Optimizing Allocation

MMTk provides alloc() and post_alloc(), to allocate a piece of memory, and finalize the memory as an object. Calling them is sufficient for a functional implementation, and we recommend doing so in the early development of an MMTk integration. However, as allocation is performance critical, runtimes generally would want to optimize allocation to make it as fast as possible, in which invoking alloc() and post_alloc() becomes inadequate.

The following discusses a few design decisions and optimizations related to allocation. The discussion mainly focuses on alloc(). post_alloc() works in a similar way, and the discussion can also be applied to post_alloc(). For concrete examples, you can refer to any of our supported bindings, and check the implementation in the bindings.

Note: Some of the optimizations need to make assumptions about MMTk's internal implementation and may make the code less maintainable. We recommend adding assertions in the binding code to make sure the assumptions are not broken across versions.

Efficient access to MMTk mutators

An MMTk mutator context (created by bind_mutator()) is a thread-local data structure of type Mutator. MMTk expects the binding to provide efficient access to the mutator structure in their thread-local storage (TLS). Usually one of the following approaches is used to store MMTk mutators.

Option 1: Storing the pointer

The Box<Mutator<VM>> returned from mmtk::memory_manager::bind_mutator is actually a pointer to a Mutator<VM> instance allocated in the Rust heap. It is simple to store it in the TLS. This approach does not make any assumption about the internals of an MMTk Mutator. However, it requires an extra pointer dereference when accessing a value in the mutator. This may sound not too bad, however, this degrades the performance of a carefully implemented inlined fast-path allocation sequence which is normally just a few (assembly) instructions. This approach could be a simple start in early development, but we do not recommend it for an efficient implementation.

If the VM is not implemented in Rust, the binding needs to turn the boxed pointer into a raw pointer before storing it.

#![allow(unused)]
fn main() {
                struct MutatorInTLS {
                    // Store the mutator as a boxed pointer.
                    // Accessing any value in the mutator will need a dereferencing of the boxed pointer.
                    ptr: Box<Mutator<MockVM>>,
                }

                // Bind an MMTk mutator
                let mutator = memory_manager::bind_mutator(fixture.get_mmtk(), tls_opaque_pointer);
                // Store the pointer in TLS
                let mut storage = MutatorInTLS { ptr: mutator };

                // Allocate
                let addr =
                    memory_manager::alloc(&mut storage.ptr, 8, 8, 0, AllocationSemantics::Default);
}

Option 2: Embed the Mutator struct

To remove the extra pointer dereference, the binding can embed the Mutator type into their TLS type. This saves the extra dereference.

If the implementation language is not Rust, the developer needs to create a type that has the same layout as Mutator. It is recommended to have an assertion to ensure that the native type has the exact same layout as the Rust type Mutator.

#![allow(unused)]
fn main() {
                struct MutatorInTLS {
                    embed: Mutator<MockVM>,
                }

                // Bind an MMTk mutator
                let mutator = memory_manager::bind_mutator(fixture.get_mmtk(), tls_opaque_pointer);
                // Store the struct (or use memcpy for non-Rust code)
                let mut storage = MutatorInTLS { embed: *mutator };
                // Allocate
                let addr = memory_manager::alloc(
                    &mut storage.embed,
                    8,
                    8,
                    0,
                    AllocationSemantics::Default,
                );
}

Option 3: Embed the fast-path struct

The size of Mutator is a few hundreds of bytes, which could be considered too large to store in the TLS in some language implementations. Embedding Mutator also requires to duplicate a native type for the Mutator struct if the implementation language is not Rust. Sometimes it is undesirable to embed the Mutator type. One can choose to only embed the fast-path struct that is in use.

Unlike the Mutator type, the fast-path struct has a C-compatible layout, and it is simple and primitive enough so it is unlikely to change. For example, MMTk provides BumpPointer, which simply includes a cursor and a limit.

In the following example, we embed one BumpPointer struct in the TLS. The BumpPointer is used in the fast-path, and carefully synchronized with the allocator in the Mutator struct in the slow-path. We also need to revoke (i.e. reset) all the cached BumpPointer values for all mutators if a GC occurs. Currently, we recommend implementing this in the resume_mutators API call, however there is work in progress that would make it an explicit API call instead.

Note that the allocate_default closure in the example below assumes the allocation semantics is AllocationSemantics::Default and its selected allocator uses bump-pointer allocation. Real-world fast-path implementations for high-performance VMs are usually JIT-compiled, inlined, and specialized for the current plan and allocation site. Hence, the allocation semantics of the concrete allocation site (and therefore the selected allocator) is known to the JIT compiler.

For the sake of simplicity, we only store one BumpPointer in the TLS in the example. In MMTk, each plan has multiple allocators, and the allocation semantics are mapped to those allocator by the GC plan you choose. So a plan uses multiple allocators, and depending on how many allocation semantics are used by a binding, the binding may use multiple allocators as well. In practice, a binding may embed multiple fast-path structs for all the allocators they use if they would like more efficient allocation.

Also for simplicity, the example assumes the default allocator for the plan in use is a bump pointer allocator. Many plans in MMTk use bump pointer allocator for their default allocation semantics (AllocationSemantics::Default), which includes (but not limited to) NoGC, SemiSpace, Immix, generational plans, etc. If a plan does not do bump-pointer allocation, we may still implement fast-paths, but we need to embed different data structures instead of BumpPointer.

#![allow(unused)]
fn main() {
                use crate::util::alloc::BumpPointer;
                struct MutatorInTLS {
                    default_bump_pointer: BumpPointer,
                    mutator: Box<Mutator<MockVM>>,
                }

                // Bind an MMTk mutator
                let mutator = memory_manager::bind_mutator(fixture.get_mmtk(), tls_opaque_pointer);
                // Create a fastpath BumpPointer with default(). The BumpPointer from default() will guarantee to fail on the first allocation
                // so the allocation goes to the slowpath and we will get an allocation buffer from MMTk.
                let default_bump_pointer = BumpPointer::default();
                // Store the fastpath BumpPointer along with the mutator
                let mut storage = MutatorInTLS {
                    default_bump_pointer,
                    mutator,
                };

                // Allocate
                let mut allocate_default = |size: usize| -> Address {
                    // Alignment code is omitted here to make the code simpler to read.
                    // In an actual implementation, alignment and offset need to be considered by the bindings.
                    let new_cursor = storage.default_bump_pointer.cursor + size;
                    if new_cursor < storage.default_bump_pointer.limit {
                        let addr = storage.default_bump_pointer.cursor;
                        storage.default_bump_pointer.cursor = new_cursor;
                        addr
                    } else {
                        use crate::util::alloc::Allocator;
                        let selector = memory_manager::get_allocator_mapping(
                            fixture.get_mmtk(),
                            AllocationSemantics::Default,
                        );
                        let default_allocator = unsafe {
                            storage
                                .mutator
                                .allocator_impl_mut::<crate::util::alloc::BumpAllocator<MockVM>>(
                                    selector,
                                )
                        };
                        // Copy bump pointer values to the allocator in the mutator
                        default_allocator.bump_pointer = storage.default_bump_pointer;
                        // Do slow path allocation with MMTk
                        let addr = default_allocator.alloc_slow(size, 8, 0);
                        // Copy bump pointer values to the fastpath BumpPointer so we will have an allocation buffer.
                        storage.default_bump_pointer = default_allocator.bump_pointer;
                        addr
                    }
                };

                // Allocate: this will fail in the fastpath, and will get an allocation buffer from the slowpath
                let addr1 = allocate_default(8);
                // Allocate: this will allocate from the fastpath
                let addr2 = allocate_default(8);
}

And pseudo-code for how you would reset the BumpPointers for all mutators in resume_mutators. Note that these mutators are the runtime's actual mutator threads (i.e. where the cached bump pointers are stored) and are different from MMTk's Mutator struct.

#![allow(unused)]
fn main() {
impl Collection<RtName> for RtNameCollection {
  ...
  fn resume_mutators(tls: VMWorkerThread) {
    // Reset the cached bump pointers of each mutator (setting both cursor and limit to 0) after a GC to
    // ensure that the VM sees a cohesive state
    for mutator in mutators {
      mutator.storage.default_bump_pointer = BumpPointer::default();
    }
    // Actually resume all the mutators now
    ...
  }
  ...
}
}

Avoid resolving the allocator at run time

For a simple and general API of alloc(), MMTk requires AllocationSemantics as an argument in an allocation request, and resolves it at run-time. The following is roughly what alloc() does internally.

  1. Resolving the allocator
    1. Find the Allocator for the required AllocationSemantics. It is defined by the plan in use.
    2. Dynamically dispatch the call to Allocator::alloc().
  2. Allocator::alloc() executes the allocation fast-path.
  3. If the fast-path fails, it executes the allocation slow-path Allocator::alloc_slow().
  4. The slow-path will further attempt to allocate memory, and may trigger a GC.

Resolving to a specific allocator and doing dynamic dispatch is expensive for an allocation. With the build-time or JIT-time knowledge about the object that will be allocated, an MMTk binding can possibly skip the first step in the run time.

If you implement an efficient fast-path allocation in the binding side (like the Option 3 above, and generating allocation code in a JIT), that naturally avoids this problem. If you do not want to implement the fast-path allocation, the following is another example of how to avoid resolving the allocator.

Once MMTk is initialized, a binding can get the memory offset for the default allocator, and save it somewhere. When we know an object should be allocated with the default allocation semantics, we can use the offset to get a reference to the actual allocator (with unsafe code), and allocate with the allocator.

#![allow(unused)]
fn main() {
            // At boot time
            let selector = memory_manager::get_allocator_mapping(
                fixture.get_mmtk(),
                AllocationSemantics::Default,
            );
            let default_allocator_offset =
                crate::plan::Mutator::<MockVM>::get_allocator_base_offset(selector);
            let mutator = memory_manager::bind_mutator(fixture.get_mmtk(), tls_opaque_pointer);

            // At run time: allocate with the default semantics without resolving allocator
            let default_allocator: &mut BumpAllocator<MockVM> = {
                let mutator_addr = Address::from_ref(&*mutator);
                unsafe {
                    (mutator_addr + default_allocator_offset).as_mut_ref::<BumpAllocator<MockVM>>()
                }
            };
            let addr = default_allocator.alloc(8, 8, 0);
}

Emitting Allocation Sequence in a JIT Compiler

If the language has a JIT compiler, it is generally desirable to generate the code sequence for the allocation fast-path, rather than simply emitting a call instruction to the allocation function. The optimizations we talked above are relevant as well: (i) the compiler needs to be able to access the mutator, and (ii) the compiler needs to be able to resolve to a specific allocator at JIT time. The actual implementation highly depends on the compiler implementation.

The following are some examples from our bindings (at the time of writing):