not logged in | [Login]
The following design document describes the the new garbage collector (GC) to be introduced with LuaJIT 2.1. This document is very much a work in progress right now. Anything may change for the actual implementation. No code is available, yet.
The garbage collector used by LuaJIT 2.0 is essentially the same as the Lua 5.1 GC. There are some minor refinements for write barriers and a couple of speed optimizations. But the basic algorithms and data structures are unchanged.
The current garbage collector is relatively slow compared to implementations in other languages. It's not competitive with top-of-the-line GCs, especially for large workloads. The main causes for this are the independent memory allocator, cache-inefficient data structures and a high number of branch mispredictions. The current GC design is a dead end when it comes to further performance tuning.
What's needed is a complete redesign from scratch with the following goals:
This leads to the following implementation constraints:
The new garbage collector is an arena-based, quad-color incremental, generational, non-copying, high-speed, cache-optimized garbage collector.
Memory is requested from the operating system in big aligned memory blocks, called arenas. Arenas are split into 16-byte sized cells. One or more cells make up a block, which is the basic memory unit used to store objects and related object data. All objects inside an arena are either traversable or not. The mark and block bitmaps are kept in the metadata area of each arena, with a metadata overhead of 1.5%. Huge blocks are located in separate memory areas.
Pointers to arenas, huge blocks, interned strings, weak tables, finalizers, etc. are held in dedicated, cache-efficient data structures which minimize branch mispredictions. E.g. hashes, unrolled linked lists or trees with high fan-out.
The allocator switches on-the-fly from a bump allocator to a segregated-fit, best-fit allocator with bounded search, depending on fragmentation pressure.
The collector is a quad-color, incremental mark & sweep collector with very low latency. Traversals are local to an arena and exhibit high cache locality. Arenas with non-traversable objects don't even need to be considered. Object marking and the ultra-fast sweep phase only work on metadata to reduce cache pressure. The sweep phase brings neither live nor dead object data back into the cache.
The write barrier of the incremental GC is very cheap and rarely triggers. A sequential store buffer (SSB) helps to further reduce overhead. The write barrier check can be done with only 2 or 3 machine instructions. The JIT compiler can eliminate most write barriers.
The collector automatically switches between a regular mode and a generational mode, depending on workload characteristics.
This is a short overview of the different GC algorithms used in Lua 5.x and LuaJIT 1.x/2.0 as well as the new GC in LuaJIT 2.1.
All of these implementations use a tracing garbage collector (#) with two basic phases:
Any practical GC implementation has a couple more phases (e.g. an atomic phase), but this is not relevant to the following discussion. To avoid a recursive algorithm, a mark stack or mark list can be used to keep track of objects that need to be traversed.
(#) 'Tracing' in this context means it's tracing through the live object graph, as opposed to a reference counting garbage collector, which manages reference counts for each object. The terminology is not related to the concept of a trace compiler.
This is the classic (non-incremental) two-color mark & sweep algorithm:
Newly allocated objects are white. The mark phase turns all reachable objects black. The sweep phase frees all white (unreachable) objects and flips the color of all black (surviving) objects back to white.
The main drawback of this algorithm is that the mutator (your running program code) cannot run interleaved with the collector. A collection cycle has to be completely finished, which means the algorithm is non-incremental (atomic collection).
There are various optimizations, e.g. the meaning of the two colors can be switched after each GC cycle. This saves the color flip in the sweep phase. You can find plenty of variations in the literature.
This is the GC algorithm used by Lua 5.0. All objects are kept in a linked list, which is processed during the mark and sweep phases. Objects that have been marked and need to be traversed are chained in a separate mark list.
This is Dijkstra's three-color incremental mark & sweep algorithm:
Newly allocated objects are white. The mark phase starts at the GC roots. Marking a reachable object means flipping the color of it from white to gray and pushing it onto a gray stack (or rechaining it onto a gray list). The gray stack is iteratively processed, removing one gray object at a time. A gray object is traversed and all objects reachable from it are marked, like above. After an object has been traversed, it's turned from gray to black. The sweep phase works just like the two-color algorithm above.
This algorithm is incremental: the collector can operate in small steps, processing only a couple of objects from the gray stack and then let the mutator run again for a while. This spreads out the GC pauses into many short intervals, which is important for highly interactive workloads (e.g. games or internet servers).
But there's one catch: the mutator might get in the way of the collector and store a reference to a white (unprocessed) object at a black (processed) object. This object would never be marked and will be freed by the sweep, even though it's clearly still referenced from a reachable object, i.e. it should be kept alive.
To avoid this scenario, one has to preserve the tri-color invariant: a black object may never hold a reference to a white object. This is done with a write barrier, which has to be checked after every write. If the invariant has been violated, a fixup step is needed. There are two alternatives:
There are many optimizations to turn this into a practical algorithm. Here are the most important:
This is the GC algorithm used by Lua 5.1/5.2 and LuaJIT 1.x/2.0. It's an enhancement of the linked list algorithm from Lua 5.0. Tables use backward barriers, all other traversable objects use forward barriers.
LuaJIT 2.0 further optimizes the write barrier for tables by only checking for a black table, ignoring the color of the stored object. This is faster to check and still safe: the write barrier may trigger more often, but this does no harm. And it doesn't matter in practice, since GC cycles progress very fast and have long pauses inbetween, so objects are rarely black. Also stored objects usually are white, anyway.
The quad-color algorithm is a refinement of the tri-color algorithm:
There's a problem with the tri-color algorithm for backward barriers: the write barrier checks can get expensive if mark bits are not inline in the object itself. But one has to do an exact check for a black object before turning it gray again when the barrier triggers. Alas, the mark bits (white vs. black) are segregated in the new GC, only the gray bit is inline in the object.
Just checking for 'not gray' is not a good idea: the write barrier would be triggered for both white and black objects, always turning them gray on the first write. This is especially bad for white objects during GC pauses, as lots of gray objects may needlessly accumulate in the gray stack.
The solution is to introduce a fourth color, splitting up gray into light-gray and dark-gray. Newly allocated traversable objects are light-gray: the mark bit is white, the gray bit is set. A new object is usually written to immediately after allocation. The write barrier only checks for a cleared gray bit and doesn't trigger in this case.
When the object is marked during the mark phase, it's turned dark-gray (mark bit turned black) and pushed onto the gray stack. In case it's unreachable, the sweep phase can free a light-gray object like any other object marked white.
Dark-gray objects are turned black after traversal (clearing the gray bit) and turned white after sweeping. The write barrier may trigger during this short period and move the barrier back by turning it dark-gray again.
An object that survived one GC cycle is turned white like all other survivors. In case the object is written to after that, it's turned light-gray again. But this doesn't push the object onto the gray stack right away! In fact, only the gray bit needs to be flipped, which avoids further barriers as explained above.
The main advantage of the quad-color algorithm is the ultra-cheap write barrier: just check the gray bit, which needs only 2 or 3 machine instructions. And due to the initial coloring and the specific color transitions, write barriers for e.g. tables are hardly ever triggered in practice. The write barrier doesn't need to access the mark bitmap, which avoids polluting the cache with GC metadata while the mutator is running.
The quad-color algorithm can easily fall back to the tri-color algorithm for some traversable objects by turning them white initially and using forward write barriers. And there's an obvious shortcut for non-traversable objects: marking turns a white object black right away, which touches the mark bitmap only. Since these kind of objects are in segregated arenas, they don't need to be traversed and their data never needs to be brought into the cache during the mark phase.
This is the GC algorithm used by LuaJIT 2.1. Objects and the segregated metadata is managed in arenas (and not in linked lists).
There are various other GC algorithms that use more than the standard colors, e.g. two whites (see above). However none of them use the colors like this algorithm does. Also, a search for a "quad-color" GC (or variations) does not turn anything up. In the absence of evidence to the contrary, I (Mike Pall) hereby claim to have invented this algorithm. Please notify me immediately, if you disagree. As with all research results from my work on LuaJIT, I hereby donate the related intellectual property to the public domain. So please use it, share it and have fun!
The standard approach to generational GCs only works with a copying GC:
Obviously, this approach doesn't work for a non-copying GC. But the main insights behind a generational GC can be abstracted:
The basic idea is to modify the sweep phase: free the (unreachable) white objects, but don't flip the color of black objects before a minor collection. The mark phase of the following minor collection then only traverses newly allocated blocks and objects written to (marked gray). All other objects are assumed to be still reachable during a minor GC and are neither traversed, nor sweeped, nor are their marks changed (kept black). A regular sweep phase is used if a major collection is to follow.
The generational mode of the collector is automatically triggered by workloads with a high death rate for young allocations. Running (say) five minor collections that deal only with young allocations plus one major collection ought to be cheaper than running (say) two major collections in the same timeframe. And the maximum memory usage should be lower, too. The collector returns to the regular, non-generational mode, in case these assumptions turn out not to be true.
The image shown is greatly simplified: the allocation and survivor rates are constant; permanent objects don't change; collections are atomic.
An arena is a big, specially aligned memory block. Arenas are requested directly from the operating system using system-specific code. A generic memory management API is offered for non-standard operating systems or for embedded use. A simple 'one-fixed-block' memory manager is available as a build-time option.
All arenas have the same size and they are naturally aligned to their size. E.g. 64 KB-sized arenas have addresses ending with 16 zero bits. This has two advantages:
All objects inside an arena are either traversable or not. Arenas with only non-traversable objects obviously don't need to be traversed at all. And there's no need to store an object type inside of those objects either.
The arena size can be any power of two from 64 KB up to 1 MB. The optimal arena sizes for each OS and CPU still needs to be determined experimentally. Memory managers may request even bigger blocks from the OS (e.g. 2 MB or 4 MB for huge page support) and split them up.
1/64th of the space of each arena is dedicated to the metadata area, leading to a fixed overhead of 1.5%. The metadata area is located at the start of each arena. The remaining space is used for the data area. The following image shows the general layout of an arena:
The data area of each arena is split up into 16-byte sized cells. One or more cells make up a block, which is the basic memory unit used to store objects and related object data.
The index of a cell (or a block) can be derived from its starting address: mask out the top bits that identify the arena and shift the resulting relative address by 4 bits to the right to get the cell index.
Cell indices do not start at zero, because the first indices would point to the metadata area. A cell index always fits into 16 bits, because the maximum arena size is 1 MB.
The mark and block bitmaps make up the majority of the space in the metadata area of each arena. The two bitmaps are segregated for better cache behavior. Every cell has an associated bit in the mark bitmap and the block bitmap. The bit index in each bitmap corresponds to the cell index.
The two bitmaps determine the blocks of an arena and their status:
This is a differential encoding: the type of the block (free, white or black) is determined by its first cell. A block can span multiple cells, it's extended with block extents.
E.g. a white block with 3 cells followed by a free block with 1 cell and a black block with 2 cells is encoded as:
10 00 00
The advantage is that only one or two bits for the first cell need to be flipped to change the status of a whole block: allocating a block only needs to set a block bit; marking a block only needs to set a mark bit.
Since cell indices don't start at zero, the first few bytes of the mark and block bitmaps can be used for other metadata related to the arena. At least 16 bytes can be used for arena management and for the gray stack pointers, since the minimum arena size is 64 KB.
Huge blocks are located in separate contiguous memory areas. Their size is always a multiple of the arena size and they obey arena alignment. The optimal size to switch from regular block allocation to huge block allocation still needs to be determined experimentally.
There's no extra header in front of a huge block. Instead, all metadata (e.g. size, or mark and gray bits) is stored in a separate hash table, keyed by the block address.
Marking of all live objects is performed iteratively, using advanced, cache-optimized data structures to hold the objects that are to be marked.
The write barrier only has to check for the gray bit of objects stored into. This is a very fast test and it triggers only if the gray bit is not set (which is rare).
If the write barrier is triggered, white objects are turned light-gray and black objects are turned dark-gray. Dark-gray objects are additionally pushed onto a fast sequential store buffer (SSB).
Light-gray objects don't need to be pushed onto the SSB, but that requires checking the mark bit. This can be avoided, if performance of a triggered barrier becomes an issue. The check may be done on SSB overflow instead.
A sequential store buffer (SSB) is a small buffer which holds the block addresses of objects that triggered the write barrier. It always has at least one slot free, so the overflow check can be done at the end.
If the SSB overflows, it's emptied by converting the stored object addresses to cell indices and pushing them onto the corresponding gray stacks. This may involve multiple allocations and other overhead.
The advantage of the two-step process is the relatively low cache pollution due to the SSB while the mutator is running.
Every arena with traversable objects has an associated gray stack which holds the cell index of all of its gray objects. Memory for gray stacks is allocated and grown on demand and need not be part of the arena itself. The stack starts with a sentinel and grows downwards.
When an object is marked dark-gray, it's pushed onto the gray stack for the corresponding arena.
To improve cache access locality, the gray stack of each arena is processed separately. Objects removed from the gray stack are turned black before the traversal. The traversal may mark other objects, which may be located in different arenas. But processing always continues with the current arena until the gray stack is empty.
The gray queue holds arenas which have a non-empty gray stack. The gray queue is a priority queue, ordered by the size of the per-arena gray stack. This ensures the largest gray stacks get processed first. A binary heap is used to implement the priority queue. It behaves mostly like a stack (LIFO) for elements with the same priority.
The gray queue is processed iteratively, always removing the highest priority arena and processing its gray stack. The mark phase ends when the gray queue is empty (and the SSB is emptied, too).
Add your questions here and I'll try to clarify the docs. --Mike