not logged in | [Login]
The following design document describes the the new garbage collector (GC) to be introduced with LuaJIT 2.1. This document is very much a work in progress right now. Anything may change for the actual implementation. No code is available, yet.
The garbage collector used by LuaJIT 2.0 is essentially the same as the Lua 5.1 GC. There are some minor refinements for write barriers and a couple of speed optimizations. But the basic algorithms and data structures are unchanged.
The current garbage collector is relatively slow compared to implementations in other languages. It's not competitive with top-of-the-line GCs, especially for large workloads. The main causes for this are the independent memory allocator, cache-inefficient data structures and a high number of branch mispredictions. The current GC design is a dead end when it comes to further performance tuning.
What's needed is a complete redesign from scratch with the following goals:
This leads to the following implementation constraints:
The new garbage collector is an arena-based, quad-color incremental, generational, non-copying, high-speed, cache-optimized garbage collector.
Memory is requested from the operating system in big aligned memory blocks, called arenas. Arenas are split into tiles. Each tile is split into 16-byte sized cells. One or more cells make up a block, which is the basic memory unit used to store objects and related object data. All objects inside a tile are either traversable or not. The mark and block bitmaps are stored in the metadata area of each arena, with a metadata overhead of 1.5%. Huge blocks are located in separate memory areas.
Pointers to arenas, huge blocks, interned strings, weak tables, finalizers, etc. are held in dedicated, cache-efficient data structures which minimize branch mispredictions. E.g. hashes, unrolled linked lists or trees with high fan-out.
The allocator switches on-the-fly from a bump allocator to a segregated-fit, best-fit allocator with bounded search, depending on fragmentation pressure.
The collector is a quad-color, incremental mark & sweep collector with very low latency. Traversals are local to a tile and exhibit high cache locality. Tiles with non-traversable objects don't even need to be considered. Object marking and the ultra-fast sweep phase only work on metadata to reduce cache pressure. The sweep phase brings neither live nor dead object data back into the cache.
The write barrier of the incremental GC is very cheap and rarely triggers. A sequential store buffer (SSB) helps to further reduce overhead. The write barrier check can be done with only 2 or 3 machine instructions. The JIT compiler can eliminate most write barriers.
The collector automatically switches between a regular mode and a generational mode, depending on workload characteristics.
This is a short overview of the different GC algorithms used in Lua 5.x and LuaJIT 1.x/2.0 as well as the new GC in LuaJIT 2.1.
All of these implementations use a tracing garbage collector (#) with two basic phases:
Any practical GC implementation has a couple more phases (e.g. an atomic phase), but this is not relevant to the following discussion. To avoid a recursive algorithm, a mark stack or mark list can be used to keep track of objects that need to be traversed.
(#) 'Tracing' in this context means it's tracing through the live object graph, as opposed to a reference counting garbage collector, which manages reference counts for each object. The terminology is not related to the concept of a trace compiler.
This is the classic (non-incremental) two-color mark & sweep algorithm:
Newly allocated objects are white. The mark phase turns all reachable objects black. The sweep phase frees all white (unreachable) objects and flips the color of all black (surviving) objects back to white.
The main drawback of this algorithm is that the mutator cannot run interleaved with the collector. A collection cycle first has to be completely finished, which means the algorithm is non-incremental (atomic collection).
There are various optimizations, e.g. the meaning of the two colors can be switched after each GC cycle. This saves the color flip in the sweep phase. You can find plenty of variations in the literature.
This is the GC algorithm used by Lua 5.0. All objects are kept in a linked list, which is processed during the mark and sweep phases. Objects that have been marked and need to be traversed are chained in a separate mark list.
This is Dijkstra's three-color incremental mark & sweep algorithm:
Newly allocated objects are white. The mark phase starts at the GC roots. Marking a reachable object means flipping the color of it from white to gray and pushing it onto a gray stack (or rechaining it onto a gray list). The gray stack is iteratively processed, removing one gray object at a time. A gray object is traversed and all objects reachable from it are marked, like above. After an object has been traversed, it's turned from gray to black. The sweep phase works just like the two-color algorithm above.
This algorithm is incremental: the collector can operate in small steps, processing only a couple of objects from the gray stack and then let the mutator run again for a while. This spreads out the GC pauses into many short intervals, which is important for highly interactive workloads (e.g. games or internet servers).
But there's one catch: the mutator might get in the way of the collector and store a reference to a white (unprocessed) object at a black (processed) object. This object would never be marked and will be freed by the sweep, even though it's clearly still referenced from a reachable object, i.e. it should be kept alive.
To avoid this scenario, one has to preserve the tri-color invariant: a black object may never hold a reference to a white object. This is done with a write barrier, which has to be checked after every write. If the invariant has been violated, a fixup step is needed. There are two alternatives:
There are many optimizations to turn this into a practical algorithm. Here are the most important:
This is the GC algorithm used by Lua 5.1/5.2 and LuaJIT 1.x/2.0. It's an enhancement of the linked list algorithm from Lua 5.0. Tables use backward barriers, all other traversable objects use forward barriers.
LuaJIT 2.0 further optimizes the write barrier for tables by only checking for a black table, ignoring the color of the stored object. This is faster to check and still safe: the write barrier may trigger more often, but this does no harm. And it doesn't matter in practice, since GC cycles progress very fast and have long pauses inbetween, so objects are rarely black. Also stored objects usually are white, anyway.
The quad-color algorithm is a refinement of the tri-color algorithm:
There's a problem with the tri-color algorithm for backward barriers: the write barrier checks can get expensive if mark bits are not inline in the object itself. But one has to do an exact check for a black object before turning it gray again when the barrier triggers. Alas, the mark bits (white vs. black) are segregated in the new GC, only the gray bit is inline in the object.
Just checking for 'not gray' is not a good idea: the write barrier would be triggered for both white and black objects, always turning them gray on the first write. This is especially bad for white objects during GC pauses, as lots of gray objects may needlessly accumulate in the gray stack.
The solution is to introduce a fourth color, splitting up gray into light-gray and dark-gray. Newly allocated traversable objects are light-gray: the mark bit is white, the gray bit is set. A new object is usually written to immediately after allocation. The write barrier only checks for a cleared gray bit and doesn't trigger in this case.
When the object is marked during the mark phase, it's turned dark-gray (mark bit turned black) and pushed onto the gray stack. In case it's unreachable, the sweep phase can free a light-gray object like any other object marked white.
Dark-gray objects are turned black after traversal (clearing the gray bit) and turned white after sweeping. The write barrier may trigger during this short period and move the barrier back by turning it dark-gray again.
An object that survived one GC cycle is turned white like all other survivors. In case the object is written to after that, it's turned light-gray again. But this doesn't push the object onto the gray stack right away! In fact, only the gray bit needs to be flipped, which avoids further barriers as explained above.
The main advantage of the quad-color algorithm is the ultra-cheap write barrier: just check the gray bit, which needs only 2 or 3 machine instructions. And due to the initial coloring and the specific color transitions, write barriers for e.g. tables are hardly ever triggered in practice. The write barrier doesn't need to access the mark bitmap, which avoids polluting the cache with GC metadata while the mutator is running.
The quad-color algorithm can easily fall back to the tri-color algorithm for some traversable objects by turning them white initially and using forward write barriers. And there's an obvious shortcut for non-traversable objects: marking turns a white object black right away, which touches the mark bitmap only. Since these kind of objects are in segregated tiles, they don't need to be traversed and their data never needs to be brought into the cache during the mark phase.
This is the GC algorithm used by LuaJIT 2.1. Objects and the segregated metadata is managed in arenas (and not in linked lists).
There are various other GC algorithms that use more than the standard colors, e.g. two whites (see above). However none of them use the colors like this algorithm does. Also, a search for a "quad-color" GC (or variations) does not turn anything up. In the absence of evidence to the contrary, I (Mike Pall) hereby claim to have invented this algorithm. Please notify me immediately, if you disagree. As with all research results from my work on LuaJIT, I hereby donate the related intellectual property to the public domain. So please use it, share it and have fun!
The standard approach to generational GCs only works with a copying GC:
Obviously, this approach doesn't work for a non-copying GC. But the main insights behind a generational GC can be abstracted:
The basic idea is to modify the sweep phase: free the (unreachable) white objects, but don't flip the color of black objects before a minor collection. The mark phase of the following minor collection then only traverses tiles with newly allocated blocks and objects written to (marked gray). All other objects are assumed to be still reachable during a minor GC and are neither traversed, nor sweeped, nor are their marks changed (kept black). A regular sweep phase is used if a major collection is to follow.
The generational mode of the collector is automatically triggered by workloads with a high death rate for young allocations. Running (say) five minor collections that deal only with young allocations plus one major collection ought to be cheaper than running (say) two major collections in the same timeframe. And the maximum memory usage should be lower, too. The collector returns to the regular, non-generational mode, in case these assumptions turn out not to be true.
The image shown is greatly simplified: the allocation and survivor rates are constant; permanent objects don't change; collections are atomic.
Add your questions here and I'll try to clarify the docs. --Mike