Lua

The LuaJIT Wiki

not logged in | [Login]

The following design document describes the the new garbage collector (GC) to be introduced with LuaJIT 2.1. This document is very much a work in progress right now. Anything may change for the actual implementation. No code is available, yet.

Rationale

The garbage collector used by LuaJIT 2.0 is essentially the same as the Lua 5.1 GC. There are some minor refinements for write barriers and a couple of speed optimizations. But the basic algorithms and data structures are unchanged.

The current garbage collector is relatively slow compared to implementations in other languages. It's not competitive with top-of-the-line GCs, especially for large workloads. The main causes for this are the independent memory allocator, cache-inefficient data structures and a high number of branch mispredictions. The current GC design is a dead end when it comes to further performance tuning.

What's needed is a complete redesign from scratch with the following goals:

  • It must be a very fast, top-of-the-line, highly competitive garbage collector.
  • It must be able to sustain high throughput and large workloads.
  • It needs to be incremental (but not realtime) with very low latencies.
  • It must be non-copying, due to various constraints in the Lua/C API.
  • But it needs to tightly control fragmentation.
  • It doesn't need to be concurrent, since Lua states are completely independent.
  • It would be nice to have an (automatic) generational mode for certain workloads.
  • It should be optimized for operation with the combination of an interpreter and a JIT compiler.

This leads to the following implementation constraints:

  • The memory allocator and the garbage collector must be tightly integrated.
  • Plain linked lists should be avoided as far as possible.
  • Data structures should be optimized for high-speed allocation, traversal, marking and sweeping.
  • Object metadata, such as mark & block info should be segregated.
  • Huge objects should be segregated.
  • Traversable and non-traversable objects should be segregated.
  • Traversals should be linear or at least have strong locality.
  • Memory should be requested and returned from or to the operating system in big blocks only.

Overview

The new garbage collector is an arena-based, quad-color incremental, generational, non-copying, high-speed, cache-optimized garbage collector.

Memory is requested from the operating system in big aligned memory blocks, called arenas. Arenas are split into tiles. Each tile is split into 16 byte sized cells. One or more cells make up a block, which is the basic memory unit used to store objects and related object data. All objects inside a tile are either traversable or not. The mark and block bitmaps are stored in the metadata area of each arena, with a metadata overhead of 1.5%. Huge blocks are located in separate memory areas.

Pointers to arenas, huge blocks, interned strings, weak tables, finalizers, etc. are held in dedicated, cache-efficient data structures which minimize branch mispredictions. E.g. hashes, unrolled linked lists or trees with high fan-out.

The allocator switches on-the-fly from a bump allocator to a segregated-fit, best-fit allocator with bounded search, depending on fragmentation pressure.

The collector is a quad-color, incremental mark & sweep collector with very low latency. Traversals are local to a tile and exhibit high cache locality. Tiles with non-traversable objects don't even need to be considered. Object marking and the ultra-fast sweep phase only work on metadata to reduce cache pressure. The sweep phase brings neither live nor dead object data back into the cache.

The write barrier of the incremental GC is very cheap and rarely triggers. A sequential store buffer (SSB) helps to further reduce overhead. The write barrier check can be done with only 2 or 3 machine instructions. The JIT compiler can eliminate most write barriers.

The generational mode of the collector is automatically triggered by workloads with a high death rate for young allocations. The mark bits are simply not cleared after a major collection. A minor collection only needs to traverse tiles with newly allocated blocks and objects written to (marked gray).

GC Algorithm

GC Phases

Two-Color Mark & Sweep

This is the classic (non-incremental) two-color mark & sweep algorithm:

Newly allocated objects are white. The mark phase starts at the GC roots and recursively turns all reachable (i.e. live) objects black. This can also be done iteratively with an explicit mark stack or mark list. The sweep phase frees all white, (i.e. unreachable) objects and flips the color of all black objects back to white.

The main drawback of this algorithm is that the mutator cannot run interleaved with the collector. A collection cycle first has to be completely finished, i.e. the algorithm is non-incremental.

There are various optimizations, e.g. the meaning of the two colors can be switched after each GC cycle. This saves the color flip in the sweep phase. You can find plenty of variations in the literature.

This is the GC algorithm used by Lua 5.0. All objects are kept in a linked list, which is processed during the mark and sweep phases. Objects that have been marked and need to be traversed are chained in a separate mark list.

Tri-Color Incremental Mark & Sweep

This is Dijkstra's three-color incremental mark & sweep algorithm:

Newly allocated objects are white. The mark phase starts at the GC roots. Marking a reachable object means flipping the color of it from white to gray and pushing it onto a gray stack (or rechaining it onto a gray list). The gray stack is iteratively processed, removing one gray object at a time. A gray object is traversed and all objects reachable from it are marked, like above. After an object has been traversed, it's turned from gray to black. The sweep phase works just like the two-color algorithm above.

This algorithm is incremental: the collector can operate in small steps, processing only a couple of objects from the gray stack and then let the mutator run again for a while. This spreads out the GC pauses into many short intervals, which is important for highly interactive workloads (e.g. games or internet servers).

But there's one catch: the mutator might get in the way of the collector and store a reference to a white (unprocessed) object at a black (processed) object. This object would never be marked and will be freed by the sweep, even though it's clearly still referenced from a reachable object, i.e. it should be kept alive.

To avoid this scenario, one has to preserve the tri-color invariant: a black object may never hold a reference to a white object. This is done with a write barrier, which has to be checked after every write. If the invariant has been violated, a fixup step is needed. There are two alternatives:

  1. Either turn the black object gray and push it back onto the gray stack. This is moving the barrier "back", because the object has to be reprocessed later on. This is beneficial for container objects, because they usually receive several stores in succession. This avoids a barrier for the next objects that are stored into it (which are likely white, too).

  2. Or immediately mark the white object, turning it gray and push it onto the gray stack. This moves the barrier "forward", because it implicitly drives the GC forward. This works best for objects that only receive isolated stores.

There are many optimizations to turn this into a practical algorithm. Here are the most important:

  • Stacks should always be kept gray and re-traversed just before the final sweep phase. This avoids a write barrier for stores to stack slots, which are the most common kind of stores.

  • Objects which have no references to child objects can immediately be turned from white to black and don't need to go through the gray stack.

  • The sweep phase can be made incremental by using two whites and flipping between them just before entering the sweep phase. Objects with the 'current' white need to be kept. Only objects with the 'other' white should be freed.

This is the GC algorithm used by Lua 5.1 and LuaJIT 2.0. It's an enhancement of the linked list algorithm from Lua 5.0. Tables use backward barriers, all other traversable objects use forward barriers.

LuaJIT 2.0 further optimizes the write barrier for tables by only checking for a black table, ignoring the color of the stored object. This is faster to check and still safe: the write barrier may trigger more often, but this does no harm. And it doesn't matter in practice, since GC cycles progress very fast and have long pauses inbetween, so objects are rarely black. Also stored objects usually are white, anyway.

Quad-Color Optimized Incremental Mark & Sweep

The quad-color algorithm is a refinement of the tri-color algorithm:

There's a problem with the tri-color algorithm for table barriers: the write barrier checks can get expensive if mark bits are not inline in the object itself. But one has to do an exact check for a black table object before turning it gray again when the barrier triggers. Alas, the mark bits (white vs. black) are segregated in the new GC, only the gray bit is inline in the object.

Just checking for 'not gray' is not a good idea: the write barrier would be triggered for both white and black objects, always turning them gray on the first write. This is especially bad for white objects during GC pauses, as lots of gray objects may needlessly accumulate in the gray stack.

The solution is to introduce a fourth color, splitting up gray into light-gray and dark-gray. Newly allocated table objects are light-gray: the mark bit is white, the gray bit is set. A new table is usually written to immediately after allocation. The write barrier only checks for a cleared gray bit and doesn't trigger in this case.

When the table is marked during the mark phase, it's turned dark-gray (mark bit turned black) and pushed onto the gray stack. In case it's unreachable, the sweep phase can free a light-gray table like any other object marked white.

Dark-gray objects are turned black after traversal (clearing the gray bit) and turned white after sweeping. The write barrier may trigger during this short period and move the barrier back by turning it dark-gray again.

A table object that survived one GC cycle is turned white like all other survivors. In case the table is written to after that, it's turned light-gray again. But this doesn't push the table onto the gray stack right away! In fact, only the gray bit needs to be flipped, which avoids further barriers as explained above.

The main advantage of the quad-color algorithm is the ultra-cheap write barrier: just check the gray bit, which needs only 2 or 3 machine instructions. And due to the initial coloring and the specific color transitions, write barriers for tables are hardly ever triggered in practice. The write barrier doesn't need access to the mark bitmap, which avoids polluting the cache with GC metadata while the mutator is running.

The quad-color algorithm can easily fall back to the tri-color algorithm for other traversable objects by turning them white initially and using forward write barriers. And there's an obvious shortcut for non-traversable objects: marking turns a white object black right away, which touches the mark bitmap only. Since these kind of objects are in segregated tiles, they don't need to be traversed and their data never needs to be brought into the cache during the mark phase.

There are various other GC algorithms that use more than the standard colors, e.g. two whites (see above). However none of them use the colors like this algorithm does. Also, a search for a "quad-color" GC (or variations) does not turn anything up. In the absence of evidence to the contrary, I (Mike Pall) hereby claim to have invented this algorithm. Please notify me immediately, if you disagree. As with all research results from my work on LuaJIT, I hereby donate the related intellectual property to the public domain. So please use it, share it and have fun.

Generational GC

Arenas

Tiles

Cells

Blocks

Mark & Block Bitmaps

Gray Stack

Huge Objects

Block Allocator

Bump Allocator

Fit Allocator

Write Barrier

Sequential Store Buffer

Special Considerations

Stacks

Strings

Weak Tables

Finalizers

Object Layout

Object Tags

Performance Optimizations

Cache Effects

Branch Prediction

Data Structures

Bitmap Tricks

Bump Frontier

Segregated Traversal

Questions?

Add your questions here and I'll try to clarify the docs. --Mike

Q: