This is a mailing list post by Mike Pall.
The following things are needed to get the best speed out for
numerical computations with LuaJIT (in order of importance):
Reduce number of unbiased/unpredictable branches.
- Heavily biased branches (>95% in one direction) are fine.
- Prefer branch-free algorithms.
- Use math.min() and math.max().
- Use bit.*, e.g. for conditional index computations.
Use FFI data structures.
- Use int32_t, avoid uint32_t data types.
- Use double, avoid float data types.
- Metamethods are fine, but don't overdose them.
Call C functions only via the FFI.
- Avoid calling trivial functions, better rewrite them in Lua.
- Avoid callbacks -- use pull-style APIs (read/write) and
Use plain 'for i=start,stop,step do ... end' loops.
- Prefer plain array indexing, e.g. 'a[i+2]'.
- Avoid pointer arithmetic.
Find the right balance for unrolling.
- Avoid inner loops with low iteration count (< 10).
- Only unroll loops if the loop body has not too many instructions.
- Consider using templates instead of hand-unrolling (see GSL Shell).
- You may have to experiment a bit.
Define and call only 'local' (!) functions within a module.
Cache often-used functions from other modules in upvalues.
- E.g. local sin = math.sin ... local function foo() return 2*sin(x) end
- Don't do this for FFI C functions, cache the namespace
instead, e.g. local lib = ffi.load("lib").
Avoid inventing your own dispatch mechanisms.
- Prefer to use built-in mechanisms, e.g. metamethods.
Do not try to second-guess the JIT compiler.
- It's perfectly ok to write 'z = x[a+b] + y[a+b]'.
- Do not try CSE by hand, e.g. 'local c = a+b'.
- It's perfectly ok to write 'a[i][j] = a[i][j] * a[i][j+1]'.
- Do not try to cache partial FFI struct/array references
(e.g. a[i]) unless they are long-lived (e.g. in a big loop).
Be careful with aliasing, esp. when using multiple arrays.
- LuaJIT uses strict type-based disambiguation, but there are
limits to this due to C99 conformance.
- E.g. in 'x[i] = a[i] + c[i]; y[i] = a[i] + d[i]' the load of
a[i] needs to be done twice, because x could alias a.
It does make sense to use 'do local t = a[i] ... end' here.
Reduce the number of live temporary variables.
- Best to initialize on definition, e.g. 'local y = f(x)'
- Yes, this means you should interleave this with other code.
- Do not hoist variable definitions to the start of a
- Use 'do local y = f(x) ... end' to bound variable lifetimes.
Do not intersperse expensive or uncompiled operations.
- print() is not compiled, use io.write().
- E.g. avoid assert(type(x) == "number", "x is a "..mytype(x)")
- The problem is not the assert() or the condition (basically
free). The problem is the string concatenation, which has to
be executed every time, even if the assertion never fails!
- Watch the output of -jv and -jdump.
You need to take all of these factors into account before deciding
on a certain algorithm. An advanced algorithm, that's fast in
theory, may be slower than a simpler algorithm, if the simpler
algorithm has much fewer unbiased branches.