This is a mailing list post by Mike Pall.
The following things are needed to get the best speed out for numerical computations with LuaJIT (in order of importance):
Reduce number of unbiased/unpredictable branches.
- Heavily biased branches (>95% in one direction) are fine.
- Prefer branch-free algorithms.
- Use math.min() and math.max().
- Use bit.*, e.g. for conditional index computations.
Use FFI data structures.
- Use int32_t, avoid uint32_t data types.
- Use double, avoid float data types.
- Metamethods are fine, but don't overuse them.
Call C functions only via the FFI.
- Avoid calling trivial functions, better rewrite them in Lua.
- Avoid callbacks -- use pull-style APIs (read/write) and iterators instead.
Use plain 'for i=start,stop,step do ... end' loops.
- Prefer plain array indexing, e.g. 'a[i+2]'.
- Avoid pointer arithmetic.
Find the right balance for unrolling.
- Avoid inner loops with low iteration count (< 10).
- Only unroll loops if the loop body has not too many instructions.
- Consider using templates instead of hand-unrolling (see GSL Shell).
- You may have to experiment a bit.
Define and call only 'local' (!) functions within a module.
Cache often-used functions from other modules in upvalues.
- E.g. local sin = math.sin ... local function foo() return 2*sin(x) end
- Don't do this for FFI C functions, cache the namespace instead, e.g. local lib = ffi.load("lib").
Avoid inventing your own dispatch mechanisms.
- Prefer to use built-in mechanisms, e.g. metamethods.
Do not try to second-guess the JIT compiler.
It's perfectly ok to write 'z = x[a+b] + y[a+b]'.
- Do not try CSE (Common Subexpression Elimination) by hand, e.g. 'local c = a+b'.
- It may become detrimental if the lifetime of the temporary is longer than needed. If the compiler cannot deduce that it's dead, then the useless temporary will block a register or stack slot and/or it needs to be stored to the Lua stack.
- Duplicate expression involving basic arithmetic operators that are relatively close to each other (and likely in the same trace) should not be manually CSEd. Loads only need to be manually hoisted, if alias analysis is likely to fail.
It's perfectly ok to write 'a[i][j] = a[i][j] * a[i][j+1]'.
- Do not try to cache partial FFI struct/array references (e.g. a[i]) unless they are long-lived (e.g. in a big loop).
There are quite a few "easy" optimizations where the compiler is in a better position to perform them. Better focus on the difficult things, like algorithmic improvements.
Be careful with aliasing, esp. when using multiple arrays.
- LuaJIT uses strict type-based disambiguation, but there are limits to this due to C99 conformance.
- E.g. in 'x[i] = a[i] + c[i]; y[i] = a[i] + d[i]' the load of a[i] needs to be done twice, because x could alias a. It does make sense to use 'do local t = a[i] ... end' here.
Reduce the number of live temporary variables.
- Best to initialize on definition, e.g. 'local y = f(x)'
- Yes, this means you should interleave this with other code.
- Use 'do local y = f(x) ... end' to bound variable lifetimes.
Do not intersperse expensive or uncompiled operations.
- print() is not compiled, use io.write().
- E.g. avoid assert(type(x) == "number", "x is a "..mytype(x)")
- The problem is not the assert() or the condition (basically free). The problem is the string concatenation, which has to be executed every time, even if the assertion never fails!
- Watch the output of -jv and -jdump.
You need to take all of these factors into account before deciding on a certain algorithm. An advanced algorithm, that's fast in theory, may be slower than a simpler algorithm, if the simpler algorithm has much fewer unbiased branches.