Hell Oh Entropy!

Life, Code and everything in between

Wrestling with the register allocator: LuaJIT edition

Posted: Sep 15, 2019, 17:20

For some time now, I kept running into one specific piece of code in luajit repeatedly for various reasons and last month I came across a fascinating bug in the register allocator pitfall that I had never encountered before. But then as is the norm after fixing such bugs, I concluded that it’s too trivial to write about and I was just stupid to not have found it sooner; all bugs are trivial once they’re fixed.

After getting over my imposter syndrome (yeah I know, took a while) I finally have the courage to write about it, so like a famous cook+musician says, enough jibberjabber…

Looping through a hash table

One of the key data structures that luajit uses to implement metatables is a hash table. Due to its extensive use, the lookup loop into such hash tables is optimised in the JIT using an architecture-specific asm_href function. Here is what a snippet of the arm64 version of the function looks like. A key thing to note is that the assembly code is generated backwards, i.e. the last instruction is emitted first.

  /* Key not found in chain: jump to exit (if merged) or load niltv. */
  l_end = emit_label(as);
  as->invmcp = NULL;
  if (merge == IR_NE)
    asm_guardcc(as, CC_AL);
  else if (destused)
    emit_loada(as, dest, niltvg(J2G(as->J)));

  /* Follow hash chain until the end. */
  l_loop = --as->mcp;
  emit_n(as, A64I_CMPx^A64I_K12^0, dest);
  emit_lso(as, A64I_LDRx, dest, dest, offsetof(Node, next));
  l_next = emit_label(as);

  /* Type and value comparison. */
  if (merge == IR_EQ)
    asm_guardcc(as, CC_EQ);
    emit_cond_branch(as, CC_EQ, l_end);

  if (irt_isnum(kt)) {
    if (isk) {
      /* Assumes -0.0 is already canonicalized to +0.0. */
      if (k)
        emit_n(as, A64I_CMPx^k, tmp);
        emit_nm(as, A64I_CMPx, key, tmp);
      emit_lso(as, A64I_LDRx, tmp, dest, offsetof(Node, key.u64));
    } else {
      Reg ftmp = ra_scratch(as, rset_exclude(RSET_FPR, key));
      emit_nm(as, A64I_FCMPd, key, ftmp);
      emit_dn(as, A64I_FMOV_D_R, (ftmp & 31), (tmp & 31));
      emit_cond_branch(as, CC_LO, l_next);
      emit_nm(as, A64I_CMPx | A64F_SH(A64SH_LSR, 32), tisnum, tmp);
      emit_lso(as, A64I_LDRx, tmp, dest, offsetof(Node, key.n));
  } else if (irt_isaddr(kt)) {
    Reg scr;
    if (isk) {
      int64_t kk = ((int64_t)irt_toitype(irkey->t) << 47) | irkey[1].tv.u64;
      scr = ra_allock(as, kk, allow);
      emit_nm(as, A64I_CMPx, scr, tmp);
      emit_lso(as, A64I_LDRx, tmp, dest, offsetof(Node, key.u64));
    } else {
      scr = ra_scratch(as, allow);
      emit_nm(as, A64I_CMPx, tmp, scr);
      emit_lso(as, A64I_LDRx, scr, dest, offsetof(Node, key.u64));
    rset_clear(allow, scr);
  } else {
    Reg type, scr;
    lua_assert(irt_ispri(kt) && !irt_isnil(kt));
    type = ra_allock(as, ~((int64_t)~irt_toitype(ir->t) << 47), allow);
    scr = ra_scratch(as, rset_clear(allow, type));
    rset_clear(allow, scr);
    emit_nm(as, A64I_CMPw, scr, type);
    emit_lso(as, A64I_LDRx, scr, dest, offsetof(Node, key));

  *l_loop = A64I_BCC | A64F_S19(as->mcp - l_loop) | CC_NE;

Here, the emit_* functions emit assembly instructions and the ra_* functions allocate registers. In the normal case everything is fine and the table lookup code is concise and effective. When there is register pressure however, things get interesting.

As an example, here is what a typical type lookup would look like:

0x100	ldr x1, [x16, #52]
0x104	cmp x1, x2
0x108	beq -> exit
0x10c	ldr x16, [x16, #16]
0x110	cmp x16, #0
0x114	bne 0x100

Here, x16 is the table that the loop traverses. x1 is a key, which if it matches, results in an exit to the interpreter. Otherwise the loop moves ahead until the end of the table. The comparison is done with a constant stored in x2.

The value of x is loaded later (i.e. earlier in the code, we are emitting code backwards, remember?) whenever that register is needed for reuse through a process called restoration or spilling. In the restore case, it is loaded into register as a constant or expressed as another constant (look up constant rematerialisation) and in the case of a spill, the register is restored from a slot in the stack. If there is no register pressure, all of this restoration happens at the head of the trace, which is why if you study a typical trace you will notice a lot of constant loads at the top of the trace.

Like the Spill that ruined your keyboard the other day…

Things get interesting when the allocation of x2 in the loop above results in a restore. Looking at the code a bit closer:

    type = ra_allock(as, ~((int64_t)~irt_toitype(ir->t) << 47), allow);
    scr = ra_scratch(as, rset_clear(allow, type));
    rset_clear(allow, scr);
    emit_nm(as, A64I_CMPw, scr, type);
    emit_lso(as, A64I_LDRx, scr, dest, offsetof(Node, key));

The x2 here is type, which is a constant. If a register is not available, we have to make one available by either rematerializing or by restoring the register, which would result in something like this:

0x100   ldr x1, [x16, #52]
0x104   cmp x1, x2
0x108	mov x2, #42
0x10c   beq -> exit
0x110   ldr x16, [x16, #16]
0x114   cmp x16, #0
0x118   bne 0x100

This ends up breaking the loop because the allocator restore/spill logic assumes that the code is linear and the restore will affect only code that follows it, i.e. code that got generated earlier. To fix this, all of the register allocations should be done before the loop code is generated.

Making things right

The result of this analysis was this fix in my LuaJIT fork that allocates registers for operands that will be used in the loop before generating the body of the loop. That is, if the registers have to spill, they will do so after the loop (we are generating code in reverse order) and leave the loop compact. The fix is also in the luajit2 repository in the OpenResty project. The work was sponsored by OpenResty as this wonderfully vague bug could only be produced by some very complex scripts that are part of the OpenResty product.


A JIT in Time...

Posted: Mar 28, 2019, 13:29

It’s been a different 3 months. For over 6 years I had been working almost exclusively on the GNU toolchain with a focus on glibc and I now had the chance of working on a completely different set of projects, something I had done a lot of during my Red Hat technical support days but not since. I was to look into Pypy, OpenJDK and LuaJIT, three very different projects with very different development styles, communities and technologies. The comparison of these projects among themselves and the GNU projects is an interesting point but not the purpose of this post, maybe some other day. In this post I want to talk about the project I spent the most time on (~1.5 months) and found to be technically the most intriguing: LuaJIT.

A Just In Time Introduction

For those new to the concept, JIT compilation techniques are pretty old and there is a very interesting paper called the A brief history of just in time that does what the title states. The basic concept is quite straightforward - code written in a high level language (in the case of luajit, lua) is interpreted as usual while keeping track of which parts of the code get hit often. If a part of the code is seen to be executed repeatedly, all or part of that code is compiled into binary and mapped in, with entry and exit branches into the interpreter, also known as exit guards. There are a number of tradeoffs in designing a JIT and the paper I’ve linked above gives enough of an introduction to appreciate the complexity of the problem being solved.

The key difference from compilers is that the time required to compile is often as much a performance factor as the quality of the generated code. Due to this, one needs to be careful about the amount of processing one can do on the code to optimise it. So while gcc or llvm may end up giving higher quality code, the ~200 passes that are involved in building a TU may well end up eating up all the performance gains compiling just in time would have given.

LuaJIT: Peeking under the hood

The LuaJIT project was started and is mostly written by Mike Pall, that is apparently a pseudonym for a very private and very smart hacker. I assume that he is male given that Mike is a common male name. The source code repository is a bit odd. There is a github repository that is supposed to be official but isn’t; it is a mirror created by CloudFlare along with Mike with the aim to broaden the developer community base. That ride hasn’t been the smoothest and I’ve talked about it in more detail below. The latest code with support for other architectures such as arm64 and ppc are in the v2.1 branch, which has only had beta releases come off it, the last one in 2017. There are tests in a separate repository called LuaJIT-test-cleanup which has a big fat warning that it is not the official testsuite, although if you look around, it pretty much is the only testsuite worth using for luajit.

Wait, there’s also bench_lua, which has some benchmarks and a pretty nice driver for the benchmarks, something that the LuaJIT-test-cleanup benchmarks lack.

LuaJIT uses the concept of trace compiling which is pretty simple in concept but has some very interesting side-effects. The idea of trace compilation, specifically with luajit is quite simple and follows roughly this logic:

This keeps on repeating as the interpreter encounters more hotspots. The interesting bit here is that the only bit that gets compiled is the code that gets executed during the trace. So if you have a branch like so:

    if cond > threshold then
        i = i + 1
        i = i - 1

and the else block is executed during the trace, only that bit is compiled and not the if block. The compiled code then has branches (known as exit guards) to jump back into the interpreter if the condition is true. This produces an interesting optimisation opportunity that can be done during tracing itself. If cond > threshold is found to be always false because they are constants or some other reason, the if condition can be completely eliminated, which saves compilation time as well as execution time.

Another interesting side effect of tracing that is not seen in typical compilers is that function calls effectively get inlined. Again, that becomes a very cheap way to achieve something that would otherwise have been done in a separate pass in traditional compilers.

In addition to very fast tracing and compilation, all of luajit is quite compact. It’s IR is linear array based and is hence allows very fast traversal. It’s easy to visualize it using the jit.* debug modules and using the -jdump flag to dump the IR during execution. The luajit wiki has some pretty detailed documentation on its internals.

The coding style of the project is a bit too compact to my taste since I personally prefer writing for readability. There are a lot of constructs throughout the code that need a fair amount of squinting to understand, such as assignments inside the for loop headers and inside conditions. OK all of you pointing at the macro and makefile soups in glibc and laughing, please be quiet ;)

There’s also the infamous (at least in luajit circles) 47-bit address space limitation for garbage collected objects in luajit because luajit uses the top bits for metadata. This is known to have correctness issues with Lua userdata objects and also performance issues because luajit repeatedly tries allocations until it finds a suitable address in the 47-bit space. It doesn’t hurt x86 much (because of MAP_32BIT) but arm64 feels it and I imagine so do other architectures.

My LuaJIT involvement

My full time involvement with luajit was brief and will likely end soon (my personal involvement may still continue) so in this short period I wanted to tick off as many short but significant work items as I could. My github fork is here.

Sameera Deshpande started the initial work and then helped me ramp up later on. We got a couple of CI instances up and running to begin with, one for the official repository and another for my github fork so that I can review my changes regularly. If you’re interested in adding a node for your architecture to the Ci projects, please feel free to reach out to me, Linaro will happily add the node to the CI matrix.

Register Allocation improvements

The register allocator in luajit is pretty simple to keep the compilation overhead low. Registers are allocated sequentially based on their categories (caller saved, callee saved, etc.) and it uses some tricks such as constant rematerialization used to reduce register pressure. Rematerialization is also very basic in its implementation; whenever constants need to be allocated to registers, it is preferred that they use existing constants, (assuming their live ranges are compatible) either directly or as a constant computation. This is quite valuable because there is a fair amount of constant usage in the JITted code; exit guard addresses are coded in as constants for example and so are floating point numbers, in addition to the usual integers. The register modes are not specified during allocation and are defined by the instructions generated in the assembly phase.

There was a bug in the luajit register allocator due to which registers used for constant rematerialization were being clobbered, resulting in corruption. A fix was proposed but the author of the fix was not sure if it was correct. I posted an alternative patch and then realized and explained why my patch is overkill and his approach is optimal. I added additional cleanups to that to finish it up.

While working on this problem, I noticed that the arm64 backend was not using XZR often enough and I posted a patch to fix that. I started benchmarking the improvement (the codegen was obviously better, it was saving registers for stores fo zeroes for example) and quickly realized that both bench_lua and the LuaJIT-test-cleanup benchmarks were quite raw and couldn’t be relied upon for consistent results.

So I digressed.

Benchmark improvements and luaJIT-test-cleanup cleanup

bench_lua was my more favourite project to hack on benchmarks because it was evident that reviews were very hard to come by in the luajit project. Also, bench_lua had a benchmark driver that produced repeatable results but it still had some cleanup issues, including the fact that it did not have a license! The author was very responsive on the license question though and quickly put one in. I fixed some timing issues in the driver and while doing so, I realized that it might be better if I used this driver on the more extensive set of benchmarks in LuaJIT-test-cleanup. So that’s what I did.

I integrated the bench_lua driver into luajit-test-cleanup and added Makefile targets so that one could easily do make check and make bench to run the tests and benchmarks. Now I had something I could work with but it was still in a different repo and it was getting quite cumbersome to work with them.

So I integrated LuaJIT-test-cleanup into LuaJIT. Now I had a LuaJIT repository that IMO was complete and could handle the standard make/make check workflow. At the same time, it was modular enough that it could be merged into the upstream LuaJIT with relative ease. I posted all of these patches as PRs and watched as nothing happened. The LuaJIT-test-cleanup project had not seen a PR review since about 2016 and the LuaJIT project had seen occassional comments and patches from Mike in the past couple of years, but not much else.

Fusing and combining optimisations

Instruction fusion is an architecture dependent feature in luajit and each backend implements its own during the IR to assembly conversion phase, where the IR is traversed from the bottom up and assembly instructions generated sequentially. Luajit does some trivial reordering in its IR optimisation passes but during assembly, it does not peek ahead to actively look for instruction fusion opportunities; it only tries to fuse neighbouring instructions. As a result, while there are implementations for instructions like load and store pair in arm64, it is useful in only the most trivial of tests. Likewise for fmadd/fmsub; a simple intervening load is sufficient to prevent the optimisation.

In addition to this, it is often seen that optimisations like loop unrolling and vectorisation bring in even more opportunities for combining of loads and stores. Luajit does some loop peeling but that’s about it.

Sameera did some analysis on ways to introduce more aggressive unrolling and possibly some amount of vectorisation but we did not have enough time to implement it. She did have enough time to implement some instruction fusing and using fnmadd and fnmsub for arm64. She also looked at load combining opportunities but realized that luajit would need more powerful instruction reordering, similar to the load grouping in the gcc scheduler that makes load pair generation much easier. So that project was also not small enough for us to complete in the limited time.

Casting floats to unsigned integers

The C standard defines casting of floating point types to unsigned integer types only for the range (-1.0, UTYPE_MAX), where UTYPE_MAX is the unsigned version of TYPE. Casts to signed types work just fine as long as the number is in the range of that type. Waters get a bit murky with dynamic types and type narrowing when the default internal representation for all numbers is double. That was the situation in luajit. The fix for this was pretty straightforward in theory, which was to add an additional cast from float to signed int and then to unsigned int for floating point values less than zero and sticking to a direct cast to unsigned int for positive numbers. I have implemented this for the interpreter and for arm64 in my fork.

Project state and the road ahead

LuaJIT is a very interesting project that has some very interesting concepts that I learned in the last month or so. It has a pretty active user community that sings praises of the project and seems to advocate it in a number of areas. However, the project development itself is in a bit of a crisis.

Around 2015 Mike Pall said he wanted to step back from the project and wanted more people to get involved in the development. With that intent, Cloudflare created the github organisation and repository to allow for better collaboration. Based on conversation threads I read, things seemed to go fine when the community stepped in to create the LuaJIT-test-cleanup repository based on some initial tests Mike had written and built it up into a set of 500+ tests. However in about a year that excitement faded because nobody was made maintainer alongside Mike to carry forward the work and that meant that the LuaJIT project itself would only get sporadic fixes whenever Mike had some free time. Minor patches were accepted but bigger pieces of code went unreviewed and presumably the developers also lost interest.

Fast forward four years into 2019 and we are still in the same situation, probably worse. LuaJIT-test-cleanup has not had a patch review since 2016. LuaJIT has had comments about a couple of times each quarter and bug fixes with similar frequency, but not much else. The mailing list also has similar traffic - I announced all of the work I did above and did not get any responses. there are forks of LuaJIT all over the place in projects such as OpenResty and RaptorJIT and the projects seem happy to let things run that way. Lua language support is in a bit of a limbo with it being mostly 5.1 compliant with some 5.2 bits thrown in. Overall, it’s a great chunk of code that’s about to vanish into oblivion.

Then there is the very tricky question of copyright. The copyright notices all over the code say that Mike Pall has ownership. However, the code clearly has a number of contributions from others and there is no copyright assignment in place. While it’s likely not an issue from a licensing standpoint (IANAL, etc.), it is definitely something that needs to be addressed if the project is somehow ressurected, at the very least to give more prominent credit to contributors.

I’ve posted PRs for my work and tried to engage but I don’t have much hope given past history. I intend to spend at least some of my free time tinkering with this code since it’s just a very interesting project and there’s a lot that can be done. I am trawling the PRs and issue lists to look for patches that can be incorporated in my tree so if anyone is interested in contributing patches, you’re most welcome. I will continue to ensure that my tree applies on top of the official repository because I do not want to give up hope of the project coming back to life.