It’s been a different 3 months. For over 6 years I had been working almost exclusively on the GNU toolchain with a focus on glibc and I now had the chance of working on a completely different set of projects, something I had done a lot of during my Red Hat technical support days but not since. I was to look into Pypy, OpenJDK and LuaJIT, three very different projects with very different development styles, communities and technologies. The comparison of these projects among themselves and the GNU projects is an interesting point but not the purpose of this post, maybe some other day. In this post I want to talk about the project I spent the most time on (~1.5 months) and found to be technically the most intriguing: LuaJIT.
A Just In Time Introduction
For those new to the concept, JIT compilation techniques are pretty old and there is a very interesting paper called the A brief history of just in time that does what the title states. The basic concept is quite straightforward - code written in a high level language (in the case of luajit, lua) is interpreted as usual while keeping track of which parts of the code get hit often. If a part of the code is seen to be executed repeatedly, all or part of that code is compiled into binary and mapped in, with entry and exit branches into the interpreter, also known as exit guards. There are a number of tradeoffs in designing a JIT and the paper I’ve linked above gives enough of an introduction to appreciate the complexity of the problem being solved.
The key difference from compilers is that the time required to compile is often as much a performance factor as the quality of the generated code. Due to this, one needs to be careful about the amount of processing one can do on the code to optimise it. So while gcc or llvm may end up giving higher quality code, the ~200 passes that are involved in building a TU may well end up eating up all the performance gains compiling just in time would have given.
LuaJIT: Peeking under the hood
The LuaJIT project was started and is mostly written by Mike Pall, that is apparently a pseudonym for a very private and very smart hacker. I assume that he is male given that Mike is a common male name. The source code repository is a bit odd. There is a github repository that is supposed to be official but isn’t; it is a mirror created by CloudFlare along with Mike with the aim to broaden the developer community base. That ride hasn’t been the smoothest and I’ve talked about it in more detail below. The latest code with support for other architectures such as arm64 and ppc are in the v2.1 branch, which has only had beta releases come off it, the last one in 2017. There are tests in a separate repository called LuaJIT-test-cleanup which has a big fat warning that it is not the official testsuite, although if you look around, it pretty much is the only testsuite worth using for luajit.
Wait, there’s also bench_lua, which has some benchmarks and a pretty nice driver for the benchmarks, something that the LuaJIT-test-cleanup benchmarks lack.
LuaJIT uses the concept of trace compiling which is pretty simple in concept but has some very interesting side-effects. The idea of trace compilation, specifically with luajit is quite simple and follows roughly this logic:
- Interpret program and profile it while it is running. Typical candidates for profiling would be loops for the obvious reason that it will likely execute repeatedly.
- If a loop is hit repeatedly, i.e. it crosses a threshold number of iterations, the JIT compiler is invoked on its next iteration.
- The JIT compiler first traces execution of the program and generates an IR for the trace of the program.
- The IR then goes through some optimisation passes and finally code is generated for the desired CPU backend.
This keeps on repeating as the interpreter encounters more hotspots. The interesting bit here is that the only bit that gets compiled is the code that gets executed during the trace. So if you have a branch like so:
if cond > threshold then i = i + 1 else i = i - 1 end
and the else block is executed during the trace, only that bit is compiled and not the if block. The compiled code then has branches (known as exit guards) to jump back into the interpreter if the condition is true. This produces an interesting optimisation opportunity that can be done during tracing itself. If
cond > threshold is found to be always false because they are constants or some other reason, the
if condition can be completely eliminated, which saves compilation time as well as execution time.
Another interesting side effect of tracing that is not seen in typical compilers is that function calls effectively get inlined. Again, that becomes a very cheap way to achieve something that would otherwise have been done in a separate pass in traditional compilers.
In addition to very fast tracing and compilation, all of luajit is quite compact. It’s IR is linear array based and is hence allows very fast traversal. It’s easy to visualize it using the jit.* debug modules and using the -jdump flag to dump the IR during execution. The luajit wiki has some pretty detailed documentation on its internals.
The coding style of the project is a bit too compact to my taste since I personally prefer writing for readability. There are a lot of constructs throughout the code that need a fair amount of squinting to understand, such as assignments inside the for loop headers and inside conditions. OK all of you pointing at the macro and makefile soups in glibc and laughing, please be quiet ;)
There’s also the infamous (at least in luajit circles) 47-bit address space limitation for garbage collected objects in luajit because luajit uses the top bits for metadata. This is known to have correctness issues with Lua userdata objects and also performance issues because luajit repeatedly tries allocations until it finds a suitable address in the 47-bit space. It doesn’t hurt x86 much (because of MAP_32BIT) but arm64 feels it and I imagine so do other architectures.
My LuaJIT involvement
My full time involvement with luajit was brief and will likely end soon (my personal involvement may still continue) so in this short period I wanted to tick off as many short but significant work items as I could. My github fork is here.
Sameera Deshpande started the initial work and then helped me ramp up later on. We got a couple of CI instances up and running to begin with, one for the official repository and another for my github fork so that I can review my changes regularly. If you’re interested in adding a node for your architecture to the Ci projects, please feel free to reach out to me, Linaro will happily add the node to the CI matrix.
Register Allocation improvements
The register allocator in luajit is pretty simple to keep the compilation overhead low. Registers are allocated sequentially based on their categories (caller saved, callee saved, etc.) and it uses some tricks such as constant rematerialization used to reduce register pressure. Rematerialization is also very basic in its implementation; whenever constants need to be allocated to registers, it is preferred that they use existing constants, (assuming their live ranges are compatible) either directly or as a constant computation. This is quite valuable because there is a fair amount of constant usage in the JITted code; exit guard addresses are coded in as constants for example and so are floating point numbers, in addition to the usual integers. The register modes are not specified during allocation and are defined by the instructions generated in the assembly phase.
There was a bug in the luajit register allocator due to which registers used for constant rematerialization were being clobbered, resulting in corruption. A fix was proposed but the author of the fix was not sure if it was correct. I posted an alternative patch and then realized and explained why my patch is overkill and his approach is optimal. I added additional cleanups to that to finish it up.
While working on this problem, I noticed that the arm64 backend was not using XZR often enough and I posted a patch to fix that. I started benchmarking the improvement (the codegen was obviously better, it was saving registers for stores fo zeroes for example) and quickly realized that both bench_lua and the LuaJIT-test-cleanup benchmarks were quite raw and couldn’t be relied upon for consistent results.
So I digressed.
Benchmark improvements and luaJIT-test-cleanup cleanup
bench_lua was my more favourite project to hack on benchmarks because it was evident that reviews were very hard to come by in the luajit project. Also, bench_lua had a benchmark driver that produced repeatable results but it still had some cleanup issues, including the fact that it did not have a license! The author was very responsive on the license question though and quickly put one in. I fixed some timing issues in the driver and while doing so, I realized that it might be better if I used this driver on the more extensive set of benchmarks in LuaJIT-test-cleanup. So that’s what I did.
I integrated the bench_lua driver into luajit-test-cleanup and added Makefile targets so that one could easily do make check and make bench to run the tests and benchmarks. Now I had something I could work with but it was still in a different repo and it was getting quite cumbersome to work with them.
So I integrated LuaJIT-test-cleanup into LuaJIT. Now I had a LuaJIT repository that IMO was complete and could handle the standard make/make check workflow. At the same time, it was modular enough that it could be merged into the upstream LuaJIT with relative ease. I posted all of these patches as PRs and watched as nothing happened. The LuaJIT-test-cleanup project had not seen a PR review since about 2016 and the LuaJIT project had seen occassional comments and patches from Mike in the past couple of years, but not much else.
Fusing and combining optimisations
Instruction fusion is an architecture dependent feature in luajit and each backend implements its own during the IR to assembly conversion phase, where the IR is traversed from the bottom up and assembly instructions generated sequentially. Luajit does some trivial reordering in its IR optimisation passes but during assembly, it does not peek ahead to actively look for instruction fusion opportunities; it only tries to fuse neighbouring instructions. As a result, while there are implementations for instructions like load and store pair in arm64, it is useful in only the most trivial of tests. Likewise for fmadd/fmsub; a simple intervening load is sufficient to prevent the optimisation.
In addition to this, it is often seen that optimisations like loop unrolling and vectorisation bring in even more opportunities for combining of loads and stores. Luajit does some loop peeling but that’s about it.
Sameera did some analysis on ways to introduce more aggressive unrolling and possibly some amount of vectorisation but we did not have enough time to implement it. She did have enough time to implement some instruction fusing and using fnmadd and fnmsub for arm64. She also looked at load combining opportunities but realized that luajit would need more powerful instruction reordering, similar to the load grouping in the gcc scheduler that makes load pair generation much easier. So that project was also not small enough for us to complete in the limited time.
Casting floats to unsigned integers
The C standard defines casting of floating point types to unsigned integer types only for the range (-1.0, UTYPE_MAX), where UTYPE_MAX is the unsigned version of TYPE. Casts to signed types work just fine as long as the number is in the range of that type. Waters get a bit murky with dynamic types and type narrowing when the default internal representation for all numbers is double. That was the situation in luajit. The fix for this was pretty straightforward in theory, which was to add an additional cast from float to signed int and then to unsigned int for floating point values less than zero and sticking to a direct cast to unsigned int for positive numbers. I have implemented this for the interpreter and for arm64 in my fork.
Project state and the road ahead
LuaJIT is a very interesting project that has some very interesting concepts that I learned in the last month or so. It has a pretty active user community that sings praises of the project and seems to advocate it in a number of areas. However, the project development itself is in a bit of a crisis.
Around 2015 Mike Pall said he wanted to step back from the project and wanted more people to get involved in the development. With that intent, Cloudflare created the github organisation and repository to allow for better collaboration. Based on conversation threads I read, things seemed to go fine when the community stepped in to create the LuaJIT-test-cleanup repository based on some initial tests Mike had written and built it up into a set of 500+ tests. However in about a year that excitement faded because nobody was made maintainer alongside Mike to carry forward the work and that meant that the LuaJIT project itself would only get sporadic fixes whenever Mike had some free time. Minor patches were accepted but bigger pieces of code went unreviewed and presumably the developers also lost interest.
Fast forward four years into 2019 and we are still in the same situation, probably worse. LuaJIT-test-cleanup has not had a patch review since 2016. LuaJIT has had comments about a couple of times each quarter and bug fixes with similar frequency, but not much else. The mailing list also has similar traffic - I announced all of the work I did above and did not get any responses. there are forks of LuaJIT all over the place in projects such as OpenResty and RaptorJIT and the projects seem happy to let things run that way. Lua language support is in a bit of a limbo with it being mostly 5.1 compliant with some 5.2 bits thrown in. Overall, it’s a great chunk of code that’s about to vanish into oblivion.
Then there is the very tricky question of copyright. The copyright notices all over the code say that Mike Pall has ownership. However, the code clearly has a number of contributions from others and there is no copyright assignment in place. While it’s likely not an issue from a licensing standpoint (IANAL, etc.), it is definitely something that needs to be addressed if the project is somehow ressurected, at the very least to give more prominent credit to contributors.
I’ve posted PRs for my work and tried to engage but I don’t have much hope given past history. I intend to spend at least some of my free time tinkering with this code since it’s just a very interesting project and there’s a lot that can be done. I am trawling the PRs and issue lists to look for patches that can be incorporated in my tree so if anyone is interested in contributing patches, you’re most welcome. I will continue to ensure that my tree applies on top of the official repository because I do not want to give up hope of the project coming back to life.
About 2.5 years ago I left Red Hat to join Linaro in a move that surprised even me for the first few months. I still work on the GNU toolchain with a glibc focus, but my focus changed considerably. I am no longer looking at the toolchain in its entirety (although I do that on my own time whenever I can, either as glibc release manager or reviewer); my focus is making glibc routines faster for one specific server microprocessor; no prizes for guessing which processor that is. I have read architecture manuals in the past to understand specific behaviours but this is the first time that I have had to pore through the entire manual and optimization guides and try and eek out the last cycle of performance from a chip.
This post is an attempt to document my learnings and make a high level guide of the various things me and my team looked at to improve performance of the toolchain. Note that my team is continuing to work on this chip (and I continue to learn new techniques, I may write about it later) so this ‘guide’ is more of a personal journey. I may add more follow ups or modify this post to reflect any changes in my understanding of this vast topic.
All of my examples use ARM64 assembly since that’s what I’ve been working on and translating the examples to something x86 would have discouraged me enough to not write this at all.
What am I optimizing for?
CPUs today are complicated beasts. Vulnerabilities like Spectre allude to how complicated CPU behaviour can get but in reality it can get a lot more complicated and there’s never really a universal solution to get the best out of them. Due to this, it is important to figure out what the end goal for the optimization is. For string functions for example, there are a number of different factors in play and there is no single set of behaviours that trumps over all others. For compilers in general, the number of such combinations is even higher. The solution often is to try and ensure that there is a balance and there are no exponentially worse behaviours.
The first line of defence for this is to ensure that the algorithm used for the routine does not exhibit exponential behaviour. I wrote about algorithmic changes I did to the multiple precision fallback implementation in glibc years ago elsewhere so I’m not going to repeat that. I will however state that the first line of attack to improve any function must be algorithmic. Thankfully barring strcmp, string routines in glibc had a fairly sound algorithmic base. strcmp fall back to a byte comparison when inputs are not mutually aligned, which is now fixed.
Large strings vs small
This is one question that gets asked very often in the context of string functions and different developers have different opinions on it, some differences even leading to flamewars in the past. One popular approach to ‘solving’ this is to quote usage of string functions in a popular benchmark and use that as a measuring stick. For a benchmark like CPU2006 or CPU2017, it means that you optimize for smaller strings because the number of calls to smaller strings is very high in those benchmarks. There are a few issues to that approach:
- These benchmarks use glibc routines for a very small fraction of time, so you’re not going to win a lot of performance in the benchmark by improving small string performance
- Small string operations have other factors affecting it a lot more, i.e. things like cache locality, branch predictor behaviour, prefether behaviour, etc. So while it might be fun to tweak behaviour exactly the way a CPU likes it, it may not end up resulting in the kind of gains you’re looking for
- A 10K string (in theory) takes at least 10 times more cycles than a 1K string, often more. So effectively, there is 10x more incentive to look at improving performance of larger strings than smaller ones.
- There are CPU features specifically catered for larger sequential string operations and utilizing those microarchitecture quirks will guarantee much better gains
- There are a significant number of use cases outside of these benchmarks that use glibc far more than the SPEC benchmarks. There’s no established set of benchmarks that represent them though.
I won’t conclude with a final answer for this because there is none. This is also why I had to revisit this question for every single routine I targeted, sometimes even before I decide to target it.
Cached or not?
This is another question that comes up for string routines and the answer is actually a spectrum - a string could be cached, not cached or partially cached. What’s the safe assumption then?
There is a bit more consensus on the answer to this question. It is generally considered safe to consider that shorter string accesses are cached and then focus on code scheduling and layout for its target code. If the string is not cached, the cost of getting it into cache far outweighs the savings through scheduling and hence it is pointless looking at that case. For larger strings, assuming that they’re cached does not make sense due to their size. As a result, the focus for such situations should be on ensuring that cache utilization is optimal. That is, make sure that the code aids all of the CPU units that populate caches, either through a hardware prefetcher or through judiciously placed software prefetch instructions or by avoiding caching altogether, thus avoiding evicting other hot data. Code scheduling, alignment, etc. is still important because more often than not you’ll have a hot loop that does the loads, compares, stores, etc. and once your stream is primed, you need to ensure that the loop is not suboptimal and runs without stalls.
My branch is more important than yours
Branch predictor units in CPUs are quite complicated and the compiler does not try to model them. Instead, it tries to do the simpler and more effective thing; make sure that the more probably branch target is accessible through sequential fetching. This is another aspect of the large strings vs small for string functions and more often than not, smaller sizes are assumed to be more probable for hand-written assembly because it seems to be that way in practice and also the cost of a mispredict hits the smaller size more than it does the larger one.
Don’t waste any part of a
CPUs today are complicated beasts. Yes I know I started the previous section with this exact same line; they’re complicated enough to bear repeating that. However, there is a bit of relief in the fact that the first principles of their design hasn’t changed much. The components of the CPU are all things we heard about in our CS class and the problem then reduces to understanding specific quirks of the processor core. At a very high level, there are three types of quirks you look for:
- Something the core does exceedingly well
- Something the core does very badly
- Something the core does very well or badly under specific conditions
Typically this is made easy by CPU vendors when they provide documentation that specifies a lot of this information. Then there are cases where you discover these behaviours through profiling. Oh yes, before I forget:
For example, the falkor core does something interesting with respect with loads and addressing modes. Typically, a load instruction would take a specific number of cycles to fetch from L1, more if memory is not cached, but that’s not relevant here. If you issue a load instruction with a pre/post-incrementing addressing mode, the microarchitecture issues two micro-instructions; one load and another that updates the base address. So:
ldr x1, [x2, 16]!
ldr x1, [x2, 16] add x2, x2, 16
and that increases the net cost of the load. While it saves us an instruction, this addressing mode isn’t always preferred in unrolled loops since you could avoid the base address increment at the end of every instruction and do that at the end. With falkor however, this operation is very fast and in most cases, this addressing mode is preferred for loads. The reason for this is the way its hardware prefetcher works.
A hardware prefetcher is a CPU unit that speculatively loads the memory location after the location requested, in an attempt to speed things up. This forms a memory stream and larger the string, the more its gains from prefetching. This however also means that in case of multiple prefetcher units in a core, one must ensure that the same prefetcher unit is hit so that the unit gets trained properly, i.e. knows what’s the next block to fetch. The way a prefetcher typically knows is if sees a consistent stride in memory access, i.e. it sees loads of X, X+16, X+32, etc. in a sequence.
On falkor the addressing mode plays an important role in determining which hardware prefetcher unit is hit by the load and effectively, a pre/post-incrementing load ensures that the loads hit the same prefetcher. That combined with a feature called register renaming ensures that it is much quicker to just fetch into the same virtual register and pre/post-increment the base address than to second-guess the CPU and try to outsmart it. The memcpy and memmove routines use this quirk extensively; comments in the falkor routines even have detailed comments explaining the basis of this behaviour.
Doing something so badly that it is easier to win
A colleague once said that the best targets for toolchain optimizations are CPUs that do things badly. There always is this one behaviour or set of behaviours that CPU designers decided to sacrifice to benefit other behaviours. On falkor for example, calling the MRS instruction for some registers is painfully slow whereas it is close to single cycle latency for most other processors. Simply avoiding such slow paths in itself could result in tremendous performance wins; this was evident with the memset function for falkor, which became twice as fast for medium sized strings.
Another example for this is in the compiler and not glibc, where the fact that using a ‘str’ instruction on 128-bit registers with register addressing mode is very slow on falkor. Simply avoiding that instruction altogether results in pretty good gains.
Both gcc and llvm allow you to specify a model of the CPU pipeline, i.e.
- The number of each type of unit the CPU has. That is, the number of load/store units, number of integer math units, number of FP units, etc.
- The latency for each type of instruction
- The number of micro-operations each instruction splits into
- The number of instructions the CPU can fetch/dispatch in a single cycle
and so on. This information is then used to sequence instructions in a function that it optimizes for. This may also help the compiler choose between instructions based on how long those take. For example, it may be cheaper to just declare a literal in the code and load from it than to construct a constant using mov/movk. Similarly, it could be cheaper to use csel to select a value to load to a register than to branch to a different piece of code that loads the register or vice versa.
Optimal instruction sequencing can often result in significant gains. For example, intespersing load and store instructions with unrelated arithmetic instructions could result in both those instructions executing in parallel, thus saving time. On the contrary, sequencing multiple load instructions back to back could result in other units being underutilized and all instructions being serialized on to the load unit. The pipeline model allows the compiler to make an optimal decision in this regard.
Vector unit - to use or not to use, that is the question
The vector unit is this temptress that promises to double your execution rate, but it doesn’t come without cost. The most important cost is that of moving data between general purpose and vector registers and quite often this may end up eating into your gains. The cost of the vector instructions themselves may be high, or a CPU might have multiple integer units and just one SIMD unit, because of which code may get a better schedule when executed on the integer units as opposed to via the vector unit.
I had seen an opposite example of this in powerpc years ago when I noticed that much of the integer operations were also implemented in FP in multiple precision math. This was because the original authors were from IBM and they had noticed a significant performance gain with that on powerpc (possible power7 or earlier given the timelines) because the CPU had 4 FP units!
This is really just the tip of the iceberg when it comes to performance optimization in toolchains and utilizing CPU quirks. There are more behaviours that could be exploited (such as aliasing behaviour in branch prediction or core topology) but the cost benefit of doing that is questionable.
Despite how much fun it is to hand-write assembly for such routines, the best approach is always to write simple enough code (yes, clever tricks might actually defeat compiler optimization passes!) that the compiler can optimize for you. If there are missed optimizations, improve compiler support for it. For glibc and aarch64, there is also the case of impending multiarch explosion. Due to the presence of multiple vendors, having a perfectly tuned routine for each vendor may pose code maintenance problems and also secondary issues with performance, like code layout in a binary and instruction cache utilization. There are some random ideas floating about for that already, like making separate text sections for vendor-specific code, but that’s something we would like to avoid doing if we can.
Since I joined Linaro back in 2015 around this time, my travel has gone up 3x with 2 Linaro Connects a year added to the one GNU Tools Cauldron. This year I went to FOSSAsia too, so it’s been a busy traveling year. The special thing about Cauldron though is that it is one of those conferences where I ‘work’ as well as have a lot of fun. The fun bit is because I get to meet all of the people that I work with almost every day in person and a lot of them have become great friends over the years.
I still remember the first Cauldron I went to in 2013 at Mountain View where I felt dwarfed by all of the giants I was sitting with. It was exaggerated because it was the first time I met the likes of Jeff Law, Richard Henderson, etc. in personal meetings since I had joined the Red Hat toolchain team just months before; it was intimidating and exciting all at once. That was also the first time I met Roland McGrath (I still hadn’t met Carlos, he had just had a baby and couldn’t come), someone I was terrified of back then because his patch reviews would be quite sharp and incisive. I had imagined him to be a grim old man hammering out those words from a stern laptop, so it was a surprise to see him use the same kinds of words but with a sarcastic smile, completely changing the context and tone. That was the first time I truly realized how emails often lack context. Years later, I still try to visualize people when I read their emails.
Skip to 4 years later and I was at my 5th Cauldron last week and despite my assumptions on how it would go, it was a completely new experience. A lot of it had to do with my time at Linaro and very little to do with technical growth. I felt like an equal to Linaro folks all over the world and I seemed to carry that forward here, where I felt like an equal with all of the people present, I felt like I belonged. I did not feel insecure about my capabilities (I still am intimately aware of my limitations), nor did I feel the need to constantly prove that I belonged. I was out there seeking toolchain developers (we are hiring btw, email me if you’re a fit), comfortable with the idea of leading a team. The fact that I managed to not screw up the two glibc releases I managed may also have helped :)
Oh, and one wonderful surprise was that an old friend decided to drop in an Cauldron and spend a couple of days.
This year’s Cauldron had the most technical talks submitted in recent years. We had 5 talks in the glibc area, possibly also the highest for us; just as well because we went over time in almost all of them. I won’t say that it’s a surprise since that has happened in every single year that I attended. The first glibc talk was about tunables where I briefly recapped what we have done in tunables so far and talked about the future a bit more at length. Pedro Alves suggested putting pretty printers for tunables for introspection and maybe also for runtime tuning in the coming future. There was a significant amount of interest in the idea of auto-tuning, i.e. collecting profiling data about tunable use and coming up with optimal default values and possibly even eliminating such tunables in future if we find that we have a pretty good default. We also talked about tuning at runtime and the various kinds of support that would be required to make it happen. Finally there were discussions on tuning profiles and ideas around creating performance-enhanced routines for workloads instead of CPUs. The video recording of the talk will hopefully be out soon and I’ll link the video here when it is available.
Florian then talked about glibc 3.0, a notional concept (i.e. won’t be a soname bump) where we rewrite sections of code that have been rotting due to having to support some legacy platforms. The most prominent among them is libio, the module in glibc that implements stdio. When libio was written, it was designed to be compatible with libstdc++ so that FILE streams could be compatible with C++ stdio streams. The only version of gcc that really supports that is 2.95 since libstdc++ has since moved on. However because of the way we do things in glibc, we cannot get rid of them even if there is just one user that needs that ABI. We toyed with the concept of a separate compatibility library that becomes a graveyard for such legacy interfaces so that they don’t hold up progress in the library. It remains to be seen how this pans out, but I would definitely be happy to see this progress; libio was one of my backlog projects for years. I had to miss Raji’s talk on powerpc glibc improvements since I had to be in another meeting, so I’ll have to catch it when the video comes out.
The two BoFs for glibc dealt with a number of administrative and development issues, details of which Carlos will post on the mailing list soon. The highlights for me were the malloc instrumented benchmarks that Carlos wants to add to benchtests and build and review tools. Once I clear up my work backlog a bit, I’ll attempt to set up something like phabricator or gerrit and see how that works out or the community instead of patchwork. I am convinced that all of the issues that we want to solve like crediting reviewers, ensuring good git commit logs, running automated builds and tests, etc. can only be effectively solved with a proper review tool in place to review patches.
There was also a discussion on redoing the makefiles in glibc so that it doesn’t spend so much time doing dependecy resolution, but I am going to pretend that it didn’t happen because it is an ugly ugly task :/
I’m back home now, recovering from the cold that worsened while I was in Prague before I head out again in a couple of weeks to SFO for Linaro Connect. I’ve booked tickets for whale watching tours there, so hopefully I’ll be posting some pictures again after a long break.