About 2.5 years ago I left Red Hat to join Linaro in a move that surprised even me for the first few months. I still work on the GNU toolchain with a glibc focus, but my focus changed considerably. I am no longer looking at the toolchain in its entirety (although I do that on my own time whenever I can, either as glibc release manager or reviewer); my focus is making glibc routines faster for one specific server microprocessor; no prizes for guessing which processor that is. I have read architecture manuals in the past to understand specific behaviours but this is the first time that I have had to pore through the entire manual and optimization guides and try and eek out the last cycle of performance from a chip.
This post is an attempt to document my learnings and make a high level guide of the various things me and my team looked at to improve performance of the toolchain. Note that my team is continuing to work on this chip (and I continue to learn new techniques, I may write about it later) so this ‘guide’ is more of a personal journey. I may add more follow ups or modify this post to reflect any changes in my understanding of this vast topic.
All of my examples use ARM64 assembly since that’s what I’ve been working on and translating the examples to something x86 would have discouraged me enough to not write this at all.
What am I optimizing for?
CPUs today are complicated beasts. Vulnerabilities like Spectre allude to how complicated CPU behaviour can get but in reality it can get a lot more complicated and there’s never really a universal solution to get the best out of them. Due to this, it is important to figure out what the end goal for the optimization is. For string functions for example, there are a number of different factors in play and there is no single set of behaviours that trumps over all others. For compilers in general, the number of such combinations is even higher. The solution often is to try and ensure that there is a balance and there are no exponentially worse behaviours.
The first line of defence for this is to ensure that the algorithm used for the routine does not exhibit exponential behaviour. I wrote about algorithmic changes I did to the multiple precision fallback implementation in glibc years ago elsewhere so I’m not going to repeat that. I will however state that the first line of attack to improve any function must be algorithmic. Thankfully barring strcmp, string routines in glibc had a fairly sound algorithmic base. strcmp fall back to a byte comparison when inputs are not mutually aligned, which is now fixed.
Large strings vs small
This is one question that gets asked very often in the context of string functions and different developers have different opinions on it, some differences even leading to flamewars in the past. One popular approach to ‘solving’ this is to quote usage of string functions in a popular benchmark and use that as a measuring stick. For a benchmark like CPU2006 or CPU2017, it means that you optimize for smaller strings because the number of calls to smaller strings is very high in those benchmarks. There are a few issues to that approach:
- These benchmarks use glibc routines for a very small fraction of time, so you’re not going to win a lot of performance in the benchmark by improving small string performance
- Small string operations have other factors affecting it a lot more, i.e. things like cache locality, branch predictor behaviour, prefether behaviour, etc. So while it might be fun to tweak behaviour exactly the way a CPU likes it, it may not end up resulting in the kind of gains you’re looking for
- A 10K string (in theory) takes at least 10 times more cycles than a 1K string, often more. So effectively, there is 10x more incentive to look at improving performance of larger strings than smaller ones.
- There are CPU features specifically catered for larger sequential string operations and utilizing those microarchitecture quirks will guarantee much better gains
- There are a significant number of use cases outside of these benchmarks that use glibc far more than the SPEC benchmarks. There’s no established set of benchmarks that represent them though.
I won’t conclude with a final answer for this because there is none. This is also why I had to revisit this question for every single routine I targeted, sometimes even before I decide to target it.
Cached or not?
This is another question that comes up for string routines and the answer is actually a spectrum - a string could be cached, not cached or partially cached. What’s the safe assumption then?
There is a bit more consensus on the answer to this question. It is generally considered safe to consider that shorter string accesses are cached and then focus on code scheduling and layout for its target code. If the string is not cached, the cost of getting it into cache far outweighs the savings through scheduling and hence it is pointless looking at that case. For larger strings, assuming that they’re cached does not make sense due to their size. As a result, the focus for such situations should be on ensuring that cache utilization is optimal. That is, make sure that the code aids all of the CPU units that populate caches, either through a hardware prefetcher or through judiciously placed software prefetch instructions or by avoiding caching altogether, thus avoiding evicting other hot data. Code scheduling, alignment, etc. is still important because more often than not you’ll have a hot loop that does the loads, compares, stores, etc. and once your stream is primed, you need to ensure that the loop is not suboptimal and runs without stalls.
My branch is more important than yours
Branch predictor units in CPUs are quite complicated and the compiler does not try to model them. Instead, it tries to do the simpler and more effective thing; make sure that the more probably branch target is accessible through sequential fetching. This is another aspect of the large strings vs small for string functions and more often than not, smaller sizes are assumed to be more probable for hand-written assembly because it seems to be that way in practice and also the cost of a mispredict hits the smaller size more than it does the larger one.
Don’t waste any part of a
CPUs today are complicated beasts. Yes I know I started the previous section with this exact same line; they’re complicated enough to bear repeating that. However, there is a bit of relief in the fact that the first principles of their design hasn’t changed much. The components of the CPU are all things we heard about in our CS class and the problem then reduces to understanding specific quirks of the processor core. At a very high level, there are three types of quirks you look for:
- Something the core does exceedingly well
- Something the core does very badly
- Something the core does very well or badly under specific conditions
Typically this is made easy by CPU vendors when they provide documentation that specifies a lot of this information. Then there are cases where you discover these behaviours through profiling. Oh yes, before I forget:
For example, the falkor core does something interesting with respect with loads and addressing modes. Typically, a load instruction would take a specific number of cycles to fetch from L1, more if memory is not cached, but that’s not relevant here. If you issue a load instruction with a pre/post-incrementing addressing mode, the microarchitecture issues two micro-instructions; one load and another that updates the base address. So:
ldr x1, [x2, 16]!
ldr x1, [x2, 16] add x2, x2, 16
and that increases the net cost of the load. While it saves us an instruction, this addressing mode isn’t always preferred in unrolled loops since you could avoid the base address increment at the end of every instruction and do that at the end. With falkor however, this operation is very fast and in most cases, this addressing mode is preferred for loads. The reason for this is the way its hardware prefetcher works.
A hardware prefetcher is a CPU unit that speculatively loads the memory location after the location requested, in an attempt to speed things up. This forms a memory stream and larger the string, the more its gains from prefetching. This however also means that in case of multiple prefetcher units in a core, one must ensure that the same prefetcher unit is hit so that the unit gets trained properly, i.e. knows what’s the next block to fetch. The way a prefetcher typically knows is if sees a consistent stride in memory access, i.e. it sees loads of X, X+16, X+32, etc. in a sequence.
On falkor the addressing mode plays an important role in determining which hardware prefetcher unit is hit by the load and effectively, a pre/post-incrementing load ensures that the loads hit the same prefetcher. That combined with a feature called register renaming ensures that it is much quicker to just fetch into the same virtual register and pre/post-increment the base address than to second-guess the CPU and try to outsmart it. The memcpy and memmove routines use this quirk extensively; comments in the falkor routines even have detailed comments explaining the basis of this behaviour.
Doing something so badly that it is easier to win
A colleague once said that the best targets for toolchain optimizations are CPUs that do things badly. There always is this one behaviour or set of behaviours that CPU designers decided to sacrifice to benefit other behaviours. On falkor for example, calling the MRS instruction for some registers is painfully slow whereas it is close to single cycle latency for most other processors. Simply avoiding such slow paths in itself could result in tremendous performance wins; this was evident with the memset function for falkor, which became twice as fast for medium sized strings.
Another example for this is in the compiler and not glibc, where the fact that using a ‘str’ instruction on 128-bit registers with register addressing mode is very slow on falkor. Simply avoiding that instruction altogether results in pretty good gains.
Both gcc and llvm allow you to specify a model of the CPU pipeline, i.e.
- The number of each type of unit the CPU has. That is, the number of load/store units, number of integer math units, number of FP units, etc.
- The latency for each type of instruction
- The number of micro-operations each instruction splits into
- The number of instructions the CPU can fetch/dispatch in a single cycle
and so on. This information is then used to sequence instructions in a function that it optimizes for. This may also help the compiler choose between instructions based on how long those take. For example, it may be cheaper to just declare a literal in the code and load from it than to construct a constant using mov/movk. Similarly, it could be cheaper to use csel to select a value to load to a register than to branch to a different piece of code that loads the register or vice versa.
Optimal instruction sequencing can often result in significant gains. For example, intespersing load and store instructions with unrelated arithmetic instructions could result in both those instructions executing in parallel, thus saving time. On the contrary, sequencing multiple load instructions back to back could result in other units being underutilized and all instructions being serialized on to the load unit. The pipeline model allows the compiler to make an optimal decision in this regard.
Vector unit - to use or not to use, that is the question
The vector unit is this temptress that promises to double your execution rate, but it doesn’t come without cost. The most important cost is that of moving data between general purpose and vector registers and quite often this may end up eating into your gains. The cost of the vector instructions themselves may be high, or a CPU might have multiple integer units and just one SIMD unit, because of which code may get a better schedule when executed on the integer units as opposed to via the vector unit.
I had seen an opposite example of this in powerpc years ago when I noticed that much of the integer operations were also implemented in FP in multiple precision math. This was because the original authors were from IBM and they had noticed a significant performance gain with that on powerpc (possible power7 or earlier given the timelines) because the CPU had 4 FP units!
This is really just the tip of the iceberg when it comes to performance optimization in toolchains and utilizing CPU quirks. There are more behaviours that could be exploited (such as aliasing behaviour in branch prediction or core topology) but the cost benefit of doing that is questionable.
Despite how much fun it is to hand-write assembly for such routines, the best approach is always to write simple enough code (yes, clever tricks might actually defeat compiler optimization passes!) that the compiler can optimize for you. If there are missed optimizations, improve compiler support for it. For glibc and aarch64, there is also the case of impending multiarch explosion. Due to the presence of multiple vendors, having a perfectly tuned routine for each vendor may pose code maintenance problems and also secondary issues with performance, like code layout in a binary and instruction cache utilization. There are some random ideas floating about for that already, like making separate text sections for vendor-specific code, but that’s something we would like to avoid doing if we can.
Those of you tuned in to the wonderful world of system programming may have noticed that glibc 2.26 was released last night (or daytime if you live west of me or middle of the night/dawn if you live east of me, well you get the drift) and it came out with a host of new improvements, including the much awaited thread cache for malloc. The thread cache for malloc is truly a great step forward - it brings down latency of a bulk of allocations from hundreds of cycles to tens of cycles. The other major improvement that a bulk of users and developers will notice is the fact that glibc now detects when resolv.conf has changed and reloads the lookup configuration. Yes, this was long overdue but hey, it’s not like we were refusing patches for the past half a decade, so thank the nice soul (Florian Weimer) who actually got it done in the end.
We are not here to talk about the improvements mentioned in the NEWS. We are here to talk about an improvement that will likely have a long term impact on how optimizations are implemented in libraries. We are here to talk about…
Yes, I’m back with tunables, but this time I am not the one who did the work, it’s the wonderful people from Cavium and Intel who have started using tunables for a use case I had alluded to in my talk at Linaro Connect BKK 2016 and also in my previous blog post on tunables, which was the ability to influence IFUNCs.
IFUNCs? International functions? Intricate Functions? Impossibly ridiculous Functions?
There is a short introduction of the GNU Indirect Functions on the glibc wiki that should help you get started on this very powerful yet very complicated concept. In short, ifuncs extend the GOT/PLT mechanism of loading functions from dynamic libraries to loading different implementations of the same function depending on some simple selection criteria. Traditionally this has been based on querying the CPU for features that it supports and as a result we have had multiple variants of some very common functions such as memcpy_sse2 and memcpy_ssse3 for x86 processors that get executed based on the support declared by the processor the program is running on.
Tunables allow you to take this idea further because there are two ways to get performance benefits, (1) by utilizing all of the CPU features that help and (2) by catering to the workload. For example, you could have a workload that performs better with a supposedly sub-optimal memcpy variant for the CPU purely because of the way your data is structured or laid out. Tunables allow you to select that routine by pretending that the CPU has a different set of capabilities than it actually reports, by setting the glibc.tune.hwcaps tunable on x86 processors. Not only that, you can even tune cache sizes and non-temporal thresholds (i.e. threshold beyond which some routines use non-temporal instructions for loads and stores to optimize cache usage) to suit your workload. I won’t be surprised if some years down the line we see specialized implementations of these routines that cater to specific workloads, like memcpy_db for databases or memset_paranoid for a time invariant (or mostly invariant) implementation of memset.
Here’s where another very important feature landed in glibc 2.26: multiarch support in aarch64. The ARMv8 spec is pretty standard and as a result the high level instruction set and feature set of vendor chips is pretty much the same with some minor trivial differences. However, even though the spec is standard, the underlying microarchitecture implementation could be very different and that meant that selection of instructions and scheduling differences could lead to sometimes very significant differences in performance and vendors obviously would like to take advantage of that.
The only way they could reliably (well, kind of, there should be a whole blog post for this) identify their processor variant (and hence deploy routines for their processors) was by reading the machine identification register or MIDR_EL1. If you’re familiar with aarch64 registers, you’ll notice that this register cannot be read by userspace, it can only be read by the kernel. The kernel thus had to trap and emulate this instruction, support for which is now available since Linux 4.11. In glibc 2.26, we now use MIDR_EL1 to identify which vendor processor the program is running on and deploy an optimal routine (in this case for the Cavium thunderxt88) for the processor.
But wait, what about earlier kernels, how do they take advantage of this? There’s a tunable for it! There’s glibc.tune.cpu for aarch64 that allows you to select the CPU variant you want to emulate. For some workloads you’ll find the generic memcpy actually works better and the tunable allows you to select that as well.
Finally due to tunables, the much needed cleanup of LD_HWCAP_MASK happened, giving rise to the tunable glibc.tune.hwcap_mask. Tunables also eliminated a lot of the inconsistency in environment variable behaviour due to the way static and dynamic executables are initialized, so you’ll see much less differences in the way your applications behave when they’re built dynamically vs when they’re built statically.
Wow, that sounds good, where do I sign up for your newsletter?
The full list of hardware capability tunables are documented in the glibc manual so take a look and feel free to hop on to the libc-help mailing list to discuss these tunables and suggest more ways in which you would like to tune the library for your workload. Remember that tunables don’t have any ABI/API guarantees for now, so they can be added or removed between releases as we deem fit. Also, your distribution may end up adding their own tunables too in future, so look out for those as well. Finally, system level tunables coming up real soon to allow system administrators to control how users use these tunables.