Tunables story continued - glibc 2.26
Those of you tuned in to the wonderful world of system programming may have noticed that glibc 2.26 was released last night (or daytime if you live west of me or middle of the night/dawn if you live east of me, well you get the drift) and it came out with a host of new improvements, including the much awaited thread cache for malloc. The thread cache for malloc is truly a great step forward - it brings down latency of a bulk of allocations from hundreds of cycles to tens of cycles. The other major improvement that a bulk of users and developers will notice is the fact that glibc now detects when resolv.conf has changed and reloads the lookup configuration. Yes, this was long overdue but hey, it’s not like we were refusing patches for the past half a decade, so thank the nice soul (Florian Weimer) who actually got it done in the end.
We are not here to talk about the improvements mentioned in the NEWS. We are here to talk about an improvement that will likely have a long term impact on how optimizations are implemented in libraries. We are here to talk about…
Yes, I’m back with tunables, but this time I am not the one who did the work, it’s the wonderful people from Cavium and Intel who have started using tunables for a use case I had alluded to in my talk at Linaro Connect BKK 2016 and also in my previous blog post on tunables, which was the ability to influence IFUNCs.
IFUNCs? International functions? Intricate Functions? Impossibly ridiculous Functions?
There is a short introduction of the GNU Indirect Functions on the glibc wiki that should help you get started on this very powerful yet very complicated concept. In short, ifuncs extend the GOT/PLT mechanism of loading functions from dynamic libraries to loading different implementations of the same function depending on some simple selection criteria. Traditionally this has been based on querying the CPU for features that it supports and as a result we have had multiple variants of some very common functions such as memcpy_sse2 and memcpy_ssse3 for x86 processors that get executed based on the support declared by the processor the program is running on.
Tunables allow you to take this idea further because there are two ways to get performance benefits, (1) by utilizing all of the CPU features that help and (2) by catering to the workload. For example, you could have a workload that performs better with a supposedly sub-optimal memcpy variant for the CPU purely because of the way your data is structured or laid out. Tunables allow you to select that routine by pretending that the CPU has a different set of capabilities than it actually reports, by setting the glibc.tune.hwcaps tunable on x86 processors. Not only that, you can even tune cache sizes and non-temporal thresholds (i.e. threshold beyond which some routines use non-temporal instructions for loads and stores to optimize cache usage) to suit your workload. I won’t be surprised if some years down the line we see specialized implementations of these routines that cater to specific workloads, like memcpy_db for databases or memset_paranoid for a time invariant (or mostly invariant) implementation of memset.
Here’s where another very important feature landed in glibc 2.26: multiarch support in aarch64. The ARMv8 spec is pretty standard and as a result the high level instruction set and feature set of vendor chips is pretty much the same with some minor trivial differences. However, even though the spec is standard, the underlying microarchitecture implementation could be very different and that meant that selection of instructions and scheduling differences could lead to sometimes very significant differences in performance and vendors obviously would like to take advantage of that.
The only way they could reliably (well, kind of, there should be a whole blog post for this) identify their processor variant (and hence deploy routines for their processors) was by reading the machine identification register or MIDR_EL1. If you’re familiar with aarch64 registers, you’ll notice that this register cannot be read by userspace, it can only be read by the kernel. The kernel thus had to trap and emulate this instruction, support for which is now available since Linux 4.11. In glibc 2.26, we now use MIDR_EL1 to identify which vendor processor the program is running on and deploy an optimal routine (in this case for the Cavium thunderxt88) for the processor.
But wait, what about earlier kernels, how do they take advantage of this? There’s a tunable for it! There’s glibc.tune.cpu for aarch64 that allows you to select the CPU variant you want to emulate. For some workloads you’ll find the generic memcpy actually works better and the tunable allows you to select that as well.
Finally due to tunables, the much needed cleanup of LD_HWCAP_MASK happened, giving rise to the tunable glibc.tune.hwcap_mask. Tunables also eliminated a lot of the inconsistency in environment variable behaviour due to the way static and dynamic executables are initialized, so you’ll see much less differences in the way your applications behave when they’re built dynamically vs when they’re built statically.
Wow, that sounds good, where do I sign up for your newsletter?
The full list of hardware capability tunables are documented in the glibc manual so take a look and feel free to hop on to the libc-help mailing list to discuss these tunables and suggest more ways in which you would like to tune the library for your workload. Remember that tunables don’t have any ABI/API guarantees for now, so they can be added or removed between releases as we deem fit. Also, your distribution may end up adding their own tunables too in future, so look out for those as well. Finally, system level tunables coming up real soon to allow system administrators to control how users use these tunables.