Those of you tuned in to the wonderful world of system programming may have noticed that glibc 2.26 was released last night (or daytime if you live west of me or middle of the night/dawn if you live east of me, well you get the drift) and it came out with a host of new improvements, including the much awaited thread cache for malloc. The thread cache for malloc is truly a great step forward - it brings down latency of a bulk of allocations from hundreds of cycles to tens of cycles. The other major improvement that a bulk of users and developers will notice is the fact that glibc now detects when resolv.conf has changed and reloads the lookup configuration. Yes, this was long overdue but hey, it’s not like we were refusing patches for the past half a decade, so thank the nice soul (Florian Weimer) who actually got it done in the end.
We are not here to talk about the improvements mentioned in the NEWS. We are here to talk about an improvement that will likely have a long term impact on how optimizations are implemented in libraries. We are here to talk about…
Yes, I’m back with tunables, but this time I am not the one who did the work, it’s the wonderful people from Cavium and Intel who have started using tunables for a use case I had alluded to in my talk at Linaro Connect BKK 2016 and also in my previous blog post on tunables, which was the ability to influence IFUNCs.
IFUNCs? International functions? Intricate Functions? Impossibly ridiculous Functions?
There is a short introduction of the GNU Indirect Functions on the glibc wiki that should help you get started on this very powerful yet very complicated concept. In short, ifuncs extend the GOT/PLT mechanism of loading functions from dynamic libraries to loading different implementations of the same function depending on some simple selection criteria. Traditionally this has been based on querying the CPU for features that it supports and as a result we have had multiple variants of some very common functions such as memcpy_sse2 and memcpy_ssse3 for x86 processors that get executed based on the support declared by the processor the program is running on.
Tunables allow you to take this idea further because there are two ways to get performance benefits, (1) by utilizing all of the CPU features that help and (2) by catering to the workload. For example, you could have a workload that performs better with a supposedly sub-optimal memcpy variant for the CPU purely because of the way your data is structured or laid out. Tunables allow you to select that routine by pretending that the CPU has a different set of capabilities than it actually reports, by setting the glibc.tune.hwcaps tunable on x86 processors. Not only that, you can even tune cache sizes and non-temporal thresholds (i.e. threshold beyond which some routines use non-temporal instructions for loads and stores to optimize cache usage) to suit your workload. I won’t be surprised if some years down the line we see specialized implementations of these routines that cater to specific workloads, like memcpy_db for databases or memset_paranoid for a time invariant (or mostly invariant) implementation of memset.
Here’s where another very important feature landed in glibc 2.26: multiarch support in aarch64. The ARMv8 spec is pretty standard and as a result the high level instruction set and feature set of vendor chips is pretty much the same with some minor trivial differences. However, even though the spec is standard, the underlying microarchitecture implementation could be very different and that meant that selection of instructions and scheduling differences could lead to sometimes very significant differences in performance and vendors obviously would like to take advantage of that.
The only way they could reliably (well, kind of, there should be a whole blog post for this) identify their processor variant (and hence deploy routines for their processors) was by reading the machine identification register or MIDR_EL1. If you’re familiar with aarch64 registers, you’ll notice that this register cannot be read by userspace, it can only be read by the kernel. The kernel thus had to trap and emulate this instruction, support for which is now available since Linux 4.11. In glibc 2.26, we now use MIDR_EL1 to identify which vendor processor the program is running on and deploy an optimal routine (in this case for the Cavium thunderxt88) for the processor.
But wait, what about earlier kernels, how do they take advantage of this? There’s a tunable for it! There’s glibc.tune.cpu for aarch64 that allows you to select the CPU variant you want to emulate. For some workloads you’ll find the generic memcpy actually works better and the tunable allows you to select that as well.
Finally due to tunables, the much needed cleanup of LD_HWCAP_MASK happened, giving rise to the tunable glibc.tune.hwcap_mask. Tunables also eliminated a lot of the inconsistency in environment variable behaviour due to the way static and dynamic executables are initialized, so you’ll see much less differences in the way your applications behave when they’re built dynamically vs when they’re built statically.
Wow, that sounds good, where do I sign up for your newsletter?
The full list of hardware capability tunables are documented in the glibc manual so take a look and feel free to hop on to the libc-help mailing list to discuss these tunables and suggest more ways in which you would like to tune the library for your workload. Remember that tunables don’t have any ABI/API guarantees for now, so they can be added or removed between releases as we deem fit. Also, your distribution may end up adding their own tunables too in future, so look out for those as well. Finally, system level tunables coming up real soon to allow system administrators to control how users use these tunables.
This is long overdue and I have finally got around to writing this. Apologies to everyone who asked me to write about it and I responded with "Oh yeah, right away!" If you are not interested in the story bits, start with So what are tunables anyway below.
The story of tunables began in 2013 when I was a relatively fresh glibc engineer in the Red Hat toolchain team. We wanted to add an environment variable to allow users to set the default stack sizes for thread stacks and Carlos took that idea to the next level with the question: How do we make this more extensible so that we have full control over the kind of tuning parameters we accept in glibc but at the same time, allow distributions to add their own tuning parameters without affecting upstream code? He asked this question in the 2013 Cauldron in Mountain View, where the famous glibc BoF happened in a tiny meeting room which overflowed into an adjacent room, which also filled up quickly, and then the BoF overran its 45 minute slot by roughly a couple of hours! Carlos joined the BoF over Hangout (I think it was called Google Talk then) because he couldn’t make it and we had a lengthy back and forth about the pros and cons of having such tuning parameters. In principle, everybody agreed that such a thing would be desirable from a maintenance perspective. However the approach for doing it was something nobody seemed to agree on.
Thus the idea of tunables was born 4 years ago, except that Carlos wrote the first wiki page and called it ‘tunnables’. He consistently spelled it tunnables and I tunables. I won in the end because I wrote the patches ;)
Jokes aside, we were happy about the reception of the idea and we went about documenting it at length. However given that we were a two man army manning the glibc bunkers in Red Hat and the fact that upstream was still reviving itself from the post-Uli era meant that we would never come back to it for a while.
Then 2015 happened and it came with a memorable Cauldron in Prague. It was memorable because by then I had come up with a first draft of an API for the tunables framework. It was also memorable because it was my last month at Red Hat, something I never imagined would ever happen. I was leaving my dream team and I wasn’t sure if I would ever be as happy again. Those uncertainties were unfounded as I know now, but that’s a story for another post.
The struggle to write code
The first draft I presented at Cauldron in 2015 was really just a naive attempt at storing and initializing public values accessed across libraries in glibc and we had not even thought through everything we would end up fixing with tunables. It kinda worked, but it was never going to make the cut. A new employer meant that tunables will become a weekend project and as a result it missed the release deadline. And another, and then another. Towards the closing of every release I would whip out a patchset that would be poked holes into and then the change would be considered too risky to include.
Finally we set a deadline of 2.25 for tunables because by then quite a few devs had started maintaining their own list of tunables on top of my tree, frustratingly rebasing every time I completely changed my approach. We made it in the end, with Florian and I working through the year end holidays to get the whole patchset in before freeze.
So as of 2.25, tunables is firmly entrenched into glibc and as we speak, there are more tunables to come, especially to override IFUNC selections and to tune the processor capability mask.
So what are tunables anyway?
This is where you start if you want the technical description and are not interested in the story bits.
Tunables is an internal implementation detail in glibc. It is a way to manage ways in which we allow behaviour in glibc to be modified. As of now the only way to manage glibc is via environment variables and the way to do that was strewn all over the place in the source code. Tunables provide one place to add the tunable parameter with all of the characteristics it would have and then the framework will handle everything from there. The user of that tunable (e.g. malloc for MALLOC_MMAP_THRESHOLD_ or malloc.mmap.threshold in tunables parlance) would then simply access the tunable from the list and do what it wants to do, without bothering about where it came from.
The framework is implemented in elf/dl-tunables.c and all of the supporting code is named as elf/dl-tunable*. As is evident, tunables is linked into the dynamic linker, where it is initialized very early. In static binaries, the initialization is done in libc-start.c, again early enough to influence almost everything in the program. The list is initialized just once and is modifiable only in the dynamic linker before it relocates itself.
The main list of tunables is maintained in elf/dl-tunables.list. Architectures may define their own tunables in sysdeps/…/dl-tunables.list. There is a README.tunables that lists out the gory details of using tunables within glibc to access its values and if necessary, update it.
This gives us a number of advantages, some of them being the following:
All environment variables used by glibc would be read in by a single double-nested loop which initializes all tunables. Accesses are then just a GOT away, so no more getenv loops in glibc code. This is not achieved yet since all of the environment variables are not yet ported to tunables (Hint: here’s a nice project for you, you aspiring glibc developer!)
All tunables are listed in a single file
The file elf/dl-tunables.list has a full list of tunables along with its properties such as type, value range, default value and its behaviour with setuid binaries. This caused us to introspect on each environment variable we ported into tunables and we ended up fixing a few bugs as well.
Very Early Initialization
Yes, very early, earlier than you would imagine, earlier than IFUNCs! *gasp*
Tunables get initialized very early so that they can influence almost every behaviour in glibc. The unreleased 2.26 makes this even earlier (or rather, delays CPU features initialization enough) so that tunables can impact selection of routines using IFUNCs. This fixes an important inconsistency in glibc, where LD_HWCAP_MASK was read in dynamically linked binaries but not in static binaries because it was not read in early enough.
The tunable list is read-only, so glibc reads from a list that cannot be tampered by malicious code that gets loaded after relocation.
What changes for me as a user?
The change in 2.25 is minimal enough that you won’t notice. In this release, only the malloc tuning environment variables have been ported to tunables and if you’ve been using those environment variables before, they will continue to work even now. In addition, you get to tune these parameters in a fancy way that doesn’t require the stupid trailing underscore, using the GLIBC_TUNABLES environment variable. The manual describes it extensively so I won’t go into details.
The major change is about to happen now. Intel is starting to push a number of tunables to allow you to tune your library to your liking, changing things like string routines that get selected for your program, cache parameters, etc. I believe PowerPC and S390 will see something simila too in the lock elision space and aarch64 multiarch will be tunable as well. All of this will hopefully come in 2.26 or latest by 2.27.
One thing to note though is that for now tunables are not covered by any ABI or API guarantees. That is to say, if you like a tunable that is in 2.26, we may well remove the tunable in 2.27 if we find that it either does not make sense to have that tunable exposed or exposing that tunable is somehow detrimental to user programs.
The big difference will likely come in when distributions start adding their own tunables into the mix. since it will allow them to add customizations to the library without having to maintain huge ugly patchsets.
The Road Ahead
The big advantage of collecting all tuning parameters under a single framework is the ability to then add new ways to influence those tuning parameters. We have environment variables now, but we could add other methods to tune the library. Some ideas discussed are as follows:
- Have a systemwide configuration file (e.g. /etc/sysctl.user.conf) that sets different defaults for some tunables and limits the degree to which specific tunables are altered. This allows systems administrators to have more fine grained control over the processes on their system
- Have user-specific configuration files (e.g. $HOME/.sysctl.user.conf) that does something similar but at a user level
- Have some tunables modified during execution via some shared memory mechanism
All of this is still evolving, so if you have an idea or would like to work on any of these ideas, feel free to get in touch with me and we can find a way to get you contributing to one of the most critical parts of the operating system!