GNU Tools Cauldron 2016, ARMv8 multi-arch edition

by siddhesh on Sep 28, 2016, 11:49

Worst planned trip ever.

That is what my England trip for the GNU Tools Cauldron was, but that only seemed to add to the pleasure of meeting friends again. I flewin to Heathrow and started on an almost long train journey to Halifax,with two train changes from Reading. I forgot my phone on the trainbut the friendly station manager at Halifax helped track it down andgot it back to me. That was the first of the many times I forgotstuff in a variety of places during this trip. Like I discovered thatI forgot to carry a jacket or an umbrella. Or shorts. Or full lengthpants for that matter. Like I purchased an umbrella from Sainsbury’s but forgot to carry it out. I guess you got the drift of it.

All that mess aside, the conference itself was wonderful as usual. My main point of interest at the Cauldron this time was to try and make progress on discussions around multi-arch support for ARMv8. I have never talked about this in my blog the past, so a brief introduction is in order.

What is multi-arch?

Processors evolve over time and introduce features that can be exploited by the C library to do work faster, like using the vectori SIMD unit to do memory copies and manipulation faster. However, this is at odds with the goal of the C library to be able to run on all hardware, including those that may not have a vector unit or may not have that specific type of vector unit (e.g. have SSE4 but not AVX512 on x86). To solve this problem, we exploit the concept of PLT and dynamic linking.

I thought we were talking about multiarch, what’s a PLT now?

When a program calls a function in a library that it links to dynamically (i.e. only the reference of the library and the function are present in the binary, not the function implementation), it makes the call via an indirect reference (aka a trampoline) within thebinary because it cannot know where the function entry point in another library resides in memory. The trampoline uses a table (called the Procedure Linkage Table, PLT for short) to then jump to the final location, which is the entry point of the function.

In the beginning, the entry point is set as a function in the dynamic linker (lets call it the resolver function), which then looks for the function name in libraries that the program links to and then updates the table with the result. The dynamic linker resolver function can do more than just look for the exact function name in the libraries the function links to and that is where the concept of Indirect Functions or IFUNCs come into the picture.

Further down the rabbit hole - what’s an IFUNC?

When the resolver function finds the function symbol in a library, it looks at the type of the function before simply patching the PLT with its address. If it finds that the function is an IFUNC type (lets call it the IFUNC resolver), it knows that executing that function will give the actual address of the function it should patch into the PLT. This is a very powerful idea because it now allows us to have multiple implementations of the same function built into the library for different features and then have the IFUNC resolver study its execution environment and return the address of the most appropriate function. This is fundamentally how multiarch is implemented in glibc, where we have multiple implementations of functions like memcpy, each utilizing different features, like AVX, AVX2, SSE4 and so on. The IFUNC resolver for memcpy then queries the CPU to find the features it supports and then returns the address of the implementation best suited to the processor.

… and we’re back! Multi-arch for ARMv8

ARMv8 has been making good progress in terms of adoption and it is clear that ARM servers are going to form a significant portion of datacenters of the future. That said, major vendors of such servers with architecture licenses are trying to differentiate by innovating onthe microarchitecture level. This means that a sequence of instructions may not necessarily have the same execution cost on all processors. This gives an opportunity for vendors to write optimal code sequences for key function implementations (string functions for example) for their processors and have them included in the C library. They can use the IFUNC mechanism to then identify their processors and then launch the routine best suited for their processor implementation.

This is all great, except that they can’t identify their processors reliably with the current state of the kernel and glibc. The way to identify a vendor processor is to read the MIDR_EL1 and REVIDR_EL1 registers using the MSR instruction. As the register name suggests, they are readable only in exception level 1, i.e. by the kernel, which makes it impossible for glibc to directly read this, unlike on Intel processors where the CPUID instruction is executable in userspace and is sufficient to identify the processor and its features.

… and this is only the beginning of the problem. ARM processors have a very interesting (and hence painful) feature called big.LITTLE, which allows for different processor configurations on a single die. Even if we have a way to read te two registers, you could end up reading the MIDR_EL1 from one CPU and REVIDR_EL1 from another, so you need a way to ensure that both values are read from the same core.

This led to the initial proposal for kernel support to expose the information in a sysfs directory structure in addition to a trap into the kernel for the MRS instruction. This meant that for any IFUNC implementation to find out the vendor IDs of the cores on the system, it would have to traverse a whole directory structure, which is not the most optimal thing to do in an IFUNC, even if it happens only once in the lifetime of a process. As a result, we wanted to look for a better alternative.

VDSO FTW!

The number of system calls in a directory traversal would be staggering for, say, a 128 core processor and things will undoubtedly get worse as we scale. Another way for the kernel to share this (mostly static) information with userspace is via a VDSO, with an opaque structure in userspace pages in the vdso and helper functionsto traverse that structure. This however (or FS traversal for that matter) exposed a deeper problem, the extent of things we can do in an IFUNC.

An IFUNC runs very early in a dynamically linked program and even earlier in a statically linked program. As a result, there is very little that it can do because most of the complex features are not even initialized at that point. What’s more, the things you can do in a dynamic program are different from the things you can do in a static program (pretty much nothing right now in the latter), so that’s an inconsistency that is hard to reconcile. This makes the IFUNC resolvers very limited in their power and applicability, at least in their current state.

What were we talking about again?

The brief introduction turned out to be not so brief after all, but I hope it was clear. All of this fine analysis was done by Szabolcs Nagy from ARM when we talked about multi-arch first and the conclusion was that we needed to fix and enhance IFUNC support first if we had any hope of doing micro-architecture detection for ARM. However, there is another way for now…

Tunables!

A (not so) famous person (me) once said that glibc tunables are the answer to all problems including world hunger and of course, the ARMv8 multi-arch problem. This was a long term idea I had shared at the Linaro Connect in Bangkok earlier this year, but it looks like it might become a reality sooner. What’s more, it seems like Intel is looking for something like that as well, so I am not alone in making this potentially insane suggestion.

The basic idea here would be to have environment variable(s) todo/override IFUNC selection via tunables until the multi-arch situation is resolved. Tunables initialization is much more lightweight and only really relies on what the kernel provides on the stackand in the auxilliary vector and what the CPU provides directly. It seems easier to delay IFUNC resolution at least until tunables are initialized and then look harder at how much further they can be delayed so that they can use other things like VDSO and/or files.

So here is yet another idea that has culminated into a “just finish tunables already!” suggestion. The glibc community has agreed on setting the 2.25 release as the deadline to get this support in, so hopefully we will see some real code in this time.

england glibc gnutools