Hell Oh Entropy!

Life, Code and everything in between

Buggy HLE, microcode updates and SIGILLs

Posted: Sep 26, 2014, 09:32

Update: Disabling lock elision in glibc doesn’t seem to be sufficient. Either way, the Fedora kernel folks will have an update in place to update the microcode early by default so that both the kernel and the first instantiation of pthreads will see HLE disabled. So read the story as something interesting that we did but didn’t quite work. It was fun though…

Amit and I ran into an interesting problem today with his new Haswell process based system. A fully updated Fedora 21 alpha would fail during boot and fall into the maintainer shell. The systemd journal showed that systemd-udevd was crashing with a SIGILL, which seemed strange. The core dump revealed the problem:

(gdb) x/i $rip
=> 0x7f68b0b978ba <pthread_rwlock_rdlock+186>:  xbeginq 0x7f68b0b978c0 <pthread_rwlock_rdlock+192>

The xbeginq instruction is an HLE instruction, so the first thing that came to mind was the recent errata that Intel pushed out, effectively announcing that HLE was buggy and that they were going to disable it soon. We looked at /proc/cpuinfo expecting to find hle and rtm missing, but were even more confused to find that they were present.

After much tinkering about, Amit made a vague reference to microcode_ctl being able to change CPU microcode on the fly. It took a while to hit us, but we finally realized that we had found the culprit. microcode_ctl had been updated with the latest Intel microcode update. We initially thought that it ought to be a one-time problem since the microcode would be flashed into the cpu and later everything would work, but then we found out that the microcode needs to be flashed on every boot.

So the root cause was that the microcode would happen late enough that systemd was already up and had read the hle bit, thus enabling lock elision support in systemd. Also, since the kernel had already read in cpu capabilities, it also did not have the updated capabilities, due to which we continued seeing hle and rtm set in cpuinfo.

As a result, thanks to the microcode update, all haswell based F21 alpha systems are essentially unbootable. Carlos is now fixing this by disabling lock-elision completely in the glibc build. Work is in progress for rawhide, F21 and F20 as I write this, so the impact of this will hopefully be minimal. If you do run into this problem, all you have to do is dowwngrade the microcode_ctl package and pin it so that it doesn’t get updated till the glibc update becomes available.