Update on libm performance
If you’ve been following glibc development (I don’t think many do; the kernel is sexier apparently) you’ll see that a new NEWS item is in for 2.18:
* Improved worst case performance of libm functions with double inputs and output.
If you’re wondering if this has anything to do with my previous few posts on libm, then you’re right, it does. Thanks to a number of very interesting changes, the speed of the
pow function on x86_64 is now roughly 5 times faster in the worst case than in 2.17. I have considered the
pow function throughout my work because it is probably the most ill-reputed function implementation. I plan to write up a detailed description of various improvements I made to the code (other than formatting it and fixing the value of TWO) in a separate post or series of posts. To summarize, I have saved time by:
- Avoiding wasting time multiplying zeroes in the multiplication function
- Written a fast squaring method that is a special case of generic multiplication
- Faster polynomial evaluation for multiple precision exp function
- Configurable mantissa type for multiple precision numbers, to allow integral mantissa for x86 and retain the floating point mantissa for powerpc
- Tweak to the multiplication algorithm to reduce multiplcations
- Dozens of minor tweaks to eek out performance
But the worst case is a few thousand times slower than the average case; 5x is nothing!
Yes and no. 5x is indeed not a good enough improvement if one is looking to compare with the average case, but for an implementation that tries to guarantee 0.5 ulp correctness, it is quite impressive. The comparison point for that is multiple precision implementations like mpfr and we’re not far off that target. Anyone who has done some work on math library implementations will tell you how it is currently not possible to predict worst case precision required to guarantee accuracy in bivariate functions like
pow, as a result of which one has to descend into large precisions whenever necessary.
I don't care about exactness, give me something fast and reasonably accurate
I got this a lot from people while working on this and honestly, it’s more a question of project goals than anything else. Currently we’re keeping things as they are, i.e. we’re going to try and maintain our halfulp correctness and try to speed things up. Maybe in future we could think of having different variants of implementations that have different performance and accuracy characteristics, but there’s nothing like that on the table right now.
Is this it? Will there be more?
There’s still a couple of interesting changes pending, the most important of them being the limitation of worst case precision for exp and log functions, based on the results of the paper Worst Cases for Correct Rounding of the Elementary Functions in Double Precision. I still have to prove that those results apply to the glibc multiple precision bits.
After that there is still a fair bit of scope for improvement, but before that I plan to get the performance benchmarking bits working for at least the major functions in glibc. That will give a standard way to measure performance across architectures and also track it across releases or milestones.
And now for the Call for Contributions!
And now on to what we need help with. Glibc exports a lot of functions and it is nearly impossible for me to write benchmark tests for all these functions in the 2.18 timeframe. I guess we’ll be happy to go with whatever we get, but if you’re looking for some interesting work, adding a function to the benchmark could be it.
benchtests/Makefile has instructions on how one could add new benchmarks for functionms to the testsuite and I’d be more than happy to help with any queries anyone may have while doing this - all you have to do is post your query on the libc-help mailing list (libc-help at sourceware dot org).
The benchmark framework itself could use some love. The current implementation is simply based on
clock_gettime, which is at best a vDSO function and at worst a system call. It would be really cool to have architecture-specific overrides that do measurements with little or no overhead so that the measurements are as accurate as they possibly can be.