Hell Oh Entropy!

Update on libm performance

Posted: Mar 30, 2013, 14:15

If you’ve been following glibc development (I don’t think many do; the kernel is sexier apparently) you’ll see that a new NEWS item is in for 2.18:

* Improved worst case performance of libm functions with double inputs and
  output.

If you’re wondering if this has anything to do with my previous few posts on libm, then you’re right, it does. Thanks to a number of very interesting changes, the speed of the pow function on x86_64 is now roughly 5 times faster in the worst case than in 2.17. I have considered the pow function throughout my work because it is probably the most ill-reputed function implementation. I plan to write up a detailed description of various improvements I made to the code (other than formatting it and fixing the value of TWO) in a separate post or series of posts. To summarize, I have saved time by:

Avoiding wasting time multiplying zeroes in the multiplication function
Written a fast squaring method that is a special case of generic multiplication
Faster polynomial evaluation for multiple precision exp function
Configurable mantissa type for multiple precision numbers, to allow integral mantissa for x86 and retain the floating point mantissa for powerpc
Tweak to the multiplication algorithm to reduce multiplcations
Dozens of minor tweaks to eek out performance

But the worst case is a few thousand times slower than the average case; 5x is nothing!

Yes and no. 5x is indeed not a good enough improvement if one is looking to compare with the average case, but for an implementation that tries to guarantee 0.5 ulp correctness, it is quite impressive. The comparison point for that is multiple precision implementations like mpfr and we’re not far off that target. Anyone who has done some work on math library implementations will tell you how it is currently not possible to predict worst case precision required to guarantee accuracy in bivariate functions like pow, as a result of which one has to descend into large precisions whenever necessary.

I don't care about exactness, give me something fast and reasonably accurate

I got this a lot from people while working on this and honestly, it’s more a question of project goals than anything else. Currently we’re keeping things as they are, i.e. we’re going to try and maintain our halfulp correctness and try to speed things up. Maybe in future we could think of having different variants of implementations that have different performance and accuracy characteristics, but there’s nothing like that on the table right now.

Is this it? Will there be more?

There’s still a couple of interesting changes pending, the most important of them being the limitation of worst case precision for exp and log functions, based on the results of the paper Worst Cases for Correct Rounding of the Elementary Functions in Double Precision. I still have to prove that those results apply to the glibc multiple precision bits.

After that there is still a fair bit of scope for improvement, but before that I plan to get the performance benchmarking bits working for at least the major functions in glibc. That will give a standard way to measure performance across architectures and also track it across releases or milestones.

And now for the Call for Contributions!

And now on to what we need help with. Glibc exports a lot of functions and it is nearly impossible for me to write benchmark tests for all these functions in the 2.18 timeframe. I guess we’ll be happy to go with whatever we get, but if you’re looking for some interesting work, adding a function to the benchmark could be it. benchtests/Makefile has instructions on how one could add new benchmarks for functionms to the testsuite and I’d be more than happy to help with any queries anyone may have while doing this - all you have to do is post your query on the libc-help mailing list (libc-help at sourceware dot org).

The benchmark framework itself could use some love. The current implementation is simply based on clock_gettime, which is at best a vDSO function and at worst a system call. It would be really cool to have architecture-specific overrides that do measurements with little or no overhead so that the measurements are as accurate as they possibly can be.

Comments

Multiprecision arithmetic in libm

Posted: Feb 05, 2013, 13:59

Before my two week break, I was working on a bunch of ideas to try and get the multiprecision arithmetic performance in libm to not suck as badly as it currently does. There was a lot of stuff going on, so I’ll try to summarize them here. The primary reason for this post is to get myself back in groove (although I’ve tried to provide as muchh background as possible), so I apologize to readers if some of the content is not coherent. Feel free to point them out in the comments and I’ll try to clarify.

The multiprecision bits in libm are essentially all files starting with mp in $srcdir/sysdeps/iee754/dbl-64/. The structure that stores a multiprecision number is called mp_no and is declared in mpa.h as:

typedef struct
{
  int e;
  double d[40];
} mp_no;

where e is the exponent of the number and the mantissa digits are in d. The radix of the number is 2²⁴, so each digit in d is always a non-negative integral value less than 2²⁴.

The other all-important module is mpa.c, which defines basic operations on mp_no (construct from double, deconstruct to double, add, subtract, multiply, divide). It was relatively easy to see that these basic operations were the hotspots (specifically multiplication), but not so easy to see that Power code has its own copy of these functions. And there begins the real difficulty.

The Power architecture is unique in that it has all of 4 floating point units. The power specific code takes advantage of that fact and is tweaked in a manner that the execution units are used in parallel. In contrast, the intel x86 architecture is quite weak on floating point units and this is where the major conflict is. The mantissa of mp_no, which is currently a double (but does not need to be since it can only have integral values), is perfect for Power, but not good enough for intel, which has much faster fixed point computations. Conversion between doubles and ints is too slow on both platforms and is hence not a good enough alternative.

A possible approach is using a mantissa_t typedef that is then overridden by Power, but I need to do some consolidation in the rest of the code to ensure that the internal structure of mp_no is not exposed anywhere. So that’s a TODO.

Apart from exploiting architecture-specific traits, the other approach I considered was to tweak the algorithm for multiplication to make it as fast as possible in a platform-independent manner. A significant number of multiplication inputs are numbers that do not utilize the full precision of mp_no, i.e. a large number of mantissa digits are zeroes. The current algorithm blindly loops around the entire precision multiplying these zeroes, which is a waste. A better idea is to find the precision of the numbers and then run the multiply-add-carry loop only to the extent of that precision. The result of this was an improvement in performance to an extent of 36% in the pow function, i.e. a bit less than twice the speed of the earlier algorithm!

Next steps are to evaluate the impact of storing the actual precision of numbers so that it does not have to be computed within the multiplication function. That’s another TODO for me.

Finally there is a lot of cleanup in order in pretty much all of the mathlib code (i.e. the stuff in sysdeps/iee754/dbl-64), which I’ve been chugging away at and will continue to do so. I’m sure the glibc community will love to review any patch submissions that even simply reformat these files in the GNU format, so there’s a good way to get started with contributing code to glibc.

Comments

Observations on glibc libm

Posted: Dec 13, 2012, 03:42

One of my recent (ongoing) projects has been to figure out a way to make performance characteristics of some transcendental functions (exp, log, sin, cos, etc.) in libm a bit more uniform. All of the functions have been implemented with two computation phases. The first phase uses a table of precomputed values and a polynomial approximation of the function. If the result is not accurate enough, the second phase is triggered, which descends into multiprecision computation. These multiprecision computations are dog-slow and that’s what I’m trying to avoid.

As expected, this has turned out to be more of a mathematics project than programming. I haven’t done any serious mathematics (beyond filing tax returns) in recent years, so it has been quite an amazing ride so far with the real fun still to come. I’ve started a page on the glibc wiki to document my observations as I go along. Currently it has notes on pow and exp functions since that’s where I started. Read up and enjoy!

Comments