File offsets, active handles and stdio

Posted: 2014-03-19T18:39:18+05:30

Back in 2012 I wrote a patch to optimize ftell so that it doesn’t always flush buffers when it just has to report the current file offset. That patch unleashed a whole lot of pain due to the way in which the ftell code was written - it shared code with fseek, which is semantically different from ftell in a few ways. The general mess in libio code didn’t help matters much either.

Since that patch, a number of fixes went in to correct broken behaviour and it culminated in me essentially rewriting ftell. The main problems encountered were related to caching of the file offset in the underlying file to avoid a syscall, so Carlos suggested I write up a wiki document explaining the various scenarios. So I wrote File offsets in stdio stream and ftell in the glibc wiki.

Since this is a new ftell implementation, we’d love to get feedback on correctness (as bug reports) and performance (as bug reports, email, tweets, etc.).

Comments

GNU C Library 2.19 and what developers can get from it

Posted: 2014-02-10T05:55:19+05:30

The GNU C Library project released version 2.19 of it’s library on Saturday (Friday for some), with Allan McRae as the release manager. Apart from numerous bug fixes, there are a couple improvements that would interest developers. Both improvements are related in some manner to the library documentation, which apparently is not very well known. In fact, it seems that very few people know that the official documentation for functions in the GNU C library is not the man pages, it is the GNU C Library Manual. This is not to discredit the man page project in any way of course - Michael Kerrisk does a swell job of keeping man pages in sync with glibc behaviour wherever necessary and I’d like to think that we’re cooperating to the best of our abilities. The man page project however is more general and covers a fairly broad set of components including the kernel, the C library and some tools. The glibc manual focusses specifically on functionality provided by glibc.

Now for the first big improvement. Alexandre Oliva, along with a number of reviewers did the mammoth job of adding documentation regarding multi-thread safety, async-signal safety and async-cancellation safety for functions provided by glibc. This is an invaluable resource because it tries to describe precisely what kind of guarantees the glibc implementation of various functions provides, as opposed to the guarantees documented in the various standards.

The second improvement is the introduction of Systemtap userspace probe markers for various events in malloc and some mathematical functions. The malloc subsystem is fairly complex and has some critical events that a developer may want to track when trying to profile memory allocation patterns for their programs. These probes are placed at such critical points in the malloc subsystem so that one may write up a systemtap script to profile their applications much more easily. I had written a description of the malloc internal implementation some time ago, which is still relevant and may help developers select the appropriate probes.

Some mathematical functions try to provide a guarantee of accuracy of the result to the last bit and to do so, some inputs may require multiple precision computation (I have of course, written about this in a bit more detail in the past). This fallback computation may be in multiple stages and may take anywhere between a 100 times to a 1000 times more than the normal execution speed of the function. While this fallback is needed only for a handful of inputs in the entire range, the performance impact is really high when an application does hit this path. So to help developers identify whether their performance hit is due to these multiple precision fallback paths, these paths have been marked with probe markers that can be used in systemtap scripts to profile applications. The probes have been documented in the libc manual, in the Internal Probes section.

Finally, I have managed to finish producing a significant set of benchmark inputs for the math functions I care about, so it might be a good time for folks to start trying them out and sending in results. The README in the benchtests directory should be a good starting point. The output file format is still not final - I’m toying with JSON as the final format - so expect changes there in future. The string benchmarks still need some attention, which hopefully will happen in the 2.20 time frame.

Looking forward to 2.20, Joseph Myers has already begun the work of moving the architectures in the ports directory to the regular source tree. Once this is complete, we will have no concept of secondary architectures, which is a good thing. Hopefully in future we will also get rid of the libc-ports mailing list, which I have complained about in the past as being an unnecessary separation.

On the benchmarks front, we’ll be moving to python as the language of choice for the scripts and adding features such as graphing, a better file format, scripts to compare benchmark outputs and anything else that catches my fancy during the next 6 months.

Finally, I’ve mentioned this before - The GNU C Library manual needs contributors and here’s how you can help.

Comments

Blank lines in /etc/netgroup

Posted: 2014-01-27T11:36:02+05:30

While doing a code audit, I noticed that netgroups queries resulted in warnings in valgrind on my system:

==14597== Invalid read of size 1
==14597==    at 0xBB735E0: _nss_files_setnetgrent (files-netgrp.c:106)
==14597==    by 0x4F4954E: __internal_setnetgrent_reuse (getnetgrent_r.c:139)
==14597==    by 0x4F49879: setnetgrent (getnetgrent_r.c:181)
==14597==    by 0x4033EA: netgroup_keys (getent.c:493)
==14597==    by 0x402370: main (getent.c:1011)
==14597==  Address 0x51fe63f is 1 bytes before a block of size 120 alloc'd
==14597==    at 0x4C2A45D: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==14597==    by 0x4EA403D: getdelim (iogetdelim.c:66)
==14597==    by 0xBB73575: _nss_files_setnetgrent (stdio.h:117)
==14597==    by 0x4F4954E: __internal_setnetgrent_reuse (getnetgrent_r.c:139)
==14597==    by 0x4F49879: setnetgrent (getnetgrent_r.c:181)
==14597==    by 0x4033EA: netgroup_keys (getent.c:493)
==14597==    by 0x402370: main (getent.c:1011)

The code was very obviously buggy too:

         ...
         while (line[curlen - 1] == '\n' && line[curlen - 2] == '\\')
           {
             /* Yes, we have a continuation line.  */
             if (found)
             ...

So if line has just the newline character, curlen will be 1 and hence, line[curlen - 2] will be the byte before the array. This kind of thing is never caught normally because of two factors:

  1. The byte before line is part of the malloc metadata
  2. The byte is read and not written to, hence there's no question of corrupting data and triggering one of malloc's consistency checking mechanisms.

Hence the code never crashes and we think that everything is just fine until the byte preceding line just happens to be 0x5b, i.e. ‘\‘. That will again never happen in the normal case since it would mean that the size of the line is at least 0x5b00000000000000 on a 64-bit system. However, even if this happens, it may not do a lot of harm since all it does is resulting in concatenating an empty string to the subsequent netgroup line.

However, consider a scenario where a netgroups file was generated in a kickstart like this:

{ for i in foo bar baz; do
    echo "${i}_netgroup \\"
    for j in $(seq 1 10); do
        echo "($i, user_$j,) \\"
    done
    # Guess what! I can get away with the \ in the end if I leave a blank line!
    # How Cool is that!
    echo ""
done } > /etc/netgroup

Here, the administrator has found a ‘feature’ where leaving blank lines allows them to get away with their very simple script to generate netgroups with trailing backslashes. Now what happens if line happens to have a preceding 0x5b? The groups separated by the blank lines will get concatenated and all of a sudden you have one group with all of the entries being members of the first group!

So if an attacker manages to control this preceding byte in a program (possibly through a different heap overflow bug), (s)he could manipulate netgroup memberships and cause a breach. So we have a security vulnerability on our hands! Or do we?

Not such a big deal after all

So while it is fun to imagine scenarios to exploit code (who knows when you’ll get a hit!), this one seems like just a harmless buglet. Here’s why:

Controlling the preceding byte in such a manner is extremely difficult, since the malloc consistency checker should kick in before line is even allocated this offending chunk of memory. One may argue that malloc could be overridden and the preceding byte could be tailored to ones needs, but that immediately takes out the interesting attack scenarios, i.e. the setuid-root programs. If you’re doing authentication using a non-setuid command line program then you’ve got more serious issues anyway because one could simply override getnetgrent_r or similar user/group/netgroup browsing functions and bypass your check.

Finally, this ‘exploit’ depends on the netgroups file being wrong, which is again extremely unlikely and even if it is, that’s a problem with the configuration and not a vulnerability in the code.

In the end it is just a trivial off-by-one bug which must be fixed, so it was.

Comments

NSCD Database File Layout

Posted: 2013-06-21T18:00:26+05:30

I had to do some analysis on the nscd database file this week, for which I hacked up a quick program to dump the contents of an nscd database file. I intend to post the code on the upstream mailing list when it is complete, but for now I wrote up a description of the file layout on the glibc wiki since there isn’t any documentation around it.

Comments

Update on libm performance

Posted: 2013-03-30T14:15:00+05:30

If you’ve been following glibc development (I don’t think many do; the kernel is sexier apparently) you’ll see that a new NEWS item is in for 2.18:

* Improved worst case performance of libm functions with double inputs and
  output.

If you’re wondering if this has anything to do with my previous few posts on libm, then you’re right, it does. Thanks to a number of very interesting changes, the speed of the pow function on x86_64 is now roughly 5 times faster in the worst case than in 2.17. I have considered the pow function throughout my work because it is probably the most ill-reputed function implementation. I plan to write up a detailed description of various improvements I made to the code (other than formatting it and fixing the value of TWO) in a separate post or series of posts. To summarize, I have saved time by:

But the worst case is a few thousand times slower than the average case; 5x is nothing!

Yes and no. 5x is indeed not a good enough improvement if one is looking to compare with the average case, but for an implementation that tries to guarantee 0.5 ulp correctness, it is quite impressive. The comparison point for that is multiple precision implementations like mpfr and we’re not far off that target. Anyone who has done some work on math library implementations will tell you how it is currently not possible to predict worst case precision required to guarantee accuracy in bivariate functions like pow, as a result of which one has to descend into large precisions whenever necessary.

I don't care about exactness, give me something fast and reasonably accurate

I got this a lot from people while working on this and honestly, it’s more a question of project goals than anything else. Currently we’re keeping things as they are, i.e. we’re going to try and maintain our halfulp correctness and try to speed things up. Maybe in future we could think of having different variants of implementations that have different performance and accuracy characteristics, but there’s nothing like that on the table right now.

Is this it? Will there be more?

There’s still a couple of interesting changes pending, the most important of them being the limitation of worst case precision for exp and log functions, based on the results of the paper Worst Cases for Correct Rounding of the Elementary Functions in Double Precision. I still have to prove that those results apply to the glibc multiple precision bits.

After that there is still a fair bit of scope for improvement, but before that I plan to get the performance benchmarking bits working for at least the major functions in glibc. That will give a standard way to measure performance across architectures and also track it across releases or milestones.

And now for the Call for Contributions!

And now on to what we need help with. Glibc exports a lot of functions and it is nearly impossible for me to write benchmark tests for all these functions in the 2.18 timeframe. I guess we’ll be happy to go with whatever we get, but if you’re looking for some interesting work, adding a function to the benchmark could be it. benchtests/Makefile has instructions on how one could add new benchmarks for functionms to the testsuite and I’d be more than happy to help with any queries anyone may have while doing this - all you have to do is post your query on the libc-help mailing list (libc-help at sourceware dot org).

The benchmark framework itself could use some love. The current implementation is simply based on clock_gettime, which is at best a vDSO function and at worst a system call. It would be really cool to have architecture-specific overrides that do measurements with little or no overhead so that the measurements are as accurate as they possibly can be.

Comments

The glibc manual needs volunteers

Posted: 2013-02-21T06:17:00+05:30

The GNU C Library (glibc) needs more contributors to help improve the library. This is even more true for a very key portion of the library, which is the documentation. We have a glibc manual that gets updated on every release, which is a bit incomplete and has a fair number of bugs filed against it. We would welcome volunteers to take a crack at those bug reports and send in patches. Here’s some information to get you started.

The glibc manual code is maintained within the glibc source tree, in the manual directory. It is in texinfo format, so if you’re familiar with tex, you’re halfway through. The various chapters in the manual are in *.texi files, with a Makefile to build the manual in either info or html format.

To build the manual, create a directory within the source tree (or anywhere outside it is also fine), which I will refer to as the build directory from now on. This keeps the generated content separate from the code. Enter that directory and type the command:

$SRCDIR/configure --prefix=/usr

where $SRCDIR is the path to the sources. Don’t worry about the /usr prefix since we’re not going to install anything. If you’re especially worried about it, then use some other prefix like $HOME/sandbox or similar. Once configure succeeds, build the info format documentation using:

make info

or the html format using:

make html

The documentation gets built in the manual directory within the build directory. The html documentation is built in the libc directory within the manual directory. You can open index.html in a browser to browse through and verify your changes.

Contributing to glibc usually requires a copyright assignment to FSF. If you don’t mind doing this, the procedure is fairly easy, albeit time consuming. All you have to do is post your patch on the libc-alpha (libc-alpha AT sourceware dot org) mailing list and if copyright assignment is necessary for the change, the reviewer will let you know what to do. For help when writing a patch, the libc-help (libc-help AT sourceware dot org) mailing list is the place to go.

Comments

Multiprecision arithmetic in libm

Posted: 2013-02-05T13:59:25+05:30

Before my two week break, I was working on a bunch of ideas to try and get the multiprecision arithmetic performance in libm to not suck as badly as it currently does. There was a lot of stuff going on, so I’ll try to summarize them here. The primary reason for this post is to get myself back in groove (although I’ve tried to provide as muchh background as possible), so I apologize to readers if some of the content is not coherent. Feel free to point them out in the comments and I’ll try to clarify.

The multiprecision bits in libm are essentially all files starting with mp in $srcdir/sysdeps/iee754/dbl-64/. The structure that stores a multiprecision number is called mp_no and is declared in mpa.h as:

typedef struct
{
  int e;
  double d[40];
} mp_no;

where e is the exponent of the number and the mantissa digits are in d. The radix of the number is 224, so each digit in d is always a non-negative integral value less than 224.

The other all-important module is mpa.c, which defines basic operations on mp_no (construct from double, deconstruct to double, add, subtract, multiply, divide). It was relatively easy to see that these basic operations were the hotspots (specifically multiplication), but not so easy to see that Power code has its own copy of these functions. And there begins the real difficulty.

The Power architecture is unique in that it has all of 4 floating point units. The power specific code takes advantage of that fact and is tweaked in a manner that the execution units are used in parallel. In contrast, the intel x86 architecture is quite weak on floating point units and this is where the major conflict is. The mantissa of mp_no, which is currently a double (but does not need to be since it can only have integral values), is perfect for Power, but not good enough for intel, which has much faster fixed point computations. Conversion between doubles and ints is too slow on both platforms and is hence not a good enough alternative.

A possible approach is using a mantissa_t typedef that is then overridden by Power, but I need to do some consolidation in the rest of the code to ensure that the internal structure of mp_no is not exposed anywhere. So that’s a TODO.

Apart from exploiting architecture-specific traits, the other approach I considered was to tweak the algorithm for multiplication to make it as fast as possible in a platform-independent manner. A significant number of multiplication inputs are numbers that do not utilize the full precision of mp_no, i.e. a large number of mantissa digits are zeroes. The current algorithm blindly loops around the entire precision multiplying these zeroes, which is a waste. A better idea is to find the precision of the numbers and then run the multiply-add-carry loop only to the extent of that precision. The result of this was an improvement in performance to an extent of 36% in the pow function, i.e. a bit less than twice the speed of the earlier algorithm!

Next steps are to evaluate the impact of storing the actual precision of numbers so that it does not have to be computed within the multiplication function. That’s another TODO for me.

Finally there is a lot of cleanup in order in pretty much all of the mathlib code (i.e. the stuff in sysdeps/iee754/dbl-64), which I’ve been chugging away at and will continue to do so. I’m sure the glibc community will love to review any patch submissions that even simply reformat these files in the GNU format, so there’s a good way to get started with contributing code to glibc.

Comments

Observations on glibc libm

Posted: 2012-12-13T03:42:31+05:30

One of my recent (ongoing) projects has been to figure out a way to make performance characteristics of some transcendental functions (exp, log, sin, cos, etc.) in libm a bit more uniform. All of the functions have been implemented with two computation phases. The first phase uses a table of precomputed values and a polynomial approximation of the function. If the result is not accurate enough, the second phase is triggered, which descends into multiprecision computation. These multiprecision computations are dog-slow and that’s what I’m trying to avoid.

As expected, this has turned out to be more of a mathematics project than programming. I haven’t done any serious mathematics (beyond filing tax returns) in recent years, so it has been quite an amazing ride so far with the real fun still to come. I’ve started a page on the glibc wiki to document my observations as I go along. Currently it has notes on pow and exp functions since that’s where I started. Read up and enjoy!

Comments

Malloc per-thread arenas in glibc

Posted: 2012-10-24T14:40:35+05:30

The malloc subsystem in glibc has had the feature of per-thread arenas for quite some time now and based on my experience, it seems to be the source of a lot of confusion. This is especially for enterprise users who move from RHEL-5 to RHEL-6 to see their apps taking up a lot more ‘memory’ and you’ll see a fair amount of content in this regard in the Red Hat Customer Portal if you’re a RHEL customer. This post is an attempt to get more perspective on this from the internal design perspective. I have not written any of the code in this implementation (barring a tiny improvement), so all of the talk about ‘original intention’ of any design is speculation on my behalf.

Background

The glibc malloc function allocates blocks of address space to callers by requesting memory from the kernel. I have written about the two syscalls glibc uses to do this, so I won’t repeat that here beyond mentioning that they two ‘places’ from where the address space is obtained are ‘arenas’ and ‘anonymous memory maps’ (referred to as anon maps in the rest of the post). The concept of interest here is the arena, which is nothing but a contiguous block of memory obtained from the kernel. The difference from the anon maps is that one anon map fulfills only one malloc request while an arena is a scratchpad that glibc maintains to return smaller blocks to the requestor. As you may have guessed, the data area (or ‘heap’) created by extending the process break (using the brk syscall) is also an arena - it is referred to as the main arena. The ‘heap’ keyword has a different meaning in relation to the glibc malloc implementation as we’ll find out later.

The Arenas

In addition to the main arena, glibc malloc allocates additional arenas. The reason for creation of arenas always seems to have been to improve performance of multithreaded processes. A malloc call needs to lock an arena to get a block from it and contention for this lock among threads is quite a performance hit. So the multiple arenas implementation did the following to reduce this contention:

  1. Firstly, the malloc call always tries to always stick to the arena it accessed the last time
  2. If the earlier arena is not available immediately (as tested by a pthread_mutex_trylock), try the next arena
  3. If we don't have uncontended access on any of the current arenas, then create a new arena and use it.

We obviously start with just the main arena. The main arena is extended using the brk() syscall and the other arenas are extended by mapping new ‘heaps’ (using mmap) and linking them. So an arena can generally be seen as a linked list of heaps.

An Arena for a Thread

The arena implementation was faster than the earlier model of using just a single arena with an on-demand model for reduction of contention in the general case. The process of detecting contention however was sufficiently slow and hence a better idea was needed. This is where the idea of having a thread stick to one arena comes in.

With the arena per-thread model, if malloc is called for the first time within a thread, a new arena is created without looking at whether the earlier locks would be contended or not. As a result, for a sane number of threads*, one can expect zero contention among threads when locking arenas since they’re all working on their own arenas.

There is a limit to the number of arenas that are created in this manner and that limit is determined based on the number of cores the system has. 32-bit systems get twice the number of cores and 64-bit systems get 8 times the number of cores. This can also be controlled using the MALLOC_ARENA_MAX environment variable.

The Problem (or not)

The whole issue around the concept of arenas is the amount of ‘memory’ it tends to waste. The common complaint is that the virtual memory footprint of programs increase due to use of arenas and definitely more so due to using a different arena for each thread. The complaint is mostly bogus because of a couple of very important features that have been there for years - loads and loads of address space and demand paging.

The linux virtual memory model can allow a process to access most of its virtual memory address space (provided it is requested and mapped in). A 64-bit address space is massive and is hence not a resource that is likely to run out for any reasonable process. Add to it the fact that the kernel has mechanisms to use physical memory only for pages that are needed by the process at that time and you can rest assured that even if you map in terabytes of memory in your process, only the pages that are used will ever be accounted against you.

Both arena models rely on these facts. A mapped in arena is explicitly requested without any permissions (PROT_NONE) so that the kernel knows that it does not need any physical memory to back it. Block requests are then fulfilled by giving read+write permissions in parts. Freed blocks near the end of heaps in an arena are given back to the system either by using madvise(MADV_DONTNEED) or by explicitly unmapping.

So in the light of all of this, as an administrator or programmer, one needs to shift focus from the virtual memory column in your top output to the RSS column, since that is where the real resource usage is. Address space usage is not the same thing as memory usage. It is of course a problem if RSS is also high and in such cases one should look at the possibility of of a memory leak within to process and if that is not the problem, then fragmentation within the arenas. There are tuning options available to reduce fragmentation. Setting resource limits on address space usage is passe and it is time that better alternatives such as cgroups are given serious consideration.

Comments

Changing the default loader for a program in its ELF

Posted: 2011-03-27T01:11:00+05:30

I was following an email thread in the libc-help mailing list, where the OP asked how one could force a program to load with a different loader, i.e. a loader from a different build of glibc. The answer, which is available if you follow the thread, is that it is necessary to rebuild the program to do this, with the --dynamic-linker switch to the linker. The best answer for the thread was to do this without a rebuild by making a wrapper script that invokes the program with an alternate loader, like:

/path/to/other/ld-linux.so program_name

So the story should have ended there. But it got me thinking; how hard could it be to edit the program to change the hard-coded loader value in the program? So I decided to find out.

The first try was as simple as replacing the string with my new loader string in a test program, which is basically just a Hell Oh World program. To do this, I opened the binary in emacs and changed to hex mode using M-x and then hexl-mode. Next I just wrote over the string with the absolute path of the new loader. I saved my changes and ran the program:

[siddhesh@localhost ~]$ ./a.out
bash: ./a.out: cannot execute binary file

Obviously I had done something wrong. So I had to do a bit of reading to understand what all was to change. One very obvious thing about this was that the length of the loader string was being recorded somewhere -- I was prety sure I had not overwritten anything and I had not changed the offset of the string at all. So I started reading the System V ABI documentation, which includes information on the ELF header. The ELF header documents where the program header table is, and the size of each program header table entry. one could read this very easily using the elfutils readelf command. So here's what I got:

[siddhesh@localhost ~]$ readelf -h ./a.out
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x4003e0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          2472 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         8
  Size of section headers:           64 (bytes)
  Number of section headers:         30
  Section header string table index: 27

So the important information here is the start of program headers and the size of each program header. In the case above, the program header starts 64 bytes into the file and then every 56 bytes after that was a program header. You can also read the program headers using elfutils:

[siddhesh@localhost ~]$ readelf -l ./a.out

Elf file type is EXEC (Executable file)
Entry point 0x4003e0
There are 8 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x00000000000001c0 0x00000000000001c0  R E    8
  INTERP         0x0000000000000200 0x0000000000400200 0x0000000000400200
                 0x000000000000001c 0x000000000000001c  R      1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x000000000000068c 0x000000000000068c  R E    200000
  LOAD           0x0000000000000690 0x0000000000600690 0x0000000000600690
                 0x00000000000001ec 0x0000000000000200  RW     200000
  DYNAMIC        0x00000000000006b8 0x00000000006006b8 0x00000000006006b8
                 0x0000000000000190 0x0000000000000190  RW     8
  NOTE           0x000000000000021c 0x000000000040021c 0x000000000040021c
                 0x0000000000000044 0x0000000000000044  R      4
  GNU_EH_FRAME   0x00000000000005e8 0x00000000004005e8 0x00000000004005e8
                 0x0000000000000024 0x0000000000000024  R      4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     8

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
   03     .ctors .dtors .jcr .dynamic .got .got.plt .data .bss 
   04     .dynamic 
   05     .note.ABI-tag .note.gnu.build-id 
   06     .eh_frame_hdr 
   07     

The relevant section here is the INTERP section. Each program header is basically this struct:

typedef struct {
        Elf32_Word p_type;
        Elf32_Off p_offset;
        Elf32_Addr p_vaddr;
        Elf32_Addr p_paddr;
        Elf32_Word p_filesz;
        Elf32_Word p_memsz;
        Elf32_Word p_flags;
        Elf32_Word p_align;
} Elf32_Phdr;

Here, the relevant values of the string size are p_filesz and p_memsz. If you look into the values from the readelf, you'll see that the size is 0x1c, which is the length of the string, including the trailing null character. So that is what I had to change. Now I had to find this header in the hex editor. The header is the one beginning with PT_INTERP, which has value 0x03:

87654321  0011 2233 4455 6677 8899 aabb ccdd eeff  0123456789abcdef                                                                                                               
00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000  .ELF............
00000010: 0200 3e00 0100 0000 e003 4000 0000 0000  ..>.......@.....
00000020: 4000 0000 0000 0000 a809 0000 0000 0000  @...............
00000030: 0000 0000 4000 3800 0800 4000 1e00 1b00  ....@.8...@.....
00000040: 0600 0000 0500 0000 4000 0000 0000 0000  ........@.......
00000050: 4000 4000 0000 0000 4000 4000 0000 0000  @.@.....@.@.....
00000060: c001 0000 0000 0000 c001 0000 0000 0000  ................
00000070: 0800 0000 0000 0000 0300 0000 0400 0000  ................
00000080: 0002 0000 0000 0000 0002 4000 0000 0000  ..........@.....
00000090: 0002 4000 0000 0000 1c00 0000 0000 0000  ..@.....(.......
000000a0: 1c00 0000 0000 0000 0100 0000 0000 0000  (...............

I just changed the highlighted 1c to 28, which is the hex value of the string length of my interpreter (/home/siddhesh/sandbox/lib/ld-2.8.90.so). This and the change in the string was all that was needed to change the interpreter for the program. This has one really basic flaw, which is that if the interpreter string is longer than the current available space in the header, I will end up overwriting other data, thus corrupting the binary. So if you're being adventurous, be sure that your interpreter path is small enough to fit into the available space.

Comments