Hell Oh Entropy!

Life, Code and everything in between

gcc under the hood

Posted: Oct 03, 2019, 14:12

My background in computers is a bit hacky for a compiler engineer. I never had the theoretical computer science base that the average compiler geek does (yes I have a Masters in Computer Applications, but it’s from India and yes I’m going to leave that hanging without explanation) and I have pretty much winged it all these years. What follows is a bunch of thoughts (high five to those who get that reference!) from my winging it for almost a decade with the GNU toolchain. If you’re a visual learner then I would recommend watching my talk video at Linaro Connect 2019 instead of reading this. This is an imprecise transcript of my talk, with less silly quips and in a more neutral accent.

Hell Oh World!

It all started as a lightning talk at SHD Belgaum where I did a 5 minute demonstration of how a program goes from source code to a binary program. I got many people asking me to talk about this in more detail and it eventually became a full hour workshop at FOSSASIA in 2017. Those remained the basis for the first part of my talk at Connect, in fact they’re a bit more detailed in their treatment of the subject of purely taking code from source to binary.

Go read those posts, I’ll wait.

Under the hood

Welcome back! So now you know almost exactly how code goes from source to binary and how it gets executed on your computer. Congratulations, you now have a headstart into hacking on the dynamic loader bits in glibc! Now that we’ve explored the woods around the swamp, lets step into the swamp. Lets take a closer look at what gcc does with our source code.

Babelfish

The first thing a compiler does is to read the source code and make it into a data structure form that is easy for the computer to manipulate. Since the computer deals best with binary data, it converts the text form of the source code language into a tree-like structure. There is plenty of literature available on how that is implemented; in fact most compiler texts end up putting too much focus on this aspect of the compiler and then end up rushing through the real fun stuff that the compiler does with the program you wrote.

The data structure that the compiler translates your source code into is called an Intermediate Representation (IR), and gcc has three of them!

This part of the compiler that parses the source code into IR is known as the front end. gcc has many such front ends, one for each language it implements; they are all cluttered into the gcc/ directory. The C family of languages has a common subset of code in the gcc/c-family directory and then there are specializations like the C++ front end, implemented in the gcc/cp directory. There is a directory for fortran, another for java, yet another for go and so on. The translation of code from source to IR varies from frontend to frontend because of language differences, but they all have one thing in common; their output is an IR called GENERIC and it attempts to be language-independent.

Optimisation passes

Once gcc translates the source code into its IR form, it runs the IR through a number of passes (about 200 of them, more or less depending on the -O flag you pass) to try and come up with the most optimal machine code representation of the program. gcc builds source code a file at a time, which is called a Translation Unit (TU). A lot of its optimisation passes operate at the function level, i.e. individual functions are seen as independent units and are optimised separately. Then there are Inter-procedural analysis (IPA) passes that look at interactions of these functions and finally there is Link Time Optimisation(LTO) that attempts to analyse source code across translation units to potentially get even better results.

Optimisation passes can be architecture-independent or architecture-dependent.

Architecture independent passes do not care too much about the underlying machine beyond some basic details like word size, whether the CPU has floating point support, vector support, etc. These passes have some configuration hooks that allow their behaviour to be modified according to the target CPU architecture but the high level behaviour is architecture-agnostic. Architecture-independent passes are the holy grail for optimisation because they usually don’t get old; optimisations that work today will continue to work regardless of CPU architecture evolution.

Architecture-dependent passes need more information about the architecture and as a result their results may change as architectures evolve. Register allocation and instruction scheduling for example are very interesting and complex problems that architecture-specific passes handle. The instruction scheduling problem was a lot more critical back in the day when CPUs could only execute code sequentially. With out-of-order execution, the scheduling problem has not become less critical, but it has definitely changed in nature. Similarly the register allocation problem can be complicated by various factors such as number of locigal registers, how they share physical register files, how costly moving between different types of registers is, and so on. Architecture-dependent passes have to take into consideration all of these factors.

The final pass in architecture-dependent passes does the work of emitting assembly code for the target CPU. The architecture-independent passes constitute what is known as the middle-end of the compiler and the architecture-dependent passes form the compiler backend.

Each pass is a complex work of art, mathematics and logic and may have one or more highly cited research papers as their basis. No single gcc engineer would claim to understand all of these passes; there are many who have spent most of their careers on a small subset of passes, such is their complexity. But then, this is also a testament to how well we can work together as humans to create something that is significantly more complex than what our individual minds can grasp. What I mean to say with all this is, it’s OK to not know all of the passes, let alone know them well; you’re definitely not the only one.

Optimisation passes are all listed in gcc/passes.def and that is the sequence in which they are executed. A pass is defined as a class with a gate and execute function that determine whether to run and what to do respectively. Here’s what a single pass namespace looks like:

/* Pass data to help the pass manager classify, prepare and cleanup.  */
const pass_data pass_data_my_pass =
{
  GIMPLE_PASS, /* type */
  "my_pass", /* name */
  OPTGROUP_LOOP, /* optinfo_flags */
  TV_TREE_LOOP, /* tv_id */
  PROP_cfg, /* properties_required */
  0, /* properties_provided */
  0, /* properties_destroyed */
  0, /* todo_flags_start */
  0, /* todo_flags_finish */
};

/* This is a GIMPLE pass.  I know you don't know what GIMPLE is yet ;) */
class pass_my_pass : public gimple_opt_pass
{
public:
  pass_my_pass (gcc::context *ctxt)
    : gimple_opt_pass (pass_data_my_pass, ctxt)
  {}

  /* opt_pass methods: */
  virtual bool gate (function *) { /* Decide whether to run.  */ }

  virtual unsigned int execute (function *fn);
};

unsigned int
pass_my_pass::execute (function *)
{
  /* Magic!  */
}

We will not go into the anatomy of a pass yet. That is perhaps a topic for a follow-up post.

GENERIC

GENERIC is the first IR that gets generated from parsing the source code. It is a tree structure that has been present since the earliest gcc versions. Its core data structure is a tree_node, which is a hierarchy of structs, with tree_base as the Abraham. OK, if you haven’t been following gcc development, this can come as a surprise: a significant portion of gcc is now in c++!

It’s OK, take a minute to mourn/celebrate that.

The tree_node can mean a lot of things (it is a union), but the most important one is the tree_typed node. There are various types of tree_typed, like tree_int_cst, tree_identifier, tree_string, etc. that describe the elements of source code that they house. The base struct, i.e. tree_typed and even further up, tree_base have flags that have metadata about the elements that help in optimisation decisions. You’ll almost never use generic in the context of code traversal in optimisation passes, but the nodes are still very important to know about because GIMPLE continues to use them to store operand information. Take a peek at gcc/tree-core.h and gcc/tree.def to take a closer look at all of the types of nodes you could have in GENERIC.

What is GIMPLE you ask? Well, that’s our next IR and probably the most important one form the context of optimisations.

OK so now you have enough background to go look at the guts of the GENERIC tree IR in gcc. Here’s the gcc internals documentation that will help you navigate all of the convenience macros to analyze th tree nodes.

GIMPLE

The GIMPLE IR is a workhorse of gcc. Passes that operate on GIMPLE are architecture-independent.

The core structure in GIMPLE is a struct gimple that holds all of the metadata for a single gimple statement and is also a node in the list of gimple statements. There are various structures named gimple_statement_with_ops_* that have the actual operand information based on its type. Once again like with GENERIC, it is a hierarchy of structs. Note that the operands are all of the tree type so we haven’t got rid of all of GENERIC. gcc/gimple.h is where all of these structures are defined and gcc/gimple.def is where all of the different types of gimple statements are defined.

Where did my control flow go?

But how is it that a simple list of gimple statements is sufficient to traverse a program that has conditions and loops? Here is where things get interesting. Every sequence of GIMPLE statements is housed in a Basic Block (BB) and the compiler, when translating from GENERIC to GIMPLE (also known as lowering, since GIMPLE is structurally simpler than GENERIC), generates a Control FLow Graph (CFG) that describes the flow of the function from one BB to another. The only control flow idea one then needs in GIMPLE to traverse code from within the gimple statement context is a jump from one block to another and that is fulfilled by the GIMPLE_GOTO statement type. The CFG with its basic blocks and edges connecting those blocks, takes care of everything else. Routines to generate and manipulate the CFG are defined in gcc/cfg.h and gcc/cfg.c but beware, you don’t modify the CFG directly. Since CFG is tightly linked with GIMPLE (and RTL, yes that’s our third and final IR), it provides hooks to manipulate the graph and update GIMPLE if necessary.

The last interesting detail about CFG is that it has a special construct for loops, because they’re typically the most interesting subjects for optimisation: you can splice them, unroll them, distribute them and more to produce some fantastic performance results. gcc/cfgloop.h is there you’ll find all of the routines you need to traverse and manipulate loops.

The final important detail with regard to GIMPLE is the Single Static Assignment (SSA) form. Typical source code would have variables that get declared, assigned to, manipulated and then stored back into memory. Essentially, it is multiple operations on a single variable, sometimes destroying its previous contents as we reuse variables for different things that are logically related in the context of the high level program. This reuse however makes it complicated for a compiler to understand what’s going on and hence it ends up missing a host of optimisation opportunities.

To make things easier for optimisation passes, these variables are broken up into versions of themselves that have a specific start and end point in their lifecycle. This is called the Single Static Assignment form of a variable, where each version of the variable as a single starting point, viz. its definition. So if you have code like this:

    x = 10;
    x += 20;

it becomes:

    x_1 = 10;
    x_2 = x_1 + 20;

where x_1 and x_2 are versions of x. If you have versions of variables in conditions, things get interesting and the compiler deals with it with a mysterious concept called PHI nodes. So this code:

    if (n > 10)
      x = 10;
    else
      x = 20;
    return x;

becomes:

    if (n > 10)
      x_1 = 10;
    else
      x_2 = 20;
    # x_3 = PHI<x_1, x_2>;
    return x_3;

So the PHI node is a conditional selector of the earlier two versions of the variable and depending on the results of the optimisation passes, you could eliminate versions of the variables altogether or use CPU registers more efficiently to store those variable versions.

There you go, now you have everything you need to get started on hacking GIMPLE. I know this part is a bit heavy but guess what, this is where you can seriously start thinking about hacking on gcc! When you jump in, you’ll need the more detailed information in the gcc internals manual on GIMPLE, CFG and GIMPLE optimisations.

RTL

We are yet another step closer to generating assembly code for our assembler and linker to build into the final program. Recall that GIMPLE is largely architecture independent, so it works on high level ideas of statements, expressions and types and their relationships. RTL is much more primitive in comparison and is designed to mimic sequential architecture instructions. Its main purpose is to do architecture-specific work, such as register allocation and scheduling instructions in addition to more optimisation (because you can never get enough of that!) passes that make use of architecture information.

Internally, you will encounter two forms of RTL, one being the rtx struct that is used for most transformations and passes in the compiler. The other form is in text to map machine instructions to RTL and these are Lisp-like S expressions. These are used to make machine descriptions, where you can specify machine instructions for all of the common operations such as add, sub, multiply and so on. For example, this is what the description of a jump looks like for aarch64:

(define_insn "jump"
  [(set (pc) (label_ref (match_operand 0 "" "")))]
  ""
  "b\\t%l0"
  [(set_attr "type" "branch")]
)

Machine description files are in the gcc/config directory in their respective architecture subdirectory and have the .md suffix. The main aarch64 machine description file for example is gcc/config/aarch64/aarch64.md. These files are parsed by a tool in gcc to generate C code with rtx structures for each of those S-expressions.

In general, the gcc/config directory contains source code that handles compilation for that specific architecture. In addition to the machine descriptions these directories also have C code that enhance the RTL passes to exploit as much of th architecture information as they possibly can. This is where some of the detailed architecture-specific analysis of the RTL instructions go. For example, combination of loads into load pairs for aarch64 is an important task and it is done with a combination of machine description and some code to peek into and rearrange neighbouring RTL instructions

But then you’re wondering, why are there multiple description files? Other than just cleaner layout (put constraint information in a separate file, type information in another, etc.) it is because there are multiple evolutions of an architecture. The ‘i386’ architecture is a mess of evolutions that span word sizes and capabilities. The aarch64 architecture has within it many microarchitectures developed by various Arm licensee vendors like xgene, thunderxt88, falkor and also those developed by Arm such as the cortex-a57, cortex-a72, ares, etc. All of these have different behaviours and performance characteristics despite sharing an instruction set. For example on some microarchitecture one may prefer to emit the csel instruction instead of cmp, b.cond and multiple mov instructions to reduce code size (and hence improve performance) but on some other architecture, the csel instruction may have been designed really badly and hence it may well be cheaper to execute the 4+ instructions instead of the one csel. These are behaviour quirks that you select when you use the -mtune flag to optimise for a specific CPU. A lot of this information is also available in the C code of the architecture in the form of structures called cost tables. These are relative costs of various operations that help the RTL passes and some GIMPLE passes determine the best target code and optimisation behaviour accordingly for the CPU. Here’s an example of register move costs for the Qualcomm Centriq processor:

static const struct cpu_regmove_cost qdf24xx_regmove_cost =
{
  2, /* GP2GP  */
  /* Avoid the use of int<->fp moves for spilling.  */
  6, /* GP2FP  */
  6, /* FP2GP  */
  4 /* FP2FP  */
};

This tells us that in general, moving between general purpose registers is cheaper, moving between floating point registers is slightly more expensive and moving between general purpose and floating point registers is most expensive.

The other important detail in such machine description files is the pipeline description. This is a description of how the pipeline for a specific CPU microarchitecture is designed along with latencies for instructions in that pipeline. This information is used by the instruction scheduler pass to determine the best schedule of instructions for a CPU.

Now where do I start?

This is a lot of information to page in at once and if you’re like me, you’d want something more concrete to get started with understanding what gcc is doing. gcc has you covered there because it has flags that allow you to study the IR generated for the compiler at every stage of compilation. By every stage, I mean every pass! The -fdump-* set of flags allow you to dump the IR to study what gcc did in each pass. Particularly, -fdump-tree-* flags dump GIMPLE IR (in a C-like format so that it is not too complicated to read) into a file for each pass and the -fdump-rtl-* flags do the same for RTL. The GIMPLE IR can be dumped in its raw form as well (e.g. -fdump-tree-all-raw), which makes it much simpler to correlate with the code that is manipulating the GIMPLE in an optimisation pass.

The easiest way to get into gcc development (and compiler development in general) in my experience is the back end. Try tweaking the various cost tables to see what effect it has on code generation. Modify the instructions generated by the RTL descriptions and use that to look closer at one pass that interests you. Once you’re comfortable with making changes to gcc, rebuilding and checking its outputs, you can then try writing a pass, which is a slightly more involved process. Maybe I’ll write about it some day.

References

Comments