Bazel & Introduction: Part Two – Toolchains

Why is all of this so complicated?

Compiling C++ can be described by one word: complex. Most of this really hails back to how C++ was designed back in the day, but the reality is that the compilation process for any language is complex. C++ works as hard as it can to create an uber-fast binary, and as such the steps are a little more nasty to deal with. Here, we’ll break it down into the component steps of the process.

Basically, we’re pretending like we just ran g++ main.cpp thepolice.cpp

Preprocessor

The first thing that happens is the preprocessor runs through each file, and the associated header file. In this case, our tree for the project looks like this:

$ tree .
   |-- main.cpp
   |-- main.hpp
   |-- thepolice.cpp
   |-- thepolice.hpp

The .hpp files are called header files, and typically only contain declarations (thank you, templates). The .cpp files are the actual source files, and contain implementations. This is sort of out of scope for this post, but basically in the .hpp you put this (Java analogy):

  public int foo(short blah, Object o);

and in the .cpp file:

  public int foo(short blah, Object o) {
    // do stuff
    return (int) blah;
  }

Back to the preprocessor. The preprocessor runs through each file, and reads the source file as text and then produces another text file as output. Any lines which begin with ‘#’ are not actually C++, they’re written in the preprocessor language. #pragma once, #include and #ifndef are all preprocessor directives. I can’t go into a lot of detail here, but the important thing to note is that any files which #include a header file have that line replaced by the content of that header file. As a result, the compiler now knows that each thing defined in the header file is implemented (written out) in an associated source file! This allows the linker to combine everything (we’re getting there).

Compiler

The compiler is probably the most straightforward step, but it does have one relevant subtlety. Basically, it takes your source code and converts it to machine code. So

  --- contents of iostream library from #include ---

  int main() {
    std::cout << "Hello, World\n";
    return 0;
  }

ends up looking more like

    8020	78
    8021	A9 80
    8023	8D 15 03
    8026	A9 2D
    8028	8D 14 03
    802B	58
    802C	60
    802D	EE 20 D0
    8030	4C 31 EA

The numbers of the left are memory addresses, and those on the right are the contents of bytes starting at those addresses.

The compiler takes in the output files from the preprocessor (.i extension for the C preprocessor) and outputs .o extension files, called “object files”.

Now for the subtlety. The object files are not just machine code. They also contain tags which reference external functions (those written in other files). Clearly, our project is nowhere near complete.

Note that between the source code and machine code it does get translated into assembly code and then machine code – it’s two distinct steps.

Linker

This step is where it all comes together, hence the name ‘Linker.’ Some people who build C++, such as myself, actually break this whole process into two steps: preprocessor/compiler and linker. Basically they run gcc which generates object files, and then ld which is the GNU linker.

As we noted before, the compiler couldn’t quite write pure machine code. The reason? The compiler only knows where stuff is in its own file. So instead it just writes notes on how it assumed stuff was layed out. Then the linker uses these notes to assign actual memory addresses to everything. This allows one file to call functions in another file. If you’re confused, do email me, because I can explain this better, just in more words.

Doing this by hand (why would you?)
  1. Run cpp which is the name of the preprocessor executable. (Preprocessor)
  2. Run gcc on the .i files output by the preprocessor. (Compiler)
  3. Run as on the .s files output by the compiler. (Assembler)
  4. Run ld on the .o files output by the assembler. (Linker)
  5. ???
  6. Profit?

edit: See this post


So that’s the full overview of the C++ compilation process. I totally didn’t cover multiarch like I meant to, so we’ll get to that next post, I suppose!

edit: Post is up, click here to read it!