Self Hosted Compilers
Contents
High level languages can (mostly) be divided into 2 categories: compiled & interpreted. There don’t tend to be differences in what they can actually do, but compiled languages are unique in that they can theoretically be written in themselves. By that, I mean that a compiler can be written in the same language it compiles. If you think that sounds recursively nonsensical, then you’re right! But it actually makes some sense.
How It Works⌗
In order to compile a self-hosted compiler, you need to go through a process known as bootstrapping. The steps involved are in principle very straightforward, but the actual process is often rather involved and can be quite time consuming.
Stage 0 - The Native Compiler⌗
First, you need to acquire a compiler that already exists & works. This tends to be a previous version of the compiler, but can also be one written in a different language specifically for bootstrapping purposes. Where the native compiler comes from can also change over time; Rust, for example, started out with a compiler written in OCaml but now uses previous beta versions of the compiler as the native compiler, generally downloaded from the internet. GCC is different in this regard, as it just assumes that you already have a C compiler (how could you not) & doesn’t have a standard mechanism for acquiring one if you don’t.
The native compiler is generally more ‘stupid’ than the final one produced, as it’s never meant to actually be used by an end-user - it only needs to build the next stage - and of course an older version of the compiler would be worse. That’s the whole reason for updating it! Furthermore, if the native compiler is in a different language, then maintaining it defeats the purpose of self-hosting (you’d be maintaining 2 different compilers for no reason), so it usually only happens to kickstart the bootstrapping process & is immediately removed (this is what the Zig programming language did).
Stage 1 - The Unoptimised Optimiser⌗
Once you have your native compiler set up, its time to compile the one you actually care about. Except, not really! The first build of the compiler will be pretty poorly optimised due to being built by the worse native compiler, so this still isn’t the final compiler that’ll be built. Alongside the stage 1 compiler, the language’s standard library also tends to be built (I’ll refer to them collectively as artifacts), as the compiler just about always depends on it. The exception is, again, C - it just assumes you have a good enough libc already somewhere on your computer & dynamically links against that (like any other C program), which usually Just Works™ but when it doesn’t it’s absolute hell - miscompilations without there being a bug in the compiler is somewhat unbelievably frustrating.
Stage 2 - The Real Deal⌗
The artifacts are built once again, this time with optimisations. The compiler is finally production ready. It produces optimised code at fast speeds (unless your language is C++ or Rust), and can be distributed to end users without fear - except the usual one that a code goblin spuriously introduced bugs into your perfect code. However, bootstrapping usually doesn’t end here. For extra safety, one last build of the compiler & standard library takes place
Stage 3 - The Real Deal, Again⌗
When the stage 2 compiler is used to build the stage 3 artifacts. These are then compared with those for stage 2 to ensure they’re byte-for-byte identical - ensuring that there were no miscompilations causing the compiler to generate the wrong machine code. There’s literally nothing else to this stage, and it can be skipped without ruining anything.
Why Do It⌗
Despite being a really weird idea, there are advantages to doing this. For one, it means people working on the compiler only need to know one language rather than two (the one the compiler is written in, and the one it compiles), which reduces the barrier for entry for new contributors who just want to work on a language they like. Secondly, it allows the developers to dogfood the compiler. Dogfooding refers to the developers making use of the program themselves, which helps to gain better insight into UI/UX issues & which features should be added. The same holds for the standard library which the compiler would make use of - using a program yourself is a really good way to come up with improvements & fixes for it without extensive user bug reports.
Why Not Do It⌗
While self-hosting is absolutely magical, enough so that most compiled languages around today are self-hosted, there are valid reasons to choose not to self-host.
The primary reason is that kickstarting the bootstrapping process is very hard. First, you need to write a compiler for your language
in another one until your language is mature enough to write a compiler in. Then, you need to actually write the compiler in your new
language, which will be a massive pain because there most likely won’t be any libraries you can use to ease the process of parsing,
optimising, and code-generation (have fun writing LLVM bindings yourself!), and you’ll probably need to do this while keeping the
old compiler up to date as bug fixes come in. Getting started is a massive pain, which is why Yorick Peterse, creator of the
Inko programming language, explicitly decided against it. Additionally, there’s the problem of what to do if a native compiler doesn’t
magically already exist & can’t easily be obtained. At this point, the only way to build your compiler will be to compile a
bootstrapper written in some other language - which is what mrustc aims to achieve for Rust, being written in the devil’s tongue
C++.
Wrapping Up⌗
Self-hosted compilers are an immense feat of human ingenuity & software engineering prowess. While the bootstrapping process may be time-consuming (especially if you can only bootstrap from the immediately previous version (DAMN YOU RUST)), it can be an immense help in many ways - not least of which is the opportunity to learn how production compilers are made & organised by going through the process of bootstrapping one. It’s a difficult process to get set up but once it works, it works well, proving that even ostensibly nonsensical ideas can be implemented to great effect.