Bootstrapping Pachuco

If you want to build a self-hosting compiler for a new programming language, you hit an interesting problem: If the compiler is the first, and therefore the only, implementation of the language, and the compiler is written in the same language, how do you compile the compiler so that you can run it?

To break this circularity, you have to come up with a second implementation of your language, just for the purpose of running the compiler. This bootstrap implementation must be written in another language, in order to escape the circularity that affected the compiler. Once you have a bootstrap implementation, you can run the compiler under it. Now you can pass the source code from the compiler as the input to the running compiler, so that it compiles itself, resulting is an executable for the compiler. That compiler executable can now be run directly by the computer, so you no longer need the bootstrap implementation. You can now use the compiler executable to compile the compiler, and to compile later enhanced versions of the compiler; the compiler has become self-sustaining. This process is called boostrapping a compiler.

So how do you obtain that bootstrap implementation? Assuming the compiler is your main goal, you want to divert as little effort as possible to the bootstrap implementation. And because it has to be written in some other programming language, it's also a in the sense that you have to work in some boring established language rather than your shiny new language. So a classic approach is to build an interpreter to act as the bootstrap implementation, interpreters being generally shorter and simpler than compilers. Interpreters are slow (particularly simple ones), but in this case, that shouldn't matter: it only needs to run once to bootstrap the compiler.

That is not the course I followed with Pachuco. The problem with it is that the compiler doesn't just spring into existence, ready to run under the interpreter. Like any non-trivial program, the compiler gets written incrementally, satisfying a gradually growing set of requirements. So the relative slowness of the interpreter (a couple of orders of magnitude for a trivial metacircular interpreter in Lisp) does matter — it makes the compiler development process cumbersome. Furthermore, the compiler will contain bugs as its development proceeds, and these bugs have to be found and fixed. A trivial interpreter lacks good error reporting and debugging facilities that make for a comfortable development environment.

So instead, I took advantage of the similarity between Pachuco and Common Lisp, and also the powerful macro system present in both languages. The compiler is written in a hybrid language that can be read, via a layer of macros, as either Common Lisp or Pachuco. This allowed me to develop the compiler under sbcl, but compile it with itself when it became sufficiently complete and robust. Since sbcl is a compiling implementation of Common Lisp, the Pachuco compiler always ran quickly, and sbcl's REPL loop and debugging facilities helped things along.

Pachuco Background

I've been interested in the implementation of programming languages for many years. I've often indulged this interest by browsing the published literature on the subject, which is extensive. But you don't really understand something in computing until you've worked in that area yourself.

One of source of my fascination is the fact that something as expressive as a good programming language can be mapped onto a dumb machine. Interpreters, and compilers that target a virtual machine, may be an important and pragmatic design point, but they are not as exciting as true compilers that target the bare metal.

Unfortunately, implementations of production compilers tend to be large codebases with unnecessary complexity due to historical baggage, just like most mature software projects (I'm basing this on my explorations of open-source compilers, but I'm sure that proprietary compilers are no better). So they don't make attractive environments for explorations of language implementation techniques. Another issue with many production compilers is that they are implemented in C and C++. These languages still have their place in the world, but they are no longer a good answer to the challenges encountered when writing a compiler.

So for a number of years I had an urge to tinker with a compiler, but couldn't find a compiler worth tinkering with. Influenced by texts such as Structure and Interpretation of Computer Programs, I came to the conclusion that writing a compiler for a Lisp/Scheme-like language in that language is not such a big deal. The effort involved is significant, but not infeasible, and it's a stimulating way to spend your free time.

So last year, when I had a block of time that I could commit to getting the project off the ground, I started working on Pachuco. It also helped that I had just got my first x86-64 machine, and a wish to learn about x86-64 machine code.

Pachuco

Pachuco is a recreational project I've been working on for a while. It's a self-hosting compiler for a dialect of the Lisp programming language (and it also has a lot in common with Scheme). It generates i386 or x86-64 assembly code directly, rather than compiling to another high-level language or virtual machine. The system is about 5000 lines of code, including the compiler, runtime, garbage collector, and interpreter (used in the macro system). The whole system is written in the Pachuco language, except for a minimal amount of C and Common Lisp code to allow the system to be bootstrapped.

Pachuco is supposed to be an interesting toy, and it is unlikely ever to be useful in any direct way. I plan to write about what aspects of the project might be of interest in future posts. But I suspect that people either find programming lanugage implementations fascinating, or they don't. If, like me, you do, then take a look.

Coming soon to this blog

I'm at the end of my first year at LShift. Time, as they say, has flown.

Before I joined the company, I took some time off between jobs, to do some traveling, get married, stuff like that. I also worked on a few recreational programming projects, none of which I got around to publishing. Until now, that is!

One project in particular will probably be the main focus of this blog for some time to come. It's not the most original project in the world, but it continues to hold my interest, so hopefully I'll have no trouble writing about it on a regular basis.

Beagle Board

I'm always on the look out for small silent power-sipping computers. My home server is an NSLU2, and I love it in many ways, but its performance and memory size remind me of the PC I had 15 years ago.

The other day I discovered the Beagle Board. It has a 500MHz superscalar ARM processor, 128MB of RAM, an SD card slot, a USB port, and video output, all for 150 USD. Lots of information is available at beagleboard.org and the embedded Linux wiki. And there are a bunch of demo videos on youtube.

This might well be the successor to my NSLU2 (though I'll wait for the Rev C board, which should have the main USB port working — the earlier revs have to do everything through the little USB OTG port). It's a shame there is no case for it, but I'm sure I'll be able to work something out.

1 comment