Post-mortem of a novices forray into compiler design

August 4, 2023

Gather round gather round, what follows is a tale of human folly and great woe.

OK it’s not actually that bad, but I did spend most of a working week on this, only to git reset --hard most of it away. I don’t want to say I wasted that time, I’ve now successfully done the task that I was origionally trying to accomplish with a much better design that was informed by my experience here, so it’s less waste and more painful painful learning. Anyway, now I’m giving away those lessons for free so you should probably worship me or something idk.

The setup

I have a language which is essentially a lisp. It’s not a “proper” lisp like common lisp or scheme or whatever, it’s in the same neighbourhood as the lambda calculus with an extra set of domain specific instructions (which I won’t go into here because they aren’t relevant), and some other stuff to make it actually possible to write code in (let, modules, if, basic math operators, etc) without wanting to get into the Forever Rc.

There’s no built-in support for recursion, though of course that doesn’t stop anyone who knows a bit of lambda calculus:

(let ((fix (lambda (f) (f f)))
      (y-comb (lambda (f) (fix (lambda (r) (f (lambda (y) ((r r) y)))))))
      (adder (lambda (self) (lambda (x) (lambda (lim) (if (== x lim) () ((self (+ 1 x)) lim)))))))
     (((y-comb adder) 0) 200))

(this is a strict variant of the Y combinator, or the why god why combinator if you’re very funny)

I’m writing a compiler for this language which outputs domain specific instructions (rather than traditional assembly). A somewhat important thing to note is the language is fully compile time static — there are no runtime values. This means the compiler can theoretically (and does, because the domain instructions only support a few operations and only on immediate or otherwise constant values) evaluate most of the language out of existence. This is a bit like boiling salt water, but instead of delicious crystal we get the domain instructions that were written by the programmer (which are black-boxes at this level, beyond the number of arguments they take and their types)

The language is mostly dynamically typed, with the caveat at that there is no mutability and “dynamic” refers to the fact that a lambda has no predefined return type — it is allowed to produce different types depending on its arguments, but none of this is tracked at runtime because lambdas are evaluated at compile-time. It is also strongly typed — i.e. there is no type-coercion.

Frontend

I’ll skip over most of the details of the parser and lexer, it’s not hugely complex, barely a hundred lines (S-expressions are piss to transform into an AST). What’s important for this story is that the AST is transformed directly into a very simple subset of the IR. Non-empty lists are transformed into Apply expressions, symbols to the Symbol expression, etc.

IR overview

struct IrId { /* ... */ }
struct Span { /* ... */ }
struct Ident {
    span: Span,
    name: Box<str>,
}

struct IrNode {
    pub span: Span,
    pub id: IrId,
    pub kind: IrNodeKind,
}
enum IrNodeKind {
    Lambda(LambdaExpr),
    If(IfExpr),
    Op(OpExpr),
    Constant(ConstExpr),
    Symbol(String) // (1)
    Apply(Box<IrNode>, Vec<IrNode>), // (2)
    // ...
}
enum OpExpr {
    BinOp(BinOpKind, Box<IrNode>, Box<IrNode>),
    UnOp(UnOpKind, Box<IrNode>),
}
enum ConstExpr {
    Int(i64),
    Bool(bool),
    // ...
}

As you can see, every IrNode has an id which is used for cheap comparisons and lookups, it can also be used to diff the tree as when a pass modifies a node it has to keep the IrId (enforced by only the AST -> IR step knowing how to make them). If I wanted to be really strict I could’ve made the id hidden and required some kind of map operation on the IrNode, which would have prevented the programmer from writing the bug where she just copied another node’s id and reused it, buut I didn’t think of that at the time.

The AST is a strict subset of this IR, it can be mapped directly by turning every list into an Apply and everything else into either a Symbol or Constant.

In order to actually be useful for optimisations and lowering to the domain instructions we need to map things to the full surface area of the IR, which can be done by passing over it and replacing the lisp patterns with their IR counterparts (you can cross off “mixing levels of abstraction” on your design pitfalls bingo card now)

(lambda (a b) (+ 9 26 7))

Apply(
    Symbol("lambda"),
    [
        Apply(
            Symbol("a"),
            [Symbol("b")]
        ),
        Apply(
            Symbol("+"),
            [Int(9), Int(26), Int(7)]
        )
    ]
)

Lambda {
    params: ["a", "b"],
    body: Apply(
        Symbol("+"),
        [Int(9), Int(26), Int(7)]
    ),
}

Lambda {
    params: ["a", "b"],
    body: BinOp {
        kind: Add,
        args: [Int(9), Int(26), Int(7)],
    },
}

I joked earlier but this mixing of abstraction levels was, I believe, my first big mistake. Not separating the rich IR from the simple IR at the type level made debugging much harder as the question of whether a pass had simply not happend or had happend wrong was not easily answerable without a lot of printing. It also made passes very dependent on the details of the passes before them, which was fun.

I’m getting ahead of myself though, back in the present we still need a way to actually define passes and traverse the IR.

The Visitor RIP Jake Sisko, I cri every tim

Visitors are somewhat common in Rust, syn provides them as a way to traverse its IR for example. syn also provides a mutable version of visitor which allows visiting the IR with mut references and hence in-place modification of it. This gave me a bright idea: define passes as visitors over the IR!

Now, I don’t have anything against visitors; back when I was a C++ stan, std::visit was my bread and butter when writing insane TMP programs (which was often). In Rust however (and generally, Rust just won’t compile it), visitors which can mutate the thing they are visiting can be problematic. For example, if I modify a node in-place, the order of visiting now matters and it’s not explicit in the control flow. Essentially we’ve made out-params a fundamental part of our traversal API, which isn’t great (out-params are one of the truly evil things in the world, although if you’ve never written C++ just wait until you encounter out params that don’t require a &mut when passing them… those are fun). This, predictably, caused at least 3 bugs, probably more, and caused another issue that I’ll tell you about when you’re older (specifically around 5-10 minutes older, depending on your reading speed and commitment).

A slight detour into group theory

Usually a visitor does not produce anything, except maybe Result<()>, but I decided to generalise with some group theory:

trait Semigroup: Sized {
    fn concat(self, other: Self) -> Self;
}
trait Monoid: Semigroup {
    fn empty() -> Self;
}

If you are unfamiliar with semi-groups and monoids then I invite you to read the wikipedia article on them, and then come back more confused. Explaining what a semigroup is properly is another blog post on its own, but for our purposes this instance is what matters:

impl<T: Semigroup, E> Semigroup for Result<T, E> {
    fn concat(self, other: Self) -> Self {
        match self {
            Ok(lok) => match other {
                Ok(rok) => Ok(lok.concat(rok))
                Err(e) => Err(e),
            },
            Err(e) => Err(e),
        }
    }
}

As you can see, if both sides are Ok then it passes the Ok through and applies the semigroup operation to the inner value. All other paths lead to Err. We can then build a chain of concat’s to produce our result.

There are two things to note about this design:

It lets us make erroring optional, we don’t even have to .unwrap() at the end just make the return type (), which is trivially a semigroup as it only has 1, singleton, member
It is somewhat fundamentally broken when it comes to Result, at least in the sense of preserving the semantics

To see what I mean by (2) consider the following code:

/// This really should be on a type like `I64UnderAddition`, since there are
/// many valid semigroups for the integers, but still it's an example
impl Semigroup for i64 {
    fn concat(self, o: Self) -> Self {
        self + o
    }
}

impl Monoid for i64 {
    fn empty() -> Self {
        0
    }
}

fn run<R: Monoid>(mut f: impl FnMut(i64) -> R) -> R {
    (0..1000).fold(R::empty(), |r, v| r.concat(f(v)))
}

let r = run(|v| {
        if v > 100 { Err("lower your expectations") }
        else if v > 101 { unreachable!() }
        else { Ok(v) }
    });
assert!(r.is_err()); // never hit

This code will panic after hitting that unreachable!(). If you don’t see why, stare at it for a while, then stare at it some more. I didn’t notice this issue for a couple days (and was rather confused by my log lines being hit after returning Err). The reason this code panics, even though we return an error before it that point can be reached, is because although we propagate the error on concat, both sides are still evaluated. We have lost the short-circuiting/early return semantics of Result.

This is where my second big mistake comes in: I did not write tests for the visitor code, nor did I write tests for the Result impl on Monoid. If I had written tests I probably would have thought through exactly what the behaviour was and caught this bug early. As it was, I spent days debugging the other bugs while contending with inaccurate log output (as since the IR was modified in-place, the later passes that happened between the error being emitted and it being handled changed the IR… that was fun) and preconditions that seemed obviousely to be enforced but were actually not.

It’s also worth noting that restoring the short-cutting semantics would be possible, and in fact quite easy with something like std::ops::ControlFlow, but I wasn’t aware of its existence at the time, and it would be a bit annoying because ? would no longer work properly for the visitor if it wanted to return a Result.

IR passes

Now that we have a way to traverse the IR we can finally start creating the passes that are going to progressively expand it and then finally lower it to the next stage (the domain specific instructions, or actually more accurately a simplified version of them which gets “assembled” to the full versions, but anyway).

Name resolution

This level of the IR hasn’t had name resolution done yet, at least not at the type level. We can define a pass which does the resolution (by traversing the IR and resolving the scopes for each variable) but it needs somewhere to store the output, somewhere which is also accessible to later passes.

This is where IrContext comes in. As the name suggests, it provides additional context to the visitors of the IR as they traverse and can be modified by mutable visitors. This object holds all the scopes of variables as well as a map from IrId’s to IR expressions. We will also need to modify the visitor to take a parameter of IrContext along with the node it is visiting.

struct Scope {
    span: Span,
    variables: Vec<(Ident, IrId)>
}
struct IrContext {
    scopes: Vec<Scope>,
    id_to_expr: HashMap<IrId, IrNode>,
}

Like the sensible people we are we’re using IrId for the bindings rather than an IrNode, since if the IR is modified we want lookups to go to the modified version and not whatever it was when we created the scope and cloned the IrNode into it. I belive writers call this forshadowing.

Now we just have to modify our visitor methods to take &mut T, &[mut] IrContext and we are away (the reason the mut on IrContext is optional is because for passes which don’t add variables or anything we don’t need to modify it). If that signiture sets alarm bells ringing then you have probably shared a few candlelit dinners with the borrow checker. Don’t worry I’ll come back to this (and if you have no idea what I’m on about don’t worry either, it’s non-obvious)

Lambda inlining

If name resolution is pass 0, then lambda inlining is pass 1 — our first proper pass. It finds all lambda call sites and inlines them recursively in place. This is necessary because the next level does not support lambdas. We could of course jump to a different position in the IR instead every time we encounter a lambda, but this means that later passes will have to know about lambdas and how they are evaluated (and, crucially, the scope for their evaluation which we will get to in a minute).

In order to evaluate a lambda we need to know 3 things:

The parent scope: for the captures
The parameter names: to map the arguments to known names
The arguments at the call site: I feel like this one is obvious

Getting (2) and (3) is more or less trivial, we can store the parameter names as part of the IR node for the lambda and we of course know the arguments because we are at the call site.

(1) turned to be quite a lot harder, although theoretically it shouldn’t be. Ideally you could simply ask the IrContext for the closest binding scope of a variable/span and it would calculate it for you. This is possible, mostly. It falls down a bit when you have a let block because variables inside it are actually scoped to the entire block and not progressively bound but that’s a separate bug.

Although that gives us (1), it doesn’t solve binding of arguments to paremeters, and because we have effectively “jumped” to the lambda’s location in the IR, the names of the arguments will no longer resolve because their scope is different. We can fix this by simply pushing a new scope onto the stack. This one has the same span as the lambda expression we jumped to, and binds the parameter names to the argument expressions (there was actually yet another bug where I made the span global. That required lambda parameters to be globally unique or else get resolved wrong when doing a nested lambda call).

Variable expansion

This is the where things really started to go wrong.

This pass is the last one we do, after lambda inlining, resolving all our operators, etc, we can interpret any Symbol’s left in the IR as variable names and expand them so that the lowering step doesn’t have to and can just read them directly.

Saying this pass is was what made everything crash and burn is victim blaming, and I won’t have it. This pass is simply an honerable whistleblower, calling out all the over passes for being bad at their jobs. Since it knows that all Symbol’s must be variables it can emit a compile error when it sees one that isn’t defined in scope, awesome! we can tell the user their code is wrong! yeah but uh one small problem: what if other passes didn’t find all the symbols? Well then I guess we’ll just have to tell the user their code:

(lambda (x) (thing x))

clearly has not defined thing, and they should feel bad.

The part where it all catches fire, collapses, hits an iceberg, and then explodes oh god oh lord oh help

To pick up on that forshadowing from earlier, You might be wondering: how does a visitor take both a &mut IrNode and a &mut IrContext. If the IR IrNode’s are owned by the IrContext, surely that would mean the IrNode reference and the IrContext reference couldn’t both be mutably borrowed since that would violate aliasing? Well… the IrContext’s IrNode’s are not used as the arguments when visiting. They can’t be, due to those pesky aliasing rules I mentioned earlier. Still though, that’s not a problem it just means you have to store the root IrNode and the IrContext separately. It’s not like we are modifying the IrNode passed to the visitors or anyth-oh… oh oh no

If you’re very observant you may have noticed a slight issue with this design. If a node gets modified in the IR but the context is not updated (which it can’t be unless it’s a mutable visitor since updating id_to_expr requires the context to be mut), then later lookups will produce the now out-of-date version. You may remember that all our passes are built on modifying the IR nodes in-place. This gets even worse when we consider the nesting happening, e.g. if I update the grandchild of A then later I ask the context to resolve A based on its ID, it will happily give me back a version of A that does not have the modification to the grandchild.

In fact this is the issue that the variable expansion pass showed up. Even if the previous passes had expanded everything correctly (which did need some fixing to actually be true), as soon as the variable expansion pass came along, and looked up any expression in the context, it was given the original version before any of those passes happend. It would then happily replace the symbol with this expression, obliterating any of the changes to its children in the process, and then fail when it couldn’t resolve the symbols that were (meant to have been) resolved by previous passes.

This has got to be one of my all time top bugs (I mean I’ve only had a few years to accumulate them but still). Working out that this is what was happening took me days, and even without everything else going on I think it would’ve taken me a while to work out. This is what happens when you try and mix mutability with .clone() my friends.

Now, in retrospect, I really should’ve just cut my losses here and redesigned everything. I picked a fundamentally flawed design and it was not going to be fixable. However, I did not cut my losses — I’d spent days on this after all and if I simply gave up after every horrifically subtle bug I would never get anything done, instead I simply hacked around the problem like the great developer I am. IMO knowing when to scrap your work and when to keep going is essential as a software dev, everything is obvious in hindsight, and I did eventually admit defeat for the better, but sometimes you do just have to keep going with a problem until you break through.

An attempted fix

While the IR and the context getting out of sync is a problem, it’s also very easy to detect and fix. You can simply traverse the IR and find all the IrNode’s where the IR version and the context version are not equal. Then you can just iterate that list and replace the context version with the IR version, simplicity itself. So I made it run after every pass. This worked, and my with my dirty hack in hand I proceeded on my merry way.

You fool, you idiot, you absolute buffoon

This lasted for a full day before it came crashing down when I tried to inline a nested lambda call. By “nested lambda call” I mean a lambda that calls another lambda inside its body. The problem with this is that it starts to show up the flaws in the diff and patch approach: if the parent of an expression doesn’t have an update to one of the children but gets applied after the child then the resultant context will still be out of sync.

The problem can be somewhat “fixed” by sorting the patches to ensure the parents get applied before the children, but this doesn’t fix the issue of the parent being out of sync with its children. I’m going to be honest, I don’t actually know how the parents got out of sync, I spent probably a full working day all told debugging just this and I’m still not sure exactly how it happend. Unfortunately more of my earlier mistakes came back to bite me and this whole mess was very difficult to debug or test.

This was pretty much the last straw. I had seen the writing on the wall for this design, and I had been prototyping several other approaches over the previous days (in the morning generally, while I was getting back into the code for the day) but they were all either too complex or ran into the same sort of issues. Repeatedly the issue of different copies getting out of sync came up or the borrow checker just refusing to accept my code because it could see what I was doing was bad. The most promising approach I think was one that tried to emulate the lazy/query driven compilation of rustc (at least most of it). The problem with that approach was it requires a signifiant amount of supporting infrastructure and would be a pretty big undertaking (and I had already burned most of a week at this point). After I worked out the source of this bug (another 3 hours or so of debugging), tried to use the fix mentioned, and then had to debug why that wasn’t working either, I gave up and scrapped the whole thing.

2 subtle as hell bugs and many many more less subtle but still not fun to debug with my lack of tests, issues with log output (I didn’t find that bug in Result’s Monoid impl until around this point by the way), bugs in the vistor impl, and so much more. To quote our esteemed ex PM Boris “He really said this” Johnson “Them’s the breaks”

Post-mortem

Let’s wrap this up shall we. Overall I made a lot of mistakes with this design, it had lot of mut everywhere, different passes had to know what the other ones did and had strict dependencies, there were absolutely no tests, fundamental building blocks like the default visitor implementation were just broken (again no tests), and many more problems.

I think though it boils down to a case of too much separation of concerns. The IR was quite broad but not very deep. What I mean by this, is that it could express a lot of the constructs of the input language (lambdas, if’s, math, etc) but you couldn’t actually manipulate those constructs or even understand the semantics of the program without an awful lot more work which realistically required carrying around a context which cached some of that, more for ease of use than for performance.

After I finally abandoned this approach I switched back to what I was using before: a recursive evaluator which just holds a bunch of state with it as it traverses the AST. I then added another IR below it, which was far far simpler, having only an abstract representation of the domain specific operations and the immediate values they needed. This IR is much easier to work with, it requires no IrContext shenanigans and modifications are expressed using copy-on-write. It took many days of pain to learn these lessons but I think I’m a better programmer for it, and I was being paid for my time, so I think I can live with it.

Appendix: Rust’s performance

I do just want to take a moment here to marvel at Rusts performance. After writing in GC’d languages for a while, where everything is a linked list and you need 50 pointer lookups to call a method, it always blows me away how Rust and C++ can take what is objectively terrible code from a performance perspective, and still have it run almost instantly. That id_to_expr hashmap was using $O(n!)$ memory by my count (every IrNode contains every child IrNode and every child IrNode was also in id_to_expr), debug printing the context of a toy program gave me a 70MB log file! Yet even in debug mode it was almost instant.

The only time in this entire project where I’ve actually hit performance issues was after I had abandoned this approach for a much more tightly bound frontend. It lowers the AST directly to the a step below the one we’ve been discussing here. The actual evaluation is recursive with a value stack (now that I think about it, I actually could probably just make it return Option<T> and do away with the stack), and when a lambda is encountered it clones the current scope and stores it as part of its value (for use when evaluating). This wouldn’t be too much of an issue on its own but when combined with nested lambda calls it can become A Problem.

The scope struct is basically just a Vec<(String, Value)> where Value is a big sum type holding all the possible results of evaluating an expression, so cloning it isn’t cheap but it’s not too expensive. Where it does become very expensive is the way values are taken out of it and evaluated: when a symbol is mapped to a value it is cloned out of the scope and pushed onto the value stack. This means that whenever a lambda expression is resolved it’s entire body and more importantly it’s entire scope is also cloned and, of course, if that scope contains other lambdas, their scopes are also cloned. This, perhaps unsurprisingly, caused that y-comb example I gave at the beginning to take 1-2 seconds per recursion of the adder in debug mode (and around 1 second in release, I didn’t benchmark this or anything this is pure eyeball :p).

So, how did I solve yet another issue (caused by my stupidly) with .clone()? By using Rc of course! Now, before you laugh, this is actually the 1/10 times you should actually use an Rc. I want cheap copies of immutable data that needs multiple owners (owners in the C++ sense not the Rust sense), I’m not just doing it to make the borrow checker happy. Anyway I put Rc everywhere and now it runs instantly again.