Post-mortem of a novices forray into compiler design
Gather round gather round, what follows is a tale of human folly and great woe.
OK it’s not actually that bad, but I did spend most of a working week on this,
only to git reset --hard
most of it away. I don’t want to say I wasted that
time, I’ve now successfully done the task that I was origionally trying to
accomplish with a much better design that was informed by my experience here,
so it’s less waste and more painful painful learning. Anyway, now I’m giving
away those lessons for free so you should probably worship me or something idk.
The setup
I have a language which is essentially a lisp. It’s not a “proper” lisp like
common lisp or scheme or whatever, it’s in the same neighbourhood as the lambda
calculus with an extra set of domain specific instructions (which I won’t go
into here because they aren’t relevant), and some other stuff to make it
actually possible to write code in (let
, modules, if
, basic math operators,
etc) without wanting to get into the Forever Rc.
There’s no built-in support for recursion, though of course that doesn’t stop anyone who knows a bit of lambda calculus:
(let ((fix (lambda (f) (f f)))
(y-comb (lambda (f) (fix (lambda (r) (f (lambda (y) ((r r) y)))))))
(adder (lambda (self) (lambda (x) (lambda (lim) (if (== x lim) () ((self (+ 1 x)) lim)))))))
(((y-comb adder) 0) 200))
(this is a strict variant of the Y combinator, or the why god why combinator if you’re very funny)
I’m writing a compiler for this language which outputs domain specific instructions (rather than traditional assembly). A somewhat important thing to note is the language is fully compile time static — there are no runtime values. This means the compiler can theoretically (and does, because the domain instructions only support a few operations and only on immediate or otherwise constant values) evaluate most of the language out of existence. This is a bit like boiling salt water, but instead of delicious crystal we get the domain instructions that were written by the programmer (which are black-boxes at this level, beyond the number of arguments they take and their types)
The language is mostly dynamically typed, with the caveat at that there is no mutability and “dynamic” refers to the fact that a lambda has no predefined return type — it is allowed to produce different types depending on its arguments, but none of this is tracked at runtime because lambdas are evaluated at compile-time. It is also strongly typed — i.e. there is no type-coercion.
Frontend
I’ll skip over most of the details of the parser and lexer, it’s not hugely
complex, barely a hundred lines (S-expressions are piss to transform into
an AST). What’s important for this story is that the AST is transformed
directly into a very simple subset of the IR. Non-empty lists are transformed
into Apply
expressions, symbols to the Symbol
expression, etc.
IR overview
struct IrId { /* ... */ }
struct Span { /* ... */ }
struct Ident {
span: Span,
name: Box<str>,
}
struct IrNode {
pub span: Span,
pub id: IrId,
pub kind: IrNodeKind,
}
enum IrNodeKind {
Lambda(LambdaExpr),
If(IfExpr),
Op(OpExpr),
Constant(ConstExpr),
Symbol(String) // (1)
Apply(Box<IrNode>, Vec<IrNode>), // (2)
// ...
}
enum OpExpr {
BinOp(BinOpKind, Box<IrNode>, Box<IrNode>),
UnOp(UnOpKind, Box<IrNode>),
}
enum ConstExpr {
Int(i64),
Bool(bool),
// ...
}
As you can see, every IrNode
has an id
which is used for cheap comparisons
and lookups, it can also be used to diff the tree as when a pass modifies a
node it has to keep the IrId
(enforced by only the AST -> IR step
knowing how to make them). If I wanted to be really strict I could’ve made the
id
hidden and required some kind of map
operation on the IrNode
, which
would have prevented the programmer from writing the bug where she just copied
another node’s id and reused it, buut I didn’t think of that at the time.
The AST is a strict subset of this IR, it can be mapped directly by turning
every list into an Apply
and everything else into either a Symbol
or
Constant
.
In order to actually be useful for optimisations and lowering to the domain instructions we need to map things to the full surface area of the IR, which can be done by passing over it and replacing the lisp patterns with their IR counterparts (you can cross off “mixing levels of abstraction” on your design pitfalls bingo card now)
(lambda (a b) (+ 9 26 7))
Apply(
Symbol("lambda"),
[
Apply(
Symbol("a"),
[Symbol("b")]
),
Apply(
Symbol("+"),
[Int(9), Int(26), Int(7)]
)
]
)
Lambda {
params: ["a", "b"],
body: Apply(
Symbol("+"),
[Int(9), Int(26), Int(7)]
),
}
Lambda {
params: ["a", "b"],
body: BinOp {
kind: Add,
args: [Int(9), Int(26), Int(7)],
},
}
I joked earlier but this mixing of abstraction levels was, I believe, my first big mistake. Not separating the rich IR from the simple IR at the type level made debugging much harder as the question of whether a pass had simply not happend or had happend wrong was not easily answerable without a lot of printing. It also made passes very dependent on the details of the passes before them, which was fun.
I’m getting ahead of myself though, back in the present we still need a way to actually define passes and traverse the IR.
The Visitor RIP Jake Sisko, I cri every tim
Visitors are somewhat common in Rust, syn provides them as a way
to traverse its IR for example. syn also provides a mutable version of visitor
which allows visiting the IR with mut
references and hence in-place
modification of it. This gave me a bright idea: define passes as visitors over
the IR!
Now, I don’t have anything against visitors; back when I was a C++ stan,
std::visit
was my bread and butter when writing insane TMP
programs (which was often). In Rust however (and generally, Rust just won’t
compile it), visitors which can mutate the thing they are visiting can be
problematic. For example, if I modify a node in-place, the order of visiting
now matters and it’s not explicit in the control flow. Essentially we’ve made
out-params a fundamental part of our traversal API, which isn’t great
(out-params are one of the truly evil things in the world, although if you’ve
never written C++ just wait until you encounter out params that don’t require a
&mut
when passing them… those are fun). This, predictably, caused at least
3 bugs, probably more, and caused another issue that I’ll tell you about when
you’re older (specifically around 5-10 minutes older, depending on your reading
speed and commitment).
A slight detour into group theory
Usually a visitor does not produce anything, except maybe Result<()>
, but
I decided to generalise with some group theory:
trait Semigroup: Sized {
fn concat(self, other: Self) -> Self;
}
trait Monoid: Semigroup {
fn empty() -> Self;
}
If you are unfamiliar with semi-groups and monoids then I invite you to read the wikipedia article on them, and then come back more confused. Explaining what a semigroup is properly is another blog post on its own, but for our purposes this instance is what matters:
impl<T: Semigroup, E> Semigroup for Result<T, E> {
fn concat(self, other: Self) -> Self {
match self {
Ok(lok) => match other {
Ok(rok) => Ok(lok.concat(rok))
Err(e) => Err(e),
},
Err(e) => Err(e),
}
}
}
As you can see, if both sides are Ok
then it passes the Ok
through and
applies the semigroup operation to the inner value. All other paths lead to
Err
. We can then build a chain of concat
’s to produce our result.
There are two things to note about this design:
- It lets us make erroring optional, we don’t even have to
.unwrap()
at the end just make the return type()
, which is trivially a semigroup as it only has 1, singleton, member - It is somewhat fundamentally broken when it comes to
Result
, at least in the sense of preserving the semantics
To see what I mean by (2) consider the following code:
/// This really should be on a type like `I64UnderAddition`, since there are
/// many valid semigroups for the integers, but still it's an example
impl Semigroup for i64 {
fn concat(self, o: Self) -> Self {
self + o
}
}
impl Monoid for i64 {
fn empty() -> Self {
0
}
}
fn run<R: Monoid>(mut f: impl FnMut(i64) -> R) -> R {
(0..1000).fold(R::empty(), |r, v| r.concat(f(v)))
}
let r = run(|v| {
if v > 100 { Err("lower your expectations") }
else if v > 101 { unreachable!() }
else { Ok(v) }
});
assert!(r.is_err()); // never hit
This code will panic after hitting that unreachable!()
. If you don’t see why,
stare at it for a while, then stare at it some more. I didn’t notice this issue
for a couple days (and was rather confused by my log lines being hit after
returning Err). The reason this code panics, even though we return an error
before it that point can be reached, is because although we propagate the
error on concat
, both sides are still evaluated. We have lost the
short-circuiting/early return semantics of Result
.
This is where my second big mistake comes in: I did not write tests for the
visitor code, nor did I write tests for the Result
impl on Monoid
. If I
had written tests I probably would have thought through exactly what the
behaviour was and caught this bug early. As it was, I spent days debugging the
other bugs while contending with inaccurate log output (as since the IR was
modified in-place, the later passes that happened between the error being
emitted and it being handled changed the IR… that was fun) and preconditions
that seemed obviousely to be enforced but were actually not.
It’s also worth noting that restoring the short-cutting semantics would be
possible, and in fact quite easy with something like
std::ops::ControlFlow
, but I wasn’t aware of its existence at
the time, and it would be a bit annoying because ?
would no longer work
properly for the visitor if it wanted to return a Result
.
IR passes
Now that we have a way to traverse the IR we can finally start creating the passes that are going to progressively expand it and then finally lower it to the next stage (the domain specific instructions, or actually more accurately a simplified version of them which gets “assembled” to the full versions, but anyway).
Name resolution
This level of the IR hasn’t had name resolution done yet, at least not at the type level. We can define a pass which does the resolution (by traversing the IR and resolving the scopes for each variable) but it needs somewhere to store the output, somewhere which is also accessible to later passes.
This is where IrContext
comes in. As the name suggests, it provides
additional context to the visitors of the IR as they traverse and can be
modified by mutable visitors. This object holds all the scopes of variables as
well as a map from IrId
’s to IR expressions. We will also need to modify the
visitor to take a parameter of IrContext
along with the node it is visiting.
struct Scope {
span: Span,
variables: Vec<(Ident, IrId)>
}
struct IrContext {
scopes: Vec<Scope>,
id_to_expr: HashMap<IrId, IrNode>,
}
Like the sensible people we are we’re using IrId
for the bindings rather than
an IrNode
, since if the IR is modified we want lookups to go to the modified
version and not whatever it was when we created the scope and cloned the IrNode
into it. I belive writers call this forshadowing.
Now we just have to modify our visitor methods to take &mut T, &[mut] IrContext
and we are away (the reason the mut
on IrContext
is optional is
because for passes which don’t add variables or anything we don’t need to
modify it). If that signiture sets alarm bells ringing then you have probably
shared a few candlelit dinners with the borrow checker. Don’t worry I’ll come
back to this (and if you have no idea what I’m on about don’t worry either,
it’s non-obvious)
Lambda inlining
If name resolution is pass 0, then lambda inlining is pass 1 — our first proper pass. It finds all lambda call sites and inlines them recursively in place. This is necessary because the next level does not support lambdas. We could of course jump to a different position in the IR instead every time we encounter a lambda, but this means that later passes will have to know about lambdas and how they are evaluated (and, crucially, the scope for their evaluation which we will get to in a minute).
In order to evaluate a lambda we need to know 3 things:
- The parent scope: for the captures
- The parameter names: to map the arguments to known names
- The arguments at the call site: I feel like this one is obvious
Getting (2) and (3) is more or less trivial, we can store the parameter names as part of the IR node for the lambda and we of course know the arguments because we are at the call site.
(1) turned to be quite a lot harder, although theoretically it shouldn’t be.
Ideally you could simply ask the IrContext
for the closest binding scope of a
variable/span and it would calculate it for you. This is possible, mostly. It
falls down a bit when you have a let
block because variables inside it are
actually scoped to the entire block and not progressively bound but that’s a
separate bug.
Although that gives us (1), it doesn’t solve binding of arguments to paremeters, and because we have effectively “jumped” to the lambda’s location in the IR, the names of the arguments will no longer resolve because their scope is different. We can fix this by simply pushing a new scope onto the stack. This one has the same span as the lambda expression we jumped to, and binds the parameter names to the argument expressions (there was actually yet another bug where I made the span global. That required lambda parameters to be globally unique or else get resolved wrong when doing a nested lambda call).
Variable expansion
This is the where things really started to go wrong.
This pass is the last one we do, after lambda inlining, resolving all our
operators, etc, we can interpret any Symbol
’s left in the IR as variable
names and expand them so that the lowering step doesn’t have to and can just
read them directly.
Saying this pass is was what made everything crash and burn is victim blaming,
and I won’t have it. This pass is simply an honerable whistleblower, calling
out all the over passes for being bad at their jobs. Since it knows that
all Symbol
’s must be variables it can emit a compile error when it sees one
that isn’t defined in scope, awesome! we can tell the user their code is wrong!
yeah but uh one small problem: what if other passes didn’t find all the symbols?
Well then I guess we’ll just have to tell the user their code:
(lambda (x) (thing x))
clearly has not defined thing
, and they should feel bad.
The part where it all catches fire, collapses, hits an iceberg, and then explodes oh god oh lord oh help
To pick up on that forshadowing from earlier, You might be wondering: how does
a visitor take both a &mut IrNode
and a &mut IrContext
. If the IR
IrNode
’s are owned by the IrContext
, surely that would mean the IrNode
reference and the IrContext
reference couldn’t both be mutably borrowed since
that would violate aliasing? Well… the IrContext
’s IrNode
’s are not used
as the arguments when visiting. They can’t be, due to those pesky aliasing
rules I mentioned earlier. Still though, that’s not a problem it just means you
have to store the root IrNode
and the IrContext
separately. It’s not like
we are modifying the IrNode
passed to the visitors or anyth-oh… oh
oh no
If you’re very observant you may have noticed a slight issue with this design.
If a node gets modified in the IR but the context is not updated (which it
can’t be unless it’s a mutable visitor since updating id_to_expr
requires the
context to be mut
), then later lookups will produce the now out-of-date
version. You may remember that all our passes are built on modifying the IR
nodes in-place. This gets even worse when we consider the nesting happening,
e.g. if I update the grandchild of A
then later I ask the context to resolve
A
based on its ID, it will happily give me back a version of A
that does
not have the modification to the grandchild.
In fact this is the issue that the variable expansion pass showed up. Even if the previous passes had expanded everything correctly (which did need some fixing to actually be true), as soon as the variable expansion pass came along, and looked up any expression in the context, it was given the original version before any of those passes happend. It would then happily replace the symbol with this expression, obliterating any of the changes to its children in the process, and then fail when it couldn’t resolve the symbols that were (meant to have been) resolved by previous passes.
This has got to be one of my all time top bugs (I mean I’ve only had a few
years to accumulate them but still). Working out that this is what was
happening took me days, and even without everything else going on I think it
would’ve taken me a while to work out. This is what happens when you try and
mix mutability with .clone()
my friends.
Now, in retrospect, I really should’ve just cut my losses here and redesigned everything. I picked a fundamentally flawed design and it was not going to be fixable. However, I did not cut my losses — I’d spent days on this after all and if I simply gave up after every horrifically subtle bug I would never get anything done, instead I simply hacked around the problem like the great developer I am. IMO knowing when to scrap your work and when to keep going is essential as a software dev, everything is obvious in hindsight, and I did eventually admit defeat for the better, but sometimes you do just have to keep going with a problem until you break through.
An attempted fix
While the IR and the context getting out of sync is a problem, it’s also very
easy to detect and fix. You can simply traverse the IR and find all the
IrNode
’s where the IR version and the context version are not equal. Then you
can just iterate that list and replace the context version with the IR version,
simplicity itself. So I made it run after every pass. This worked, and my with
my dirty hack in hand I proceeded on my merry way.
You fool, you idiot, you absolute buffoon
This lasted for a full day before it came crashing down when I tried to inline a nested lambda call. By “nested lambda call” I mean a lambda that calls another lambda inside its body. The problem with this is that it starts to show up the flaws in the diff and patch approach: if the parent of an expression doesn’t have an update to one of the children but gets applied after the child then the resultant context will still be out of sync.
The problem can be somewhat “fixed” by sorting the patches to ensure the parents get applied before the children, but this doesn’t fix the issue of the parent being out of sync with its children. I’m going to be honest, I don’t actually know how the parents got out of sync, I spent probably a full working day all told debugging just this and I’m still not sure exactly how it happend. Unfortunately more of my earlier mistakes came back to bite me and this whole mess was very difficult to debug or test.
This was pretty much the last straw. I had seen the writing on the wall for this design, and I had been prototyping several other approaches over the previous days (in the morning generally, while I was getting back into the code for the day) but they were all either too complex or ran into the same sort of issues. Repeatedly the issue of different copies getting out of sync came up or the borrow checker just refusing to accept my code because it could see what I was doing was bad. The most promising approach I think was one that tried to emulate the lazy/query driven compilation of rustc (at least most of it). The problem with that approach was it requires a signifiant amount of supporting infrastructure and would be a pretty big undertaking (and I had already burned most of a week at this point). After I worked out the source of this bug (another 3 hours or so of debugging), tried to use the fix mentioned, and then had to debug why that wasn’t working either, I gave up and scrapped the whole thing.
2 subtle as hell bugs and many many more less subtle but still not fun to debug
with my lack of tests, issues with log output (I didn’t find that bug in
Result
’s Monoid
impl
until around this point by the way), bugs in the
vistor impl, and so much more. To quote our esteemed ex PM Boris “He really
said this” Johnson “Them’s the breaks”
Post-mortem
Let’s wrap this up shall we. Overall I made a lot of mistakes with this design,
it had lot of mut
everywhere, different passes had to know what the other
ones did and had strict dependencies, there were absolutely no tests,
fundamental building blocks like the default visitor implementation were just
broken (again no tests), and many more problems.
I think though it boils down to a case of too much separation of concerns. The IR was quite broad but not very deep. What I mean by this, is that it could express a lot of the constructs of the input language (lambdas, if’s, math, etc) but you couldn’t actually manipulate those constructs or even understand the semantics of the program without an awful lot more work which realistically required carrying around a context which cached some of that, more for ease of use than for performance.
After I finally abandoned this approach I switched back to what I was using
before: a recursive evaluator which just holds a bunch of state with it as it
traverses the AST. I then added another IR below it, which was far far simpler,
having only an abstract representation of the domain specific operations and
the immediate values they needed. This IR is much easier to work with, it
requires no IrContext
shenanigans and modifications are expressed using
copy-on-write. It took many days of pain to learn these lessons but I think I’m
a better programmer for it, and I was being paid for my time, so I think I can
live with it.
Appendix: Rust’s performance
I do just want to take a moment here to marvel at Rusts performance. After
writing in GC’d languages for a while, where everything is a linked list and
you need 50 pointer lookups to call a method, it always blows me away how Rust
and C++ can take what is objectively terrible code from a performance
perspective, and still have it run almost instantly. That id_to_expr
hashmap
was using $O(n!)$ memory by my count (every IrNode
contains every child
IrNode
and every child IrNode
was also in id_to_expr
), debug printing the
context of a toy program gave me a 70MB log file! Yet even in debug mode it was
almost instant.
The only time in this entire project where I’ve actually hit performance issues
was after I had abandoned this approach for a much more tightly bound frontend.
It lowers the AST directly to the a step below the one we’ve been discussing
here. The actual evaluation is recursive with a value stack (now that I think
about it, I actually could probably just make it return Option<T>
and do away
with the stack), and when a lambda is encountered it clones the current scope
and stores it as part of its value (for use when evaluating). This wouldn’t be
too much of an issue on its own but when combined with nested lambda calls it
can become A Problem.
The scope struct is basically just a Vec<(String, Value)>
where Value
is a
big sum type holding all the possible results of evaluating an expression, so
cloning it isn’t cheap but it’s not too expensive. Where it does become
very expensive is the way values are taken out of it and evaluated: when a
symbol is mapped to a value it is cloned out of the scope and pushed onto the
value stack. This means that whenever a lambda expression is resolved it’s
entire body and more importantly it’s entire scope is also cloned and, of
course, if that scope contains other lambdas, their scopes are also cloned. This,
perhaps unsurprisingly, caused that y-comb
example I gave at the beginning to
take 1-2 seconds per recursion of the adder
in debug mode (and around 1
second in release, I didn’t benchmark this or anything this is pure eyeball :p).
So, how did I solve yet another issue (caused by my stupidly) with .clone()
?
By using Rc
of course! Now, before you laugh, this is actually the 1/10 times
you should actually use an Rc
. I want cheap copies of immutable data that
needs multiple owners (owners in the C++ sense not the Rust sense), I’m not
just doing it to make the borrow checker happy. Anyway I put Rc
everywhere
and now it runs instantly again.