Back

Build a Compiler in Five Projects

206 points3 monthskmicinski.com
fjfaase2 months ago

It surprises that they are still teaching parsing techniques that are based on limitation from 40 years ago, when memory was limited and you had to parse a file one character at the time. Why not start with a back-tracking recursive descent parser on a file stored in memory? Can be made efficient with some simple caching. In an introduction course there is no need to aim for maximum performance if parsing a 10k lines program takes less than a second.

norir2 months ago

Parsing is strange in that many people tend to believe it is a solved problem and yet every project handles it slightly differently (and almost none do it truly well).

I have been studying compiler design for several years and I have found that writing a simple parser by hand is the best way to go most of the time. There is a process to it: You start with a "Hello, world!" program and you parse it character by character with no separate lexer. You ensure that at each step in your parser, you make an unambiguous decision at each character and never backtrack. The decision may be that you need to enter a disambiguation function that also only moves forward. If the grammar gets in the way of conserving this property, change the grammar not the parser design.

If you follow that high level algorithm, you will end up with a parser with performance linear in the number of characters which is asymptotically as well as you can hope to do. It is both easy and simple to implement (provided you have solid programming fundamentals) and no caching is needed for efficiency.

Deliberate backtracking in a compiler is an evil that should be avoided at all costs. It is potentially injecting exponentially poor performance into a developer's primary feedback loop which is a theft of their time for which they have little to no recourse.

fjfaase2 months ago

I agree, that if you want to write a production grade parser, this is probably the best way to go. I also agree that parsing is not a solved problem for all cases. But that is the case with many more problems. However, for many cases it is a solved problem and that often it is not the first thing you should focus on to optimize.

If you teach a course about compiler construction, I think it might be better to teach your students how to write a grammar for some language and use some interactive parser that can parse some input according to the grammar (and visualize the AST). See for example: [1] and [2] (Even if you feed it the C grammar, it succeeds parsing thousands of lines (preprocessed) C code at every keystroke. This interpreting parser is written in JavaScript and uses a simple caching strategy for performance improvement.)

For the scripting language [3] in some of the Bizzdesigns modeling tools, a similar interactive parser was used (implemented in C++). This scripting language is also internally used for implementing the various meta-models. These scripts are parsed once, cached, and interpreted often.

I think it is also true for many domain-specific languages (DSL).

[1] https://info.itemis.com/demo/agl/editor

[2] https://fransfaase.github.io/MCH2022ParserWorkshop/IParseStu...

[3] https://help.bizzdesign.com/articles/#!horizzon-help/the-scr...

Rendello2 months ago

I love parsing and have made a lot of parsers, but never a typical programming language parser. It's interesting that most of the literature (from academic papers to blog posts) focuses programming language parsers, but the vast majority of parsers out there deal with other things. I had to really figure things out myself, and that's been the same story for every parser I've written.

A lesson I have to relearn every time: while you can always skip lexing (which is really just another parser), it almost always screws you over to do so.

marcosdumay2 months ago

Well, I can immediately think of two reasons:

Backtracking parsers lead people into creating bad grammars. In principle people are perfectly capable of creating simple context-free grammars and write any parser they want to read it. But on practice your tools guide your decisions for a huge extent, and the less experience people have, the more true that becomes; so it's a really dangerous tool, in particular for students.

Also, fully backtracking parsers have the most unpredictable and hard to fix error conditions for all possibilities. There exist middle grounds where the parser execution is still predictable but you do get most of the benefit from backtracking, but that's a lot of complex engineering decisions to reach and keep your project close that optimal.

Immediate edit: On a CS context there is one reason that is probably more important than any other. People use parsers as an application of automata and regular languages theory. Those two concepts are way more important than the practical implications of parsing.

fjfaase2 months ago

What do you mean with bad grammars? Do you mean grammars that are hard to parse (require a lot of backtracking) or do you mean that it leads people to creating bad languages?

My experience is that if a back-tracking parser list all the possible terminals it is expecting at the first location (with some additional information about the rules they occurred in) it fails to get passed, that this usually gives enough information to understand what is wrong about the input or the grammar.

marcosdumay2 months ago

I mean grammars that are hard for people to follow.

kmicinski2 months ago

> In an introduction course there is no need to aim for maximum performance if parsing a 10k lines program takes less than a second.

I'll do you one better. The main compiler in my course uses only six characters to parse every single project: `(read)`.

vintagedave2 months ago

Are you referring to lookahead, as in allowing more ambiguous grammars?

fjfaase2 months ago

Sorry, I mend to say: Even if a grammar is not ambiguous, it can require unbound look-ahead to be parsed correctly [1].

The grammar of C is ambiguous. The statement "a * b;" can be both parsed as a variable declaration of the variable 'b' of type pointer to 'a' and as an expression consisting of a multiplication of 'a' and 'b'. It all depends on whether 'a' is a type or not. In most cases it would be a type definition, because why multiply two variables and ignore the result. One trick to deal with this is to give precedence for the type declaration grammar rule over the expression grammar rule. However, this is not something that can be done with many parser generating tools.

Yet the first C compiler where single pass compilers with a single look-ahead lexical token probably implemented as a recursive descent parser. It worked, because it kept a (reverse) list of all variable declarations, which allowed it to determine when 'a' was parsed if it was the start of some declaration or the start of a statement based on whether it was defined before as a type or not.

[1] https://stackoverflow.com/questions/12971191/grammars-that-r...

fjfaase2 months ago

No, even if a grammar is ambigious it can require unbound look-ahead to be parsed, although this is very rare the case for meaningfull grammars such as the ones you would write for a programming language.

What I wanted to say that you do not need complex algorithms to implement parser if you do not have a grammar that can be parsed with look-ahead lexical element.

fjfaase2 months ago

I mend to say: No, even if a grammar is not ambiguous ...

UncleOxidant2 months ago

The Essentials of Compilation book mentioned is only ~$24 on Amazon. Usually books like this are much more expensive. I ordered a copy.

declank2 months ago

There is also the Python version too and available free. I do like the register allocation/graph colouring chapter.

almostgotcaught2 months ago

looks like a fun book but just be forewarned real compiler engineering is nothing like what's covered there.

kryptiskt2 months ago

I'm knee deep in clang at the moment and I'm so fed up with real compiler engineering. Give me Chez Scheme and the nanopass compiler any day. That is so much better than the big ball of mud that goes into a "real" compiler.

almostgotcaught2 months ago

..... Yes that's exactly my point that these cutesy books give a completely inaccurate picture of what the job is really like.

fragmede2 months ago

Real compiler engineering covers a lot of ground. This book is an intro to it, not the whole everything. No need to posture about it.

pjmlp2 months ago

Most people just want to compile their toy language into machine code and be done with it.

anta402 months ago

Any recommendation for a more realistic book?

I think hacking GCC/LLVM can be pretty challenging, but hey they are real, production-grade compilers and not just typical academic projects.

almostgotcaught2 months ago

there are no good modern compiler books - everything that's been written down pales in comparison to what GCC/LLVM really involve. recently i found Engineering a Compiler by Cooper and Torczon when reviewing/prepping for interviews - it wasn't bad. also there's now LLVM Code Generation by Quentin Colombet but that's basically a code walk-through of LLVM (it doesn't cover any of the algos). and it was probably out of the date the second it got published lol (not really but maybe). the truth is that trying to learn how to build a compiler from a single book is like trying to learn how to build a skyscraper from a single book.

sph2 months ago

> the truth is that trying to learn how to build a compiler from a single book

I think you conflate “learning to build a compiler for a toy language” with “being effective at working on a modern optimizing compiler suite like GCC/LLVM”

The book is perfectly fine for the first use case, and never claims to touch upon the latter.

kmicinski2 months ago

Respectfully, I think what you mean is that there are no books which give you the experience of hacking on LLVM for several years.

+4
yu3zhou42 months ago
mamcx2 months ago

A good counterpoint is that a lot of information about this is dense, cryptic, weird, confusing and hard to get.

The major problem is not to find the sophisticated things, but understand how do it in simple-ish ways.

Do otherwise is a major waste of time!

P.D: And yes, only when you get the basic and learn the jargon still is a problem to find the neat tricks, but is likely that you already get that there is nothing like read the source... (sadly that source is in C or worse C++, but lately with Rust that is gaining traction at least it make more sense!)

AdityaSanthosh2 months ago

Hi, seems like an interesting course. I haven't studied compilers in my undergrad( I'm an electronics student) but I have been working as a programmer who studied c and bit of low level languages. Is there any prerequisite compiler knowledge required for this course?

ktimespi2 months ago

The only prerequisite here is probably Racket, to follow along with the book

evnix2 months ago

Most of these compiler projects and books would be 100x more popular and accessible if they were in Javascript

wiseowise2 months ago

Just use LLM to translate example to whatever language you want.

derrida2 months ago

Now now, isn't that what a compiler does?

Iwan-Zotow2 months ago

SICP in JAVASCRIPT exist somewhere