dmaze ([personal profile] dmaze) wrote2004-11-27 10:47 am
Entry tags:

Whither lex?

For no terribly good reason, I'm writing a 6.035 Espresso compiler. (I told [livejournal.com profile] obra last night, "I want a compiler to hack on.") I'm doing this in Python, since that's a fairly sane (if weakly-typed) OO language, and the various arcane things I can think to do with it are easy. There aren't obvious parser/scanner generators for it, so I'm hand-writing a scanner and an LL(1) parser.

Having gone through this exercise, I'm almost done (can't deal with function-call as statement or interesting things on the left-hand side of an assignment) in about six hours of work. This makes me wonder what the motivation for tools like lex and yacc are. For lex, regexps for many things are almost overkill; my scanner is a state machine that builds up words, checks to see whether they're keywords, and emits an appropriate token. This is a little simpler for being a Python generator function (so I can return multiple values via yield), but it feels like a smallish constant factor away from equivalent C code and about as much code as the lex description. And in all that's 300 lines of code; is the code-space savings worth the loss of debuggability you get using a tool?

Pretty much the same thing can be said about yacc. Hand-writing the LL(1) parser was pretty easy. Maybe a table-based LR(k) parser has a smaller code size so it runs better on old machines? For your trouble you get zero debuggability, though possibly better performance than a recursive-descent parser. At MIT I used ANTLR, an LL(k) parser generator, but I don't think I get much beyond some automated munging of the grammar. My impression (not that I'm trying here) is that error recovery sucks in LR parsers and it's a little better in the LL world. yacc makes you munge your grammar less but you still need some work for things like operator precedence.

So if these tools don't generate fast or understandable code, and interpreting their output/error messages involves understanding their theoretical basis, and they make code harder to debug, why do they get so much use?

[identity profile] eichin.livejournal.com 2004-11-27 09:18 pm (UTC)(link)
"This parser can be expressed in yacc" does say *something* (albeit probably not as much as people think it does) about the grammar itself - sort of a static-check on the possible ambiguity. Similarly, it reduces certain kinds of mistake (though it trades them for others at a higher level, mostly.)

On the other hand, the BSD ftpd used yacc for the protocol parser - which seems to have been a mistake of near-sendmail proportions. And the one major change I made to the GNU C++ grammar (nested types) couldn't be done in yacc, and involved C code between the tokenizer and the parser. I haven't had much excuse for writing parsers, in general - there's almost always a "simplify this to something I can throw at some pre-existing parser" path that makes things easier for the end-user too.

[identity profile] eichin.livejournal.com 2004-11-28 10:21 pm (UTC)(link)
And then there's the other approach - write a parser-generator as part of your project. There are apparently nineteen of them for python alone...

[identity profile] iabervon.livejournal.com 2004-12-01 09:17 pm (UTC)(link)
I don't get the point of lex either. With a little less work, you can just write a scanner that handles grouping constructs in the obvious fashion (parans match, brackets match, braces match; also, commas and semicolons mark off parts of groups) in addition to doing the right things with operators, comments, literals, and symbols. "read" is not a particularly complex scheme procedure, and it only takes a couple more features to make it handle the C family as well.

I've also never understood why the scanner isn't responsible for paren matching, but is responsible for distinguishing keywords from identifiers.

Of course, things do get a bit more interesting if you decide that you want your language to have angle brackets (and recognize these in the scanner), and inequality or shift operators.