Whither lex?
Nov. 27th, 2004 10:47 amFor no terribly good reason, I'm writing a 6.035 Espresso compiler. (I told
obra last night, "I want a compiler to hack on.") I'm doing this in Python, since that's a fairly sane (if weakly-typed) OO language, and the various arcane things I can think to do with it are easy. There aren't obvious parser/scanner generators for it, so I'm hand-writing a scanner and an LL(1) parser.
Having gone through this exercise, I'm almost done (can't deal with function-call as statement or interesting things on the left-hand side of an assignment) in about six hours of work. This makes me wonder what the motivation for tools like lex and yacc are. For lex, regexps for many things are almost overkill; my scanner is a state machine that builds up words, checks to see whether they're keywords, and emits an appropriate token. This is a little simpler for being a Python generator function (so I can return multiple values via yield), but it feels like a smallish constant factor away from equivalent C code and about as much code as the lex description. And in all that's 300 lines of code; is the code-space savings worth the loss of debuggability you get using a tool?
Pretty much the same thing can be said about yacc. Hand-writing the LL(1) parser was pretty easy. Maybe a table-based LR(k) parser has a smaller code size so it runs better on old machines? For your trouble you get zero debuggability, though possibly better performance than a recursive-descent parser. At MIT I used ANTLR, an LL(k) parser generator, but I don't think I get much beyond some automated munging of the grammar. My impression (not that I'm trying here) is that error recovery sucks in LR parsers and it's a little better in the LL world. yacc makes you munge your grammar less but you still need some work for things like operator precedence.
So if these tools don't generate fast or understandable code, and interpreting their output/error messages involves understanding their theoretical basis, and they make code harder to debug, why do they get so much use?
![[livejournal.com profile]](https://www.dreamwidth.org/img/external/lj-userinfo.gif)
Having gone through this exercise, I'm almost done (can't deal with function-call as statement or interesting things on the left-hand side of an assignment) in about six hours of work. This makes me wonder what the motivation for tools like lex and yacc are. For lex, regexps for many things are almost overkill; my scanner is a state machine that builds up words, checks to see whether they're keywords, and emits an appropriate token. This is a little simpler for being a Python generator function (so I can return multiple values via yield), but it feels like a smallish constant factor away from equivalent C code and about as much code as the lex description. And in all that's 300 lines of code; is the code-space savings worth the loss of debuggability you get using a tool?
Pretty much the same thing can be said about yacc. Hand-writing the LL(1) parser was pretty easy. Maybe a table-based LR(k) parser has a smaller code size so it runs better on old machines? For your trouble you get zero debuggability, though possibly better performance than a recursive-descent parser. At MIT I used ANTLR, an LL(k) parser generator, but I don't think I get much beyond some automated munging of the grammar. My impression (not that I'm trying here) is that error recovery sucks in LR parsers and it's a little better in the LL world. yacc makes you munge your grammar less but you still need some work for things like operator precedence.
So if these tools don't generate fast or understandable code, and interpreting their output/error messages involves understanding their theoretical basis, and they make code harder to debug, why do they get so much use?