Therapeutic parsers
Programming doesn’t always fill me with joy, but there’s one particular kind of programming that does so almost invariably. I’m talking about parsers, or, more specifically, parsing formal grammars. If you know what a parser is, but haven’t had a pleasure of writing one yourself, this may sound geeky or even plain ludicrous. Fortunately, in this case your gut sense is letting you down: parsers are the soul of programming. They animate lifeless symbols and set them on their path to become a computation. And yet, now matter how many parsers I write, I keep re-discovering their unassuming beauty. Let this post be a written reminder, and also an ode to these wonderful algorithms.
Recently I’ve been experimenting with CouchBase Lite 2.0 as a database for one of my hobby projects. Among other things, I was compelled to try their SQL-like language for querying schemaless data called N1QL. CouchBase Lite 2.0 is still in beta, and there are no (and may never be) native Ruby bindings, so I had to retreat to using its native core library via an FFI interface. Overall, the process is straightforward enough, with one sad exception: N1QL queries. Where the end user libraries come with hand-crafted query builders, I was left with a not too friendly abstract syntax tree represented as a JSON document. My only source of documentation — a functional test written in C. After a few evenings of writing queries this way, I came to an obvious conclusion. I had to write N1QL parser and compiler. That was the most reasonable thing to do. There was no other way around it. (Stop questioning my judgment! 😅)
There’s more than one way you can go about implementing a parser. It you’re curious about learning the fundamentals, the traditional “start with a lexer” approach is a natural choice, and I definitely recommend that you go through with it at least once: it’s worth your time. That said, for building a practical, fun parser, I usually reach out for a specialized library that can generate a working parser from a defined grammar. Ruby’s default parser generator is Parslet, and it’s a pleasure to use. One thing that makes parser generators so delightful is the power of a well-understood domain-specific language (DSL) combined with test-driven development (TDD). I don’t subscribe to TDD as my default mode of programming, but I find it very productive for writing parsers. Coincidentally, DSLs and TDD are the two things Ruby is really great at, so I was grateful to be armed with the right tool for the job.
it { is_expected.to parse('SELECT name FROM table').as(select: 'name', from: 'table') }
I started with a test similar to the one above and made it a point to implement the absolute minimum of N1QL syntax to achieve the desired result. This statement consists of two keywords (SELECT
and FROM
), two names (name
and table
), and some whitespace. While N1QL is insensitive to case, I reasoned that representing keywords as string constants would give me a good enough approximation:
rule(:kw_select) { str('SELECT') }
rule(:kw_from) { str('FROM') }
The names could be simplified to strings that start with a letter followed by any number of alpha-numerals:
rule(:name) { match('[a-zA-Z]') >> match('[\w\_]').repeat }
Whitespace characters can be matched by a regular expression, repeating one or more time:
rule(:ws) { match('\s').repeat(1) }
By combining these four primitives, a trivial SELECT statement can be described as:
rule(:select) { kw_select >> ws >> name.as(:select) >> ws >> kw_from >> ws >> name.as(:from) }
As you can see, just five rules produce a working parser that can already handle simplest SELECT statements with names of arbitrary length and trailing whitespace. And it passes the unit test.
Parser.new.parse('SELECT name FROM users')
# => {:select=>"name"@7, :from=>"users"@17}
In my experience, setting yourself up for an easy win at the start of a project is a nice hack for getting into the “flow”. After all, parsers are “simple not easy”: from here on you have to work your way up towards ever more complex rules: multiple columns, escaped names, strings, objects, compound operators, and infix expression trees. It all makes sense though, if you take it one step at a time.
This is where magic happens. As, one after another, I was checking N1QL’s syntactic features off my list, I realized there was something therapeutic about this process: every filled gap in my implementation made me feel more whole. As the structure started to resemble the end result, I’d often catch myself smiling at a newly added parsing rule. It was akin to reading “Gödel, Escher, Bach”: the revelation of everything being interconnected and endlessly recursive: “a strange loop”, as Douglas Hofstadter would call it. The beauty of tackling complex problems by composing numerous simple pieces never fails to charm me.
Let’s reiterate. Parsing formal grammars is way more fun than it sounds: you work with pure functions, free of side-effects, easily testable, and packaged in a well-understood DSL. Using TDD is helpful, because the tight feedback loop will keep you motivated. Once you have a working parser, try to write a compiler (or an interpreter) for it. Looking for a real challenge? Implement a subset of your favorite programming language, then compile it to C or JavaScript. The process is incredibly rewarding and the presented challenge makes a nice break from the routine of professional software development. Last but not least, it might just expand your understanding of computation and turn you a better programmer (you never know). That’s all. Happy parsing!
P.S. If you’re curious to take a peek at my N1QL parser, the code is available on GitHub. It’s still unfinished, but it mostly works for my use case. I have no plans of publishing it to RubyGems so the repository is out there for educational purposes (and as a backup).