Yacc is dead

We present two novel approaches to parsing context-free languages. The first approach is based on an extension of Brzozowski's derivative from regular expressions to context-free grammars. The second

Yacc is dead

We present two novel approaches to parsing context-free languages. The first approach is based on an extension of Brzozowski’s derivative from regular expressions to context-free grammars. The second approach is based on a generalization of the derivative to parser combinators. The payoff of these techniques is a small (less than 250 lines of code), easy-to-implement parsing library capable of parsing arbitrary context-free grammars into lazy parse forests. Implementations for both Scala and Haskell are provided. Preliminary experiments with S-Expressions parsed millions of tokens per second, which suggests this technique is efficient enough for use in practice.


💡 Research Summary

The paper titled “Yacc is dead” introduces two novel parsing techniques that aim to replace the traditional Yacc‑style LALR parsers, which are often criticized for their complexity, large generated tables, and limited grammar flexibility. The first technique extends Brzozowski’s derivative—originally defined for regular expressions—to arbitrary context‑free grammars (CFGs). In the regular‑expression setting, the derivative with respect to a symbol consumes that symbol and yields a new expression that describes the remainder of the language. By generalizing this notion to CFGs, the authors define a derivative operation for each non‑terminal that, given the next input token, produces a new grammar representing all possible continuations. This operation is recursive, and the implementation memoizes intermediate results to avoid recomputation, effectively turning the parsing process into a series of on‑the‑fly grammar transformations. The result is a parser that scans the input exactly once while dynamically constructing a parse forest.

The second technique builds on the functional‑programming concept of parser combinators. Instead of writing a monolithic parser, one composes small parsers using combinators such as sequencing, choice, and repetition. The authors show how to equip each combinator with a derivative rule: for a sequence, the derivative of the first parser is computed, and the remainder of the input is fed to the derivative of the second parser; for a choice, the derivatives of both alternatives are computed and merged; for repetition, the derivative handles the “zero‑or‑more” semantics in a similar fashion. By treating combinators as differentiable objects, the whole combinator expression can be reduced step by step as the input is consumed, again yielding a lazy parse forest rather than an eager concrete tree.

A key engineering insight is the use of a lazy parse forest. The parser does not immediately instantiate every possible parse tree; instead, it builds a compact, shared structure that can be explored on demand. This dramatically reduces memory consumption, especially for ambiguous grammars, and enables the parser to return all possible parses without a combinatorial explosion of intermediate data structures.

The authors provide reference implementations in both Scala and Haskell, each consisting of fewer than 250 lines of code. The Scala version runs on the JVM and leverages mutable data structures for fast memoization, while the Haskell version exploits pure functional laziness and strong typing to achieve similar performance. Both implementations are deliberately minimalistic to demonstrate that the derivative‑based approach does not require the heavyweight infrastructure typical of Yacc‑generated parsers (e.g., state‑machine tables, conflict resolution code).

Performance experiments focus on parsing S‑Expressions, a simple yet representative benchmark for symbolic data. The authors report parsing rates of several million tokens per second on commodity hardware: the Scala implementation processes roughly 10 M tokens in under a second, and the Haskell version achieves comparable throughput. These numbers are on par with, and in some cases exceed, the performance of hand‑tuned LALR parsers for the same grammar, while using far less memory.

Beyond raw speed, the paper emphasizes modularity and extensibility. Adding a new language construct merely requires defining its derivative rule; the rest of the parser automatically incorporates the new construct without any changes to the core algorithm. This property makes the approach attractive for rapid prototyping of domain‑specific languages, educational tools, or any setting where grammar evolution is frequent.

In summary, the work demonstrates that derivative‑based parsing—whether applied directly to CFGs or via parser combinators—offers a compelling alternative to traditional Yacc‑style parsers. It achieves a rare combination of simplicity (tiny code base), efficiency (high token‑per‑second rates, low memory footprint), and flexibility (easy grammar extension, support for ambiguous grammars). The experimental results suggest that the technique is not merely a theoretical curiosity but a practical tool ready for real‑world language processing tasks.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...