A Probabilistic Generative Grammar for Semantic Parsing
Domain-general semantic parsing is a long-standing goal in natural language processing, where the semantic parser is capable of robustly parsing sentences from domains outside of which it was trained. Current approaches largely rely on additional supervision from new domains in order to generalize to those domains. We present a generative model of natural language utterances and logical forms and demonstrate its application to semantic parsing. Our approach relies on domain-independent supervision to generalize to new domains. We derive and implement efficient algorithms for training, parsing, and sentence generation. The work relies on a novel application of hierarchical Dirichlet processes (HDPs) for structured prediction, which we also present in this manuscript. This manuscript is an excerpt of chapter 4 from the Ph.D. thesis of Saparov (2022), where the model plays a central role in a larger natural language understanding system. This manuscript provides a new simplified and more complete presentation of the work first introduced in Saparov, Saraswat, and Mitchell (2017). The description and proofs of correctness of the training algorithm, parsing algorithm, and sentence generation algorithm are much simplified in this new presentation. We also describe the novel application of hierarchical Dirichlet processes for structured prediction. In addition, we extend the earlier work with a new model of word morphology, which utilizes the comprehensive morphological data from Wiktionary.
💡 Research Summary
The paper tackles the long‑standing challenge of domain‑general semantic parsing by proposing a fully generative probabilistic grammar that jointly models natural language utterances and their corresponding logical forms. Unlike most contemporary parsers that rely on discriminative models and require domain‑specific supervision (lexicons, annotated grammar rules, or large amounts of paired data), this work introduces a model that can be trained with supervision that is independent of any particular domain. The core idea is to treat logical forms as latent variables drawn from a prior distribution; given a sampled logical form, a top‑down grammar recursively generates a derivation tree, selecting production rules according to probabilities conditioned on the logical form. This generative perspective makes it straightforward to incorporate background knowledge (type constraints, ontology information, etc.) into the logical‑form prior, thereby enabling the parser to handle novel terminology without re‑learning the entire grammar.
A major technical contribution is the novel application of hierarchical Dirichlet processes (HDPs) to structured prediction within the parsing model. The authors adapt the HDP, traditionally used for topic modeling, to the tree‑structured space of logical forms. Each node in the logical‑form hierarchy possesses its own Dirichlet process whose base distribution is the process of its parent node. This hierarchy captures the intuition that sub‑structures of a logical form share statistical strength while still allowing fine‑grained variation. The HDP framework yields a set of “tables” (clusters) that correspond to reusable grammar fragments or lexical items, and the Chinese restaurant franchise metaphor provides an intuitive generative story for sampling observations (words or morphemes) conditioned on their source node.
Inference in this model involves three intertwined tasks: (1) learning the HDP parameters (table assignments and associated probabilities) from a corpus of paired utterances and logical forms, (2) parsing a new utterance by searching for the most probable derivation tree given the learned distributions, and (3) generating natural language sentences from a sampled logical form. The authors develop efficient Markov chain Monte Carlo (MCMC) samplers for the HDP posterior, a beam‑search‑style parser that can return partial parses (useful when encountering unseen tokens), and a top‑down generator that samples production rules conditioned on the logical‑form prior. To handle unseen words, they augment the model with a morphological component that automatically extracts morpheme‑level information from Wiktionary; this allows the parser to decompose unknown tokens into known morphemes and still produce meaningful partial parses.
A particularly elegant solution is presented for the problem of inferring the source node of a new observation when the node is unknown. The authors formulate this as a discrete optimization problem and solve it with a branch‑and‑bound algorithm that exploits the hierarchical structure of the HDP. The heuristic used in the bound is computed locally at each node, ensuring that the search scales logarithmically with the size of the hierarchy and remains tractable even for deep trees.
Empirical evaluation is conducted on two classic semantic‑parsing benchmarks: GEOQUERY and JOBS. Both datasets consist of natural‑language questions paired with logical forms expressed in a DataLog‑style language. The proposed model is trained on the standard training splits and evaluated on held‑out test sets. Results show that the generative HDP‑based parser achieves accuracy comparable to, and in some configurations slightly exceeding, the state‑of‑the‑art discriminative parsers (e.g., CCG‑based, Seq2Seq, and neural graph‑based models). More importantly, the authors perform a domain‑transfer experiment where test instances contain entities and predicates not seen during training. The model’s logical‑form prior and the HDP’s ability to share statistical strength across related sub‑structures enable it to maintain high parsing accuracy under these out‑of‑domain conditions, demonstrating genuine domain‑generalization capability.
In summary, the paper makes four key contributions: (1) a fully generative grammar for semantic parsing that separates domain‑independent supervision from domain‑specific lexical knowledge, (2) a novel hierarchical Dirichlet process formulation for structured prediction that captures dependencies between logical forms and grammar rules, (3) an automatic morphological augmentation using Wiktionary to improve handling of unseen words, and (4) efficient training, parsing, and generation algorithms—including a branch‑and‑bound inference procedure—that scale to realistic datasets. The work bridges the gap between probabilistic language modeling and semantic parsing, offering a promising direction for building parsers that can be deployed across diverse domains with minimal additional annotation effort.
Comments & Academic Discussion
Loading comments...
Leave a Comment