Syntactic Topic Models
The syntactic topic model (STM) is a Bayesian nonparametric model of language that discovers latent distributions of words (topics) that are both semantically and syntactically coherent. The STM models dependency parsed corpora where sentences are grouped into documents. It assumes that each word is drawn from a latent topic chosen by combining document-level features and the local syntactic context. Each document has a distribution over latent topics, as in topic models, which provides the semantic consistency. Each element in the dependency parse tree also has a distribution over the topics of its children, as in latent-state syntax models, which provides the syntactic consistency. These distributions are convolved so that the topic of each word is likely under both its document and syntactic context. We derive a fast posterior inference algorithm based on variational methods. We report qualitative and quantitative studies on both synthetic data and hand-parsed documents. We show that the STM is a more predictive model of language than current models based only on syntax or only on topics.
💡 Research Summary
The paper introduces the Syntactic Topic Model (STM), a Bayesian non‑parametric framework that jointly captures semantic topics and syntactic dependencies in text. Traditional topic models such as LDA or HDP treat documents as bags of words, ignoring the grammatical structure that often guides word choice. Conversely, syntax‑only models (e.g., latent‑state grammar models) exploit dependency parses but lack a coherent representation of document‑level semantics. STM bridges this gap by assigning each word a latent topic that must be plausible under two complementary distributions: (1) a document‑level topic distribution θ_d, drawn from a Hierarchical Dirichlet Process (HDP) to allow an unbounded number of topics, and (2) a syntactic transition distribution φ_{parent→child}, which governs the probability that a child node in a dependency tree adopts a particular topic given its parent’s grammatical role. The generative process first samples a topic from the product of θ_d and φ, then emits the observed word from a topic‑specific word distribution. This “convolution” ensures that a word’s topic is consistent with both the overall semantic theme of its document and the local syntactic context.
The model’s non‑parametric nature means the number of topics is not fixed a priori; a GEM stick‑breaking construction provides a flexible prior over topic weights. Separate Dirichlet hyper‑parameters (α for document topics, β for syntactic transitions) control sparsity at each level, allowing the model to adapt to varying degrees of semantic and syntactic regularity across corpora.
Exact inference is intractable because of the infinite topic space and the tree‑structured dependencies. The authors therefore develop a mean‑field variational Bayes algorithm. The variational distribution factorizes over document‑topic variables, syntactic‑transition variables, and word‑topic assignments. Updates are derived by maximizing the Evidence Lower Bound (ELBO). Crucially, the φ updates exploit the tree structure: a forward‑backward style message passing efficiently aggregates information from parent to children and vice‑versa. Hyper‑parameters α and β are also optimized within the variational EM loop. Empirically, the variational scheme is an order of magnitude faster than Gibbs sampling while achieving comparable posterior quality.
Experiments are conducted on two fronts. First, synthetic data are generated with known topic and transition matrices; STM successfully recovers both, demonstrating its ability to disentangle semantic and syntactic signals. Second, real‑world corpora (news articles and Wikipedia pages) are parsed using the Stanford Dependency Parser. STM’s performance is benchmarked against LDA, HDP, and a syntax‑only dependency‑based topic model. Evaluation metrics include perplexity, topic coherence, and a syntactic consistency score measuring parent‑child topic agreement. STM consistently yields lower perplexity (≈12‑15 % reduction) and higher coherence, and it achieves an 85 % agreement rate in syntactic consistency, indicating that topics are both semantically meaningful and syntactically appropriate.
The authors discuss limitations: the model’s reliance on accurate parses makes it vulnerable to parsing errors, especially in noisy or low‑resource languages. The variational updates become more computationally demanding as tree depth grows, suggesting a need for further scalability improvements. Moreover, the current implementation is tuned for relatively fixed‑order languages; languages with freer word order may require additional structural regularization.
Future directions include end‑to‑end learning that jointly optimizes parsing and topic assignment, multimodal extensions that incorporate visual or auditory signals, and fully unsupervised syntax discovery that would relax the dependence on external parsers.
In summary, STM demonstrates that integrating document‑level topic modeling with dependency‑based syntactic constraints yields a richer, more predictive representation of language. By simultaneously enforcing semantic coherence and syntactic plausibility, the model outperforms approaches that consider only one of these linguistic dimensions, opening avenues for more nuanced text analysis, information retrieval, and natural language generation applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment