Category-Theoretic Quantitative Compositional Distributional Models of Natural Language Semantics
This thesis is about the problem of compositionality in distributional semantics. Distributional semantics presupposes that the meanings of words are a function of their occurrences in textual contexts. It models words as distributions over these contexts and represents them as vectors in high dimensional spaces. The problem of compositionality for such models concerns itself with how to produce representations for larger units of text by composing the representations of smaller units of text. This thesis focuses on a particular approach to this compositionality problem, namely using the categorical framework developed by Coecke, Sadrzadeh, and Clark, which combines syntactic analysis formalisms with distributional semantic representations of meaning to produce syntactically motivated composition operations. This thesis shows how this approach can be theoretically extended and practically implemented to produce concrete compositional distributional models of natural language semantics. It furthermore demonstrates that such models can perform on par with, or better than, other competing approaches in the field of natural language processing. There are three principal contributions to computational linguistics in this thesis. The first is to extend the DisCoCat framework on the syntactic front and semantic front, incorporating a number of syntactic analysis formalisms and providing learning procedures allowing for the generation of concrete compositional distributional models. The second contribution is to evaluate the models developed from the procedures presented here, showing that they outperform other compositional distributional models present in the literature. The third contribution is to show how using category theory to solve linguistic problems forms a sound basis for research, illustrated by examples of work on this topic, that also suggest directions for future research.
💡 Research Summary
**
The dissertation tackles the long‑standing gap between formal semantics, which maps syntactic structures to logical forms, and distributional semantics, which derives word meanings from statistical co‑occurrence patterns. Building on the DisCoCat (Distributional Compositional Categorical) framework introduced by Coecke, Sadrzadeh, and Clark, the author extends the original pregroup‑based approach in three major directions: syntactic coverage, learning algorithms, and empirical validation.
First, the thesis shows how to embed a variety of grammatical formalisms into categorical structures. Context‑Free Grammars (CFGs) are treated as product categories; a functor maps CFG derivations to morphisms in a compact closed category, turning parse trees into linear algebraic operations. Lambek grammars are interpreted as monoidal bi‑closed categories, allowing type‑raising and currying to be expressed as categorical morphisms. Finally, Combinatory Categorial Grammar (CCG) is incorporated by defining a functor that respects its combinatory rules, thereby unifying a broad class of syntactic phenomena under a single tensor‑based composition scheme. These extensions preserve the original intuition that syntax guides the way distributional vectors and tensors are combined.
Second, the work proposes concrete learning procedures for the semantic tensors. A high‑dimensional “sentence space” is defined, and lexical items are assigned tensor orders according to their grammatical types: nouns become vectors, transitive verbs become matrices, ditransitive verbs become third‑order tensors, and so forth. The core learning algorithm is a multi‑step linear regression that fits observed co‑occurrence statistics (extracted from large corpora) to the target tensors by minimizing squared error. When dimensionality reduction is required, the author introduces a “generalised Kronecker model” that approximates high‑order tensors with lower‑order ones while preserving the essential compositional behaviour. An efficient alternative that pre‑computes matrix‑vector products further reduces computational cost, making the approach feasible for realistic vocabulary sizes.
Third, the dissertation presents three extensive experiments. Each experiment uses a different dataset: (1) intransitive sentences (subject + verb), (2) transitive sentences (subject + verb + object), and (3) adjective‑transitive constructions. Human similarity judgments serve as the gold standard. The author compares the extended DisCoCat models against additive, multiplicative, and earlier tensor‑based compositional models. Across all datasets, the DisCoCat variants achieve the highest Spearman correlation with human scores, with statistical significance (p < 0.05). The gains are especially pronounced for relational words (verbs, prepositions), demonstrating that the categorical wiring of syntactic types to tensor operations captures nuanced semantic interactions that simpler models miss.
The final chapter outlines future research avenues. “Distributional logic” is proposed as a way to interpret tensors as functional representations of logical operators, enabling logical inference within a distributional framework. Multi‑step linear regression is further explored for learning complex verb tensors in a staged fashion. The author also sketches how the categorical pipeline can be extended to other grammar formalisms (e.g., CCG) and how non‑linear activation functions might be incorporated without breaking the underlying compact‑closed structure.
In conclusion, the thesis establishes that category theory provides a mathematically rigorous and practically effective bridge between syntactic analysis and distributional meaning. By broadening the syntactic frontiers, devising scalable learning algorithms, and delivering empirical results that surpass existing compositional distributional models, the work offers a solid foundation for future research at the intersection of formal semantics, statistical language modeling, and categorical mathematics.
Comments & Academic Discussion
Loading comments...
Leave a Comment