A model of language inflection graphs

A model of language inflection graphs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Inflection graphs are highly complex networks representing relationships between inflectional forms of words in human languages. For so-called synthetic languages, such as Latin or Polish, they have particularly interesting structure due to abundance of inflectional forms. We construct the simplest form of inflection graphs, namely a bipartite graph in which one group of vertices corresponds to dictionary headwords and the other group to inflected forms encountered in a given text. We then study projection of this graph on the set of headwords. The projection decomposes into a large number of connected components, to be called word groups. Distribution of sizes of word group exhibits some remarkable properties, resembling cluster distribution in a lattice percolation near the critical point. We propose a simple model which produces graphs of this type, reproducing the desired component distribution and other topological features.


💡 Research Summary

The paper investigates the structural properties of inflection graphs for highly synthetic languages such as Latin and Polish. An inflection graph is defined as a bipartite network G = (H, I, E) where H is the set of dictionary headwords and I is the set of all inflected forms that appear in a given text; an edge connects a headword to each of its inflected forms. For Latin the authors built a graph (G_LA) containing 28 092 headwords, 1 028 972 inflected forms and 1 077 806 edges; a comparable Polish graph (G_PL) was also constructed.

Projecting G onto the headword set yields a simple graph G′ in which two headwords are linked whenever they share at least one inflected form. The connected components of G′—called “word groups”—exhibit a striking size distribution: the number of groups of size s follows a power law n_s ∝ s^−τ with τ≈3.1 for Latin and τ≈4.3 for Polish. This scaling resembles the cluster‑size distribution at the percolation threshold of a lattice, although the exponent differs from the classic Fisher exponent (τ≈2.5).

Further topological analysis shows that G′ does not behave like an Erdős–Rényi random graph at criticality. Its degree distribution is exponential rather than Poisson, with an average degree ≈1.8, and its k‑core decomposition reveals a rich hierarchy of highly clustered cores, in contrast to the narrow core spectrum of critical Erdős–Rényi graphs.

To reproduce these empirical findings, the authors propose a two‑stage stochastic model. In the first stage, each headword is assigned a random number x_i drawn from a weighted sum of three normal distributions (representing verbs, nouns/adjectives, and other word classes). The absolute value of x_i determines how many inflected‑form vertices are attached to that headword, creating a collection of star‑like subgraphs. In the second stage, a set of random “bridges” is added between stars, allowing different headwords to share inflected forms. The parameters of the normal mixture (weights, means, variances) are calibrated to match the observed degree distribution of the real inflection graph; small variations in these parameters do not affect the emergence of the power‑law component‑size distribution.

Simulations of the model generate synthetic graphs whose projected headword networks display the same exponential degree distribution, multi‑core clustering pattern, and power‑law component‑size exponent as the empirical Latin and Polish graphs. Thus the model captures the essential topological features of inflection graphs, suggesting that the observed scaling arises from a simple combination of heterogeneous star formation (reflecting grammatical categories) and sparse random inter‑star connections.

Overall, the work bridges linguistic morphology and complex‑network theory, providing a quantitative framework for understanding the combinatorial richness of synthetic languages and offering a basis for improved lexical counting, disambiguation, and text‑processing algorithms.


Comments & Academic Discussion

Loading comments...

Leave a Comment