From Linear Input to Hierarchical Structure: Function Words as Statistical Cues for Language Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

What statistical conditions support learning hierarchical structure from linear input? In this paper, we address this question by focusing on the statistical distribution of function words. Function words have long been argued to play a crucial role in language acquisition due to their distinctive distributional properties, including high frequency, reliable association with syntactic structure, and alignment with phrase boundaries. We use cross-linguistic corpus analysis to first establish that all three properties are present across 186 studied languages. Next, we use a combination of counterfactual language modeling and ablation experiments to show that language variants preserving all three properties are more easily acquired by neural learners, with frequency and structural association contributing more strongly than boundary alignment. Follow-up probing and ablation analyses further reveal that different learning conditions lead to systematically different reliance on function words, indicating that similar performance can arise from distinct internal mechanisms.

💡 Research Summary

The paper investigates the statistical conditions under which learners can acquire hierarchical syntactic structure from purely linear input, focusing on the role of function words. The authors identify three key distributional properties of function words that have long been hypothesized to aid acquisition: (i) high lexical frequency, (ii) reliable association with specific syntactic structures, and (iii) systematic alignment with phrase boundaries.

First, a large‑scale cross‑linguistic analysis is performed on the Universal Dependencies (UD) treebanks covering 186 languages. Function words are operationalized as closed‑class POS tags (DET, ADP, CC, SCONJ, AUX) while excluding numerals and pronouns. For each language the authors compute (a) the type‑to‑vocabulary ratio and (b) the token‑frequency ratio for function versus content words. The results show a robust pattern: function words occupy a small proportion of types but a disproportionately large share of tokens, confirming high frequency. Next, they assess structural predictability by measuring the entropy of the POS tags that co‑occur as dependents of each function‑word tag. Function‑word entropy is consistently lower than that of content words, indicating strong syntactic selectivity. Finally, they verify that function words tend to appear at the beginnings or ends of phrases across languages, establishing the boundary‑alignment property. Together these analyses demonstrate that the three properties are universal rather than English‑specific.

To test whether these properties actually facilitate learning, the authors create manipulated versions of Wikipedia text. They keep overall token count, sentence‑length distribution, and content‑word inventory constant, but systematically vary each property:

Frequency conditions – NO FUNCTION (all function words removed), STANDARD (natural inventory), FIVE FUNCTION (one prototype per syntactic class), MORE FUNCTION (each natural function word expanded into ten pseudowords), and FIVEFUNCTION (minimal inventory).
Structural predictability – PHRASE DEPENDENCY (natural mapping of function words to syntactic frames), BIGRAM DEP (function word determined deterministically by the following content word), RANDOM DEP (function‑word identities shuffled while preserving positions).
Boundary alignment – AT BOUNDARY (natural placement) and WITHIN BOUNDARY (function words moved adjacent to their heads, breaking the phrase‑boundary cue).

For each condition a dedicated tokenizer is trained, and a GPT‑2 Small model is trained from scratch for ten epochs, with three random seeds per condition. Because the manipulations change the entropy of the training distribution, perplexity is not used for comparison. Instead, the authors adapt the BLiMP benchmark, applying the same transformations to the test sets and filtering out any minimal pairs that become ill‑defined under a given manipulation.

The evaluation reveals a clear hierarchy of importance. The STANDARD FUNCTION condition achieves the highest BLiMP accuracy. Removing function words (NO FUNCTION) or destroying structural predictability (RANDOM DEP) leads to the largest drops (10–30 % absolute). BIGRAM DEP, which preserves frequency but replaces structural cues with a simple bigram rule, also harms performance substantially, indicating that mere co‑occurrence statistics are insufficient. The boundary‑alignment manipulation (WITHIN BOUNDARY) yields a modest decline (≈5 %), suggesting that while helpful, boundary cues are less critical than frequency and structural association. Moreover, the FIVE FUNCTION and MORE FUNCTION conditions illustrate a “Goldilocks” effect: function words must be frequent enough to be reliable, yet sufficiently diverse to remain informative; both extreme reduction and excessive proliferation degrade learning.

Probing analyses further uncover that models reaching similar BLiMP scores can rely on different internal mechanisms. Attention‑probing shows varying degrees of focus on function‑word tokens across conditions. Ablation experiments that mask function words during inference cause larger performance drops in models trained with strong structural cues than in those trained with boundary cues, confirming distinct reliance patterns.

In sum, the paper makes four major contributions: (1) it empirically validates that high frequency, structural selectivity, and phrase‑boundary alignment of function words are universal across a wide typological sample; (2) it demonstrates that neural language models can exploit these cues to learn hierarchical syntax from linear input; (3) it quantifies the relative contribution of each cue, establishing a clear hierarchy (frequency > structural association > boundary alignment) and a Goldilocks constraint on frequency; and (4) it shows that comparable behavioral performance does not guarantee identical internal representations, highlighting multiple viable routes to successful grammar acquisition. These findings have implications for theories of human language acquisition and for the design of more linguistically informed neural models.

From Linear Input to Hierarchical Structure: Function Words as Statistical Cues for Language Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment