A Hierarchical and Attentional Analysis of Argument Structure Constructions in BERT Using Naturalistic Corpora

A Hierarchical and Attentional Analysis of Argument Structure Constructions in BERT Using Naturalistic Corpora
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This study investigates how the Bidirectional Encoder Representations from Transformers model processes four fundamental Argument Structure Constructions. We employ a multi-dimensional analytical framework, which integrates MDS, t-SNE as dimensionality reduction, Generalized Discrimination Value (GDV) as cluster separation metrics, Fisher Discriminant Ratio (FDR) as linear diagnostic probing, and attention mechanism analysis. Our results reveal a hierarchical representational structure. Construction-specific information emerges in early layers, forms maximally separable clusters in middle layers, and is maintained through later processing stages.


💡 Research Summary

The paper investigates how BERT internally represents four core Argument Structure Constructions (ASCs)—Resultative, Caused‑Motion, Ditransitive, and Way—using a strictly controlled naturalistic corpus drawn exclusively from the fiction sections of the British National Corpus and the Corpus of Contemporary American English. A total of 800 sentences (200 per construction) were retrieved with POS‑based query patterns, manually verified, and annotated for key syntactic roles (subject, verb, direct object, indirect object, preposition, and the noun “way”). Each sentence was tokenized with the WordPiece tokenizer of bert‑base‑uncased; for multi‑token words, the first sub‑word embedding was taken as the representation of the whole word. Contextual embeddings and full attention weight matrices were extracted from all 12 transformer layers, yielding a layer‑wise dataset of 768‑dimensional vectors for each syntactic role and a set of attention maps for every head.

To probe the geometry of these representations, the authors applied classic Multidimensional Scaling (MDS) and t‑Distributed Stochastic Neighbor Embedding (t‑SNE) to the CLS token embeddings at each layer. Visualizations show that early layers (1‑3) contain only vague separation, while middle layers (approximately 5‑7) produce the most distinct clusters for the four constructions; later layers retain this separation, albeit with slight diffusion. Quantitative cluster separation was measured with the Generalized Discrimination Value (GDV), which becomes increasingly negative and reaches its minimum (≈‑0.42) in layers 5‑7, indicating maximal inter‑class distance relative to intra‑class variance.

Linear probing experiments were conducted by training a 4‑class Linear Support Vector Machine on the embeddings of three token types (CLS, VERB, OBJ) at each layer. Starting from layer 2, classification accuracy exceeds 85 % and peaks at around 92 % for the VERB token, demonstrating that construction category information is linearly decodable, especially in the verb representation.

Attention analysis employed the Fisher Discriminant Ratio (FDR) on attention weight distributions across token pairs. The verb‑object pair shows a sharp rise in FDR from layer 3 onward, signifying that attention increasingly focuses on this relation for construction discrimination. The Way construction exhibits a uniquely high FDR for the verb‑“way” noun pair, revealing a distinct attentional schema that separates it from the other three constructions.

Overall, the study uncovers a hierarchical representational trajectory in BERT: low‑level layers encode surface form cues, middle layers integrate abstract syntactic‑semantic schemas yielding highly separable clusters, and higher layers preserve these schemas while supporting broader semantic processing. The findings align with prior work on BERT’s layer‑wise syntactic and semantic gradients but extend them by demonstrating that abstract constructional knowledge—central to Construction Grammar—emerges naturally from self‑supervised training on authentic text. The distinct treatment of the Way construction underscores that BERT can develop specialized schemata for constructions with particularly salient form‑meaning pairings.

Methodologically, the paper showcases a comprehensive probing pipeline that combines geometric visualization, scalar cluster metrics, linear decoding, and attention‑based discriminability. This integrated framework offers a robust template for future investigations of other linguistic phenomena (e.g., idioms, metaphor, complex clause embedding) within large language models, and it reinforces the view of such models as viable computational testbeds for cognitive‑linguistic theories.


Comments & Academic Discussion

Loading comments...

Leave a Comment