A Computational Approach to Language Contact -- A Case Study of Persian
We investigate structural traces of language contact in the intermediate representations of a monolingual language model. Focusing on Persian (Farsi) as a historically contact-rich language, we probe the representations of a Persian-trained model when exposed to languages with varying degrees and types of contact with Persian. Our methodology quantifies the amount of linguistic information encoded in intermediate representations and assesses how this information is distributed across model components for different morphosyntactic features. The results show that universal syntactic information is largely insensitive to historical contact, whereas morphological features such as Case and Gender are strongly shaped by language-specific structure, suggesting that contact effects in monolingual language models are selective and structurally constrained.
💡 Research Summary
This paper investigates whether a monolingual Persian language model (ParsBER‑T) contains latent structural traces of Persian’s long‑standing contact with a variety of other languages. The authors treat the model’s intermediate token representations as a window into historical and modern contact phenomena, probing these representations with two complementary techniques: an information‑theoretic probe based on variational usable information (I_V) and an attribution analysis using Language Activation Probability Entropy (LAPE).
The experimental setup uses the Parallel Universal Dependencies (PUD) treebanks, which provide aligned translations of 1,000 news and Wikipedia sentences into 21 languages. From these, eight languages are selected to span a spectrum of contact intensity with Persian: Turkish (high, historical), Arabic (high, historical), Hindi (moderate‑historical), Russian (moderate‑regional), English, French, and German (modern, global), and Japanese (essentially no contact). This selection allows the authors to disentangle pure typological similarity from contact‑driven convergence.
ParsBER‑T is a 12‑layer encoder‑only transformer (hidden size 768) pretrained on roughly 3.9 billion Persian tokens from diverse domains. Because the model never sees any of the target languages during pretraining, any alignment between its internal states and the structural patterns of the contact languages must arise from (i) Persian’s own historical borrowing and calquing, (ii) statistical imprint of loanwords and hybrid constructions present in modern Persian corpora, or (iii) universal linguistic regularities shared across languages.
For each token in the test sentences, the authors extract the embedding vectors from every layer. The I_V probe treats the embedding X as a random variable and a linguistic property Y (language identity, Universal POS tag, CASE, or GENDER) as the target. I_V(X→Y) is normalized by the marginal entropy H(Y) to yield a dimensionless score (\hat I_V) in
Comments & Academic Discussion
Loading comments...
Leave a Comment