Do language models accommodate their users? A study of linguistic convergence

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While large language models (LLMs) are generally considered proficient in generating language, how similar their language usage is to that of humans remains understudied. In this paper, we test whether models exhibit linguistic convergence, a core pragmatic element of human language communication: do models adapt, or converge, to the linguistic patterns of their user? To answer this, we systematically compare model completions of existing dialogues to original human responses across sixteen language models, three dialogue corpora, and various stylometric features. We find that models strongly converge to the conversation’s style, often significantly overfitting relative to the human baseline. While convergence patterns are often feature-specific, we observe consistent shifts in convergence across modeling settings, with instruction-tuned and larger models converging less than their pretrained and smaller counterparts. Given the differences in human and model convergence patterns, we hypothesize that the underlying mechanisms driving these behaviors are very different.

💡 Research Summary

The paper investigates whether large language models (LLMs) exhibit linguistic convergence—the tendency to adapt one’s speech style to that of an interlocutor—when interacting with human users. To do this, the authors construct a synthetic yet realistic experimental setup: they take existing human‑human dialogues from three English corpora (DailyDialog, NPR interview transcripts, and a movie script corpus), select conversations with at least six turns, and then ask a variety of LLMs to replace specific turns (starting at turn 6 and every other turn thereafter) with model‑generated responses. The model is always given at least five prior turns, so its output is directly conditioned on the most recent human utterance (rₜ₋₁). This design allows a direct, turn‑by‑turn comparison between the model’s response (rₜ) and the original human response that it replaces, as well as against two baselines: (1) the human baseline, measuring how much the original human turn already converges with its predecessor, and (2) a random baseline, where a randomly sampled utterance from the same dataset is inserted instead of rₜ.

Four stylometric metrics are used to quantify convergence:

Utterance Length – a symmetric similarity score (LSM) based on the absolute difference in token count between rₜ and rₜ₋₁.
LIWC Agreement – average similarity across eight LIWC 2007 functional‑word categories (personal pronouns, articles, conjunctions, prepositions, auxiliary verbs, frequent adverbs, negations, quantifiers).
PROPN Overlap – percentage of proper nouns shared between rₜ and rₜ₋₁, intended as a proxy for topical alignment.
Token Novelty – proportion of tokens in rₜ that are novel relative to rₜ₋₁; lower novelty indicates stronger lexical alignment.

The study evaluates sixteen models from two open‑source families: Gemma (1 B, 4 B, 12 B, 27 B) and Llama 3 (1 B, 3 B, 8 B, 70 B). For each size, both the pretrained checkpoint and an instruction‑tuned variant are tested, yielding a total of sixteen configurations. All models are run via HuggingFace with 8‑bit quantization for the largest checkpoint. Prompting is kept uniform: “Continue this conversation based on the given context,” followed by the full dialogue history (including any earlier model‑generated turns). Post‑generation cleaning removes whitespace artifacts and dialogue tags.

Key Findings

Overall Convergence: Across all three corpora, LLMs consistently achieve convergence scores that meet or exceed the human baseline on most metrics. The strongest effects appear in Token Novelty and PROPN Overlap, where models often show significantly lower novelty and higher proper‑noun sharing than humans, suggesting a tendency to over‑align lexically with the immediate context.
Model Size and Tuning Effects: Larger models (e.g., Llama 3 70 B) and instruction‑tuned versions generally converge less than smaller, purely pretrained models. This pattern holds for all four metrics, indicating that scaling and fine‑tuning for instruction following may promote more generalized, less “copy‑cat” behavior.
Corpus‑Specific Patterns: DailyDialog and NPR exhibit clear LIWC‑based convergence (most functional‑word categories show significant differences from the random baseline), whereas the Movie corpus shows minimal LIWC effects—only auxiliary verbs reach significance. This aligns with prior observations that scripted movie dialogue is less spontaneous and thus displays weaker natural accommodation.
Feature‑Specific Variability: Not all convergence is uniform across features. Some models align strongly in length but not in LIWC categories; others show high proper‑noun overlap but retain relatively high token novelty. This mirrors the multifaceted nature of human accommodation, yet the variance profiles differ, hinting at distinct underlying mechanisms.
Over‑fitting Concern: The authors note that many models “over‑converge” relative to the human baseline, especially on lexical measures. This could be interpreted as the model memorizing statistical regularities of the training data and reproducing them verbatim when a strong contextual cue is present, rather than engaging in a socially motivated adaptation.

Interpretation and Implications

The authors argue that while LLMs do display measurable convergence, the drivers differ from human accommodation. Humans modulate convergence based on social factors (e.g., rapport, status, intent), whereas LLMs appear to react primarily to immediate statistical cues in the prompt. Over‑convergence may reduce conversational diversity and creativity, potentially harming user experience in open‑ended dialogue settings.

Future Directions Proposed

Controlled Decoding Strategies: Develop methods (e.g., temperature scheduling, nucleus sampling adjustments) that can explicitly regulate the degree of stylistic alignment.
Adaptive Fine‑Tuning: Incorporate user‑feedback signals or persona embeddings to steer convergence toward desired levels without sacrificing fluency.
Human‑Centric Evaluation: Conduct user studies to link measured convergence with subjective metrics such as perceived empathy, trust, and satisfaction.
Cross‑Linguistic Extension: Test whether similar patterns hold in non‑English corpora, where functional‑word distributions and proper‑noun usage differ markedly.

In sum, the paper provides a systematic, quantitative framework for assessing linguistic convergence in LLMs, reveals consistent but sometimes excessive alignment to user style, and highlights how model architecture, scale, and fine‑tuning influence this behavior. These insights are valuable for developers aiming to build more nuanced, socially aware conversational agents.

Do language models accommodate their users? A study of linguistic convergence

💡 Research Summary

Comments & Academic Discussion

Leave a Comment