Far Out: Evaluating Language Models on Slang in Australian and Indian English

Far Out: Evaluating Language Models on Slang in Australian and Indian English
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: WEB, containing 377 web-sourced usage examples from Urban Dictionary, and GEN, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP$^$) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP$^$, with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on WEB versus GEN datasets, with average similarity score increasing by 0.03 and 0.05 across TWP and TWP$^*$ tasks respectively (3) en-IN tasks outperform en-AU when averaged across all models and datasets, with TWS demonstrating the largest disparity, increasing average accuracy from 0.44 to 0.54. These findings underscore fundamental asymmetries between generative and discriminative competencies for variety-specific language, particularly in the context of slang expressions despite being in a technologically rich language such as English.


💡 Research Summary

This paper investigates how well state‑of‑the‑art large language models (LLMs) understand slang that is specific to two English varieties: Indian English (en‑IN) and Australian English (en‑AU). The authors first construct two complementary evaluation sets. The WEB set contains 377 real‑world usage examples scraped from Urban Dictionary and manually validated by native speakers. The GEN set consists of 1,492 synthetically generated scenarios created with Google Gemini Pro 2.5, which expands each slang term into four distinct contexts (different characters, settings, and dialogues). Diversity analyses using ROUGE scores and cosine similarity of all‑MiniLM‑L6‑v2 embeddings confirm that the generated scenarios are lexically varied and semantically dissimilar, reducing the risk of over‑fitting to a narrow set of contexts.

Three downstream tasks are defined to probe model competence. (1) Target Word Prediction (TWP) masks the slang phrase and asks the model to generate any token(s) from its full vocabulary that fill the blank. (2) Guided TWP (TWP*) adds an explicit instruction to the prompt, directing the model to produce a slang term appropriate for the target variety. (3) Target Word Selection (TWS) presents a multiple‑choice format: the masked sentence plus four candidate phrases (the correct slang and three distractors drawn from the same variety). Encoder‑only models (BERT‑Base, RoBERTa‑Large, XLM‑RoBERTa‑Large) are evaluated via masked‑language‑model predictions; decoder‑only models (Granite‑1B, Llama‑3.2‑3B‑Instruct, Olmo‑2‑7B‑Instruct, Qwen‑3‑4B‑Instruct) are evaluated with a cloze‑style prompt at temperature 0.8. All models are run in 8‑bit quantized form on an Apple M1 Pro (16 GB RAM).

Performance is measured by two metrics: (a) accuracy – the proportion of instances where the model’s output exactly matches the gold slang phrase, and (b) semantic similarity – cosine similarity between Sentence‑BERT embeddings of the predicted phrase and the reference phrase (validated by a secondary embedding set, Granite‑embedding‑125m‑english, with an average Pearson correlation of 0.77 across all conditions).

Across the seven models, the average results are strikingly different across tasks. For open‑vocab generation (TWP) the mean accuracy is only 0.02 and similarity 0.27, indicating that models rarely generate the exact slang token. Guided generation (TWP*) improves modestly to 0.04 accuracy and 0.28 similarity, suggesting that an explicit instruction helps only slightly. In contrast, the discriminative multiple‑choice task (TWS) yields a mean accuracy of 0.49 and similarity of 0.61, demonstrating that models are considerably better at recognizing the correct slang when the candidate set is limited.

Domain effects are also evident. Models perform better on the WEB (real‑world) data than on the GEN (synthetically generated) data. For TWP, WEB accuracy is 0.04 versus 0.01 on GEN; for TWP* the drop is from 0.07 to 0.02; for TWS the decrease is modest (0.50 → 0.49). The authors attribute this to possible data contamination – the web examples may resemble text seen during pre‑training – whereas the generated scenarios differ in style and contextual cues, making them a harder test.

Variety effects are even more pronounced. Across all tasks, en‑IN consistently outperforms en‑AU. In TWP, en‑IN achieves 0.03 accuracy versus 0.02 for en‑AU; in TWS the gap widens to 0.54 versus 0.44 accuracy (a 0.10 absolute increase) and 0.69 versus 0.61 similarity. The authors hypothesize that Indian English slang may be better represented in the massive web corpora used to pre‑train these models, or that its stylistic patterns align more closely with the models’ learned distributions.

A qualitative error analysis focuses on the best‑performing model by similarity on GEN – Olmo‑2‑7B‑Instruct. Errors often involve the model producing a synonym or a morphological variant that is semantically close but not an exact match, which hurts the strict exact‑match accuracy metric. Multi‑sense slang terms also cause confusion; without sufficient contextual clues, the model sometimes selects a plausible but incorrect sense. The analysis confirms that while LLMs have a latent understanding of slang semantics, they lack precise lexical recall in generation mode.

The paper concludes that current LLMs are more adept at discriminating among candidate slang expressions than at generating them outright. This asymmetry highlights a gap in the models’ ability to handle non‑standard, community‑specific language. The authors recommend expanding pre‑training data to include more variety‑specific slang, developing evaluation benchmarks that combine both generative and discriminative tasks, and exploring prompting strategies that can better elicit exact slang recall. Their work provides the first systematic, variety‑aware benchmark for slang comprehension in English and underscores the need for more inclusive language technology that respects linguistic diversity beyond standard varieties.


Comments & Academic Discussion

Loading comments...

Leave a Comment