Digital Linguistic Bias in Spanish: Evidence from Lexical Variation in LLMs
This study examines the extent to which Large Language Models (LLMs) capture geographic lexical variation in Spanish, a language that exhibits substantial regional variation. Treating LLMs as virtual informants, we probe their dialectal knowledge using two survey-style question formats: Yes-No questions and multiple-choice questions. To this end, we exploited a large-scale, expert-curated database of Spanish lexical variation. Our evaluation covers more than 900 lexical items across 21 Spanish-speaking countries and is conducted at both the country and dialectal area levels. Across both evaluation formats, the results reveal systematic differences in how LLMs represent Spanish language varieties. Lexical variation associated with Spain, Equatorial Guinea, Mexico & Central America, and the La Plata River is recognized more accurately by the models, while the Chilean variety proves particularly difficult for the models to distinguish. Importantly, differences in the volume of country-level digital resources do not account for these performance patterns, suggesting that factors beyond data quantity shape dialectal representation in LLMs. By providing a fine-grained, large-scale evaluation of geographic lexical variation, this work advances empirical understanding of dialectal knowledge in LLMs and contributes new evidence to discussions of Digital Linguistic Bias in Spanish.
💡 Research Summary
This paper investigates how well large language models (LLMs) capture geographic lexical variation in Spanish, a language with extensive regional differences. The authors treat LLMs as “virtual informants” and probe their dialectal knowledge using two survey‑style question formats: a Yes‑No format (YNQF) and a Multiple‑Choice format (MCQF). The gold‑standard reference is VARILEX, an expert‑curated database that documents 934 lexical items and 9,057 variants across 21 Spanish‑speaking countries, organized into eight dialectal areas.
For YNQF, each prompt asks the model whether a specific lexical variant is commonly used in a given country, requiring a binary “Sí” or “No” answer. Performance is measured with the standard F1 score. For MCQF, the model is presented with all variants for an item and must return the numbers of those it would normally use; because multiple variants can be predominant, the authors evaluate overlap using an adjusted Jaccard coefficient (J_adj) that corrects for chance agreement based on a hyper‑geometric null model.
The study evaluates three OpenAI models accessed via API: GPT‑4o, GPT‑5.1, and GPT‑5.2. In YNQF, GPT‑4o achieves the highest F1 (0.514), with GPT‑5.1 (0.499) and GPT‑5.2 (0.480) close behind; all three outperform a naïve baseline (always answering “Sí”) by roughly a factor of two. In MCQF, GPT‑5.1 attains the best J_adj (0.338), GPT‑5.2 follows (0.336), and GPT‑4o records 0.314, again far above the baseline (0.110).
Country‑level analysis reveals systematic patterns. Variants associated with Spain, Equatorial Guinea, Mexico & Central America, and the La Plata River area (Argentina, Uruguay, Paraguay) are recognized with high accuracy (e.g., Spain F1 = 0.723). In contrast, the Chilean variety is poorly captured (YNQF F1 = 0.372; MCQF J_adj = 0.080), indicating a pronounced blind spot. The authors also examine whether the volume of digital resources per country explains these differences. Correlation analyses show that resource quantity does not account for the performance gaps; for example, Bolivia and Chile have relatively modest web corpora yet exhibit markedly lower scores than some better‑resourced countries. This suggests that factors beyond sheer data volume—such as data curation practices, web‑crawling policies, and the representativeness of regional language on the internet—drive the observed digital linguistic bias (DLB).
Methodologically, the paper contributes a large‑scale, expert‑grounded evaluation framework that bridges traditional dialectology and modern LLM assessment. The adjusted Jaccard metric is a novel, chance‑corrected measure suitable for multi‑label lexical tasks with variable numbers of correct answers. The authors also report that, despite randomizing answer order to mitigate option‑position effects, models tend to favor the most frequent variants, highlighting the importance of prompt engineering and post‑processing when deploying LLMs for region‑sensitive applications.
In sum, the study provides robust evidence that LLMs encode Spanish dialectal knowledge unevenly, excelling for some regions while largely ignoring others, particularly Chilean Spanish. The findings underscore that digital linguistic bias is not merely a function of data quantity but also of data quality, collection strategies, and model architecture. The authors call for future work to expand the evaluation to additional languages and dialects, incorporate multimodal data, and develop bias‑mitigation techniques such as balanced sampling and targeted fine‑tuning to achieve more equitable representation of linguistic diversity in AI systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment