Diversidade linguística e inclusão digital: desafios para uma ia brasileira

Diversidade linguística e inclusão digital: desafios para uma ia brasileira
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Linguistic diversity is a human attribute which, with the advance of generative AIs, is coming under threat. This paper, based on the contributions of sociolinguistics, examines the consequences of the variety selection bias imposed by technological applications and the vicious circle of preserving a variety that becomes dominant and standardized because it has linguistic documentation to feed the large language models for machine learning.


💡 Research Summary

The paper “Linguistic Diversity and Digital Inclusion: Challenges for a Brazilian AI” examines how Brazil’s ambitious artificial‑intelligence roadmap (the 2024‑2028 “AI for the Good of All” plan) risks reproducing linguistic bias if it relies solely on Portuguese, especially the prestige Brazilian variant, for training large language models (LLMs). Drawing on sociolinguistic research, the author first outlines the legal framework that recognises a rich tapestry of languages in Brazil: the 1988 Constitution, the 2002 law on Brazilian Sign Language (Libras), the 2010 decree establishing the National Inventory of Linguistic Diversity (INDL), and subsequent co‑officialisation of 23 indigenous languages. Despite these recognitions, the national imagination and most digital initiatives treat Portuguese as the sole “national language,” marginalising indigenous tongues, Afro‑Brazilian languages, immigrant languages, sign language, and creoles.

The paper stresses that Portuguese itself is pluricentric, with a wide range of regional and social varieties. Current software localisation and documentation typically address only two hegemonic standards (European Portuguese and Brazilian Portuguese), leaving countless sub‑varieties undocumented and absent from training corpora. This “variety selection bias” mirrors documented biases in English‑language LLMs, where African‑American English (AAE) and other non‑standard dialects trigger stereotyped or erroneous outputs. The author argues that similar dynamics will emerge in Brazil: models trained on a narrow Portuguese corpus will reinforce prestige norms, erase minority speech forms, and produce discriminatory responses in voice assistants, chatbots, and other AI‑mediated services.

To break this vicious cycle, the paper proposes a two‑pronged strategy. First, a national “Brazilian Linguistic Diversity Platform” should be created under the joint leadership of ABRALIN (the Brazilian Association of Linguistics) and the ANPOLL sociolinguistics working group. This platform would aggregate existing field‑work data—audio recordings, transcriptions, annotations—into a standardized, openly accessible repository with clear metadata schemas and storage protocols. Such an infrastructure would turn fragmented sociolinguistic resources into a reusable asset for AI development, ensuring that authentic linguistic data are available for model training, evaluation, and fine‑tuning.

Second, LLM development pipelines must deliberately incorporate this diversified data. The author recommends balanced sampling across Portuguese regional varieties and systematic inclusion of non‑Portuguese languages as either separate sub‑models or via multilingual/multimodal training techniques. Bias‑detection metrics should be integrated into the evaluation stage, measuring performance per variety and per language to prevent over‑optimisation on the dominant variant. By doing so, AI systems can respect the principle of fairness, equity, and inclusion articulated in the national AI plan.

The paper concludes that linguistic diversity is not a peripheral concern but a core component of Brazil’s digital sovereignty. Without intentional policy and technical measures, Brazil’s AI initiatives risk perpetuating linguistic hierarchies, undermining cultural heritage, and excluding vulnerable communities from the benefits of digital transformation. Embracing a comprehensive, data‑driven approach to language diversity will enable Brazil to build ethically sound, socially responsive AI that truly reflects the nation’s multilingual reality.


Comments & Academic Discussion

Loading comments...

Leave a Comment