From Data Scarcity to Data Care: Reimagining Language Technologies for Serbian and other Low-Resource Languages
Large language models are commonly trained on dominant languages like English, and their representation of low resource languages typically reflects cultural and linguistic biases present in the source language materials. Using the Serbian language as a case, this study examines the structural, historical, and sociotechnical factors shaping language technology development for low resource languages in the AI age. Drawing on semi structured interviews with ten scholars and practitioners, including linguists, digital humanists, and AI developers, it traces challenges rooted in historical destruction of Serbian textual heritage, intensified by contemporary issues that drive reductive, engineering first approaches prioritizing functionality over linguistic nuance. These include superficial transliteration, reliance on English-trained models, data bias, and dataset curation lacking cultural specificity. To address these challenges, the study proposes Data Care, a framework grounded in CARE principles (Collective Benefit, Authority to Control, Responsibility, and Ethics), that reframes bias mitigation from a post hoc technical fix to an integral component of corpus design, annotation, and governance, and positions Data Care as a replicable model for building inclusive, sustainable, and culturally grounded language technologies in contexts where traditional LLM development reproduces existing power imbalances and cultural blind spots.
💡 Research Summary
The paper “From Data Scarcity to Data Care: Reimagining Language Technologies for Serbian and other Low‑Resource Languages” investigates why language technologies for low‑resource languages (LRLs) remain under‑developed, using Serbian as a case study. The authors combine historical analysis, sociotechnical context, and semi‑structured interviews with ten scholars and practitioners (linguists, digital humanists, AI developers, etc.) to trace the roots of data scarcity and the prevailing “engineering‑first” mindset that prioritises quick functionality over linguistic nuance.
Historically, centuries of Ottoman occupation, wars, and especially the 1941 bombing of the National Library of Serbia destroyed half a million volumes, including medieval Cyrillic manuscripts. This loss, compounded by limited digitisation, restrictive copyright regimes, and weak institutional support, has left Serbian with a severely depleted textual heritage. Modern digital readiness is relatively high (broad internet penetration, national AI strategy, supercomputing center), yet Serbian is still classified as “fragmentarily supported” by the European Language Equality framework because of the paucity of corpora, tools, and coordinated funding.
Interviewees highlighted four major barriers: (1) insufficient high‑quality data due to historical loss and current digitisation gaps; (2) reliance on English‑trained multilingual models that only superficially capture Serbian’s complex morphology, dual scripts, and dialectal variation; (3) superficial transliteration and translation pipelines that ignore cultural context, thereby cementing bias; and (4) centralized data governance that excludes local experts and community authority, limiting ethical oversight.
To address these challenges, the authors propose a new “Data Care” framework grounded in the CARE principles—Collective Benefit, Authority to Control, Responsibility, and Ethics. Unlike post‑hoc bias‑mitigation techniques, Data Care embeds bias awareness throughout the data lifecycle: (a) collaborative digitisation of surviving manuscripts with rich metadata; (b) inclusive corpus construction that deliberately balances Cyrillic/Latin scripts, Ekavian/Ijekavian variants, and dialectal data; (c) mandatory involvement of local linguists, cultural anthropologists, and community representatives in annotation and validation; (d) transparent governance structures (e.g., a Data Stewardship Board) that grant data contributors decision‑making power over access, reuse, and commercial exploitation; and (e) an ethics checklist that enforces privacy, cultural sensitivity, and equitable benefit sharing.
The paper also discusses the political dimension of language in the post‑Yugoslav space. Although Serbian, Croatian, Bosnian, and Montenegrin share a macrolanguage (ISO code hbs), nationalistic policies favor separate, nation‑specific models, hindering the creation of a shared, larger dataset that could improve model performance for all. The authors argue for regional collaboration while respecting sovereign language rights, suggesting joint initiatives such as a “Balkan Language Resource Hub.”
In conclusion, Data Care reframes the development of language technologies for LRLs as a socio‑technical endeavor that must integrate historical awareness, community authority, and ethical stewardship from the outset. By doing so, it offers a replicable roadmap for building inclusive, sustainable, and culturally grounded language models that can restore digital sovereignty to languages like Serbian and, by extension, to the world’s many under‑represented tongues.
Comments & Academic Discussion
Loading comments...
Leave a Comment