Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties

Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We conduct an empirical study of cross-lingual transfer using spontaneous, noisy, and code-mixed speech across a wide range of Indic dialects and language varieties. Our results indicate that although ASR performance is generally improved with reduced phylogenetic distance between languages, this factor alone does not fully explain performance in dialectal settings. Often, fine-tuning on smaller amounts of dialectal data yields performance comparable to fine-tuning on larger amounts of phylogenetically-related, high-resource standardized languages. We also present a case study on Garhwali, a low-resource Pahari language variety, and evaluate multiple contemporary ASR models. Finally, we analyze transcription errors to examine bias toward pre-training languages, providing additional insight into challenges faced by ASR systems on dialectal and non-standardized speech.


💡 Research Summary

This paper presents a comprehensive empirical study of cross‑lingual transfer for low‑resource Indic language varieties, focusing on spontaneous, noisy, and code‑mixed speech in Devanagari script. The authors first evaluate a state‑of‑the‑art multilingual ASR model, IndicWav2Vec, which has been pre‑trained on 17,000 h of clean speech from 40 Indic languages and then fine‑tuned on Hindi. When tested on 30 Devanagari‑script varieties from the large VAANI dataset (≈150 k h of speech, ~10 % transcribed), the Hindi‑fine‑tuned model achieves a best‑case word error rate (WER) of 50.4 % on Hindi itself and similarly high error rates on many other languages, even those seen during pre‑training. This demonstrates that pre‑training on clean, read speech does not guarantee robust performance on real‑world, spontaneous, and code‑mixed utterances.

To investigate why performance varies, the authors quantify orthographic variability by measuring the proportion of hapax‑legomena (words occurring only once) and the type‑to‑token ratio in each test set. A strong positive correlation (Pearson ρ = 0.705, p = 4 × 10⁻⁴) is observed between the percentage of hapax‑legomena and WER, indicating that languages with less consistent spelling conventions (e.g., Thethi, Surjapuri) are substantially harder for the model. A weaker trend is seen for character error rate (CER).

The core of the study examines cross‑lingual transfer when fine‑tuning on small amounts of dialectal data (1–7 h per language). Using the w2v‑bert‑2.0 model, the authors fine‑tune on each of several dialects or standard languages and evaluate zero‑shot performance across all Devanagari varieties. Across the full set of languages, phylogenetic distance between the fine‑tuning language and the evaluation language correlates positively with WER (Spearman ρ = 0.333, p = 1.1 × 10⁻⁷), confirming that closer linguistic relatives generally transfer better. However, when the evaluation is restricted to non‑standard dialects, this relationship weakens. Notably, models fine‑tuned on relatively low‑resource dialects such as Marwari and Magadhi (5–7 h) often outperform models fine‑tuned on high‑resource standard languages like Hindi, Marathi, or Rajasthani, even when the latter are phylogenetically closer to the test dialect. For example, a Marwari‑fine‑tuned model transfers well to Kumaoni (a Pahari variety) despite belonging to different sub‑families and being geographically distant. These findings suggest that the presence of dialectal acoustic and lexical patterns in the fine‑tuning data can be more beneficial than sheer data volume from a standard language.

A dedicated case study on Garhwali—a low‑resource Pahari language spoken in Uttarakhand—provides concrete evidence of the challenges. The authors evaluate several contemporary self‑supervised speech models on Garhwali data and conduct a detailed error analysis. Results show WERs exceeding 70 %, with systematic biases toward Hindi: the models frequently render Garhwali phonological variants as Hindi words, over‑recognize English insertions in code‑mixed segments, and mis‑spell dialect‑specific tokens according to Hindi orthography. This bias quantification demonstrates that pre‑training language dominance can skew transcription in dialectal contexts.

The paper’s contributions are threefold: (1) a thorough empirical analysis showing that phylogenetic distance is a useful but insufficient predictor of transfer performance for dialectal speech; (2) the first detailed ASR evaluation and error analysis for Garhwali, highlighting the impact of orthographic inconsistency and code‑mixing; and (3) a diagnostic framework for measuring bias toward pre‑training languages in dialectal ASR. The work underscores that even modest amounts of dialect‑specific data can rival or surpass large amounts of standard language data for transfer, offering practical guidance for building ASR systems for the myriad low‑resource Indic varieties.


Comments & Academic Discussion

Loading comments...

Leave a Comment