Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries
Large Language Model (LLM)-powered Automatic Speech Recognition (ASR) systems achieve strong performance with limited resources by linking a frozen speech encoder to a pretrained LLM via a lightweight connector. Prior work trains a separate connector per language, overlooking linguistic relatedness. We propose an efficient and novel connector-sharing strategy based on linguistic family membership, enabling one connector per family, and empirically validate its effectiveness across two multilingual LLMs and two real-world corpora spanning curated and crowd-sourced speech. Our results show that family-based connectors reduce parameter count while improving generalization across domains, offering a practical and scalable strategy for multilingual ASR deployment.
💡 Research Summary
The paper “Language Family Matters: Evaluating LLM‑Based ASR Across Linguistic Boundaries” investigates how to make multilingual automatic speech recognition (ASR) systems that combine a frozen speech encoder with a large language model (LLM) more parameter‑efficient and robust. Existing approaches train a separate lightweight “connector” (or adapter) for each language, which scales poorly as the number of languages grows. The authors hypothesize that linguistic relatedness—specifically, membership in the same language family—can be exploited to share a single connector across multiple languages, thereby reducing model size while preserving or even improving transcription quality.
Two research questions guide the study: (RQ1) Does training connectors at the language‑level or the family‑level yield better multilingual ASR performance? (RQ2) How well do these connectors generalize across domains (i.e., when trained on one speech corpus and evaluated on another)?
Methodology
The experimental pipeline follows an Encoder‑Connector‑Decoder architecture. Whisper‑large‑v3 serves as the frozen speech encoder, and two distinct pretrained LLM decoders—Gemma‑2‑2b and Salamandra‑2b—are kept frozen throughout training. The connector consists of two linear projection layers with a GELU non‑linearity, amounting to a few thousand trainable parameters. Training uses only the connector, with AdamW (lr = 1e‑4, weight‑decay = 1e‑6) for 10 epochs and early stopping.
Datasets: the authors use two publicly available multilingual speech corpora—FLEURS and CommonVoice_22. To ensure balanced coverage, they select seven language families (Afro‑Asiatic, Austronesian, Dravidian, Indo‑European, Niger‑Congo, Turkic, Uralic) and sample up to five representative languages per family, capping each language’s training data at 100 hours. This yields roughly 40 languages across the families.
Two connector variants are evaluated:
- LANG‑CONN – a language‑specific connector trained only on data from a single language.
- FAM‑CONN – a family‑level connector trained on the pooled data of all languages within a given family.
Additionally, a universal connector (UNICONN) trained on all languages together is introduced to test whether gains stem merely from larger data volume.
Results – RQ1 (Granularity)
Across both LLM backbones and both corpora, FAM‑CONN consistently outperforms LANG‑CONN in the majority of families. For example, with Salamandra on FLEURS, the Germanic family’s WER drops from 23.37 % (LANG‑CONN) to 15.67 % (FAM‑CONN); the Romance family improves from 37.47 % to 11.15 %. Gains are especially pronounced in families with strong morphological and phonological similarity (Germanic, Romance, Slavic, Baltic). In high‑variance families such as Afro‑Asiatic and Dravidian, FAM‑CONN sometimes underperforms, indicating that genealogical grouping does not always align with acoustic similarity.
Parameter efficiency is notable: a single family connector replaces multiple language‑specific adapters, reducing total connector parameters by roughly 30‑60 % without sacrificing accuracy.
Results – RQ2 (Cross‑Domain Generalization)
When training on FLEURS and testing on CommonVoice (and vice‑versa), FAM‑CONN generally yields lower WERs than LANG‑CONN, demonstrating stronger robustness to domain shift. The Germanic family, for instance, sees WER fall from 124 % (LANG‑CONN) to 56 % (FAM‑CONN) in the CommonVoice‑to‑FLEURS direction. Similar improvements appear in Slavic and Romance families. However, certain families (e.g., Dravidian) still favor language‑specific adapters, likely due to high intra‑family phonetic diversity.
UNICONN, despite being trained on the largest pooled dataset, performs worse than FAM‑CONN across all families, confirming that the advantage stems from linguistic relatedness rather than sheer data volume.
Discussion and Implications
The study validates the intuition that language families provide a natural inductive bias for multilingual ASR. Sharing a connector across related languages captures common acoustic‑phonetic patterns, leading to better generalization both within the same domain and across divergent domains. The approach also dramatically cuts the parameter budget, facilitating deployment in resource‑constrained settings.
Limitations include families with heterogeneous scripts or phonologies where a single connector may be too coarse. Future work could explore hierarchical sharing (sub‑family or language‑specific adapters on top of a family backbone), dynamic routing of language embeddings, or meta‑learning strategies to automatically discover optimal sharing granularity.
Conclusion
By systematically comparing language‑level and family‑level connectors across two LLM backbones and two multilingual speech corpora, the authors demonstrate that family‑based connector sharing offers a practical, scalable, and performance‑enhancing solution for multilingual LLM‑ASR. The findings encourage the community to incorporate linguistic taxonomy into model design, paving the way for more efficient and universally accessible speech technologies.
Comments & Academic Discussion
Loading comments...
Leave a Comment