Building Robust and Scalable Multilingual ASR for Indian Languages
Reading time: 5 minute
...
📝 Original Info
Title: Building Robust and Scalable Multilingual ASR for Indian Languages
ArXiv ID: 2511.15418
Date: 2025-11-19
Authors: ** SPRING Lab, Indian Institute of Technology Madras (IIT Madras) **
📝 Abstract
This paper describes the systems developed by SPRING Lab, Indian Institute of Technology Madras, for the ASRU MADASR 2.0 challenge. The systems developed focuses on adapting ASR systems to improve in predicting the language and dialect of the utterance among 8 languages across 33 dialects. We participated in Track 1 and Track 2, which restricts the use of additional data and develop from-the-scratch multilingual systems. We presented a novel training approach using Multi-Decoder architecture with phonemic Common Label Set (CLS) as intermediate representation. It improved the performance over the baseline (in the CLS space). We also discuss various methods used to retain the gain obtained in the phonemic space while converting them back to the corresponding grapheme representations. Our systems beat the baseline in 3 languages (Track 2) in terms of WER/CER and achieved the highest language ID and dialect ID accuracy among all participating teams (Track 2).
💡 Deep Analysis
📄 Full Content
The ASRU MADASR 2.0 1 challenge presents a multilingual, multi-dialect dataset spanning over 8 low-resource Indian languages. The challenge provides a dataset of 1200 hours -150 hours per language -and is evaluated over 4 tracks. The ASR systems developed are evaluated on hidden test sets with metrics such as word error rate (WER), character error rate (CER), language ID accuracy (LID accuracy) and dialect ID accuracy (DID accuracy). These are the 4 tracks with different levels of restrictions.
• Track 1 allows only the use of given 30 hours small subset per language with no no external data or models. • Track 2 allows only the use of given 120 hours large subset per language with no external data or models. This paper presents the ASR systems developed by the SPRING Lab, IIT Madras for tracks 1 and 2. We mainly 1 https://sites.google.com/view/respinasrchallenge2025/home
focus on leveraging the phonemic similarities among Indian languages through a common label set [1] and discuss ways to retain the gains from the ASR while converting back from the CLS space to corresponding graphemic notations.
A very strong correlation exists between phonemes and graphemes of Indian languages, which eases the (graphemeto-phoneme) G2P conversion. Moreover, the strong phonemic and graphemic similarities between many Indian languages stem from their common roots in related language families. For instance, among the languages used in this challenge, Bhojpuri, Magahi, Marathi, and Chhattisgarhi use the same Devanagari script. Whereas Telugu and Kannada belong to the same Dravidian language family. We exploit these facts by converting the graphemic representations into a common phonemic space using the unified parser [2]. A standard set of labels are given to phonetically similar speech sounds among different Indian languages known as the Common Label Set. Examples of CLS representation across different languages is shown in Fig 1 . Reconstructing text in the native script from CLS representations is inherently challenging due to linguistic phenomena like schwa deletion, geminate correction, and the intricacies of syllable segmentation. This paper discuss a few text-to-text machine transliteration (MT) approaches to retain the gains obtained by the CLS ASR.
We use the ESPnet toolkit [3] for all our experiments. Track 1 models are trained on the small dataset (approx. 240 hours), and Track 2 models are trained on the large dataset (approx. 1200 hours). We use 80-dimensional log-Mel Spectrogram as speech features for all our experiments. The Mel Spectrogram We use unified parser to convert native text to CLS. For all our models, language ID and dialect ID information is passed as special token < LID DID > in the beginnning of both the CLS and native script text.
The baseline system follows a standard encoder-decoder architecture, comprising a Conformer [4] encoder and a Transformer decoder [5]. It takes log-Mel spectrograms as input to the encoder and generates output in the native script. A special token < LID > is prepended to the target text to indicate language identity. The baseline model does not predict dialects and is trained using a hybrid CTC-Attention loss [6].
In this approach, we employ a cascaded system comprising two models. The first model performs speech recognition in the CLS (Common Label Set) space. It takes log-Mel spectrograms as input and generates in the CLS format. The second model handles machine transliteration by converting the CLS transcriptions into the target language’s native script. Both the models are trained separately.
For speech recognition, we use an encoder-decoder architecture with a Conformer encoder and a Transformer decoder. The model is trained using a hybrid CTC-Attention loss function. The machine transliteration model also follows an encoderdecoder architecture, utilising transformer layers for both the encoder and decoder, and is trained using only the attentionbased cross-entropy loss.
The multi-decoder architecture [7] consists of two subnetworks as shown in the Fig 2: an ASR sub-network and an MT (machine transliteration) sub-network. The ASR subnetwork employs a Conformer encoder followed by a Transformer decoder, while the MT sub-network uses a lightweight Transformer encoder and a Transformer decoder.
The ASR sub-network takes log-Mel spectrograms as input and generates hidden states through its decoder. These hidden states are passed directly to the encoder of the MT subnetwork. The MT decoder then produces the final output in the target language’s native script. The ASR sub-network is trained using a hybrid CTC-Attention loss, where the CTC loss is computed with respect to the CLS (Common Label Set) transcription. The MT sub-network is trained using an attention-based cross-entropy loss. Both the sub-nets are jointly trained. Additionally, the MT decoder incorporates cross-attention over the speech encoder outputs, allowing it to access acoustic information directly.