The TCG CREST -- RKMVERI Submission for the NCIIPC Startup India AI Grand Challenge
In this report, we summarize the integrated multilingual audio processing pipeline developed by our team for the inaugural NCIIPC Startup India AI GRAND CHALLENGE, addressing Problem Statement 06: Language-Agnostic Speaker Identification and Diarisation, and subsequent Transcription and Translation System. Our primary focus was on advancing speaker diarization, a critical component for multilingual and code-mixed scenarios. The main intent of this work was to study the real-world applicability of our in-house speaker diarization (SD) systems. To this end, we investigated a robust voice activity detection (VAD) technique and fine-tuned speaker embedding models for improved speaker identification in low-resource settings. We leveraged our own recently proposed multi-kernel consensus spectral clustering framework, which substantially improved the diarization performance across all recordings in the training corpus provided by the organizers. Complementary modules for speaker and language identification, automatic speech recognition (ASR), and neural machine translation were integrated in the pipeline. Post-processing refinements further improved system robustness.
💡 Research Summary
The paper presents a complete multilingual audio‑processing pipeline built for the NCIIPC Startup India AI Grand Challenge (Problem Statement 06). The system tackles five tasks: speaker identification (SID), speaker diarization (SD), language identification (LID), automatic speech recognition (ASR), and neural machine translation (NMT). The authors focus on advancing diarization for code‑mixed Hindi, English, and Punjabi recordings.
For voice activity detection they adopt the pre‑trained Silero‑VAD model, achieving near‑perfect frame‑level precision (0.9956) and recall (0.9946). Speaker diarization is driven by a newly proposed multi‑kernel consensus spectral clustering (MK‑CSC) method that combines exponential and arccosine kernels with a 15‑nearest‑neighbour graph. Embeddings are extracted with an ECAPA‑TDNN model trained on VoxCeleb, using 1‑second windows with 0.5‑second overlap. This unsupervised clustering reduces diarization error rate (DER) to an average of 24.7 % across the mock dataset, outperforming conventional spectral clustering.
Speaker identification uses the same ECAPA‑TDNN embeddings. Enrollment is performed on a single speaker file (ID16.ogg); test segments are 3 seconds long and scored by cosine similarity against the enrollment centroid. A median filter smooths label switches, and an analytically derived threshold (Δ = 0.3147) yields an identification error rate (IER) of 8.34 %.
Language identification fine‑tunes a logistic‑regression classifier on VoxLingua107 ECAPA‑TDNN embeddings using 20 mock recordings. The system reaches a language DER of 21.4 % and reliably tags Hindi, Punjabi, and English segments.
ASR employs Whisper‑Small.en for English and ai4bharat IndicWhisper‑hi for Hindi and Punjabi. Segment‑level transcripts are concatenated to compute a file‑level word error rate (WER) of 0.7464; on well‑formed timestamps the WER is 0.7919.
For translation, the pipeline uses IndicTrans2 for Hindi→English and Opus‑MT for Punjabi→English, applying beam search (size = 5) and splitting code‑switched spans to match model inputs. The resulting BLEU score averages 0.209.
The authors evaluate each module on 21 recordings of varying formats, sample rates (8 kHz–192 kHz), bit depths, and SNRs (≥5 dB). Overall system metrics are: VAD accuracy ≈ 88.8 %, average DER ≈ 24.7 %, SID IER ≈ 20.2 %, LID DER ≈ 74.6 %, ASR WER ≈ 0.746, and NMT BLEU ≈ 0.209.
Strengths include the novel MK‑CSC clustering that improves diarization in multilingual, code‑mixed contexts, modular CSV‑based integration for reproducibility, and reliance on well‑tested open‑source components. Limitations involve reliance on a single enrollment recording, modest translation quality, and evaluation on a limited mock dataset that may not reflect real‑world noise and reverberation. The paper concludes that the pipeline demonstrates practical applicability for low‑resource Indian multilingual scenarios, while suggesting future work on real‑time processing, multi‑enrollment handling, and translation refinement.
Comments & Academic Discussion
Loading comments...
Leave a Comment