TICL+: A Case Study On Speech In-Context Learning for Children's Speech Recognition

Children's speech recognition remains challenging due to substantial acoustic and linguistic variability, limited labeled data, and significant differences from adult speech. Speech foundation models can address these challenges through Speech In-Con…

Authors: Haolong Zheng, Yekaterina Yegorova, Mark Hasegawa-Johnson

TICL+: A Case Study On Speech In-Context Learning for Children's Speech Recognition
TICL+: A CASE STUD Y ON SPEECH IN-CONTEXT LEARNING FOR CHILDREN’S SPEECH RECOGNITION Haolong Zheng, Y ekaterina Y e gor ova, Mark Hase gawa-J ohnson Uni versity of Illinois at Urbana-Champaign { haolong2, yay2, jhasega w } @illinois.edu ABSTRA CT Children’ s speech recognition remains challenging due to substantial acoustic and linguistic variability , limited labeled data, and significant differences from adult speech. Speech foundation models can address these challenges through Speech In-Context Learning (SICL), allowing adaptation to new domains without fine-tuning. Howe ver , the effec- tiv eness of SICL depends on how in-context examples are selected. W e e xtend an existing retriev al-based method, T ext- Embedding KNN for SICL (TICL), introducing an acoustic r eranking step to create TICL+. This extension prioritizes examples that are both semantically and acoustically aligned with the test input. Experiments on four children’ s speech corpora show that TICL+ achiev es up to a 53.3% relative word error rate reduction over zero-shot performance and 37.6% o ver baseline TICL, highlighting the v alue of com- bining semantic and acoustic information for rob ust, scalable ASR in children’ s speech. Index T erms — In-conte xt learning, automatic speech recognition, large multimodal models 1. INTRODUCTION More than half of the children under the Individuals with Disabilities Education Act (IDEA) require speech and language services, which is approximately 3.4 million chil- dren. Children with speech or language-related concerns risk falling behind in their academic and social-emotional dev elopment [1]. T ypically , the earlier these concerns can be identified and addressed with ability-based interventions, the greater the likelihood that these children will thrive aca- demically and socio-economically . Howe ver , due to the substantial imbalance between the number of Speech and Language Pathologists (SLPs) and the children who require their services, there has been gro wing interest in automat- ing these tasks to improv e the ef ficiency of screening for language disorders. [2]. The success of such automation depends heavily on the accuracy and robustness of the automatic speech recogni- tion (ASR) systems integrated into these pipelines. ASR for children’ s speech remains a low-resource task and ex- hibits a notable performance gap when applying off-the-shelf ASR systems directly due to the substantial acoustic and linguistic variability inherent in children’ s speech, including inter-speak er variability due to differing de velopmental rates and intra-speaker variability resulting from underde veloped pronunciation skills [3 – 8]. The resulting performance degra- dation is significant, as these sources of v ariability are lar gely absent from the data used to train large-scale ASR models. T o address this challenge, transfer learning techniques have been emplo yed to apply knowledge from adult ASR systems to children’ s ASR, specifically by fine-tuning models such as Whisper [9, 10] and W av2V ec2 [11] with children’ s speech data. T o mitigate data bias when fine-tuning self-supervised learning models with data from a different domain than the pretraining data, [12] proposed a Domain-Responsible Adap- tation and Fine-Tuning strategy and reported improvements in word error rate (WER) across multiple speech models when fine-tuned with a children’ s speech dataset [12, 13]. Beyond fine-tuning methods, in-context learning (ICL) [14, 15] has emer ged as a flexible adaptation paradigm for large language models (LLMs) that mitigates catastrophic forgetting and eliminates the need for parameter updates. SICL studies rely on random sampling to select in-conte xt examples [16 – 18], ho wev er, pre vious work has demonstrated that the selection of in-context examples strongly influences the performance of ICL [19 – 21]. T o make the selection of in-context e xamples more tar geted, we previously introduced a T ext- Embedding KNN for SICL (TICL) pipeline that first generates a pseudo-label for the test sample and then retrie ves semantically similar demonstrations, enhancing SICL perfor - mance [22]. This method is dependent on the pseudo-labels, whose quality can be substantially degraded in low-resource scenarios such as children’ s speech. In these settings, incor- porating acoustic similarity as an additional similarity mea- sure can prov e to be beneficial. T o address this, we e xtend the retriev al stage of TICL by introducing an acoustic-based reranking step that prioritizes demonstrations with acoustic characteristics closer to the test utterance, resulting in the pro- posed TICL+ pipeline. This dual-criteria selection strategy , illustrated in Fig. 1, improves conte xt construction for SICL in low-resource speech domains such as children’ s speech. ❄ Large Multimodal Model ❄ "leave" ⑤ Inference Input T ranscribe the audio: The transcription is: "leave me" Demonstration #1 (audio) (label) More audio-label pairs ... T ranscribe the audio: (input audio) ④ Input preparation : Concatenate " let me alone" text encoder (input audio) "leave me alone." audio encoder audio decoder "leave me" ③ Reranking ① pseudo labelling ② T opK Context Retrieval "okay" Candidate Pool "live alone" "leave me" "let me" T op 300 samples "live alone" "let me" "okay" Selected Candidate Pool Original Candidate Pool "me" "alone" Output: Fig. 1 . Overvie w of the TICL+ pipeline 2. METHODOLOGY 2.1. Speech In-Context Learning Rather than updating the model parameters, ICL adapts a model to a target domain by conditioning on demonstrations drawn from the tar get domain. SICL extends text-based ICL by conditioning jointly on paired audio and text tokens. Gi ven a test speech sample s ∗ , a model Λ generates a transcription ˆ y conditioned on conte xt C : ˆ y = arg max y Pr( y | C, x ∗ s , Λ) , where x ∗ s denotes the audio encoding of s ∗ . The context C consists of query–answer pairs c ( i ) = ( q ( i ) , a ( i ) ) , where each query q ( i ) is an encoded audio seg- ment, and each answer a ( i ) corresponds to the transcription of that audio segment. 2.2. T ext-Embedding KNN Candidate Selection The TICL pipeline [22] introduces a text-embedding- based KNN candidate selection method designed to identify an effecti ve context C for SICL for a given test sample. T o construct C , TICL retrie ves speech-transcription pairs whose transcriptions are lexically similar to the test utterance from a candidate dataset C = { ( s ( i ) , y ( i ) ) } N i =1 where s ( i ) denotes the speech audio and y ( i ) its corresponding transcription. A frozen text encoder ϕ : Y text → R d maps each tran- scription y to a d -dimensional sentence embedding. The ℓ 2 - normalized embedding is defined as: ¯ ϕ ( y ) = ϕ ( y ) ∥ ϕ ( y ) ∥ 2 For each candidate c i ∈ C , its normalized embedding is pre- computed as ¯ z ( i ) = ¯ ϕ ( y ( i ) ) . During inference, the ground- truth transcription of the test utterance s ∗ is una vailable. In- stead, a pseudo-transcription ˜ y = f θ ( s ∗ ) is generated using a frozen ASR model f θ : X audio → Y text , where X audio , Y text denote the audio and the text space respecti vely . The pseudo-label is then encoded into a normalized lexi- cal embedding ¯ z ∗ = ¯ ϕ ( ˜ y ) . T o select relev ant in-context e x- amples, we compute the Euclidean distance between ¯ z ∗ and each candidate embedding ¯ z ( i ) : r ( i ) =   ¯ z ∗ − ¯ z ( i )   2 . The K most similar candidates are retrieved as N K ( s ∗ ) = T opK i ∈ [ N ]  − r ( i )  , and used to construct the final context C . TICL was e valuated on Phi-4-MultiModal-Instruct (Phi-4-MM) [23]. 2.3. Acoustic Reranking for Refining Context Selection T o better align the context C with the acoustic character- istics of the test audio, we e xtend the TICL pipeline to TICL+ by introducing an acoustic reranking step. Prior results demonstrated that Whisper embeddings were the second- best retrie val method after semantic similarity [22]. Whisper embeddings can capture many aspects of the input speech, such as prosody , speaker identity , and pronunciation, that are not reflected in purely lexical representations [24]. Moti vated by this, we incorporate an acoustic-based distance measure to refine the selection of in-context e xamples. Using the top M semantically similar candidates N M ( s ∗ ) , where M = 300 , we compute acoustic similarity using precomputed embeddings using a frozen speech encoder g : X audio → R p . Whisper-large-v3-turbo was used in our experiments. The ℓ 2 -normalized acoustic embed- ding is precomputed as ¯ g ( s ) = g ( s ) ∥ g ( s ) ∥ 2 . The acoustic distance between the test audio and each candidate is then computed as r acoustic ( i ) = ∥ ¯ a ∗ − ¯ a ( i ) ∥ 2 . Where ¯ a ∗ = ¯ g ( s ∗ ) and ¯ a ( i ) = ¯ g ( s ( i ) ) . Candidates are reranked according to r acoustic ( i ) , and the top K acoustically closest samples are selected as: N acoustic K ( s ∗ ) = T opK i ∈N M ( s ∗ )  − r acoustic ( i )  . The final SICL context C is constructed from N acoustic K ( s ∗ ) . This two-stage retriev al process ensures that the selected in- context examples are both lexically related and acoustically similar to the test utterance. 3. EXPERIMENT AL RESUL TS AND AN AL YSIS T o ev aluate children’ s speech recognition performance, we used four corpora: My Science T utor (MyST) [25], a con- taining science tutoring dialogues with students ages 8–11; the OGI Kids’ Speech Corpus [26] which contains about 100 hours of read and prompted speech from children ages 5- 16; Edmonton Narrativ e Norms Instrument (ENNI) [27, 28], which contains narrativ e retellings and story completions by children ages 4-9; and the Redmond Sentence Recall (RSR) [29] consisting of sentence repetition tasks for children ages 5–9, including those with de velopmental language disorders. All datasets were preprocessed, and the corresponding candi- date sets were generated according to the procedure outlined in [22]. W e ev aluate TICL+ using Phi-4-MM. T able 1 demonstrates that incorporating acoustic rerank- ing into the TICL pipeline substantially improv es recognition accuracy across all datasets. TICL+ achiev es up to a 53.3% relativ e impro vement o ver zero-shot performance and up to 37.62% o ver TICL. These gains may stem from the limita- tions of the pseudo-labeler, which often produces inaccurate transcriptions for children’ s speech. Selecting the top 300 se- mantically closest utterances helps remove unrelated exam- ples, while the acoustic reranking step further refines the con- text by prioritizing samples that are acoustically similar to the test utterance, regardless of their le xical similarity . Across all four datasets, TICL+ consistently outperforms both zero-shot and TICL. The lar gest relative improvement (53.3%) is observ ed on MyST , which contains con versational speech with high variability in speak er age and background noise. This suggests that acoustic similarity may help identify examples with comparable speak er and en vironmental condi- tions, resulting in more robust conte xtual alignment. Smaller impro vements are observed on OGI and ENNI, which primarily contain read or structured speech where lex- ical overlap already provides strong guidance during the se- mantic retriev al step. In these cases, the acoustic filter likely contributes by accounting for pronunciation dif ferences and dev elopmental variability across speakers. Performance on RSR also improves despite it also containing read utterances, showing that acoustic reranking remains effecti ve ev en when lexical di versity is limited, potentially due to its ability to identify dev elopmentally similar speakers. Overall, these results demonstrate that acoustic similar- ity provides complementary information to le xical similarity , enabling the model to better capture inter- and intra-speaker variability inherent in children’ s speech. The improvements across all corpora demonstrate that TICL+ is effecti ve for low-resource and de velopmentally v ariable speech domains. T able 1 . Children’ s Speech Results for Phi-4-MM . ↓ WER%. Method k MyST OGI ENNI RSR Zero-Shot 0 12.81 16.17 14.37 20.06 TICL 1 17.27 9.55 17.57 18.92 2 11.77 8.94 14.07 18.92 3 11.69 8.75 13.54 18.90 4 11.81 8.52 13.75 19.54 ∆ rel 8.7% 47.3% 5.8% 5.8% TICL+ 1 11.48 8.84 14.83 12.89 2 10.17 7.97 12.01 12.26 3 10.52 7.78 11.52 12.19 4 10.57 7.55 11.52 12.75 ∆ rel 20.6% 53.3% 19.8% 39.2% 4. CONCLUSION In this work, we introduced an acoustic reranking step into the TICL pipeline, leading to the TICL+ frame work, which improv es SICL for children’ s speech recognition. The pro- posed approach leverages both acoustic and semantic similar- ity to construct more effecti ve in-context examples. Experi- ments on four children’ s speech corpora demonstrate signif- icant performance gains, reducing relati ve WER by up to a 53.3% o ver zero-shot performance and up to 37.62% ov er the baseline TICL. These findings highlight the value of incor- porating multiple f actors when selecting in-conte xt examples, paving the way toward more robust and scalable ASR systems for children’ s speech. 5. A CKNO WLEDGMENTS This w ork was supported by National Science Founda- tion grant #2229873. This work used the Delta system at the National Center for Supercomputing Applications through al- location beiq-delta-gpu from the Advanced Cyberinfrastruc- ture Coordination Ecosystem: Services & Support (A CCESS) program, which is supported by National Science F ounda- tion grants #2138259, #2138286, #2138307, #2137603, and #2138296. 6. REFERENCES [1] University at Buf falo, The State Uni versity of Ne w Y ork, “The project: The national ai institute for exceptional ed- ucation — technology , ” https://www.buffalo.edu/ ai4exceptionaled/technology.html , 2025, Ac- cessed: 2025-10-30. [2] Dancheng Liu et al., “ Automatic screening for children with speech disorder using automatic speech recognition: Opportu- nities and challenges, ” in Pr oceedings of the AAAI Symposium Series , 2024, vol. 4. [3] Laura L K oenig, Jorge C Lucero, and Elizabeth Perlman, “Speech production v ariability in fricati ves of children and adults: Results of functional data analysis, ” The J ournal of the Acoustical Society of America , vol. 124, no. 5, pp. 3158–3170, 2008. [4] Laura L K oenig and Jorge C Lucero, “Stop consonant voicing and intraoral pressure contours in women and children, ” The Journal of the Acoustical Society of America , vol. 123, no. 2, pp. 1077–1088, 2008. [5] Sungbok Lee, Alexandros Potamianos, and Shrikanth Narayanan, “ Acoustics of children’ s speech: Dev elopmental changes of temporal and spectral parameters, ” The J ournal of the Acoustical Society of America , vol. 105, no. 3, pp. 1455– 1468, 1999. [6] Sungbok Lee, Alexandros Potamianos, and Shrikanth Narayanan, “ Analysis of children’ s speech: Duration, pitch and formants, ” in F ifth Eur opean Conference on Speech Com- munication and T echnology , 1997. [7] Houri K V orperian and Ray D Kent, “V owel acoustic space dev elopment in children: A synthesis of acoustic and anatomic data, ” 2007. [8] Bruce L Smith, “Relationships between duration and temporal variability in children’ s speech, ” The Journal of the Acoustical Society of America , v ol. 91, no. 4, pp. 2165–2174, 1992. [9] Ahmed Adel Attia, Jing Liu, W ei Ai, Dorottya Demszky , and Carol Espy-W ilson, “Kid-whisper: T o wards bridging the per- formance gap in automatic speech recognition for children vs. adults, ” in Pr oceedings of the AAAI/A CM Conference on AI, Ethics, and Society , 2024, v ol. 7, pp. 74–80. [10] Rishabh Jain, Andrei Barcovschi, Mariam Y ahayah Y iwere, Peter Corcoran, and Horia Cucu, “ Adaptation of whisper mod- els to child speech recognition, ” in INTERSPEECH , 2023. [11] Rishabh Jain, Andrei Barcovschi, Mariam Y iwere, Dan Bigioi, Peter Corcoran, and Horia Cucu, “ A wav2v ec2-based experi- mental study on self-supervised learning methods to improve child speech recognition., ” IEEE Access , 2023. [12] Ruchao Fan and Abeer Alwan, “Draft: A novel framew ork to reduce domain shifting in self-supervised learning and its application to children’ s asr , ” 2022. [13] Ruchao Fan, Y unzheng Zhu, Jinhan W ang, and Abeer Alwan, “T ow ards better domain adaptation for self-supervised models: A case study of child asr , ” IEEE J ournal of Selected T opics in Signal Pr ocessing , vol. 16, no. 6, pp. 1242–1252, 2022. [14] T om Brown, Benjamin Mann, Nick Ryder , Melanie Subbiah, et al., “Language models are few-shot learners, ” in Advances in Neural Information Pr ocessing Systems , H. Larochelle, M. Ranzato, R. Hadsell, M.F . Balcan, and H. Lin, Eds. 2020, vol. 33, pp. 1877–1901, Curran Associates, Inc. [15] Qingxiu Dong, Lei Li, Damai Dai, et al., “ A survey on in- context learning, ” in Pr oceedings of the 2024 Confer ence on Empirical Methods in Natural Languag e Pr ocessing , Y aser Al-Onaizan, Mohit Bansal, and Y un-Nung Chen, Eds., Mi- ami, Florida, USA, Nov . 2024, pp. 1107–1128, Association for Computational Linguistics. [16] Siyin W ang, Chao-Han Y ang, Ji W u, and Chao Zhang, “Can whisper perform speech-based in-context learning?, ” in ICASSP 2024 - 2024 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2024, pp. 13421–13425. [17] Nathan Roll, Calbert Graham, Y uka T atsumi, et al., “In-context learning boosts speech recognition via human-like adaptation to speakers and language v arieties, ” 2025. [18] Jiaming Zhou, Shiwan Zhao, Jiabei He, Hui W ang, W enjia Zeng, Y ong Chen, Haoqin Sun, Aobo Kong, and Y ong Qin, “M2R-Whisper: Multi-stage and multi-scale retriev al augmen- tation for enhancing whisper , ” in ICASSP 2025-2025 IEEE In- ternational Conference on Acoustics, Speech and Signal Pr o- cessing (ICASSP) . IEEE, 2025, pp. 1–5. [19] Zihao Zhao, Eric W allace, Shi Feng, et al., “Calibrate before use: Improving fe w-shot performance of language models, ” in Pr oceedings of the 38th International Conference on Machine Learning , Marina Meila and T ong Zhang, Eds. 18–24 Jul 2021, vol. 139 of Pr oceedings of Machine Learning Researc h , pp. 12697–12706, PMLR. [20] Zhao Y ang, Y uanzhe Zhang, Dianbo Sui, et al., “Representa- tiv e demonstration selection for in-context learning with two- stage determinantal point process, ” in Pr oceedings of the 2023 Confer ence on Empirical Methods in Natural Language Pr o- cessing , Houda Bouamor , Juan Pino, and Kalika Bali, Eds., Singapore, Dec. 2023, pp. 5443–5456, Association for Com- putational Linguistics. [21] Rishabh Agarw al, A vi Singh, Lei Zhang, et al., “Many-shot in- context learning, ” in Advances in Neural Information Pr ocess- ing Systems , A. Globerson, L. Mackey , D. Belgrave, A. Fan, U. Paquet, J. T omczak, and C. Zhang, Eds. 2024, vol. 37, pp. 76930–76966, Curran Associates, Inc. [22] Haolong Zheng, Y ekaterina Y e gorova, and Mark Haseg awa- Johnson, “TICL: T ext-embedding knn for speech in-context learning unlocks speech recognition abilities of lar ge multi- modal models, ” arXiv preprint , 2025. [23] Microsoft, :, Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, et al., “Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras, ” 2025. [24] Y uan Gong, Sameer Khurana, Leonid Karlinsky , and James Glass, “Whisper-at: Noise-robust automatic speech recogniz- ers are also strong general audio event taggers, ” in INTER- SPEECH 2023 . Aug. 2023, interspeech2023, ISCA. [25] Sameer Pradhan, Ronald Cole, and W ayne W ard, “My sci- ence tutor (MyST)–a large corpus of children’ s conv ersational speech, ” in Pr oceedings of the 2024 Joint International Con- fer ence on Computational Linguistics, Language Resour ces and Evaluation (LREC-COLING 2024) , 2024, pp. 12040– 12045. [26] Khaldoun Shobaki, John-Paul Hosom, and Ronald Cole, “The OGI kids’ speech corpus and recognizers, ” 10 2000, pp. 258– 261. [27] Phyllis Schneider , Denyse Hayward, Rita V is Dub ´ e, et al., “Storytelling from pictures using the edmonton narrativ e norms instrument, ” Journal of speech language pathology and audiology , v ol. 30, no. 4, pp. 224, 2006. [28] Dancheng Liu and Jinjun Xiong, “F ASA: a flexible and auto- matic speech aligner for extracting high-quality aligned chil- dren speech data, ” 2024. [29] AI4ExceptionalEd, “Redmond sentence recall (RSR), ” https://huggingface.co/datasets/ ai4exceptionaled/Redmond- Sentence- Recall , Hugging Face dataset; accessed 2025-09-08.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment