KidSpeak: A General Multi-purpose LLM for Kids Speech Recognition and Screening

February 09, 2026

Reading time: 30 minute

...

📝 Original Info

Title: KidSpeak: A General Multi-purpose LLM for Kids Speech Recognition and Screening
ArXiv ID: 2512.05994
Date: 2025-12-01
Authors: Rohan Sharma, Dancheng Liu, Jingchen Sun, Shijie Zhou, Jiayu Qin, Jinjun Xiong, Changyou Chen

📝 Abstract

With the rapid advancement of conversational and diffusion-based AI, there is a growing adoption of AI in educational services, ranging from grading and assessment tools to personalized learning systems that provide targeted support for students. However, this adaptability has yet to fully extend to the domain of children's speech, where existing models often fail due to their reliance on datasets designed for clear, articulate adult speech. Children, particularly those in early developmental stages or with speech and language pathologies, present unique challenges that current AI models and datasets are ill-equipped to handle. To address this, we introduce KidSpeak, a multi-task speech-enhanced Foundation Model capable of both generative and discriminative tasks specifically tailored to children's speech patterns. Our framework employs a two-stage training process that incorporates phonetic knowledge into the speech encoder, achieving an average accuracy of 87% across four separate tasks. Furthermore, recognizing the limitations of scalable human annotation and existing speech alignment tools, we propose the Flexible and Automatic Speech Aligner (FASA) and leverage the method to construct high quality datasets for training and evaluation. This novel alignment tool significantly improves the quality of aligned children's speech from noisy data, enhancing data quality by 13.6× compared to human annotations, as demonstrated on the CHILDES dataset. To the best of our knowledge, KidSpeak and FASA represent the first comprehensive solution designed for speech and language therapy in children, offering both a multi-purpose speech LLM and a robust alignment tool.

📄 Full Content

Humans begin to acquire the fundamental cues of vocal communication as early as 3 months of age (USDHHS et al., 2017). As development advances, some individuals master their vocal abilities to such a degree that they are capable of vocalizing with over 1000 kHz, allowing for the conveyance of complex and nuanced ideas and emotions Garnier et al. (2010) 1 . On the other hand, hearing begins as early as the 28 th week of gestation in humans (Querleu et al., 1988), eventually leading to an auditory capacity capable of discerning frequencies as precise as 0.5 Hz within a range of 20 Hz to 20 kHz (Romand & Varela-Nieto, 2014). Despite these remarkable developmental milestones, numerous challenges persist in early speech acquisition. In fact, nearly 1 in 12 children in the U.S. aged 3 to 17 has experienced a disorder affecting voice, speech, language, or swallowing, with almost half of them not receiving any intervention services in the past year (Black et al., 2015). These statistics highlight that, despite our advanced auditory and vocal capabilities, we continue to face significant barriers in the early diagnosis and treatment of speech-related disorders. The setbacks are further compounded by the dearth of data pertaining to kids' speech and vocalizations, which poses additional challenges towards the development of automated computational tools. Consequently, the current state-of-the-art ASR systems remain limited to the use of the widely popular datasets of Librispeech (Panayotov et al., 2015) and Librivox (Shankar et al., 2024), while also incorporating some of the lesser-known opensource datasets such as WSJ (Garofolo et al., 1993), CORAAL (Shankar et al., 2024), and TED-LIUM (Rousseau et al., 2012), along with their processed versions. Furthermore, the dataset creation phase itself for kids is burdened by the labour-intensive nature of the task, considering the nuanced speech and articulation patterns of children. This often leads to current approaches assuming proper pronunciation and articulation, failing to account sufficiently for kids' speech, and faltering significantly more with accented and nonnative kids' speech, often generating offensive and inaccurate transcriptions (Ramesh et al., 2022). As an exemplar, the following transcription showcases a child's speech using the state-of-the-art ASR systems, Whisper (Radford et al., 2023) and Wave2Vec 2.0 (Baevski et al., 2020). The child is a 4-year-old non-native boy.

Utterance: and they are looking at the frog; and because he cracked his egg Whisper: and they recognize the fog; and because do you grab this egg? Wav2Vec: unfated in that the fog; and because fee practis ed

In an attempt to overcome these major hurdles, we introduce KidSpeak, a speech-based LLM with multi-task capacities of ASR, gender and dialect identification, and speech pathology classification, trained on a curated corpus of kids’ speech through instruction tuning. The framework is based upon the foundations of spoken language understanding adapted towards kids’ speech transcription and diagnosis of speech language pathologies. We train the method using a specialized two-stage procedure, wherein we utilize simultaneous phonetic and English transcription as a pre-training task for the Whisper ASR model in order to incorporate phonetically informed encoding capacities into the encoder of Whisper, as the first stage. Subsequently, the encoder of the model is used in the final framework.

Additionally, we acknowledge several limitations inherent in the existing datasets for children’s speech. Given the scarcity of relevant data, we explore the CHILDES (MacWhinney, 2000b) corpus as a resource for children’s speech. However, it is important to note that the transcriptions within this dataset are significantly compromised. The annotators involved are often engaged in multifaceted tasks, as the children included in the corpus frequently exhibit speech and language disabilities. Consequently, some annotators focus specifically on issues such as stuttering or speech sound disorders, while others address dialectical variations. This diversity in annotation purpose leads to inconsistencies in human transcriptions, limiting their applicability for developing robust automated systems. We therefore develop a new forced alignment tool Flexible and Automatic Speech Aligner (FASA), allowing us to extract accurate, aligned, and well-segmented audio seg-ments and the corresponding transcriptions under flexible conditions, creating a corpus for KidSpeak. Our main contributions are, 1 We develop KidSpeak, a novel multi-task speech-based foundation language model aimed at diagnosis and transcription of children’s speech. 2 We innovate a two-stage training procedure for the audio encoder, in order to incorporate phonetic information into the encoder, provably enhancing the downstream performance of the framework. 3 We develop the Flexible and Automatic Speech Aligner, a novel forced alignment tool, enabling extraction of accurate and aligned audio from noisy speech and demonstrate its utility over the CHILDES corpus in our framework.

Availability of data over the visual and descriptive-visual domain has spurred a preponderance of work towards understanding the visual domain. We witness a similar emergence in the aural understanding domain. Herein we provide an abridged summary of the contemporary work relevant to this manuscript. A more detailed description of the literature is provided in the Section A.1 under the Appendix.

Challenges in encoding speech with LLMs stem from handling long sequences of audio. GSLM (Lakhotia et al., 2021), TWIST (Hassid et al., 2024), and SpeechGPT (Zhang et al., 2023) address them by using quantized speech representations using models such as HuBERT (Hsu et al., 2021). While others employ log-mel spectrograms to develop representations which are then combined with textual data for multi-modal generative tasks, such as speech recognition, generation and understanding (Fathullah et al., 2024;Nachmani et al., 2023;Zhao et al., 2023;Gong et al., 2023), using ASR models such as Whisper (Radford et al., 2023), Wav2Vec (Baevski et al., 2020), Conformer (Gulati et al., 2020), and AST (Gong et al., 2021), or using multimodal retrieval based models such as ImageBind (Girdhar et al., 2023) like the PandaGPT (Su et al., 2023). Our work innovates techniques essential for processing and understanding of kids’ speech.

A key limitation of current works is the insufficient handling of nuanced speech variations, such as accents, dialects, intonations, and developmental or disordered speech, as is typical in children. The field of children’s speech recognition remains underresearched, with only a few notable approaches, such as LSTM-based disfluency detection (Venkatasubramaniam et al., 2023) and teacher-student models (Plantinga & Fosler-Lussier, 2019). To the best of our knowledge, this work is the first to propose leveraging LLMs as multi-task models with diagnostic capabilities for children’s speech, offering substantial potential in the domain of speech therapy and supporting Speech-Language Pathologists.

Forced-Alignment Toolkits Traditional audio-transcription alignment relies on human annotators (Boersma & Weenink, 2007;Grover et al., 2020), which is not scalable for large datasets. While Kisler et al. (2017) offers components of a forced-alignment pipeline, it does not solve the alignment issue. Sheng et al. (2019) uses GANs for data augmentation in children’s ASR datasets but does not introduce new data. Several studies utilize human-labeled transcriptions for forced-alignment (McAuliffe et al., 2017;Rodd et al., 2021;Zhang et al., 2023;Liu et al., 2023), with the Montreal Forced Aligner (MFA) being prominent (McAuliffe et al., 2017). However, MFA demands perfect alignment, limiting its effectiveness. Our work improves over existing works and over human annotators by margins of over 13×.

We describe the method that we implement in order to create a multi-purpose speech LLM that exhibits potential as a useful diagnosis tool for speech-related impairments. The framework is trained using instruction finetuning. In summary, we utilize an audio encoder pre-trained using a targeted procedure, to generate representations which are subsequently post-processed and prepended to the textual embeddings and processed by a pre-trained LLM in order to generate answers. The main framework is illustrated in Figure 2. We employ the pre-trained Vicuna 7B model (Chiang et al., 2023) as our main LLM2 . We note that our data exhibits a significant domain gap from the pre-training, both in terms of format (audio vs text) and content (kids’ speech). Therefore, in order to retain the general capacity of the LLM and avoid overfitting to the newer data, we finetune the LLM using Low Rank Approximation (LoRA) (Hu et al., 2021). This additionally helps with the memory footprint of the model, allowing for larger batch sizes. We illustrate two instructions in the general input sequence that we implement to for the IFT procedure. The conversation structure comprises alternating exchanges between a human user and the KidSpeak, where tags < Aud > and < /Aud > demarcate the audio representations. The framework is trained to predict Y at using the aural and instructional context. The < STOP > is set to ### in practice.

We employ a Whisper-based encoder for speech in our main framework. The audio representations from Whisper are prepended to the text embeddings consisting of instructions and the teacher-forced output. The complete sequence is further processed using the Vicuna LLM, incorporating self-attentive mechanism thereby incorporating audio-lingual context in order to learn and generate informed inference. However, we note that the native implementation of the Whisper encoder generates encodings of shape batch size× 1500 ×768 leading to a significant increase in the memory footprint of the model due to the extensive sequence length and the consequent self-attention matrices. We therefore apply a post-processing step to the feature tensors produced by the Whisper encoder. This step aims to reduce the final input sequence length whilst minimizing information loss. In this procedure, we aggregate multiple consecutive audio features (Figure 3) to form a cumulative representation that spans 80 milliseconds per feature vector. The aggregated features are then processed through a two-layer adapter network to align the feature space dimensions with those of the textual embeddings, which are further processed using the LLM. For each audio sample i, we create a multi-turn instruction following dataset (Y

a T ) illustrated in the Figure 4, wherein the instructions are randomly ordered during training. The framework is then trained using a conditional auto-regressive prediction objective arg min

(1) for a sample indexed i and the instruction j. The speech contexted LoRA parameters θ s and the adapter MLP parameters θ m are estimated, conditioning upon the speech sample A (i) represented due to the frozen speech encoder parameters θ enc and the instruction Y qj . We train the framework to predict the answer

aj . Additionally, we find that the a targeted encoding scheme for θ enc benefits the framework, as we detail next. A preponderance of developing speech is characterized by phonetic challenges wherein the child mispronounces similar phonetic units (Munson et al., 2012). Speech and language therapists must proficiently utilize phonetics for transcription to accurately diagnose and treat speech-sound disorders (Munson et al., 2012;Ball & Rahilly, 2002). Therefore, reliable pretraining in phonetic transcription is imperative, as inaccuracies can significantly affect clinical management and therapeutic outcomes. In recognition of these facets in diagnosis, we conduct a separate procedure in order to endow the speech encoder with phonetic information. The audio encoder employed is based on the Whisper model which is designed to encode mono-channel audio sampled at 16 kHz, which is then transformed into log-Mel spectrogram images. The encoder generates audio features, with each feature tensor corresponding to a 20-millisecond segment of audio. For optimal performance, we utilize the default configuration, which processes audio in 30-second chunks, and affix it with separate dedicated decoders for the orthographic English transcription and phonetic transcription as illustrated in Figure 5. This essentially yields a training regime wherein the same encoder facilitates both textual and phonetic transcriptions while the two decoders specialize in decoding the features into phonetic and English transcriptions respectively. The model is then trained using the next token prediction loss for both decodings simultaneously. The objective is,

<t,e , A (i) ; {θ enc , θ

(en) dec } -

where

ar is the combined auto-regressive loss for the i-th sample, T <t,e and y (i) <t,p are preceding tokens and A (i) is the input audio. We specify the decoder parameters separately with θ (en) dec and θ (ph) dec representing the English and the phonetic decoders respectively using the same encoder parameters θ enc . We also utilize a special token |phn| to signify phonetic decoding. However, we hypothesize that explicit measures towards alignment of the two decoders may lead to enhanced generalization and understanding of the language through a unified feature space representation. Therefore, in order to enhance the alignment of the transcription mechanisms, we further process the downstream features of the decoders using two additional mechanisms.

This mechanism leverages cross-attention to synchronize the hidden states from two decoders derived from the same audio input. We implement the mechanism over the final hidden states of both decoders as formulated below,

where

e and H (i)

p represent the hidden states of English and phonetic transcriptions, U (768 × 1024) and D(1024 × 768) the upward and downward projection matrices, and d k the dimensionality of the keys. The upward projection enriches the representation space to capture detailed alignments, while the downward projection ensures the contextualized outputs are compatible with the original dimensions. In practice, we implement this mechanism using multi-head attention with 16 heads. The residuals are ultimately used in the evaluation of L ar in Equation 2. A detailed justification for this scheme is provided under Section A.2.1. In summary, these alignment mechanisms ensure that phonetic and orthographic transcriptions are not only aligned but also mutually reinforcing, enhancing the encoder’s ability to generate accurate and contextually relevant transcriptions for both modalities, ultimately improving the model’s overall performance and utility in downstream tasks. The model is subsequently trained end-to-end using a linear combination of the two resulting loss functions given by L

In this section, we present the Flexible and Automatic Speech Aligner (FASA), a novel toolkit designed for forced alignment to create high-quality fine-tuning datasets. A robust children’s speech model necessitates a large, diverse dataset with accurately aligned audio and transcriptions. However, obtaining such high-quality data is challenging due to the distinctive speech patterns of children, particularly those with speech and language disorders. Human annotation is labor-intensive and requires domain expertise, as noted by Miller et al. (2016), with our experience showing that annotating a single audio segment can take 3-8 times longer than its duration. Additionally, the quality of annotations varies significantly, especially in datasets like CHILDES (MacWhinney, 2000a), which cater to diverse transcription purposes, resulting in many transcriptions being incomplete or irrelevant. To enhance KidSpeak, a general-purpose forced alignment toolkit is crucial for extracting high-quality children’s speech datasets from low-quality sources. Existing methods, such as MFA (McAuliffe et al., 2017), rely on accurate transcriptions, which are often impractical to obtain. Therefore, we propose a flexible and automated forced alignment toolkit that addresses various challenges in current children’s speech datasets.

Given a non-timestamped audio file and its noisy or incomplete transcription, forced alignment generates time-stamped audio segments paired with high-quality transcriptions. Kid-Speak, like many modern automatic speech recognition systems, requires input audio to be divided into smaller segments during training. For instance, the Whisper model (Radford et al., 2022) pads or trims audio inputs to 30 seconds. Consequently, when associating a non-timestamped transcription with a lengthy audio file, a forced-alignment toolkit is essential for creating a model-compatible dataset. Formally, the forced-alignment task involves an audio sample containing n utterances, A := {A 1 , A 2 , …A n }, and a transcription of m “words,” T := {T 1 , T 2 , …T m }. A “word” in T represents a fundamental unit of transcription, which may refer to a sentence, a single word, or a phonetic symbol. The goal is to associate each A i with its corresponding words in T , from T si to T ei , or indicate that A i lacks a transcription in T . We denote this association as A i = (T si , T ei ). A robust autoalignment system should exhibit two crucial features. First, it must not assume that if A i = (T si , T ei ) and A j = (T sj , T ej ) with i < j, then ei < sj; an utterance appearing earlier in the audio does not guarantee its early appearance in the transcription. Second, A i may lack a corresponding (T si , T ei ), implying that A i = ∅. This means not all audio segments have transcriptions, and some audio may remain untranscribed. Similarly, T k ∈ T does not imply T k ∈ A; not every word in the transcription corresponds to an audio segment. These features are essential as they relax the need for completeness and order in the provided transcription, mirroring more realistic scenarios. While a common method for obtaining large datasets of paired audio and transcriptions is through Internet scraping, many online transcriptions are noisy and incomplete, often with missing or misordered entries. Under these conditions, existing forced-alignment toolkits, such as those proposed by McAuliffe et al. (2017), are inadequate. Further details are discussed in Appendix A.1.

FASA follows a five-module pipeline to automatically segment, label, and align a long audio file with its transcription, as illustrated in Figure 6. Among the five modules, the second and third are mandatory, whereas the other three are optional for enhancing the quality and quantity of the dataset. These five modules together maximize the correctness of forced ⃝ segments and makes predictions on the audio; module 3 ⃝ forced-aligns audio segments with the provided transcription using Algorithm 1; module 4

⃝ performs post-generation checking (PGC); and module 5 ⃝ allows user to augment dataset via manual selections. The entire system besides module 5 ⃝ is automatic.

alignment under flexible conditions. 1 ⃝ The first module applies a regular expression to clean the provided transcriptions and to exclude any non-alphanumeric characters. 2 ⃝ For the second module, modern ASR models will be used to obtain word-level timestamps of the transcriptions. Currently, sentence-level separations from the provided model are used as the segmentation marks for long audio. 3 ⃝ After the second module, a folder consisting of audio segments and their corresponding predictions will be generated. The set of predictions for sentence-level utterances will be denoted as T = { T1 , T2 , …, Tn }. For each utterance A k , its predicted transcription will be A k = T

A k . The third module will apply a sliding-window Algorithm 1 (in Appendix A.6) to find the best matching from the provided transcription (T ) for each utterance (A k ). After this module, two datasets will be generated. The first dataset DATA align is what the algorithm finds close alignment between the prediction and provided transcription that is within a threshold. The second dataset DATA verif y is what the algorithm finds slight mismatches between the prediction and the transcription. For DATA align , the provided transcription from T will be used as the ground truth of the utterance. 4 ⃝ The fourth module, post-generation checking (PGC), is an optional module that iterates through DATA align to find if there are significant mismatches between a secondround prediction and the aligned transcription on sentence length. The implemented metric for PGC is based on the difference in sentence length between the results of a second-round prediction and the aligned transcription. If the difference is greater than a threshold, the utterance and its transcription will be removed from DATA align . 5 ⃝ The fifth module, user selection, is an optional module that launches a graphical-user-interface (GUI) that allows the user to listen to, select, or input correct transcription for each utterance in DATA verif y so that they could be added to the dataset. After the two optional modules, FASA assumes the validity of DATA align , which will be used as the final output dataset. Furthermore, FASA features additional qualitative and user-friendly traits as described in Section A.3.

We compile multiple open-source datasets in order to test our framework against a wide variety of instructions, in addition to datasets generated by FASA leading to over 57 hours of high-quality data. The corpus was built in order to adequately represent children with speech pathologies and those with clear speech, and containing a rich collection of speaker related attributes that we aim our method to predict and generate. We summarize the data in Table 1. A broader description of the datasets is provided in Section A.4.

We conduct training and performance evaluation of our method across several distinct tasks, each based on specific speech traits. These traits are extracted from the corresponding ground-truth labels available in the datasets referenced in Table 1. The tasks are described as follows:

◆ Disorder Classification: Speech disorders are categorized into the following classes: 1 inconsistent phonological disorder, 2 consistent phonological disorder, 3 childhood apraxia of speech, 4 phonological delay, 5 vowel disorder, 6 articulation disorder, and 7 no disorder, based on the ground truths provided in the Ultraphonix (UPX) subset of the Ultrasuite repository Eshky et al. (2019). Detailed descriptions of these disorders are provided in Section A.5.

◆ Gender Classification: We conduct binary gender classification using the groundtruth gender labels available in the datasets. Predictions are made only where gender information is explicitly provided.

◆ Age Group Classification: Age labels are sourced from the ENNI dataset. To account for the minimal acoustic differences between closely aged children, we divide the age range into two groups: 1-5 years and 6-13 years.

◆ Transcription: We compile the transcriptions available for the kids with no speechrelated disorders in order to create a reliable benchmark for training and evaluating speech recognition models. We train the framework to transcribe the speech and evaluate using the word error rate and character error rate metrics.

We evaluate the performance of the classification tasks using the accuracy of inference. Additional details for the configuration of the training setup are provided under Table 6.

In addition to garnering benefits over the KidSpeak framework, the aligned training procedure for the Whisper model is beneficial for the transcription performance of the Whisper model as shown in Table 2, where we evaluate the Phonetic Error Rate of the models using various configurations. The multi-task evaluation in Table 3 et al., 2017) completely fails to properly align the audio segments with the correct transcription.3 To be specific, both documents have missing transcriptions corresponding to the beginning of the audio, which results in 99.93% AW Error. This is because MFA tries to align all the words from the beginning, but since those words do not have available transcriptions, the entire system fails. Second, FASA incorrectly aligns one utterance with its transcription. For that utterance, it misses the “so the” sound at the end of the utterance, and the two words are not recorded into the aligned transcription. Manual inspection finds that the speaker stuttered and repeated “so the”, which might be the issue of the model not picking up that trailing sound in the segmented utterance. Lastly, FASA’s result is potentially much better than human annotators. Attia et al. ( 2023) reports that 5 out of 393 hours of speech in MyST dataset (Pradhan et al., 2023) are potentially incorrect with WER> 50%, resulting in 3% increase in WER for the entire training dataset. Compared to human annotators that were used to annotate MyST, FASA achieves one magnitude lower WER (13.6×) without requiring any human labor.

In conclusion, this work presents significant advancements in the field of children’s speech analysis. First, we introduce KidSpeak, a pioneering multi-task speech-based foundation language model designed specifically for diagnostic tasks pertaining to children’s speech. Second, we propose an innovative two-stage training procedure for the audio encoder that effectively integrates phonetic information, leading to marked improvements in both diagnostic and transcription performance, as well as downstream performances when integrated with KidSpeak. Finally, we develop the Flexible and Automatic Speech Aligner, a novel forced alignment tool that extracts accurate and aligned audio from noisy speech, allowing us to create a high quality kids’ data corpus. Collectively, these contributions enhance the capabilities of speech analysis frameworks for children, paving the way for future research and applications in this critical area. Determine the gender of the speaker in this clip.

The child in the clip is a girl.

4y old Boy: “The elephant grabs it and drops it in the water.” User:

Transcribe the spoken words into text.

The elephant grabs it and drops it in the water.

7y old Boy: “Just turn the tap on for water.” User:

Evaluate whether the speech in this clip is typical or impaired.

The manner of speaking seems natural.

10y old Girl: “The hippo jumps in the pool.” User: Estimate the age-group of the speaker from this clip.

It is estimated that the speaker may be between 6 -13 years old.

Table 5: Conversation with KidSpeak.

Audio-Visual Understanding: Contemporary work has attained remarkable feats with regards to understanding the visual and lingual modalities, often in synergy with each other, owing to the large scale availability of datasets facilitating research. Several works including but not limited to LLaVA (Liu et al., 2024), PaLME (Driess et al., 2023), Flamingo (Alayrac et al., 2022), BLIP (Li et al., 2022) and GPT-4 (Achiam et al., 2023) incorporate instruction based tuning in order to incorporate multi-modal capabilities in language models. The training routines for these methods involve tokenization of the image modality allowing for the formation of a sequence which one may incorporate into instruction/text token sequences using self attention (LLaVA, PaLME), cross attention (Flamingo) or through a separate network (BLIP). This often consists of multiple stages wherein the initial set of stages are aimed at priming the learnable parameters for the newer modality. In view of their tremendous potential, we also witness applications of these advancements helping significantly enhance human-AI interaction by improving search engines, supporting creative tasks, and, importantly, advancing accessibility, particularly for the image modality. For instance, these technologies can be utilized to enhance accessibility tools, such as through augmented communication (Chanjaradwichai et al., 2019), as assistive learning tools (Padmanabha et al., 2024;Kazemitabaar et al., 2024) and sign language recognition systems (Gong et al., 2024).

Speech based LLMs and Spoken Language Understanding: Encoding speech using LLMs is faced with challenges associated with encoding very large sequences of aural representations, given that a 16kHz sampling of one second of audio contains 16000 unique representations of the medium in the frequency domain. However, several recent contributions quantized representations of speech provided by the HuBERT (Hsu et al., 2021), as leveraged by the works of Generative Speech Language Modeling (Lakhotia et al., 2021), TWIST (Hassid et al., 2024) and SpeechGPT (Zhang et al., 2023) wherein based on the applications, the representations are used in transformer (Vaswani, 2017) based encoder decoder systems or are combined with the textual embeddings in order to construct an interactive multi-modal system using instruction based tuning, with generative capabilities in audio. Another encoding scheme for audio is the use of log-mel spectrograms, processed The two alignment mechanisms address a critical need for alignment between phonetic and orthographic transcriptions within multi-modal speech processing systems. This alignment is essential for several reasons:

◆ Phonetic and Orthographic Consistency: Phonetic transcriptions represent the pronunciation of words, focusing on sounds, while orthographic transcriptions represent the written form of language. Aligning these two modalities ensures that the pronunciation (phonetics) and spelling (orthographics) are consistent with each other. This consistency is crucial for tasks such as speech recognition and language learning, where accurate mapping between spoken and written forms is required.

◆ Enhanced Encoder Utility: By aligning phonetic and orthographic transcriptions through the shared encoder, the model benefits from enriched feature representations. The encoder, which is common to both decoders, learns to produce more comprehensive and phonetically aware representations. This shared learning helps the encoder capture nuances that prove to be critical for improved understanding of pronunciation based nuances necessary for improved diagnostic capacities.

◆ Robust Multi-Modal Learning: Aligning the hidden states of phonetic and orthographic decoders allows the system to leverage complementary information from both transcriptions. Phonetics provides insights into pronunciation nuances, while orthographics offer context about spelling and grammar. The combined insights from both modalities lead to a more robust and versatile model capable of handling diverse linguistic tasks.

In the following, we demonstrate the phonetic capacities of our scheme using comparisons with Whisper and Wav2Vec trained over TIMIT. The phonetic alphabet used natively by TIMIT is illustrated here for selected samples. We notice that all of the models capture a meaningful pronunciation for each word of the examples listed. However, the targeted alignment scheme of our method captures the nuances in pronunciation, audible in the ground truth phonetic captioning, enabling a more accurate transcription and henceforth, better encoder representations, for therapeutic downstream tasks. In the following, we demonstrate the capacity of the phonetically endowed Whisper model using speech from the Ultraphonix dataset Eshky et al. (2019). The Whisper model was trained using the Multihead alignment scheme described in Section 3.2 using the TIMIT corpus (Garofolo, 1993). Subsequently, we conduct inference using the phonetic decoder of the model over the speech of a 4 year old boy undergoing therapy for phonological disorder. The child mistakenly uses the sound of “da” in place of “ga” for words. For instance, the pronunciation for the word “luggage” (“LUG-ij”) and “gore” (“GOw-ar”) here are made as “LAD-ij” and “DOw-ar”. However, post therapy, the child learns to pronounce clearly as is captured by our model. We use the TIMIT phonetic transcription code here. pau indicates a pause in speech.

Instructor: g ow axr pau g ey t pau g eh t pau l ah g ux jh Child Pre: d ow ax pau d iy t pau d ae t pau l ah d ix jh Child Post: g ow aa pau g ih g eh ix t pau g ae t pau l ah g ih jh

As is evinced in the illustration, the phonetically endowed Whisper correctly detects the improvements in pronunciation in pre-vs post-therapy of the child, thereby allowing for tailored features for targeted therapy based downstream tasks such as those implemented in KidSpeak.

Similar to existing auto-alignment toolkits, FASA requires an audio file and its corresponding transcription. However, due to high uncertainty in the raw dataset, FASA assumes only a minimal input format and does not require the transcription to be accurate. The ground truth (GT) for an utterance A k is defined by Equation 6 in the FASA pipeline. In contrast to previous forced-alignment toolkits, FASA deliberately ignores utterances without valid transcriptions, thereby enhancing quality.

FASA also incorporates beneficial design elements from established toolkits to enhance user convenience, following the same design principles as MFA. Users need only to place the audio file and its transcriptions in a designated folder before executing the program, after which all processes are fully automated, enhancing the user experience. FASA allows users to select and manually input transcriptions for utterances when the provided transcriptions are suspected to be inaccurate, ensuring precision and user control. Additionally, FASA features an optional post-generation check to automatically exclude incorrect alignments, minimizing errors from the underlying model.

The Core-Ultraphonix (UPX) subset of the Ultrasuite repository (Eshky et al., 2019) provided the labels for speech with pathologies. Additionally, we incorporate the Children’s Speech Recording (CSR) dataset by Kennedy et al. (2017). With FASA, we convert subsets of the Child Language Data Exchange System (CHILDES) (MacWhinney, 2000a) that contain English children’s speech. Specifically, we use a collection of 352 children from ages 4 to 9. The children are performing the Edmonton Narrative Norms Instrument (ENNI) test (Schneider et al., 2005). To the best of our knowledge, this generated dataset will be the first at-scale high-quality dataset for young children from clinical recordings that is fully compatible with modern DL systems. While we only use the subset from CHILDES with rich clinical information for KidSpeak, FASA remains a generic forced-alignment toolkit that can extract many more datasets than the one used.

The following provides a broad explanation of various speech-related disorders, along with seminal and intriguing citations in the field of speech-language pathology that KidSpeak is capable of diagnosing based on the speech patterns exhibited by the child.

◆ Inconsistent Phonological Disorder: This pediatric speech sound disorder is characterized by the inconsistent production of the same words across repeated trials (Dodd et al., 2024). For example, a child may say “bat,” “gat,” and “at” instead of “cat,” or produce “log” for “dog” one day and “fog” the next. Moreover, a child may say “fider,” “sider,” and “pider” when attempting to pronounce “spider” (Dodd & Crosbie, 2010;Carter et al., 2019).

◆ Consistent Phonological Disorder: In contrast, this disorder is marked by the child’s ability to produce the same errors consistently when attempting to articulate the same word. For instance, a child may reliably say “tup” instead of “cup” or “wabbit” for “rabbit.” Such patterns indicate a stable phonological processing issue, as the child consistently makes the same substitutions or distortions (Felsenfeld et al., 1995;Bleile, 2002).

◆ Phonological Delay: This specific speech sound disorder entails developmental phonological errors that align with typical speech development patterns but persist longer than expected, often for six months or more, which can impact clarity and sound production (Orsolini et al., 2001). Children acquire speech by learning entire words rather than individual sounds; as their speech matures, they categorize words by their components, often simplifying sounds or sequences into easier alternatives (e.g., saying “ca” for “cat”) (Waring et al., 2022).

◆ Vowel Disorder: Vowel disorders are marked by difficulties in the positioning and sequencing of the articulators, particularly the tongue and lips, affecting vowel quality and accuracy. Incorrect positioning can lead to issues with vowel production, such as excessively long vowels or distortions. For instance, vowels may be partially voiced due to challenges in controlling vocal fold vibration, or they may exhibit excessive nasality from difficulties managing velopharyngeal closure. These spatial, temporal, and coordination difficulties often result in challenges in vowel production (Gibbon & Beck, 2002;Ball & Gibbon, 2002). Additionally, children may struggle with vowel lengthening or shortening, such as elongating the vowel in “see” for “sit” or shortening it in “cat” as “kit,” with omissions occurring as well (e.g., saying “bll” instead of “ball”) (Stoel-Gammon & Pollock, 2008).

◆ Articulation Disorder: This type of speech sound disorder is characterized by difficulties in accurately producing speech sounds due to the imprecise use of the lips, tongue, or throat. Individuals may demonstrate various symptoms, including the omission of sounds (e.g., final consonants), distortion of sounds (e.g., producing an “s” sound with a whistle), and challenges in coordinating the movements of their lips, tongue, teeth, palate, and lungs (Hall & Tomblin, 1978;Rvachew & Jamieson, 1989).

◆ Childhood Apraxia of Speech: Childhood apraxia of speech (CAS) is a neurological speech sound disorder characterized by impaired precision and consistency of movements underlying speech, absent neuromuscular deficits (e.g., abnormal reflexes or tone) (Davis et al., 1998;Association et al., 2007;Kummer et al., 2007). Children with CAS may encounter difficulties in speech production, such as trouble transitioning smoothly between sounds and syllables, groping movements of the jaw, lips, or tongue, vowel distortions, incorrect stress patterns (e.g., pronouncing “banana” as “BUH-nan-uh,”) equal emphasis on all syllables (e.g., saying “BUH-NAN-UH,”) separation of syllables with pauses, inconsistency in errors when repeating words, and voicing errors (e.g., saying “down” instead of “town”) (Carter et al., 2019).

Here, we provide the pseudocode for the third module of FASA’s workflow in Algorithm 1. FASA uses a sliding window algorithm with two thresholds to determine the final two subsets of audio segments. In the algorithm, DIS is the Levenshtein distance between two sentences.

including phonological disorders and developmental variances. By incorporating phonetic information, the model better understands these nuances, allowing it to differentiate subtle pronunciation variations common among young speakers. This phonetic grounding enhances the model’s ability to generalize across diverse dialects and individual speech patterns, ultimately contributing to more reliable assessments and interventions in speech-related applications for children.

We choose Vicuna as it is one of the best LLMs we began with. Due to computational contraints, we did not try other strong pretrained LLMs. However, we believe they would perform similarly and our conclusions remain.

This is not to say that MFA is not a good model. MFA works fine with high-quality transcriptions that are an approximate match of the audio. However, if the audio/transcription match before the alignment is not good, MFA will not produce anything meaningful.

📄 Read Full PDF on ArXiv