Synthetic Data Domain Adaptation for ASR via LLM-based Text and Phonetic Respelling Augmentation

End-to-end automatic speech recognition often degrades on domain-specific data due to scarce in-domain resources. We propose a synthetic-data-based domain adaptation framework with two contributions: (1) a large language model (LLM)-based text augmen…

Authors: Natsuo Yamashita, Koichi Nagatsuka, Hiroaki Kokubo

Synthetic Data Domain Adaptation for ASR via LLM-based Text and Phonetic Respelling Augmentation
SYNTHETIC D A T A DOMAIN AD APT A TION FOR ASR VIA LLM-BASED TEXT AND PHONETIC RESPELLING A UGMENT A TION Natsuo Y amashita, K oichi Na gatsuka, Hir oaki K okubo, K ota Dohi, T uan V u Ho Hitachi, Ltd. ABSTRA CT End-to-end automatic speech recognition often degrades on domain- specific data due to scarce in-domain resources. W e propose a synthetic-data-based domain adaptation framework with tw o contri- butions: (1) a lar ge language model (LLM)-based te xt augmentation pipeline with a filtering strategy that balances lexical di versity , perplexity , and domain-term co verage, and (2) phonetic respelling augmentation (PRA), a novel method that introduces pronunciation variability through LLM-generated orthographic pseudo-spellings. Unlike conv entional acoustic-le vel methods such as SpecAugment, PRA provides phonetic div ersity before speech synthesis, enabling synthetic speech to better approximate real-world variability . Ex- perimental results across four domain-specific datasets demonstrate consistent reductions in word error rate, confirming that combin- ing domain-specific lexical coverage with realistic pronunciation variation significantly impro ves ASR rob ustness. Index T erms — Automatic speech recognition, domain adapta- tion, large language models, synthetic speech, phonetic respelling 1. INTR ODUCTION End-to-end automatic speech recognition (ASR) systems hav e achiev ed remarkable progress in recent years, but they still suf- fer substantial performance degradation when applied to domain- specific data that differs from the training distribution [1]. Since collecting large amounts of target-domain text and speech can be costly , recent studies have explored generating domain-specific text using large language models (LLMs) and con verting it into synthetic speech via text-to-speech (TTS) as a cost-effecti ve approach for ASR domain adaptation [2, 3]. Ho we ver , the existing synthetic-data approaches face two k ey limitations: (1) insuf ficient domain-specific lexical div ersity—these studies ha ve primarily focused on increasing the amount of text without explicitly optimizing for domain-a ware lexical div ersity and co verage; and (2) lack of natural phonetic variability—synthetic speech generated via TTS lacks the pronunci- ation variations, errors, and idiosyncrasies found in real speech [4]. Existing acoustic-level augmentation methods (e.g., SpecAug- ment [5]) mask parts of the spectrograms rather than introduce pro- nunciation v ariants and, when applied to uniformly rendered syn- thetic speech, can be detrimental in some setups [3]. T o address these limitations, we propose a rob ust ASR domain adaptation framew ork that relies solely on synthetic data. Our ap- proach makes two key contributions: (1) it enhances domain-specific lexical div ersity via an LLM-based text augmentation pipeline, equipped with a novel filtering strategy that jointly maximizes type–token ratio (TTR), perplexity , and domain-specific vocabulary cov erage; and (2) it introduces phonetic respelling augmentation (PRA), a no vel method that le verages LLMs to generate ortho- graphic pseudo-spellings reflecting realistic pronunciation v ariabil- † denotes multiple LLMs (e.g., GPT , Llama, Qwen). ‡ denotes multilingual prompts (e.g., English, Chinese, Japanese). Domain seed Context seed Paraphrasing Filtering Scenarios Domain-specific terms English Japanese LLM†‡ Generate scenarios in domain seed (e.g., air traffic control) . LLM†‡ Translation LLM† LLM† Generate 10 Japanese sentences according to scenario (e.g., aircraft landing) in domain seed . Generate 10 Japanese sentences including the domain-specific term (e.g., recleared) in domain seed . Keep the term in English. Paraphrase the sentence in domain seed into 9 different versions. Translate the Japanese sentence in domain seed to English. W eb, Manual Chinese (a) T ext augmentation pipeline Respelled: Jang Feng pilotid ze Bo-in Sevem Three Sevem eer -kraft. Original: Zhang Feng piloted the Boeing Seven Three Seven aircraft. Convert the English sentence into a pronunciation respelling. ‧ Reflect possible pronunciation errors, substitutions, or idiosyncrasies. ‧ Use English letters (not IP A) to represent each word as it would sound. LLM (b) Phonetic respelling augmentation Fig. 1 : Proposed methods overvie w . Italics denote placeholders. ity . Unlike SpecAugment, which modifies acoustic features after synthesis, PRA injects phonetic diversity directly at the te xt stage, enabling synthetic speech to capture natural variations such as pro- nunciation errors, substitutions, and idiosyncrasies, while remaining fully compatible with standard TTS systems. Experimental results show that our LLM-based text augmen- tation pipeline improves recognition accuracy across multiple domain-specific datasets, and PRA yields additional gains. Ex- ample prompts, generated texts, filtering code, and audio samples with multiple TTS systems are av ailable on our project page. 1 2. RELA TED WORK 2.1. T ext filtering for ASR domain adaptation Recent studies have explored the use of LLM-generated text for downstream ASR domain adaptation [2, 3] but the quality of the text has not been suf ficiently discussed. Some pre vious studies have focused on filtering out high-perplexity sentences, aiming to retain 1 https://natsuooo.github.io/llm- asr- augmentation/ only fluent texts [6, 7]. Howe ver , prioritizing low-perplexity sen- tences may risk excluding specialized or technical expressions nec- essary for domain adaptation. Other approaches employ vocab ulary cov erage maximization (VCM) to ensure lexical diversity by maxi- mizing the number of unique words included in the training set [8]. Nev ertheless, VCM may lead to the selection of numerous irrele- vant or meaningless words generated by LLMs, which can hinder model learning rather than enhance it. In contrast, we adopt a nor- malized tri-objecti ve selection that jointly balances TTR, perple xity , and domain-term cov erage, resulting in texts that are both di verse and domain-specific. 2.2. Phonological T asks and LLMs T raditional grapheme-to-phoneme (G2P) models rely heavily on pronunciation dictionaries and often fail to generalize to domain- specific words [9]. Recently , LLMs hav e been e xplored for phono- logical tasks such as G2P conv ersion [10, 11]. While LLMs can capture broader contextual and linguistic cues than conv entional models, prior work has shown that their performance remains in- ferior to specialized G2P systems or human annotators. Moreov er, approaches that employ complex phoneme representations (e.g., IP A [12]) often degrade text-to-speech (TTS) quality due to the dif- ficulty of accurately rendering fine-grained phonetic symbols [13]. In contrast, our approach uses LLMs to generate alphabetic re- spellings that reflect actual pronunciation, thus av oiding the limi- tations of dictionary-based G2P and specialized phonetic symbols while enabling div erse and realistic variations. 3. PR OPOSED METHOD Our frame work has two ke y components: an LLM-based text aug- mentation pipeline (Figure 1(a)) and PRA (Figure 1(b)). Synthesized speech generated from both components is used to fine-tune the ASR model. 3.1. LLM-based text augmentation pipeline Unlike D AS [2], which generates only the required amount of text, our proposed pipeline first over -generates a large pool of candidate sentences through multi-stage LLM-based augmentation (steps 1–5), and then applies a novel filtering process (step 6) to select the most relev ant and di verse subset. This over -generation and filtering is a key novelty , allowing more ef fective control ov er le xical di versity and domain cov erage. 1. Domain seed : T o ensure that the generated sentences are relev ant to the target domain, our pipeline begins by conditioning each prompt on domain-specific information, referred to as a domain seed (e.g., air traffic control). 2. Context seed : Scenarios are generated as context seeds, and texts are created for each scenario to increase di versity . In many production settings, stakeholders provide an operational lexicon (e.g., user names, product names, call signs) obtained from chat logs, websites, or manuals, and systems must recognize these terms with high recall [13–16]. Therefore, we also explore using these terms as context seeds to generate te xts containing them. 3. Multilingual pr ompting : For each conte xt, prompts are con- structed in multiple languages (e.g., English, Japanese, Chinese), and the outputs are translated back into the tar get language to enrich linguistic di versity . T o ensure that domain-specific terms are preserved during translation, we add an instruction such as “K eep the term in English” to the prompts in the previous step. 4. P araphrasing : LLM-based paraphrasing generates alternative ex- pressions for each sentence. 5. Multiple LLMs : Multiple LLMs trained on dif ferent data are used in Steps 1–4, and their outputs are combined. 6. F iltering : While the above steps yield a large pool of candidate sentences, synthesizing speech and fine-tuning an ASR model on the entire set is inefficient. T o address this, we introduce a nov el filtering step based on three ke y heuristics: (1) maximiz- ing TTR to promote lexical div ersity , (2) maximizing perplexity to encourage technical and in-domain words, and (3) weighting domain-specific terms to enhance cov erage of underrepresented vocab ulary . W e compute the combined score S ( s ) for each can- didate sentence s as a weighted sum of TTR gain, perple xity , and the normalized count of domain-specific terms, as follows: S ( s ) = α | V o cab( s ) \ V | | s | + β exp   − 1 | s | | s | X i =1 log p ( w i | w

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment