Synthetic Data Domain Adaptation for ASR via LLM-based Text and Phonetic Respelling Augmentation

SYNTHETIC D A T A DOMAIN AD APT A TION FOR ASR VIA LLM-BASED TEXT AND PHONETIC RESPELLING A UGMENT A TION Natsuo Y amashita, K oichi Na gatsuka, Hir oaki K okubo, K ota Dohi, T uan V u Ho Hitachi, Ltd. ABSTRA CT End-to-end automatic speech recognition often degrades on domain- speciﬁc data due to scarce in-domain resources. W e propose a synthetic-data-based domain adaptation framework with tw o contri- butions: (1) a lar ge language model (LLM)-based te xt augmentation pipeline with a ﬁltering strategy that balances lexical di versity , perplexity , and domain-term co verage, and (2) phonetic respelling augmentation (PRA), a novel method that introduces pronunciation variability through LLM-generated orthographic pseudo-spellings. Unlike conv entional acoustic-le vel methods such as SpecAugment, PRA provides phonetic div ersity before speech synthesis, enabling synthetic speech to better approximate real-world variability . Ex- perimental results across four domain-speciﬁc datasets demonstrate consistent reductions in word error rate, conﬁrming that combin- ing domain-speciﬁc lexical coverage with realistic pronunciation variation signiﬁcantly impro ves ASR rob ustness. Index T erms — Automatic speech recognition, domain adapta- tion, large language models, synthetic speech, phonetic respelling 1. INTR ODUCTION End-to-end automatic speech recognition (ASR) systems hav e achiev ed remarkable progress in recent years, but they still suf- fer substantial performance degradation when applied to domain- speciﬁc data that differs from the training distribution [1]. Since collecting large amounts of target-domain text and speech can be costly , recent studies have explored generating domain-speciﬁc text using large language models (LLMs) and con verting it into synthetic speech via text-to-speech (TTS) as a cost-effecti ve approach for ASR domain adaptation [2, 3]. Ho we ver , the existing synthetic-data approaches face two k ey limitations: (1) insuf ﬁcient domain-speciﬁc lexical div ersity—these studies ha ve primarily focused on increasing the amount of text without explicitly optimizing for domain-a ware lexical div ersity and co verage; and (2) lack of natural phonetic variability—synthetic speech generated via TTS lacks the pronunci- ation variations, errors, and idiosyncrasies found in real speech [4]. Existing acoustic-level augmentation methods (e.g., SpecAug- ment [5]) mask parts of the spectrograms rather than introduce pro- nunciation v ariants and, when applied to uniformly rendered syn- thetic speech, can be detrimental in some setups [3]. T o address these limitations, we propose a rob ust ASR domain adaptation framew ork that relies solely on synthetic data. Our ap- proach makes two key contributions: (1) it enhances domain-speciﬁc lexical div ersity via an LLM-based text augmentation pipeline, equipped with a novel ﬁltering strategy that jointly maximizes type–token ratio (TTR), perplexity , and domain-speciﬁc vocabulary cov erage; and (2) it introduces phonetic respelling augmentation (PRA), a no vel method that le verages LLMs to generate ortho- graphic pseudo-spellings reﬂecting realistic pronunciation v ariabil- † denotes multiple LLMs (e.g., GPT , Llama, Qwen). ‡ denotes multilingual prompts (e.g., English, Chinese, Japanese). Domain seed Context seed Paraphrasing Filtering Scenarios Domain-specific terms English Japanese LLM†‡ Generate scenarios in domain seed (e.g., air traffic control) . LLM†‡ Translation LLM† LLM† Generate 10 Japanese sentences according to scenario (e.g., aircraft landing) in domain seed . Generate 10 Japanese sentences including the domain-specific term (e.g., recleared) in domain seed . Keep the term in English. Paraphrase the sentence in domain seed into 9 different versions. Translate the Japanese sentence in domain seed to English. W eb, Manual Chinese (a) T ext augmentation pipeline Respelled: Jang Feng pilotid ze Bo-in Sevem Three Sevem eer -kraft. Original: Zhang Feng piloted the Boeing Seven Three Seven aircraft. Convert the English sentence into a pronunciation respelling. ‧ Reflect possible pronunciation errors, substitutions, or idiosyncrasies. ‧ Use English letters (not IP A) to represent each word as it would sound. LLM (b) Phonetic respelling augmentation Fig. 1 : Proposed methods overvie w . Italics denote placeholders. ity . Unlike SpecAugment, which modiﬁes acoustic features after synthesis, PRA injects phonetic diversity directly at the te xt stage, enabling synthetic speech to capture natural variations such as pro- nunciation errors, substitutions, and idiosyncrasies, while remaining fully compatible with standard TTS systems. Experimental results show that our LLM-based text augmen- tation pipeline improves recognition accuracy across multiple domain-speciﬁc datasets, and PRA yields additional gains. Ex- ample prompts, generated texts, ﬁltering code, and audio samples with multiple TTS systems are av ailable on our project page. 1 2. RELA TED WORK 2.1. T ext ﬁltering for ASR domain adaptation Recent studies have explored the use of LLM-generated text for downstream ASR domain adaptation [2, 3] but the quality of the text has not been suf ﬁciently discussed. Some pre vious studies have focused on ﬁltering out high-perplexity sentences, aiming to retain 1 https://natsuooo.github.io/llm- asr- augmentation/ only ﬂuent texts [6, 7]. Howe ver , prioritizing low-perplexity sen- tences may risk excluding specialized or technical expressions nec- essary for domain adaptation. Other approaches employ vocab ulary cov erage maximization (VCM) to ensure lexical diversity by maxi- mizing the number of unique words included in the training set [8]. Nev ertheless, VCM may lead to the selection of numerous irrele- vant or meaningless words generated by LLMs, which can hinder model learning rather than enhance it. In contrast, we adopt a nor- malized tri-objecti ve selection that jointly balances TTR, perple xity , and domain-term cov erage, resulting in texts that are both di verse and domain-speciﬁc. 2.2. Phonological T asks and LLMs T raditional grapheme-to-phoneme (G2P) models rely heavily on pronunciation dictionaries and often fail to generalize to domain- speciﬁc words [9]. Recently , LLMs hav e been e xplored for phono- logical tasks such as G2P conv ersion [10, 11]. While LLMs can capture broader contextual and linguistic cues than conv entional models, prior work has shown that their performance remains in- ferior to specialized G2P systems or human annotators. Moreov er, approaches that employ complex phoneme representations (e.g., IP A [12]) often degrade text-to-speech (TTS) quality due to the dif- ﬁculty of accurately rendering ﬁne-grained phonetic symbols [13]. In contrast, our approach uses LLMs to generate alphabetic re- spellings that reﬂect actual pronunciation, thus av oiding the limi- tations of dictionary-based G2P and specialized phonetic symbols while enabling div erse and realistic variations. 3. PR OPOSED METHOD Our frame work has two ke y components: an LLM-based text aug- mentation pipeline (Figure 1(a)) and PRA (Figure 1(b)). Synthesized speech generated from both components is used to ﬁne-tune the ASR model. 3.1. LLM-based text augmentation pipeline Unlike D AS [2], which generates only the required amount of text, our proposed pipeline ﬁrst over -generates a large pool of candidate sentences through multi-stage LLM-based augmentation (steps 1–5), and then applies a novel ﬁltering process (step 6) to select the most relev ant and di verse subset. This over -generation and ﬁltering is a key novelty , allowing more ef fective control ov er le xical di versity and domain cov erage. 1. Domain seed : T o ensure that the generated sentences are relev ant to the target domain, our pipeline begins by conditioning each prompt on domain-speciﬁc information, referred to as a domain seed (e.g., air trafﬁc control). 2. Context seed : Scenarios are generated as context seeds, and texts are created for each scenario to increase di versity . In many production settings, stakeholders provide an operational lexicon (e.g., user names, product names, call signs) obtained from chat logs, websites, or manuals, and systems must recognize these terms with high recall [13–16]. Therefore, we also explore using these terms as context seeds to generate te xts containing them. 3. Multilingual pr ompting : For each conte xt, prompts are con- structed in multiple languages (e.g., English, Japanese, Chinese), and the outputs are translated back into the tar get language to enrich linguistic di versity . T o ensure that domain-speciﬁc terms are preserved during translation, we add an instruction such as “K eep the term in English” to the prompts in the previous step. 4. P araphrasing : LLM-based paraphrasing generates alternative ex- pressions for each sentence. 5. Multiple LLMs : Multiple LLMs trained on dif ferent data are used in Steps 1–4, and their outputs are combined. 6. F iltering : While the above steps yield a large pool of candidate sentences, synthesizing speech and ﬁne-tuning an ASR model on the entire set is inefﬁcient. T o address this, we introduce a nov el ﬁltering step based on three ke y heuristics: (1) maximiz- ing TTR to promote lexical div ersity , (2) maximizing perplexity to encourage technical and in-domain words, and (3) weighting domain-speciﬁc terms to enhance cov erage of underrepresented vocab ulary . W e compute the combined score S ( s ) for each can- didate sentence s as a weighted sum of TTR gain, perple xity , and the normalized count of domain-speciﬁc terms, as follows: S ( s ) = α | V o cab( s ) \ V | | s | + β exp   − 1 | s | | s | X i =1 log p ( w i | w

Original Paper

Loading high-quality paper...

Related Papers

Loading...

Comments & Academic Discussion

Loading comments...

Leave a Comment

Twitter Facebook