We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which enables streaming inference of phonemic and prosodic (PnP) labels. By guaranteeing minimal look-ahead size to each input token, the proposed model can consider future context in each token, which leads to stable PnP label prediction. Unlike previous streaming methods that depend on explicit word boundaries, the CTC decoder in CC-G2PnP effectively learns the alignment between graphemes and phonemes during training, making it applicable to unsegmented languages. Experiments on a Japanese dataset, which has no explicit word boundaries, show that CC-G2PnP significantly outperforms the baseline streaming G2PnP model in the accuracy of PnP label prediction.
Owing to the growing demand for natural and efficient human-machine interaction, research on spoken dialogue models has been actively studied [1]. Although one line of research explores end-to-end models that perform speech-to-speech generation within a single model [2,3,4], cascade approaches that combine automatic speech recognition (ASR), a large language model (LLM), and textto-speech (TTS) offer advantages in robustness and flexibility [5,6]. To improve the response speed of spoken dialogue models, cascade approaches have applied streaming TTS to the text incrementally generated by the LLM, achieving promising results [6,7,8,9].
In terms of TTS input, some models can take the raw output of the LLM, i.e., graphemes [10,11]. These models are trained to generate correct pronunciation and prosody in speech directly from graphemes. However, achieving sufficient performance in this setting generally requires a huge amount of text-speech paired data, since the model must simultaneously learn pronunciation and prosody, as well as additional aspects of speech such as speaker characteristics and speaking style. Therefore, a practical alternative is to first convert graphemes into phonemes and prosodic labels (G2PnP) and then use them as input to the TTS model [12,13,14]. Since phonemes and prosodic labels (e.g., accent position and accent type) provide more direct information to synthesize speech compared to graphemes, this approach tend to yield more stable performance even with limited training data. Nevertheless, since G2PnP is introduced between the LLM and the TTS, reducing latency requires streaming not only in TTS but also in G2PnP.
A naive approach to achieve streaming G2PnP is to process the input text in chunks using non-streaming sentence-level G2PnP models [12,15]. However, since G2PnP in many languages is highly dependent on the surrounding context, this approach generally struggles to deliver robust performance.
Since G2PnP is a sequence-to-sequence mapping problem, another approach would be to employ the encoder-decoder Transformer architecture [16]. Dekel et al. [8] proposed LLM2PnP, a streaming G2PnP model that applies a word-level restriction to the attention mask of the Transformer. Unlike the naive chunk-based streaming approaches, this model can effectively incorporate both past and future context during streaming, and it demonstrated strong performance in streaming G2PnP for English. However, since the implementation of look-ahead and masking assumes the presence of word boundaries, it cannot be directly applied to languages without explicit word segmentation, such as Japanese or Chinese, i.e., unsegmented languages.
To overcome the limitation of the naive approaches and LLM2PnP, we introduce CC-G2PnP, a streaming G2PnP model capable of incorporating past and future context without dependence on word boundaries. CC-G2PnP is composed of a stack of streaming Conformer [17] layers followed by a connectionist temporal classification (CTC) decoder [18]. Unlike LLM2PnP, which requires an explicit mapping between words and phonemes for making attention masks, CC-G2PnP learns these mappings dynamically through CTC, making it applicable to unsegmented languages. Furthermore, to enable efficient streaming with the Conformer, the proposed method employs chunk-aware streaming [19], which divides the input sequence into chunks and allows tokens to attend within each chunk to realize look-ahead. To adapt this framework to the G2PnP task, we further extend the model so that every token has at least one token of look-ahead, thereby enabling more stable PnP label prediction.
To demonstrate the effectiveness of CC-G2PnP in the streaming G2PnP task for unsegmented languages, we conducted experiments using a Japanese dataset. The experimental results revealed that CC-G2PnP achieved significant improvements over baseline streaming methods in terms of both objective evaluations (character error rate, CER; sentence error rate, SER) and subjective assessments of the naturalness of TTS samples.
Considering the successful application of CTC in prior work on non-streaming G2P tasks [20,21,22 streaming scenarios [19,23], the proposed method employs a CTC decoder to model the G2PnP task as a sequence labeling problem. The use of CTC eliminates the need for pre-defined alignments and word boundaries. The proposed model architecture is shown in Fig. 1. As shown in the figure, the proposed model adopts a structure commonly used in the field of ASR, consisting of a stack of Conformer layers [17] followed by a CTC decoder [18,24]. To relax the conditional independence of each token and improve the performance of the model, we introduced self-conditioned CTC [25] into the intermediate Conformer layers. The model is optimized to minimize the sum of the final and the intermediate CTC losses.
To make the Conformer streaming-capable, it is necessary to restrict the dependencies on future tokens in both the convolution lay
This content is AI-processed based on open access ArXiv data.