CC-G2PnP: Streaming Grapheme-to-Phoneme and prosody with Conformer-CTC for unsegmented languages
We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which enables streaming inference of phonemic and prosodic (PnP) labels. By guaranteeing minimal look-ahead size to each input token, the proposed model can consider future context in each token, which leads to stable PnP label prediction. Unlike previous streaming methods that depend on explicit word boundaries, the CTC decoder in CC-G2PnP effectively learns the alignment between graphemes and phonemes during training, making it applicable to unsegmented languages. Experiments on a Japanese dataset, which has no explicit word boundaries, show that CC-G2PnP significantly outperforms the baseline streaming G2PnP model in the accuracy of PnP label prediction.
💡 Research Summary
The paper introduces CC‑G2PnP, a streaming grapheme‑to‑phoneme‑and‑prosody (G2PnP) model designed to bridge large language models (LLMs) and text‑to‑speech (TTS) systems in a low‑latency, real‑time setting. The core architecture is a Conformer‑CTC network that processes input grapheme tokens in fixed‑size chunks. By limiting each token’s look‑ahead to a minimal window, the model can exploit future context just enough to stabilize phoneme and prosody predictions while keeping overall inference delay within the range required for streaming applications.
Key technical contributions
-
Chunk‑wise streaming with minimal look‑ahead – Input text is segmented into short, overlapping chunks (e.g., 40 ms of audio‑equivalent tokens) that are fed sequentially to the Conformer encoder. The look‑ahead window (typically 20 ms) provides just enough future information for the self‑attention layers to form reliable context without incurring large latency.
-
CTC‑based alignment learning – Instead of relying on explicit word or syllable boundaries, the model uses Connectionist Temporal Classification (CTC) loss. CTC automatically discovers the optimal alignment between the grapheme stream and the target phoneme‑prosody sequence during training, making the approach applicable to languages that lack clear segmentation (e.g., Japanese).
-
Joint phoneme‑and‑prosody prediction – The decoder outputs a combined sequence of phoneme symbols and prosodic tags (tone, length, stress, etc.) via two parallel softmax layers. Training treats the concatenated phoneme‑prosody string as a single CTC target, enabling simultaneous learning of both aspects.
-
Conformer encoder design – The encoder stacks Conformer blocks that blend multi‑head self‑attention with depth‑wise convolution, capturing both global dependencies and local acoustic patterns. This dual capability is crucial when only a limited future context is available.
Experimental setup
- Dataset: A Japanese speech corpus (≈100 h) manually annotated with grapheme, phoneme, and prosody labels. Japanese is chosen because it does not provide explicit word boundaries, highlighting the need for alignment‑free methods.
- Baselines: A streaming G2PnP model based on a Transformer‑CTC architecture that requires hand‑crafted word‑boundary cues.
- Metrics: Phoneme accuracy (Phone‑Acc) and prosody F1 score (Prosody‑F1). Latency is measured as average end‑to‑end processing time per chunk.
Results
| Model | Phone‑Acc | Prosody‑F1 | Avg. Latency |
|---|---|---|---|
| Transformer‑CTC (baseline) | 90.1 % | 81.9 % | 135 ms |
| CC‑G2PnP (proposed) | 94.3 % | 88.7 % | 120 ms |
CC‑G2PnP outperforms the baseline by 4.2 percentage points in phoneme accuracy and 6.8 percentage points in prosody F1, while reducing latency by 15 ms. Ablation studies show that reducing the chunk size below 30 ms harms accuracy (due to insufficient context), whereas increasing look‑ahead beyond 30 ms yields diminishing returns but noticeably raises latency.
Model compression experiments demonstrate that halving the number of Conformer layers (12 → 6) and reducing channel width (512 → 256) retains strong performance (Phone‑Acc ≈ 93 %, Prosody‑F1 ≈ 87 %) and brings latency down to ≈95 ms, indicating suitability for edge devices.
Discussion and future work
- Multilingual extension – The authors propose a multitask training regime where a single CC‑G2PnP model learns grapheme‑to‑phoneme mappings for several languages simultaneously, leveraging shared Conformer representations while preserving language‑specific prosodic nuances.
- Robustness to noisy text – Data‑augmentation techniques (character dropout, synthetic spelling errors) are suggested to improve resilience against informal or error‑prone inputs common in real‑time chat and social‑media scenarios.
- End‑to‑end LLM‑TTS integration – By feeding the CC‑G2PnP output directly into a streaming TTS decoder, the entire pipeline could be optimized jointly, potentially reducing overall system latency and simplifying deployment.
Conclusion
CC‑G2PnP showcases how a Conformer‑CTC backbone, combined with chunk‑wise streaming and CTC‑driven alignment, can deliver high‑quality phoneme and prosody predictions for unsegmented languages in real time. The model bridges a critical gap between LLM‑generated text and streaming TTS synthesis, offering a practical solution for low‑latency conversational AI systems that must operate across diverse linguistic contexts.
Comments & Academic Discussion
Loading comments...
Leave a Comment