Reading time: 10 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.22491
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

As an endangered language, Manchu presents unique challenges for speech synthesis, including severe data scarcity and strong phonological agglutination. This paper proposes ManchuTTS(Manchu Text to Speech), a novel approach tailored to Manchu's linguistic characteristics. To handle agglutination, this method designs a three-tier text representation (phoneme, syllable, prosodic) and a cross-modal hierarchical attention mechanism for multi-granular alignment. The synthesis model integrates deep convolutional networks with a flowmatching Transformer, enabling efficient, non-autoregressive generation. This method further introduce a hierarchical contrastive loss to guide structured acoustic-linguistic correspondence. To address low-resource constraints, This method construct the first Manchu TTS dataset and employ a data augmentation strategy. Experiments demonstrate that ManchuTTS attains a MOS of 4.52 using a 5.2-hour training subset derived from our full 6.24-hour annotated corpus, outperforming all baseline models by a notable margin. Ablations confirm hierarchical guidance improves agglutinative word pronunciation accuracy (AWPA) by 31% and prosodic naturalness by 27%.

๐Ÿ“„ Full Content

Research on speech synthesis [1] for common languages is relatively mature, but Manchu, a UNESCO "critically endangered" language, faces severe challenges due to data scarcity and complex phonological characteristics. As the official language of China's Qing Dynasty, Manchu has left behind a vast number of historical archives awaiting translation, as shown in Fig. 1. However, with fewer than 100 fluent speakers globally today, most documents remain uninterpreted, significantly hindering historical research.

Manchu speech synthesis [2] faces dual challenges from severe data scarcity and complex linguistic characteristics [3], [4]. As an agglutinative language, its phonological patterns like vowel harmony are difficult for conventional TTS models to capture accurately. While methods such as LRSpeech [5] offer promising directions for low-resource synthesis, they assume a minimum data threshold unavailable for truly endangered languages. Existing approaches, including flow-based models like F5-TTS [6] and ReFlow-TTS [7]-lack design for Annotation example of Manchu speech using the Praat tool. agglutinative morphological complexity. Technically, explicit alignment methods show high error rates (25%-35%) in lowresource settings, duration predictors produce rigid rhythms, and non-autoregressive architectures struggle with nuanced prosodic variations. This creates a dual challenge: overcoming both data scarcity and the alignment demands of agglutinative phonology.

To address data scarcity and agglutinative complexity, this paper propose ManchuTTS, a framework integrating hierarchical linguistic guidance with conditional flow matching [8]. Unlike standard flow-based TTS, the method condition the flow vector field on three-tier linguistic features c = {c phon , c syll , c pros } . The model employs cross-modal attention for multi-granular alignment and efficient non-autoregressive generation via conditional differential equations. Key contributions include: a hierarchical guidance mechanism for agglutinative languages; an end-to-end system based on conditional flow matching; and the first public Manchu TTS dataset with validation. Experimental results show ManchuTTS achieves a MOS of 4.52 with only 5.2 hours training data, surpassing all baselines. Ablations validate the hierarchical design, improving agglutinative word accuracy by 31% and prosodic naturalness by 27%. The main contributions of this work are as follows:

โ€ข This paper propose a hierarchical text feature guidance framework tailored for Manchu speech synthesis, effectively addressing its agglutinative characteristics. โ€ข This paper develop an end-to-end speech synthesis system based on hierarchical conditional flow matching, enabling high-quality synthesis with minimal data requirements.

โ€ข This paper construct the first publicly available Manchu speech synthesis dataset and demonstrate the effectiveness of this approach through comprehensive experiments. This work provides both a practical solution for Manchu speech synthesis and a methodological framework that can be extended to other endangered languages, contributing to the preservation and accessibility of linguistic heritage through technological innovation.

Theoretical Framework. This method establishes an implicit mapping from reference speech V ref and text T ref to target speech for T tgt , forming a ternary speech-text-acoustic relationship. To handle Manchu’s agglutinative nature, the input text is decomposed into three hierarchical linguistic units. c phon is phoneme-level features. c syll is syllable/word-structure features. c pros is prosodic features. These are integrated into a conditional flow matching framework to guide the speech generation process. Conditional Flow Matching. This model the transformation from noise x 0 to target speech x 1 via a linear interpolation path:

where x 0 โˆผ N (0, I) and x 1 is the target mel-spectrogram.

The corresponding vector field that drives this transformation is:

The goal is to learn a parameterized model v t (x t | c; ฮธ) that approximates u t . The training objective is:

Here, c = {c phon , c syll , c pros } represents the multi-level linguistic conditions. At inference, this method start from Gaussian noise and solve the ODE defined by v t to generate speech that aligns with the input text. To handle Manchu’s complex agglutinative morphology, this work proposes a three-stage mechanism (phoneme-syllableprosody). Fig. 2 illustrates the multi-layer Praat annotations for an agglutinative sentence, validated by linguists. The complete synthesis pipeline of ManchuTTS is presented in Fig. 3.

This paper proposes a three-layer cross-modal attention architecture for progressive alignment between textual features F t and acoustic features F a . Following a “self-attention โ†’ interaction โ†’ self-attention” paradigm, it enables stepwise cross-modal alignment.

Layer 1: Intra-modal Self-Attention First, both textual and acoustic features undergo intra-modal self-attention to enhance their internal representations. The multi-head attention mechanism captures semantic dependencies within the textual sequence, computed as:

Similarly, the acoustic features are processed through an analogous self-attention operation to strengthen the temporal relationships within the acoustic sequence:

Here, MHA(Q, K, V ) denotes the standard multi-head attention mechanism. This layer produces enhanced modalityspecific features, laying the foundation for subsequent crossmodal interaction. Layer 2: Cross-modal Cross-Attention After obtaining internally refined features, cross-modal interaction is performed via a bidirectional cross-attention mechanism. Specifically, the textual features are updated by attending to the acoustic features as contextual information:

Symmetrically, the acoustic features are updated by attending to the textual features, ensuring mutual information exchange:

t , F

t ) . (7) This layer serves as the core of cross-modal alignment, allowing each modality to integrate relevant information from the other and promoting feature-space fusion. Layer 3: Intra-modal Self-Attention Finally, to consolidate the information acquired through cross-modal interaction and further refine the internal representations, another intra-modal self-attention layer is applied. The output textual features are computed as:

t , F

t , F

t ) . (8)

Similarly, the output acoustic features are obtained via:

The resulting aligned features, F out t and F out a , preserve the internal consistency of their respective modalities while embodying complementary cross-modal information.

ManchuTTS employs a three-tier text representation to address Manchu’s agglutinative structure. At the phoneme level, input text is converted into IPA sequences to capture fine-grained phonetic features including vowel harmony and coarticulation patterns. The syllable/word level decomposes complex affixes into [root+suffix] structures, modeling intersyllable transitions and root-suffix prosodic coupling to handle morphological variations. The prosody level predicts global intonation and rhythm patterns, adapting pitch contours and energy distribution according to sentence types (e.g., rising pitch for questions). Collectively, these hierarchical features c provide structured linguistic guidance for subsequent crossmodal alignment and conditional flow matching.

Based on the conditional flow matching model, this paper propose a hierarchical contrastive alignment loss to enhance the consistency between the generated speech and the hierarchical linguistic conditions c = {c phon , c syll , c pros }. For each level of condition c (k) , this paper construct positive sample pairs (x 1 , c (k) ) and negative sample pairs (x 1 , c (k) neg ), and maximize the mutual information between positive pairs. The loss function is defined as a weighted sum of the contrastive losses at each level:

. (10) where s(โ€ข, โ€ข) is a similarity function. This loss guides the generation process to better preserve fine-grained linguistic information at the phoneme, syllable, and prosodic levels.

ManchuTTS introduces a novel “multi-granular linguistic conditional flow matching” framework. Unlike standard flow-matching TTS models (e.g., F5-TTS, E2/E3-TTS), This method jointly models agglutinative structures by conditioning the flow vector field on hierarchical linguistic features c = {c phon , c syll , c pros }. This coupling of linguistic priors with the underlying ODE fundamentally reshapes the probability flow path during generation. In contrast to traditional hierarchical TTS that merely uses multi-level embeddings, this conditions Baselines. This paper compare the method with several representative models widely used in low-resource speech synthesis and minority languages:

โ€ข Tacotron 2 [9]: Classic seq2seq TTS with encoderattention decoder; data-hungry and slow in low-resource settings. โ€ข FastSpeech 2 [10]: Non-autoregressive model using duration/pitch prediction; faster synthesis suitable for realtime use. โ€ข Glow-TTS [11]: Flow-based model with invertible networks for bidirectional text-speech conversion.

โ€ข VITS [12]: End-to-end TTS integrating VAE and GAN, directly generating speech from text. โ€ข F5-TTS [6]: Fast, controllable model with multi-scale architecture for complex prosody. โ€ข Cloning-based Voice Conversion (CBVC) [13]: Adapts pre-trained models with minimal target data via transfer learning. Baseline Configuration. All models used the same data (5.2h train, 0.52h test), pronunciation dictionary, and audio parameters (44.1kHz/16-bit, 80-dim Mel) under identical lowresource settings.

Evaluation included subjective and objective methods: Mean Opinion Score (MOS) tests by 20 native speakers with ABX preference tests; objective metrics comprising Mel Cepstral Distortion (MCD), fundamental frequency Root Mean Square Error (F0-RMSE), Word Error Rate (WER), speaker similarity (SIM), and Perceptual Evaluation of Speech Quality (PESQ). All tests used the same test set with 95% confidence intervals. Fig. 5. Impact of three-level text feature guidance on MOS, AWPA, and prosodic naturalness. X-axis: Configuration levels (“Phoneme”, “+Syllable”, “+Prosody”), representing the incremental guidance conditions added step by step. Y-axis: 1. MOS (Mean Opinion Score, 1-5, the higher the better); 2. AWPA (%, the higher the better); 3. Prosodic naturalness (%, the higher the better).

all baselines across objective metrics (Table III) and showing consistent superiority (Fig. 4).

  1. Ablation Study of Hierarchical Text Feature Guidance Mechanism: Fig. 5 illustrates the performance contribution of the three-level “phoneme-syllable-prosody” guidance framework in ManchuTTS. Using only the phoneme layer yields a MOS of 3.94. Incorporating the syllable layer increases MOS to 4.21, and enabling all three layers further improves MOS to 4.52, with notable enhancements in agglutinative word accuracy and prosodic naturalness, as summarized in Table IV. Table IV shows the impact of progressively removing guidance layers (other conditions fixed) on MOS, agglutinative word accuracy, and prosodic naturalness. Configuration A (Phoneme Layer Only): The MOS drops to 3.94, a reduction of 0.58 compared to the full model. Spectral analysis reveals a loss of high-frequency energy (โ‰ฅ4kHz), particularly affecting frication noise in sibilants and nasal resonance in final nasal sounds. Agglutinative word accuracy reaches 70.8%, and prosodic naturalness is 70.4%. Configuration B (Phoneme + Syllable Layers): MOS increases to 4.21. Key improvements include smoother stemsuffix transitions, for instance, in the verb “ลกunggi-ra,” the energy gap between the stem /ลกung/ and suffix /ra/ narrows by 12 ms, contributing to an additional 0.8 dB reduction

ManchuTTS demonstrates deployment-ready efficiency, achieving RTF 0.12, 86 ms latency, and 4.1 GB VRAM on an RTX 4090, enabling 8ร— concurrent streams (Table VI). It significantly outperforms autoregressive baselines in speed and resource usage (Fig. 7). After INT8 quantization, it sustains 3ร— real-time synthesis on a Jetson Orin Nano (8 GB), offering a practical edge-computing solution for low-resource language applications.

The results are shown in Table VII. Laboratory tests with 20 native speakers showed ManchuTTS achieved naturalness MOS of 4.52ยฑ0.11, comparable to ground truth, and clarity score of 4.61ยฑ0.10. Qualitative feedback indicated 63% elderly listeners praised clear word endings, though limited emotional expression in long sentences and occasional stress drift were noted, highlighting prosodic modeling as key for future improvement.

ManchuTTS has proven that with 5.2 hours of training data, it can achieve speech synthesis that is acceptable to humans, which has been confirmed by experimental data. However, the model is still limited to studio recorded speech and its accuracy decreases in dialect variations. Future work

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut