ARCHI-TTS: A flow-matching-based Text-to-Speech Model with Self-supervised Semantic Aligner and Accelerated Inference
Although diffusion-based, non-autoregressive text-to-speech (TTS) systems have demonstrated impressive zero-shot synthesis capabilities, their efficacy is still hindered by two key challenges: the difficulty of text-speech alignment modeling and the high computational overhead of the iterative denoising process. To address these limitations, we propose ARCHI-TTS that features a dedicated semantic aligner to ensure robust temporal and semantic consistency between text and audio. To overcome high computational inference costs, ARCHI-TTS employs an efficient inference strategy that reuses encoder features across denoising steps, drastically accelerating synthesis without performance degradation. An auxiliary CTC loss applied to the condition encoder further enhances the semantic understanding. Experimental results demonstrate that ARCHI-TTS achieves a WER of 1.98% on LibriSpeech-PC test-clean, and 1.47%/1.42% on SeedTTS test-en/test-zh with a high inference efficiency, consistently outperforming recent state-of-the-art TTS systems.
💡 Research Summary
ARCHI‑TTS introduces a novel non‑autoregressive (NAR) text‑to‑speech architecture that simultaneously tackles two long‑standing challenges in diffusion‑based TTS: (1) reliable text‑speech alignment and (2) the heavy computational burden of iterative denoising. The system is built around three main components: a self‑supervised semantic aligner, a compressed VAE latent representation, and a conditional flow‑matching decoder that reuses encoder outputs across denoising steps.
Semantic Aligner
The aligner receives two streams: (a) tokenized text (character or pinyin) embedded and enriched by ConvNeXt‑V2 blocks, and (b) a “mask” sequence of learnable embeddings replicated to match the target speech length. A start‑of‑sequence token is prepended to each stream, and a Transformer processes the concatenated inputs. This design lets the model learn a flexible mapping from text semantics to a temporal canvas, eliminating the need for explicit duration predictors or rigid padding. The mask embeddings initially encode only duration; the Transformer injects semantic context, producing a sequence of aligned semantic features (z). An auxiliary CTC loss applied to intermediate Transformer layers further enforces consistency between the predicted alignment and the original transcript.
Compressed Speech Latent
Instead of mel‑spectrograms, ARCHI‑TTS adopts a VAE‑based audio compressor that encodes 24 kHz speech into a continuous latent sequence at 12.5 Hz (≈8 ms per token). The VAE is trained with a KL‑regularization term, following the Stable Audio paradigm, yielding a compact representation that removes the need for a separate neural vocoder.
Conditional Flow‑Matching Decoder
The decoder follows the conditional flow‑matching (CFM) framework: a time‑dependent vector field (v_t(x_t; \theta)) is learned to transform a simple Gaussian prior into the data distribution along the optimal transport (linear interpolation) path. ARCHI‑TTS implements CFM with a Diffusion Transformer (DiT) split into a condition encoder (18 layers) and a velocity decoder (4 layers). Conditioning inputs are: (i) semantic features (z), (ii) a global speaker embedding (s) broadcast to the speech length, and (iii) a masked audio prompt (x_{\text{ref}}) derived from the ground‑truth latent. The condition encoder produces hidden states (h) that are added to the sinusoidal timestep embeddings and injected globally into each DiT block of the velocity decoder.
Training Objectives
The primary loss is the CFM loss, which penalizes the L2 distance between the predicted velocity and the analytically known velocity of the optimal transport path. Two auxiliary terms are added: a direction loss (L_{\text{DIR}}) (cosine similarity) to keep the flow orientation correct, and the CTC loss (L_{\text{CTC}}) weighted by (\eta=0.1). The total loss is (L = L_{\text{CFM}} + L_{\text{DIR}} + \eta L_{\text{CTC}}).
Inference Acceleration via Encoder Reuse
A key efficiency breakthrough is the reuse of the condition encoder’s hidden states across successive denoising steps. Since the encoder dominates the computational cost, storing (h) after the first step and feeding it unchanged to later steps eliminates repeated encoder forward passes. This “training‑free” acceleration yields a 4× speed‑up (RTF ≈ 0.21 for 10‑second audio) without any knowledge‑distillation or teacher‑student setup.
Zero‑Shot Synthesis
For zero‑shot TTS, a short reference audio and its transcript are provided. The reference yields a speaker embedding and a masked latent prompt; the target transcript is concatenated with the reference transcript and passed through the semantic aligner to obtain a unified semantic vector (z_{\text{ref·gen}}). The target duration is estimated by scaling the reference token‑per‑frame rate. The ODE defined by the learned velocity field is solved with an Euler solver, optionally guided by Classifier‑Free Guidance (CFG) where the guided velocity is (\tilde v_t = (1+\omega)v_t^{\text{cond}} - \omega v_t^{\text{uncond}}). The resulting latent is decoded by the VAE decoder into waveform.
Experiments
Training used the 100 k‑hour multilingual Emilia corpus on 8 RTX 5090 GPUs for 4 days (≈800 k updates). The model has 289 M parameters and was evaluated on LibriSpeech‑PC test‑clean and the multilingual Seed‑TTS benchmark. Results:
- LibriSpeech‑PC test‑clean – WER 1.98 % (SSIM 0.70), RTF 0.21, outperforming models ranging from 300 M to 1.1 B parameters.
- Seed‑TTS English – WER 1.47 % (SSIM 0.68); Chinese – WER 1.42 % (SSIM 0.70).
- MOS evaluation showed 3.53 NMOS and 3.48 SMOS, comparable to large‑scale industrial systems.
Analysis
The semantic aligner’s mask‑based temporal canvas enables flexible length handling, crucial for low‑token‑rate representations where text tokens are often shorter than speech frames. The CTC auxiliary loss tightens the alignment without requiring an external duration model. Encoder reuse provides a practical, training‑free speed boost that sidesteps the complexity of diffusion distillation methods, which typically need a pre‑trained teacher and extra forward passes.
Limitations & Future Work
While the VAE latent is compact, its reconstruction fidelity still depends on the VAE’s capacity; improving the compressor could further raise audio quality. The current CFG strength and number of shared encoder steps are hyper‑parameters that may need tuning for different languages or speaker styles. Extending the approach to real‑time streaming synthesis and exploring multilingual joint training are promising directions.
In summary, ARCHI‑TTS delivers a high‑quality, fast, and zero‑shot capable TTS system by integrating a self‑supervised semantic aligner, a low‑rate VAE latent, a conditional flow‑matching decoder, and an encoder‑reuse inference scheme, setting a new benchmark for non‑autoregressive speech synthesis.
Comments & Academic Discussion
Loading comments...
Leave a Comment