T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS
Neural audio codecs provide promising acoustic features for speech synthesis, with representative streaming codecs like Mimi providing high-quality acoustic features for real-time Text-to-Speech (TTS) applications. However, Mimi’s decoder, which employs a hybrid transformer and convolution architecture, introduces significant latency bottlenecks on edge devices due to the the compute intensive nature of deconvolution layers which are not friendly for mobile-CPUs, such as the most representative framework XNNPACK. This paper introduces T-Mimi, a novel modification of the Mimi codec decoder that replaces its convolutional components with a purely transformer-based decoder, inspired by the TS3-Codec architecture. This change dramatically reduces on-device TTS latency from 42.1ms to just 4.4ms. Furthermore, we conduct quantization aware training and derive a crucial finding: the final two transformer layers and the concluding linear layers of the decoder, which are close to the waveform, are highly sensitive to quantization and must be preserved at full precision to maintain audio quality.
💡 Research Summary
The paper addresses a critical bottleneck in on‑phone streaming text‑to‑speech (TTS) systems that rely on the Mimi neural audio codec. While Mimi’s hybrid decoder—eight transformer layers followed by de‑convolution (transpose‑convolution) up‑sampling—delivers high‑fidelity audio, the de‑convolution blocks are computationally expensive on mobile CPUs and are poorly supported by inference frameworks such as XNNPACK. This results in a latency of 42.1 ms per 80 ms audio frame, which is unacceptable for real‑time interaction.
Inspired by the fully transformer‑based TS3‑Codec, the authors propose T‑Mimi, a decoder that replaces all de‑convolution layers with four additional transformer layers (using fixed‑window streaming self‑attention) and two linear layers that perform up‑sampling without overlap‑add. The total parameter count is kept constant, but the computational cost drops to roughly 13 % of the original CNN‑based decoder. The architecture retains the original eight pretrained transformer layers, adding depth (12 layers total) to leverage the pretrained weights and improve representation capacity.
Training proceeds in two stages. Stage 1 uses a composite loss: a multi‑scale mel‑spectrogram L1 loss (weight 2.0), a least‑squares GAN loss and a feature‑matching loss (each weight 4.0), and an auxiliary L1 term (weight 0.1). Stage 2 fine‑tunes the decoder with only the feature‑matching loss to boost perceptual quality. To suppress spurious noise in silent regions, the authors augment 10 % of training samples with leading and trailing silence, forcing the model to learn a robust silence representation.
Quantization‑aware training (QAT) is applied to reduce model size and further improve latency. Experiments with 4‑bit group‑wise and 8‑bit per‑channel weight quantization, together with 8‑bit dynamic activation quantization, reveal that layers closest to waveform synthesis are extremely sensitive to reduced precision. By preserving the final two transformer layers and the two linear layers in full‑precision (FP32) while quantizing the rest to 8‑bit, the model achieves a storage reduction from 163.2 MB to 68.7 MB (≈58 % decrease) with only a minor PESQ drop (3.21 → 3.16). The mixed‑precision configuration (layers 1‑10 at 8‑bit, layers 11‑12 + linear layers at 32‑bit) offers the best trade‑off between size, latency, and audio quality.
Real‑device benchmarking on a Samsung Galaxy S22 shows that T‑Mimi processes an 80 ms audio chunk in an average of 4.4 ms, a 9.6× speed‑up over the baseline CNN‑Mimi (42.1 ms). Even when the baseline’s convolutional context window is reduced from 5 to 2 (cutting latency to 18 ms), it remains slower than T‑Mimi and suffers quality degradation. Ablation studies confirm that increasing depth from 8 to 12 layers yields substantial gains in PESQ, STOI, and SI‑SDR, while expanding the linear layer dimension from 2048 to 3072 offers marginal improvement at the cost of an extra ~6 MB. Adding more layers (16) provides diminishing returns, solidifying the 12‑layer, 2048‑dimensional configuration as the sweet spot for on‑device constraints.
In conclusion, T‑Mimi demonstrates that a fully transformer‑based decoder can replace de‑convolution up‑sampling without sacrificing audio fidelity, delivering real‑time TTS performance on mobile hardware. The study also uncovers a generalizable quantization rule: layers nearest to the waveform output must remain in high precision to avoid perceptual degradation. These insights are applicable to other neural audio codecs seeking efficient on‑device deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment