Sylber 2.0: A Universal Syllable Embedding

Sylber 2.0: A Universal Syllable Embedding
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Scaling spoken language modeling requires speech tokens that are both efficient and universal. Recent work has proposed syllables as promising speech tokens at low temporal resolution, but existing models are constrained to English and fail to capture sufficient acoustic detail. To address this gap, we present Sylber 2.0, a self-supervised framework for coding speech at the syllable level that enables efficient temporal compression and high-fidelity reconstruction. Sylber 2.0 achieves a very low token frequency around 5 Hz, while retaining both linguistic and acoustic detail across multiple languages and expressive styles. Experiments show that it performs on par with previous models operating on high-frequency baselines. Furthermore, Sylber 2.0 enables efficient TTS modeling which can generate speech with competitive intelligibility and quality with SOTA models using only 72M parameters. Moreover, the universality of Sylber 2.0 provides more effective features for low resource ASR than previous speech coding frameworks. In sum, we establish an effective syllable-level abstraction for general spoken language.


💡 Research Summary

Sylver 2.0 introduces a universal, syllable‑level speech tokenization framework that dramatically reduces token frequency while preserving both linguistic and acoustic fidelity across a wide range of languages and expressive styles. Building on the original Sylber concept, the authors extend the approach to 102 languages using a multilingual HuBERT (mHuBERT) backbone pretrained on 147 languages. The system compresses speech into non‑uniform tokens at roughly 5 Hz (average 4.8 Hz, ranging from 3.2 Hz to 6.4 Hz), which is the lowest reported token rate for multilingual speech.

The architecture consists of three main components: a content encoder, an acoustic encoder, and a boundary detector. The content encoder is trained through a four‑stage self‑supervised pipeline. Stage 1 employs frame‑wise self‑distillation between a student and an EMA teacher to learn language‑agnostic representations. Stages 2 and 3 refine these representations using self‑segmentation distillation, where an unsupervised segmentation algorithm on the teacher’s features provides target segment‑averaged embeddings. Unlike the original Sylber, silent‑frame masking is removed, allowing low‑energy syllables to be retained. Stage 4 replaces the costly similarity‑based segmentation with a learned boundary detector (three‑layer Transformer + binary head) that predicts syllable boundaries in parallel, dramatically speeding up inference.

The content encoder outputs a 64‑dimensional continuous “content” embedding (C) for each syllable, capturing linguistic information. To restore the acoustic details that self‑distillation tends to discard (speaker identity, timbre, emotion), a separate acoustic encoder processes the same syllable boundaries using a CNN followed by six Transformer layers, producing a complementary 64‑dimensional “acoustic” embedding (A). Each token also carries a duration value (d) indicating how many original frames the syllable spans. During decoding, the (d, C, A) triplet is expanded back to the original frame rate and fed to a lightweight Siuzdak vocoder, which synthesizes 24 kHz waveforms with near‑perfect quality.

Extensive experiments demonstrate that Sylver 2.0 matches or exceeds high‑frequency baselines (e.g., Mimi at 12.5 Hz, VibeVoice at 7.5 Hz, CLEAR at 7.7 Hz) on objective metrics such as PESQ, STOI, and MOS, and on subjective listening tests. Notably, the system can reconstruct expressive singing voice with minimal degradation, highlighting the effectiveness of the acoustic encoder.

Downstream evaluations show the practical benefits of the low‑frequency tokenization. A zero‑shot multilingual TTS model built on Sylver 2.0 tokens uses only 72 M parameters yet achieves intelligibility and naturalness comparable to state‑of‑the‑art TTS systems that typically require hundreds of millions of parameters. In low‑resource ASR scenarios, Sylver 2.0 tokens improve word error rates by 2–3 % absolute over prior VQ‑VAE tokenizers, while also reducing sequence length, training time, and GPU memory consumption.

Training efficiency is another highlight: the entire pipeline fits on a single 24 GB GPU, and the boundary detector reduces the real‑time factor of segmentation by an order of magnitude compared to similarity‑based methods. By operating in a continuous embedding space rather than a large discrete codebook, Sylver 2.0 simplifies generation for diffusion or flow‑based models and offers finer control over sampling.

In summary, Sylver 2.0 delivers a universal, syllable‑level speech representation that combines ultra‑low token frequency, multilingual coverage, and high‑fidelity reconstruction. It bridges the gap between efficient speech coding and expressive acoustic detail, opening new possibilities for large‑scale spoken language modeling, low‑resource language technologies, and resource‑constrained speech synthesis.


Comments & Academic Discussion

Loading comments...

Leave a Comment