Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling
A good language model starts with a good tokenizer. Tokenization is especially important for speech modeling, which must handle continuous signals that mix linguistic and non-linguistic information. A speech tokenizer should extract phonetics and prosody, suppress linguistically irrelevant information like speaker identity, and enable high-quality synthesis. We present Kanade, a single-layer disentangled speech tokenizer that realizes this ideal. Kanade separates out acoustic constants to create a single stream of tokens that captures rich phonetics and prosody. It does so without the need for auxiliary methods that existing disentangled codecs often rely on. Experiments show that Kanade achieves state-of-the-art speaker disentanglement and lexical availability, while maintaining excellent reconstruction quality.
💡 Research Summary
The paper introduces Kanade, a single‑layer, disentangled speech tokenizer that cleanly separates linguistic content (phonetics and prosody) from non‑linguistic factors such as speaker identity and recording conditions. Kanade builds on self‑supervised learning (SSL) features extracted from a frozen wavLM encoder. Deep SSL layers (6 and 9) that encode linguistic information are fed into a content branch consisting of a transformer encoder with local‑window attention, followed by strided convolution for temporal down‑sampling. The resulting vectors are quantized with Finite Scalar Quantization (FSQ), a codebook‑free method that yields one discrete token per timestep at 12.5 Hz or 25 Hz.
Shallow SSL layers (1 and 2), which capture speaker‑related cues, are processed by a ConvNeXt‑based global branch that produces a single continuous embedding for the whole utterance. This embedding is not discretized; instead it conditions the decoder via AdaLN‑Zero, providing a pathway for all non‑linguistic information.
Training optimizes two reconstruction losses: (1) an SSL‑feature loss that forces the content tokens to retain phonetic detail, and (2) a mel‑spectrogram loss that encourages preservation of prosodic cues. Because the content branch is bitrate‑constrained, it naturally discards information that the global branch can carry, achieving unsupervised disentanglement without adversarial, contrastive, or variance‑learning auxiliary objectives.
Extensive evaluation shows that Kanade attains state‑of‑the‑art speaker disentanglement (voice conversion and speaker discrimination) and lexical availability (down‑stream ASR and TTS) while maintaining reconstruction quality comparable to multi‑layer neural audio codecs. Remarkably, the model is trained on only 600 hours of speech and uses 120 M unfrozen parameters.
In summary, Kanade demonstrates that a simple architecture—SSL input, a narrow information bottleneck, and codebook‑free quantization—can deliver a single‑stream token representation that is both linguistically rich and acoustically faithful. This eliminates the need for complex multi‑layer token structures or auxiliary disentanglement mechanisms, paving the way for more efficient and effective spoken language models, voice conversion systems, and high‑quality text‑to‑speech pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment