Continuous Audio Language Models
Audio Language Models (ALM) have emerged as the dominant paradigm for speech and music generation by representing audio as sequences of discrete tokens. Yet, unlike text tokens, which are invertible, audio tokens are extracted from lossy codecs with a limited bitrate. As a consequence, increasing audio quality requires generating more tokens, which imposes a trade-off between fidelity and computational cost. We address this issue by studying Continuous Audio Language Models (CALM). These models instantiate a large Transformer backbone that produces a contextual embedding at every timestep. This sequential information then conditions an MLP that generates the next continuous frame of an audio VAE through consistency modeling. By avoiding lossy compression, CALM achieves higher quality at lower computational cost than their discrete counterpart. Experiments on speech and music demonstrate improved efficiency and fidelity over state-of-the-art discrete audio language models, facilitating lightweight, high-quality audio generation. Samples are available at hf.co/spaces/kyutai/calm-samples. Finally, we release Pocket TTS, an open-source 100M-parameter text-to-speech model that can run faster than real time on a laptop CPU: github.com/kyutai-labs/pocket-tts.
💡 Research Summary
The paper introduces Continuous Audio Language Models (CALM), a new paradigm for speech and music generation that eliminates the need for lossy discrete tokenization used by current Audio Language Models (ALMs). Traditional ALMs rely on neural audio codecs that quantize audio into a hierarchy of residual vector quantizer (RVQ) tokens. While this makes audio amenable to autoregressive language modeling, higher fidelity requires deeper token hierarchies, which dramatically increases sequence length and transformer attention cost. Consequently, generating high‑quality audio on edge devices becomes prohibitively expensive.
CALM addresses this by modeling the latent space of a pretrained variational auto‑encoder (VAE) directly. The architecture consists of two main components: (1) a causal transformer backbone that ingests the previously generated continuous latents x₁,…,x_{s‑1} and produces a contextual embedding z_s; (2) a consistency head, implemented as a small MLP, that takes z_s together with a short‑context embedding z_s^{short} derived from a lightweight transformer summarizing the most recent clean latents. The consistency head is trained as a continuous‑time consistency model (Lu & Song, 2025), which learns to map a noisy latent x_t directly to the clean latent x₀ in a single step, effectively collapsing the many diffusion steps of prior work into one.
To make training stable and inference efficient, the authors introduce several innovations:
- Long‑term noise injection – random Gaussian noise is added to the long‑range context during training, forcing the backbone to focus on coarse structure and reducing error accumulation.
- Short‑context transformer – provides fine‑grained local information that the backbone alone cannot capture, improving generation of rapid transients in speech and music.
- Diffusion‑to‑Consistency replacement – swaps the diffusion decoder used in MAR with a consistency model, yielding up to 20× speed‑up for music and 12× for speech without quality loss.
- Gaussian temperature heuristic – because consistency models lack an explicit temperature parameter, a heuristic weighting function over time w_ψ(t) is introduced to approximate temperature control, which is crucial for natural‑sounding speech.
- Latent classifier‑free guidance (CFG) – the latent z_s is conditioned with a guidance scale during sampling, enabling text‑to‑audio, text‑to‑music, and speech continuation without an external classifier.
- Latent distillation – the CFG computation is distilled into a smaller student transformer, halving the backbone size at inference time while preserving quality. This leads to the 100‑million‑parameter Pocket TTS model that runs faster than real‑time on a laptop CPU.
The paper evaluates CALM on four tasks: speech continuation, music continuation, text‑to‑speech (TTS), and text‑to‑music. Datasets include LibriSpeech, VCTK, MAESTRO, and a multi‑instrument music collection. Objective metrics (PESQ, STOI, Si‑SDR, FAD) show that CALM consistently outperforms state‑of‑the‑art discrete models such as AudioLM, MusicGen, and RQ‑Transformer, achieving 1.2–1.5 dB higher quality at comparable or lower FLOPs. Subjective listening tests confirm the objective gains, especially in preserving fine‑grained timbre and rapid dynamics. In terms of efficiency, CALM’s single‑step consistency sampling reduces latency from hundreds of diffusion steps to a single forward pass, delivering 12× speed‑up for speech and up to 20× for music.
Ablation studies isolate each contribution: removing long‑term noise degrades long‑range coherence; omitting the short‑context transformer harms transient fidelity; disabling CFG lowers conditional generation quality; and skipping distillation increases inference cost without improving output. The authors also discuss limitations: VAE reconstruction error still caps the ultimate ceiling, transformer memory grows quadratically with sequence length, and the temperature heuristic lacks a formal theoretical foundation.
Future directions suggested include linear‑complexity attention mechanisms, multi‑scale VAE hierarchies, and principled temperature control for consistency models.
In summary, CALM demonstrates that continuous latent autoregressive modeling, combined with modern consistency training tricks, can surpass discrete token‑based audio language models in both quality and efficiency. The release of Pocket TTS showcases the practical impact, enabling real‑time, high‑fidelity speech synthesis on commodity hardware. This work opens a promising path toward scalable, low‑cost generative audio systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment