Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

Reading time: 5 minute
...

📝 Original Info

  • Title: Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
  • ArXiv ID: 2602.16687
  • Date: 2026-02-18
  • Authors: ** Potsawee Manakul (Stanford University) William Held (Stanford University) Diyi Yang (Stanford University) 외 다수 공동 저자 (논문 본문에 상세히 기재) **

📝 Abstract

Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3{\times}10^{18}$ to $3{\times}10^{20}$ FLOPs, finding that optimal data grows 1.6$\times$ faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.

💡 Deep Analysis

📄 Full Content

Building foundation models that can understand and generate audio is a key challenge in multimodal AI. Current approaches have distinct limitations. LLM-centric architectures, such as SALMONN (Tang et al., 2024) or Qwen3-Omni (Qwen Team, 2025b), add audio modules to a pre-1 Stanford University 2 SCB 10X 3 University of Southern California 4 University of Cambridge 5 OpenAthena. Correspondence to: Potsawee Manakul , William Held , Diyi Yang .

Project Page: https://soda-audio.github.io/ for model checkpoints, discrete audio data, code, and experiment log. trained text LLM; while effective for instruction-following, they have a “semantic bottleneck” that limits general audioto-audio modeling. Semantic-only speech language models, such as TWIST (Hassid et al., 2023) or SpiritLM (Nguyen et al., 2025), are trained speech-first but discard acoustic details, limiting high-fidelity understanding and generation. Native audio models like Moshi (Défossez et al., 2024) or Llama-Mimi (Sugiura et al., 2025) model acoustic tokens directly but focus on specific tasks without text integration. Meanwhile, next-token prediction has enabled unified models for text and vision-language (Chameleon Team, 2024), yet analogous approaches that jointly model audio understanding and generation in a single backbone remain limited.

To bridge this gap, this paper presents a systematic empirical study of native audio foundation models that jointly model semantic, acoustic, and text tokens within a unified nexttoken prediction framework-establishing the first training recipes and scaling laws analogous to scaling study in LLMs (Kaplan et al., 2020). This design enables a range of tasks within a single model: audio continuation, semantic/acoustic understanding, cross-modal capabilities (e.g., text-to-speech and speech-to-text), and text generation. We adopt utterance-level interleaving of tokens derived from neural codecs, avoiding word-level alignment errors and enabling the use of large datasets with available transcripts. A challenge in training such audio models is the lack of established pretraining understanding: while the Chinchilla study for text LLMs (Hoffmann et al., 2022) established that model size N and training tokens D should scale equally (N * , D * ∝ C 0.5 ), it is unclear if this holds for audio, where information density per token can be far lower. We address key questions for pre-training discrete audio models:

• What training data and token design should we use?

( §4): We systematically compare speech corpora, text mixture ratios, and token compositions (semantic-only vs. semantic+acoustic vs. semantic+acoustic+text), establishing a validated training recipe.

• How should we allocate compute, and is validation loss a reliable metric? ( §5): We show that validation loss is predictive of downstream performance, then derive scaling laws from 64 IsoFLOP models (3 × 10 18 to 3 × 10 20 FLOPs), finding D * ∝ C 0.579 and N * ∝ C 0.367 .

• Does scaling up work? ( §6): We train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens (up to 1.3 × 10 22 FLOPs), and validate them against our scaling predictions. We compare cold-start (from scratch) versus warm-start (from text LLMs) training at scale, finding that cold-start is superior and provides higher training stability. We further validate SODA as a flexible backbone by formulating voice-preserving speech-to-speech translation simply as a next-token prediction task and fine-tuning SODA.

SODA achieves competitive performance across audio and cross-modal benchmarks, with fine-tuning for S2ST demonstrating its flexibility. We release checkpoints, discrete audio data, experiment log, and code to facilitate future research.

2.1. Audio & Speech Foundation Models LLM-Centric Architectures. Models such as SALMONN (Tang et al., 2024), Llama-Omni (Fang et al., 2025), and Qwen3-Omni (Qwen Team, 2025b) warm-start from a pretrained text LLM and add audio capability via separate encoder/decoder modules. The backbone processes textaligned semantic representations, creating a “semantic bottleneck” where fine-grained acoustic details are compressed or lost. While effective for instruction following, these models cannot natively generate audio and often rely on separate modules like vocoders with fixed speaker embeddings, limiting their utility as end-to-end audio foundation models.

Semantic-Only Models. Approaches like TWIST (Hassid et al., 2023), SpiritLM (Nguyen et al., 2025), VoxtLM (Maiti et al., 2024), SUTLM (Chou et al., 2023), and SIMS (Maimon et al., 2025a) operate on discrete speech tokens but restrict themselves to semantic tokens (e.g., HuBERT units). VoxtLM and SUTLM combine text BPEs with Hu-BERT tokens and support ASR, TTS, and continuation via control tokens, but still lack acoustic detail. While SpiritLM introduces token interleaving, it focuses on semantic content, disc

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut