Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization

Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed-frame-rate codecs. Code and checkpoints will be released publicly at https://github.com/lucadellalib/dycast.


💡 Research Summary

The paper introduces DyCAST (Dynamic Character‑Aligned Speech Tokenizer), a novel speech tokenization framework that departs from the conventional fixed‑frame‑rate neural audio codecs. Traditional codecs discretize continuous speech at a constant temporal resolution, which leads to unnecessarily long token sequences and poor alignment with textual units. DyCAST addresses these issues by (1) learning a soft alignment between tokens and character‑level linguistic units, (2) explicitly modeling token durations, and (3) augmenting low‑frame‑rate reconstruction with a retrieval‑based decoder.

The system starts with a frozen self‑supervised speech encoder (e.g., wav2vec 2.0) that extracts high‑dimensional frame‑level features at a fixed rate. A lightweight compressor reduces dimensionality, after which the “chunker” groups consecutive frames into variable‑length chunks. Chunk boundaries are predicted by a hazard‑model‑based boundary predictor, which outputs a per‑frame probability hₜ that the next boundary occurs. The hazard formulation models the time‑to‑next‑boundary distribution, enabling temporally dependent and normalized predictions. Ground‑truth character boundaries, obtained from a frozen CTC‑based ASR aligner, supervise the hazard model during training. At inference, boundaries are decoded either greedily or via sequential sampling, subject to user‑defined minimum/maximum chunk lengths and a threshold τₕ that trades off frame rate against token granularity. Within each chunk, the last frame is selected (down‑sampling) to form a chunk‑level representation, which is then quantized into a discrete token.

During decoding, a duration predictor restores the temporal structure that was discarded when only the token sequence is transmitted. The predictor uses a negative‑binomial distribution to model token durations, a choice motivated by the over‑dispersed nature of speech segment lengths. For each token i, a free mean μ_free,i = softplus(g_ϕ(c₁:ₙ)_i) is computed, a minimum duration d_min (default = 1) is added, and the excess duration y_i = d_i − d_min follows NB(μ_free,i, α) with a global dispersion α. The loss combines the negative log‑likelihood of the NB distribution with a normalized length regularizer that encourages the sum of predicted durations to match the total number of frames. This model enables two decoding regimes: (a) free decoding, where durations are taken as d̂_i = d_min + round(μ_free,i), and (b) budget‑constrained decoding, where the predicted means are renormalized to fit a known target length (e.g., when resynthesizing a specific utterance).

Because the character‑aligned token stream yields a very low frame rate (6–18 Hz), reconstructing high‑fidelity waveforms is challenging. To mitigate this, the authors propose Retrieval‑Augmented Decoding (RAD). At inference, each discrete token’s latent vector is refined by a similarity search against a pre‑computed pool of continuous latents. The nearest neighbor is substituted before waveform synthesis, supplying missing fine‑grained acoustic details without increasing bitrate.

Experiments are conducted on LibriSpeech‑clean‑100, VCTK, and internal conversational datasets. Metrics include mean opinion score (MOS), PESQ, STOI for reconstruction quality, word error rate (WER) for downstream ASR, and token count for efficiency. DyCAST achieves comparable or slightly better MOS (≈ 4.2) than state‑of‑the‑art fixed‑frame codecs while reducing token count by over 30 % at the same bitrate (~6 kbps). WER improves from 7.2 % to 6.8 %, and PESQ gains ≈ 0.3 dB when RAD is applied. The τₕ parameter allows smooth control of the trade‑off between frame rate and quality, enabling applications ranging from ultra‑low‑bitrate streaming to high‑quality TTS.

Key contributions are:

  1. A soft, character‑level alignment mechanism that grounds token boundaries in linguistic units.
  2. A hazard‑based boundary predictor that yields stable, variable‑length chunking without external text at inference.
  3. A negative‑binomial duration model that captures the over‑dispersed distribution of speech segment lengths, providing explicit duration control.
  4. Retrieval‑augmented decoding that restores high‑frequency and speaker‑specific details at low frame rates without extra bitrate.

Overall, DyCAST presents a comprehensive solution for efficient, linguistically grounded speech tokenization, facilitating more compact inputs for large language models and multimodal systems while preserving reconstruction quality and offering flexible duration control. Future work includes extending the approach to multilingual settings, real‑time streaming, and optimizing the retrieval pool for lower latency.


Comments & Academic Discussion

Loading comments...

Leave a Comment