ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

Reading time: 5 minute
...

📝 Original Info

  • Title: ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling
  • ArXiv ID: 2602.15537
  • Date: 2026-02-17
  • Authors: ** - Nicol Visser (주 저자) - 기타 공동 저자 (논문에 명시되지 않음) **

📝 Abstract

Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.

💡 Deep Analysis

📄 Full Content

Advances in self-supervised learning (SSL) [1] have made it possible to train language models directly on audio data: instead of using tokenized text, a causal language model is trained on tokens derived from an SSL model [2]. Without relying on any textual resources, this framework would enable natural language processing for languages lacking the massive speech-text datasets required by traditional speech-driven technology. These pure speech language models showed early potential [3], but subsequent progress has slowed.

The decline in progress is most evident in syntactic modeling, where performance has plateaued since 2023. AudioLM [4] remains the state-of-the-art, yet it relies on a resource-intensive framework: a high-capacity language model trained on highbitrate tokens from w2v-BERT XL [5]. Moreover, recent work indicates that pure speech modeling scales less favorably than text [6], suggesting that simply increasing model size and data volume will not solve the problem. This has resulted in a shift toward discovering more effective speech units in order to close the performance gap with textual models of comparable size [7].

One hypothesis is that the fine granularity of standard speech tokens creates a bottleneck [8]. Because discrete speech tokens are extracted at a much higher rate than text tokens, the resulting sequences are substantially longer, making it difficult to model long-range dependencies. To mitigate this, studies have attempted to utilize larger phone-, syllable-, or word-like units [8][9][10][11]. Syllable-based modeling has proven to be particularly successful [10,11]. However, existing syllabic tokenizers rely on complex multi-stage pipelines where an SSL model must be fine-tuned with a specialized objective.

WavLM L14-L22

Figure 1: ZeroSyl detects syllabic boundaries using prominencebased peak detection on features from layer 13 of a frozen WavLM. It then mean pools semantic features from layer 22 within the discovered boundaries and clusters these using spherical K-means. A language model is trained on these cluster IDs.

In this paper, we propose an unsupervised approach to syllable tokenization that does not require the complex methods of previous work. Figure 1 shows our simple pipeline. We find that high-quality syllabic boundaries can be extracted by simply applying peak detection to the L2 norms of frozen WavLM [12] features. We then mean-pool and cluster features within these boundaries to generate a discrete vocabulary. A symbolic language model [13] is trained on the resulting sequences. This approach, which we call ZeroSyl, surpasses the syntactic (sBLIMP) performance of comparable syllabic models and achieves superior performance on the Topic StoryCloze (tSC) narrative benchmark. It also outperforms syllabic systems on the lexical sWUGGY benchmark. But in scaling experiments, it does not surpass the performance of a system utilizing fine-grained units [7]. Nevertheless, our results demonstrate that high-quality syllable discovery is possible without complex multi-stage training pipelines, paving a simple and effective path for future research in spoken language modeling. Models and code available at: https://github.com/nicolvisser/ZeroSyl/

This work focuses on pure speech language models, where text is not used or incorporated at all [14]. Standard benchmarks have been developed that assess how well these models capture lexical, syntactic, and semantic properties [15,16].

To date, AudioLM [4] achieves state-of-the-art performance in syntactic modeling. However, since it uses a 25 Hz-tokenizer that produces long sequences, a large language model (300M parameters) is subsequently required to capture long-range dependencies. SpidR [7] is a recent SSL-based approach that is more lightweight. It uses online clustering and a novel training objective to improve the SSL representations and resulting speech units. While SpidR achieves the best lexical modeling scores, its units remain close to the frame level. As with AudioLM, the sequence lengths are therefore very long, and SpidR has not surpassed AudioLM’s syntactic scores.

Instead of directly discretizing SSL features, two noteworthy approaches have turned to syllable units to deal with the long sequence lengths. Sylber [10] utilizes a sentence-level distillation bjective [17] to finetune HuBERT [18]. After distillation, syllable boundaries are accessible using cosine similarity between the framewise features. SyllableLM [11] identifies boundaries by analyzing the HuBERT loss objective. The idea is that fully masking a syllable results in a distinct disruption of the loss compared to partial masking. Syllable boundaries are predicted by monitoring the loss as a masked interval sweeps across the utterance. To reduce computational load at inference time, both approaches distill the boundary information: they train an encoder to predict piecewise-constant targets, derived by mean-pooling the teacher model’s embeddings

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut