A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models
Self-supervised learning (SSL) has transformed speech processing, yet its reliance on massive pre-training datasets remains a bottleneck. While robustness is often attributed to scale and diversity, the role of the data distribution is less understood. We systematically examine how curated subsets of pre-training data influence Automatic Speech Recognition (ASR) performance. Surprisingly, optimizing for acoustic, speaker, or linguistic diversity yields no clear improvements over random sampling. Instead, we find that prioritizing the longest utterances achieves superior ASR results while using only half the original dataset, reducing pre-training time by 24% on a large corpora. These findings suggest that for pre-training speech SSL models, data length is a more critical factor than either data diversity or overall data quantity for performance and efficiency, offering a new perspective for data selection strategies in SSL speech processing.
💡 Research Summary
This paper investigates how the composition of pre‑training data influences the performance of self‑supervised learning (SSL) speech models, with a focus on downstream automatic speech recognition (ASR). While prior work has largely addressed model architecture, loss functions, and evaluation efficiency, the impact of the data distribution itself has received far less attention. To fill this gap, the authors conduct a systematic study using the large‑scale Loquacious corpus (25 000 h of English speech) and evaluate several unsupervised data‑selection strategies.
Two families of selection methods are explored. The first family aims to increase acoustic, speaker, or linguistic diversity. For acoustic diversity, 39‑dimensional MFCC‑based vectors (including first‑ and second‑order derivatives) are computed per utterance. Speaker diversity is captured using 256‑dimensional embeddings from the WESpeaker model, while linguistic diversity relies on 1024‑dimensional semantic embeddings from the SENSE model. Each feature space is clustered with k‑means (k = 150 for the medium split, k = 200 for the large split), and a balanced sample of roughly 50 % of the utterances is drawn by ensuring each cluster contributes at least one example and then filling the remainder uniformly.
The second family is based purely on utterance length. One variant selects the longest 50 % of utterances (“Length”), and another combines speaker clustering with length by picking the longest utterances within each speaker cluster (“Speaker+Len”). All subsets are sized to match the total audio duration of a random 50 % baseline, enabling a fair comparison of performance versus data quantity.
All models are pre‑trained with the BEST‑RQ framework, which uses a random‑projection quantizer and a Conformer backbone (12 layers, 640 hidden size, 8 heads, ~100 M parameters). Pre‑training runs for 200 k steps (≈200 h on 8 × NVIDIA A100 GPUs) with dynamic batching that caps each GPU batch at 800 s of audio, yielding roughly 1.77 h of audio per batch. After pre‑training, each model is fine‑tuned on a fixed 250 h labeled ASR dataset using a CTC‑based feed‑forward head and a 1 024‑token BPE vocabulary. Performance is measured by word error rate (WER) on development and test splits, and training efficiency is reported in GPU‑hours and total pre‑training audio used. Statistical significance is assessed via 1 000‑sample bootstrapping (95 % confidence intervals).
Results (Table 1) show that diversity‑driven sampling (MFCC, Speaker, SENSE) does not outperform a random 50 % baseline. On the medium split (2.5 k h), the random baseline yields 19.39 % test WER, while MFCC, Speaker, and SENSE achieve 19.98 %, 19.72 %, and 20.39 % respectively—none of which are statistically better. On the large split (25 k h), speaker‑based diversity modestly improves over random (17.97 % vs 18.54 % test WER) but still lags behind using the full dataset (18.08 %).
In stark contrast, length‑based selection consistently delivers the best performance. Selecting the longest 50 % of utterances reduces test WER to 19.02 % (medium) and 17.77 % (large), both statistically significant improvements over random and full‑data baselines. Adding speaker diversity to the length criterion yields a marginal further gain (18.97 % and 17.42 % respectively). Importantly, because longer utterances lead to fewer samples per batch, the large‑scale length‑based runs complete ~24 % faster than the full‑data baseline despite processing the same total amount of audio (≈12.6 k h).
The authors interpret these findings as evidence that, for SSL speech pre‑training, the sheer quantity of data is less critical than the presence of long, context‑rich utterances. Longer segments likely provide richer temporal dependencies and more varied acoustic patterns, offering stronger learning signals for the model’s representation encoder. Moreover, the lack of benefit from enforced diversity suggests that SSL objectives already capture a wide range of acoustic and linguistic variations when trained on sufficiently large random corpora.
The paper concludes by emphasizing utterance length as a key factor in constructing efficient pre‑training corpora. By halving the data volume while selecting the longest utterances, practitioners can achieve equal or better ASR performance and reduce computational cost. The authors acknowledge limitations: experiments are confined to a single language (English), a single SSL architecture (BEST‑RQ), and one large corpus. Future work will explore cross‑lingual settings, other SSL models (e.g., wav2vec 2.0, HuBERT), and deeper analyses of why longer utterances are beneficial.
Overall, this study provides a practical, data‑centric guideline for the speech community: when assembling massive unlabeled corpora for SSL, prioritize longer recordings over random sampling or handcrafted diversity metrics to achieve both performance gains and resource efficiency.
Comments & Academic Discussion
Loading comments...
Leave a Comment