Reading time: 29 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.21204
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

Human infants, with only a few hundred hours of speech exposure, acquire basic units of new languages, highlighting a striking efficiency gap compared to the data-hungry selfsupervised speech models. To address this gap, this paper introduces SpidR-Adapt for rapid adaptation to new languages using minimal unlabeled data. We cast such low-resource speech representation learning as a meta-learning problem and construct a multi-task adaptive pretraining (MAdaPT) protocol which formulates the adaptation process as a bi-level optimization framework. To enable scalable metatraining under this framework, we propose a novel heuristic solution, first-order bi-level optimization (FOBLO), avoiding heavy computation costs. Finally, we stabilize metatraining by using a robust initialization through interleaved supervision which alternates selfsupervised and supervised objectives. Empirically, SpidR-Adapt achieves rapid gains in phonemic discriminability (ABX) and spoken language modeling (sWUGGY, sBLIMP, tSC), improving over in-domain language models after training on less than 1h of targetlanguage audio, over 100ร— more data-efficient than standard training. These findings highlight a practical, architecture-agnostic path toward biologically inspired, data-efficient representations. We open-source the training code and model checkpoints at https://github.com/ facebookresearch/spidr-adapt.

๐Ÿ“„ Full Content

Human infants demonstrate a remarkable capacity for language acquisition: at under 6-months of age, they begin distinguishing phonemic contrasts and rapidly internalize the structure of their native language (Werker and Tees, 1984;Kuhl, 2004;Eimas et al., 1971), all from continuous auditory input and with only 100 to 500 hours of speech exposure (Bergelson et al., 2019;Cychosz et al., 2021).

In contrast, current self-supervised learning (SSL) models such as HuBERT (Hsu et al., 2021) and WavLM (Chen et al., 2022) require thousands of hours of training data to learn meaningful linguistic representations, and even then, their learned units are brittle-sensitive to acoustic and contextual variability (Gat et al., 2023;Hallap et al., 2023). When used as the basis for spoken language models (SLMs), these representations lead to limited language modeling performance compared to textbased systems (Hassid et al., 2023;Lakhotia et al., 2021) and far worse than the learning trajectories of human infants (Bergelson and Swingley, 2012).

A key reason for this discrepancy lies in inductive biases-infants begin with strong predispositions for speech perception, such as sensitivity to phones, rhythmic regularities, and speakerinvariance (Werner, 2007;Kuhl, 2004). These biases constrain learning to plausible linguistic structures, enabling rapid generalization from sparse input. By contrast, most machine learning systems are initialized from random weights and rely solely on statistical regularities of massive datasets. Without built-in inductive priors, they fail to discover linguistic abstractions of new languages efficiently.

To move toward the inductive efficiency of human learners, we propose a fast-adaptive selfsupervised framework for speech representation learning including three broad components:

โ€ข Multi-task Adaptive Pre-training (MAdaPT), a novel protocol that frames model learning as a bi-level optimization problem. The model is meta-optimized across several data-scarce adaptation episodes, each simulating a “lifetime” of low-resource language learning. Intuitively, this episodic design draws inspiration from evolutionary processes, with a second-order optimization occurring at an outer, population-like level that shapes the model’s inductive biases over generations. To further encourage cross-lingual abstraction, we introduce controlled active forgetting between episodes, resetting key model components to simulate the onset of a new “lifetime,” thereby promoting robust, transferable representations.

โ€ข First Order Bi-level Optimization (FOBLO), a meta-optimization heuristic that efficiently solves the second-order bi-level problem posed by MAdaPT. It trains the model to learn from unlabeled, under-resourced data in the inner-loop, with the outer-loop calibrating meta-parameters through feedback from a gold-standard labeled set. โ€ข Interleaved supervision, which incorporates self-supervised training with occasional phoneme supervised steps, yielding an initialization that imitates human-robustness to contextual-and acoustic-variations of speech while being label-efficient. Together, these mechanisms produce a model that achieves performance comparable to SSL systems trained on 6,000 hours of language data, despite seeing only 10 minutes to 100 hours of data in the target language. We further demonstrate that the resulting fast-adaptive model learns speech representations of an unseen language significantly faster than standard multi-task training.

We build on SpidR (Poli et al., 2025b), a speech SSL model that achieves state-of-the-art (SOTA) performance on phonemic discrimination and SLM metrics with efficient training. Our framework extends SpidR with the above fast-adaptive components, yielding SpidR-Adapt. Although our current implementation of MAdaPT-FOBLO uses SpidR as the backbone and focuses on speech representation, our framework is architecture-agnostic and broadly applicable to self-supervised models.

Our results demonstrate a step toward biologically inspired, data-efficient speech representation learning. Our paper makes three broad contributions: (1) Methodologically, we introduce MAdaPT, a general meta-training protocol that structures training as a series of episodes, each mirroring the low-resource language adaptation scenario.

The approach naturally formulates the adaptation process as a bi-level optimization problem. (2) Technically, we propose FOBLO, a novel heuristic solution to the bi-level optimization challenge formulated by MAdaPT. Additionally, we introduce interleaved supervision as a complementary strategy to build stronger model initializations for meta-training. (3) Empirically, we conduct comprehensive experiments, including comparisons with alternative meta-learning heuristics (Reptile), demonstrating that the combination of MAdaPT and FOBLO consistently achieves superior performance, on par with in-domain language training.

2 Related Works 2.1 Self-supervised learning Self-supervised learning (SSL) has enabled speech models to learn rich representations from unlabeled audio and now underpins a wide range of downstream applications-including ASR, emotion recognition, and spoken language modeling (SLM). Among these, SLM-where the objective is to capture linguistic structure directly from speech (Lakhotia et al., 2021;Dunbar et al., 2021;Borsos et al., 2022)-is particularly relevant for our work, given our motivation to build SSL models that enable human-like acquisition of spoken language. In the context of SLM, recent research has demonstrated that the semantic representativeness of learned units-especially their phonemic discriminability-directly impacts downstream spoken language performance (Poli et al., 2024;Hallap et al., 2023). Hence, in the current work, when evaluating the performance of speech SSL models, we employ measures of phonemic discriminability such as ABX (Schatz, 2016), PNMI (Hsu et al., 2021), and phoneme error rate.

Self-supervised models like HuBERT (Hsu et al., 2021) and WavLM (Chen et al., 2022) use masked prediction and clustering to build speech representations, but require extensive training time. SpidR (Poli et al., 2025b) improves on prior SSL models by combining self-distillation and online clustering, achieving SOTA SLM results with more efficient training. This efficiency makes SpidR an ideal backbone for current meta-learning approaches.

Meta-learning aims to optimize models for rapid adaptation to new tasks, often in low-resource settings (Finn et al., 2017;Nichol et al., 2018). This is typically achieved by performing two loops of optimization-in the inner-loop, the model is repeatedly adapted to a new task and in the outerloop, its meta-parameters are updated based on how well it adapts to that task. First-order modelagnostic meta-learning (FOMAML) and Reptile (Nichol et al., 2018), in particular, use first-order outer-loop updates, making them computationally attractive heuristics for large-scale meta-learning.

Meta-learning has demonstrated significant effectiveness in improving out-of-domain (OoD) generalization. Recent studies have introduced riskaware task selection frameworks that significantly improve adaptability and robustness without sacrificing training efficiency when facing distribution shifts (Wang et al., 2025;Qu et al., 2025) while others have proposed meta-learning for OoD detection and model selection (Qin et al., 2024). In this paper, we evaluate generalization capability by meta-testing on OoD languages that are not available during meta-training.

Recent work has also explored active forgetting as a complementary mechanism for improving model plasticity (Chen et al., 2023;Aggarwal et al., 2024). By periodically resetting parts of the model, such as embeddings or prediction layers, active forgetting encourages the formation of weights that can be reconfigured for new linguistic domains and prevents overfitting to unstable patterns. Here, we blend traditional meta-learning with active forgetting to amplify the adaptive benefits of both.

Despite their success in few-shot learning, metalearning methods have seen limited application in speech models, where training typically relies on large, static corpora. Only a few studies explore meta-learning for speech classification or ASR (e.g., Chen et al., 2021;Hsu et al., 2020), and none target self-supervised speech representations. In contrast, we apply meta-learning at the level of SSL itself for the goal of spoken language modeling.

Here we introduce SpidR-Adapt, a speech representation model tailored for rapid and robust adaptation to new languages with limited unlabeled audio data. First, we build a general multi-task training setup (MAdaPT; Sec. 3.1) that imitates fast-adaptation to new languages in low-resource scenarios, incorporating active forgetting to encourage stronger cross-lingual abstraction. This approach builds the adaptation process as a bi-level optimization problem. Then, to efficiently solve the nontrivial bi-level problem, we introduce an empirical solution called first-order bi-level optimization (FOBLO; Sec. 3.2), which avoids the heavy computational cost of second-order gradient steps in the outer-loop. Finally, to stabilize meta-optimization, we propose initializing with a pretrained model and design an interleaved supervised objective (interleaved supervision; Sec. 3.3).

The goal of MAdaPT is to addresses the OoD generalization challenge: the model is pre-trained on source (seen) linguistic domains with sufficient data and subsequently adapted on target (new) linguistic domains for which only limited unlabeled data is available.

Notation. Let S denote the set of source languages available during training and T represent the set of unseen target languages encountered during adaptation. For each source language โ„“ โˆˆ S, we assume access to a sufficiently large unlabeled corpus D u โ„“ and, optionally, a small labeled corpus D s โ„“ . In contrast, for each target language in T , only a limited unlabeled corpus is available. Episodic multi-lingual setup. We cast the OoD challenge from seen to new languages as a metalearning problem. To simulate fast adaptation to target languages with limited speech data, we partition the large unlabeled corpus D u โ„“ of each source language into multiple smaller data chunks {D u โ„“ }. Thus, one task in this work corresponds to a specific language โ„“ and one scarce data chunk D u โ„“ as the training set. During meta-training, the model is presented with a mini-batch of task-specific episodes and is optimized in the outer-loop based on learning performance of the inner-loops. At the metatest stage, we fine-tune the learned model on datascarce tasks derived from each target language, evaluating adaptation in low-resource scenarios. SpidR as backbone speech model. In this work, we deploy the SOTA speech representation model SpidR (Poli et al., 2025b) as our backbone, which has a student-teacher architecture. Thus, we represent the speech model in detail:

is a convolutional downsampler and E s , E t are Transformer encoders for the student and teacher, respectively. The teacher is an exponential moving average (EMA) of the student. W k is the prediction head of the student and C k is the target codebook of the teacher at the intermediate layer k (where

Given a language โ„“ with its low-resource dataset D u โ„“ , we formalize the adaptation process as: where L ssl denotes a self-supervised loss function, ฮธ represents all learnable parameters of the speech model SpidR, and ฮธ * โ„“ are the optimal model parameters specific to the language โ„“. We note that D u โ„“ is not sufficient to train a specific speech model from scratch due to severe overfitting (Dupoux, 2018). Bi-level optimization. To mitigate model’s overfitting to source languages, we propose a generic bilevel optimization framework which aims to learn meta-parameters from source languages that adapt rapidly to target languages. Within this framework, training with pure SSL in Equation (1) serves as an inner-optimization; meanwhile, light-weight labeled data are deployed to supervise these adaptation processes in the outer-level by shaping metaparameters. Meta-parameters are shared across concurrent tasks and can be intuitively viewed as inductive biases for speech representation learning.

For clarity, we instantiate meta-parameters ฯ• as the initial parameters of the backbone model in Equation (1). Thus, the expected bi-level objective for MAdaPT is:

where L ssl denotes a self-supervised loss function in the inner level, performing adaptation from un-labeled speech data; and L sl denotes a supervised loss function in the outer level. In contrast to regular meta-learning frameworks designed for supervised learning (Finn et al., 2017), supervised information here is only used in the outer optimization while inner adaptations remain unsupervised. This preserves the assumption of low-resource, unlabeled data usage within the inner-loop, while leveraging supervised information in the outer-loop to resolve the ambiguities of pure self-supervision.

Active forgetting in task adaptation. To suppress unstable and language-specific learning from past episodes, we introduce an active forgetting mechanism. During meta-training, SpidR’s prediction heads and codebooks tend to be dominated by phonemic knowledge from source languages, hindering its generalization over new languages.

To this end, we reinitialize these components at the start of each inner loop. Concretely, we copy the student and teacher parameters from the shared meta-parameters ฯ• as default but reset all heads and codebooks, yielding the optimization with initialization ฮธ AF (ฯ•) for each inner loop at both meta-training and meta-test stages:

Here each codebook C k 0 is sampled from a normal distribution N (0, 1) and each head W k 0 is warmed up for 20 steps using the first batch of D u โ„“ .

Solving the bi-level optimization in Equation ( 2) is non-trivial because both the inner-and outerloops require multiple gradient steps. To make meta-training scalable, we introduce a first-order bi-level optimizer that yields a principled first-order approximation to the meta-gradient. In contrast to other first-order approximations (Finn et al., 2017;Nichol et al., 2018), our optimizer is intended for a more challenging case where the inner-and outerloops are served by different loss functions. Given a specific language โ„“, the updating of meta-parameters ฯ• can be formulated as:

where ฮฒ is a learning rate in the outer-loop to update the meta-parameter ฯ•. Assume that the innerand outer-loops perform M and N gradient steps, respectively. By applying chain rule to Equation (4) during backpropagation over M inner steps, we can reformulate the meta-update as:

where ฮฑ is the learning rate in the inner-loop update and the task-specific parameter ฮธ m โ„“ denotes the model’s parameters after the m th -inner step. To avoid the heavy computational cost in computing the Jacobian product of the second derivative in Equation ( 5), we adopt a first-order approximation by dropping the second-order term (i.e., we stop the gradient through the inner-loop).

The outer-loop typically performs N supervised steps on labeled speech corpora D s โ„“ . Following Reptile (Nichol et al., 2018), we approximate the outer-loop gradient by the parameter difference between the end of the inner-loop and the end of the outer-loop:

where ฮธ M โ„“ is obtained after M self-supervised inner-steps starting from ฮธ and ฮธ M +N โ„“ is obtained by taking an additional N supervised steps from ฮธ M โ„“ .

By substituting Equation ( 6) into Equation ( 5), FOBLO updates the meta-parameters as follows:

Illustration of our work is provided in Figure 1. This work provides a principled and practical solution for few-shot self-supervised adaptation by nesting self-supervised inner-loops within supervised outer-loops.

In practice, we find that initializing the metaparameters from random weights leads to unstable learning dynamics and poor convergence (see Appendix D.2). Thus, to facilitate effective bi-level optimization, it is necessary to perform a dedicated pre-training phase prior to the meta-training stage.

To this end, we introduce an interleaved pretraining objective to obtain the most performative meta-initialization, denoted as ฯ• 0 . During the dedicated pre-training phase, we alternate between selfsupervised and supervised objectives in an interleaved manner. This mechanism leverages both unlabeled and labeled data, allowing the model to benefit from large-scale unsupervised corpora while grounding representations with supervised signals. This pre-training objective is defined as:

where ฮป โˆˆ {0, 1} is a binary hyperparameter. Here, L ssl denotes the self-supervised loss, applied to the union set of unlabeled corpora from all source languages, {D u โ„“ , โ„“ โˆผ S}; while L sl is the supervised loss, applied to the union of labeled corpora {D s โ„“ , โ„“ โˆผ S}.

In the current work, we utilize two distinct metainitializations: 1) Multi-Task-PT [SSL]: setting ฮป to 1 throughout pre-training, yielding standard self-supervision. 2) Multi-Task-PT [SSL/SL] switching ฮป to 0 periodically, interleaving occasional supervised steps into the self-supervised training regime. This provides a more superior initialization for meta-training.

In the experimental section, we seek to address the following key questions: (1) How data-efficient is SpidR-Adapt in generalizing to the linguistic structure of new languages? (2) Can the MAdaPT framework produce improvements when labeled data is unavailable during pre-training? (3) Can SpidR-Adapt produce improvements in downstream spoken language modeling? (4) Can SpidR-Adapt achieve superior performance compared to existing speech models under the OoD setup? Datasets. We collect data from 27 languages to evaluate adaptation capabilities of speech encoders under in-domain (ID) and out-of-domain (OoD) setups. We partition the languages as follows: 19 source languages for training; 5 target languages for development; and 3 target languages for testing. Importantly, there are no overlaps between source and target languages. Each source language is supported by a substantial unlabeled corpus (300 hours per language) collected from VoxPopuli (Wang et al., 2021) and a small phoneme-aligned corpus (maximum 50 hours per language) collected from VoxCommunis Corpus (Ahn and Chodroff, 2022) to serve as labels for the FOBLO outer-loop and for interleaved supervision.

Only small-scale unlabeled corpora are available for fast adaptation to target languages (mimicking infant learning settings. We construct four subsets per target language with durations 10 minutes, 1 hour, 10 hours, and 100 hours. To quantify the performance gap between ID and OoD training, we additionally collect large-scale in-domain training corpora from VoxPopuli for each test language. Each in-domain corpus comprises 6k hours-comparable in scale to the combined duration of the OoD corpora. The small-scale adaptation sets for these test languages are sampled from the same in-domain training pool; consequently, the OoD models are adapted using subsets of ID data. These choices are made to enable fair comparisons between ID and OoD models.

Small-scale adaptation corpora were also created for the meta-development languages, sourced from CommonVoice (Ardila et al., 2020) and used for model development. Further details on dataset construction are provided in Appendix A. Training Setup. We perform multi-task pretraining of SpidR with self-or interleaved-supervised objectives (interleaving supervision every 10 steps; see Sec. 3.3). These models serve as initializations for meta-training wherein we train across 800 episodes, each episode consisting 1800 inner-and 200 outer-steps. In each inner-loop, the model is trained on a random 10 hour data chunk of a ran-dom source language. Training is performed across 16 GPUs in a distributed fashion. Details regarding training can be found in Appendix B.

To evaluate data efficiency, we adapt meta-trained models to new target languages using only limited unlabeled data. We benchmark our approach against baselines using ABX (lower is better), computed using the fastabx toolkit (Poli et al., 2025a). ABX scores quantify how well model embeddings capture phone distinctions and correlate strongly with downstream SLM performance (Poli et al., 2025b), serving as an efficient zero-shot proxy. In the ABX task, embeddings are computed for three triphones: A, B, and X. Here, A and X are instances of the same triphone, while B differs in its central phone (e.g., /bag/ vs. /beg/). The model succeeds if X is closer to A than to B in embedding space. The within-speaker condition uses triphones from the same speaker, while the across-speaker condition uses A and B from one speaker and X from another, making the task more challenging.

Figure 2 shows results: the x-axis indicates adaptation data size, and the y-axis shows average ABX scores across our three test languages and across within-and across-speaker ABX conditions (individual trends are consistent). With SpidR as backbone and under two meta-initialization strategies (Sec. 3.3) self-and interleaved-supervised initialization, we compare: 1) In-Domain Mono-Task-PT: Standard in-domain pre-training with sufficient data from the target language. Because every smallscale evaluation subset is drawn from the ID training pool, we do not perform additional small-scale adaptation, so In-Domain PT appears as a horizontal line in Figure 2. 2) Multi-Task-PT: Standard OoD pre-training with ample unlabeled data from all source languages, using the same data-feeding protocol as In-Domain PT. 3) MAdaPT-FOBLO: Our proposed approach that combines MAdaPT with its trivial solution FOBLO. Used with SpidR as backbone and interleaved-supervised initialization, this method formulates our few-shot learning speech encoder, SpidR-Adapt.

Figure 2 (a) shows that Multi-Task-PT underperforms In-Domain PT especially when the adaptation budget is small (< 100 hours). This suggests that regular multi-task pre-training lacks the adaptation capacity needed for unseen targets, and simply mixing several source languages during pre- training does not guarantee better generalization.

In contrast, MAdaPT-FOBLO improves rapidly across data scales, indicating high effectiveness for adapting to OoD data. Notably, MAdaPT-FOBLO reaches parity with In-Domain PT after adapting with just 1 hour of unlabeled target-language audio, highlighting data efficiency improvements of 100ร—. Such efficiency is crucial for real-world scenarios where language corpora are scarce.

Finally, Figure 2 (b) indicates that the interleaved-supervised initialization (Multi-Task-PT [SSL/SL]) provides a better starting point (lower initial ABX) than self-supervised initialization (Multi-Task-PT [SSL]). However, regardless of initialization, the incorporation of MAdaPT-FOBLO delivers the largest gains in rapid adaptation to unseen languages. This suggests that while initialization can set a stronger baseline, the adaptation strategy is the primary driver of sustained performance improvements. Full results are provided in Appendix C.1 for detailed comparison.

Here we consider an extreme setting in which no supervised training data is available for source languages. In this regime, MAdaPT must be optimized using a purely self-supervised procedure.

To instantiate MAdaPT without labels, we adopt Reptile (Nichol et al., 2018), a first-order metalearning heuristic that approximates the metagradient assuming identical inner-and outer-loop objectives. Here, the meta-update is written as off the previous meta-parameters and the taskspecific solution after each episode. In contrast to our FOBLO update in Equation ( 7), ฮธ M +N โ„“ in Reptile denotes parameters obtained by pure selfsupervised training for a total of M +N steps.

In Table 1, we report ABX (in %) averaged over adaptation budgets from 10 minutes to 100 hours and three test languages. The results emphasize that MAdaPT-based optimization consistently improves over standard Multi-Task-PT, with FOBLO (which requires supervised labels) achieving the strongest performance. Notably, when all source languages lack supervision, Reptile that is used as a purely self-supervised instantiation of MAdaPT, outperforms baseline Multi-Task-PT. These findings underscore the importance of a tailored multitask framework for low-resource OoD adaptation.

We evaluate SLM performance of SSL models adapted on English test sets, using three complementary linguistic metrics. 1) Lexical (sWUGGY) (Nguyen et al., 2020) tests whether the model assigns higher probability to true words than to matched non-words. 2) Syntax (sBLIMP) requires the model to choose grammatical sentences from minimal pairs. 3) Discourse/Narrative (Spoken Topic StoryCloze) (Mostafazadeh et al., 2017) asks the model to select appropriate continuations for short stories. We report accuracy (in %) averaged across the three metrics in Table 2. Detailed per-task results are included in the Appendix C.2).

Table 2 shows that MAdaPT-FOBLO achieves rapid gains under the few-shot adaptation scenario (for both self-and interleaved-supervised initializations). MAdaPT-Reptile comes a close second, with especially strong zero-shot performance.

To further investigate the adaptability of the proposed methods, we compare them with a performant speech SSL model, HuBERT, trained under the OoD Mutli-Task-PT setup using the Phoneme Discovery Benchmark (Poli et al., 2026). The metrics reported include: 1) phone-normalized mutual information (PNMI), the uncertainty eliminated about a phone label by a predicted unit; 2) phoneme error rate (PER), after mapping units to the most frequently associated phoneme 3) ABX (within-and across-conditions). Results reported in Table 3

We present SpidR-Adapt, a speech representation model that enables data-efficient adaptation to new languages by combining meta-adaptive pretraining, bi-level optimization, and interleaved supervision. Achieving superior performance with as little as 1 hour of target-language audio-100ร— less data than traditional mono-and multi-task methods-SpidR-Adapt demonstrates effectiveness of a tailored meta-learning framework for flexible representation learning in low-resource settings.

This work offers promising data-efficiency in fewshot speech representation learning, but several limitations remain. Model performance is influenced by the choice of meta-initialization, suggesting that further research is needed into more robust meta-learning that can be trained without metainitialization. Supervised information from source languages is still required at the outer-level, which limits scaling of source languages. Additionally, training of spoken language models has not been included into the meta-learning framework and hence is not data-efficient; future work could focus on applying meta-learning directly to SLM training to enhance efficiency and reduce data requirements.

Table 4 summarizes the datasets used for metatraining, meta-development, and meta-testing. To prepare the unlabeled training data, we apply the Silero Voice Activity Detector to the audio files, segmenting them into smaller audio files ranging from 0.5 to 30 seconds in duration (with mean 14.6 seconds). This pre-processing step ensures that the model is exposed to realistic, variablelength speech segments during both training and evaluation. For reproducibility, the start and end timestamp metadata for all processed audio files used in training and evaluation will be made available in the accompanying GitHub codebase.

In addition to the unlabeled dataset, we also use a small supervised dataset, mainly sourced from VoxCommunis Corpus (Ahn and Chodroff, 2022). This corpus comprises phoneme alignments inferred on CommonVoice (Ardila et al., 2020) data using Montreal Forced Aligners (MFA; McAuliffe et al., 2017). While CommonVoice has data for 18 training languages, it does not contain data for one language-Croatian. To obtain a labeled set in Croatian, we use the transcribed set of Voxpopuli (Wang et al., 2021) and align phonemes using offthe-shelf MFA models. We clean the alignment data by applying similar filtering and phoneme mapping measures employed in (Ortiz Tandazo et al., 2025); this includes filtering out alignments with spn segments or with non-silent phones that are excessively long (which indicate alignment errors), fixing diacritics that were wrongly attached to adjacent phones, and replacing some MFA phones with their IPA equivalents ([ฤ›] becomes [g]). The amount of phoneme-aligned data available varied widely based on language-to avoid overfitting on any one language, we limit the maximum quantity to 50 hours per language, yielding a labeled dataset of total 372 hours.

For calculating ABX scores on test languages in experiments Sec. 4.1 and Sec. 4.2, we use phoneme alignments obtained from the test set of the Zero Resource 2017 Challenge (Dunbar et al., 2017). For computing the ABX score on the development languages, we use data from CommonVoice (Ardila et al., 2020) and alignments from VoxCommunis (Ahn and Chodroff, 2022).

Models are trained using a distributed setup across 16 GPUs. Default SpidR hyperparameters (Poli et al., 2025b) are used for pre-training the ID monotask and the OoD multi-task models. In interleaved supervised pre-training (i.e., Multi-Task-PT [SSL/SL]), every tenth step is backpropagated using phoneme supervised loss (hence in equation 8, ฮป = 0 if step mod 10 = 0, else 1). For prediction of supervised labels, language-specific classifier heads (19 heads in total) are attached to the 8 th transformer layer of the SpidR model. Here, the 8 th layer was used because exploration of hyperparameters indicated it as being optimal for few-shot performance on developmental languages. During supervised training steps, utterances are batched by language; while during self-supervised training steps, each batch consists of a mix of languages.

In self-supervised pre-training (i.e., Multi-Task-PT [SSL]), standard SSL loss (as defined by the SpidR architecture) is used throughout. The OoD multitask models trained under these schema are used as initialization weights for meta-training.

Eight MAdaPT episodes are trained in parallel across 16 GPUs. During meta-training, each episode consists of 2,000 steps-1,800 steps for the inner-loop (self-supervised adaptation) and 200 steps for the outer-loop (supervised metaoptimization). For each inner-loop task D u โ„“ , we utilize a randomly chosen 10-hour data chunk from a randomly chosen source language. For the outer-loop otimization, the inner-loop language โ„“ is retained, but data duration is not fixed at 10 hours. The overall training spans 200,000 steps, resulting in a total of 800 episodes (calculated as 200, 000/2, 000 ร— 8 = 800 episodes). This metatraining setup is chosen for both practicality of implementation (on limited compute and with limited time) and to closely mimic the low-resource adaptive fine-tuning scenario central to our research.

For the self-supervised initialization of FOBLO, the supervised outer-loop optimization is applied to the 6 th layer of the model; while, for the interleaved-supervised initialization, it is applied to the 8 th layer (staying consistent with the supervised layer during meta-initialization). The FOBLO supervised layers were selected based on best performing layers of the meta-initialization models used for meta-training. When computing ABX scores, we thereby report results from the 6 th and 8 th layers for the self-and interleaved-supervised models, respectively.

In SpidR, the teacher is trained as an exponential moving average of the student, with the decay of the teacher at the timestep t defined as 1 -(1 -ฮฒ 0 ) exp(-t/T ). We find that some meta-training configurations (specifically, trainings initialized using interleaved supervision or metatrained using FOBLO) perform better when trained with ฮฒ 0 = 1, effectively producing a frozen teacher. Hence, we select the best performing value of ฮฒ 0 (from 1.0 and the default 0.999) for each metatraining variant (i.e., Reptile or FOBLO with SSL or SSL/SL initializations) based on few-shot per-formance on the development language set.

Within each meta-training inner-loop, we use a constant learning rate adding a small warmup for 600 timesteps at the beginning of each loop. The learning rate within each episode is identified through a tri-stage learning rate scheduler with maximum learning rate of 5e -5. The detailed scheduler has been illustrated in Figure 3.

For fast adaptive fine-tuning to the OoD target languages, we use a single GPU. For each model variant (i.e., Multi-Task-PT, MAdaPT-Reptile, or MAdaPT-FOBLO with SSL or SSL/SL initializations) and each adaptation dataset size (10 minutes to 100 hours), we conduct a hyperparameter exploration on the development language set to identify optimal training timesteps (varied between 4,000 and 24,000), learning rate (constant learning rate of 5e-4 or 5e-5), and ฮฒ 0 for the teacher decay (1.0 or default 0.999). The best checkpoint for each adaptation run is selected based on the lowest validation loss, ensuring optimal model performance for downstream evaluations.

Here, we present detailed ABX scores for both within-speaker and across-speaker setups as illustrated in Figure 2. As shown in Table 5, the In-Domain Mono-Task-PT [SSL] models are trained with sufficient in-domain data (6k hours per language), resulting in oracle-level performance. Moreover, we evaluate all methods on the five development languages, with their ABX scores reported in Table 6. Due to the lack of unlabeled corpora for these five development languages, the oracle performance is not reported in the table. Across both tables, our proposed MAdaPT-FOBLO consistently outperforms the Oracle baseline and achieves performance comparable to the MAdaPT-Reptile method. Notably, when self-supervised initialization is applied, our approach rapidly improves performance as adaptation time increases, highlighting its data efficiency and overall effectiveness.

Detailed results for downstream spoken language modeling are provided under Table 7. As described in Experiment 4.3, we used sWuggy, sBlimp, and spoken tSC to estimate performance of the spoken language models. For all tasks, candidates are scored by length-normalized log-likelihood (loglikelihood divided by token count) for comparability across strings, and decisions are made by selecting the higher-scoring alternative.

We use SSL models finetuned on the English adaptation sets (0 hours to 100 hours) as encoders for the downstream SLM. OPT-125M models (Zhang et al., 2022) are utilized as the SLMs, trained using fairseq2 (Balioglu et al., 2023) and following the architectural decisions made by previous works (Hassid et al., 2023;Poli et al., 2025b). The 6k hour subset of Libri-Light (Kahn et al., 2020) is used as the training dataset. We train on 8 GPUs, with a context length of 2048, and a batch of at most 81920 tokens, for 25000 steps. The learning rate is set at 1e -2 with a 1000-step warmup period and with a cosine annealing schedule. Remaining hyperparameters follow OPT-125M defaults. We select the checkpoint with the lowest validation loss.

The Phoneme Discovery Benchmark (Poli et al., 2026) is specifically designed to investigate the abilities of speech representation models to encode phonemic information in a low resource setting.

It employs metrics such as PNMI, PER, and ABX (which are described in the main paper). The benchmark includes 6 development languages-Swahili, Tamil, Thai, Ukrainian, Turkish, and German-and 6 test languages-French, English, Japanese, Mandarin, Wolof, and Basque. Note that the development and test language set in the benchmark differs from our previous experiments but is disjoint from our training set.

In the current work, we applied our previously tuned MAdaPT-Reptile and MAdaPT-FOBLO models to the tasks. Our models are trained on 19 Voxpopuli languages (Wang et al., 2021). We compare our approaches to an OoD HuBERT trained on 20 Voxpopuli languages. Test language results are reported in

To investigate the impact of active forgetting in our approach we conduct ablation studies by removing the active forgetting mechanism from the innerloop on the 5 development and 3 test languages. As shown in Table 10 andTable 11, incorporating active forgetting consistently outperforms the variant without this mechanism. This demonstrates that resetting the prediction heads and codebooks helps the model alleviate overfitting to previous episodes, thereby improving overall performance.

To explore the influence of meta-initialization, we meta-train our model from three types of initialization. Multi-Task-PT [SSL] and Multi-Task-PT [SSL/SL] have been introduced in the main manuscript, both obtained via multi-task pretraining. Here we attempt random initialization, wherein the backbone is initialized by random sampling from the default parameter distribution. Table 12 and Table 13 present ablation studies with different meta-initializations on 5 development and 3 test languages, respectively.

We find that random meta-initialization does not work for meta-training. Without a meaningful starting point, meta-training may fail to converge or require significantly more data and iterations to achieve competitive performance for selfsupervised speech models. Thus, the success of meta-learning for speech representation learning is tightly coupled with the quality and relevance of the initial representations encoded in the backbone.

To systematically investigate the impact of the meta-learning rate ฮฒ in our approach, we conduct a series of experiments with SpidR-Adapt, utilizing interleaved supervised meta-initialization and varying ฮฒ across 0.001, 0.01, 0.1 and 1. Table 14 presents the results, with the left and right subtables corresponding to the 5 development and 3 test language sets, respectively. Our analysis reveals a clear trend on the development set as ฮฒ increases: ABX scores initially become lower, reaching its peak at ฮฒ = 0.01 before declining at higher values. This suggests that a moderate meta-learning rate strikes the best balance between adaptation and stability, while excessively high rates may lead to suboptimal generalization.

To ensure robust hyperparameter selection and prevent overfitting to the 3 test languages, we rely on the 5 development results to identify the optimal ฮฒ. Consequently, all reported results in the paper are based on ฮฒ = 0.01, which consistently yields the strongest performance across our evaluation.

To and averaged across 10 minute to 100 hour adaptation data scales. Our analysis reveals distinct trends for different meta-initialization strategies applied to MAdaPT-FOBLO: 1) with Multi-Task-PT[SSL], the phone discriminability improves with increasing layer depth, peaking at layer 6. Beyond this point, performance declines, suggesting that intermediate layers capture the most relevant phonetic representations, while deeper layers may become overly specialized or abstracted for the ABX task.

  1. with Multi-Task-PT[SSL/SL], the optimal performance is observed at layer 8. These results suggest that the best performing layer is consistent with the layer at which supervision is applied during the outer-loop of FOBLO. For SSL meta-initialization, the a supervision head is attached to the 6th encoder layer for outer-loop supervision while for SSL/SL meta-initialization, it is attached to the 8th layer (see Appendix B for more details here).

with MAdaPT-Reptile, a purely SSL solution for MAdaPT. ABX scores averaged across 10 minutes to 100 hours training (excluding zero-shot, 0h). Although Reptile under-performs FOBLO, it achieves better results than baseline Multi-Task-PT, demonstrating the effectiveness of MAdaPT. The best scores are in bold and second best are underlined.

on the Phoneme Discovery Benchmark. MAdaPT-FOBLO outperforms alternate speech SSL model (HuBERT) and alternate meta-learning framework (Reptile). Here, MAdaPT-FOBLO and MAdaPT-Reptile are initialized using Multi-Task-PT [SSL/SL]. The best scores are in bold and second best are underlined.

investigate how layer-specific embeddings affect the model’s ability to discriminate between phonemes, we present ABX scores for each student layer in Figure 4. The scores are averaged across (a) 5 development or (b) 3 test languages

Within-Speaker and Across-Speaker ABX scores (in %) on 5 DEVELOPMENT languages. MAdaPT-FOBLO outperforms alternate methods in phoneme representation. Hyperparameters are tuned using results from the development language set. The best scores are in bold and second best are underlined.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut