An Entropy-Guided Curriculum Learning Strategy for Data-Efficient Acoustic Scene Classification under Domain Shift
Acoustic Scene Classification (ASC) faces challenges in generalizing across recording devices, particularly when labeled data is limited. The DCASE 2024 Challenge Task 1 highlights this issue by requiring models to learn from small labeled subsets recorded on a few devices. These models need to then generalize to recordings from previously unseen devices under strict complexity constraints. While techniques such as data augmentation and the use of pre-trained models are well-established for improving model generalization, optimizing the training strategy represents a complementary yet less-explored path that introduces no additional architectural complexity or inference overhead. Among various training strategies, curriculum learning offers a promising paradigm by structuring the learning process from easier to harder examples. In this work, we propose an entropy-guided curriculum learning strategy to address the domain shift problem in data-efficient ASC. Specifically, we quantify the uncertainty of device domain predictions for each training sample by computing the Shannon entropy of the device posterior probabilities estimated by an auxiliary domain classifier. Using entropy as a proxy for domain invariance, the curriculum begins with high-entropy samples and gradually incorporates low-entropy, domain-specific ones to facilitate the learning of generalizable representations. Experimental results on multiple DCASE 2024 ASC baselines demonstrate that our strategy effectively mitigates domain shift, particularly under limited labeled data conditions. Our strategy is architecture-agnostic and introduces no additional inference cost, making it easily integrable into existing ASC baselines and offering a practical solution to domain shift.
💡 Research Summary
Acoustic Scene Classification (ASC) is a fundamental task in computational auditory analysis, but its real‑world deployment is hampered by domain shift caused by variations in recording devices. The DCASE 2024 Challenge Task 1 explicitly targets this problem by restricting training to a very small fraction of labeled data (as low as 5 % of the full set) collected on a limited set of devices, while evaluation includes recordings from previously unseen devices. Existing solutions such as data augmentation (e.g., Freq‑MixStyle, simulated impulse responses) and large pre‑trained models improve robustness but increase model size, require external resources, or add inference overhead—undesirable for the strict complexity constraints of the challenge (≤128 kB parameters, ≤30 MMACs per second).
The authors propose a lightweight, architecture‑agnostic training strategy based on curriculum learning (CL). The key novelty lies in defining sample difficulty not by classification loss or confidence, but by domain uncertainty measured through the Shannon entropy of device posterior probabilities. An auxiliary domain classifier f_dom is first trained (with the feature extractor f_feat frozen) to predict device identities. For each training sample x, the classifier outputs a probability distribution p_d(x) over the D devices; the entropy H(x)=−∑_i p_i log p_i quantifies how ambiguous the device prediction is. High entropy indicates that the sample does not contain strong device‑specific cues and is therefore more “domain‑invariant”.
All training samples are ranked by H(x) and split at the median: the top 50 % form the domain‑invariant set X_inv, the remainder form the domain‑specific set X_spec. Training proceeds in two stages. Stage 1 uses only X_inv to train the scene classifier f_cls on top of f_feat, encouraging the network to learn representations that are robust to device variations. Stage 2 gradually introduces X_spec by constructing each mini‑batch with a fixed ratio (e.g., 80 % X_inv + 20 % X_spec). The transition from Stage 1 to Stage 2 is triggered when validation loss on X_inv stops improving for a predefined number of epochs. Throughout both stages, the original optimizer, learning‑rate schedule, and batch size of each ASC baseline are kept unchanged, guaranteeing a plug‑and‑play integration without any hyper‑parameter tuning.
Experiments were conducted on the DCASE 2024 Task 1 dataset, which contains 10 acoustic scenes recorded across 12 European cities with both real devices (A, B, C) and simulated devices (S1‑S3). The test set also includes three unseen simulated devices (S4‑S6). Following the official low‑resource protocol, five training subsets containing 5 %, 10 %, 25 %, 50 %, and 100 % of the labeled data were used. Three representative ASC baselines were evaluated: the official DCASE 2024 baseline, the XJTLU model from Cai et al., and the SJTUT‑HU model from Han et al. Performance was measured by class‑wise average accuracy, with a particular focus on the unseen‑device subset.
Results show consistent improvements across all data regimes. For the official baseline, unseen‑device accuracy rose from 42.4 % (5 % data) to 44.0 % with the curriculum, a 1.6 % absolute gain. The XJTLU model improved from 46.7 % to 49.3 % (≈2.6 % gain), and similar gains were observed for the SJTUT‑HU model. Even when the full 100 % of labeled data were used, modest but reliable improvements were recorded, confirming that the method is beneficial beyond extreme low‑resource settings.
The contributions of this work are threefold: (1) introducing entropy‑based domain uncertainty as a principled difficulty metric for curriculum learning, thereby directly targeting domain invariance rather than task‑specific confidence; (2) delivering a model‑agnostic, inference‑free augmentation that respects the strict parameter and computational budgets of the DCASE challenge; and (3) demonstrating that careful ordering of training samples can substantially mitigate device‑induced domain shift, especially when labeled data are scarce. Future directions include exploring adaptive entropy thresholds (instead of a fixed median split), weighting samples continuously rather than binary grouping, and combining entropy with Bayesian uncertainty estimates for even richer curricula. Overall, the paper provides a practical, easily adoptable strategy for improving domain generalization in acoustic scene classification under realistic resource constraints.
Comments & Academic Discussion
Loading comments...
Leave a Comment