A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation
Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: in-the-wild datasets contain weak labels and severe co-occurrence of events. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence of events by mining high-purity single-event segments from in-the-wild datasets via a semantically consistent synthesis protocol. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2.4k hours of raw audio. Experimental results demonstrate that, compared with the state-of-the-art model SAM-Audio which was trained on a huge dataset $\sim$500 times larger than Hive, certain open-source models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibited remarkable zero-shot generalization on out-of-distribution evaluation benchmarks. These findings highlight that prioritizing purity of supervised signals enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs. Code and dataset are available at https://shandaai.github.io/Hive.
💡 Research Summary
The paper tackles a fundamental bottleneck in query‑based universal sound separation (USS): the quality of training data. While recent USS models have achieved impressive results by scaling up both model size and dataset volume—often relying on massive in‑the‑wild collections such as AudioSet and VGGSound—these datasets suffer from weak annotations and pervasive co‑occurrence of multiple sound events. Consequently, models learn spurious correlations, treating background noises that frequently accompany a target class (e.g., wind with rain) as intrinsic characteristics of that class, which leads to residual interference in complex acoustic scenes.
To overcome this limitation, the authors propose a fully automated pipeline that extracts high‑purity single‑event segments from uncurated corpora and then mixes them under strict semantic constraints. The pipeline consists of three stages:
-
Ontology reconstruction & preprocessing – Starting from the 474‑leaf AudioSet ontology, they merge synonyms, collapse overly fine‑grained categories, and discard abstract or environment‑only labels, arriving at a refined 283‑leaf taxonomy tailored for separation. Raw audio is sliced into 10‑second windows with 5‑second overlap, and silent segments (RMS < 5 × 10⁻⁴) are removed.
-
Single‑event semantic‑acoustic alignment – Using metadata they first filter out clips with multiple tags. They then employ the multimodal LLM Qwen‑3‑Omni as a zero‑shot polyphony detector to reject any remaining multi‑event mixtures. A coarse‑to‑fine hierarchical labeling follows: a pretrained audio‑tagging model predicts parent nodes (confidence > 0.7), which serve as priors for Qwen‑3‑Omni to refine the classification to the exact leaf node. This dual discriminative‑generative approach guarantees that each retained clip contains a single, well‑labeled acoustic event.
-
Super‑resolution based standardization – Because source datasets exhibit heterogeneous sampling rates, the authors standardize everything to 44.1 kHz. Low‑bandwidth clips are up‑sampled using the Apollo model (audio super‑resolution), while higher‑rate audio is down‑sampled with anti‑aliasing filters, preserving spectral fidelity.
After cleaning, the pipeline yields roughly 0.9 M unique clips (≈2 442 hours) drawn from 12 public sources (AudioSet, VGGSound, FreeSound, BBC Sound Effects, etc.). The authors then synthesize mixtures with 2–5 sources per clip, following a “principled mix” strategy that respects semantic compatibility (e.g., avoiding mixing two instances of the same fine‑grained class). This results in the Hive dataset, a 2.4 k‑hour synthetic collection with uniform 44.1 kHz audio and fully consistent labels.
To evaluate the impact of data purity, the authors train two representative USS models on Hive alone: the discriminative AudioSep and the generative FlowSep. They compare these against SAM‑Audio, a state‑of‑the‑art unified model trained on roughly one million hours of data (≈500 × Hive’s size). Despite using only ~0.2 % of SAM‑Audio’s data, Hive‑trained models achieve comparable SI‑SDR, PESQ, and MOS scores on standard test sets. Moreover, in zero‑shot generalization experiments on out‑of‑distribution benchmarks such as FSD50K and ESC‑50, Hive‑trained models retain competitive performance, especially on long‑tail, rare events where data efficiency is most critical.
The results substantiate the authors’ central claim: prioritizing the purity and semantic consistency of supervised signals can dramatically improve data efficiency, reducing the need for massive, noisy datasets. By releasing the full pipeline code, the refined ontology, and the Hive dataset, the work also enhances reproducibility and invites the community to adopt similar data‑centric strategies across other audio domains (e.g., music separation, speech enhancement).
In summary, this study demonstrates that a carefully curated, semantically consistent synthetic dataset can close the performance gap with far larger, noisier corpora, offering a scalable and accessible path toward robust, query‑driven universal sound separation.
Comments & Academic Discussion
Loading comments...
Leave a Comment