Positive-Unlabelled Active Learning to Curate a Dataset for Orca Resident Interpretation

Positive-Unlabelled Active Learning to Curate a Dataset for Orca Resident Interpretation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This work presents the largest curation of Southern Resident Killer Whale (SRKW) acoustic data to date, also containing other marine mammals in their environment. We systematically search all available public archival hydrophone data within the SRKW habitat (over 30 years of audio data). The search consists of a weakly-supervised, positive-unlabelled, active learning strategy to identify all instances of marine mammals. The resulting transformer-based detectors outperform state-of-the-art detectors on the DEEPAL, DCLDE-2026, and two newly introduced expert-annotated datasets in terms of accuracy, energy efficiency, and speed. The detection model has a specificity of 0-28.8% at 95% sensitivity. Our multiclass species classifier obtains a top-1 accuracy of 42.1% (11 train classes, 4 test classes) and our ecotype classifier obtains a top-1 accuracy of 43.0% (4 train classes, 5 test classes) on the DCLDE-2026 dataset. We yield 919 hours of SRKW data, 230 hours of Bigg’s orca data, 1374 hours of orca data from unlabelled ecotypes, 1501 hours of humpback data, 88 hours of sea lion data, 246 hours of pacific white-sided dolphin data, and over 784 hours of unspecified marine mammal data. This SRKW dataset is larger than DCLDE-2026, Ocean Networks Canada, and OrcaSound combined. The curated species labels are available under CC-BY 4.0 license, and the corresponding audio data are available under the licenses of the original owners. The comprehensive nature of this dataset makes it suitable for unsupervised machine translation, habitat usage surveys, and conservation endeavours for this critically endangered ecotype.


💡 Research Summary

The paper presents a comprehensive effort to build the largest acoustic dataset of Southern Resident Killer Whales (SRKW) and co‑occurring marine mammals by mining over 30 years of publicly available hydrophone recordings. The authors adopt a weakly‑supervised Positive‑Unlabelled (PU) learning framework combined with several active learning strategies to efficiently discover and label marine mammal vocalizations in an archive of roughly 260 000 hours of audio.

Data sources include Ocean Networks Canada, SanctSound, the Ocean Observatories Initiative, OrcaSound, the DCLDE‑2026 challenge set, DEEP‑AL fieldwork, and two recent FGPD/PSGCH collections. After filtering for permissive licenses and high‑probability segments, the final corpus comprises more than 5 152 hours of audio covering SRKW (919 h), Bigg’s orca (230 h), unlabelled orca ecotypes (1 374 h), humpback whale (1 501 h), sea lion (88 h), Pacific white‑sided dolphin (246 h), and unspecified marine mammals (784 h).

Pre‑processing converts all files to lossless FLAC, resamples to 32 kHz, applies a 1 kHz high‑pass filter, and splits longer recordings into 5‑minute chunks. Each chunk is passed through the Whisper‑tiny encoder (39 M parameters) to obtain a 15‑second embedding; layer‑norm and mean‑pooling reduce this to a single vector per chunk. The total energy consumption for embedding generation is 153 kWh, corresponding to less than 10 kg CO₂, demonstrating an environmentally conscious pipeline.

For the PU stage, the authors treat existing ONC labels and expert‑verified positive instances as the positive set, while randomly sampling from the massive unlabeled pool to serve as negative examples. A logistic regression model with L2 regularization provides an initial classifier. Theoretical analysis invokes the SCAR (selected completely at random) assumption, the Vapnik‑Chervonenkis dimension V, the propensity eₘ of a positive sample being labeled, and the Massart margin h, yielding convergence bounds O(V n·eₘ·h) for h ≥ … and O(V/(n·eₘ)) for h < …. As active learning proceeds, eₘ approaches 1, effectively reducing the PU problem to standard supervised learning.

Active learning proceeds in three phases: (1) Positive‑only sampling, where only the top‑k highest‑confidence positives are labeled; (2) Entropy‑based sampling, selecting the top‑k most uncertain samples; and (3) Diversity‑preserving sampling, ensuring coverage of under‑represented regions of the feature space. After each batch, the model is retrained and newly labeled positives are added to the positive pool, gradually improving both recall and precision. Non‑expert volunteers performed the manual listening, achieving 50‑65 % agreement with expert labels, which is sufficient to boost the positive labeling rate above the ambient prevalence.

The detection backbone is a 12‑layer transformer with lightweight attention mechanisms, trained on the PU‑augmented dataset. It outperforms state‑of‑the‑art detectors on DEEP‑AL, DCLDE‑2026, and two newly curated expert‑annotated test sets in terms of accuracy, speed, and energy efficiency. At a fixed sensitivity of 95 %, the model attains a specificity of 0‑28.8 %, a notable improvement over prior work. The multiclass species classifier (11 training classes, 4 test classes) reaches a top‑1 accuracy of 42.1 %, while the ecotype classifier (4 training, 5 test) reaches 43.0 % on DCLDE‑2026.

The resulting dataset, named DORI (Dataset for Orca Resident Interpretation), is released under a CC‑BY 4.0 license for the curated labels, while the raw audio retains the original owners’ licenses. The authors argue that DORI’s breadth makes it suitable for unsupervised machine translation of cetacean calls, habitat‑usage surveys, and conservation planning for the critically endangered SRKW ecotype.

Limitations discussed include potential violation of the SCAR assumption due to bias toward high‑SNR samples, possible under‑representation of rare species in entropy‑based sampling, and the 5‑minute chunk size limiting capture of long‑duration vocal sequences. Future work will explore multi‑scale temporal models, Bayesian PU formulations with dynamic priors based on location, bathymetry, and season, and integration of additional sensor modalities (e.g., video, environmental data) to further enhance detection and classification performance.


Comments & Academic Discussion

Loading comments...

Leave a Comment