Audio-to-Image Bird Species Retrieval without Audio-Image Pairs via Text Distillation

Audio-to-Image Bird Species Retrieval without Audio-Image Pairs via Text Distillation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple bioacoustic benchmarks. The distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. Most importantly, on the SSW60 benchmark, the proposed approach achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings, despite not training on paired audio-image data. These results demonstrate that indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing a practical solution for visually grounded species recognition in data-scarce bioacoustic settings.


💡 Research Summary

The paper tackles the problem of audio‑to‑image retrieval for bird species recognition, a task that offers interpretable, visually grounded results compared to traditional audio‑only classification. A major obstacle is the scarcity of paired audio‑image data in the wild, which makes direct cross‑modal alignment expensive or infeasible. The authors propose a remarkably simple yet effective solution: use text as a semantic bridge between the two modalities.

Two large‑scale pretrained multimodal models are leveraged: (1) BioCLIP‑2, an image‑text model trained on hundreds of millions of biological images and captions, whose text embeddings encode rich visual, taxonomic, and hierarchical information; (2) BioLingual, an audio‑text model trained on millions of bioacoustic recordings paired with textual descriptions. Although both models align their own modality to text, their embedding spaces are not aligned with each other.

The key insight is that the text space of BioCLIP‑2 already contains the visual semantics needed for image retrieval. By distilling this space into the audio encoder of BioLingual, the audio representations can inherit the visual grounding without ever seeing an image. Concretely, the audio encoder output is passed through a learnable linear projection g to match the dimensionality of BioCLIP‑2’s text embeddings. For each audio‑text pair (a_i, t_i) from the iNatSounds dataset, the projected audio vector z_Ai = g(f_A(a_i)) and the frozen BioCLIP‑2 text vector z_Ti = f_IT(t_i) are brought together using a contrastive loss:

L_distill = – (1/N) Σ_i log


Comments & Academic Discussion

Loading comments...

Leave a Comment