Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword Spotting

Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword Spotting
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-text alignment, and the need for task-specific models. To address these shortcomings, we propose a joint multimodal contrastive learning framework that unifies both acoustic and cross-modal supervision in a shared embedding space. Our approach simultaneously optimizes: (i) audio-text contrastive learning, inspired by the CLAP loss, to align audio and text representations and (ii) audio-audio contrastive learning, via Deep Word Discrimination (DWD) loss, to enhance intra-class compactness and inter-class separation. The proposed method outperforms existing AWE baselines on word discrimination task while flexibly supporting both STD and KWS. To our knowledge, this is the first comprehensive approach of its kind.


💡 Research Summary

This paper addresses key limitations in current Acoustic Word Embedding (AWE) methods for speech retrieval tasks like Spoken Term Detection (STD) and Keyword Spotting (KWS). Traditional AWE approaches often rely on unimodal supervision (audio-only or text-only), optimize audio-audio and audio-text alignment objectives disjointedly, and require separate models for different tasks (e.g., query-by-example vs. text-query). To overcome these shortcomings, the authors propose a novel Joint Multimodal Contrastive Learning framework.

The core innovation lies in unifying two complementary learning objectives within a single, shared embedding space. The framework simultaneously trains an audio encoder and a text encoder. First, it employs a symmetric audio-text contrastive loss, inspired by CLAP (Contrastive Language-Audio Pretraining). This loss pulls the embeddings of matching audio segments and their corresponding text keywords closer together while pushing apart non-matching pairs within a training batch. This enables cross-modal retrieval, essential for KWS where the query is text. Second, it incorporates an audio-audio discrimination loss based on Deep Word Discrimination (DWD). This loss operates solely on the audio embeddings, enforcing intra-class compactness (clustering different utterances of the same word) and inter-class separation (pushing apart embeddings of different words). This enhances the discriminative power of embeddings for direct audio-to-audio comparison, which is crucial for Query-by-Example STD.

The total loss function is a weighted sum of these two components (L_total = α1L_at + α2L_aa), allowing the model to learn a representation space that is both semantically aligned across modalities and structurally discriminative within the audio modality itself. The authors conduct a systematic ablation study to find an effective balance, ultimately setting α1=0.1 and α2=1.

Experiments are conducted on the LibriSpeech corpus. The training uses the train-clean-100 subset, with word-level segments obtained via forced alignment and filtered by duration. A consistent neural architecture (a 3-layer BiLSTM) is used for the audio encoder across all baselines and the proposed model to ensure fair comparison. The baselines include representative AWE models: Siamese RNN, Correspondence Autoencoder RNN, Multi-view RNN, Contrastive RNN, and a model using only DWD loss.

Evaluation focuses on the intrinsic quality of embeddings via a word discrimination task, reporting Average Precision (AP). The test set is carefully partitioned into In-Vocabulary (IV) and Out-Of-Vocabulary (OOV) words to assess generalization. The results demonstrate that the proposed joint multimodal framework consistently outperforms all unimodal and disjoint training baselines on both IV and OOV sets. Notably, the combination of losses proves complementary: the audio-text loss improves performance when text queries are involved, while the audio-audio loss significantly boosts robustness for purely acoustic matching, especially for unseen (OOV) words.

In conclusion, this work presents the first comprehensive framework to jointly learn from acoustic and textual supervision for AWE. It delivers a single, flexible model capable of state-of-the-art performance on both STD and KWS tasks. Additionally, the paper contributes a standardized and reproducible evaluation protocol for AWE research, addressing inconsistencies in prior work.


Comments & Academic Discussion

Loading comments...

Leave a Comment