Synaspot: A Lightweight, Streaming Multi-modal Framework for Keyword Spotting with Audio-Text Synergy
Open-vocabulary keyword spotting (KWS) in continuous speech streams holds significant practical value across a wide range of real-world applications. While increasing attention has been paid to the role of different modalities in KWS, their effectiveness has been acknowledged. However, the increased parameter cost from multimodal integration and the constraints of end-to-end deployment have limited the practical applicability of such models. To address these challenges, we propose a lightweight, streaming multi-modal framework. First, we focus on multimodal enrollment features and reduce speaker-specific (voiceprint) information in the speech enrollment to extract speaker-irrelevant characteristics. Second, we effectively fuse speech and text features. Finally, we introduce a streaming decoding framework that only requires the encoder to extract features, which are then mathematically decoded with our three modal representations. Experiments on LibriPhase and WenetPrase demonstrate the performance of our model. Compared to existing streaming approaches, our method achieves better performance with significantly fewer parameters.
💡 Research Summary
The paper introduces Synaspot, a lightweight, streaming‑oriented multi‑modal framework for open‑vocabulary keyword spotting (KWS). Traditional multi‑modal KWS systems improve accuracy by combining audio and text enrollment, but they suffer from large model sizes and reliance on non‑streaming end‑to‑end decoders, which limits deployment on resource‑constrained devices. Synaspot addresses these issues through three main innovations.
First, the audio encoder is built from seven DFSMN layers (hidden size 256) that ingest FBank frames and output frame‑level embeddings (E_A). To make these embeddings speaker‑independent, a speaker classifier is attached to the encoder and its gradients are reversed during training (gradient reversal). This forces utterances with the same linguistic content but different speakers to map to similar embeddings. In parallel, a phoneme classifier is trained with an Additive Angular Margin (AAM) loss, which enlarges inter‑phoneme angular margins and reduces phoneme confusion. The combined audio loss is L_audio = α_A·L_ph + β_A·L_vp.
Second, text embeddings (E_T) are generated by a simple embedding layer followed by an LSTM. A cross‑attention module treats E_T as queries and the audio embeddings E_A as keys/values, producing a mixed audio‑text embedding (E_M). All three modalities (audio, text, mixed) are aligned in a shared space using contrastive learning losses L_clat (audio‑text) and L_clam (mixed‑audio). The overall training objective for the multimodal branch is L_mixed = α_M·L_ph + β_M·L_clat + γ_M·L_clam.
Third, during inference only the audio encoder runs, producing streaming embeddings (E_W) for each incoming audio chunk. The system pre‑computes and caches the three enrollment embeddings (E_A, E_T, E_M). For each frame, cosine similarity p_ij between E_W and each enrollment embedding is computed. To mitigate noise, a causal moving‑average smoothing with window w_smooth is applied, yielding p′_ij. A scoring window w_scoring aggregates the maximum smoothed similarity across the window, producing per‑modality scores S_A, S_T, and S_M. The final wake‑up score is a weighted sum: S = α_S·S_A + β_S·S_T + γ_S·S_M. This design eliminates the need for a separate decoder, allows arbitrary keyword lengths, and operates at frame‑level latency.
Experiments were conducted on two benchmark suites: LibriPhrase (English) derived from LibriSpeech, and WenetPhrase (Mandarin). Training used LibriSpeech train‑clean‑100/360 for audio and the respective phrase datasets for evaluation. Synaspot variants (audio‑only, text‑only, and fused) contain only 0.9 M parameters, far fewer than prior streaming models (0.6–2.2 M) and non‑streaming baselines (up to 3.9 M). Results show that the fused model achieves an Equal Error Rate (EER) of 5.77 % and an Area Under Curve (AUC) of 27.29 % on the hard English set, outperforming CMCD (EER 8.42 %) while using comparable parameters. On the easy set, EER drops to 5.97 % with AUC 97.19 %. Mandarin experiments confirm similar trends: the fused model reaches 14.56 % EER and 34.50 % AUC, surpassing MM‑KWS (3.9 M parameters).
Ablation studies reveal that removing the mixed embedding raises EER to 7.07 % and discarding the speaker classifier increases EER to 8.85 %, confirming that both speaker‑invariant audio features and multimodal alignment are crucial. Qualitative heat‑map visualizations illustrate clear separation between positive and negative examples in both offline and streaming similarity matrices.
The paper also compares Synaspot with non‑streaming end‑to‑end approaches that rely on fixed‑length windows (1.5 s, 2 s). Those methods are sensitive to window size, incur higher computational load, and cannot adapt to variable‑length keywords, whereas Synaspot processes audio continuously with minimal latency and memory footprint.
In conclusion, Synaspot demonstrates that a carefully designed speaker‑agnostic audio encoder, contrastive multimodal alignment, and a decoder‑free streaming scoring mechanism can deliver high‑accuracy open‑vocabulary KWS with a sub‑million‑parameter model. This makes it well‑suited for on‑device, real‑time applications such as wake‑word detection, voice assistants, and other human‑machine interaction scenarios where both flexibility and efficiency are paramount.
Comments & Academic Discussion
Loading comments...
Leave a Comment