Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities
Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech understanding capabilities. However, most speech LLMs are trained on single-channel, single-talker data, which makes it challenging to directly apply them to multi-talker and multi-channel speech understanding task. In this work, we present a comprehensive investigation on how to enable directional multi-talker speech understanding capabilities for LLMs, specifically in smart glasses usecase. We propose two novel approaches to integrate directivity into LLMs: (1) a cascaded system that leverages a source separation front-end module, and (2) an end-to-end system that utilizes serialized output training. All of the approaches utilize a multi-microphone array embedded in smart glasses to optimize directivity interpretation and processing in a streaming manner. Experimental results demonstrate the efficacy of our proposed methods in endowing LLMs with directional speech understanding capabilities, achieving strong performance in both speech recognition and speech translation tasks.
💡 Research Summary
The paper addresses the challenge of extending large language model (LLM)–based speech understanding to multi‑talker, multi‑channel scenarios typical of wearable devices such as smart glasses. While recent work has shown that prompting LLMs with audio embeddings can yield strong speech recognition (ASR) and translation capabilities, those models have been trained almost exclusively on single‑channel, single‑talker data, limiting their applicability in realistic conversational environments where multiple speakers overlap and spatial cues are essential.
To bridge this gap, the authors propose two distinct architectures that explicitly incorporate directional information from a five‑microphone array embedded in smart glasses. The first is a cascaded system that employs a streaming multi‑channel source‑separation (SS) front‑end. Raw microphone signals are beam‑formed into K + 1 channels, transformed to the short‑time Fourier transform (STFT) domain, and processed by an encoder‑decoder network consisting of gated linear unit (GLU) convolutional blocks, a three‑layer LSTM, and convolutional decoders. The network outputs time‑frequency masks for the wearer (near‑field) and the conversation partner (far‑field). After inverse STFT, the two separated waveforms are examined chunk‑wise (600 ms). A simple RMS‑ratio heuristic, combined with a voice‑activity detector, determines which speaker dominates each chunk, producing a speaker tag (“self” or “other”). The tag is then used to select a task‑specific prompt (e.g., “translate to French”) and the reference channel (a single‑channel signal) is fed to a pre‑trained multimodal LLM (Gemma‑3‑n 4B). This approach yields very low speaker‑attribution (SA) errors for the wearer and strong ASR/translation performance, but it cannot handle overlapping speech because the SS masks introduce distortion when two voices are active simultaneously.
The second architecture is an end‑to‑end solution based on Serialized Output Training (SOT). Multi‑channel audio is first processed by a non‑linearly constrained minimum variance (NLCMV) beamformer that produces several directional beams; only one “mouth‑beam” is passed to the LLM’s audio encoder, preserving the spatial contrast between speakers while keeping the model’s input dimensionality unchanged. Training data are constructed in SOT format: reference transcripts are ordered by speaker start times and interleaved with a special speaker‑change token. The LLM is fine‑tuned on this data using Low‑Rank Adaptation (LoRA) applied to both the audio encoder and the language decoder (rank = 64, updating ≈ 1.9 % of parameters). This design enables the model to learn to emit a serialized sequence that includes both transcription and translation while simultaneously indicating speaker changes, thereby handling overlapping speech without an explicit separation front‑end.
Experiments are conducted on simulated multi‑channel data that mimic the geometry of Aria‑style smart glasses. The simulation uses real room impulse responses (RIRs) from 12 directions spaced at 30°, with five frontal directions (‑60°, ‑30°, 0°, 30°, 60°) designated for the partner. Speech sources are drawn from Common Voice, Multilingual LibriSpeech (MLS), and LibriSpeech for training; evaluation uses the FLEURS dataset in English, French, Spanish, and Italian. The source‑separation model is trained with a combination of L1, STFT, and Log‑SI‑SDR losses, achieving PESQ scores of 1.60–2.91, STOI of 0.70–0.97, and SI‑SDR improvements from ‑13.28 dB (unprocessed) to +8.66 dB for the partner speaker.
Performance is measured with a modified word error rate (WER) that penalizes both transcription errors and speaker‑attribution (SA) mistakes, as well as BLEU for translation. The cascaded SS + SLM system (SS+SLM) attains the lowest overall WER (10.2–12.5 %) and near‑zero SA for the wearer, while the SOT‑based system (SOT+SLM) shows slightly higher SA for the partner (up to 2.5 %) and occasionally higher WER due to occasional confusion between transcription and translation prompts. Nevertheless, both systems outperform two strong baselines: a multi‑channel RNN‑T ASR model and the JST‑AR streaming speech‑translation model. In the “partner speaks Spanish” scenario, SS+SLM raises BLEU from 18.3 (baseline) to 25.3, and SOT+SLM improves it to 22.6. Similar gains are observed across the other language pairs.
Key insights include: (1) explicit directional processing via source separation dramatically reduces speaker‑attribution errors, enabling LLMs to handle multi‑talker contexts; (2) SOT provides a principled way to model overlapping speech without a separate separation front‑end, but current prompt designs cause occasional instruction ambiguity; (3) LoRA‑based fine‑tuning is an efficient method to adapt large multimodal LLMs to multi‑channel audio with minimal parameter updates; (4) streaming inference can be approximated by chunking audio into 600 ms windows and maintaining a sliding‑window text context (max 30 s audio, 50 recent words), though alignment between audio and text remains imperfect.
The authors acknowledge limitations: the cascaded approach cannot process overlapping speech, while the SOT approach suffers from instruction‑following errors that affect speaker attribution and cause higher deletion/insertion rates. Future work may explore more robust prompt engineering, joint training of separation and language models, and tighter audio‑text synchronization mechanisms.
In summary, this work presents the first comprehensive study of equipping LLMs with directional multi‑talker speech understanding for smart‑glasses applications, offering both a source‑separation‑driven pipeline and an end‑to‑end serialized‑output pipeline, and demonstrates that both substantially improve ASR and speech‑translation performance in realistic, spatially‑aware, multi‑speaker scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment