Cross-Attention End-to-End ASR for Two-Party Conversations

Cross-Attention End-to-End ASR for Two-Party Conversations
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present an end-to-end speech recognition model that learns interaction between two speakers based on the turn-changing information. Unlike conventional speech recognition models, our model exploits two speakers’ history of conversational-context information that spans across multiple turns within an end-to-end framework. Specifically, we propose a speaker-specific cross-attention mechanism that can look at the output of the other speaker side as well as the one of the current speaker for better at recognizing long conversations. We evaluated the models on the Switchboard conversational speech corpus and show that our model outperforms standard end-to-end speech recognition models.


💡 Research Summary

The paper introduces a novel end‑to‑end automatic speech recognition (ASR) architecture designed specifically for two‑speaker conversations. Traditional end‑to‑end ASR systems typically treat each utterance in isolation, relying on an external language model (LM) trained only on text to capture long‑range contextual information. This separation limits the ability to exploit conversational cues such as turn‑taking, speaker identity, and the semantic flow that naturally occurs in dialog.

Motivation and Related Work
Previous work on dialog‑aware ASR has incorporated conversational context by concatenating or averaging embeddings of preceding utterances, or by using phrase‑list attention for task‑specific vocabularies (e.g., song titles). However, these approaches either ignore speaker differentiation or require external resources unavailable at inference time. Text‑only recurrent LM approaches have modeled speaker interactions but remain detached from the acoustic model, preventing joint optimization.

Proposed Architecture
The authors propose three key components:

  1. Utterance Encoder – Each past utterance is transformed into a fixed‑dimensional vector using a BERT‑based sentence embedding. This richer representation replaces simpler one‑hot averages or word2vec embeddings.

  2. Speaker‑Specific History Queues – Two separate queues store the embeddings of utterances spoken by Speaker A and Speaker B, respectively. Turn‑changing information is assumed to be known a priori, allowing the system to maintain distinct histories for each participant.

  3. Cross‑Attention Mechanism – Two variants are explored:

    • Simple Attention – A standard attention layer computes a weighted sum over each speaker’s history, producing a speaker‑specific contextual vector.
    • matchLSTM‑Based Cross‑Attention – Inspired by matchLSTM used in question‑answering, the model treats the other speaker’s history as a “question” and the current speaker’s history as a “passage.” At each time step, an LSTM cell attends to the opposite speaker’s embeddings, thereby capturing sequential interactions and longer‑range dependencies. The final hidden state serves as the contextual embedding for the decoder.

These contextual embeddings (att_e) are fed, together with the acoustic embedding from the CNN‑BLSTM encoder and the previous word embedding, into a joint CTC/Attention decoder. The additional attention parameters amount to roughly 2 million, raising the total model size from 32 M to 34 M parameters – a modest increase.

Experimental Setup
The system is trained on 300 hours of the Switchboard corpus (≈2,400 dialogs) and evaluated on the HUB5 Eval2000 test sets (Switchboard and CallHome). Acoustic features consist of 80‑dimensional log‑mel filterbanks plus 3 pitch dimensions. The baseline follows a standard joint CTC/Attention configuration with a 6‑layer CNN, 6‑layer BLSTM encoder (320 cells per direction), and a 2‑layer LSTM decoder (300 cells). Training uses AdaDelta with gradient clipping, and a two‑stage pre‑training pipeline: baseline → vanilla conversational model → proposed cross‑attention model. Decoding employs beam search (beam = 10) with a length penalty of 0.1.

Results
Table 2 shows word error rates (WER) for various history lengths (6, 10, 20 utterances) and both attention variants. The matchLSTM model with a 20‑utterance history achieves the best performance: 16.4 % WER on Switchboard and 29.8 % on CallHome, outperforming the baseline (17.9 % / 30.6 %) and the simple attention variant (16.6 % / 30.0 %). The improvement is consistent across history lengths, indicating that the cross‑attention mechanism effectively leverages conversational context. Visualizations of attention weights reveal that the model focuses on longer, semantically rich utterances while down‑weighting filler words such as “oh” or “yeah.”

Analysis and Future Directions
The superiority of matchLSTM suggests that modeling the interaction between speakers sequentially captures the dialog flow better than a static weighted average. The authors note that extending the approach to word‑level or sub‑word‑level histories could provide finer granularity. Moreover, handling more than two speakers, integrating external language models, and applying the method in streaming or low‑latency scenarios are identified as promising research avenues.

Conclusion
By introducing speaker‑specific cross‑attention that jointly considers what the current speaker says and what the interlocutor previously said, the paper demonstrates a tangible reduction in WER for two‑party conversational speech. The architecture remains end‑to‑end trainable, adds only modest parameter overhead, and opens the door for richer context‑aware ASR systems that more closely mirror human conversational processing.


Comments & Academic Discussion

Loading comments...

Leave a Comment