Multilingual and Continuous Backchannel Prediction: A Cross-lingual Study
We present a multilingual, continuous backchannel prediction model for Japanese, English, and Chinese, and use it to investigate cross-linguistic timing behavior. The model is Transformer-based and operates at the frame level, jointly trained with auxiliary tasks on approximately 300 hours of dyadic conversations. Across all three languages, the multilingual model matches or surpasses monolingual baselines, indicating that it learns both language-universal cues and language-specific timing patterns. Zero-shot transfer with two-language training remains limited, underscoring substantive cross-lingual differences. Perturbation analyses reveal distinct cue usage: Japanese relies more on short-term linguistic information, whereas English and Chinese are more sensitive to silence duration and prosodic variation; multilingual training encourages shared yet adaptable representations and reduces overreliance on pitch in Chinese. A context-length study further shows that Japanese is relatively robust to shorter contexts, while Chinese benefits markedly from longer contexts. Finally, we integrate the trained model into a real-time processing software, demonstrating CPU-only inference. Together, these findings provide a unified model and empirical evidence for how backchannel timing differs across languages, informing the design of more natural, culturally-aware spoken dialogue systems.
💡 Research Summary
This paper tackles the problem of predicting the timing of listener backchannels (“uh‑huh”, “yeah”, etc.) in spoken dialogue, extending prior work from single‑language settings to a truly multilingual scenario. The authors collect a large‑scale corpus of dyadic conversations in Japanese, English, and Mandarin Chinese, amounting to roughly 300 hours of audio (≈ 100 h per language). Conversations were recorded via Zoom, segmented into Inter‑Pausal Units using a 200 ms silence threshold, and transcribed with Whisper models (Kotoba‑Whisper for Japanese, Whisper‑large for English and Chinese). A manually curated list of backchannel forms (both “continuers” and “assessments”) was verified by native speakers, and consecutive backchannels were merged. Statistics show that Japanese exhibits the highest backchannel rate (34.4 % of utterances, 16.5 % of total time) and the largest proportion of backchannels occurring during the speaker’s utterance (69.4 %). English and Chinese have lower rates (≈ 28 % of utterances) and more backchannels after the speaker finishes (≈ 60 % for English, 52 % for Chinese).
The core contribution is a frame‑level (100 ms) continuous prediction model built on a Transformer architecture. Input consists of the separated waveforms of the two interlocutors. Each waveform is first encoded by a frozen Contrastive Predictive Coding (CPC) encoder pre‑trained on 60 k h of Libri‑light data, producing 500 ms embeddings. These embeddings are processed by separate speaker‑specific Transformers (one layer each) and then fused through three layers of cross‑attention Transformers to capture inter‑speaker dynamics. The shared representation feeds four linear heads: (1) Voice Activity Detection (VAD) for current speech/non‑speech, (2) Voice Activity Projection (VAP) predicting joint speaking states over the next 2 s (quantized into four bins, yielding 256 classes), (3) Backchannel Detection (BD) indicating whether the listener is currently producing a backchannel, and (4) Backchannel Prediction (BP) – the primary task – estimating the probability of a backchannel 0.5 s in the future. During training, annotated backchannel onsets are shifted forward by 0.5 s to create supervision for BP. The total loss is a weighted sum L = α₁L_VAD + α₂L_VAP + α₃L_BD + α₄L_BP, with α₁ = α₂ = 1 and α₃ = α₄ = 5, emphasizing the backchannel‑related objectives.
Four models are trained: three monolingual models (each on a single language) and one multilingual model (trained on the combined data). All models share identical architecture, hyper‑parameters (AdamW, lr = 3.63 × 10⁻⁴, batch = 8, up to 25 epochs), and evaluation metric (frame‑level F1 with threshold 0.5). Results (Table 3) show that the multilingual model attains F1 scores of 33.69 (Japanese), 23.96 (English), and 22.65 (Chinese), matching or surpassing the corresponding monolingual baselines (33.27, 22.85, 21.37). This demonstrates that a single model can learn universal cues (e.g., pauses, turn‑taking signals) while still adapting to language‑specific timing patterns.
Zero‑shot transfer experiments (Table 4) where models are trained on any two languages and tested on the third reveal substantial performance drops (e.g., English‑Chinese training → Japanese test F1 = 8.02). The degradation underscores substantive cross‑linguistic differences: Japanese backchannels often occur mid‑utterance, whereas Chinese backchannels tend to follow clear utterance boundaries, and English lies in between. Hence, no language subsumes the others.
Ablation studies assess the contribution of auxiliary tasks. For monolingual models, removing any auxiliary loss yields only minor changes, sometimes even slight improvements, suggesting that each language can learn its own backchannel cues without heavy reliance on auxiliary supervision. In contrast, the multilingual model shows pronounced sensitivity: removing VAP loss causes the largest drops (up to –3.59 F1 points), indicating that turn‑taking dynamics are crucial for shared representation learning. Removing BD also harms performance, while surprisingly, dropping VAD slightly improves scores, perhaps because speech activity alone is not a strong predictor of backchannel timing.
Perturbation analyses further probe which acoustic cues the models exploit. Four manipulations are applied to test audio: pitch flattening, intensity flattening, pause removal, and cepstral liftering. Results (Tables 7 and 8) reveal language‑specific dependencies. Japanese models are most affected by intensity flattening and pause removal, implying reliance on short‑term energy cues and silence patterns. English and Chinese models are more sensitive to pitch flattening and especially to pause removal, reflecting a stronger dependence on prosodic contours and silence duration. Notably, the multilingual model exhibits reduced sensitivity to pitch in Chinese, suggesting that multilingual training mitigates over‑reliance on any single cue and promotes more balanced feature use.
Finally, the authors integrate the trained model into a real‑time processing pipeline that runs on CPU only. Inference latency stays within a few tens of milliseconds, confirming feasibility for deployment in live spoken dialogue systems without GPU acceleration.
Overall, the paper makes four key contributions: (1) a Transformer‑based, frame‑level multilingual backchannel predictor that jointly learns universal and language‑specific timing cues; (2) a comprehensive empirical comparison of backchannel behavior across Japanese, English, and Mandarin, quantifying differences in frequency, intra‑utterance vs. post‑utterance placement, and silence‑duration preferences; (3) an extensive analysis of auxiliary task importance and acoustic cue reliance, highlighting how multilingual training shapes shared representations; and (4) a demonstration of real‑time CPU‑only inference, paving the way for culturally aware, natural‑sounding spoken dialogue agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment