O-EENC-SD: Efficient Online End-to-End Neural Clustering for Speaker Diarization
We introduce O-EENC-SD: an end-to-end online speaker diarization system based on EEND-EDA, featuring a novel RNN-based stitching mechanism for online prediction. In particular, we develop a novel centroid refinement decoder whose usefulness is assessed through a rigorous ablation study. Our system provides key advantages over existing methods: a hyperparameter-free solution compared to unsupervised clustering approaches, and a more efficient alternative to current online end-to-end methods, which are computationally costly. We demonstrate that O-EENC-SD is competitive with the state of the art in the two-speaker conversational telephone speech domain, as tested on the CallHome dataset. Our results show that O-EENC-SD provides a great trade-off between DER and complexity, even when working on independent chunks with no overlap, making the system extremely efficient.
💡 Research Summary
The paper introduces O‑EENC‑SD, an online speaker diarization system that builds on the EEND‑EDA architecture and incorporates a novel RNN‑based stitching mechanism together with two transformer‑based refinement decoders. The core idea is to process audio in short, non‑overlapping chunks, obtain frame‑level embeddings and attractors from EEND‑EDA, and then resolve the permutation problem across chunks by means of an online neural clustering module. This module uses a set of gated recurrent units (GRUs), each representing a speaker centroid; the hidden state of each GRU is updated only when a newly estimated attractor is assigned to that centroid. Assignment is performed via a softmax over all existing centroids plus a special “average speaker” vector h₀, and is supervised by a cross‑entropy loss (L_cluster_CE) as well as a diarization loss applied to the stitched output (L_cluster_diar).
Two refinement decoders are added to improve the quality of the attractors and centroids before stitching. The attractor refinement decoder re‑processes the attractors using only the current chunk’s frame embeddings, keeping computation lightweight. The newly proposed centroid refinement decoder introduces a “ghost speaker” embedding and applies cross‑attention between the previous centroids and the concatenated set of ghost speaker and current attractors. This allows active speakers’ centroids to be nudged toward their correct attractors while providing a fallback option for non‑active speakers, thereby reducing mismatches when speakers appear or disappear.
Training proceeds in two stages. First, a pre‑training phase uses simulated telephone conversations generated from the CallHome and CallFriend corpora. Stereo channels are filtered for single‑speaker segments, then randomly mixed to create multi‑speaker dialogues, preserving language consistency. The model is initially trained on 10‑second audio split into ten non‑overlapping 1‑second sub‑chunks, which helps the network learn robust chunk‑level representations. In the fine‑tuning stage, the real CallHome development set is used, with various buffer sizes (from 5 s to 100 s) and latencies (1 s to 5 s) to study the trade‑off between real‑time constraints and performance. The total loss combines global and chunk‑level EEND‑EDA losses (weighted by a factor of 10 for the chunk loss) with the two clustering losses, yielding four terms that are jointly optimized.
Experiments on the two‑speaker CallHome test set (using a 0.25 s collar) demonstrate that O‑EENC‑SD achieves state‑of‑the‑art diarization error rates (DER) while using far less computational budget than previous online methods. With a 100‑second FIFO buffer and 5‑second latency, the system reaches 9.33 % DER; with a 10‑second buffer and 10‑second latency, DER is 9.50 %. Even when the buffer size equals the latency (i.e., no overlap between chunks), the model attains 12.47 % DER, outperforming the earlier EEND‑EDA+FW‑STB baseline (12.70 %). Ablation studies show that the centroid refinement decoder contributes the most to performance gains (reducing DER from 15.38 % to 12.69 % in a baseline configuration), while the attractor decoder offers a modest improvement.
A notable observation is that models trained with higher latency (e.g., 5 s) generalize better even when evaluated at lower latency, suggesting that longer context during training yields more accurate attractor predictions, which in turn eases the stitching process. The paper also highlights that the proposed method is hyper‑parameter free regarding clustering, unlike traditional unsupervised clustering approaches that require careful tuning of distance thresholds and number‑of‑clusters estimates.
In summary, O‑EENC‑SD provides a compact, efficient, and accurate solution for online speaker diarization. Its ability to operate on non‑overlapping chunks makes it suitable for edge devices and real‑time streaming applications where memory and processing power are limited. Future work could extend the approach to more than two speakers, explore multilingual scenarios, and integrate the entire pipeline onto low‑power hardware for on‑device diarization.
Comments & Academic Discussion
Loading comments...
Leave a Comment