Dual-Strategy-Enhanced ConBiMamba for Neural Speaker Diarization

Dual-Strategy-Enhanced ConBiMamba for Neural Speaker Diarization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Conformer and Mamba have achieved strong performance in speech modeling but face limitations in speaker diarization. Mamba is efficient but struggles with local details and nonlinear patterns. Conformer’s self-attention incurs high memory overhead for long speech sequences and may cause instability in long-range dependency modeling. These limitations are critical for diarization, which requires both precise modeling of local variations and robust speaker consistency over extended spans. To address these challenges, we first apply ConBiMamba for speaker diarization. We follow the Pyannote pipeline and propose the Dual-Strategy-Enhanced ConBiMamba neural speaker diarization system. ConBiMamba integrates the strengths of Conformer and Mamba, where Conformer’s convolutional and feed-forward structures are utilized to improve local feature extraction. By replacing Conformer’s self-attention with ExtBiMamba, ConBiMamba efficiently handles long audio sequences while alleviating the high memory cost of self-attention. Furthermore, to address the problem of the higher DER around speaker change points, we introduce the Boundary-Enhanced Transition Loss to enhance the detection of speaker change points. We also propose Layer-wise Feature Aggregation to enhance the utilization of multi-layer representations. The system is evaluated on six diarization datasets and achieves state-of-the-art performance on four of them. The source code of our study is available at https://github.com/lz-hust/DSE-CBM.


💡 Research Summary

This paper presents a dual‑strategy enhancement of the ConBiMamba architecture for neural speaker diarization, achieving state‑of‑the‑art results on four out of six widely used diarization benchmarks. The authors first adopt ConBiMamba, a hybrid model that replaces the self‑attention module of the Conformer with ExtBiMamba, a bidirectional state‑space model that captures long‑range dependencies with linear computational complexity. To compensate for Mamba’s known weakness in modeling fine‑grained local patterns, the original Conformer convolutional block is redesigned as a multi‑branch depthwise convolution with kernel sizes 15, 31, and 63, providing multi‑scale temporal perception.

Two complementary strategies are introduced. The first is Layer‑wise Feature Aggregation (LFA), which learns scalar importance weights for each of the seven ConBiMamba layers, masks out early layers, normalizes the weights with a softmax, and then aggregates the selected layer outputs via a weighted sum, layer‑norm, and dropout. Experiments show that aggregating the last three layers yields the lowest diarization error rate (DER), confirming that deep representations combined with mid‑level features are most beneficial.

The second strategy is the Boundary‑Enhanced Transition Loss (BETL). An auxiliary binary task predicts speaker change points by comparing frame‑wise speaker activity labels. Because change points are sparse, the loss adopts a focal‑loss formulation with a dynamic positive‑sample weight α set to the observed positive‑sample ratio r, and a focusing parameter γ = 2. BETL is combined with the standard permutation‑invariant training (PIT) loss using a weighting factor λ = 0.5, encouraging the model to explicitly learn boundary information rather than relying on implicit supervision. Ablation studies demonstrate that removing BETL degrades DER by 0.3–1.5 % and increases false‑alarm and miss rates around speaker transitions.

The system follows the Pyannote pipeline: frozen WavLM Base+ extracts 768‑dim acoustic features, which are linearly projected to 256 dimensions and fed into a seven‑layer ConBiMamba encoder. An ECAPA‑TDNN model provides speaker embeddings for agglomerative hierarchical clustering; clustering thresholds and minimum cluster sizes are tuned via Bayesian optimization. Training proceeds in two stages: (1) a compound dataset comprising six public corpora (AISHELL‑4, MagicData‑RAMC, VoxConverse, MSDWild, AMI channel 1, AliMeeting) and a large simulated four‑speaker dataset is used to train a generic model with 20‑second segments; (2) the model is fine‑tuned on each target dataset with segment lengths of 10, 20, or 30 seconds.

Results (DER with 0‑second collar) show the proposed system achieving 9.8 % on AISHELL‑4, 10.9 % on RAMC, 8.6 % on VoxConverse, 19.2 % on MSDWild, and 14.9 % on AMI channel 1, outperforming Pyannote‑AI, Diarizen (both frozen and fine‑tuned variants), and the previously best Mamba‑diarization system. Detailed error analysis reveals substantial reductions in false‑alarm and miss rates at speaker change points, confirming the effectiveness of BETL.

The authors acknowledge limitations: overlapping speech is not explicitly modeled, which hampers performance on highly overlapped datasets such as AliMeeting; and while fixing the WavLM encoder simplifies training, joint fine‑tuning yields further gains, suggesting future work should explore end‑to‑end optimization of the feature extractor together with ConBiMamba. Additionally, extremely long recordings may still challenge GPU memory despite ExtBiMamba’s linear scaling.

In summary, by integrating multi‑scale local convolutions, bidirectional state‑space modeling, layer‑wise feature aggregation, and a dedicated boundary loss, the Dual‑Strategy‑Enhanced ConBiMamba system delivers robust speaker diarization across diverse languages and recording conditions, setting a new benchmark for future research.


Comments & Academic Discussion

Loading comments...

Leave a Comment