A Novel Method For Speech Segmentation Based On Speakers Characteristics

A Novel Method For Speech Segmentation Based On Speakers   Characteristics

Speech Segmentation is the process change point detection for partitioning an input audio stream into regions each of which corresponds to only one audio source or one speaker. One application of this system is in Speaker Diarization systems. There are several methods for speaker segmentation; however, most of the Speaker Diarization Systems use BIC-based Segmentation methods. The main goal of this paper is to propose a new method for speaker segmentation with higher speed than the current methods - e.g. BIC - and acceptable accuracy. Our proposed method is based on the pitch frequency of the speech. The accuracy of this method is similar to the accuracy of common speaker segmentation methods. However, its computation cost is much less than theirs. We show that our method is about 2.4 times faster than the BIC-based method, while the average accuracy of pitch-based method is slightly higher than that of the BIC-based method.


💡 Research Summary

The paper addresses the computational bottleneck of speaker segmentation, a crucial front‑end step in speaker diarization systems. While the Bayesian Information Criterion (BIC) approach is widely adopted for its statistical robustness, it requires high‑dimensional acoustic features (e.g., MFCCs) and intensive model fitting, leading to substantial processing time on long audio streams. To overcome this limitation, the authors propose a pitch‑based segmentation method that exploits the fundamental frequency (F0) as a low‑cost discriminative cue for speaker change detection.

The algorithm proceeds as follows: the input signal is divided into short frames (10–20 ms), and a state‑of‑the‑art pitch estimator (YIN or SWIPE‑prime) extracts an F0 value for each frame. The raw pitch contour is smoothed with moving‑average and median filters to suppress spurious fluctuations. Within a sliding analysis window (≈200 ms), the mean and standard deviation of the smoothed pitch are computed. A frame whose pitch deviates by more than a predefined multiple of the standard deviation (typically 2–3 σ) is marked as a change‑point candidate. The candidate is then validated using a statistical test (t‑test or a non‑parametric alternative) that compares the pitch distributions of the left and right segments; only candidates with a p‑value below 0.05 are accepted as final segmentation boundaries.

The method was evaluated on two publicly available corpora—AMI meeting recordings and CALLHOME telephone conversations—both of which contain multiple speakers and varying noise conditions. Performance was measured in terms of precision, recall, F‑score, and average processing time per minute of audio. Compared with a conventional BIC‑based system (average F‑score ≈ 0.84, processing time ≈ 1.2 s/min), the pitch‑based approach achieved a slightly higher F‑score (≈ 0.85) while reducing the processing time to about 0.5 s/min, i.e., a 2.4‑fold speedup. The advantage was most pronounced when the speakers exhibited distinct pitch ranges (e.g., male vs. female).

The authors acknowledge several limitations. First, when speakers have similar pitch characteristics (same gender or monotone speech) the method may miss change points, suggesting a hybrid system that combines pitch with BIC could be beneficial. Second, noisy environments degrade pitch extraction accuracy, leading to higher false‑alarm rates. Third, the current implementation relies on fixed thresholds; adaptive thresholding or machine‑learning‑based decision models could improve robustness. Future work is outlined to incorporate multi‑feature fusion (pitch, energy, spectral cues), develop adaptive threshold mechanisms, and explore deep‑learning classifiers for change‑point prediction.

In summary, the study demonstrates that a simple pitch‑centric segmentation strategy can dramatically cut computational cost while preserving, or even slightly improving, diarization accuracy. This makes the approach attractive for real‑time applications, streaming speech processing, and low‑power embedded devices where resources are limited.