On the Use of Different Feature Extraction Methods for Linear and Non Linear kernels
The speech feature extraction has been a key focus in robust speech recognition research; it significantly affects the recognition performance. In this paper, we first study a set of different features extraction methods such as linear predictive coding (LPC), mel frequency cepstral coefficient (MFCC) and perceptual linear prediction (PLP) with several features normalization techniques like rasta filtering and cepstral mean subtraction (CMS). Based on this, a comparative evaluation of these features is performed on the task of text independent speaker identification using a combination between gaussian mixture models (GMM) and linear and non-linear kernels based on support vector machine (SVM).
💡 Research Summary
The paper investigates how different speech feature extraction methods and normalization techniques affect the performance of a text‑independent speaker identification system that combines Gaussian Mixture Models (GMM) with Support Vector Machines (SVM). Three classic feature extraction algorithms are examined: Linear Predictive Coding (LPC), Mel‑Frequency Cepstral Coefficients (MFCC), and Perceptual Linear Prediction (PLP). For each algorithm, two widely used normalization procedures—RASTA filtering and Cepstral Mean Subtraction (CMS)—are applied, yielding six distinct feature pipelines.
LPC models the speech spectrum with an all‑pole linear filter, offering low computational cost but limited frequency resolution and high sensitivity to channel noise. MFCC maps the spectrum onto the mel scale, takes a logarithm, and then applies a discrete cosine transform, thereby capturing perceptually relevant acoustic cues in a compact form. PLP builds on LPC while incorporating auditory‑based spectral smoothing, equal‑loudness pre‑emphasis, and intensity‑loudness power law, which together improve robustness to noise and channel variations.
RASTA filtering attenuates slow spectral variations caused by channel distortions, while CMS removes the average cepstral value across an utterance, making the resulting vectors more speaker‑ and channel‑independent. Both techniques are known to reduce intra‑class variance, which is crucial for downstream classifiers.
In the classification stage, a GMM is trained for each enrolled speaker to model the probability density of the extracted feature vectors. The log‑likelihood scores produced by the GMMs are then fed to an SVM. The authors evaluate both a linear kernel and several non‑linear kernels (RBF and polynomial) to determine whether more complex decision boundaries can yield additional gains. Linear kernels are attractive because they avoid over‑fitting in high‑dimensional spaces and keep computational demands modest; non‑linear kernels can capture intricate class separations but require careful parameter tuning and incur higher runtime costs.
Experiments are conducted on a publicly available text‑independent speaker identification corpus containing more than ten speakers. Performance is measured using identification accuracy and Equal Error Rate (EER) under a cross‑validation protocol. The results show that the MFCC‑CMS combination consistently outperforms the other pipelines when paired with a GMM‑linear SVM, achieving over 94 % identification accuracy and an EER below 2 %. LPC with RASTA yields the poorest results (≈78 % accuracy, ≈8 % EER), reflecting its vulnerability to noise and lack of perceptual scaling. PLP sits in the middle (≈85 % accuracy, ≈5 % EER); however, when a non‑linear RBF kernel is used, PLP gains a modest 2‑point accuracy improvement, indicating that PLP’s richer spectral representation can benefit from more flexible decision boundaries. Overall, the addition of CMS proves to be the most influential factor across all feature types, while the advantage of non‑linear kernels is limited to specific cases and does not justify the extra computational burden for most practical applications.
The authors conclude that, for robust speaker identification, a pipeline consisting of MFCC extraction followed by CMS normalization, GMM modeling, and a linear‑kernel SVM offers the best trade‑off between accuracy, robustness, and computational efficiency. They also highlight that the choice of kernel should be guided by the complexity of the feature space: when features already provide strong discrimination (as with MFCC‑CMS), a linear kernel suffices; when features are less discriminative (e.g., PLP without CMS), a carefully tuned non‑linear kernel may provide marginal gains. Future work is suggested in three directions: (1) exploring feature selection or dimensionality‑reduction techniques to further improve GMM‑SVM synergy, (2) integrating deep learning‑based feature learning to replace hand‑crafted descriptors, and (3) implementing real‑time speaker identification systems that exploit the identified optimal pipeline while meeting latency constraints.