A Comparison of Classifiers in Performing Speaker Accent Recognition Using MFCCs
An algorithm involving Mel-Frequency Cepstral Coefficients (MFCCs) is provided to perform signal feature extraction for the task of speaker accent recognition. Then different classifiers are compared based on the MFCC feature. For each signal, the mean vector of MFCC matrix is used as an input vector for pattern recognition. A sample of 330 signals, containing 165 US voice and 165 non-US voice, is analyzed. By comparison, k-nearest neighbors yield the highest average test accuracy, after using a cross-validation of size 500, and least time being used in the computation
💡 Research Summary
The paper presents a comparative study of several classifiers for the binary task of speaker accent recognition—distinguishing between U.S. and non‑U.S. English—using Mel‑Frequency Cepstral Coefficients (MFCCs) as the sole acoustic feature. A total of 330 utterances (165 U.S., 165 non‑U.S.) were collected from 22 speakers (11 female, 11 male) via an online source; each utterance is roughly one second long, sampled at 44.1 kHz, and contains no background noise.
Feature extraction follows the standard MFCC pipeline: pre‑emphasis filtering, short‑time Fourier transform with a Hamming window (20–40 ms frames), mapping of the power spectrum onto the Mel scale, logarithmic compression, and a discrete cosine transform to obtain the cepstral coefficients. The authors experiment with MFCC dimensionalities of 12, 19, 26, 33, and 39. For each utterance, the MFCC matrix (frames × coefficients) is reduced to a single fixed‑length vector by averaging each coefficient across all frames, yielding a q‑dimensional mean vector.
Five classification approaches are evaluated: Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Support Vector Machines with a radial‑basis‑function kernel (RBF‑SVM), SVM with a second‑order polynomial kernel (Poly‑SVM), and k‑Nearest Neighbors (k‑NN) with k = 3. The authors use stratified random sampling to create 500 cross‑validation folds; accuracy is defined as (TP + TN)/N for each fold, and the mean accuracy across folds is reported.
Results (Table 2, Figure 5) show a clear trend: increasing the number of MFCCs improves performance up to about 30 coefficients, after which gains plateau. k‑NN consistently outperforms the other methods, achieving its highest average accuracy of 95.86 % with 33 MFCCs. QDA and both SVM variants also perform well (≈92–94 % accuracy), whereas LDA lags behind (≈73–83 %).
Computation time (Table 3, Figure 6) reveals that k‑NN is the fastest, requiring roughly 0.6–1.2 seconds per 500‑fold cross‑validation run, because it does not involve model training—only distance calculations at test time. In contrast, SVMs need 9–16 seconds, QDA about 9–12 seconds, and LDA 7–18 seconds, reflecting the overhead of estimating covariance matrices and solving optimization problems.
The authors conclude that, for this modestly sized, noise‑free dataset, a simple mean‑MFCC representation combined with a non‑parametric k‑NN classifier yields the best trade‑off between accuracy and speed. They caution that using only the mean discards potentially useful information such as coefficient variance or temporal dynamics, and that MFCCs may become less effective in noisy conditions. Future work is suggested to incorporate additional statistics (e.g., standard deviations), explore Gaussian mixture modeling of MFCC trajectories, or adopt more sophisticated deep‑learning architectures that operate on raw spectrograms. Moreover, expanding the corpus to include varied recording environments and larger speaker populations would be necessary to validate the generalizability of the findings.
Comments & Academic Discussion
Loading comments...
Leave a Comment