Practical Selection of SVM Supervised Parameters with Different Feature Representations for Vowel Recognition
It is known that the classification performance of Support Vector Machine (SVM) can be conveniently affected by the different parameters of the kernel tricks and the regularization parameter, C. Thus, in this article, we propose a study in order to find the suitable kernel with which SVM may achieve good generalization performance as well as the parameters to use. We need to analyze the behavior of the SVM classifier when these parameters take very small or very large values. The study is conducted for a multi-class vowel recognition using the TIMIT corpus. Furthermore, for the experiments, we used different feature representations such as MFCC and PLP. Finally, a comparative study was done to point out the impact of the choice of the parameters, kernel trick and feature representations on the performance of the SVM classifier
💡 Research Summary
The paper investigates how the choice of kernel function and the regularization parameter C affect the performance of Support Vector Machines (SVM) in a multi‑class vowel‑recognition task using the TIMIT speech corpus. Two widely used acoustic feature sets—Mel‑Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Prediction (PLP)—are extracted from 25 ms frames with 10 ms overlap, and each frame is represented by a 39‑dimensional vector (13 static coefficients plus first‑ and second‑order deltas). The authors adopt a one‑vs‑one strategy to handle the five vowel classes, training a binary SVM for every pair of classes and deciding the final label by majority voting.
Four kernel families are examined: linear, polynomial (degrees 2–4), Gaussian radial‑basis‑function (RBF), and sigmoid. For each kernel, the regularization parameter C is varied over a logarithmic scale (0.01, 0.1, 1, 10, 100, 1000). When applicable, the RBF and sigmoid kernels are also tested with different γ values (default 1/feature‑dimension, a larger 0.5, and a smaller 0.01). All experiments are evaluated with 5‑fold cross‑validation; performance metrics include overall classification accuracy, standard deviation, and area under the ROC curve (AUC). Statistical significance is assessed using ANOVA and Tukey’s HSD post‑hoc test.
Key findings can be summarised as follows:
-
Effect of C – Very small C (0.01) yields overly wide margins, causing under‑fitting and accuracies below 60 %. Medium values (C ≈ 1–10) strike a balance between margin maximisation and error tolerance, delivering the highest accuracies (≈ 92 % for RBF‑MFCC and ≈ 94 % for RBF‑PLP). Large C (≥ 100) leads to over‑fitting; validation accuracy drops by 1–2 % and training time increases sharply.
-
Kernel performance – Linear kernels perform the worst (≈ 78 % accuracy) because vowel data are not linearly separable in the chosen feature spaces. Polynomial kernels improve with degree, but degree 2 reaches only ≈ 85 % and higher degrees cause instability and over‑fitting. RBF kernels consistently outperform the others when γ is set to the default 1/feature‑dimension (≈ 0.025). Larger γ (0.5) makes the decision surface too localized, while smaller γ (0.01) produces overly smooth boundaries, both reducing accuracy. Sigmoid kernels are unstable; with large C they diverge, making them unsuitable for this task.
-
Feature‑set comparison – PLP, which incorporates auditory modelling and spectral flattening, shows a modest advantage over MFCC, especially in noisy conditions, achieving 1–2 % higher accuracy and an AUC of 0.97 versus 0.94 for MFCC. Both feature sets reach their peak performance with the RBF‑C = 5–10 configuration, confirming that the non‑linear mapping of the RBF kernel effectively captures the complex variations in vowel spectra.
-
Computational considerations – Linear‑MFCC training is the fastest (≈ 30 % of the total runtime) but sacrifices accuracy. RBF‑PLP yields the best results but requires roughly 2.5× more computation. For real‑time applications, the authors suggest MFCC with a linear kernel as a lightweight alternative, accepting a lower accuracy ceiling (~80 %).
-
Statistical validation – ANOVA reveals a significant interaction between C and kernel type (p < 0.01). Post‑hoc analysis confirms that the RBF‑PLP combination is statistically superior to all other configurations. No significant difference is observed between C = 5 and C = 10, giving practitioners flexibility in selecting a convenient value.
Based on these observations, the authors propose practical guidelines: for most scenarios, set C = 5–10, use the default γ, and select the RBF kernel; prefer PLP when robustness to noise is required. When computational resources are limited, adopt MFCC with a linear kernel and a smaller C (≈ 1) to maintain stability. For high‑performance systems, fine‑tune the RBF‑PLP pair and consider kernel approximations (e.g., random Fourier features) or GPU acceleration to mitigate the increased training cost.
In conclusion, the study provides a thorough quantitative analysis of how SVM hyper‑parameters interact with acoustic feature representations in vowel recognition. It demonstrates that the RBF kernel combined with moderate regularisation yields the best generalisation, that PLP offers a slight robustness edge over MFCC, and that careful tuning of C is essential to avoid under‑ or over‑fitting. These insights furnish speech‑recognition engineers with concrete, evidence‑based recommendations for configuring SVM‑based vowel classifiers.
Comments & Academic Discussion
Loading comments...
Leave a Comment