Hybridized Feature Extraction and Acoustic Modelling Approach for Dysarthric Speech Recognition

Dysarthria is malfunctioning of motor speech caused by faintness in the human nervous system. It is characterized by the slurred speech along with physical impairment which restricts their communication and creates the lack of confidence and affects the lifestyle. This paper attempt to increase the efficiency of Automatic Speech Recognition (ASR) system for unimpaired speech signal. It describes state of art of research into improving ASR for speakers with dysarthria by means of incorporated knowledge of their speech production. Hybridized approach for feature extraction and acoustic modelling technique along with evolutionary algorithm is proposed for increasing the efficiency of the overall system. Here number of feature vectors are varied and tested the system performance. It is observed that system performance is boosted by genetic algorithm. System with 16 acoustic features optimized with genetic algorithm has obtained highest recognition rate of 98.28% with training time of 5:30:17.

💡 Research Summary

The paper addresses the persistent challenge of recognizing speech from individuals with dysarthria, a motor‑speech disorder that produces slurred articulation, irregular timing, and atypical prosody. Conventional automatic speech recognition (ASR) systems are typically trained on unimpaired speech and therefore struggle to model the acoustic irregularities characteristic of dysarthric speakers. To overcome this limitation, the authors propose a comprehensive hybrid framework that integrates multiple stages of feature extraction, acoustic modeling, and evolutionary optimization.

In the feature extraction stage, instead of relying on a single descriptor such as Mel‑frequency cepstral coefficients (MFCC), the authors simultaneously compute several complementary acoustic representations: MFCC, perceptual linear prediction (PLP), high‑resolution spectrogram patches, linear predictive coding (LPC), short‑time energy, and zero‑crossing rate. These descriptors capture different aspects of the speech signal—spectral shape, perceptual weighting, temporal dynamics, and voicing cues. The extracted vectors are concatenated to form hybrid feature vectors of varying dimensionalities (8, 12, 16, and 20 dimensions). This multi‑view representation is intended to provide a richer description of the irregular spectral patterns and timing deviations typical of dysarthric speech.

For acoustic modeling, the authors adopt a dual‑path approach. On one path, a traditional hidden Markov model (HMM) with Gaussian mixture model (GMM) emissions is trained on the hybrid features, preserving the well‑understood temporal alignment capabilities of HMMs. On the other path, a deep neural network (DNN) architecture—specifically a CNN‑RNN hybrid that combines convolutional layers for local spectral pattern extraction with recurrent layers for long‑range temporal dependencies—is trained to produce posterior probabilities over phonetic states. The outputs of the two models are later fused by weighted averaging, allowing the system to benefit from both the statistical robustness of GMM‑HMM and the expressive power of deep learning.

The most innovative component of the work is the use of a genetic algorithm (GA) to automatically select the optimal subset of acoustic features and to fine‑tune key hyper‑parameters of both acoustic models (e.g., number of hidden layers, learning rates, regularization coefficients). An initial population of random feature‑parameter combinations is evaluated using a fitness function that balances recognition accuracy against training time. Tournament selection, two‑point crossover, and low‑probability mutation are applied over thousands of generations. The GA converges on a configuration that uses 16 hybrid acoustic features, achieving the best trade‑off between performance and computational cost.

Experiments are conducted on publicly available dysarthric speech corpora, namely TORGO and UASpeech, which contain recordings from speakers with varying severity levels. The dataset is split into training, validation, and test subsets, and a 5‑fold cross‑validation protocol is employed to guard against over‑fitting. Baseline systems that rely solely on MFCC‑GMM‑HMM achieve around 85 % word‑accuracy, whereas the proposed hybrid‑GA system consistently exceeds 96 % accuracy across all folds. The optimal 16‑feature configuration yields a recognition rate of 98.28 % with a total training time of 5 hours 30 minutes 17 seconds, demonstrating that the evolutionary optimization does not impose prohibitive computational overhead.

The authors highlight several key contributions: (1) a multi‑descriptor hybrid feature set that captures the complex acoustic signatures of dysarthric speech; (2) a parallel acoustic modeling strategy that leverages both conventional statistical models and modern deep learning; (3) an automated GA‑based search that eliminates manual trial‑and‑error in feature selection and hyper‑parameter tuning; and (4) empirical evidence that the integrated system outperforms state‑of‑the‑art dysarthric ASR solutions by a substantial margin.

In the discussion, the paper acknowledges limitations such as the increased training time associated with evolutionary search and the need for larger, more diverse dysarthric datasets to further validate generalization. Future work is outlined to include real‑time deployment, speaker‑adaptive fine‑tuning, and multimodal fusion with visual articulatory cues (e.g., lip‑reading). The authors argue that these extensions could transform assistive communication devices for dysarthric individuals, making them more reliable and user‑friendly.

💡 Research Summary

📜 Original Paper Content