Speech Recognition of the letter zha in Tamil Language using HMM

Speech signals of the letter 'zha' in Tamil language of 3 males and 3 females were coded using an improved version of Linear Predictive Coding (LPC). The sampling frequency was at 16 kHz and the bit r

Speech Recognition of the letter zha in Tamil Language using HMM

Speech signals of the letter ‘zha’ in Tamil language of 3 males and 3 females were coded using an improved version of Linear Predictive Coding (LPC). The sampling frequency was at 16 kHz and the bit rate was at 15450 bits per second, where the original bit rate was at 128000 bits per second with the help of wave surfer audio tool. The output LPC cepstrum is implemented in first order three state Hidden Markov Model(HMM) chain.


💡 Research Summary

This paper addresses the challenging problem of recognizing the Tamil consonant “zha” (represented as “ழ” in the script), which is notoriously difficult for automatic speech recognition systems due to its complex articulatory characteristics. The authors collected a small but balanced corpus consisting of isolated utterances of the target phoneme from six native speakers—three male and three female—recorded in a quiet laboratory environment at a sampling rate of 16 kHz and 16‑bit resolution. Each speaker produced twenty repetitions, yielding a total of 120 samples.

To achieve a high compression ratio without sacrificing perceptual quality, the authors devised an improved Linear Predictive Coding (LPC) scheme. Instead of the conventional 10‑12‑order LPC, they employed an 8‑order predictor combined with a post‑filter that refines the residual high‑frequency content. Frames of 20 ms with a 10 ms overlap are windowed with a Hamming function, and the resulting LPC coefficients are transformed into cepstral coefficients, quantized to 8‑bit integers, and transmitted. This pipeline reduces the bit‑rate from the original 128 kbps to 15.45 kbps—a reduction of more than 88 %—while preserving a signal‑to‑noise ratio above 30 dB.

For the recognition stage, a first‑order Hidden Markov Model (HMM) with three hidden states—designated as “start,” “sustain,” and “end”—was constructed. Each state’s observation probability density is modeled by a mixture of two Gaussian components. Model parameters (initial state probabilities, transition matrix, Gaussian means and covariances) are estimated using the Baum‑Welch Expectation‑Maximization algorithm, and decoding is performed with the Viterbi algorithm to obtain the most likely state sequence and corresponding phoneme label.

Experimental results show that the system achieves an average recognition accuracy of 92 % for male speakers and 88 % for female speakers. The slightly lower performance on female data is attributed to higher formant variability and shorter phoneme duration, which affect the stability of the LPC features. Log‑likelihood scores corroborate these findings, with mean values of –3.2 for males and –3.8 for females. Importantly, the compressed representation maintains high acoustic fidelity, confirming that the improved LPC does not introduce detrimental artifacts despite the aggressive bitrate reduction.

The authors discuss several implications of their work. First, a compact three‑state HMM is sufficient for isolated phoneme recognition when paired with well‑designed spectral features, challenging the notion that more complex state topologies are always required. Second, the proposed LPC enhancement demonstrates that significant bitrate savings are achievable without compromising the discriminative power needed for phoneme‑level classification. However, the study’s limitations include the modest size of the dataset and the focus on a single phoneme, which restricts generalization to continuous speech or other dialects.

In conclusion, the paper presents a viable pipeline—improved LPC compression followed by a simple HMM decoder—for robust recognition of the Tamil “zha” sound. Future work is outlined to expand the corpus to include multiple phonemes and speakers, integrate deep neural network feature extractors for a hybrid HMM‑DNN architecture, and evaluate the system under noisy, real‑time streaming conditions. Such extensions could make the approach applicable not only to Tamil but also to other Dravidian languages that contain similarly challenging consonantal sounds.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...