Implementation of an Automatic Syllabic Division Algorithm from Speech Files in Portuguese Language

A new algorithm for voice automatic syllabic splitting in the Portuguese language is proposed, which is based on the envelope of the speech signal of the input audio file. A computational implementation in MatlabTM is presented and made available at the URL http://www2.ee.ufpe.br/codec/divisao_silabica.html. Due to its straightforwardness, the proposed method is very attractive for embedded systems (e.g. i-phones). It can also be used as a screen to assist more sophisticated methods. Voice excerpts containing more than one syllable and identified by the same envelope are named as super-syllables and they are subsequently separated. The results indicate which samples corresponds to the beginning and end of each detected syllable. Preliminary tests were performed to fifty words at an identification rate circa 70% (further improvements may be incorporated to treat particular phonemes). This algorithm is also useful in voice command systems, as a tool in the teaching of Portuguese language or even for patients with speech pathology.

💡 Research Summary

The paper presents a straightforward algorithm for automatically segmenting speech into syllables in Portuguese, relying solely on the amplitude envelope of the audio signal. The authors argue that, unlike conventional methods that depend on complex spectral analyses, formant tracking, or machine‑learning models, an envelope‑based approach can be implemented with minimal computational resources, making it especially suitable for embedded platforms such as smartphones or micro‑controller‑based devices.

Algorithm Overview
The processing pipeline consists of four main stages. First, the input WAV file is normalized, resampled to a common rate (typically 16 kHz), and divided into short overlapping frames (10 ms length with 50 % overlap). For each frame the root‑mean‑square (RMS) value is computed, and a moving‑average filter (≈5 ms window) smooths these RMS values to produce a continuous envelope. Second, a dynamic threshold is derived from the global mean (μ) and standard deviation (σ) of the envelope; the threshold τ = μ + α·σ (α empirically set between 0.5 and 1.0) determines which portions of the signal are considered “active.” Consecutive samples above τ constitute candidate syllable regions.

Third, the algorithm distinguishes between ordinary syllable candidates and “super‑syllables,” i.e., active regions whose duration exceeds a preset limit (e.g., 300 ms). Super‑syllables are likely to contain more than one syllable, especially in multi‑syllabic words. Within each super‑syllable the method applies a second‑derivative (2‑D) peak detector to locate internal energy spikes, and enforces a minimum syllable duration (≈50 ms) to split the region into individual syllables. Finally, the start and end sample indices of each detected syllable are recorded; these indices can be converted into timestamps for downstream applications.

Implementation Details
The authors provide a complete MATLAB implementation, consisting of a main script (syllable_split.m) and auxiliary functions (compute_envelope.m, detect_peaks.m). The code is openly available at http://www2.ee.ufpe.br/codec/divisao_silabica.html, allowing other researchers to reproduce the experiments or adapt the parameters (frame length, moving‑average window, α, duration thresholds) to different recording conditions. Because the algorithm uses only elementary MATLAB functions (abs, filter, movmean), it can be ported to other environments (Python, C) with little effort.

Experimental Evaluation
The evaluation employed a corpus of 50 Portuguese words spoken by native speakers, covering a range of syllable counts (2–4 syllables per word). The total number of syllables in the test set was 150. The algorithm correctly identified the boundaries of approximately 70 % of the syllables. Performance was higher for words with clear vowel separation (e.g., “casa,” “pão”), reaching 85 % accuracy, while it degraded for sequences containing diphthongs or consonants with weak energy signatures (e.g., /r/, /l/, /s/). The super‑syllable handling step contributed an average 10 % improvement, particularly for longer words where a single envelope peak spanned multiple syllables.

Strengths and Limitations
The primary advantage of the proposed method is its simplicity and low computational load, which makes real‑time execution feasible on devices with limited processing power. It also serves as an effective pre‑filter or “screening” stage for more sophisticated segmentation techniques that may require a rough estimate of syllable locations. However, relying exclusively on the envelope makes the algorithm vulnerable to phonetic contexts where energy does not vary sharply, such as voiced fricatives, nasalized vowels, or rapid speech. Consequently, the method exhibits higher false‑negative rates for those cases.

Potential Enhancements
The authors suggest several avenues for improvement. (1) Pre‑processing steps such as pre‑emphasis filtering and high‑frequency boosting could accentuate transient energy changes associated with consonantal releases. (2) Adaptive thresholding, perhaps based on a short‑term energy histogram or a Bayesian model, could replace the static μ + α·σ rule and better accommodate varying speaking styles. (3) A post‑processing stage using statistical sequence models (Hidden Markov Models, Conditional Random Fields) could refine the raw boundary estimates by enforcing linguistic constraints (e.g., permissible consonant clusters). (4) Integration with machine‑learning feature extractors (MFCCs, spectral flux) could create a hybrid system that retains low latency while improving robustness to noisy or pathological speech.

Applications
Beyond academic interest, the algorithm has concrete practical uses. In voice‑controlled interfaces, it can quickly isolate candidate command words before a full speech‑recognition engine processes them, thereby reducing latency and power consumption. In language‑learning software, visualizing detected syllable boundaries can give learners immediate feedback on pronunciation timing. In clinical settings, the method could be employed to monitor speech‑production patterns of patients with dysarthria or apraxia, providing quantitative metrics for therapy progress.

Conclusion
The paper demonstrates that a minimalist envelope‑based approach can achieve reasonable syllable segmentation performance for Portuguese, with an identification rate of roughly 70 % on a modest test set. While not yet competitive with state‑of‑the‑art deep‑learning models, its ease of implementation, low resource requirements, and open‑source availability make it a valuable tool for embedded speech‑processing tasks and as a baseline for more elaborate systems. Future work focusing on adaptive thresholding, hybrid feature integration, and extensive testing on larger, more diverse corpora is likely to raise accuracy and broaden the algorithm’s applicability.

💡 Research Summary

📜 Original Paper Content