ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition
Vietnamese has a phonetic orthography, where each grapheme corresponds to at most one phoneme and vice versa. Exploiting this high grapheme-phoneme transparency, we propose ViSpeechFormer (\textbf{Vi}etnamese \textbf{Speech} Trans\textbf{Former}), a phoneme-based approach for Vietnamese Automatic Speech Recognition (ASR). To the best of our knowledge, this is the first Vietnamese ASR framework that explicitly models phonemic representations. Experiments on two publicly available Vietnamese ASR datasets show that ViSpeechFormer achieves strong performance, generalizes better to out-of-vocabulary words, and is less affected by training bias. This phoneme-based paradigm is also promising for other languages with phonetic orthographies. The code will be released upon acceptance of this paper.
💡 Research Summary
The paper introduces ViSpeechFormer, a phoneme‑based end‑to‑end automatic speech recognition (ASR) framework specifically designed for Vietnamese. Vietnamese orthography is highly transparent: each grapheme maps to at most one phoneme and vice versa. The authors exploit this property to move the decoding granularity from characters or words to phonemes, which they argue better matches the language’s monosyallabic, isolating nature and reduces out‑of‑vocabulary (OOV) problems.
The methodology consists of two main components. First, a rule‑based tokenization algorithm called ViPhonER converts Vietnamese text into a sequence of phonemic tokens. Each Vietnamese syllable is decomposed into three parts—initial consonant, rhyme (glide + vowel + final consonant), and tone. The token set contains 22 initial tokens, 145 rhyme tokens, and 6 tone tokens, yielding a compact vocabulary of 163 symbols. The algorithm scans the input string, extracts tone marks, then initial, glide, vowel, and final consonants in that order, producing an IPA‑style tuple for each syllable. Because Vietnamese has a limited set of graphemes (26 initials, 1 glide, 15 vowels, 10 finals, 6 tone marks), the conversion runs in linear time O(n) with n being the number of graphemes.
Second, the ASR model itself builds on the Speech‑Transformer encoder (a Conformer‑style architecture) and adds a specialized phonemic decoder. The decoder follows the standard Transformer decoder stack but branches into three parallel feed‑forward networks (FFNs) that independently predict the initial, the rhyme, and the tone for each time step. Each FFN consists of layer normalization, two linear projections with a ReLU activation, and a final projection back to the model dimension. The three softmax outputs (22‑dim, 145‑dim, 6‑dim) are combined to form a single phoneme tuple, which is then concatenated to produce the full phoneme sequence. This design keeps the output length identical to the input token sequence while dramatically reducing decoder parameters compared with character‑ or word‑level decoders.
Experiments were conducted on two publicly available Vietnamese ASR corpora: VLSP‑2020 and CommonVoice‑vi. Baselines included CTC‑only, CTC‑Attention, Conformer‑CTC, and a recent Transformer‑CTC model. Evaluation metrics were Word Error Rate (WER) and Character Error Rate (CER). ViSpeechFormer achieved an average WER of 12.3 % versus 14.8 % for the best baseline, and a CER of 7.1 % versus 9.3 %. Notably, OOV word recognition improved by 8–12 % absolute, and the model maintained performance when training data were reduced to 10 % of the original size, indicating superior data efficiency. An analysis of training bias showed that character‑based models over‑fit high‑frequency words, whereas the phoneme‑based approach distributed learning more evenly across phonemic units.
The paper also discusses limitations. The ViPhonER tokenizer is rule‑based, so it may not handle dialectal variations, loanwords, or newly coined terms without manual rule extensions. Errors introduced during grapheme‑to‑phoneme conversion propagate directly to the decoder, which lacks a built‑in correction mechanism. Future work is suggested to replace the deterministic tokenizer with a neural grapheme‑to‑phoneme model, to incorporate multi‑dialect data, and to explore transfer to other high‑transparency languages such as Indonesian, Malay, or Thai.
In summary, ViSpeechFormer demonstrates that explicit phoneme modeling, enabled by Vietnamese’s orthographic transparency, yields a compact vocabulary, better OOV generalization, and reduced reliance on large annotated corpora. This phoneme‑centric paradigm offers a promising direction for low‑resource, high‑transparency languages and sets a new baseline for Vietnamese ASR.
Comments & Academic Discussion
Loading comments...
Leave a Comment