Interpretable Modeling of Articulatory Temporal Dynamics from real-time MRI for Phoneme Recognition

Interpretable Modeling of Articulatory Temporal Dynamics from real-time MRI for Phoneme Recognition
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Real-time Magnetic Resonance Imaging (rtMRI) visualizes vocal tract action, offering a comprehensive window into speech articulation. However, its signals are high dimensional and noisy, hindering interpretation. We investigate compact representations of spatiotemporal articulatory dynamics for phoneme recognition from midsagittal vocal tract rtMRI videos. We compare three feature types: (1) raw video, (2) optical flow, and (3) six linguistically-relevant regions of interest (ROIs) for articulator movements. We evaluate models trained independently on each representation, as well as multi-feature combinations. Results show that multi-feature models consistently outperform single-feature baselines, with the lowest phoneme error rate (PER) of 0.34 obtained by combining ROI and raw video. Temporal fidelity experiments demonstrate a reliance on fine-grained articulatory dynamics, while ROI ablation studies reveal strong contributions from tongue and lips. Our findings highlight how rtMRI-derived features provide accuracy and interpretability, and establish strategies for leveraging articulatory data in speech processing.


💡 Research Summary

This paper investigates how to extract compact, interpretable representations from real‑time magnetic resonance imaging (rtMRI) of the vocal tract for the purpose of phoneme recognition. The authors compare three feature types derived from midsagittal rtMRI videos: (1) raw video frames, (2) optical‑flow fields, and (3) six linguistically motivated regions of interest (ROIs) that capture the motion of the lips, tongue tip, tongue body, velum, tongue root, and larynx. Each feature stream is processed by a dedicated spatial encoder (Vision Transformer, ResNet, or 1‑D convolution) followed by a temporal encoder (LSTM or the recent Mamba architecture). The resulting embeddings are fed into a lightweight classification head trained with Connectionist Temporal Classification (CTC) loss to predict phoneme sequences directly from the visual data.

The experimental corpus consists of approximately one hour of rtMRI recordings from a single male American‑English speaker, comprising 71 videos (≈102 minutes total) that were segmented into 5‑second overlapping windows, yielding 1,224 samples. The dataset includes synchronized audio, phoneme‑level transcriptions, and a frame rate of 99 fps (resampled to 100 fps). Training uses a batch size of 1 on an NVIDIA A40 GPU, Adam optimizer (lr = 1e‑3, weight decay = 1e‑4), and a step learning‑rate schedule over 300 epochs.

Performance is evaluated with phoneme error rate (PER) and top‑1/top‑3 phoneme classification accuracy. Single‑feature models achieve PERs of 0.37 (raw video), 0.41 (optical flow), and 0.53 (ROI). Multi‑feature models consistently improve over these baselines: ROI + raw video yields the lowest PER of 0.34, ROI + optical flow reaches 0.35, and raw video + optical flow obtains 0.39. The best result demonstrates that ROI provides a compact, noise‑robust summary of articulatory activity, while raw video preserves fine‑grained kinematic detail; their combination supplies complementary information.

Temporal fidelity is examined by systematically corrupting the time dimension: (i) shuffling phoneme order while preserving intra‑phoneme frame order, (ii) shuffling frames within each phoneme, (iii) reversing frame order, (iv) up‑sampling by a factor of two, and (v) down‑sampling by a factor of two. All manipulations increase PER, confirming that models rely heavily on temporal continuity and co‑articulation cues. Down‑sampling causes the largest PER jumps for raw video (0.20) and optical flow (0.24), indicating sensitivity to high‑frequency motion components. Upsampling also raises PER (≈0.30) likely due to over‑parameterization on longer sequences. ROI is comparatively robust, with PER increases below 0.15 for all temporal perturbations.

An ablation study removes each ROI channel individually and measures the resulting PER change. The most detrimental removals are the lip aperture (LA) and tongue tip (TT) channels, which increase PER by 0.15 and 0.13 respectively. This aligns with phonetic theory that bilabial constriction and tongue‑tip placement are critical for distinguishing a large portion of English phonemes.

Confusion matrices for the best multimodal model (ROI + raw video) reveal that most phonemes are correctly identified, with notable errors between phonetically similar categories such as /Z/ vs. /S/ and /k/ vs. /g/. These errors suggest that even with high‑resolution visual data, distinguishing subtle articulatory differences remains challenging.

The authors conclude that careful preprocessing of rtMRI can dramatically affect both accuracy and interpretability in phoneme recognition. ROI features offer a low‑dimensional, easily visualizable representation, while raw video and optical flow retain richer motion cues. Combining them yields the best trade‑off. Limitations include reliance on a single speaker, manual ROI definition, and a phoneme inventory lacking certain sounds (e.g., /Z/). Future work should explore automated ROI extraction, multi‑speaker datasets, and integration with acoustic models to improve generalization and practical applicability.


Comments & Academic Discussion

Loading comments...

Leave a Comment