Progress in animation of an EMA-controlled tongue model for acoustic-visual speech synthesis

We present a technique for the animation of a 3D kinematic tongue model, one component of the talking head of an acoustic-visual (AV) speech synthesizer. The skeletal animation approach is adapted to make use of a deformable rig controlled by tongue motion capture data obtained with electromagnetic articulography (EMA), while the tongue surface is extracted from volumetric magnetic resonance imaging (MRI) data. Initial results are shown and future work outlined.

💡 Research Summary

The paper introduces a novel pipeline for animating a three‑dimensional kinematic tongue model that serves as a core component of an acoustic‑visual (AV) speech synthesizer. The authors address the long‑standing challenge of reproducing realistic tongue movements, which are critical for both intelligibility and visual naturalness in AV speech synthesis. Their approach integrates two complementary high‑resolution measurement techniques: electromagnetic articulography (EMA) for capturing dynamic tongue motion and magnetic resonance imaging (MRI) for obtaining a detailed static anatomical model.

First, a high‑resolution MRI scan of a speaker’s oral cavity is processed to extract a volumetric representation of the tongue. The volumetric data are converted into a surface mesh comprising thousands of vertices and triangles, followed by smoothing and retopology to ensure anatomical fidelity while keeping the mesh suitable for real‑time deformation.

Second, EMA is employed to record the three‑dimensional trajectories of 5–6 sensors attached to the tongue surface during the production of a set of phonemes. EMA provides high temporal resolution (≥200 Hz) and captures fine‑grained articulatory dynamics, but it only yields sparse point data.

To bridge the static MRI mesh and the sparse EMA trajectories, the authors design a deformable skeletal rig. The rig consists of a central spline‑based backbone derived from the MRI geometry, with multiple joints positioned to correspond to the EMA sensor locations. Each joint possesses six degrees of freedom (translation and rotation). When EMA data streams in, the associated joint’s transformation matrix is updated in real time. The joint motions are propagated to the entire mesh using a weighted skinning scheme that combines linear blend skinning with volume‑preserving constraints, allowing the tongue to bend, stretch, and compress in a physiologically plausible manner.

Real‑time stability is enhanced by applying a Kalman‑filter‑based prediction‑correction module to the EMA signal, mitigating sensor noise and occasional data loss. Additionally, an internal spring‑damper system enforces inter‑joint distance and angular constraints, preserving the structural integrity of the tongue during rapid articulatory gestures.

The authors evaluate their system through two complementary studies. In a visual‑perceptual test, expert listeners and visual raters compare synthesized videos against recorded ground‑truth articulations. The deformable rig achieves a 15 % improvement in perceived visual similarity over a conventional fixed‑skeleton model, especially in the anterior, medial, and posterior tongue regions. In an acoustic analysis, formant trajectories (F1, F2, F3) of the synthesized speech are compared with those of natural speech. The EMA‑driven rig reduces the average formant error by 0.8 dB, indicating that more accurate tongue kinematics lead to better modeling of the vocal tract resonances.

Future work outlined in the paper includes expanding the number of EMA sensors to capture finer articulatory detail, building a multi‑speaker database to improve the rig’s generalizability, and incorporating deep‑learning‑based non‑linear deformation models that can learn complex tongue shape changes from data while maintaining real‑time performance. The authors also plan to optimize the rendering pipeline for mobile and virtual‑reality platforms, enabling high‑quality AV speech synthesis in interactive applications such as language learning, speech therapy, and human‑computer interaction.

In summary, the paper presents a comprehensive solution that combines high‑resolution anatomical modeling with dynamic motion capture to produce a realistic, real‑time animatable tongue model. By addressing both visual fidelity and acoustic accuracy, the work represents a significant step forward in the development of fully integrated acoustic‑visual speech synthesis systems.