This work considers merging two independent models, TTS and A2F, into a unified model to enable internal feature transfer, thereby improving the consistency between audio and facial expressions generated from text. We also discuss the extension of the emotion control mechanism from TTS to the joint model. This work does not aim to showcase generation quality; instead, from a system design perspective, it validates the feasibility of reusing intermediate representations from TTS for joint modeling of speech and facial expressions, and provides engineering practice references for subsequent speech expression co-design. The project code has been open source at: https://github.com/GoldenFishes/UniTAF
Existing speech-driven facial expression generation systems typically adopt a multi-stage pipeline composed of a large language model (LLM), text-to-speech (TTS), and audio-driven facial animation (A2F). In this paradigm, the modules are decoupled: the LLM generates textual content, TTS synthesizes speech from text, and A2F predicts facial expressions and motions from the audio signal. While this design offers modularity, it implicitly assumes that the high-level information required for speech and expression generation-such as emotion, prosody, and expressive intent-can be independently inferred from plain text or raw audio.
However, although such expression-related high-level information is implicitly modeled during the internal reasoning of the LLM, it is not preserved in the final textual output. As a result, both TTS and A2F must repeatedly infer the same emotional and prosodic attributes from limited inputs, leading to information loss and redundant computation.
From a theoretical perspective, directly modeling the joint distribution of text, audio, and facial expressions with a unified multimodal model is a more natural solution. In practice, however, such end-to-end approaches are often constrained by inference cost and controllability, making them difficult to deploy within existing frameworks. In contrast, modern TTS systems explicitly reconstruct rich intermediate representations during the text-to-speech mapping process, including phonetic timing and prosodic variations, which can be regarded as an engineering approximation of the intent generated by the LLM. Therefore, directly leveraging TTS intermediate representations to drive A2F constitutes a more practical solution under current system constraints, with reduced information loss.
Based on these observations, this paper proposes UniTextAudioFace (UniTAF), a joint modeling framework built upon IndexTTS2 [1] and UniTalker [2], enabling the model to generate audio and facial expressions directly from text within a unified process. Furthermore, we explore the transfer of emotion control mechanisms from TTS to facial expression generation, thereby reducing redundant inference of the same high-level information across different modules and providing a more direct implementation path for consistent speech-expression modeling.
2 Related Work
In recent years, high-quality text-to-speech (Text-to-Speech, TTS) systems have gradually shifted from end-to-end waveform generation toward hierarchical or intermediate-representation-driven modeling paradigms, in order to improve speech naturalness, controllability, and modeling stability. Representative works such as the Tacotron series [3,4] and the FastSpeech series [5,6] decouple textual modeling from acoustic modeling by introducing intermediate features including mel-spectrograms, duration, and pitch. Furthermore, TTS methods based on neural codecs have become a prevailing trend. Works such as VALL-E [7] and AudioLM [8] achieve higher fidelity and stronger generation consistency by representing speech with discretized acoustic tokens. Building upon this line of research, IndexTTS2 [1] adopts multi-level acoustic indexing and explicit intermediate representation modeling, which maintains speech quality while providing stable and structured intermediate speech information for downstream tasks. Compared with directly modeling from text or waveforms, such intermediate representations exhibit stronger structure and interpretability in terms of rhythm, prosody, and speech dynamics.
Although these intermediate representations have been widely used to improve the generation quality of TTS systems themselves, their potential value in cross-modal generation tasks has not yet been systematically explored, particularly in scenarios involving speech-driven facial expression or facial animation generation.
Audio-driven facial animation generation (Audio-to-Face, A2F) aims to generate facial motions that are consistent with the semantics, rhythm, and emotion of the input speech. Early approaches largely relied on handcrafted audio features, such as MFCC and F0, combined with regression models for prediction [9,10], but exhibited clear limitations in terms of expression naturalness and emotional consistency.
With the advancement of deep learning, end-to-end neural networks have gradually become the dominant approach. Methods such as VOCA [11], FaceFormer [12], and SadTalker [13] directly predict 3D facial parameters from audio or acoustic features, achieving notable improvements in lip synchronization and motion coherence. However, these methods typically rely only on raw audio or shallow acoustic features, making it difficult to explicitly disentangle speech content, speaking rhythm, and higher-level prosodic or emotional information.
UniTalker [2] further proposes a unified speech-to-talker animation modeling framework. By modeling long-term dependencies between audio and facial parameters, it demonstrates s
This content is AI-processed based on open access ArXiv data.