FactorPortrait: Controllable Portrait Animation via Disentangled Expression, Pose, and Viewpoint

FactorPortrait: Controllable Portrait Animation via Disentangled Expression, Pose, and Viewpoint
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce FactorPortrait, a video diffusion method for controllable portrait animation that enables lifelike synthesis from disentangled control signals of facial expressions, head movement, and camera viewpoints. Given a single portrait image, a driving video, and camera trajectories, our method animates the portrait by transferring facial expressions and head movements from the driving video while simultaneously enabling novel view synthesis from arbitrary viewpoints. We utilize a pre-trained image encoder to extract facial expression latents from the driving video as control signals for animation generation. Such latents implicitly capture nuanced facial expression dynamics with identity and pose information disentangled, and they are efficiently injected into the video diffusion transformer through our proposed expression controller. For camera and head pose control, we employ Plücker ray maps and normal maps rendered from 3D body mesh tracking. To train our model, we curate a large-scale synthetic dataset containing diverse combinations of camera viewpoints, head poses, and facial expression dynamics. Extensive experiments demonstrate that our method outperforms existing approaches in realism, expressiveness, control accuracy, and view consistency.


💡 Research Summary

FactorPortrait introduces a groundbreaking video diffusion-based framework designed for highly controllable and lifelike portrait animation. The fundamental challenge addressed in this research is the effective disentanglement of three critical components in portrait animation: facial expressions, head poses, and camera viewpoints. While traditional methods often struggle with the entanglement of these elements—where changing an expression might inadvertently alter the subject’s identity or head position—FactorPortrait enables independent and precise control over each parameter.

The methodology is built upon a sophisticated multi-modal control mechanism. To achieve precise facial expression transfer, the authors utilize a pre-trained image encoder to extract facial expression latents from a driving video. These latents are specifically engineered to capture the nuanced dynamics of facial movements while stripping away identity and pose information. These extracted features are then injected into a Video Diffusion Transformer through a specialized “expression controller,” ensuring that the generated animation inherits the precise emotional nuances of the driving source without compromising the identity of the original portrait.

To handle the complexities of head pose and camera movement, the researchers move beyond 2D-based manipulations by adopting a 3D-aware geometric approach. By leveraging 3D body mesh tracking, the model utilizes Plücker ray maps and normal maps as primary control signals. The use of Plücker ray maps allows the model to understand the geometric trajectories of light rays, while normal maps provide essential information regarding surface orientations. This geometric grounding is crucial for enabling “Novel View Synthesis,” allowing the camera to move around the subject or change angles while maintaining high structural and textural consistency, effectively preventing the distortions common in 2D-centric diffusion models.

A key contribution of this work is the creation of a large-scale synthetic dataset. To train a model capable of managing such complex, multi-modal inputs, the researchers curated a massive dataset containing diverse combinations of camera trajectories, head poses, and facial expression dynamics. This extensive training allows the model to learn the intricate relationships between the control signals and the resulting pixels, ensuring robust generalization across various scenarios.

Extensive experimental evaluations demonstrate that FactorPortrait significantly outperforms existing state-of-the-art approaches. The method achieves superior performance across four critical dimensions: visual realism, expressiveness of facial movements, accuracy in following control signals, and temporal and view consistency. By providing a way to synthesize high-quality, controllable, and view-consistent portrait animations from a single image, FactorPortrait paves the way for transformative applications in digital human creation, cinematic visual effects, and immersive virtual reality environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment