FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation

FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite recent progress in 3D Gaussian-based head avatar modeling, efficiently generating high fidelity avatars remains a challenge. Current methods typically rely on extensive multi-view capture setups or monocular videos with per-identity optimization during inference, limiting their scalability and ease of use on unseen subjects. To overcome these efficiency drawbacks, we propose FastGHA, a feed-forward method to generate high-quality Gaussian head avatars from only a few input images while supporting real-time animation. Our approach directly learns a per-pixel Gaussian representation from the input images, and aggregates multi-view information using a transformer-based encoder that fuses image features from both DINOv3 and Stable Diffusion VAE. For real-time animation, we extend the explicit Gaussian representations with per-Gaussian features and introduce a lightweight MLP-based dynamic network to predict 3D Gaussian deformations from expression codes. Furthermore, to enhance geometric smoothness of the 3D head, we employ point maps from a pre-trained large reconstruction model as geometry supervision. Experiments show that our approach significantly outperforms existing methods in both rendering quality and inference efficiency, while supporting real-time dynamic avatar animation.


💡 Research Summary

FastGHA introduces a novel feed‑forward pipeline for creating high‑fidelity 3D Gaussian head avatars from only a few input photographs while supporting real‑time facial animation. The method consists of two stages: canonical reconstruction and dynamic deformation.

In the reconstruction stage, each input image—captured from arbitrary viewpoints and expressions—is processed by two frozen, pre‑trained feature extractors: DINOv3 for semantic cues and a Stable Diffusion VAE (SD‑Turbo) for color and texture information. These feature maps are concatenated with Plücker ray‑maps that encode camera geometry, forming a multi‑view token tensor. A multi‑view Vision Transformer (ViT) aggregates cross‑view correspondences, and a modified SD‑Turbo decoder regresses per‑pixel 3D Gaussian parameters (position, color, rotation, scale, opacity). The resulting per‑view Gaussian maps are fused into a single canonical Gaussian cloud, deliberately representing a neutral‑expression head.

For animation, each Gaussian is augmented with a 32‑dimensional per‑Gaussian feature vector. A lightweight multilayer perceptron (MLP) receives the canonical Gaussian set together with a FLAME expression code (z_exp) and predicts per‑Gaussian offsets for position and color. Because the MLP operates independently on each Gaussian, the deformation can be computed in parallel, enabling >30 fps rendering on a modern GPU. The deformed Gaussians are rasterized using the standard differentiable 3D Gaussian splatting pipeline, producing images, alpha masks, and depth maps from arbitrary viewpoints.

Training combines photometric supervision (L1 RGB loss, SSIM, perceptual loss, silhouette loss) with a geometry prior derived from a large reconstruction model, VGGT. VGGT supplies point‑wise geometry maps that are used as a regularization term, encouraging the learned Gaussian cloud to align with realistic facial topology and to remain smooth. This prior is applied only during training, avoiding any extra inference cost.

Extensive experiments on large multi‑view head video datasets demonstrate that FastGHA outperforms state‑of‑the‑art feed‑forward methods such as Avat3r and Facelift in PSNR, SSIM, and LPIPS, while reducing reconstruction time to under one second and achieving real‑time animation (≤30 ms per frame). User studies confirm superior visual quality and responsiveness.

Key contributions are: (1) a few‑shot, feed‑forward architecture that directly predicts per‑pixel 3D Gaussian representations, eliminating the need for per‑identity optimization; (2) an efficient per‑Gaussian feature and MLP‑based deformation network that enables real‑time, expression‑driven animation; (3) the integration of a large‑scale geometry prior as a training‑time regularizer to improve shape consistency and robustness.

FastGHA thus provides a practical solution for real‑time, high‑quality digital human creation, opening avenues for AR/VR avatars, live streaming, and interactive gaming where rapid avatar generation and on‑the‑fly facial animation are essential.


Comments & Academic Discussion

Loading comments...

Leave a Comment