Relightable and Dynamic Gaussian Avatar Reconstruction from Monocular Video

Relightable and Dynamic Gaussian Avatar Reconstruction from Monocular Video
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Modeling relightable and animatable human avatars from monocular video is a long-standing and challenging task. Recently, Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) methods have been employed to reconstruct the avatars. However, they often produce unsatisfactory photo-realistic results because of insufficient geometrical details related to body motion, such as clothing wrinkles. In this paper, we propose a 3DGS-based human avatar modeling framework, termed as Relightable and Dynamic Gaussian Avatar (RnD-Avatar), that presents accurate pose-variant deformation for high-fidelity geometrical details. To achieve this, we introduce dynamic skinning weights that define the human avatar’s articulation based on pose while also learning additional deformations induced by body motion. We also introduce a novel regularization to capture fine geometric details under sparse visual cues. Furthermore, we present a new multi-view dataset with varied lighting conditions to evaluate relight. Our framework enables realistic rendering of novel poses and views while supporting photo-realistic lighting effects under arbitrary lighting conditions. Our method achieves state-of-the-art performance in novel view synthesis, novel pose rendering, and relighting.


💡 Research Summary

This paper introduces “Relightable and Dynamic Gaussian Avatar (RnD-Avatar),” a novel framework for reconstructing high-quality, animatable, and relightable 3D human avatars from monocular video input. The work addresses significant limitations in existing avatar modeling techniques, particularly the inability to capture fine-grained geometric details like clothing wrinkles that are dependent on body motion, and the challenges of learning accurate geometry from sparse visual cues in single-view videos.

The proposed method builds upon the 3D Gaussian Splatting (3DGS) representation, which offers fast rendering and explicit geometric control compared to slower, implicit Neural Radiance Field (NeRF)-based approaches. RnD-Avatar’s core technical innovations are threefold. First, it proposes Dynamic Skinning Weights. Unlike prior 3DGS-based methods that use static skinning weights regressed from canonical space positions, RnD-Avatar employs an encoder network that conditions the skinning weights on a sequence of body poses. This encoder utilizes temporal and spatial attention mechanisms to capture both global motion dynamics and local joint movements. The resulting pose-variant weights enable more accurate articulation and, crucially, learn additional non-rigid deformations induced by specific body motions, leading to superior modeling of dynamic geometric details.

Second, to mitigate the under-constrained nature of monocular reconstruction, the authors introduce a novel Geometric Consistency Regularization loss. This loss operates on feature maps extracted from rendered images at different frames. It encourages feature vectors sampled from corresponding spatial locations (positive pairs) to be similar, while pushing apart features from different locations (negative pairs). This regularization enforces consistency in the geometric representation of the same surface point across different viewpoints, thereby stabilizing and improving the learning of depth-related attributes like surface normals, which are crucial for relighting.

Third, the framework incorporates a Physically-Based Rendering (PBR) pipeline to achieve relightability under arbitrary illumination. The Gaussian attributes are disentangled into geometry (opacity, rotation, scale, normal) and appearance (albedo, roughness, Spherical Harmonics coefficients for color and view-dependent visibility). A learnable environment light map and the Disney BRDF model are used for shading. Visibility is modeled in canonical space via a lightweight MLP to maintain consistency across poses. The training is conducted in two stages: first optimizing the geometry attributes and dynamic skinning network, then fine-tuning the appearance attributes and lighting.

A significant additional contribution is the creation of a new multi-view video dataset captured under varied colored lighting conditions (e.g., red, green, blue lights). This dataset addresses the lack of ground-truth data for quantitatively evaluating relighting performance in human avatar modeling.

Comprehensive experiments demonstrate that RnD-Avatar achieves state-of-the-art performance across multiple tasks: novel-view synthesis, novel-pose animation, and relighting. It outperforms existing NeRF-based and 3DGS-based methods on standard metrics (PSNR, SSIM, LPIPS) and produces visually superior results with finer details and more realistic lighting effects. Ablation studies confirm the effectiveness of both the dynamic skinning weights and the geometric consistency regularization. The work provides a practical and high-quality solution for creating dynamic, relightable avatars from easily accessible monocular video, while also contributing a valuable benchmark dataset to the research community.


Comments & Academic Discussion

Loading comments...

Leave a Comment