Detail Enhanced Gaussian Splatting for Large-Scale Volumetric Capture
📝 Abstract
We present a unique system for large-scale, multi-performer, high resolution 4D volumetric capture providing realistic free-viewpoint video up to and including 4K resolution facial closeups. To achieve this, we employ a novel volumetric capture, reconstruction and rendering pipeline based on Dynamic Gaussian Splatting and Diffusion-based Detail Enhancement. We design our pipeline specifically to meet the demands of high-end media production. We employ two capture rigs: the Scene Rig, which captures multi-actor performances at a resolution which falls short of 4K production quality, and the Face Rig, which records high-fidelity single-actor facial detail to serve as a reference for detail enhancement. We first reconstruct dynamic performances from the Scene Rig using 4D Gaussian Splatting, incorporating new model designs and training strategies to improve reconstruction, dynamic range, and rendering quality. Then to render high-quality images for facial closeups, we introduce a diffusion-based detail enhancement model. This model is fine-tuned with high-fidelity data from the same actors recorded in the Face Rig. We train on paired data generated from low- and high-quality Gaussian Splatting (GS) models, using the low-quality input to match the quality of the Scene Rig, with the high-quality GS as ground truth. Our results demonstrate the effectiveness of this pipeline in bridging the gap between the scalable performance capture of a large-scale rig and the high-resolution standards required for film and media production.
💡 Analysis
We present a unique system for large-scale, multi-performer, high resolution 4D volumetric capture providing realistic free-viewpoint video up to and including 4K resolution facial closeups. To achieve this, we employ a novel volumetric capture, reconstruction and rendering pipeline based on Dynamic Gaussian Splatting and Diffusion-based Detail Enhancement. We design our pipeline specifically to meet the demands of high-end media production. We employ two capture rigs: the Scene Rig, which captures multi-actor performances at a resolution which falls short of 4K production quality, and the Face Rig, which records high-fidelity single-actor facial detail to serve as a reference for detail enhancement. We first reconstruct dynamic performances from the Scene Rig using 4D Gaussian Splatting, incorporating new model designs and training strategies to improve reconstruction, dynamic range, and rendering quality. Then to render high-quality images for facial closeups, we introduce a diffusion-based detail enhancement model. This model is fine-tuned with high-fidelity data from the same actors recorded in the Face Rig. We train on paired data generated from low- and high-quality Gaussian Splatting (GS) models, using the low-quality input to match the quality of the Scene Rig, with the high-quality GS as ground truth. Our results demonstrate the effectiveness of this pipeline in bridging the gap between the scalable performance capture of a large-scale rig and the high-resolution standards required for film and media production.
📄 Content
We present a unique system for large-scale, multi-performer, high resolution 4D volumetric capture providing realistic free-viewpoint video up to and including 4K resolution facial closeups. To achieve this, we employ a novel volumetric capture, reconstruction and rendering pipeline based on Dynamic Gaussian Splatting and Diffusion-based Detail Enhancement. We design our pipeline specifically to meet the demands of high-end media production. We employ two capture rigs: the Scene Rig, which captures multi-actor performances at a resolution which falls short of 4K production quality, and the Face Rig, which records high-fidelity single-actor facial detail to serve as a reference for detail enhancement. We first reconstruct dynamic performances from the Scene Rig using 4D Gaussian Splatting, incorporating new model designs and training strategies to improve reconstruction, dynamic range, and rendering quality. Then to render high-quality images for facial 1 Introduction 4D volumetric performance capture systems are being leveraged with increasing frequency in media production, including for film and television where 4K resolution output is a requirement. Film and TV applications also introduce the need to capture the interaction of multiple actors over an extended area, and to produce recordings which appear high-resolution in wide, medium, and closeup shots. Placing the cameras around a larger performance area -ours is 6m × 9m -increases their distance to the subjects, which makes it harder to capture high-resolution details of the dynamic performances.
In this paper, we present a novel volumetric recording, reconstruction, and detail enhancement pipeline designed to address these challenges. Our approach leverages two complementary physical capture rigs built at different scales. The Scene Rig is designed for multi-view, multi-actor performance capture, enabling high-quality reconstuctions, but not enough to render production-quality facial closeups. The Face Rig records the head of each actor with production-quality resolution for closeups, but cannot capture fullbody performances.
We first reconstruct performances of actors captured by the Scene Rig using a novel Gaussian Splatting-based approach optimized for this capture setup. This approach integrates a temporally stable camera calibration method and an HDR-aware 4D Gaussian Splatting method, which accounts for practical issues. Indeed, our rendering pipeline incorporates carefully designed components and training strategies optimized for color, exposure, and black levels, ensuring enhanced color fidelity that meets production needs.
Next, to bridge the quality gap between the Scene Rig captures and production resolution (especially for close-ups), we introduce a detail enhancement Diffusion Model. We modify a pre-trained image generation diffusion model, to support conditioning, to be temporally stable and to jointly generate RGB and Alpha channels. We fine-tune this model on high-fidelity Face Rig data of the actors who performed in the Scene Rig. Specifically, we use paired RGBA sequences of low-quality and high-quality renderings, obtained from pairs of low-and high-quality Dynamic Gaussian Splatting models. We limit the Gaussian count in the low-quality models to mimic the Scene Rig’s quality, with the high-quality renderings serving as ground truth. We demonstrate our method on several sequences of three sub-groups of actors showing various novel camera paths, including facial close-ups which significantly exceed the quality of the original Scene Rig capture. We demonstrate the importance of our system components through a set of baseline and ablation comparisons.
To summarize, the main contributions of this work are as follows:
• A two-stage approach to performance capture, combining a scene-scale capture rig and a single-actor facial capture rig. Rendering photorealistic, view-controllable human performances from volumetric capture remains an active research area. Pioneering works focus on reconstructing 3D meshes from multi-view setups, addressing facial performance [Beeler et al. 2011;Fyffe et al. 2011;Guenter et al. 1998] and full-body geometry [Ahmed et al. 2008;Cagniart et al. 2010;de Aguiar et al. 2008;Kanade et al. 1997;Vlasic et al. 2008Vlasic et al. , 2009]]. Some methods rely on template priors such as shape-from-silhouettes [Ahmed et al. 2008;Vlasic et al. 2008], or track and estimate the performer’s deforming geometry using canonical or reference geometry [Beeler et al. 2011;Cagniart et al. 2010;de Aguiar et al. 2008;Vlasic et al. 2009]. Others leverage various illumination patterns to capture both geometry and reflectance information [Einarsson et al. 2006;Fyffe et al. 2011;Guo et al. 2019]. However, these approaches either rely on parameterized templates or fail to capture detailed geometry and appearance.
To reconstruct more details from multi-view videos, subsequent works propose using IR video cameras [Collet et al. 2015;D
This content is AI-processed based on ArXiv data.