Figure 1 . MatSpray Overview we utilize 2D material world knowlegde from 2D diffusion models to reconstruct 3D relightable objects. Given multi-view images of a target object, we first generate per-view PBR material predictions (base color, roughness, metallic) using any 2D diffusion-based material model. These 2D estimates are then integrated into a 3D Gaussian Splatting reconstruction via Gaussian ray tracing. Finally, a neural refinement stage applies a softmax-based restriction to enforce multi-view consistency and enhance the physical accuracy of the materials. The resulting 3D assets feature high-quality, fully relightable PBR materials under novel illumination. Project
approaches. We begin by reconstructing scene geometry via Gaussian Splatting. From the input images, a diffusion model generates 2D maps for albedo, roughness, and metallic parameters. Any existing diffusion model that can convert images or videos to PBR materials can be applied. The predictions are further integrated into the 3D representation either by optimizing an image-based loss or by directly projecting the material parameters onto the Gaussians using Gaussian ray tracing. To enhance fine-scale accuracy and multi-view consistency, we further introduce a lightweight neural refinement step (Neural Merger), which takes ray-traced material features as input and produces detailed adjustments. Our results demonstrate that the proposed methods outperform existing techniques in both quantitative metrics and perceived visual realism. This enables more accurate, relightable, and photorealistic renderings from
Editing and relighting real scenes captured with casual cameras is central to many vision and graphics applications. While modern neural 3D reconstruction methods can produce impressive geometry and appearance from images, they often entangle illumination with appearance, yielding textures or coefficients that are not physically meaningful for relighting. Classical inverse rendering requires strong assumptions about lighting and exposure and remains fragile when materials vary spatially. In parallel, recent 2D material predictors learn rich priors from large-scale data and can produce plausible material maps from images, yet they operate in 2D and are not directly consistent across views or attached to a 3D representation.
We introduce a method to transfer 2D material predictions onto a 3D Gaussian representation to obtain relightable assets with spatially varying base color, roughness, and metallic parameters. The approach projects 2D material maps to 3D via efficient ray-traced assignment, refines materials with a small MLP to reduce multi-view inconsistencies, and supervises rendered material maps directly with the 2D predictions to preserve plausible priors while discouraging baked-in lighting. This combination yields cleaner albedo, more accurate roughness, and informed metallic estimates, enabling higher-quality relighting compared to pipelines that learn only appearance. Our contributions are: โข World Material Fusion. A plug-and-play pipeline that, to our knowledge, is the first to fuse swappable diffusionbased 2D PBR priors (“world material knowledge”) with 3D Gaussian material optimization via Gaussian ray tracing and PBR consistent supervision to obtain relightable assets. โข Neural Merger. A softmax neural merger that aggregates per-Gaussian, multi-view material estimates, suppresses baked-in lighting, and enforces cross-view consistency while stabilizing joint environment map optimization. โข Faster Reconstruction. A simple projection and optimization scheme that reconstructs high-quality relightable 3D materials with 3.5ร less per-scene optimization time than IRGS [9].
Materials Spatially varying BRDFs (svBRDFs) have long been studied, with early high-resolution texel-based representations enabling point-wise material parameteriza-tion [25,44]. In this work, we adopt a Cook-Torrance variant, which is based on the widely used Disney principled BRDF and real-time formulations in major engines [5,19,23]. Building on this foundation, recent research has explored richer material parameterizations and extended the expressiveness of svBRDF models [12].
Diffusion Diffusion models enable high-fidelity image synthesis and conditioning, with efficient latent-space formulations and extensions to video generation [2,13,14,42].
For material estimation from 2D images, large-scale diffusion priors have been used to infer PBR maps [8,20,27,55,59]. Of particular relevance is DiffusionRenderer by Huang et al. [16], whose results on high-quality material maps motivated this work. Diffusion approaches have been further explored for related tasks such as HDR prediction, texture estimation, and relighting [1,30,54].
We exploit Gaussian Ray Tracing as a mechanism for transferring 2D material data to 3D. Although the field remains in its early stages, emerging works have explored stochastic and explicit Gaussian raytracing methods [34,39,47] and related neural optimization schemes [10]. Our formulation builds upon insights from Mai et al. [34] and Moenne-Loccoz et al. [39], extending them toward material-aware 3D reconstruction.
Novel View Synthesis and Scene Reconstruction Neural representations for novel view synthesis and scene reconstruction have rapidly advanced 3D modeling, with NeRF and Gaussian Splatting providing strong foundations for radiance-based scene representations [21,37]. Material modeling atop radiance fields has been explored through inverse rendering and relighting [3,4,43,57], where coupling with signed distance fields (SDFs) improves physical plausibility [36,50]. Gaussian-based inverse rendering approaches such as R3DGS [9] reconstruct materials via perscene optimization, while IRGS [11] extends this with 2D Gaussians and deferred shading for improved appearance modeling. Complementary efforts leverage diffusion models for 3D reconstruction and sparse-view recovery [31,38], and hybrid cues or mesh integration further enhance geometric fidelity [28,29,48,52]. Beyond these, extensions of 2D and 3D Gaussians have enabled spatially varying materials, advanced relighting, and richer reflectance modeling [6,7,15,18,22,24,26,32,33,46,51,53,58].
In contrast to R3DGS and IRGS, which optimize material and geometry parameters per scene, our approach employs Gaussian Ray Tracing to lift 2D material estimates by diffusion models into 3D representations, exploiting the world-knowledge priors learned by the diffusion models and the broader understanding of material behavior they Figure 2. Pipeline. From multi-view images, a diffusion predictor yields per-view material maps. We reconstruct the object’s geometry using 3D Gaussian Splatting. Then we project 2D materials to 3D via ray tracing, and refine per Gaussian materials with our Neural Merger that has a softmax output layer, choosing between the projected values. We then supervise the produced material maps using the predicted 2D material maps. Additionally, using deferred shading we supervise by a PBR-based photometric rendering loss with the multi-view ground truth images of the object.
provide. This enables faster, more consistent reconstruction and material reasoning across scenes.
Our method recovers consistent 3D PBR materials from multiple views by combining 2D diffusion predictions with a 3D Gaussian representation. We first obtain material maps (base color, roughness, metallic) per view from any diffusion material predictor, making our approach compatible with a wide range of existing and future diffusion models. Scene geometry is reconstructed via a relightable Gaussian Splatting pipeline (R3DGS) [9,21], which provides both geometry and normals. The 2D material estimates are then transferred to 3D and jointly refined for multi-view consistency by a newly introduced Neural Merger step. The materials maps and normals are further refined based on a rendering loss, evaluated with deferred rendering. The illumination is modeled by an optimizable environment map during refinement.
Specifically, we (1) lift 2D materials to 3D via Gaussian ray tracing, (2) refine per-Gaussian material parameters with the Neural Merger across views, and (3) supervise rendered material maps to preserve plausible 2D priors while suppressing baked-in lighting.
We leverage pretrained diffusion priors to predict physically meaningful per-view material maps, enabling accurate re-construction of 3D PBR materials. Each material channel, base color (three channels), roughness (single channel), and metallic (single channel), is inferred explicitly. In practice, we evaluate multiple prebuilt diffusion-based predictors and select the one that provides the best fidelity and consistency balance for our data. In our experiments, this is Diffusion-Renderer [27]. We also tested Marigold [20] and RGB-to-X [55]. We choose DiffusionRenderer because it achieves about thirty percent higher PSNR than the other methods.
DiffusionRenderer predicts material maps from short frame batches and uses a limited temporal context to improve consistency and material understanding within each batch. While this improves local consistency, the predicted materials in a batch can still differ across views. The model cannot recover the complete environment illumination from a small input batch, which may lead to small shifts in color, roughness, or metallic appearance within a batch. In addition, results from separate batches may not align, as these images may introduce new information that was not visible in previous batches. These variations make a direct projection of the predicted maps unreliable and often result in blurry and washed out material maps. Our method resolves this by refining and merging the estimates into one consistent 3D representation.
For each view v i , we collect the material attributes (base color, roughness, metallic) corresponding to every Gaussian g from the pixels within its projected footprint f p p using Gaussian ray tracing, following the formulation of Mai et al. [35]. Their approach determines each Gaussian or ellipsoid’s contribution to a ray based on density. Because the opacity ฮฑ used in Gaussian Splatting [21] does not directly correspond to a physical density, we adopt the formulation by Moenne-Loccoz et al. [39], which allows the direct use of Gaussian Splatting opacity ฮฑ in ray tracing.
For a Gaussian with mean ยต and covariance ฮฃ, the point of maximum response x max along a ray with origin o and direction d is
The corresponding opacity ฮฑ max , given a base opacity ฮฑ and a falloff parameter ฮป > 0, is
(2) Material values per pixel m p are then assigned to the Gaussians intersected by the ray and aggregated across all pixels in each Gaussian’s footprint fp g,vi per view. To reduce color distortion from outliers and overlapping footprints, we compute a median of all assigned material parameters m g,vi per Gaussian:
After computing the median for each view, the resulting values are assigned to their corresponding Gaussians. Gaussians not intersected in any view are removed. Grid-based supersampling per pixel is used to ensure stable Gaussian hits.
For each Gaussian g, we now obtain arrays of material estimates across all views:
roughness g = {r g,1 , r g,2 , . . . , r g,n }.
These arrays contain the inconsistent material values per view produced by DiffusionRenderer. These inconsistencies motivate the subsequent Neural Merger step, which refines the estimates into a coherent 3D representation.
To reduce inconsistencies across views, we introduce the Neural Merger, which predicts weights per view for the material parameters collected during the projection step for each Gaussian. It fuses the predictions into a single, consistent estimate. The key idea is to interpolate between the predicted values rather than allowing the network to freely generate new colors or material values. This ensures that the merged results remain consistent with the world knowledge captured by the diffusion priors while enforcing coherence across views.
For each Gaussian g, the Neural Merger takes as input the projected material values m g,v for all views v โ {1, . . . , V }, along with the Gaussian position p g , encoded using a positional encoding. The input is processed by a lightweight MLP f ฮธ to produce unnormalized weights h g,v for each view:
A softmax function then converts these outputs into normalized weights:
The merged material m g for the Gaussian is computed as the weighted sum of the per-view predictions:
The Neural Merger is optimized during the refinement explained in the next section. Using the softmax weighting is crucial. Without it, the merger can converge faster than the environment map optimization, producing unrealistic material values that match the ground truth only superficially. By interpolating among the predicted values, the Neural Merger produces physically plausible material estimates while allowing the environment map to converge reliably. In our framework, we use a separate Neural Merger for each material channel, enabling improved disentanglement of the materials.
The Neural Merger produces material values per Gaussian, which are then rasterized into material maps. These maps are iteratively refined using two complementary supervised losses. First, we supervise the rendered material maps against the diffusion model’s 2D material predictions using an L 1 loss. This loss is applied exclusively to the material parameters, thereby optimizing only the Neural Merger. Given the rendered material maps M render and the diffusion-predicted material maps M 2D , the material supervision loss L Image is defined as:
Second, the rasterized materials are used for deferred shading to generate a physically based rendering (PBR) image. This rendered image is then compared to the groundtruth input using the loss introduced in Gaussian Splatting [21]. The rendering supervision loss is defined as: (11) Figure 3. Relighting Comparison between our method, an extended version of R3DGS [9] and IRGS [11]. The objects are all relit under the same environment maps. In IRGS, reconstructed scene geometry might partially occlude the environment map.
where I PBR denotes the rendered image, I GT is the groundtruth image, and ฮป โ [0, 1] (typically set to 0.8). This loss supervises both the Neural Merger and the environment map estimation, ensuring that the final rendering is consistent with the input views.
We evaluate our method on both synthetic and real-world datasets from the Navi dataset [17], comparing against state-of-the-art approaches for 3D material estimation and relighting, specifically an extended version of R3DGS [9] (modified to support metallic materials) and IRGS [11]. Our evaluation includes qualitative comparisons of material maps and relighting quality, quantitative metrics across material channels, computational performance, and an ablation study.
All methods use the same experimental setup for fair comparison. Synthetic scenes are initialized with a unit cube point cloud containing 100,000 points sampled uniformly at random. Real-world scenes use structure-from-motion reconstruction from COLMAP [45] for both initialization and camera inference.
We evaluate on 17 synthetic objects with ground-truth material maps for quantitative analysis. The synthetic dataset uses 100 training images and 200 evaluation images per object. Real-world objects use all available images (average of 27 per object).
To handle highly specular objects, we train Gaussian Splatting on DiffusionRenderer normals as RGB, which helps guide the geometry more effectively, since Gaussian Splatting alone often struggles with specular surfaces and can produce holes in the reconstruction.
The Neural Merger consists of separate MLPs for each material channel (basecolor, roughness, metallic), each with 3 hidden layers of 128 neurons and ReLU activations. The final layer outputs view-specific weights passed through softmax to form a probability distribution.
We spend 30.000 iterations for the 3D Gaussian geometry optimization and 10.000 for the material refinement. All experiments run on an NVIDIA RTX 4090 GPU [40].
We evaluate material estimation using PSNR, SSIM [49], and LPIPS [56] between predicted and ground-truth material maps. For relighting, we render novel views under different environment maps and compare against ground-truth renderings using the same metrics.
Relighting Quality Figure 3 compares relighting quality across methods. Our method produces results that more closely resemble ground truth, particularly for specular objects. The extended R3DGS struggles with specular materials, often producing brighter images than ground truth Figure 4. Material Maps produced by our method compared to extended R3DGS [9], which can also predict metallic material maps, IRGS [11], and the DiffusionRenderer material output produced on the test images that are not used for training. We show four images each, where the top left is the base color, top right is the roughness, bottom left is metallic, and bottom right are the normals.
Figure 5. Real-World Comparison of our method, extended R3DGS [9] and IRGS [11]. The ground truth images show the object masked (top) and unmasked (bottom) to give a better understanding of the object and the surrounding lighting.
due to unconstrained material maps during joint environment map optimization. In contrast, our method constrains material maps, enabling more accurate environment map optimization and improved material estimation.
IRGS exhibits artifacts such as floaters and overly flat surfaces. While it reconstructs flat surfaces well using 2D Table 1. Quantitative Comparison of material estimation and relighting results on 17 synthetic objects. We compare against IRGS [11] and an extended version of R3DGS [9] that supports metallic materials. * For non-metallic objects, our model correctly optimizes the parameter to zero, which can result in infinite PSNR when all metallic maps are predicted as zero.
Ours Ext. R3DGS IRGS Gaussians, it loses fine detail compared to our method. Moreover, IRGS cannot predict metallic materials, preventing accurate reconstruction of highly specular metallic surfaces. Our method achieves clear advantages in reconstructing metallic and highly specular materials while preserving geometric detail across both diffuse and flat surfaces.
On the real-world relighting results in Figure 5 one can observe the most consistent results with our method. With R3DGS, the estimated color appears too bright, the objects are too shiny and the geometry is less smooth, while IRGS produces smooth geometry but too dark colors.
Material Maps Figure 4 compares material maps across methods. Our method effectively removes baked-in lighting effects from basecolor maps, producing nearly diffuse basecolors with minimal shadowing, while other methods exhibit visible shadows and residual lighting.
Compared to DiffusionRenderer’s original predictions, which exhibit significant view-dependent inconsistencies, our Neural Merger enhances multi-view consistency. This is particularly evident in roughness and metallic maps, where DiffusionRenderer’s per-view predictions vary across views. Our approach removes these inconsistencies while preserving high-quality spatially varying material properties.
Our metallic maps match ground truth well, with predictions substantially closer than competing methods. For roughness, our method guides material maps toward consistent values for surfaces sharing the same properties, improving upon the 2D predictions. While DiffusionRenderer struggles with roughness accuracy (reflected in darker roughness maps), our Neural Merger refines initial predictions by enforcing view consistency, resulting in more physically plausible parameters. In contrast, R3DGS tends to overestimate roughness due to specular highlights appearing in only a subset of the images, biasing optimization toward diffuse surfaces. IRGS produces overly uniform roughness where fine details are difficult to discern.
Although IRGS normals are slightly closer to ground truth in some regions, our method significantly improves normals compared to extended R3DGS despite starting from the same 3D Gaussian geometry. Overall, our method produces qualitatively superior material maps across basecolor, normals, metallic, and roughness channels.
Table 1 shows quantitative results on 17 synthetic objects. Our method consistently outperforms all baselines in relighting quality and basecolor estimation, consistent with the qualitative comparisons.
For roughness, IRGS achieves slightly higher PSNR (16.182 vs. 15.331), but our method achieves better SSIM (0.820 vs. 0.744) and LPIPS (0.181 vs. 0.192), indicating better preservation of structural information and perceptual quality. All methods face challenges in roughness estimation, which remains a difficult problem.
For metallic maps, our method shows substantial improvement. When correctly predicting fully non-metallic objects, PSNR becomes infinite, which occurs exclusively for our method. Those objects are left out in the actual PSNR calculation. While DiffusionRenderer often predicts partial metallicity in certain views our method enforces view-consistent material estimation, producing stable nonmetallic predictions.
Table 2 shows runtime breakdown on the Navi dataset [17]. DiffusionRenderer requires 112 seconds (โผ6 seconds per image) to process the full image set. Gaussian Splatting takes 131 seconds on average (ranging from 64 to 274 seconds depending on object complexity). Normal generation using R3DGS takes 270 seconds on average (247-347 seconds). Material optimization requires 975 seconds on average, extending up to 3,631 seconds (approximately one hour) for complex objects. In total, our method takes 1,488 seconds (โผ25 minutes) on average, approximately 3.5ร faster than IRGS (5,347 seconds, โผ89 minutes).
Table 3 demonstrates that the Neural Merger is the key component responsible for our method’s superior performance. The full model achieves the highest scores across all metrics (PSNR: 29.164, SSIM: 0.9105, LPIPS: 0.0626), significantly outperforming all ablated variants. This improvement stems from the Neural Merger’s ability to enforce multi-view consistency while preserving high-quality material properties predicted by the diffusion model. The Supervised variant performs worse than the full model, relying solely on diffusion predictions without geometric or photometric optimization. Interestingly, it is even worse than the Proj. Average baseline, which simply projects diffusion predictions into Gaussians without training. We attribute this to view-dependent effects captured in optimized Gaussians that are absent without optimization.
The qualitative results in Figure 6 show that the Neural Merger produces cleaner and more consistent material maps. Visualized using base color (where baked-in lighting effects are most evident), the Neural Merger yields substantial improvements in both sharpness and color fidelity compared to the Proj. Average baseline, which serves as our initialization. These results validate that the Neural Merger enhances visual quality and enforces consistency with underlying physical properties, resulting in more accurate and realistic relighting outcomes.
MatSpray enables casual acquisition and high-quality reconstruction of photorealistic relightable 3D assets with spatially varying materials. It effectively lifts 2D material predictions to 3D to fuse them with the 3D Gaussian geometry. It employs pretrained 2D diffusion-based material estimators without requiring additional expensive training on large-scale 3D PBR data sets. Introducing the Neural Merger, our method significantly improves multi-view consistencies, which even video-based material prediction models still struggle with. The resulting relightable 3D models feature improved quality both in the estimated material maps as well as in the final relit appearance, even more so for highly specular objects. As demonstrated, Mat-Spray outperforms current state-of-the-art methods and excels in reconstructing accurate metallic maps for both synthetic and real-world inputs. The approach provides a powerful tool for easy 3D content generation.
Limitations While our approach drastically improves multi-view consistency, the overall material quality remains dependent on the performance of the chosen diffusion model. However, our PBR-to-image loss partially corrects small deviations in the diffusion predictions 4 Diffusion-Renderer vs. MatSpray.
Our method struggles when inconsistent geometry and normals are produced by the underlying R3DGS method [9], though the photometric loss may partially mitigate these issues. Additionally, very small or flat Gaussians might sometimes be missed during ray tracing. Future work could address missing Gaussians through a projection transformer combination assignment similar to [41].
The high quality of the resulting 3D geometry-material association could be exploited for accurate 3D object part segmentation. This segmentation, paired with matching language features, might enable objectspecific constraints in reconstruction or a natural interface for manipulating both geometry and reflections.
This supplementary material provides extended results, analyses, and implementation details that complement the findings in the main paper. For ease of navigation, the main components are summarized here and referenced through the corresponding section labels.
โข Additional Videos and Real-World Objects (A): A detailed collection of video comparisons and reconstructions of real objects that highlight the performance and stability of our method relative to earlier approaches. โข Neural Merger Ablation (B): An extended analysis of the importance of the final Softmax layer in the Neural Merger, supported by qualitative and quantitative evidence. โข Tone Mapping Analysis (C): A discussion of the tone mapping behaviour of DiffusionRenderer, how this affects predicted base color, roughness and metallic maps, and why this creates a mismatch when compared to linear ground truth. โข Implementation Details (D): A description of our training setup, super sampling strategy, Neural Merger inputs and other practical considerations that are important for stable optimization.
Figure 7 shows the thumbnail that links to all additional videos included with this supplementary material. These videos provide an extensive visual comparison of our method with Extended R3DGS [9], IRGS [11], and the forward renderer of DiffusionRenderer [27]. While the main paper presents representative examples, the extended videos give a more complete picture of the consistency and stability of our approach, especially compared to the produced material maps of DiffusionRenderer.
Across the set of videos, our method consistently produces reconstructions that remain stable across all viewpoints, without the flickering or structural collapse that can be observed in the other methods. This is particularly visible in objects with complex geometry or pronounced specular highlights such as the Kettle. Extended R3DGS often fails to maintain surface smoothness and yields unstable representations. On the other hand IRGS tends to oversmooth surfaces and tends to bake in specular reflections of metallic objects into its base color. In contrast, our approach maintains coherent structure even under strong lighting variations.
To illustrate this, we provide three relighting videos: White Golden Airplane, Stone Birdhouse, and Kettle. Additionally, three videos visualize predicted material properties: Yellow Airplane, Birdhouse with Yellow Flower, and Chair. These examples show that DiffusionRenderer, despite being trained on its own dataset, still produces inconsistent material maps that vary strongly with camera angle and lighting. Our method mitigates these issues and aligns predictions across views more reliably.
Real-World Objects Figure 8 shows additional real objects reconstructed by our method and by Extended R3DGS and IRGS. Here, each method is evaluated under two relighting settings and compared in base color and normals. The differences are most obvious in the base color: our base color is locally sharp and coherent across the surface, while both baselines exhibit noise, distortions, or view-dependent artifacts. The relighting results further demonstrate that our predicted materials generalize well across lighting conditions, while the other methods still have lighting effects baked into their materials (R3DGS) or tend to be washed out (IRGS).
The Neural Merger plays a key role in ensuring that the material parameters assigned to each Gaussian remain stable and consistent across all viewpoints. One central element of the Neural Merger is the final Softmax layer, which normalizes its output into weights acting as a weighted average of the inputs. Although this layer may ap-Figure 8. Additional real objects reconstructed with our method, Extended R3DGS [9] and IRGS [11]. The figure includes relighting under two environments, base color and normal maps. pear to be a small architectural detail, it has a sizable impact on the quality of the final reconstruction.
Without the Softmax normalization, the Neural Merger becomes unconstrained and starts to absorb illumination cues directly from the training images. In other words, instead of learning clean, view-independent materials, the MLP blends in signals that correspond to lighting variations and shadows. Because these patterns differ between viewpoints, the network produces material values that vary from view to view, which leads to inconsistency during rendering. Although this might also be additionally influenced by slight variations in the 2D Diffusion predictions geometry. This behaviour becomes especially problematic under relighting, because the embedded shadows and highlights interfere with the simulated lighting and produce unrealistic results.
Figure 9 shows a comparison between the full method, the version without the Softmax, and the linear ground truth. The differences become clear when observing fine geometric structures and shadow placement. Without Softmax, shadows from the input images appear in the base color maps and the renderings become blurry in high detail areas. These issues are especially visible in the lower birdhouse example, where the version without Softmax fails to maintain consistent materials on the swim ring and the surrounding areas.
We further quantify these findings in Table 4, which reports results across all scenes in the dataset. The full model outperforms the version without the Softmax across all metrics, with especially large gains in perceptual similarity (LPIPS). This confirms that the Softmax-based normalization is not merely a numerical improvement but a key component that ensures robustness and prevents the network from encoding view-dependent appearance into the materials.
One recurring observation in our experiments was that the base color predicted by our method tended to appear darker than the linear ground truth material map. This appeared to be a miss-prediction of the 2D material maps by DiffusionRenderer for a few objects. However, this discoloration appeared in almost all objects that we tested, hinting towards a systemic problem in DiffusionRenderer. Figure 10 illustrates this systemic discoloration of the predicted base color. This indicates that during training Diffusion-Renderer was supervised using tone-mapped ground truth images.
Our analysis suggests that DiffusionRenderer employs a filmic or AgX tone mapping curve. These tone-mapping algorithms compress high dynamic range values into the Base color is affected in a predictable way, because tone mapping acts like a softened gamma curve. Applying an inverse gamma of roughly one point eight partially recovers the linear values but cannot undo the full nonlinearity. Roughness is affected more severely, because its values occupy a small part of the zero to one interval, which collapses under tone mapping. Metallic values, on the other hand, remain closer to either zero or one and thus suffer less from compression. These effects explain why our predicted material maps sometimes differ from the linear ground truth as they closely match DiffusionRenderer’s tone-mapped output.
Our experiments were performed on an NVIDIA RTX 4090 GPU with PyTorch, C++ and Optix. To keep the input consistent with the internal resolution of DiffusionRenderer, we render all training views at a resolution of 512ร512 pixel. This choice ensures that the reconstruction quality aligns with the scale at which DiffusionRenderer was originally trained. In scenes with strong specular highlights, we disable geometry learning entirely and keep the Gaussian positions fixed, because additional geometric optimization tends to destabilize the representation under these conditions.
The Neural Merger is optimized using a learning rate of zero point zero zero one. Material supervision uses an L1loss with a weight of 1.0, as we found that this balance prevents the model from overfitting shadows while still enforcing high fidelity in the material maps. During training, we also apply random view sampling to avoid biasing the model toward any particular viewpoint.
Super Sampling A key technical detail is the use of super sampling during the projection of material values into the Gaussian representation. We employ a 16ร16 grid of rays per pixel to ensure that even small or distant Gaussians receive material parameters. With fewer samples, Gaussians are occasionally missed leading to a patchy geometry and a low resolution material parameter transferal. Figure 11 shows an example where a lower sampling rate produces obvious reconstruction defects.
Merger Inputs Finally, Figure 12 illustrates the input to the Neural Merger. The features consist of a NeRF-style positional encoding of the Gaussian location along with the projected base color, roughness and metallic values. The combination of positional encoding and projected materials allows the network to balance local detail with global consistency, which is essential for producing clean results under relighting.
This content is AI-processed based on open access ArXiv data.