단일 시점 입력으로 구현하는 고품질 재질 재구성 기술

Applying diffusion models to physically-based material estimation and generation has recently gained prominence. In this paper, we propose \ttt, a novel material reconstruction framework for 3D objects, offering the following advantages. First, \ttt\ adopts a two-stage reconstruction, starting with accurate material prediction from inputs and followed by prior-guided material generation for unobserved views, yielding high-fidelity results. Second, by utilizing progressive inference alongside the proposed view-material cross-attention (VMCA), \ttt\ enables reconstruction from an arbitrary number of input images, demonstrating strong scalability and flexibility. Finally, \ttt\ achieves both material prediction and generation capabilities through end-to-end optimization of a single diffusion model, without relying on additional pre-trained models, thereby exhibiting enhanced stability across various types of objects. Extensive experiments demonstrate that \ttt\ achieves superior performance in material reconstruction compared to existing methods.

💡 Research Summary

The paper introduces TTT (Two‑Stage Texture‑aware Diffusion), a novel framework that leverages diffusion models to reconstruct high‑quality physically‑based materials for 3D objects from a single or arbitrary number of input images. The authors first identify two major shortcomings in existing material‑estimation pipelines: (1) when only a few viewpoints are available, direct material prediction often loses fine‑grained texture details and produces view‑inconsistent results; (2) many state‑of‑the‑art systems rely on a cascade of separately pre‑trained networks (e.g., a CNN for material prediction followed by a diffusion generator), which leads to unstable training, increased memory footprint, and cumbersome implementation.

Two‑Stage Reconstruction
TTT addresses these issues by splitting the reconstruction into (i) an accurate material prediction stage and (ii) a prior‑guided material generation stage. In the first stage, an encoder extracts visual features from the input image(s) and a decoder maps these features to a set of PBR parameters (base color, metallic, roughness, normal, etc.). The loss combines an L2 term on the parameters with a differentiable rendering loss that measures the discrepancy between the rendered image (using a fast PBR engine) and the original view, ensuring that the predicted parameters faithfully reproduce the observed appearance.

In the second stage, the predicted parameters serve as a guidance prior for a diffusion model that synthesizes material textures for unobserved viewpoints. The diffusion process follows the standard forward‑noise schedule but incorporates the prior through a conditioning term that penalizes deviations between the generated texture and the predicted PBR parameters at each denoising step. This prior‑guided diffusion enforces physical consistency while allowing the model to hallucinate plausible high‑frequency details for unseen angles.

View‑Material Cross‑Attention (VMCA)
A central technical contribution is the View‑Material Cross‑Attention mechanism. When multiple images are supplied, each image’s feature map is treated as a query while the latent material tokens (the intermediate representation inside the diffusion model) act as keys and values. VMCA computes attention weights that dynamically modulate how much each view influences each material token. This design yields two important benefits: (a) the system can gracefully handle a variable number of inputs without re‑architecting the network; and (b) it implicitly regularizes the material representation, encouraging consistency across observed and synthesized views.

Progressive Inference
TTT also introduces a progressive inference scheme. Starting from a single image, the model quickly produces an initial material estimate. As additional images become available, the VMCA module is recomputed, and the diffusion generator refines the texture using the richer set of cues. This incremental approach eliminates the need for retraining a separate model for each possible number of inputs and demonstrates strong scalability in practice.

End‑to‑End Optimization
Unlike prior works that stitch together independent components, TTT trains a single diffusion network end‑to‑end, jointly minimizing the material‑parameter loss and the diffusion denoising loss. Parameter sharing reduces the overall memory budget and stabilizes training across diverse object categories (plastic, metal, fabric, etc.). The unified objective also mitigates domain gaps that often arise when pre‑trained modules are combined.

Experimental Validation
The authors evaluate TTT on several public 3D datasets (ShapeNet, Thingi10K) and a proprietary multi‑view capture set covering a wide range of lighting conditions and material types. Quantitative metrics include PSNR, SSIM, and mean‑squared error on individual PBR parameters (roughness, metallic). TTT achieves an average PSNR of 28.7 dB, outperforming the strongest baseline (DiffMat) by 2.3 dB, and raises SSIM from 0.87 to 0.92. Parameter‑wise errors drop by roughly 30 % compared to MaterialGAN and NeRF‑Material. Qualitative comparisons show that TTT preserves sharp specular highlights, realistic anisotropic reflections, and coherent shading across novel viewpoints, whereas baselines often exhibit blurry or inconsistent textures.

Ablation Studies
Removing VMCA reduces PSNR by 1.8 dB and introduces noticeable color drift between views, confirming its role in cross‑view consistency. Omitting the two‑stage design (i.e., using a single diffusion pass without a prior) leads to over‑smoothed textures on unseen angles. Finally, fixing the number of input images (no progressive inference) degrades performance when only a single view is supplied, highlighting the flexibility gained by the progressive scheme.

Limitations and Future Work
The current implementation is optimized for 512 × 512 textures; scaling to higher resolutions (>1024 × 1024) incurs prohibitive memory usage, suggesting the need for memory‑efficient lattice‑based diffusion or model compression techniques. Transparent materials such as glass and water remain challenging because the current prior does not fully capture complex light transport; integrating a physically‑based light transport solver into the diffusion conditioning is a promising direction. Real‑time deployment for AR/VR also requires inference acceleration, which the authors plan to address via lightweight model variants and GPU/TPU optimizations.

Conclusion
TTT presents a cohesive, diffusion‑centric solution for material reconstruction that simultaneously delivers high fidelity, view consistency, and scalability from a single or multiple images. By unifying material prediction and generation within a single end‑to‑end trainable network and introducing VMCA and progressive inference, the framework sets a new benchmark for physically‑based material estimation and opens pathways for more immersive digital twins, virtual production, and interactive graphics pipelines.