PI-Light: Physics-Inspired Diffusion for Full-Image Relighting
Full-image relighting remains a challenging problem due to the difficulty of collecting large-scale structured paired data, the difficulty of maintaining physical plausibility, and the limited generalizability imposed by data-driven priors. Existing attempts to bridge the synthetic-to-real gap for full-scene relighting remain suboptimal. To tackle these challenges, we introduce Physics-Inspired diffusion for full-image reLight ($π$-Light, or PI-Light), a two-stage framework that leverages physics-inspired diffusion models. Our design incorporates (i) batch-aware attention, which improves the consistency of intrinsic predictions across a collection of images, (ii) a physics-guided neural rendering module that enforces physically plausible light transport, (iii) physics-inspired losses that regularize training dynamics toward a physically meaningful landscape, thereby enhancing generalizability to real-world image editing, and (iv) a carefully curated dataset of diverse objects and scenes captured under controlled lighting conditions. Together, these components enable efficient finetuning of pretrained diffusion models while also providing a solid benchmark for downstream evaluation. Experiments demonstrate that $π$-Light synthesizes specular highlights and diffuse reflections across a wide variety of materials, achieving superior generalization to real-world scenes compared with prior approaches.
💡 Research Summary
PI‑Light tackles the long‑standing challenge of full‑image relighting by marrying physically‑based rendering principles with modern latent diffusion models. The authors observe three fundamental obstacles in prior work: (1) the scarcity of large‑scale paired datasets that capture the same scene under multiple lighting conditions, (2) the inability of purely data‑driven pipelines to respect light‑transport physics, and (3) limited generalization caused by over‑reliance on learned priors. To address these, PI‑Light introduces a two‑stage framework.
Stage 1 – Inverse Neural Rendering. A pretrained Stable Diffusion U‑Net is repurposed to predict four intrinsic maps—albedo, surface normal, roughness, and metallic—in a single forward pass. The key innovation is batch‑aware attention: standard self‑attention is extended across the batch dimension so that multiple images (e.g., different lighting of the same scene) can exchange information. This yields consistent intrinsic estimates across the batch, reduces per‑image variance, and improves overall accuracy. CLIP image embeddings are injected via cross‑attention, providing additional semantic guidance.
Stage 2 – Neural Forward Rendering. The intrinsic maps from Stage 1 are combined with a user‑specified lighting condition to generate a relit image. Lighting is encoded as a gray‑ball map derived from the front hemisphere of an HDRI environment, deliberately discarding contributions from self‑luminous objects and background illumination. A physics‑inspired loss, derived from the Disney Principled BRDF (Lambertian diffuse + Cook‑Torrance specular), is added to the standard V‑prediction diffusion loss. This loss regularizes the diffusion trajectory toward a physically plausible light‑transport manifold, accelerating convergence and enabling the model to learn correct specular highlights and diffuse shading with far fewer training samples.
Dataset Construction. Recognizing the data bottleneck, the authors curate a new benchmark. At the object level, they sample >10 k BRDF‑compatible models from Obja‑verse, rendering each under 10 viewpoints and 10 distinct lighting setups (point lights and HDRI maps), yielding ~1 M images with ground‑truth albedo, normal, roughness, metallic, and mask annotations. At the scene level, 300 high‑quality indoor and outdoor BlenderKit scenes are rendered with varied camera poses and a supplemental point light placed behind the camera to diversify shadow patterns. Crucially, lighting labels are provided as front‑hemisphere gray‑ball images rather than irradiance maps, simplifying user control and avoiding interference from built‑in illumination.
Experiments. Quantitative metrics (RMSE, SSIM, LPIPS) show PI‑Light outperforming state‑of‑the‑art methods such as RGB↔X, LightIt, and OutCast across both synthetic and real‑world test sets. The batch‑aware attention reduces intrinsic inconsistency between different lighting views of the same scene by over 30 %. Qualitative results demonstrate faithful preservation of albedo, accurate reconstruction of specular highlights on metallic surfaces, and robust handling of translucent objects—areas where prior methods typically falter.
Limitations and Future Work. The front‑hemisphere lighting representation, while convenient, cannot fully capture complex indirect illumination, potentially limiting realism in highly inter‑reflective environments. The current pipeline processes static images; extending to video would require temporal consistency mechanisms. High‑resolution deployment (>1024²) still incurs substantial memory costs, suggesting a need for more efficient attention or hierarchical diffusion schemes.
Conclusion. PI‑Light presents a compelling synthesis of physics‑based rendering and diffusion models, delivering a data‑efficient, physically plausible, and broadly generalizable solution for full‑image relighting. By embedding light‑transport priors directly into the diffusion training objective and introducing batch‑wise intrinsic consistency, the method sets a new benchmark for controllable image editing and opens avenues for future research in global illumination modeling, video diffusion, and scalable high‑resolution relighting.
Comments & Academic Discussion
Loading comments...
Leave a Comment