MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding

December 13, 2025

Reading time: 5 minute

...

📝 Original Info

Title: MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding
ArXiv ID: 2512.12307
Date: 2025-12-13
Authors: Benjamin Beilharz, Thomas S. A. Wallis

📝 Abstract

While deep learning methods have achieved impressive success in many vision benchmarks, it remains difficult to understand and explain the representations and decisions of these models. Though vision models are typically trained on 2D inputs, they are often assumed to develop an implicit representation of the underlying 3D scene (for example, showing tolerance to partial occlusion, or the ability to reason about relative depth). Here, we introduce MRD (metamers rendered differentiably), an approach that uses physically based differentiable rendering to probe vision models' implicit understanding of generative 3D scene properties, by finding 3D scene parameters that are physically different but produce the same model activation (i.e. are model metamers). Unlike previous pixel-based methods for evaluating model representations, these reconstruction results are always grounded in physical scene descriptions. This means we can, for example, probe a model's sensitivity to object shape while holding material and lighting constant. As a proof-of-principle, we assess multiple models in their ability to recover scene parameters of geometry (shape) and bidirectional reflectance distribution function (material). The results show high similarity in model activation between target and optimized scenes, with varying visual results. Qualitatively, these reconstructions help investigate the physical scene attributes to which models are sensitive or invariant. MRD holds promise for advancing our understanding of both computer and human vision by enabling analysis of how physical scene parameters drive changes in model responses.

💡 Deep Analysis

📄 Full Content

Starting from a scene with known parameters 𝜋 , we sample positions on the unit sphere, and render the set of ground truth images 𝐼 . A new scene is initialized from some other state 𝜋 ′ (for example, a sphere shape instead of a dragon). The optimization loop then renders the images at the sampled camera origins and computes the loss between the renders and the ground truth L. We compute the gradient w.r.t. the involved scene parameters and backpropagate, updating the target scene parameters (here, geometry) while holding other parameters (e.g. lighting) constant. This enables targeted probing of a neural network's understanding of scene properties by separating physical causes and can uncover invariance or even equivalence classes.

Deep learning has revolutionized pattern recognition from visual input. Image-computable models can now perform many tasks with performance matching or exceeding humans, and their activations can correlate highly with visually-driven responses in primate brains [14]. However, it remains difficult to explain how and why these models make the decisions they do. Some work [1,76] probes whether models truly understand scenes, and judge whether these explanations might also provide explanations of visual processing in humans and other animals [22,46,70].

In this work, we demonstrate how a relatively new technology from computer graphics -physically based differentiable rendering (PBDR) -can be used to evaluate 3D scene understanding in image-computable vision models. PBDR allows the reconstruction of physically plausible 3D scene parameters, such as geometry, camera parameters, and material definitions, via optimization with gradient descent. In contrast to approaches using neural networks for inverse rendering [37,38,38,40,44], the PBDR approach is always grounded in the physics of light transport, allowing the physical causes of the image to be separated, decomposed and thereby understood.

Applying PBDR allows one to synthesize new physical scene descriptions which, for example, cause matching model activations to a target scene but are physically different (i.e. are model metamers; [18,19]). This objective has long been used in the study of human vision, first to support trichromatic color theory [42], and then for more general image representations [2,21,65,66], because it allows the identification of perceptual invariants. Here, we combine PBDR with the metamerism objective to create metamers rendered differentiably (MRD). Rather than trying to infer what the model might understand by interpreting noisy pixel images (e.g. in existing synthesis-based explanation methods), the user can interpret vision neural networks by using model representations to reconstruct specific scene parameters, represented in the physical units of the generating scene. This also opens possibilities to fine-tune existing models on specific scene properties.

To demonstrate the usefulness of PBDR as a tool for model interpretability, we evaluate the implicitly-learned 3D knowledge of vision models trained on 2D images. Vision models are assumed to learn about the underlying 3D structure of scenes, even by training only on 2D images. For example, models trained on 2D scenes can perform well on tasks such as novel view synthesis [44], depth estimation [26], and 3D object reconstruction [28,68]. We demonstrate MRD in two example settings: first, we investigate the general problem of “material appearance” by investigating the recovery of surface properties (bidirectional reflectance distribution function, BRDF). Second, we examine the recovery of shape (geometry) in the Learned Perceptual Image Patch Similarity (LPIPS) [72] metric and in ImageNet trained networks such as ResNet-50 [29] and its shapebias induced version ResNet-50 SIN trained on a stylized version of ImageNet [24]. We also present results for CLIP [55] as a common multi-model embedding backbone.

Our specific contributions are (1) a new method to understand learned visual representations of neural networks by linking their activations to physical environmental properties and efficiently optimizing to find invariants, and (2) results evaluating contemporary vision models using this new method. Because it allows the decomposition of model activations into physical causes, we hope that this method will become an important tool for

📄 Read Full PDF on ArXiv