Differentiable Inverse Graphics for Zero-shot Scene Reconstruction and Robot Grasping

Differentiable Inverse Graphics for Zero-shot Scene Reconstruction and Robot Grasping
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Operating effectively in novel real-world environments requires robotic systems to estimate and interact with previously unseen objects. Current state-of-the-art models address this challenge by using large amounts of training data and test-time samples to build black-box scene representations. In this work, we introduce a differentiable neuro-graphics model that combines neural foundation models with physics-based differentiable rendering to perform zero-shot scene reconstruction and robot grasping without relying on any additional 3D data or test-time samples. Our model solves a series of constrained optimization problems to estimate physically consistent scene parameters, such as meshes, lighting conditions, material properties, and 6D poses of previously unseen objects from a single RGBD image and bounding boxes. We evaluated our approach on standard model-free few-shot benchmarks and demonstrated that it outperforms existing algorithms for model-free few-shot pose estimation. Furthermore, we validated the accuracy of our scene reconstructions by applying our algorithm to a zero-shot grasping task. By enabling zero-shot, physically-consistent scene reconstruction and grasping without reliance on extensive datasets or test-time sampling, our approach offers a pathway towards more data efficient, interpretable and generalizable robot autonomy in novel environments.


💡 Research Summary

The paper introduces a novel Differentiable Neuro‑Graphics (DNG) framework that enables zero‑shot 3D scene reconstruction and robot grasping from a single RGB‑D image and bounding‑box prompts, without any additional 3D training data or test‑time sampling. The authors combine a foundation segmentation model (Segment‑Anything Model, SAM) with a physics‑based differentiable renderer implemented in JAX. The pipeline consists of four sequential stages: (1) object detection and mask generation using EfficientDet‑D7 or a custom SSD‑512 followed by SAM; (2) robust ellipsoid initialization where depth‑derived point clouds are fit to ellipsoids via a MAP formulation that incorporates Laplace, log‑normal, and truncated‑normal priors to handle sensor noise and outliers; (3) differentiable scene rendering where lighting, material (Phong parameters), pose, and scale are jointly optimized using L‑BFGS, with a soft‑mask function that avoids zero‑gradient issues and barrier functions that enforce physical bounds; (4) mesh refinement where a cage‑based deformation model updates mesh vertices, guided by a combination of Laplacian smoothness, depth disparity regularization, and a novel volume‑consistency loss that ties the refined mesh volume to the previously estimated ellipsoid volume.

Key technical contributions include: a fully differentiable ray‑tracer built from JAX primitives, a probabilistic ellipsoid prior that provides a robust global initialization, and a multi‑stage constrained optimization scheme that mitigates local‑minimum traps. The method is evaluated on standard 6‑DoF pose benchmarks (e.g., YCB‑Video, LINEMOD) against few‑shot pose estimators such as FS6D, Gen6D, and LatentFusion. Despite using no object‑specific 3D data, DNG achieves lower ADD‑S errors and outperforms these baselines in pose accuracy. For grasping, the reconstructed scene is imported into a physics simulator to compute optimal grasp points, which are then executed on a real UR5e robot with an RG2 gripper. The zero‑shot grasping experiments report an 80 % success rate across 30 novel objects, surpassing baseline methods by roughly 12 %.

The system runs in roughly 2.5 seconds on an RTX 3090, with the mesh refinement stage converging in under 300 L‑BFGS iterations thanks to the strong ellipsoid initialization. Limitations include difficulty handling highly reflective or transparent objects, reliance on a single point‑light model, and sensitivity to severe depth noise that can degrade the ellipsoid fit. Moreover, the approach still depends on an upstream object detector to provide bounding boxes, so fully detector‑free perception remains an open challenge.

Overall, the work demonstrates that physically grounded, differentiable graphics combined with modern segmentation foundations can replace large‑scale 3D supervision, delivering interpretable, data‑efficient scene understanding that directly supports downstream robotic manipulation. This opens avenues for more autonomous, adaptable robots operating in unstructured environments without the heavy data collection pipelines that currently dominate the field.


Comments & Academic Discussion

Loading comments...

Leave a Comment