Inverse Graphics with Probabilistic CAD Models
Recently, multiple formulations of vision problems as probabilistic inversions of generative models based on computer graphics have been proposed. However, applications to 3D perception from natural images have focused on low-dimensional latent scenes, due to challenges in both modeling and inference. Accounting for the enormous variability in 3D object shape and 2D appearance via realistic generative models seems intractable, as does inverting even simple versions of the many-to-many computations that link 3D scenes to 2D images. This paper proposes and evaluates an approach that addresses key aspects of both these challenges. We show that it is possible to solve challenging, real-world 3D vision problems by approximate inference in generative models for images based on rendering the outputs of probabilistic CAD (PCAD) programs. Our PCAD object geometry priors generate deformable 3D meshes corresponding to plausible objects and apply affine transformations to place them in a scene. Image likelihoods are based on similarity in a feature space based on standard mid-level image representations from the vision literature. Our inference algorithm integrates single-site and locally blocked Metropolis-Hastings proposals, Hamiltonian Monte Carlo and discriminative data-driven proposals learned from training data generated from our models. We apply this approach to 3D human pose estimation and object shape reconstruction from single images, achieving quantitative and qualitative performance improvements over state-of-the-art baselines.
💡 Research Summary
The paper presents a comprehensive framework that brings together probabilistic generative modeling, computer‑aided design (CAD) programs, and modern Bayesian inference to tackle the long‑standing “inverse graphics” problem: recovering detailed three‑dimensional structure from a single natural image. The authors introduce Probabilistic CAD (PCAD) programs, which are stochastic versions of traditional CAD pipelines. In a PCAD, latent variables describe both the geometry of a deformable mesh and the affine transformation that places the object in a scene. For generic objects, the mesh is generated by a lathing operation driven by a Gaussian Process (GP) prior over the object’s cross‑sectional profile; for human bodies, a skeletal armature with per‑joint affine parameters is used. These priors are highly expressive, allowing the model to capture a wide range of shapes while still being amenable to sampling.
To connect the 3‑D generative model to observed images, the authors treat a standard graphics engine (Blender) as a stochastic scene generator. Rather than comparing raw pixels, they render a mid‑level representation: contour maps and their distance transforms. The similarity between a rendered contour image I_R and a real contour image I_D is measured using a non‑symmetric Chamfer distance, which is then embedded in a Gaussian likelihood P(I_D | I_R). This abstraction removes the need to model lighting, texture, and other high‑frequency image variations, focusing inference on shape consistency.
Because the posterior over the latent scene variables is highly multimodal, high‑dimensional, and contains both continuous and discrete components, exact inference is infeasible. The authors therefore design a hybrid Markov chain Monte Carlo (MCMC) sampler that mixes four complementary proposal mechanisms:
- Local Random Proposals – single‑site Metropolis–Hastings updates for continuous variables and Gibbs moves for discrete ones.
- Block Proposals – joint updates of groups of tightly coupled variables (e.g., an affine matrix together with the mesh parameters it influences).
- Discriminative (Data‑driven) Proposals – a learned proposal distribution obtained by k‑nearest‑neighbor search in a feature space of synthetic (image, latent) pairs, followed by kernel density estimation. This helps the chain escape local minima caused by occlusion or clutter.
- Hamiltonian Monte Carlo (HMC) Proposals – gradient‑based updates for the continuous subset of variables, enabling larger, informed jumps in the high‑dimensional space.
The overall proposal distribution is a weighted mixture of these kernels, and a Metropolis–Hastings acceptance ratio guarantees asymptotic correctness.
The framework is evaluated on two challenging tasks.
3‑D Object Parsing – A small dataset of about 20 real‑world objects is used. The authors compare against SIRFS, a state‑of‑the‑art single‑object reconstruction method that requires pre‑segmented masks. Their PCAD approach, which jointly infers segmentation, pose, and shape, achieves substantially lower surface‑wise mean absolute error (Z‑MAE) and normal error (N‑MSE). Qualitatively, the reconstructed depth maps and meshes align far better with the ground truth, demonstrating the benefit of a strong, learned 3‑D prior.
3‑D Human Pose Estimation – Using images from KTH, LabelMe, and internet sources (including heavily occluded “person sitting” cases), the method is benchmarked against the Deformable Part Model (DPM) pose detector. The PCAD system consistently yields lower joint localization error and produces coherent 3‑D skeletons even when large portions of the body are hidden. The authors also show that independent MCMC runs converge to the same posterior, indicating robust mixing.
In summary, the paper makes three major contributions: (1) a flexible probabilistic CAD language that can encode both rigid and highly deformable objects; (2) an inverse graphics pipeline that leverages mid‑level contour representations to define a tractable likelihood; and (3) a hybrid MCMC inference engine that combines local, block, gradient‑based, and learned proposals to efficiently explore a complex posterior. The results demonstrate that realistic generative models, once thought intractable for real‑world vision, can indeed be inverted to achieve state‑of‑the‑art performance on difficult 3‑D perception tasks. Future directions include scaling to more diverse object categories, incorporating temporal cues, and accelerating rendering and inference with learned surrogates.
Comments & Academic Discussion
Loading comments...
Leave a Comment