MessyKitchens: Contact-rich object-level 3D scene reconstruction
Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achieve high performance in depth estimation from a single image. Meanwhile, reconstructing and …
Authors: Junaid Ahmed Ansari, Ran Ding, Fabio Pizzati
MessyKitc hens: Con tact-ric h ob ject-lev el 3D scene reconstruction Junaid Ansari 1 * , Ran Ding 1 * , F abio Pizzati 1 , and Iv an Laptev 1 Mohamed bin Za yed Univ ersity of Artificial In telligence (MBZUAI), Abu Dhabi, UAE {junaid.ansari,ran.ding,fabio.pizzati,ivan.laptev}@mbzuai.ac.ae Abstract. Mono cular 3D scene reconstruction has recen tly seen signif- ican t progress. P o wered b y the mo dern neural architectures and large- scale data, recen t metho ds ac hieve high performance in depth estimation from a single image. Meanwhile, reconstructing and decomp osing com- mon scenes into individual 3D ob jects remains a hard challenge due to the large v ariety of ob jects, frequent o cclusions and complex ob ject rela- tions. Notably , b ey ond shape and pose estimation of individual ob jects, applications in rob otics and animation require physica lly-plausible scene reconstruction where ob jects ob ey ph ysical principles of non-p enetration and realistic con tacts. In this work w e adv ance ob ject-level scene re- construction along t wo directions. First, we introduce MessyKitchens , a new dataset with real-world scenes featuring cluttered environmen ts and pro viding high-fidelit y ob ject-lev el ground truth in terms of 3D ob- ject shapes, poses and accurate ob ject contacts. Second, w e build on the recen t SAM 3D approac h for single-ob ject reconstruction and ex- tend it with Multi-Ob ject Deco der (MOD) for joint ob ject-lev el scene reconstruction. T o v alidate our con tributions, we demonstrate MessyK- itc hens to significan tly impro ve previous datasets in registration accuracy and inter-ob ject p enetration. W e also compare our multi-ob ject recon- struction approach on three datasets and demonstrate consisten t and significan t impro vemen ts of MOD ov er the state of the art. Our new b enc hmark, code and pre-trained models will become publicly a v ailable on our pro ject w ebsite: https://messykitchens.github.io/ . Keyw ords: 3D reconstruction · Benchmark · Physical accuracy 1 In tro duction A ccurate 3D scene reconstruction plays a pivotal role for many applications in digital arts and con tent creation, industrial insp ection, surgery , heritage preser- v ation, navigation as well as rob ot learning and simulation. While some tasks, e.g., navigation [ 2 , 9 , 42 , 47 ], may only require free space estimation and scene- lev el surface reconstruction, other tasks, e.g., rob otic manipulation [ 1 , 6 , 19 , 36 , * Equal contribution. 2 J. Ansari and R. Ding et al. #100 high - fidelity regis tered s cenes 3D scans of #130 textured objects H ighly - accurate cont acts Fig. 1: MessyKitchens b enc hmark. Images of real scenes and corresponding high- fidelit y ob ject-level 3D scenes reconstructions comp osed of accurate ob ject scans. 38 , 51 , 56 ] and animation [ 5 , 43 , 55 ] often rely on detailed reconstruction of in- dividual ob jects. Ob ject-level scene reconstruction is a difficult task challenged b y the large v ariety of ob ject shapes and frequen t occlusions. Moreo v er, b e- y ond ob ject shap e and p ose estimation, animation and sim ulation tasks require ph ysically-plausible scene reconstruction with correct estimation of ob ject con- tacts, deformations and other parameters. Recen t metho ds for 3D scene reconstruction hav e evolv ed from classic geometry- based formulations to learning-based approaches. The latter methods rely on the learned inductive biases and enable accurate shap e predictions from a sin- gle image. In particular, several recent methods suc h as DepthAn ything [ 59 ], V GGT [ 52 ] and Gen3C [ 45 ] significantly adv ance results for mono cular depth estimation. In comparison, ob ject-level scene reconstruction has received rela- tiv ely less attention. Among existing metho ds, MIDI [ 27 ] and PartCrafter [ 37 ] presen t impressive results, but fo cus on synthetic scenes, while the recent SAM 3D [ 11 ] enables the estimation of the shap e and pose of single ob jects in real images. Besides new metho ds, the progress in ob ject-level scene reconstruction also requires realistic and high-fidelit y b enc hmarks for training and ev aluation. While several b enc hmarks exist [ 3 , 19 , 32 ], their 3D ground truth often suffers from the limited registration accuracy and in ter-ob ject p enetrations. T o address these limitations, we adv ance ob ject-lev el scene reconstruction on t wo fron ts. First , we in tro duce MessyKitchens , a new dataset with 100 real-w orld scenes featuring cluttered environmen ts along with high-fidelit y ob ject-lev el 3D ground truth. As shown in Fig. 1 , our dataset contains contact-ric h scenes com- p osed of 130 v arying kitc hen ob jects that hav e b een scanned and registered with high precision. Moreov er, MessyKitchens pro vides accurate ob ject contacts, and hence enables ev aluation of geometrically-precise and physically-realistic scene MessyKitc hens 3 reconstruction. In addition, we also provide a large-scale synthetic training set MessyKitc hens-train comp osed of 1.8k conta ct-rich scenes and 10.8k rendered images. Se c ond , alongside the b enc hmark, we prop ose a metho d for joint ob ject- lev el scene reconstruction. Our approach builds up on the recent SAM 3D frame- w ork for single-ob ject reconstruction and extends it with a Multi-Ob ject De- co der (MOD) that jointly predicts the geometry and p oses of m ultiple ob jects in a scene. By reconstructing ob jects simultaneously , MOD captures contextual relationships and enforces more ph ysically-plausible configurations. In summary , w e prop ose the following contributions: – W e introduce MessyKitchens, a new b enc hmark with cluttered real scenes and high-fidelit y 3D ob ject-level ground truth, including accurate ob ject shap es, p oses and contacts along with precise scene-level ob ject registration. – W e prop ose Multi-Ob ject Deco der (MOD), a metho d that extends SAM 3D to the join t mo deling and reconstruction of multiple ob jects in a scene. – Through extensive exp erimen ts we demonstrate MOD to outp erform state- of-the-art ob ject-level scene reconstruction in MessyKitchens, GraspNet- 1B [ 19 ] and HouseCat6D [ 32 ]. W e also demonstrate significantly improv ed 3D accuracy of the MessyKitc hens b enc hmark compared to other recen t datasets. 2 Related W orks Existing b enchmarks for 3D sc enes. Early datasets suc h as LINEMOD [ 25 ], T-LESS [ 26 ], YCB-M [ 22 ], and YCB-Video [ 57 ] established standardized ev al- uation proto cols for 3D p ose estimation and ob ject-level reconstruction, signifi- can tly adv ancing the field. Ho wev er, they w ere captured in controlled lab oratory settings and t ypically contain a limited num b er of ob jects and scenes with re- stricted category diversit y . Datasets such as MP6D [ 10 ] and T-LESS [ 26 ] fo cus largely on industrial ob jects and lac k representation of everyda y kitchen cat- egories, limiting their realism for domestic en vironments. Although syn thetic datasets suc h as F alling Things [ 49 ] and ZeroGrasp-11B [ 29 ] scale up ob ject coun t substantially , they remain purely simulation-based and do not reflect the c hallenges of real-world scene acquisition and annotation. More recent datasets, including GraspNet-1B [ 19 ], HouseCat6D [ 32 ], PhoCaL [ 54 ], Omni6DPose [ 63 ], KIT c hen [ 61 ], P ACE [ 60 ], and GraspClutter6D [ 3 ], increase ob ject diversit y and incorp orate kitchen and transparent ob jects. Nev ertheless, they often face lim- itations in annotation accuracy , scene complexit y , or scalabilit y , as their cap- ture pipelines rely on controlled environmen ts and specialized hardware, such as rob otic manipulators or motion capture systems. In con trast, MessyKitchens is collected using a scalable pipeline that do es not require specialized setup, enabling scene acquisition across diverse lo cations. Alternativ es provide limited con tact qualit y , whic h is one of the core focuses of our w ork. Moreo v er, our dataset ac hiev es higher ob ject-to-scene registration accuracy while remaining comparable in ob ject diversit y and scale. 4 J. Ansari and R. Ding et al. Obje ct-c entric 3D sc ene r e c onstruction. Ob ject-centric 3D scene reconstruc- tion has traditionally evolv ed from feed-forward and retriev al-based strategies. F eed-forw ard methods t ypically lev erage enco der-deco der arc hitectures to regress scene prop erties, such as geometry , instance lab els, and p oses, directly from im- ages [ 13 , 21 , 39 , 40 , 53 , 58 , 62 , 65 ], while retriev al-based approaches, suc h as Diff- CAD [ 20 ], align high-quality 3D assets from existing databases to input [ 23 , 30 , 34 , 35 ]. Ho wev er, these methods are often limited b y the scarcit y of supervised 3D data and a heavy dep endence on database diversit y , whic h hinders their ability to generalize to complex out-of-distribution scenes [ 8 , 15 , 16 ]. Recently , comp o- sitional generativ e paradigms [ 12 , 24 , 48 ] hav e sought to mo del complex scenes utilizing large-scale p erceptual and 3D priors [ 18 , 31 , 33 , 44 , 46 , 64 ]. F rameworks suc h as MIDI [ 27 ] and PartCrafter [ 37 ] achiev e high generative fidelity through m ulti-instance diffusion and Diffusion T ransformers (DiT) [ 41 ]. How ever, these approac hes are primarily dev elop ed on synthetic datasets and often face chal- lenges when generalizing to the complexities of real-world captures. In contrast, SAM 3D [ 11 ] pro vides a highly scalable pip eline that delivers strong foundational results in div erse real-world scenes through robust data alignment. Y et, b ecause SAM 3D t ypically pro cesses ob jects as independent tok ens, it lac ks explicit, end-to-end reasoning regarding the spatial inter-dependencies b etw een multiple ob jects. This independent treatmen t often leads to inaccuracies in the global spa- tial lay out and imprecise relative ob ject p oses, p osing a significant challenge for ac hieving spatially consistent 3D scene reconstructions in complex en vironments. 3 The MessyKitchens Benc hmark W e in tro duce here our MessyKitc hens b enc hmark. F or it, our core aim is the ev aluation of the accuracy of ob ject-cen tered 3D scene reconstruction on c hal- lenging scenarios, using physically-curated ground truth data. W e describ e the pro cess and the characteristics of the data in Section 3.1 . In Section 3.2 , w e also pro vide a synthetic set, coined MessyKitchens-syn thetic, to enable training on similar scenarios. Finally , in Section 4 , we prop ose Multi-Ob ject Decoder as a new simple baseline for 3D scene reconstruction. 3.1 Real data Data ac quisition. W e now describ e our data acquisition strategy . In total, we collect 100 real scenes, each comp osed of a v ariable num ber of kitchen ware ob- jects that dep end on the difficulty level of the scene. W e emplo y 130 ob jects in total, gathered from 10 different kitchens, for which we collect 3D scans with a Einstar V ega 3D scanner. W e built a sp ecific apparatus for scanning ob jects in isolation, display ed in Figure 2 , left. In practice, we p osition the ob ject on a transparent acrylic surface. Since the scanner do es not detect the acrylic sur- face, this allo ws us to tak e m ultiple scans of the same ob ject from differen t viewp oin ts without moving it. This allo ws us to greatly increase the precision of ob ject scanning. W e then collect tw o scans p er ob ject: a first scan collected MessyKitc hens 5 Ob j e c t s c a n s y s te m M es s y Ki t ch en s M es s y Kit ch en s - s y n t h E a s y M e d i u m Ha r d R G B 3D R G B 3D R G B 3D Fig. 2: On the left, we show our ob ject scanning system. The transparent surface allows us to take multiple scans without moving the ob ject. On the right, we show samples of MessyKitchens, for three difficulty levels. Scenes get more cluttered and with more sophisticated ob ject in teractions with the increase of the difficult y . W e also provide a synthetic set (MessyKitchens-syn thetic) usable for training, with constructed scenes similar to the real dataset. from ab o ve the ob ject, and a second from b elo w, in order to capture the entir e 3D geometry . These tw o scans are later aligned to obtain a dense 3D ground truth. T o facilitate alignment, we p osition visual markers on the acrylic surface. W e use double-sided reflective markers, p ositioned so that they can b e seen b y the scanner from either ab o v e or b elo w the ob ject. The full pro cess takes 6 ∼ 10 min utes p er ob ject. Difficulty levels. W e then construct a scene by assembling pre-scanned ob jects on different surfaces, to promote v ariabilit y . F or scenes, we define three different difficult y levels dep ending on the configuration: (1) Easy: we include 4 ob jects, w ell separated and with minimal contact; (2) Medium: we include 6 ob jects, among whic h 4 are base ob jects lying on a flat surface, and the other 2 are stac ked ob jects on top of the others, in equilibrium. W e reduce the separation and imp ose more contacts with each other; (3) Hard: we use 8 ob jects, imp os- ing maximum contact with each other. In addition to the c haracteristics of the medium setup, w e include nested ob jects inserted one inside the other (such as a cup into a b o wl). W e display examples of our scenes configuration in Figure 2 , cen ter. After b eing assembled, the scenes are scanned with the same sensor that w as used for ob jects. In particular, w e carefully scan each constructed setup while ensuring stable tracking and complete surface cov erage, so that even heav- ily o ccluded or con tact-rich regions are captured as accurately as p ossible. The acquired scene scans are subsequen tly pro cessed, cleaned, and decimated using the scanner softw are, preserving geometric fidelity (with a p oin t-to-mesh error b elo w 0.05 mm). W e subsequently map textures to the mesh. In total, the scan- ning pro cess tak es up to 20 minutes p er scene. The fully reconstructed scenes are exported for subsequent registration with the individual ob ject mo dels. Note that scenes are built incrementally: we first construct a hard scene and scan that 6 J. Ansari and R. Ding et al. only . W e p erform on the hard scene the registration of existing ob jects. Then, we remo ve some ob jects from the hard scene and construct, consequently , a medium scene first, and finally an easy scene. This allows us to obtain the same configu- ration of some ob jects, with v arying con tacts. Note that the asso ciated RGB to the scenes is instead v arying, due to the differen t camera p ose used for recording differen t difficult y levels. This could b e useful for assessing the dep endency of 3D reconstruction algorithms from camera viewp oin ts. R e gistr ation. F or registering the ob ject mo dels to their corresp onding scenes, we first p erform a man ual coarse alignmen t which gives us the initial transformations for our registration pip eline. Our registration pip eline is in t wo stages: in the first stage, we randomly sample a set of 3D p oin ts from the surface of eac h ob ject mesh, find the distance to closest on-surface p oin t on the scene mesh, and then optimize the ob ject transformations to minimize this cost, and in the second, taking the transformations from first stage as new initial estimates, we imp ose, at optimization time, that the normals of the scan and of the ob ject are aligned for a given p oin t. Indeed, most of our ob jects are thin and concav e, so a p oin t on the top surface and its opp osite p oin t on the b ottom surface often hav e very similar distances to the scan. During automatic registration, the optimizer can therefore minimize error by placing the scan surface b et ween the tw o w alls of the ob ject, in- correctly satisfying b oth sides. P enalizing for normals coherence can prev ent this. 3.2 Syn thetic data Sc ene c onstruction. T o enable training on MessyKitc hens, w e construct a syn- thetic dataset with close similarity to our setup for real scene scanning. W e call this set MessyKitchens-syn thetic. W e start by using the 3D assets of GSO [ 17 ], whic h includes 42 kitc henw are ob jects. W e then generate scenes following the same difficulty levels as in our MessyKitc hens real scenes. F or easy scenes, w e simply randomly place four ob jects on a flat surface. F or medium scenes, after randomly placing four ob jects, we p osition tw o additional ob jects on top of ex- isting ones, and we activ ate gravit y to make the simulation physically realistic. W e make sure that ob jects are effectively stable on top of others b efore including the scene inside the dataset. F or hard scenes, w e include stac ked ob jects, as in the real setup. Ensuring a realistic randomized stacking of ob jects in synthetic sim ulation is nontrivial. T o do so, w e first compute ob ject v olume estimates, and drop appropriately sized ob jects into compatible supp ort ob jects. All scenes are generated using conca ve, mesh-based collision, which is critical for achieving realistic stack ed and nested interactions. In all cases, we ensure that our scenes are contact-ric h and ph ysically realistic. W e visualize samples in Figure 2 , right. R endering. Once we construct the 3D scene, w e sample multiple views for en- abling training of 3D reconstruction metho ds. T o promote photorealism, we em- plo y Blender’s Cycles engine for the rendering of scenes and use the original texture files from the dataset. MessyKitc hens 7 SAM3D En c o d e r SAM3D Encoder SAM3D Encoder I nput image . . . Object masks . . . Pos e tokens . . . Shape tokens Multi - Object Decoder SAM3D De c o de r SAM3D De c o de r SAM3D Decoder 3D object shapes 3D object poses Refined poses Fig. 3: Multi-Ob ject Deco der for 3D reconstruction. SAM3D outputs 3D shap es from input images and masks. T o imp ose scene-level constrain ts, w e use a Multi-Ob ject Deco der refining SAM3D prediction on the p ose of the ob jects. The residual refined term is summed to the original prediction to obtain a scene-aw are p ose estimation. . . . . . . Multi - Object Self - attenti on Shape tokens Po se tokens Multi - Object Decoder Multi - Object Cros s - att ention Linear Residual poses K times Self - attenti on Fig. 4: Multi-Ob ject Decoder. W e in- form p ose tokens on scene-level context by using K blocks including multi-ob ject self- atten tions and cross-atten tion. W e use pose and shap e information from all ob jects to obtain residual p ose correcting factors. W e render 10 different views ran- dominzing the camera angle, sam- pling from azimuth ϕ ∈ [0 , 2 π ] and elev ation ι ∈ [ π / 4 , π / 2] . This sam- pling ensures diverse viewp oin ts while main taining a top-do wn bias consis- ten t with tabletop scenarios. Simulta- neously with the image rendering, we extract instance-based seman tic maps for all ob jects. 4 Multi-Ob ject Deco der 4.1 Metho d details T o extract 3D ob jects from 2D im- ages, SAM3D tak es as input an im- age x and an ob ject pixelwise mask m i indicating the p osition of ob ject i . It then returns a vo xel-based shap e prediction s and a 7-DOF p ose p = ( q , t , σ ) of the ob ject in 3D space, where q ∈ H 1 is a unit quaternion representing 3D rotation, t ∈ R 3 is the translation vector, and σ ∈ R + is the isotropic scaling factor. T o do so, it pro cesses x b y extracting shape tok ens t s ∈ R F s × C and p ose tokens t p ∈ R F p × C , used to estimate respectively s and p through the SAM3D deco der. F s and F p are the sequence lengths, and C the feature dimen- sion. W e refer to the original pap er for details [ 11 ]. F or a scene with N ob jects, w e denote the aggregated scene tokens as T s = { t s 1 , . . . , t s N } ∈ R N × F s × C and T p = { t p 1 , . . . , t p N } ∈ R N × F p × C , where t s i and t p i corresp ond to the shap e and p ose tokens of ob ject i , respectively . Our ob jective is to train a Multi-Ob ject Deco der to estimate residual scene-aw are p ose up dates ˜ P = { ˜ p 1 , . . . , ˜ p N } from the aggregated tokens T p and T s , and compute the final refined p oses as P + ˜ P , encouraging globally consisten t scene-lev el understanding. W e visualize our ap- proac h in Figure 3 . T o correctly predict scene-a ware pose tokens, eac h pose 8 J. Ansari and R. Ding et al. prediction should b e p erformed with aw areness of b oth the p ose and the shap e of all other ob jects. W e achiev e this ob jective by building Multi-Ob ject Deco der as a stack of K blo c ks composed of (i) a m ulti-ob ject self-atten tion lay er for p ose tokens, which can correlate the p ose of all ob jects, and (ii) a multi-ob ject cross-atten tion that grounds the refined p ose tokens to the shap e tokens of all ob jects. As we show in Figure 4 , in a block, w e first refine p ose tok ens inde- p enden tly for each ob ject b y applying standard self-attention SA ( · ) along the sequence dimension, and using N as a batch size: ˆ T p = SA( T p ) , ˆ T p ∈ R N × F p × C . (1) This step increases the expressiv e capacit y of pose features while preserving ob ject-wise separation. W e then enable cross-ob ject reasoning by flattening the ob ject and tok en dimensions of T p in to a single sequence of length N F p , similarly to related literature [ 27 , 37 ], hence ˆ T p ∈ R 1 × ( N F p ) × C no w. W e then pro cess it with a self-atten tion lay er SA multi ( · ) : ˜ T p = SA multi ( ˆ T p ) , ˜ T p ∈ R 1 × ( N F p ) × C (2) Finally , to ground p ose predictions to geometry , we similarly pro cess shap e and p ose tokens with a multi-ob ject cross-attention, using the aggregated p ose tok ens ˜ T p as queries and reshap ed shap e tok ens T s ∈ R 1 × ( N F p ) × C as keys and v alues: T out = CA multi ( ˜ T p , T s ) , T out ∈ R 1 × ( N F p ) × C . (3) The result T out is finally reshap ed to R N × F p × C , and is pro vided as input for the next blo c k. The output of the last blo c k is deco ded to ˜ P with a linear lay er, indep enden tly for each refined p ose. 4.2 T raining and inference W e train Multi-Ob ject Deco der with a w eighted combination of rotation, transla- tion, and scale losses. First, w e imp ose a geometry term L CD based on the Cham- fer distance (CD) betw een the predicted and GT shap es. Then, we represen t rotations as unit quaternions q ∈ H 1 and take adv an tage of an alignment term based on the predicted and ground truth quaternion rotation ˆ q : L ip = 1 − ⟨ q , ˆ q ⟩ 2 . This loss accounts for the double-co ver prop ert y of the quaternions, where q and − q represent the same physical rotation; b y squaring the inner pro duct, w e ef- fectiv ely minimize the geodesic distance on the S O (3) manifold without sign am biguity [ 28 ]. T ranslation and scale are sup ervised with standard regression losses b et w een outputs (the translation v ector t ∈ R 3 and the isotropic scaling factor σ ∈ R + ) and the ground truth, i.e. L t and L s , resp ectiv ely . Explicitly , the symmetric Chamfer distance is: L CD ( S, ˆ S ) = 1 | S | X x ∈ S min ˆ x ∈ ˆ S ∥ x − ˆ x ∥ 2 2 + 1 | ˆ S | X ˆ x ∈ ˆ S min x ∈ S ∥ x − ˆ x ∥ 2 2 . (4) MessyKitc hens 9 This geometric term is vital for handling ob ject symmetry; since it relies on nearest-neighbor correspondence rather than fixed p oin t-wise mapping, ge- ometrically iden tical views of symmetric ob jects result in an equiv alent loss, prev enting the model from b eing p enalized for v alid but non-unique orienta- tions. Considering that our training data and SAM 3D outputs ma y lie in differ- en t canonical spaces, to obtain reliable ground truth, we first estimate a global S im (3) transformation betw een the predicted and ground truth scenes using ICP [ 4 ]. W e then matc h predicted ob jects to ground truth ones and, for each matc hed pair, refine the ob ject p ose by p erforming S im (3) ICP from the SAM 3D ob ject to its ground truth counterpart. Finally , we decomp ose the resulting S im (3) transformation into rotation, translation, and scale (denoted as ˆ q , ˆ t , and ˆ σ respectively) to serve as direct sup ervision. Our final ob jective is: L = 0 . 1 L CD + 100 L t + 100 L s + 10 L ip . (5) A t inference, we extract tokens for eac h ob ject detected by SAM 3 [ 7 ] and refine their pose predictions with Multi-Ob ject Decoder. The structural outputs of SAM 3D, suc h as vo xels or meshes, remain unchanged but are accurately re- p ositioned in the 3D scene to ensure global coherence and physical consistency . 5 Exp erimen ts 5.1 Setup and baselines Datasets W e compare with multiple concurren t b enc hmarks including household and kitchen w are ob jects, such as T-LESS [ 26 ], LINEMOD [ 25 ], YCB-Video [ 57 ], MP6D [ 10 ], GraspNet-1B [ 19 ], GraspClutter6D [ 3 ]. F or the ev aluation of MOD, b esides MessyKitc hens, w e use GraspNet-1B [ 19 ] and HouseCat6D [ 32 ], as an out-of-distribution test set. Metho ds W e compare the 3D reconstruction of Multi-Ob ject Decoder against three ma jor ob ject-level baselines, PartCrafter [ 37 ], MIDI [ 27 ], and SAM 3D [ 11 ]. P artCrafter is the only metho d that do es not require a segmentation map of the ob jects as input. F or all the others and ours, we use the same segmentation map extracted by SAM 3 [ 7 ]. F or all, we compare in terms of in tersection-ov er-union (IoU) and Chamfer Distance (CD). F or b oth metrics, we rep ort ob ject-level and scene-lev el v alues. T r aining setup W e train our Multi-Ob ject Deco der on 4 NVIDIA A100 GPUs (40GB). F or training, we use MessyKitc hens-synthetic, sampling 600 scenes p er difficult y level (easy/medium/hard), and rendering 6 images p er scene, for a to- tal of 10800 images. MOD adds a few parameters to SAM 3D (approximately 81 million), and training for 10 ep ochs takes approximately 2 hours. W e set K = 3 for the architecture of MOD. W e used a learning rate of 5 × 10 − 5 with a linear w arm-up during the first 10% of training. 10 J. Ansari and R. Ding et al. Dataset Sensor µ | δ | med | δ | σ δ T-LESS [ 26 ] Camarine 4.28 2.46 7.72 Kinect v2 8.40 5.45 11.36 LINEMOD [ 25 ] Kinect v2 5.89 5.57 1.47 YCB-Video [ 57 ] Xtion Pro Liv e 3.95 3.66 2.26 MP6D [ 10 ] T uyang FM851-E2 3.54 2.70 0.17 GraspNet-1B [ 19 ] RealSense D435 7.69 4.95 14.30 Azure Kinect 14.79 9.54 20.20 GraspClutter6D [ 3 ] Zivid 3.22 1.55 11.10 RealSense D415 5.71 3.77 14.67 RealSense D435 7.02 4.58 13.82 Azure Kinect 13.85 6.83 32.59 MessyKitchens Einstar V ega 1.62 0.91 3.83 (a) Registration accuracy 1 0 2 1 0 3 1 0 4 P e n e t r a t i o n A r e a ( m m 2 ) 1 0 2 1 0 3 1 0 4 C o n t a c t A r e a ( m m 2 ) 0.14 0.28 0.43 0.66 MessyKitchen (Ours) HouseCat6D GraspNet-1B GraspClutter6D (b) Contacts and p enetration Fig. 5: Comparison with other b enc hmarks. In T able a , we show that MessyK- itc hens yields significant improv ements in registration accuracy , measured with depth errors ( mm ), compared to others. In Figure b , we calculate the ratio b et w een p enetra- tion area and contacts surface area. MessyKitchens exhibits the b est ratio, demonstrat- ing the high quality of our cluttered scenes, resulting in physically-realistic contacts. 5.2 Data quality T able 1: Contacts across difficult y lev- els. W e report contacts accuracy metrics across difficult y lev els. The ratio betw een p enetration and contacts in medium and hard scenes is similar, sho wcasing the qual- it y of our data. Split Con tacts measures ( mm 2 ) C. Area P . Area Ratio Easy 14.66 1.062 0.0724 Medium 2593 397.4 0.1533 Hard 5892 793.5 0.1347 R e gistr ation ac cur acy Registration accuracy is crucial for faithful 3D re- construction. F ollowing [ 3 , 26 ], we as- sess registration qualit y by measur- ing the depth discrepancy betw een the rendered depth of registered ob jects and the ground-truth scan depth. In Figure 5a , we rep ort the mean ( µ | δ | ) and median ( med | δ | ) absolute depth error, together with the standard de- viation σ δ , all in millimeters. As sho wn, MessyKitc hens ac hieves very ac cur ate r e gistr ation , with a mean er- ror of 1.62 mm, corresp onding to a 49.7% relative impro vemen t ov er the second- b est b enc hmark, GraspClutter6D (3.22). W e observ e a similar trend for the me- dian error, where MessyKitc hens attains 0.91 mm, improving by 41.3% o ver the second b est (1.55). These results highlight the precision of our data and the effectiv eness of our automatic registration pip eline, demonstrating that MessyK- itc hens provides a reliable foundation for ob ject-level 3D reconstruction. Contacts and p enetr ations Realistic modeling of contacts is essen tial in clut- tered scenes, as many downstream tasks rely on ph ysically plausible ob ject in- teractions. How ev er, imprecise dataset construction may introduce unrealistic MessyKitc hens 11 T able 2: Effectiveness of Multi-Ob ject Deco der. On MessyKitchens, GraspNet- 1B, and HouseCat6D, SAM 3D+MOD achiev es state-of-the-art ob ject-level 3D scene reconstruction. W e measure b oth ob ject-level metrics (first four rows) and scene-level ones (second four rows). MessyKitc hens GraspNet-1B HouseCat6D Metho d IoU ↑ CD ↓ IoU ↑ CD ↓ IoU ↑ CD ↓ Ob ject P artCrafter 0.071 0.495 0.020 0.956 0.029 0.852 MIDI 0.186 0.285 0.067 0.640 0.092 0.433 SAM 3D 0.409 0.064 0.336 0.082 0.325 0.125 MOD 0.445 0.061 0.344 0.078 0.404 0.100 Scene P artCrafter 0.133 0.228 0.075 0.355 0.114 0.289 MIDI 0.238 0.165 0.121 0.327 0.172 0.215 SAM 3D 0.431 0.054 0.356 0.074 0.374 0.099 MOD 0.472 0.050 0.377 0.069 0.458 0.079 in ter-ob ject p enetrations, limiting the reliability of the data for con tact reason- ing and ph ysics-based applications. In Figure 5b , we compare the datasets by join tly analyzing the con tact and p enetration statistics. On the y -axis, w e rep ort the contact area, computed as the surface of p oin ts lying within 2.5mm b etw een distinct ob jects. On the x -axis, we measure the p enetration area, defined as the area of surfaces from one ob ject intersecting another, detected via vo xelization and intersection testing. Next to eac h p oin t, we rep ort the ratio b et ween p ene- tration surface and contact surface areas. While GraspClutter6D exhibits large con tact regions, it also shows substan tial p enetration, resulting in an unfa vorable ratio (0.66), suggesting that man y contacts in the 3D scenes are not physically realistic. In contrast, MessyKitchens achiev es the b est ratio ( 0.14 ), sho wcasing the precision of our registered cluttered scenes. This serv es as a further pro of of the physical consistency of ob ject interactions in MessyKitchens. Finally , in T a- ble 1 , we rep ort ratios b et ween contacts and p enetration in easy/medi um/hard scenes. W e notice that the ratio obtained in medium/hard scenes is similar, high- ligh ting that our registration is robust enough for ph ysically-accurate contacts indep enden tly from the complexity . Easy scenes hav e no contacts. 5.3 3D ob ject-based reconstruction Effe ctiveness of MOD In T able 2 , we compare Multi-Ob ject Deco der against state-of-the-art baselines for ob ject-cen tric 3D reconstruction. W e train MOD ex- clusiv ely on MessyKitchens-Syn thetic, while ev aluating SAM 3D, P artCrafter, and M IDI in a zero-shot setting. Ev aluation is conducted on MessyKitchens, GraspNet-1B, and HouseCat6D. As predicted 3D reconstructions and ground truth scenes may not share a common co ordinate frame, we p erform scene-level alignmen t using Sim(3) ICP . T o mitigate sensitivity to initialization and av oid alignmen t bias, we run ICP three times and rep ort the b est result. Naive infer- ence with SAM 3D already substantially outp erforms comp eting metho ds across 12 J. Ansari and R. Ding et al. Input GT SAM 3D MOD Part Crafter MIDI c Fig. 6: Qualitative comparison. W e sho w examples of 3D reconstructions for MOD and alternative metho ds, on MessyKitchens. In the insets, we show significant scene-lev el improv ements o ver SAM 3D, demonstrating the effectiveness of MOD. The gra y shap es are ground truth. all b enc hmarks, achieving, for example, a +19.3% improv ement ov er the second b est (MIDI) in scene-level reconstruction (0.238 vs 0.431), motiv ating our design c hoice to use MOD as a plug-in comp onen t to SAM 3D. Ho wev er, incorp orating MOD further bo osts p erformance consistently . F or instance, for ob ject-level IoU, training solely on synthetic data yields impro v ements on MessyKitchens (0.409 vs 0.445 with MOD), as well as gains on GraspNet-1B (0.336 vs 0.344 ) and House- Cat6D (0.325 vs 0.404 ). These results indicate that MessyKitchens provides ef- fectiv e sup ervision for robust ob ject-based 3D reconstruction and highlight the imp ortance of accurate contacts and ob ject p ose/scale refinement, as even minor geometric corrections lead to notable improv emen ts. F urthermore, MOD demon- strates strong generalization b ey ond the training distribution, ac hieving consis- ten t p erformance under domain shift. Notably , HouseCat6D and GraspNet-1B MessyKitc hens 13 (a) Distance only (b) Distance+normals Registration µ | δ | med | δ | σ δ Manual 4.689 3.209 7.439 Distance only 2.892 2.141 4.822 Distance+normals 1.615 0.911 3.827 (c) Quantitative Fig. 7: Registration ablation. W e sho w that our approach based on dis- tance+normal (Fig. b ) registration considerably improv es results with resp ect distance only (Fig. a ). Metrics on registration (T ab. c ) agree with our ev aluation. con tains ob ject categories distinct from kitchen w are, further confirming out-of- distribution generalization to differen t ob jects. Qualitative r esults W e report in Figure 6 a qualitative comparison b et w een MOD and alternative methods. As shown, PartCrafter and MIDI struggle to generalize to the ev aluated setups, often producing incomplete or geometrically inconsis- ten t reconstructions. SAM 3D instead generates visually realistic ob ject shap es across MessyKitc hens and the other datasets. MOD builds on these predictions b y refining only the p ose and scale of the detected ob jects, lea ving their geom- etry unchanged. In the insets, we presen t close-up com par isons b et w een SAM 3D and MOD outputs, where the gra y surface denotes the ground truth. The addition of MOD consistently improv es alignment in the presence of ob ject in- teractions, correcting p ose and scale estimates by enforcing scene-lev el geometric consistency and more accurate inter-ob ject spatial relationships. In particular, in the last column, it is visible how the introduction of MOD correctly allows to reconstruct tw o ob jects in contact with each other. W e attribute this to our precise con tact-rich training data in MessyKitchens. 5.4 Ablation studies R e gistr ation str ate gy W e ablate our design choice for the registration strategy , follo wing the same metrics rep orted in T able 5 . First, we show Manual setup, using the rough manual initialization used for our registration. Then, we perform our automatic registration, with a naiv e distance criteria or our matching criteria based on normals describ ed in Sec. 3.1 . F rom results in Fig. 7 , we outp erform considerably the alternatives. This show cases the effectiveness of our automatic registration pip eline. Multi-obje ct attention In Multi-Ob ject Decoder, we use multi-ob ject se lf- and cross- atten tion, for p ose and shap e tokens. No w, we inv estigate the effectiveness of our choice. W e compare our design (S+P) using b oth multi-ob ject attention in shap e (S) and p ose (P) tokens with alternative designs in which we use multi- ob ject attention for shap e or p ose only . In these, we replace the m ulti-ob ject atten tion with a standard cross- or self-atten tion, op erating on a single ob ject. In T able 3a , we show that our setup achiev es the b est results, suggesting that the in teraction b et ween p ose and shap e tokens for all ob jects is imp ortan t. 14 J. Ansari and R. Ding et al. T able 3: MOD ablation studies. In T able a , we inv estigate the effects of the multi- ob ject attention la yers, showing that b oth scene-level information ab out p ose and shap e con tribute to b est p erformance. In T able b , w e report that 3 transformer blo c ks to build MOD ( K = 3 ) yield the b est results. MessyKitc hens MOD Setup IoU ↑ CD ↓ Ob ject Shap e only 0.438 0.063 P ose only 0.432 0.065 S+P (Ours) 0.445 0.061 Scene Shap e only 0.463 0.054 P ose only 0.457 0.056 S+P (Ours) 0.472 0.050 (a) Multi-ob ject attention MessyKitc hens Blo c ks Setup IoU ↑ CD ↓ Ob ject K = 1 0.441 0.061 K = 3 (Ours) 0.445 0.061 K = 6 0.403 0.065 Scene K = 1 0.467 0.051 K = 3 (Ours) 0.472 0.050 K = 6 0.427 0.059 (b) Number of blocks A r chite ctur e W e prop ose an ablation on the blocks used for the Multi-Ob ject Deco der. In T able 3b , w e v ariate K , the num b er of blo c ks, in the range K = { 1 , 3 , 6 } . F rom our results, K = 3 yields the b est results, while the addition of more blocks (K=6) tends to degrade results. W e hypothesize that this is due to the richness of the features of SAM 3D, where w e plug MOD on. Indeed, it ma y b e that SAM 3D do es not necessitate many non-linearities and complex arc hitectures to enco de scene aw areness, b enefiting from a smaller MOD. 6 Conclusion In this w ork, we addressed the c hallenging problem of physically-plausible, ob ject- lev el 3D scene reconstruction from mono cular images. T o ov ercome the limita- tions of existing datasets, we introduced MessyKitchens, a no vel benchmark fea- turing cluttered, contact-ric h real-w orld scenes. By utilizing a rigorous data ac- quisition and normals-aw are registration pip eline, MessyKitchens pro vides high- fidelit y 3D ground truth with significantly low er inter-ob ject p enetration com- pared to previous datasets, setting a new standard for ev aluating physical con- sistency . F urthermore, w e prop osed the Multi-Ob ject Decoder (Multi-Ob ject Deco der), a simple yet highly effective extension for the single-ob ject SAM 3D framew ork based on the com bination of several multi-ob ject attentions. Exten- siv e exp erimen ts demonstrate that our approac h significantly outp erforms state- of-the-art baselines on MessyKitchens, GraspNet-1B, and HouseCat6D, showing particularly strong out-of-distribution generalization and high-quality results. Ultimately , we b eliev e that MessyKitchens and Multi-Ob ject Deco der will pro- vide a robust foundation for future research in ph ysics-consistent 3D computer vision, facilitating adv ancements in do wnstream applications suc h as robotic manipulation, virtual realit y , and 3D animation. MessyKitc hens 15 References 1. Ahn, M., Brohan, A., Bro wn, N., Chebotar, Y., Cortes, O., David, B., Finn, C., F u, C., Gopalakrishnan, K., Hausman, K., et al.: Do as i can, not as i say: Grounding language in rob otic affordances. CoRL (2022) 1 2. Ansari, J.A., T ourani, S., Kumar, G., Bho wmic k, B.: Exploring social motion latent space and human aw areness for effective rob ot na vigation in cro wded environmen ts. In: IROS. IEEE 1 3. Bac k, S., Lee, J., Kim, K., Rho, H., Lee, G., Kang, R., Lee, S., Noh, S., Lee, Y., Lee, T., et al.: Grasp clutter6d: A large-scale real-world dataset for robust p erception and grasping in cluttered scenes. RA-L (2025) 2 , 3 , 9 , 10 , 19 , 20 4. Besl, P .J., McKay , N.D.: Metho d for registration of 3-d shap es. In: Sensor fusion IV: control paradigms and data structures. Spie (1992) 9 5. Blender Online Communit y: Blender - a 3d mo delling and rendering pack age (2024), http://www.blender.org 2 6. Brohan, A., Brown, N., Carba jal, J., Cheb otar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Rob otics transformer for real-world control at scale. RSS (2023) 1 7. Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ry ali, C., Alw ala, K.V., Khedr, H., Huang, A., et al.: Sam 3: Segment anything with concepts. arxiv (2025) 9 8. Chang, A.X., et al.: Shapenet: An information-ric h 3d model repository . arXiv (2015) 4 9. Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdino v, R.: Learning to explore using active neural slam. ICLR (2020) 1 10. Ch en, L., Y ang, H., W u, C., W u, S.: Mp6d: An rgb-d dataset for metal parts’ 6d p ose estimation. RA-L (2022) 3 , 9 , 10 11. Ch en, X., Chu, F.J., Gleize, P ., Liang, K.J., Sax, A., T ang, H., W ang, W., Guo, M., Hardin, T., Li, X., et al.: Sam 3d: 3dfy anything in images. arXiv (2025) 2 , 4 , 7 , 9 , 19 12. Ch en, Y., W ang, T., W u, T., Pan, X., Jia, K., Liu, Z.: Combov erse: Comp ositional 3d assets creation using spatially-aw are diffusion guidance. In: ECCV (2024) 4 13. Dah nert, M., Hou, J., Nießner, M., Dai, A.: Panoptic 3d scene reconstruction from a single rgb image. NeurIPS (2021) 4 14. Dawson-Haggert y , M.: T rimesh: A python library for loading and using triangular meshes. https://github.com/mikedh/trimesh (2019) 22 15. Deitke, M., et al.: Ob jav erse: A universe of annotated 3d ob jects. CVPR (2023) 4 16. Deitke, M., et al.: Ob jav erse-xl: A univ erse of 10m+ 3d ob jects. NeurIPS (2024) 4 17. Downs, L., F rancis, A., Koenig, N., Kinman, B., Hickman, R., Reymann, K., McHugh, T.B., V anhouc ke, V.: Google scanned ob jects: A high-quality dataset of 3d scanned household items. In: ICRA (2022) 6 , 24 18. Eftekhar, A., Sax, A., Malik, J., Zamir, A.: Omnidata: A scalable pip eline for making multi-task mid-level vision datasets from 3d scans. In: ICCV (2021) 4 19. F ang, H.S., W ang, C., Gou, M., Lu, C.: Graspnet-1billion: A large-scale b enc hmark for general ob ject grasping. In: CVPR (2020) 1 , 2 , 3 , 9 , 10 , 20 20. Gao, D., Rozenberszki, D., Leutenegger, S., Dai, A.: Diffcad: W eakly-sup ervised probabilistic cad mo del retriev al and alignment from an rgb image. TOG (2024) 4 21. Gkioxari, G., Malik, J., Johnson, J.: Mesh r-cnn. In: ICCV (2019) 4 22. Grenzdörffer, T., Gün ther, M., Hertzb erg, J.: Y cb-m: A m ulti-camera rgb-d dataset for ob ject recognition and 6dof p ose estimation. In: ICRA (2020) 3 16 J. Ansari and R. Ding et al. 23. Gü meli, C., Dai, A., Nießner, M.: Roca: Robust cad mo del retriev al and alignment from a single image. In: CVPR (2022) 4 24. Han, H., et al.: Reparo: Comp ositional 3d assets generation with differentiable 3d la yout alignment. ICCV (2025) 4 25. Hinterstoisser, S., Lep etit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., Nav ab, N.: Mo del based training, detection and p ose estimation of texture-less 3d ob jects in heavily cluttered scenes. In: ACCV (2012) 3 , 9 , 10 26. Ho dan, T., Haluza, P ., Ob držálek, Š., Matas, J., Lourakis, M., Zabulis, X.: T-less: An rgb-d dataset for 6d p ose estimation of texture-less ob jects. In: W A CV (2017) 3 , 9 , 10 27. Hu ang, Z., Guo, Y.C., An, X., Y ang, Y., Li, Y., Zou, Z.X., Liang, D., Liu, X., Cao, Y.P ., Sheng, L.: Midi: Multi-instance diffusion for single image to 3d scene generation. In: CVPR (2025) 2 , 4 , 8 , 9 , 19 28. Hu ynh, D.Q.: Metrics for 3d rotations: Comparison and analysis. Journal of Math- ematical Imaging and Vision (2009) 8 29. Iwase, S., Irshad, M.Z., Liu, K., Guizilini, V., L ee, R., Ikeda, T., Amma, A., Nishi- w aki, K., Kitani, K., Ambrus, R., et al.: Zerograsp: Zero-shot shap e reconstruction enabled rob otic grasping. In: CVPR (2025) 3 30. Izadin ia, H., Shan, Q., Seitz, S.M.: Im2cad. In: CVPR (2017) 4 31. Jun , H., Nichol, A.: Shap-e: Generating conditional 3d implicit functions. arXiv preprin t arXiv:2305.02463 (2023) 4 32. Jun g, H., W u, S.C., Ruhk amp, P ., Zhai, G., Schieber, H., Rizzoli, G., W ang, P ., Zhao, H., Garattoni, L., Meier, S., et al.: Housecat6d-a large-scale multi-modal category lev el 6d ob ject p erception dataset with household ob jects in realistic scenarios. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition. pp. 22498–22508 (2024) 2 , 3 , 9 , 20 33. K irillo v, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P ., Girshick, R.: Segment anything. In: ICCV (2023) 4 34. K uo, W., Angelo v a, A., Lin, T.Y., Dai, A.: Mask2cad: 3d shape prediction by learning to segment and retrieve. In: ECCV (2020) 4 35. K uo, W., Angelov a, A., Lin, T.Y., Dai, A.: Patc h2cad: Patc h wise embedding learn- ing for in-the-wild shap e retriev al from a single image. In: ICCV (2021) 4 36. L evine, S., Pastor, P ., Krizhevsky , A., Ibarz, J., Quillen, D.: Learning hand-eye co ordination for rob otic grasping with deep learning and large-scale data collection. The International journal of rob otics research (2018) 1 37. L in, Y., Lin, C., Pan, P ., Y an, H., F eng, Y., Mu, Y., F ragkiadaki, K.: Partcrafter: Structured 3d mesh generation via comp ositional latent diffusion transformers. In: NeurIPS (2025) 2 , 4 , 8 , 9 , 19 38. Mousa vian, A., Eppner, C., F ox, D.: 6-dof graspnet: V ariational grasp generation for ob ject manipulation. In: ICCV (2019) 1 39. Nie, Y., Han, X., Guo, S., Zheng, Y., Chang, J., Zhang, J.J.: T otal3dunderstanding: Join t lay out, ob ject p ose and mesh reconstruction for indo or scenes from a single image. In: CVPR (2020) 4 40. P aschalidou, D., Kar, A., Sh ugrina, M., Kreis, K., Geiger, A., Fidler, S.: Atiss: Autoregressiv e transformers for indo or scene synthesis. Adv ances in Neural Infor- mation Pro cessing Systems (NeurIPS) 34 , 12013–12026 (2021) 4 41. P eebles, W., Xie, S.: Scalable diffusion mo dels with transformers. In: ICCV (2023) 4 42. Philion, J., Fidler, S.: Lift, splat, sho ot: Enco ding images from arbitrary camera rigs by implicitly unpro jecting to 3d. In: ECCV (2020) 1 MessyKitc hens 17 43. Pu marola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-nerf: Neural ra- diance fields for dynamic scenes. In: CVPR (2021) 2 44. Ren, T., et al.: Grounded sam: Assembling op en-w orld mo dels for diverse visual tasks. arXiv (2024) 4 45. Ren, X., Shen, T., Huang, J., Ling, H., Lu, Y., Nimier-Da vid, M., Müller, T., Keller, A., Fidler, S., Gao, J.: GEN3C: 3D-Informed w orld-consistent video generation with precise camera control. In: CVPR (2025) 2 46. Rombac h, R., Blattmann, A., Lorenz, D., Esser, P ., Ommer, B.: High-resolution image synthesis with latent diffusion mo dels. In: CVPR (2022) 4 47. S hah, D., Eysen bac h, B., Kahn, G., Rhinehart, N., Levine, S.: Ving: Learning op en-w orld navigation with visual goals. In: ICRA (2021) 1 48. T ang, J., Nie, Y., Markhasin, L., Dai, A., Thies, J., Nießner, M.: Diffuscene: De- noising diffusion mo dels for generative indo or scene synthesis. CVPR (2024) 4 49. T rem blay , J., T o, T., Birc hfield, S.: F alling things: A synthetic dataset for 3d ob ject detection and p ose estimation. In: CVPR W orkshops (2018) 3 50. Virtanen, P ., Gommers, R., Oliphant, T.E., Hab erland, M., Reddy , T., Courna- p eau, D., Buro vski, E., Peterson, P ., W eck esser, W., Brigh t, J., v an der W alt, S.J., Brett, M., Wilson, J., Millman, K.J., May orov, N., Nelson, A.R.J., Jones, E., Kern, R., Larson, E., Carey , C.J., Polat, İ., F eng, Y., Mo ore, E.W., V anderPlas, J., Lax- alde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E.A., Harris, C.R., Arc hibald, A.M., Rib eiro, A.H., P edregosa, F., v an Mulbregt, P ., Con tributors, S..: Scipy 1.0: F undamental algorithms for scientific computing in python. Nature Metho ds 17 , 261–272 (2020). https://doi.org/10.1038/s41592- 019- 0686- 2 23 51. W ang, H., Shahriar, F., Azimi, A., V asan, G., Mahmo od, R., Bellinger, C.: V ersatile and generalizable manipulation via goal-conditioned reinforcement learning with grounded ob ject detection. arXiv (2025) 1 52. W ang, J., Chen, M., Karaev, N., V edaldi, A., Rupprech t, C., Nov otny , D.: VGGT: Visual geometry grounded transformer. In: CVPR (2025) 2 53. W ang, N., Zhang, Y., Li, Z., F u, Y., Liu, W., Jiang, Y.G.: Pixel2mesh: Generating 3d mesh mo dels from single rgb images. In: ECCV (2018) 4 54. W ang, P ., Jung, H., Li, Y., Shen, S., Srik anth, R.P ., Garattoni, L., Meier, S., Na v ab, N., Busam, B.: Phocal: A multi-modal dataset for category-level ob ject p ose estimation with photometrically challenging ob jects. In: CVPR (2022) 3 55. W u, Z., Y u, C., W ang, F., Bai, X.: Animatean ymesh: A feed-forw ard 4d foundation mo del for text-driven universal mesh animation. In: ICCV (2025) 2 56. Xia, Y., Ding, R., Qin, Z., Zhan, G., Zhou, K., Y ang, L., Dong, H., Cremers, D.: T argo: benchmarking target-driven ob ject grasping under occlusions. arXiv preprin t arXiv:2407.06168 (2024) 1 57. Xiang, Y., Schmidt, T., Naray anan, V., F ox, D.: Posecnn: A conv olutional neural net work for 6d ob ject pose estimation in cluttered scenes. RSS (2017) 3 , 9 , 10 58. Xie, H., Y ao, H., Sun, X., Zhou, S., Zhang, S.: Pix2vo x: Context-a ware 3D recon- struction from single and multi-view images. In: ICCV (2019) 4 59. Y ang, L., Kang, B., Huang, Z., Xu, X., F eng, J., Zhao, H.: Depth An ything: Uleash- ing the p o w er of large-scale unlab eled data. In: CVPR (2024) 2 60. Y ou, Y., Xiong, K., Y ang, Z., Huang, Z., Zhou, J., Shi, R., F ang, Z., Harley , A.W., Guibas, L., Lu, C.: Pace: Pose annotations in cluttered environmen ts (2024) 3 61. Y ounes, A., Asfour, T.: Kitc hen: A real-w orld b enc hmark and dataset for 6d ob ject p ose estimation in kitchen environmen ts. In: Humanoids (2024) 3 62. Z hang, C., Cui, Z., Zhang, Y., Zeng, B., Pollefeys, M., Liu, S.: Holistic 3d scene understanding from a single image with implicit representation. In: Pro ceedings of 18 J. Ansari and R. Ding et al. the IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 8833–8842 (2021) 4 63. Z hang, J., Huang, W., Peng, B., W u, M., Hu, F., Chen, Z., Zhao, B., Dong, H.: Omni6dp ose: A b enchmark and mo del for universal 6d ob ject p ose estimation and trac king. In: ECCV (2024) 3 64. Z hang, L., et al.: Clay: A con trollable large-scale generativ e model for creating high-qualit y 3d assets. TOG (2024) 4 65. Z hang, X., Chen, Z., W ei, F., T u, Z.: Uni-3d: A univ ersal mo del for panoptic 3d scene reconstruction. In: ICCV (2023) 4 MessyKitc hens 19 App endix In this app endix, we provide additional results of our metho d MOD on Grasp- Clutter6D – a highly cluttered and out-of-distribution dataset in Section A . W e then describ e details for the creation of the real and syn thetic subsets of our MessyKitc hens dataset in Sections B and C resp ectiv ely . F urthermore, we pro- vide an attached video demonstrating our high-quality MessyKitc hens dataset and MOD’s qualitativ e reconstruction results A A dditional Results on GraspClutter6D GraspClutter6D [ 3 ] serves as a challenging out-of-distribution (OOD) b enc hmark for our metho d due to its high clutter and diverse real-world ob ject interactions. W e pro vide here the complete quantitativ e comparison and further qualitative insigh ts. F ollowing the exp erimen tal proto col established in the main pap er, all ev aluations of MOD on this b enc hmark are conducted in a zero-shot manner; no fine-tuning w as p erformed on this sp ecific dataset. A.1 Quan titative P erformance W e compare our Multi-Ob ject Deco der (MOD) against ob ject-level baselines in- cluding P artCrafter [ 37 ], MIDI [ 27 ], and SAM 3D [ 11 ]. As shown in T able 4 , MOD consistently achiev es state-of-the-art results. Notably , in scene-level inte- grated reconstruction, MOD improv es the IoU from 0.487 (SAM 3D) to 0.496, while reducing the Chamfer Distance to 0.058, demonstrating its robust gener- alization to unseen cluttered en vironments. T able 4: Quantitativ e Results on GraspClutter6D. Comparison of MOD against ob ject-level baselines. W e rep ort the mean IoU and Chamfer Distance (CD) for b oth ob ject-level reconstruction and scene-level integrated reconstruction. GraspClutter6D Method IoU ↑ CD ↓ Ob ject PartCrafter 0.067 0.674 MIDI 0.086 0.500 SAM 3D 0.328 0.105 MOD 0.340 0.103 Scene PartCrafter 0.227 0.227 MIDI 0.213 0.201 SAM 3D 0.487 0.059 MOD 0.496 0.058 20 J. Ansari and R. Ding et al. Input GT SAM 3D MOD GraspClutter6D Housecat6D GraspNet - 1B Fig. 8: Qualitative comparison across Housecat6D, GraspNet-1B, and GraspClut- ter6D. W e visualize 3D scene reconstructions for MOD and the SAM 3D baseline. Our metho d consistently pro duces physically more plausible results with fewer interpene- tration and more accurate grounding compared to the baseline. Ground-truth meshes are ov erlaid in gray for reference. A.2 Qualitativ e Visualization W e present qualitative reconstruction results on three challenging out-of-distribution (OOD) b enc hmarks: GraspClutter6D [ 3 ], GraspNet-1B [ 19 ], and Housecat6D [ 32 ]. As sho wn in Figure 8 , our Multi-Ob ject Decoder (MOD) demonstrates sup erior generalization capabilities compared to SAM 3D. Despite the significant domain gap b et ween our synthetic training data and these real-world captures, MOD successfully maintains global scene coherence. Sp ecifically , it resolves common artifacts such as in ter-ob ject p enetrations and unstable "floating" p oses that frequently o ccur in the baseline reconstructions. This highlights the effectiveness of our learned ob ject-lev el priors in reasoning ab out complex, contact-ric h interactions across diverse environmen ts. B MessyKitc hens-Real dataset creation In this section, we elaborate on our MessyKitc hens-Real dataset creation pip eline, see Figure 9 . B.1 Ob ject Scanning System Our ob ject scanning system consists of an acrylic plate mounted on a manual turn table. Since a complete 3D mo del requires capturing both visible and o c- cluded surfaces, each ob ject must b e scanned from at least tw o sides (typically MessyKitc hens 21 3 D s c a n n i n g o f s c e n e 3 D s c a n n i n g o f o bj e c t s i n t h e s c e n e C oa r se m a n u a l a l i g nm e nt A u t o m a t i c r e g i st r a t i o n – st a g e - 1 : D i st a n c e - b a se d A u t om a t i c r e g i st r a t i on – st a g e - 2 : D i st a n c e + N orm a l 3 D o bj e c t m o de l s o f t h e o b j e c t s i n t h e s c e n e 3 D s c a n o f t h e s c e n e Ac c u r a t e 3 D r e g i st r a t i o n 1 5 i m a g e s c a p t u r e d f ro m f i v e d i f f e re n t a zi m u t h a n d t h re e e l e v a t i o n le v e l s I m a g e c a p t u re d a t t h r e e d i f f e re n t e l e v a t i o n le v e l s C o m p o se Fig. 9: Overview of our MessyKitchens-Real dataset acquisition and registration pip eline. top and b ottom) and the resulting scans must b e aligned. Standard scan align- men t requires selecting at least three corresp onding features across scans, which is tedious and often infeasible for texture-less ob jects. T o address this limita- tion, we designed a simple yet effective solution: we attach reflective markers to an acrylic base using double-sided placemen t (t wo single-sided markers carefully aligned back-to-bac k so that their centers coincide). Because the 3D scanner can see through acrylic, the same marker is visible in b oth top and b ottom scans, enabling straightforw ard man ual corresp ondence selection in the Einstar StarVi- sion softw are. Additionally , w e mount Lego blo c ks with em b edded 3D markers on to the base, introducing non-coplanar constraints that impro ve alignmen t sta- bilit y and prev en t degeneracies caused by relying solely on in-plane features. The turntable further accelerates scanning by exp osing all ob ject sides without mo ving the scanner. B.2 Ob ject Scanning Pro cedure Before scanning scenes, w e first create high-qualit y 3D mo dels of all ob jects that will appear in the scenes. Eac h ob ject is placed on the acrylic platform moun ted on the turntable, and tw o scans are captured: one from a top-down viewp oin t and one from a b ottom-up viewp oin t. T o expedite acquisition, the scanner remains fixed at one of the predefined p ositions while the turntable is 22 J. Ansari and R. Ding et al. rotated, ensuring full surface co verage. Scanning is p erformed in the device’s marker-only alignment mo de, where reflective markers are used to align frames within eac h scanning session. The tw o complementary scans are then merged in the Einstar StarVision softw are by manually selecting corresp onding markers across the scans. Because the reflectiv e mark ers are double-sided and visible through the acrylic base, iden tifying consisten t corresp ondences betw een the t wo scans is straightforw ard and reliable. B.3 Scene Scanning After all ob jects are successfully scanned, we construct scenes using these ob- jects. Although ob ject scanning has b een consistently successful, in a few cases w e rep eated the scanning and alignment pro cess m ultiple times to obtain the highest-qualit y mo del. Scenes are scanned using the device’s F ast mo de at the highest a v ailable resolution, with both fe atur e and textur e alignmen t enabled. The resulting point cloud is imp orted into the Einstar StarVision soft ware, where it is reconstructed into a mesh. The raw scene meshes typically contain on the order of 3 × 10 6 v ertices. T o k eep the mesh size manageable for our registration pip eline, we decimate the mesh by 80% . W e verified geometric fidelity by measur- ing p oin t-to-mesh error b et ween the original and decimated meshes, observing an error b elo w 0 . 05 mm, confirming negligible loss of accuracy . B.4 R GB Image A cquisition After completing the 3D scene scanning, we capture RGB images of eac h scene using a handheld cell phone camera operating in auto-focus mode. F or eac h scene, we acquire 15 images in total: images are captured from three different elev ation levels, and at eac h elev ation lev el, fiv e images are taken uniformly across azim uth angles to ensure diverse view points. T o increase v ariabilit y and improv e robustness for downstream tasks, we in tentionally capture scenes under v arying bac kground settings and lighting conditions. This includes changes in ambien t illumination and background app earance, resulting in realistic v ariabilit y across the R GB image set. B.5 Ob ject-to-Scene Registration Distanc e-Base d R e gistr ation. After manual coarse alignment, we refine eac h ob ject pose b y minimizing a robust surface-to-surface distance ob jectiv e. Let scene T obj ∈ S E (3) denote the rigid transformation of the ob ject. W e uniformly sample M = 500 p oin ts { p i } M i =1 from the ob jects’ surface. F or each transformed p oin t scene T obj p i , we compute its closest p oin t on the scene mesh, denoted by Π S ( scene T obj p i ) . Closest-p oin t queries are p erformed using the trimesh.proximity [ 14 ] mo dule, which efficiently handles large scene meshes ( ∼ 600K faces) via spa- tial acceleration structures. MessyKitc hens 23 The geometric residual for eac h sampled p oin t is defined as r i ( scene T obj ) = ∥ scene T obj p i − Π S ( scene T obj p i ) ∥ 2 . (6) W e optimize the ob ject transformation b y minimizing the robustified least- squares ob jective E dist ( scene T obj ) = N X i =1 ρ r i ( scene T obj ) 2 , (7) where ρ ( · ) is the soft- ℓ 1 loss used in SciPy’s least_squares solv er [ 50 ]: ρ ( s ) = 2 r 1 + s f 2 − 1 , (8) with s = r i ( scene T obj ) 2 and f = 4 . 5 the f-scale parameter. This loss b eha ves quadratically for small residuals and approximately linearly for large residuals, pro viding robustness to outliers caused b y occlusions, scan noise, or missing geometry . W e run the solv er for 20 iterations in this stage. Normal-A war e R e gistr ation. F or thin and conca ve ob jects, purely distance-based alignmen t may pro duce am biguous solutions. Poin ts sampled from opp osite sides of a thin surface may hav e similar distances to the scan, allowing the optimizer to incorrectly place the scene surface b et w een t wo ob ject walls while still mini- mizing the distance cost. T o mitigate this issue, we incorp orate surface normal consistency in a second optimization stage. F or each sampled ob ject p oin t p i , we compute its surface normal n ob j i and the normal of the scene surface at the closest p oin t, denoted n scene i . W e define a normal-consistency w eight w i = ( n ob j i · n scene i , if n ob j i · n scene i ≥ 0 . 7 , 0 , otherwise . (9) The w eighted residual b ecomes ˜ r i ( scene T obj ) = w i r i ( scene T obj ) , (10) and the optimized ob jective is E normal ( scene T obj ) = M X i =1 ρ ˜ r i ( scene T obj ) 2 . (11) This stage is initialized with the transformation obtained from the discussed distance-only stage and optimized for 20 iterations using the same sampling strategy ( M = 500 ) and soft- ℓ 1 loss. The tw o-stage pro cedure first ensures geometric proximit y and then resolves thin-surface ambiguities through normal coherence, resulting in physically plau- sible and stable ob ject-to-scene alignments. 24 J. Ansari and R. Ding et al. In Figure 10 w e show a few samples of our MessyKitchens-Real dataset from all the three difficulty level categories – Easy , Medium and Hard. Additionally , w e also sho w 3D visualizations of a few s amples from our MessyKitchens-Real dataset in the attac hed supplementary video. C MessyKitc hens-Syn thetic dataset creation C.1 Prepro cessing T o enable training on MessyKitc hens dataset, we construct a synthetic dataset closely matching the real-w orld acquisition setup describ ed in the main pap er. W e refer to this dataset as MessyKitchens-Syn thetic. W e b egin with the 3D ob- ject assets from GSO [ 17 ], selecting 42 kitchen w are ob jects consistent with our real scenes. Prior to scene generation, we prepro cess all meshes to ensure simula- tion stability . Sp ecifically , we (1) remesh the ob jects to obtain well-conditioned, uniformly distributed triangle meshes suitable for physics sim ulation, and (2) adjust the center of mass when necessary to ensure physically correct b eha vior during stacking and dropping. These prepro cessing steps are critical for av oiding unstable or biased ob ject p oses during simulation. C.2 Sim ulation parameters All scenes are simulated in Blender using concav e, mesh-based collision, which are essential for accurately mo deling nested and stack ed interactions. T o ensure ph ysical plausibility , w e carefully tune simulation parameters. The ob ject-ob ject collision margin is set to 0.01mm, while the ob ject-plane collision margin is set to 1.0mm. W e disable velocity-based deactiv ation, as it can prematurely freeze ob jects in ph ysically implausible p oses. The restitution parameter is set to zero to eliminate b ounce during collisions, preven ting unrealistic motion. Addition- ally , w e found that mo deling the flat plane as an “active” ob ject in the ph ysics engine, while fully constraining its translation and rotation (effectiv ely making it passive), significantly reduces interpenetration artifacts compared to treating it as a purely static collider. All generated scenes are contact-ric h, stable, and visually consisten t with real-world stacking b eha vior. C.3 Con trolled scene generation W e generate synthetic scenes following the same difficulty levels as in the Messy- Kitc hens-Real dataset. F or easy scenes, we randomly place or drop four ob jects on to a flat plane. T o increase realism and diversit y , we enforce a balanced dis- tribution of ob ject orientations: approximately 50% of base ob jects are placed uprigh t, reflecting common kitc hen arrangements, while the remaining ob jects are placed with arbitrary orien tations. F or medium scenes, we first sample four base ob jects and randomly p erturb their ( x, y ) p ositions around the scene center while placing them appropriately MessyKitc hens 25 along the z -axis. The extent of p erturbation in X and Y axes is also a parameter for us to con trol how tight the medium scenes would b e. F or hard scenes, W e then select tw o additional ob jects and situate them ab o ve the existing supp ort ob jects so that when gravit y acts on it during simulation, w e get stacks of ob jects (n umber of stack ed set of ob jects and num b er of ob jects in eac h stac k is also con trollable). Stac king feasibilit y is determined using a precomputed dictionary con taining uprigh t top-face bounding b o x areas and ob ject v olumes. Ob jects are only stack ed if their volume and supp ort surface area are compatible. F urthermore, w e maintain a list of non-stack able ob jects (e.g., pitc hers, k ettles) that are nev er placed on top of other ob jects. F or hard scenes, w e extend this pro cedure to allow multiple stack ed and nested configurations, carefully controlling stacking and compatibility to pro duce realistic, cluttered arrangemen ts. Figure 11 shows how a typical "hard" scene is created. In Figure 12 we show a few samples of our MessyKitc hens-Synthetic dataset from all the three difficulty level categories – Easy , Medium and Hard. In A ddi- tion, we also sho w 3D visualizations of a few samples from our MessyKitchen- syn thetic dataset in the attached supplementary video. 26 J. Ansari and R. Ding et al. H a r d Me d i um E a s y RG B 3D RG B 3D Fig. 10: A few samples from all the three difficulty levels of our MessyKitchens-Real dataset. W e show the RGB image and the ground-truth 3D acquired from our regis- tration pip eline. MessyKitc hens 27 4 b a s e ob je cts + 2 ob je ct s d r op p e d + 2 s t a ck e d ob je ct s Run Bl e n d e r p h y s i cs s i mula t i o n Fig. 11: An example of controlled scene generation for the "hard" category . 28 J. Ansari and R. Ding et al. H a r d Me d i um E a s y RG B 3D RG B 3D Fig. 12: A few samples from all the three difficulty levels of our MessyKitchens- Syn thetic dataset. W e show the rendered R GB image and the ground-truth 3D of the scene obtained from our Blender simulation.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment