ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K
Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwi…
Authors: Kaixuan Wang, Tianxing Chen, Jiawei Liu
ManiT win: Scaling Data-Generation-Ready Digital Object Dataset to 100K Kaixuan W ang 1 ∗ Tianxing Chen 1 , 2 ∗ Jiawei Liu 10 ∗ Honghao Su 10 ∗ Shaolong Zhu 2 ∗ Minxuan W ang 10 Zixuan Li 10 Y ue Chen 8 Huan-ang Gao 9 Y usen Qin 7 Jiawei W ang 3 , 6 Qixuan Zhang 3 , 5 Lan Xu 5 Jingyi Y u 5 Y ao Mu 4 , † Ping Luo 1 , † 1 The Univ ersity of Hong Kong 2 Xspark AI 3 Deemos T ech 4 Shanghai Jiao T ong Uni versity 5 ShanghaiT ech Uni versity 6 Univ ersity of California, San Diego 7 D-Robotics 8 Peking Univ ersity 9 Tsinghua Univ ersity 10 Shenzhen Univ ersity ∗ Equal Contribution † Corresponding Authors https://manitwin.github .io A u t o D a t a - G en er at i on - R ea d y A s s e t Ge n e r a t i o n L a r ge - S c a l e D a t a s e t w i t h 100K A s s e t s Figure 1: ManiT win. An automated pipeline for synthesizing data-generation-ready assets and rich annotations, and ManiT win-100K , a large-scale dataset of 100K such assets with div erse, high- quality annotations. Abstract Learning in simulation provides a useful foundation for scaling robotic manip- ulation capabilities. Howe ver , this paradigm often suffers from a lack of data- generation-ready digital assets, in both scale and di versity . In this w ork, we present ManiT win, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generation. Using this pipeline, we construct ManiT win-100K, a dataset containing 100K high-quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiT win provides an efficient asset synthesis and annotation w orkflow , and that ManiT win-100K offers high-quality and div erse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning. Preprint. 1 Introduction Robotic manipulation learning in simulation critically depends on large-scale, high-quality object assets that not only represent geometric di versity , but also encode how objects can be physically interacted with. While 3D asset repositories have gro wn rapidly , most existing datasets are target geometric or visual understanding rather than manipulation-centric robotics. As a result, current robot learning researchers often f ace a fundamental mismatch between the assets used for training and the physical requirements of real-world manipulation. Existing datasets expose a persistent gap between scale, semantic richness, and physical usability . Large-scale geometry datasets provide millions of meshes but lack physical parameters and interaction semantics, requiring extensiv e manual curation for robotics applications. Conv ersely , robotics- oriented datasets introduce articulation models or functional labels but remain limited in scale and do not systematically verify physical v alidity . What is missing is a lar ge-scale dataset that simultaneously provides manipulation-centric objects, rich functional and grasp annotations aligned with language descriptions, and assets verified to be simulation-ready and collision-free. T o address this gap, we introduce ManiT win , an automated pipeline for generating data-generation- ready digital object twins at scale. Starting from a single input image, ManiT win synthesizes simulation-ready 3D assets, then employs vision-language models (VLMs) to annotate physical properties, functional points, and language descriptions. Candidate interaction points are sampled via farthest point sampling and filtered through VLM reasoning. A learning-based grasp generator produces div erse grasp proposals, which are screened by proximity to selected interaction points and v alidated through physics simulation. The resulting assets are high-fidelity , collision-ready , and equipped with rich manipulation annotations. Our main contributions are: • W e propose an automated pipeline that transforms a single input image into high-fidelity digital object twins with rich language, manipulation, and functional annotations, while ensuring that the generated assets are simulation-ready and collision-ready . • Building upon this pipeline, we construct ManiT win-100K , a large-scale dataset containing 100K semantically annotated digital assets, which supports a wide range of applications including simulation-based manipulation data synthesis, scene layout generation, and VQA data synthesis. • W e conduct extensi ve experiments to v alidate the quality and div ersity of ManiT win, demon- strating that ManiT win provides a strong foundation for large-scale robotic manipulation data generation. 2 Related W orks 2.1 Digital Object T win Asset Dataset Large-scale geometry datasets such as ShapeNet [ 3 ], ModelNet [ 11 ], Objav erse [ 10 ], and Objav erse- XL [ 9 ] provide extensiv e mesh collections with broad category cov erage and language captions. Ho wev er, these datasets are geometry-centric, containing many decorati ve or static objects unsuitable for manipulation. They lack physical parameters, articulation structures, or collision-validated assets, limiting their direct use in physics-based simulation. T o support interactiv e tasks, datasets such as PartNet-Mobility [ 25 ], GAPartNet [ 13 ], and PhysXNet [ 2 ] introduce articulated structures, part-lev el semantics, and affordance labels. Howe ver , these datasets remain limited in scale, often require post-processing for simulation deployment, and generally lack language descriptions. P ASG [ 31 ] proposes automatic annotation but without physics verification or a lar ge-scale curated dataset. High-quality collections such as YCB [ 1 ] emphasize physical fidelity b ut cov er only tens of objects. RoboT win-OD [ 5 ] provides simulation-ready assets with manipulation annotations b ut lacks func- tional annotations and remains limited in scale. ManiT win-100K bridges this gap by providing 100K manipulation-centric digital twins with functional and grasp annotations, language descriptions, and simulation-verified physical v alidity . W e compare serval datasets with ManiT win-100K in T able. 1 . 2 T able 1: Comparison of ManiT win-100K with other asset datasets. W e compare typical asset datasets in terms of geometry , simulation and collision readiness, manipulation and semantic anno- tations, and dataset scale, where Sim-ready refers to whether the object can be directly load into mainstream robotic simulators, grasping annotation refers to pre-contact grasp poses or af fordance, function annotation refers to part-wise function information, and language annotation allo ws semantic understanding for the objects. Asset Dataset Rigid Sim-Ready Grasping Annotation Function Annotation Language Annotation #Objs Objav erse [ 10 ] ✓ ✗ ✗ ✗ ✓ 818K Objav erse-XL [ 9 ] ✓ ✗ ✗ ✗ ✓ 10M+ Objav erse++ [ 15 ] ✓ ✗ ✗ ✗ ✗ 500K PhysXNet [ 2 ] ✓ ✓ ✗ ✓ ✓ 26K PhysXNet-XL [ 2 ] ✓ ✓ ✗ ✓ ✓ 6M PartNet [ 25 ] ✓ ✓ ✗ ✗ ✓ 26K PartNet-Mobility [ 25 ] ✓ ✓ ✗ ✗ ✓ 2K YCB Object Set [ 1 ] ✓ ✗ ✓ ✓ ✓ 77 ModelNet [ 11 ] ✓ ✗ ✗ ✗ ✗ 127K ShapeNet [ 3 ] ✓ ✗ ✗ ✗ ✗ 51K RoboT win-OD [ 5 ] ✓ ✓ ✓ ✗ ✗ 731 GAPartNet [ 13 ] ✓ ✓ ✗ ✓ ✗ 4K OmniObject3D [ 24 ] ✓ ✗ ✗ ✗ ✓ 6K ManiT win-100K (ours) ✓ ✓ ✓ ✓ ✓ 100K 2.2 Data Generation in Simulation Simulation-based data synthesis has become a widely adopted approach for scaling robotic learning across di verse tasks. A representati ve line of work focuses on trajectory and interaction data synthesis based on annotated object assets. RoboT win 1.0 series [ 16 , 17 , 7 ] explores synthesizing manipulation trajectories by combining asset-le vel manipulation annotations with expert code, while RoboT win 2.0 [ 5 ] further introduces a larger annotated asset collection, RoboT win-OD, and supports cluttered scene layouts to synthesize large-scale data spanning 50 manipulation tasks. Related ef forts, including RoboGen [ 21 ], InternData-A1 [ 20 ], UniVT A C [ 4 ], RMBench [ 8 ] and HumanoidGen [ 14 ], follo w a similar paradigm of generating manipulation data from annotated assets, with InternData-A1 reporting ov er 7K hours of synthesized manipulation data. Beyond action trajectories, se veral works target language-conditioned data synthesis. RoboRefer [ 29 ], RoboT racer [ 30 ], and Vlaser [ 27 ] synthesize large-scale VQA data in simulation to support vision- language-action learning. Despite their success, these approaches fundamentally rely on simulation- ready assets with manipulation annotations, which constrains scalability and div ersity . ManiT win addresses this limitation by providing an automated pipeline for asset generation, annotation, and verification, enabling lar ge-scale synthesis of physically v alid assets for simulation data generation. 3 Method W e present ManiT win, an automated pipeline for generating data-generation-ready digital object twins at scale. As illustrated in Fig. 2 , our approach includes three stages: (I) Asset Generation, which transforms 2D images into simulation-ready 3D meshes; (II) Asset Annotation, which enriches assets with functional and manipulation semantics; and (III) V erification, which ensures physical validity through simulation and human re view . 3.1 Asset Generation The asset generation stage con verts input images into ph ysically-grounded 3D meshes suitable for robotic simulation. 3 I. Asset Generation Input Image(s) 3D Gen VLM Model Phys. Prop. & Captions Scale: hei ght 0.3m Mass: 0.6kg Friction: 0.4 (ro ugh plastic) Captions: “kettle id eal for boiling water” “dark green medium electric kettle for making tea” .. . 3D Mesh Quality Check If passed Rescaling 3D Mesh w. S i m -Ready APIs II. Asset Annotation FPS Candidate Points Render Point Cloud GraspGen Multi - view Labelled Imgs Annotator Functiona l Point Selection id_0": {"id": 14, "funct ion": " water spout ", "rationale": "spout for pouring", "caption": " water spout ", "position": [ - 0.0365212, - 0.105907, - 0.0753147], " surfa ce_normal ": [ - 0.300, 0.764, 0.571], . .. Grasp Point Selection "id_0": {" raw_id ": 14, " grasp_type ": "enveloping", "rationale": "middle st able grip", " grasp_scenario ": "daily holding and lifting", "posit ion": [ - 0.0365212, - 0.105907, - 0.0753147], },... Grasp Proposals Spatial Filtering & Semantic Association Functional Grasp Proposals III. V erification V erification in Sim Filtered Out Unstable Grasps Unstable Objects Sliding Gra sps Consolidated Annotations Scale / Bounding Boxes Mass Ma trix Friction Coefficients Captions Functional Grasp Proposals Proactive / Reactive Function al Points Placement Points / Directio n Available Placing Poi nts (for containers) Sim -Ready APIs ( Rigidbo dy , Colliders...) Human V erification Annotated Sim -Ready Digital T win 📄 “Grasp the kettle” “Open the lid” Figure 2: ManiT win Pipeline Overview . Our pipeline consists of three stages: (I) Asset Generation transforms input images into simulation-ready 3D meshes with VLM-estimated physical properties; (II) Asset Annotation combines FPS-based candidate sampling, VLM-driv en functional and grasp point selection, and learning-based grasp proposal generation; (III) V erification v alidates annotations through physics simulation and human revie w , producing fully annotated digital twins ready for robotic manipulation research. 3D Generation. Giv en one or more input images depicting an object or a piece of text, we employ a state-of-the-art 3D generati ve model [ 28 ] to synthesize high-fidelity meshes. Input images undergo preprocessing including background remov al and resolution normalization to isolate tar get objects. The generated meshes are con verted to v arious formats with sim-ready APIs for compatibility with downstream simulation platforms. Quality V erification. Not all generated assets are suitable for manipulation tasks, as the images and text are also generated. W e employ a VLM-based quality gate that e valuates multi-vie w renderings against two criteria: (i) object singularity —e xactly one coherent object should be present; and (ii) visual quality —the mesh should be free of sev ere artifacts such as fragmentation, texture corruption, or geometric implausibility . Assets failing these checks are filtered from the pipeline, removing approximately 10–15% of generated content. Physical Pr operty Estimation. For assets passing quality verification, we estimate physical prop- erties essential for realistic simulation. A VLM analyzes eight uniformly-distributed renderings to infer: oriented bounding box (OBB) dimensions, estimated mass based on apparent material and volume, and surface friction coef ficients derived from visual material cues. These estimates enable scale normalization to real-world dimensions via uniform scaling. Semantic Captioning. The VLM additionally generates rich semantic descriptions including object category , color , material, size, shape, and functional purpose. These language annotations support downstream applications such as language-conditioned manipulation and VQA data synthesis. 3.2 Asset Annotation The annotation stage enriches each asset with manipulation-rele vant semantics through a combination of geometric sampling, VLM-based reasoning, and learning-based grasp synthesis. Candidate Point Sampling. T o identify potential interaction regions, we sample a dense point cloud from the mesh surface and apply f arthest point sampling (FPS) to select spatially-distrib uted candidates. FPS iterati vely selects points maximizing minimum distance to the current set: p k +1 = arg max p ∈P \S min p i ∈S ∥ p − p i ∥ 2 , (1) ensuring broad cov erage across the object surface. These candidates are visualized as numbered markers on multi-vie w renderings for subsequent VLM annotation. 4 3D Mo d e l Caption [ “ black cl aw hamm er used for pu lling nai ls” , “silver hammer typically used f or striking”, … ] Manipula tion/active/gras p [“id_0”: { “ratio nale ”: “striking face”, “co nf”: 0.9 }, “id_ 1”, …] usdz, glb, o b j Manipula tion/bounding_box { “ lower_boun d ” : …, “ upper_bo und ”: …, “ placement_face ”: …, “ bottom_center ”: …, … } Manipula tion/active/ Grasp_group [“grasp_411” : { “c onf ”: 0.84, “ tcp_position ”: [ - 3.94e - 05, - 8.7e - 4, - 0.14], “orientation”: { “w ”: 0.72, “ xyz ”: [ - 0.58, 0.35, - 0.093] } } , “id_1”, … ] Manipula tion/active/tool_u se Manipula tion/active/place Manipula tion/passive/pla ce … Categor y hammer ID 007 Sou rce Google Images Type Ri gid Collis ion_radius 0.19 Image Figure 3: Annotation V isualization. ManiT win provides functional points (e.g., spout, handle), grasp points with type labels, and simulation-verified 6-DoF grasp poses for each object. VLM-Based Point Selection. W e query a VLM with the labeled multi-view images to identify tw o types of semantically meaningful points: Functional P oints correspond to regions with specific object functions—spouts for pouring, handles for gripping, blades for cutting, buttons for control. For each selected point, the VLM provides a function description, confidence score, and brief rationale. Grasp P oints identify locations suitable for stable robotic grasping, considering proximity to center of gravity , surface geometry , and safety . Each grasp point includes a grasp type (parallel-jaw , pinch, power , three-finger, or en veloping) and intended use scenario - this allows dif ferent grasping poses to be conducted while doing different tasks. Grasp Proposal Generation. Building on VLM-selected points, we generate dense grasp proposals using GraspGen [ 18 ], a learning-based method that predicts stable grasp configurations from point cloud observ ations. Each proposal comprises a 6-DoF pose (position and quaternion orientation) with an associated confidence score. Spatial Filtering and Semantic Association. Raw grasp proposals are filtered by spatial proximity to VLM-selected points, retaining grasps aligned with identified affordances. Each filtered grasp inherits semantic labels from its nearest functional and/or grasp point, enabling task-oriented grasp selection (e.g., “grasp the handle for pouring”). 3.3 V erification The verification stage ensures that all annotations are physically v alid and ready for deployment in simulation en vironments. Simulation V erification. Each grasp proposal undergoes physics-based validation using the SAPIEN simulator with PhysX. W e execute a standardized grasp sequence: position the gripper , close fingers until contact, and verify stability . Successful grasps must maintain stable contact for multiple consecutiv e frames. W e further test robustness through a slide resistance protocol, moving the grasped object along orthogonal directions and discarding grasps where object displacement exceeds a threshold. Only grasps passing both stability and slide tests are retained. Human V erification. While automated verification ensures physical v alidity , human annotators revie w a sampled subset to catch edge cases. Annotators assess mesh quality , physical plausibility of estimated properties, semantic correctness of annotations, and simulation results. Feedback is aggregated to iterati vely refine VLM prompts and filtering thresholds. Consolidated Output. The final output for each object comprises: (i) simulation-ready 3D mesh with PBR materials; (ii) physical properties (dimensions, mass, friction); (iii) language annotations; (iv) functional and grasp point annotations with semantic labels; (v) simulation-verified 6-DoF grasp poses; and (vi) placement annotations for scene generation. Representati ve annotations are visualized in Fig. 3 . 5 3.4 ManiT win-100K Dataset Using the pipeline above, we construct ManiT win-100K, a large-scale dataset of 100K data-generation- ready digital twins. Unlike existing 3D datasets that prioritize geometric di versity or visual fidelity alone, ManiT win-100K is specifically designed to address the fundamental requirements of robotic manipulation research: simulation-ready assets with rich manipulation semantics and verified physical validity . ManiT win-100K bridges this gap by providing both scale and manipulation-centric annota- tions within a unified dataset, enabling large-scale polic y learning across div erse object cate gories and interaction types. Data Collection. Input images are curated from di verse sources: e-commerce product catalogs cap- turing real-world object appearances and text-to-image generations for underrepresented cate gories. Statistics and Distribution of ManiT win-100K Dataset. ManiT win-100K spans manipulation- rele vant categories including kitchen items (cups, mugs, utensils, containers, bottles), tools (hammers, screwdri vers, pliers, wrenches), electronics (phones, remotes, controllers), personal items (brushes, cosmetics, accessories), office supplies (staplers, tape dispensers, pens), and household objects (cleaning supplies, toys, food items). Each object in ManiT win-100K includes: 2–4 functional points with semantic labels, 2–3 grasp points with grasp type annotations, 10–50 simulation-verified 6-DoF grasp poses, physical properties (OBB dimensions, mass, friction), and rich language descriptions (category , color , material, shape, function) as shown in Fig. 3 . 4 A pplications The rich semantic annotations in ManiT win enable a variety of do wnstream applications in robotics and 3D vision. In this section, we highlight four key application scenarios: manipulation data generation, scene layout generation, robotics VQA data synthesis, and 3D understanding tasks. 4.1 Manipulation Data Generation One of the primary applications of ManiT win is automated generation of lar ge-scale manipulation training data. The combination of simulation-ready meshes, verified grasp poses, and functional point annotations enables fully automated data collection pipelines for learning generalizable manipulation skills. Pick and Place Skills Data Generation. Each object in ManiT win comes with simulation-verified 6-DoF grasp poses and placement vector that are guaranteed to be collision-free and physically stable. By placing objects in simulation en vironments and e xecuting these grasp poses, we can automatically generate successful grasp demonstrations at scale. This eliminates the need for costly human teleoperation or manual grasp labeling, enabling the collection of millions of grasp trajectories across div erse objects. Functional Manipulation. Beyond simple pick-and-place, ManiT win’ s functional point annotations enable task-oriented manipulation data generation. For instance, we can generate trajectories for grasping a mug by its handle for pouring, picking up a knife by its handle for cutting, or holding a spray bottle by its trigger for spraying. These functionally-grounded demonstrations provide richer supervision signals for learning manipulation policies that understand object affordances. A utomated T ask Generation. Beyond data collection for predefined tasks, ManiT win’ s rich se- mantic annotations enable automated task gener ation at scale. Giv en a generated scene layout, we can programmatically compose di verse manipulation tasks by le veraging object-lev el annotations: functional points define what actions are possible (pour , cut, press, open), grasp points specify ho w to interact, and language descriptions pro vide natural task specifications. For example, gi ven a scene with a kettle and a mug, the system can automatically generate tasks like “pour water from the kettle into the mug” by identifying the kettle’ s spout (functional point) and the mug’ s opening (placement target). This capability significantly reduces the human effort required to scale task di versity . Prior works such as RoboT win2.0 [ 5 ] and RoboCasa365 [ 19 ] rely on manual task specification, limiting scalability . In contrast, ManiT win’ s annotation-driven approach can generate hundreds of task v ariants per scene 6 Piper AgileX Aloha Xspark X-One UMI Franka ARX X5 (a) Cr oss-Embodiment Manipulation Data Gener - ation. ManiT win-100K enables automated data gen- eration across different robotic platforms. W e show manipulation trajectories generated for multiple end- effectors using the same underlying object annotations. (b) Large Scale Grasping Data Generation. ManiT win-100K enables automated data generation across different robotic platforms. W e show manipu- lation trajectories generated for multiple end-ef fectors using the same underlying object annotations. Figure 4: ManiT win Data Generation. (Left) Cross-embodiment manipulation trajectories across multiple end-effectors using shared object annotations. (Right) Grasping data generation. Figure 5: Layout Generation. Using placement and collision radius annotations, we generate di verse multi-object scene layouts that are collision-free and physically plausible. combination, cov ering pick-and-place, tool use, pouring, insertion, and other manipulation primitiv es. Div erse tasks translate to div erse skills, and training data spanning this task distrib ution can benefit general-purpose manipulation policy learning. Cross-Embodiment Data. While grasp poses in ManiT win are initially generated and verified for the Franka Panda gripper , the underlying grasp points and functional annotations transfer across different end-ef fectors. As shown in Fig. 4a , we le verage ManiT win to generate manipulation data for multiple robotic platforms, including parallel-jaw grippers, de xterous hands, and custom end-effectors. This cross-embodiment data generation capability supports pretraining of generalizable manipulation policies that can transfer across different robot hardw are. 4.2 Layout Generation W ith per-object placement annotations, including a placement position and orientation, we can deterministically place a single object on a supporting surface in simulation. T o av oid overlaps when randomly generating multi-object scenes, we additionally provide a collision_radius annotation, which defines the projected collision radius of each object on the placement plane. This allows us to sample object placements while prev enting inter-object o verlaps and collisions. Using these annotations, we can generate di verse random layouts. Fig. 5 visualizes randomly generated layouts of multiple objects on different table configurations. These layouts support manipulation policy data generation and e v aluation with random tabletop distractors, and also serv e as a basis for VQA data synthesis. 4.3 VQA Data Generation The rich semantic annotations in ManiT win provide a foundation for generating V isual Question Answering (VQA) data specifically tailored for robotics applications. Unlike general-purpose 7 VQA datasets, robotics-focused VQA requires understanding of manipulation af fordances, physical properties, spatial reasoning, and action-rele v ant object attributes. W e curate a VQA dataset across div erse tabletop en vironments with varying surface textures (metallic, wood, marble) and object densities to simulate realistic manipulation scenarios. Fig. 6 illustrates representati ve examples spanning fiv e question categories. V isual Scene VQA T raining Pair Q[Language Gr ounding]: Locate the object with a grid of colored squares and describe its best grasp type. A: The Rubik’ s cube at the bottom right. Its flat, orthogonal faces are ideal for a stable parallel-jaw grasp. Q[Functional Planning]: Where is the interaction point of the emergency safety device, and is it occluded by the riser? A: The red circular button on the top-left housing. It is not occluded by the central black riser and is accessible via a top-down approach. Q[Scene Understanding]: Which tool requires a precision “pinch” grasp, and is there enough clearance to its right? A: The scre wdriv er (yellow/black handle). There is insufficient clearance to the right due to the proximity of the white cloth roll. Q[T ask Planning]: If the robot slides the clock 10cm to the right then 5cm to the front, identify the risk of collision. A: There is a high collision risk with the white ceramic bowl, which is located directly in the clock’ s linear path. Q[Detection]: Enumerate and identify all containers present on the brushed brass surface. A: There are tw o containers: one yello w vitamin jar (left), and one red/green seedling pot (right). Figure 6: Robotics VQA Examples. Each pair links manipulation-relev ant questions to grounded scene understanding, cov ering language grounding, functional planning, scene understanding, task planning, and object detection. The resulting VQA dataset can be used to train or fine-tune robotics-specific vision-language models (VLMs) that understand manipulation-relev ant visual concepts and can provide actionable guidance for robotic task ex ecution. 4.4 3D Understanding T asks As a large-scale 3D object dataset with dense semantic annotations, ManiT win supports v arious 3D computer vision tasks beyond robotics applications. 3D Part Segmentation. Combined with our layout generation capability , ManiT win enables the creation of large-scale training data for 3D object segmentation in cluttered scenes. By rendering multi-object layouts with known object identities and positions, we can automatically generate ground truth instance se gmentation masks for both 2D images and 3D point clouds. This supports training and 8 ev aluation of models for instance segmentation, semantic se gmentation, and panoptic segmentation in realistic tabletop manipulation scenarios with varying le vels of clutter and occlusion. 3D Object Understanding. ManiT win provides multi-modal annotations including point clouds, meshes, multi-view images, and language descriptions for each object. This rich annotation enables research on 3D object classification, shape retriev al, cross-modal learning (e.g., text-to-3D, image-to- 3D retriev al), and 3D captioning tasks. Affordance Pr ediction. The grasp points and functional points in ManiT win can serve as ground truth for learning affordance prediction models. Giv en a 3D object representation (point cloud, mesh, or multi-view images), models can be trained to predict where and ho w the object can be grasped or manipulated, which is valuable for both robotics and human-object interaction understanding. 4.5 Empowering Real-W orld Manipulation Recent works such as G3Flo w [ 6 ], D(R,O) Grasp [ 22 ] and CordV iP [ 12 ] suggest that high-quality 3D object assets, when combined with pose estimation tools like F oundationPose [ 23 ], can bridge the gap between simulation and real-world manipulation. ManiT win contributes to this direction by providing manipulation-ready 3D object twins with accurate geometry , physical properties, and semantic annotations. In practice, a robot can reconstruct or retrieve an object twin from a single image, estimate its 6D pose in the scene, and then leverage the associated semantic information, grasp proposals and functional points to plan and ex ecute actions. Moreov er , our scalable asset generation makes it feasible to co ver long-tail object instances and di verse appearances, which is important for robust manipulation in open-w orld settings. 5 Experiments W e conduct comprehensiv e experiments to e valuate the quality of ManiT win assets and demonstrate their utility for downstream applications. W e assess: (1) 3D generation quality through latent-based metrics; (2) annotation quality through automated v erification and human e valuation; and (3) grasp quality and data generation statistics. 5.1 3D Generation Quality W e ev aluate the quality of generated 3D assets using latent-based metrics that measure semantic alignment between generated 3D assets and their input conditions (images or text). Evaluation Metrics. W e adopt latent-based metrics to assess both geometric and appearance quality . For geometry e valuation, CLIP(N-I) and CLIP(N-T) compute the CLIP similarity between rendered normal maps and the input image or text description, respecti vely , measuring how well the generated 3D geometry aligns with the input conditions. For appearance e valuation, CLIP(I-I) and CLIP(I-T) compute the CLIP similarity between rendered RGB images and the input image or te xt, assessing visual fidelity and semantic consistency . Additionally , ULIP-I and ULIP-T measure 3D-to-image and 3D-to-text alignment using the ULIP [ 26 ] latent space, pro viding a holistic assessment of ho w well the generated 3D asset matches the input semantically . For each metric, we compute the av erage score across all 30 rendered views. Results. W e adopt the results from CLA Y [ 28 ] in T able 2 with the 3D generation quality metrics for both image-to-3D and text-to-3D generation. The results demonstrate that our pipeline produces semantically aligned 3D assets that faithfully preserve both the geometric structure and visual appearance of the input conditions. Image-to-3D generation achiev es substantially higher scores across all metrics, reflecting the richer conditioning signal pro vided by input images compared to text descriptions. 5.2 Annotation Quality W e ev aluate the quality of ManiT win annotations through both automated verification and human ev aluation. A utomated V erification. Our multi-stage pipeline incorporates automated quality checks at each stage. T able 4a summarizes the k ey statistics. The 3D generation stage achie ves a 69.67% success 9 T able 2: 3D Generation Quality Evaluation. Latent metrics for text-to-3D and image-to-3D. Geometry uses normal renders; appearance uses RGB renders. Metric T ext-to-3D Image-to-3D ULIP 0.1705 0.2140 CLIP(N-I/T) 0.1948 0.6848 CLIP(I-I/T) 0.2324 0.7769 T ime (s) ∼ 45 ∼ 45 T able 3: Data Generation Statistics. Sum- mary of manipulation data generated with ManiT win assets and annotations. Statistic V alue T otal objects 100K T otal verified grasps 5M T otal grasp trajectories 10M A vg. trajectory length 7.6s T able 4: Automated and human ev aluation of annotation quality . (a) A utomated V erification. Pass rate of the 3D generation and grasp verification in simulation, and the av erage grasp proposals per object before/after verification. Metric V alue 3D-gen succ. rate 69.67% Grasp verification succ. rate 76.13% A vg. grasp candidates / obj. 81.63 A vg. verified grasps / obj. 62.14 (b) Human Evaluation of Annotation Quality . Per- cent of annotations judged correct on 500 sampled objects. Annotation T ype Acc. (%) Category Classification 100.0 Language Descriptions 99.6 Functional Point Labels 92.2 Phys. Prop. Estimation 92.2 Grasp Point Selection 84.8 rate, filtering out lo w-quality or failed generations. For grasp annotation, we generate an av erage of 81.63 grasp candidates per object through GraspGen and VLM-guided filtering. After physics-based simulation verification, 62.14 grasps per object are retained on av erage, corresponding to a 76.13% verification success rate. This indicates that the majority of proposed grasps are physically valid and stable. Human Evaluation. W e conduct a human study to ev aluate the quality of VLM-generated annota- tions. Annotators assess a random sample of 500 objects across five dimensions to e valuate whether the 3D assets and annotations meet the requirements for manipulation data generation. T able 4b reports the results. Category classification achiev es perfect accuracy , while language descriptions reach 99.6% accuracy . Functional point labels and physical property estimation both achiev e 92.2% accuracy . Grasp point selection shows slightly lo wer accuracy at 84.8%, reflecting the inherent diffi- culty of identifying optimal grasp locations purely from visual reasoning. Overall, the high accuracy across all annotation types validates the ef fectiv eness of our VLM-based annotation approach. 5.3 Analysis on ManiT win-100K Diversity W e analyse the semantic and object div ersity of ManiT win-100K dataset. The dataset covers 512 object categories. The semantic div ersity is illustrated in Fig. 8 , with category distrib ution and size statistics shown in Fig. 7 . Object dimensions span from 2 cm (small tools, cosmetics) to 37 cm (large containers, tools), cov ering the typical range encountered in household and industrial manipulation scenarios. The results demonstrate highly diverse object categories and semantic labels, which can benefit div erse task and manipulation data generation. 5.4 Data Generation Statistics T o demonstrate the scalability of ManiT win for do wnstream applications, we report statistics on lar ge- scale grasping data generation. Using the simulation-verified grasp poses and placement annotations in ManiT win, we automatically generate cross-embodiment manipulation trajectories across the full dataset. T able 3 summarizes the data generation statistics. The dataset contains 100K objects with o ver 5 million simulation-v erified grasp poses. From these, we generate 10 million grasp trajectories with an average length of 7.6 seconds. This scale of manipulation data, automatically generated without human teleoperation, demonstrates the practical 10 Figure 7: Dataset Distribution. (Left) W ord cloud showing object category diversity across ManiT win-100K. (Right) Distribution of object dimensions, spanning typical manipulation scales from 5–50 cm. Figure 8: Semantic Diversity . W ord clouds for object attributes: color , function, size, cate gory , material, and shape. utility of ManiT win for large-scale robotic learning. The combination of div erse objects, verified grasps, and functional annotations enables training manipulation policies that generalize across object categories and task types. 6 Conclusion W e present ManiT win, an automated pipeline for generating data-generation-ready digital object twins, and ManiT win-100K, a large-scale dataset of 100K objects constructed using this pipeline. ManiT win transforms single input images into simulation-ready 3D assets with physical properties, functional point annotations, grasp configurations, and language descriptions, all v alidated through physics-based simulation. The key contribution lies in unifying scale, semantic richness, and physical usability . Unlike geometry- focused datasets requiring extensi ve manual curation, ManiT win-100K assets are directly deployable in physics simulators with collision-ready meshes and verified manipulation annotations. W e demon- strated utility across multiple applications: cross-embodiment manipulation data generation, scene layout synthesis, robotics VQA curation, and 3D understanding tasks. Experiments v alidate annota- tion quality with ov er 90% human-e v aluated accuracy . By providing manipulation-centric assets at unprecedented scale, ManiT win-100K establishes a foundation for training generalizable manipulation policies in simulation. The automated nature of the ManiT win pipeline also enables continuous expansion of the dataset as 3D generation technology improv es, supporting the gro wing demands of robotic learning systems. Limitations. ManiT win-100K currently cov ers rigid graspable ob jects but e xcludes articulated objects (drawers, doors) and deformable objects (cloth, rope). Physical property estimates are VLM- inferred rather than ground-truth calibrated. Extending the pipeline to articulated structures and incorporating real-world calibration are directions for future work. Acknowledgements W e acknowledge W eiyang Jin for his v aluable discussion during this project. 11 7 A ppendix 7.1 Dataset Examples Figure 9 presents representati ve examples from ManiT win-100K, illustrating the full pipeline from input image to annotated digital twin. Each row sho ws a single object with four visualization stages: Input Image. The leftmost column shows the input image used for 3D generation. These images are sourced from div erse origins including e-commerce product photos, rendered views from existing 3D repositories, and text-to-image generations. Generated 3D Asset. The second column displays the 3D asset produced by our generation pipeline, rendered from a canonical viewpoint. The generated meshes faithfully preserve the geometric structure and visual appearance of the input images, including fine details such as handles, spouts, buttons, and surf ace textures. Mesh Visualization. The third column shows the underlying mesh geometry , rev ealing the mesh topology and density . Our pipeline produces clean, watertight meshes suitable for physics simulation, with appropriate polygon density to capture geometric details while remaining computationally efficient for collision detection. Sampled Grasps. The rightmost column visualizes a subset of simulation-verified grasp poses. Grasp poses are shown as gripper visualizations positioned at the predicted 6-DoF configurations, representing div erse approach directions and grasp locations validated through physics simulation. To y C a r Mo use Fork Cup Brush Box Glasses Lapt op Door Holder Can Punch Caraf e Banana Scissors Keyb oard Cap Book Badminton R acket Camera Figure 9: ManiT win-100K Dataset Examples. Each row sho ws one object. From left to right: input image, generated 3D asset, mesh visualization, and samples of simulation-verified grasp poses. 7.2 Pipeline Implementation Details 7.2.1 3D Generation W e use CLA Y[ 28 ] as the primary image-to-3D generation model. F or each input image, we generate a 3D mesh in USDZ format, which is then conv erted to GLB for downstream processing. Generation takes approximately 45 seconds per object. Failed generations (non-watertight meshes, de generate geometry) are automatically filtered, resulting in a 69.67% success rate. 12 7.2.2 Physical Pr operty Estimation The VLM analyzes 8 multi-view renders of each object to estimate: • OBB Dimensions : Length, width, and height in meters • Mass : Estimated in kilograms based on apparent material and size • Friction Coefficient : Based on surface material (e.g., 0.3 for plastic, 0.5 for rubber) Objects are rescaled to match VLM-estimated real-world dimensions using the longest OBB axis as reference. 7.2.3 Point Sampling and Selection W e sample 20,000 points uniformly from the mesh surface, then apply Farthest Point Sampling (FPS) to select 42 candidate points that maximize spatial cov erage. The VLM ev aluates each candidate against multi-vie w renders to identify functional regions and suitable grasp locations. 7.2.4 Grasp Generation and Filtering GraspGen produces up to 4,000 grasp candidates per object using the Franka Panda gripper model. W e filter grasps by: (1) proximity to VLM-selected grasp points within 3cm threshold; (2) 7-DoF FPS for div ersity , retaining 100 representativ e grasps; (3) simulation verification in SAPIEN[ 25 ]. 7.2.5 Simulation V erification Each candidate grasp under goes physics simulation with SAPIEN (PhysX 5.0 backend), maximum 2,000 simulation steps, requiring 3 consecuti ve stable frames with less than 0.01m displacement. A grasp passes if the object remains stably grasped without collision penetration throughout the lift trajectory . 7.3 Failur e Cases and Filtering Our multi-stage pipeline incorporates quality filtering at each stage: 3D Generation Failur es. Objects may be rejected due to incomplete geometry where parts are missing or hollow , incorrect topology with self-intersecting meshes, or texture artif acts. VLM Quality V erification F ailures. The VLM-based quality checker rejects assets containing multiple objects instead of a single coherent item, or assets that appear broken, melted, or visually corrupted. Grasp V erification Failures. Grasps fail simulation verification due to collision between gripper and object, unstable grasp where object slips during lift, or unreachable gripper configurations. References [1] Calli, B., Singh, A., W alsman, A., Srini vasa, S., Abbeel, P ., Dollar , A.M.: The ycb object and model set: T owards common benchmarks for manipulation research. In: 2015 international conference on advanced robotics (ICAR). pp. 510–517. IEEE (2015) [2] Cao, Z., Chen, Z., Pan, L., Liu, Z.: Physx-3d: Physical-grounded 3d asset generation. arXiv preprint arXiv:2507.12465 (2025) [3] Chang, A.X., Funkhouser , T ., Guibas, L., Hanrahan, P ., Huang, Q., Li, Z., Sav arese, S., Savv a, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository . arXi v preprint arXiv:1512.03012 (2015) [4] Chen, B., W an, W ., Chen, T ., Guo, X., Xu, C., Qi, Y ., Zhang, H., W u, L., Xu, T ., Li, Z., et al.: Univtac: A unified simulation platform for visuo-tactile manipulation data generation, learning, and benchmarking. arXiv preprint arXi v:2602.10093 (2026) 13 [5] Chen, T ., Chen, Z., Chen, B., Cai, Z., Liu, Y ., Li, Z., Liang, Q., Lin, X., Ge, Y ., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXi v preprint arXiv:2506.18088 (2025) [6] Chen, T ., Mu, Y ., Liang, Z., Chen, Z., Peng, S., Chen, Q., Xu, M., Hu, R., Zhang, H., Li, X., et al.: G3flow: Generati ve 3d semantic flow for pose-aware and generalizable object manipulation. In: Proceedings of the Computer V ision and Pattern Recognition Conference. pp. 1735–1744 (2025) [7] Chen, T ., W ang, K., Y ang, Z., Zhang, Y ., Chen, Z., Chen, B., Dong, W ., Liu, Z., Chen, D., Y ang, T ., et al.: Benchmarking generalizable bimanual manipulation: Robotwin dual-arm collaboration challenge at cvpr 2025 meis workshop. arXi v preprint arXiv:2506.23351 (2025) [8] Chen, T ., W ang, Y ., Li, M., Qin, Y ., Shi, H., Li, Z., Hu, Y ., Zhang, Y ., W ang, K., Chen, Y ., et al.: Rmbench: Memory-dependent robotic manipulation benchmark with insights into policy design. arXiv preprint arXi v:2603.01229 (2026) [9] Deitke, M., Liu, R., W allingford, M., Ngo, H., Michel, O., K usupati, A., Fan, A., Laforte, C., V oleti, V ., Gadre, S.Y ., et al.: Obja verse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems 36 , 35799–35813 (2023) [10] Deitke, M., Schwenk, D., Salvador , J., W eihs, L., Michel, O., V anderBilt, E., Schmidt, L., Ehsani, K., K embhavi, A., Farhadi, A.: Objaverse: A univ erse of annotated 3d objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13142–13153 (2023) [11] Fang, Z., Li, X., Li, X., Zhao, S., Liu, M.: Modelnet-o: A large-scale synthetic dataset for occlusion-aware point cloud classification. Computer V ision and Image Understanding 246 , 104060 (2024) [12] Fu, Y ., Feng, Q., Chen, N., Zhou, Z., Liu, M., W u, M., Chen, T ., Rong, S., Liu, J., Dong, H., et al.: Cordvip: Correspondence-based visuomotor policy for dexterous manipulation in real-world. arXi v preprint arXiv:2502.08449 (2025) [13] Geng, H., Xu, H., Zhao, C., Xu, C., Y i, L., Huang, S., W ang, H.: Gapartnet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7081–7091 (2023) [14] Jing, Z., Y ang, S., Ao, J., Xiao, T ., Jiang, Y .G., Bai, C.: Humanoidgen: Data generation for bimanual dexterous manipulation via llm reasoning. arXi v preprint arXiv:2507.00833 (2025) [15] Lin, C., Liu, H., Lin, Q., Bright, Z., T ang, S., He, Y ., Liu, M., Zhu, L., Le, C.: Objaverse++: Cu- rated 3d object dataset with quality annotations. In: Proceedings of the IEEE/CVF International Conference on Computer V ision. pp. 6813–6822 (2025) [16] Mu, Y ., Chen, T ., Chen, Z., Peng, S., Lan, Z., Gao, Z., Liang, Z., Y u, Q., Zou, Y ., Xu, M., et al.: Robotwin: Dual-arm robot benchmark with generativ e digital twins. In: Proceedings of the computer vision and pattern recognition conference. pp. 27649–27660 (2025) [17] Mu, Y ., Chen, T ., Peng, S., Chen, Z., Gao, Z., Zou, Y ., Lin, L., Xie, Z., Luo, P .: Robotwin: Dual- arm robot benchmark with generativ e digital twins (early version). In: European Conference on Computer V ision. pp. 264–273. Springer (2024) [18] Murali, A., Sundaralingam, B., Chao, Y .W ., Y uan, W ., Y amada, J., Carlson, M., Ramos, F ., Birchfield, S., Fox, D., Eppner , C.: Graspgen: A diffusion-based framew ork for 6-dof grasping with on-generator training. arXiv preprint arXi v:2507.13097 (2025) [19] Nasiriany , S., Nasiriany , S., Maddukuri, A., Zhu, Y .: Robocasa365: A large-scale simulation framew ork for training and benchmarking generalist robots. In: International Conference on Learning Representations (ICLR) (2026) 14 [20] T ian, Y ., Y ang, Y ., Xie, Y ., Cai, Z., Shi, X., Gao, N., Liu, H., Jiang, X., Qiu, Z., Y uan, F ., et al.: Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist polic y . arXiv preprint arXiv:2511.16651 (2025) [21] W ang, Y ., Xian, Z., Chen, F ., W ang, T .H., W ang, Y ., Fragkiadaki, K., Erickson, Z., Held, D., Gan, C.: Robogen: T o wards unleashing infinite data for automated robot learning via generati ve simulation. arXiv preprint arXi v:2311.01455 (2023) [22] W ei, Z., Xu, Z., Guo, J., Hou, Y ., Gao, C., Cai, Z., Luo, J., Shao, L.: D(r ,o) grasp: A unified representation of robot and object interaction for cross-embodiment dexterous grasping. arXi v preprint arXiv:2410.01702 (2024) [23] W en, B., Y ang, W ., Kautz, J., Birchfield, S.: Foundationpose: Unified 6d pose estimation and tracking of novel objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17868–17879 (2024) [24] W u, T ., Zhang, J., Fu, X., W ang, Y ., Ren, J., Pan, L., W u, W ., Y ang, L., W ang, J., Qian, C., et al.: Omniobject3d: Large-v ocabulary 3d object dataset for realistic perception, reconstruction and generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 803–814 (2023) [25] Xiang, F ., Qin, Y ., Mo, K., Xia, Y ., Zhu, H., Liu, F ., Liu, M., Jiang, H., Y uan, Y ., W ang, H., et al.: Sapien: A simulated part-based interactive en vironment. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11097–11107 (2020) [26] Xue, L., Gao, M., Xing, C., Martín-Martín, R., W u, J., Xiong, C., Xu, R., Niebles, J.C., Sav arese, S.: Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1179–1189 (2023) [27] Y ang, G., Zhang, T ., Hao, H., W ang, W ., Liu, Y ., W ang, D., Chen, G., Cai, Z., Chen, J., Su, W ., et al.: Vlaser: V ision-language-action model with syner gistic embodied reasoning. arXiv preprint arXiv:2510.11027 (2025) [28] Zhang, L., W ang, Z., Zhang, Q., Qiu, Q., Pang, A., Jiang, H., Y ang, W ., Xu, L., Y u, J.: Clay: A controllable large-scale generati ve model for creating high-quality 3d assets. A CM Transactions on Graphics (TOG) 43 (4), 1–20 (2024) [29] Zhou, E., An, J., Chi, C., Han, Y ., Rong, S., Zhang, C., W ang, P ., W ang, Z., Huang, T ., Sheng, L., et al.: Roborefer: T owards spatial referring with reasoning in vision-language models for robotics. arXiv preprint arXi v:2506.04308 (2025) [30] Zhou, E., Chi, C., Li, Y ., An, J., Zhang, J., Rong, S., Han, Y ., Ji, Y ., Liu, M., W ang, P ., et al.: Robotracer: Mastering spatial trace with reasoning in vision-language models for robotics. arXiv preprint arXi v:2512.13660 (2025) [31] Zhu, Z., Zheng, Y ., Pan, S., Jin, Y ., Mu, Y .: Pasg: A closed-loop framework for automated geometric primitiv e extraction and semantic anchoring in robotic manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer V ision. pp. 8950–8960 (2025) 15
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment