UniVTAC: A Unified Simulation Platform for Visuo-Tactile Manipulation Data Generation, Learning, and Benchmarking
Robotic manipulation has seen rapid progress with vision-language-action (VLA) policies. However, visuo-tactile perception is critical for contact-rich manipulation, as tasks such as insertion are difficult to complete robustly using vision alone. At the same time, acquiring large-scale and reliable tactile data in the physical world remains costly and challenging, and the lack of a unified evaluation platform further limits policy learning and systematic analysis. To address these challenges, we propose UniVTAC, a simulation-based visuo-tactile data synthesis platform that supports three commonly used visuo-tactile sensors and enables scalable and controllable generation of informative contact interactions. Based on this platform, we introduce the UniVTAC Encoder, a visuo-tactile encoder trained on large-scale simulation-synthesized data with designed supervisory signals, providing tactile-centric visuo-tactile representations for downstream manipulation tasks. In addition, we present the UniVTAC Benchmark, which consists of eight representative visuo-tactile manipulation tasks for evaluating tactile-driven policies. Experimental results show that integrating the UniVTAC Encoder improves average success rates by 17.1% on the UniVTAC Benchmark, while real-world robotic experiments further demonstrate a 25% improvement in task success. Our webpage is available at https://univtac.github.io/.
💡 Research Summary
The paper introduces UniVTAC, a comprehensive simulation‑based platform that tackles two major bottlenecks in visuo‑tactile robotic manipulation: the scarcity of large‑scale tactile datasets and the lack of a unified benchmark for evaluating tactile‑driven policies. Built on NVIDIA Isaac Sim and the TacEx framework, UniVTAC faithfully virtualizes three widely used visuo‑tactile sensors—GelSight Mini, VITAI GF225, and Xense WS—by configuring camera intrinsics, gel‑pad meshes, and rendering pipelines to match real hardware.
Data generation is orchestrated through a set of atomic manipulation APIs (Grasp, Move, Place, Probe, Rotate). The Grasp API incorporates a closed‑loop control law that modulates gripper joint velocity based on real‑time tactile depth (d_min) relative to a zero‑contact reference (d_max). This feedback prevents non‑physical penetration, keeps deformation within realistic manifolds, and yields high‑quality tactile imprints without damaging the simulated sensors.
UniVTAC’s data pipeline captures three “perception pathways”: (1) Shape – global geometry, supervised by RGB‑Depth reconstruction; (2) Contact – local deformation and shear, supervised by marker‑displacement and pressure‑map prediction; (3) Pose – spatial grounding, supervised by ground‑truth object pose regression. By jointly optimizing these three self‑supervised objectives, the UniVTAC Encoder learns a tactile‑centric latent space that encodes fine‑grained contact physics while remaining invariant to sensor‑specific visual artifacts.
The encoder architecture combines a Vision‑Transformer‑based image backbone with a PointNet‑style module for processing marker coordinates, yielding a compact embedding that can be plugged directly into downstream policy networks without extra inference overhead.
To evaluate the impact of the simulated data and learned representations, the authors construct the UniVTAC Benchmark, comprising eight representative contact‑rich manipulation tasks (insertion, alignment, rotation, fine adjustment, etc.) built on top of TacEx and Isaac Sim. Each task defines quantitative success criteria (pose error, contact duration, force limits) and provides automated data collection and evaluation scripts, ensuring reproducibility and fair comparison across methods.
Experimental results show that policies equipped with the pre‑trained UniVTAC Encoder achieve a 17.1 % higher average success rate on the benchmark compared to baselines that use plain CNN encoders or single‑task supervision. The advantage is most pronounced on tasks that require subtle tactile feedback, such as precise insertion. Moreover, when the same policies are transferred to a real‑world robot equipped with actual GelSight sensors, they exhibit a 25 % improvement in task success, demonstrating effective sim‑to‑real transfer thanks to the high‑fidelity tactile simulation and multi‑task pretraining.
Additionally, UniVTAC’s modular sensor support allows researchers to benchmark algorithms across different hardware configurations within a single environment, facilitating hardware‑agnostic algorithm development.
In summary, UniVTAC delivers (1) a scalable, high‑fidelity visuo‑tactile simulation engine, (2) large‑scale annotated tactile datasets with rich physical supervision, (3) a multi‑task pre‑trained tactile encoder, and (4) a standardized benchmark suite. By integrating these components, the work substantially lowers the entry barrier for visuo‑tactile manipulation research and opens avenues for future extensions such as deformable object handling, multi‑robot collaboration, and language‑conditioned tactile tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment