SynthVerse: A Large-Scale Diverse Synthetic Dataset for Point Tracking
Point tracking aims to follow visual points through complex motion, occlusion, and viewpoint changes, and has advanced rapidly with modern foundation models. Yet progress toward general point tracking remains constrained by limited high-quality data, as existing datasets often provide insufficient diversity and imperfect trajectory annotations. To this end, we introduce SynthVerse, a large-scale, diverse synthetic dataset specifically designed for point tracking. SynthVerse includes several new domains and object types missing from existing synthetic datasets, such as animated-film-style content, embodied manipulation, scene navigation, and articulated objects. SynthVerse substantially expands dataset diversity by covering a broader range of object categories and providing high-quality dynamic motions and interactions, enabling more robust training and evaluation for general point tracking. In addition, we establish a highly diverse point tracking benchmark to systematically evaluate state-of-the-art methods under broader domain shifts. Extensive experiments and analyses demonstrate that training with SynthVerse yields consistent improvements in generalization and reveal limitations of existing trackers under diverse settings.
💡 Research Summary
SynthVerse addresses a critical bottleneck in point‑tracking research: the lack of large‑scale, high‑quality, and diverse training data. Existing synthetic datasets such as Kubric, PointOdyssey, and Dynamic Replica are limited either in the number of frames, the variety of object categories, or the range of motion patterns they capture. Consequently, state‑of‑the‑art trackers, while impressive on narrow benchmarks, often fail to generalize when faced with domain shifts such as egocentric viewpoints, articulated or deformable objects, and complex interaction dynamics.
The authors introduce SynthVerse, a synthetic dataset comprising 5.8 million training frames across 48 000 video sequences. The dataset expands coverage to four previously under‑represented domains: (1) animated‑film‑style scenes sourced from publicly released Blender Studio projects, (2) embodied manipulation where a robot arm follows text‑driven instructions (GenManip) and is captured by four synchronized cameras, (3) indoor navigation sequences in which a camera traverses cluttered rooms, and (4) articulated and deformable objects, including URDF‑based multi‑joint mechanisms and soft‑body assets such as garments and flowers. Human and animal content is also richly represented (≈4 000 human sequences, 75 animal species with multiple motion styles).
A key technical contribution is a cross‑platform data‑generation pipeline that combines Blender for high‑fidelity modeling and animation with Isaac Sim for physics‑based simulation and multi‑view rendering. The pipeline automatically produces RGB images, depth maps, instance masks, 3D point trajectories, 2D projections, and per‑frame visibility flags. By leveraging both egocentric (hand‑mounted, head‑mounted) and allocentric (orbit, top‑down) camera setups, SynthVerse supplies the viewpoint diversity required for robust learning.
To evaluate the impact of this data, the authors construct a benchmark that tests eight recent point‑tracking methods (PIP, TAPIR, MVTracker, CoTracker, SpatialTracker, DELTA, TAPIP‑3D, D4R) under a wide spectrum of domain shifts. When trained solely on prior synthetic data, the trackers achieve reasonable performance on in‑domain tests but suffer dramatic drops on out‑of‑domain scenarios (e.g., from rigid‑fall to articulated‑object tracking). Training on SynthVerse consistently improves average precision by 7–12 % across all test conditions, with especially large gains (≈15 %) on articulated and deformable object sequences. These results demonstrate that the added scale and diversity directly translate into better generalization.
The paper also discusses limitations. Because the data are synthetic, real‑world sensor noise, lens distortion, and other optical artifacts are not fully reproduced. Moreover, while the dataset provides geometry‑level annotations (trajectory, depth, masks), it does not include physical interaction forces or contact pressures, which could be valuable for physics‑aware tracking research. The authors suggest future work on domain‑adaptation techniques, incorporation of realistic noise models, and extension of the annotation set to include force information.
In summary, SynthVerse delivers the first truly large‑scale, multi‑domain synthetic dataset for point tracking, together with a rigorous benchmark that reveals the shortcomings of current methods under realistic distribution shifts. By openly releasing the data, generation pipeline, and benchmark, the authors provide the community with a powerful resource to develop more robust, generalizable point‑tracking algorithms applicable to robotics, AR/VR, 4D reconstruction, and beyond.
Comments & Academic Discussion
Loading comments...
Leave a Comment