StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at https://ke-xing.github.io/StereoWorld/.
💡 Research Summary
StereoWorld tackles the pressing need for high‑quality stereoscopic video in the era of XR devices, where traditional dual‑camera rigs are expensive, cumbersome, and inaccessible to most creators. Existing monocular‑to‑stereo pipelines fall into two categories: (1) novel‑view synthesis (NVS) that reconstructs 3D geometry via SfM, NeRF, or 3D Gaussian Splatting, and (2) depth‑warping‑inpainting pipelines that first estimate depth, warp the left view, and then hallucinate occluded regions. Both approaches suffer from severe drawbacks: NVS pipelines are fragile to pose errors and struggle with dynamic scenes, while depth‑warping pipelines break pixel‑level correspondence, leading to texture distortions, color shifts, and uncomfortable stereoscopic artifacts.
StereoWorld proposes a fundamentally different solution: an end‑to‑end diffusion‑based framework that directly generates the right‑eye video conditioned on the left‑eye input, without any intermediate warping or inpainting stages. The core of the system is a pretrained text‑to‑video diffusion model built on the DiT (Diffusion Transformer) architecture. Instead of redesigning the network, the authors simply concatenate the latent representations of the left and right videos along the temporal (frame) dimension, allowing the model’s existing 3D self‑attention layers to fuse information across space, time, and viewpoint. This “monocular‑conditioning” strategy requires no architectural changes and leverages the strong spatio‑temporal priors already learned by the foundation model.
To endow the model with true 3‑D awareness, StereoWorld introduces a geometry‑aware regularization consisting of two complementary supervision signals:
-
Disparity Supervision – Ground‑truth disparity maps are obtained offline using a state‑of‑the‑art stereo matching network. During training, a lightweight differentiable stereo projector κ takes the left‑view latent and the denoised right‑view latent to predict disparity (ˆb_pred). A combined log‑L1 loss (global log‑ratio consistency plus pixel‑wise L1) forces the predicted disparity to match the ground‑truth, reducing cross‑view misalignment and temporal disparity drift.
-
Depth Supervision – Disparity only covers overlapping regions; newly visible areas in the right view would otherwise lack guidance. Therefore, the model jointly diffuses depth maps together with RGB frames. The last few DiT blocks are duplicated to form parallel branches that learn separate velocity fields for color and depth, ensuring that the generated right‑eye video is geometrically consistent even in non‑overlapping zones.
High‑resolution, long‑duration synthesis is enabled by a spatio‑temporal tiling scheme. During training, early frames of the noisy latent sequence are occasionally replaced with ground‑truth frames, encouraging the model to rely on past context. At inference time, videos are split into overlapping segments; the final frames of a segment are fed as conditioning for the next, preserving temporal coherence while keeping memory usage modest.
A major bottleneck for training such a system is data. Existing stereo datasets either have baselines far larger than the human interpupillary distance (IPD) or are not publicly available. StereoWorld therefore curates a new “StereoWorld‑11M” dataset: over 11 million frames extracted from high‑definition Blu‑ray side‑by‑side movies, all aligned to a natural human IPD (≈55–75 mm). The videos span diverse genres, are standardized to 1080p, 24 fps, and down‑scaled to 480p for training (81‑frame clips). This dataset provides both the visual diversity and the correct geometric baseline needed for realistic XR content.
Extensive experiments demonstrate that StereoWorld outperforms prior diffusion‑based stereo generators (e.g., StereoCrafter, SpatialDreamer, GenStereo) across a suite of metrics: higher PSNR/SSIM/LPIPS, lower disparity error, reduced depth RMSE, and improved temporal warping error. User studies confirm that the generated videos exhibit stronger depth perception and lower visual fatigue.
In summary, StereoWorld shows that a pretrained video diffusion model, when equipped with simple latent‑level conditioning and geometry‑aware regularization, can be transformed into a powerful monocular‑to‑stereo generator. It delivers high‑fidelity, IPD‑accurate stereoscopic video at scale, opening a practical pathway for creators to convert existing monocular content into immersive XR experiences without costly hardware or multi‑stage pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment