3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation

3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : https://cvlab-kaist.github.io/3DScenePrompt/


💡 Research Summary

Paper Title: 3DScenePrompt: 3D Scene Prompting for Scene‑Consistent Camera‑Controllable Video Generation

Problem Statement
The authors define a new task: given an arbitrary‑length dynamic video as context and a user‑specified camera trajectory, generate a future video chunk that follows the trajectory while preserving the static geometry of the scene and allowing dynamic objects to evolve naturally. Existing camera‑controllable video generators condition on a single frame or a short clip, which limits their ability to maintain long‑range spatial consistency when the camera revisits previously seen viewpoints. Video‑to‑future‑video models improve temporal continuity but still rely on a small temporal window, leading to spatial drift for long trajectories.

Key Contributions

  1. Dual Spatio‑Temporal Conditioning – The model receives (i) the last w frames for motion continuity and (ii) a spatial prompt derived from a 3D scene memory that encodes the static geometry of the entire input video.
  2. 3D Scene Memory Construction – A dynamic SLAM pipeline estimates per‑frame camera poses and reconstructs a dense point cloud. A novel dynamic‑masking strategy, based on optical‑flow and depth cues, removes moving objects, leaving only persistent static points. The resulting static‑only point cloud is aggregated into a global memory.
  3. Projection‑Based Spatial Prompt – The static point cloud is rendered from the target viewpoint(s) defined by the user’s camera trajectory, producing geometrically consistent warped views that serve as strong spatial cues for the diffusion model.
  4. Architecture – The diffusion backbone (a U‑Net) is equipped with two encoders: a Temporal‑Prompt Encoder that processes the recent w frames, and a Spatial‑Prompt Encoder that ingests the rendered static views together with the target pose. Cross‑Attention fuses the two streams before decoding future frames. Text prompts can be incorporated via a standard text encoder.
  5. Efficiency – By storing only the static geometry and a small temporal window, memory consumption grows linearly with the number of static points rather than with the total frame count, achieving ~40 % lower GPU memory usage and ~30 % faster inference compared with naïve full‑frame conditioning.

Methodology Details

  • Dynamic SLAM: The authors adopt recent D‑SLAM frameworks (e.g., Zhang 2022, Li 2024) that jointly estimate camera extrinsics and a semi‑dense 3D reconstruction.
  • Dynamic Masking: For each frame, optical flow magnitude and depth discontinuities are combined to generate a binary mask. Pixels flagged as moving are excluded from the point cloud. This mask is refined across frames to reduce false positives.
  • Scene Memory Aggregation: Static points from all frames are merged using pose‑aligned voxel hashing; duplicate points are averaged, and outliers are filtered by consistency checks.
  • Spatial Prompt Rendering: Given a target SE(3) pose, the static point cloud is rasterized into a depth‑aware image (RGB‑D) using splatting. The rendered view is fed to the Spatial‑Prompt Encoder as a 2‑D feature map.
  • Diffusion Process: The model follows a classifier‑free guidance scheme. The loss combines a standard denoising objective with a consistency regularizer that penalizes deviation from the static prompt in static regions, encouraging the decoder to respect the geometry while freely generating dynamic content guided by the temporal frames.

Experiments
Datasets include synthetic indoor/outdoor scenes with moving agents and real‑world captures (e.g., KITTI‑style driving sequences). Baselines: CameraCtrl, VD3D, Cosmos‑predict2, DFoT, and recent geometry‑aware models (Gen3C). Metrics: PSNR, SSIM, LPIPS for visual fidelity; Camera Pose Error for trajectory adherence; Scene Consistency Score (average distance between rendered static prompt and generated static pixels). Results show 10‑20 % improvement across all metrics. Qualitative examples demonstrate that when the camera loops back to a previously seen viewpoint, the static background aligns perfectly, while moving objects continue from their latest positions rather than being frozen to earlier states.

Ablation Studies

  • Removing the spatial prompt degrades scene consistency dramatically, confirming its necessity.
  • Using the full point cloud without dynamic masking leads to “ghost” artifacts where past moving objects reappear; the dynamic mask eliminates this issue.
  • Varying the temporal window size shows diminishing returns beyond w = 4, supporting the design choice of a small temporal context.

Limitations

  • Dynamic masking can fail for fast motions or low‑texture regions, causing residual dynamic points in the static memory.
  • SLAM may break down under severe illumination changes, affecting pose accuracy and point cloud quality.
  • The current pipeline is offline; real‑time interactive control would require further optimization.

Future Work
The authors propose integrating learning‑based motion segmentation to replace the hand‑crafted flow‑depth mask, exploring multi‑view fusion to improve robustness, and developing a lightweight version of the diffusion model for real‑time applications such as VR telepresence or live video editing.

Conclusion
3DScenePrompt introduces a principled way to combine temporal dynamics with a globally consistent 3D static scene memory, enabling high‑quality, camera‑controllable video generation from arbitrarily long inputs. By separating static geometry from dynamic content and projecting the static geometry as a spatial prompt, the method achieves superior scene consistency, precise camera trajectory adherence, and efficient computation, opening new possibilities for long‑duration video synthesis in film, VR, and synthetic data generation.


Comments & Academic Discussion

Loading comments...

Leave a Comment