ReRoPE: Repurposing RoPE for Relative Camera Control
Video generation with controllable camera viewpoints is essential for applications such as interactive content creation, gaming, and simulation. Existing methods typically adapt pre-trained video models using camera poses relative to a fixed reference, e.g., the first frame. However, these encodings lack shift-invariance, often leading to poor generalization and accumulated drift. While relative camera pose embeddings defined between arbitrary view pairs offer a more robust alternative, integrating them into pre-trained video diffusion models without prohibitive training costs or architectural changes remains challenging. We introduce ReRoPE, a plug-and-play framework that incorporates relative camera information into pre-trained video diffusion models without compromising their generation capability. Our approach is based on the insight that Rotary Positional Embeddings (RoPE) in existing models underutilize their full spectral bandwidth, particularly in the low-frequency components. By seamlessly injecting relative camera pose information into these underutilized bands, ReRoPE achieves precise control while preserving strong pre-trained generative priors. We evaluate our method on both image-to-video (I2V) and video-to-video (V2V) tasks in terms of camera control accuracy and visual fidelity. Our results demonstrate that ReRoPE offers a training-efficient path toward controllable, high-fidelity video generation. See project page for more results: https://sisyphe-lee.github.io/ReRoPE/
💡 Research Summary
The paper addresses the longstanding challenge of precise camera viewpoint control in video generation models, which is crucial for interactive content creation, gaming, and simulation. Existing approaches typically condition pre‑trained video diffusion models on absolute 6‑DoF camera poses referenced to a fixed frame (usually the first frame). This absolute encoding lacks shift‑invariance, leading to poor generalization to unseen trajectories and accumulated drift over long sequences. Relative camera pose embeddings—defined between arbitrary view pairs—offer a more robust alternative, but integrating them into large pre‑trained video diffusion transformers has required either extensive architectural redesign or training from scratch, both of which are prohibitively expensive.
ReRoPE (Repurposing RoPE) proposes a lightweight, plug‑and‑play solution that injects relative camera information directly into the Rotary Positional Embedding (RoPE) modules of existing video diffusion transformers without altering the backbone, adding auxiliary encoders, or introducing new learnable parameters beyond a brief fine‑tuning phase. The key insight is that RoPE’s full spectral bandwidth is not uniformly utilized: low‑frequency bands (corresponding to large frequency indices) contribute almost no positional discrimination for typical video lengths (e.g., 50 frames). This redundancy is especially pronounced in the temporal dimension, where the latent sequence is short relative to spatial dimensions.
The authors conduct a toy experiment that visualizes attention scores across frequency bands and demonstrate that high‑frequency bands encode fine‑grained relative positions, while low‑frequency bands remain near‑constant, effectively “unused.” Leveraging this observation, ReRoPE repurposes a selected subset of low‑frequency temporal channels to encode the relative camera transformation matrix (rotation and translation) between any two frames. The high‑frequency temporal and spatial RoPE components continue to preserve the original temporal ordering and spatial layout, ensuring that the generative priors learned during large‑scale pre‑training remain intact.
ReRoPE supports two generation settings: (1) Video‑to‑Video (V2V), where an input video and its source camera trajectory are re‑rendered to follow a new target trajectory; and (2) Image‑to‑Video (I2V), where a single reference image is expanded into a video that follows a prescribed camera path. In both cases, the method operates on a 3‑D latent grid (T × H × W) of the pre‑trained diffusion transformer, inserting the relative pose into the low‑frequency temporal RoPE band while leaving spatial RoPE untouched. This design requires only a short fine‑tuning (on the order of minutes) and preserves the model’s visual fidelity.
Extensive experiments on three state‑of‑the‑art video diffusion backbones (including Wan2.1 and CogVideoX) show that ReRoPE dramatically reduces camera control error (pose deviation and trajectory mismatch) compared to absolute‑pose baselines and recent relative‑pose methods such as PRoPE, BulletTime, and UCPE. At the same time, quantitative image‑quality metrics (FID, LPIPS) and human evaluations indicate that ReRoPE maintains or improves visual quality, confirming that the low‑frequency injection does not corrupt the high‑frequency generative features. Moreover, because no additional parameters or architectural changes are introduced, the computational overhead is negligible.
In summary, ReRoPE demonstrates that the under‑utilized low‑frequency spectrum of RoPE can be safely repurposed to carry relative camera geometry, enabling precise, shift‑invariant camera control in pre‑trained video diffusion models with minimal training cost. This plug‑and‑play approach offers a practical pathway for adding controllable viewpoint capabilities to existing large‑scale video generators, opening new possibilities for interactive media creation and simulation.
Comments & Academic Discussion
Loading comments...
Leave a Comment