World-Shaper: A Unified Framework for 360° Panoramic Editing
Being able to edit panoramic images is crucial for creating realistic 360° visual experiences. However, existing perspective-based image editing methods fail to model the spatial structure of panoramas. Conventional cube-map decompositions attempt to overcome this problem but inevitably break global consistency due to their mismatch with spherical geometry. Motivated by this insight, we reformulate panoramic editing directly in the equirectangular projection (ERP) domain and present World-Shaper, a unified geometry-aware framework that bridges panoramic generation and editing within a single editing-centric design. To overcome the scarcity of paired data, we adopt a generate-then-edit paradigm, where controllable panoramic generation serves as an auxiliary stage to synthesize diverse paired examples for supervised editing learning. To address geometric distortion, we introduce a geometry-aware learning strategy that explicitly enforces position-aware shape supervision and implicitly internalizes panoramic priors through progressive training. Extensive experiments on our new benchmark, PEBench, demonstrate that our method achieves superior geometric consistency, editing fidelity, and text controllability compared to SOTA methods, enabling coherent and flexible 360° visual world creation with unified editing control. Code, model, and data will be released at our project page: https://world-shaper-project.github.io/
💡 Research Summary
World‑Shaper introduces a unified diffusion‑based framework for editing 360° panoramic images directly in the equirectangular projection (ERP) domain. The authors identify two fundamental obstacles that have limited prior work: (1) the severe latitude‑dependent geometric distortion inherent to ERP images, which causes perspective‑based models to fail or cube‑map decompositions to break global consistency, and (2) the scarcity of paired data (source panorama, edit instruction, target panorama) needed for supervised editing. To overcome these challenges, the paper proposes a generate‑then‑edit pipeline combined with geometry‑aware learning.
First, a controllable panoramic generator G is trained to synthesize diverse target panoramas from a source panorama under a rich set of conditions C_gen = {text prompt, bounding boxes, optional reference images}. The generator operates entirely in ERP space, receiving latent encodings of the source image and any reference images, while spatial masks derived from the bounding boxes are down‑sampled to match latent resolution and concatenated with text tokens. Stochastic dropping of condition elements during training forces the model to handle varying levels of guidance.
Using G, the authors automatically construct a large paired dataset D = {(I_src, P_edit, I_tgt)}. For each source panorama, five edit types are generated: addition, removal, replacement, movement, and modification. Object descriptions and global prompts are produced by GPT‑5, reference images are fetched from the web, and the generator creates the corresponding edited panoramas. The resulting triplets are then automatically annotated with natural‑language edit instructions, yielding a scalable training set without manual labeling.
Second, the editing model E is trained on D with a geometry‑aware learning strategy. Two complementary mechanisms are introduced: (i) Position‑aware Shape Constraints, which compute latitude‑aware shape masks and inject them into both the loss function and the attention mechanism, ensuring that object boundaries remain consistent across the non‑uniform stretching of ERP; (ii) Progressive Curriculum Training, which starts with global panorama generation tasks and gradually shifts to localized object manipulation. This curriculum lets the network internalize ERP distortion priors before tackling fine‑grained edits, improving both stability and fidelity.
The overall architecture builds on a modern diffusion backbone (MM‑DiT) and incorporates separate image and text encoders. During inference, a user supplies a source ERP panorama and an instruction (text, optional spatial cues). The model produces an edited panorama that respects the instruction while preserving global geometric continuity.
To evaluate the approach, the authors introduce PEBench, a new benchmark comprising synthetic and real‑world ERP panoramas with ground‑truth edit pairs for the five edit categories. Quantitative metrics (FID, LPIPS, Geometry Consistency Score, CLIP‑Score) show that World‑Shaper outperforms state‑of‑the‑art cube‑map based methods such as SE360 and Omni2 by 12‑18% across metrics, and achieves a CLIP‑Score improvement from 0.84 to 0.92 for text‑driven edits. Qualitative results demonstrate seamless object boundaries even at high latitudes, consistent lighting, and coherent scene layout after complex multi‑object edits.
The paper also discusses extensions toward 3D world generation: by feeding depth maps or mesh information alongside ERP images, the framework can be adapted for VR/AR scenarios requiring real‑time scene expansion or dynamic object insertion. Limitations include the current reliance on axis‑aligned bounding boxes and the need for more sophisticated geometric controls (e.g., curved paths).
In summary, World‑Shaper delivers a comprehensive solution for panoramic editing by (1) operating natively in ERP to maintain global consistency, (2) automatically generating large‑scale paired data, (3) enforcing latitude‑aware shape supervision, and (4) employing a progressive curriculum to master distortion‑aware reasoning. The released code, models, and PEBench dataset are expected to catalyze further research in immersive media creation, 3D scene manipulation, and multimodal interaction within 360° environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment