MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models

December 03, 2025

Reading time: 5 minute

...

📝 Original Info

Title: MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models
ArXiv ID: 2512.04248
Date: 2025-12-03
Authors: Shaoheng Fang, Chaohui Yu, Fan Wang, Qixing Huang

📝 Abstract

We introduce MVRoom, a controllable novel view synthesis (NVS) pipeline for 3D indoor scenes that uses multi-view diffusion conditioned on a coarse 3D layout. MVRoom employs a two-stage design in which the 3D layout is used throughout to enforce multi-view consistency. The first stage employs novel representations to effectively bridge the 3D layout and consistent image-based condition signals for multi-view generation. The second stage performs image-conditioned multi-view generation, incorporating a layout-aware epipolar attention mechanism to enhance multi-view consistency during the diffusion process. Additionally, we introduce an iterative framework that generates 3D scenes with varying numbers of objects and scene complexities by recursively performing multi-view generation (MVRoom), supporting text-to-scene generation. Experimental results demonstrate that our approach achieves high-fidelity and controllable 3D scene generation for NVS, outperforming state-of-the-art baseline methods both quantitatively and qualitatively. Ablation studies further validate the effectiveness of key components within our generation pipeline.

💡 Deep Analysis

📄 Full Content

MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models Shaoheng Fang1* Chaohui Yu2,3 Fan Wang2 Qixing Huang1 1The University of Texas at Austin, 2DAMO Academy, Alibaba Group, 3Hupan Lab Abstract We introduce MVRoom, a controllable novel view syn- thesis (NVS) pipeline for 3D indoor scenes that uses multi- view diffusion conditioned on a coarse 3D layout. MV- Room employs a two-stage design in which the 3D layout is used throughout to enforce multi-view consistency. The first stage employs novel representations to effectively bridge the 3D layout and consistent image-based condition signals for multi-view generation. The second stage performs image- conditioned multi-view generation, incorporating a layout- aware epipolar attention mechanism to enhance multi-view consistency during the diffusion process. Additionally, we introduce an iterative framework that generates 3D scenes with varying numbers of objects and scene complexities by recursively performing multi-view generation (MVRoom), supporting text-to-scene generation. Experimental results demonstrate that our approach achieves high-fidelity and controllable 3D scene generation for NVS, outperforming state-of-the-art baseline methods both quantitatively and qualitatively. Ablation studies further validate the effective- ness of key components within our generation pipeline. 1. Introduction Creating high-quality 3D content is crucial for immersive applications in augmented reality (AR), virtual reality (VR), gaming, and filmmaking. Yet producing detailed 3D as- sets remains challenging and labor-intensive, often requir- ing professional skills and tools. Recent advances in gener- ative modeling enable text-to-3D object synthesis, stream- lining creation [20, 32]. However, generating 3D scenes is more complex, as it involves creating numerous objects ar- ranged within complex spatial layouts, and the generated scenes must exhibit realism to ensure authenticity and user engagement. Several methods have recently been proposed to address 3D scene generation [7, 9, 14, 23, 31, 38, 56–58]. Compo- sition approaches first generate objects via text-to-3D tech- *Work done during internship at DAMO Academy, Alibaba Group niques and then assemble a full scene [7, 57, 58]. While these methods leverage the strengths of 3D object gener- ation, they frequently fall short in overall scene realism. Other methods adopt incremental generation frameworks. They construct 3D indoor environments by sequentially synthesizing different viewpoints frame by frame and re- constructing room meshes from these images [14, 31, 56]. However, these methods always suffer from error accumu- lation and weak layout control, making it difficult to ensure a coherent spatial arrangement. To address these limitations, we propose a novel ap- proach MVRoom for multi-view-based 3D scene genera- tion, conditioned on initial scene image(s) and a coarse 3D layout (oriented object bounding boxes with class labels). The 3D layout, which can be obtained from user input or off-the-shelf 3D layout generative models, provides a flex- ible and easily editable representation, serving as essential guidance of our framework. Moreover, the initial images can be generated using text-2-image models (see Figure 5 and the accompanying video). MVRoom has two stages. The first stage focuses on how to bridge the 3D layout into image-based condition signals for multi-view generation. To achieve this, we introduce a novel image-based representation that preserves 3D lay- out information and enables cross-view feature aggregation during generation. This representation includes hybrid lay- out priors, incorporating multi-layer semantics and depth conditions, along with layout spatial embeddings. In the second stage, we employ an image-conditioned multi-view generative model that produces views that cover the under- lying scene, supporting the whole scene generation. In ad- dition to conditioning on rich signals derived from a 3D lay- out, we introduce a layout-aware epipolar attention layer to effectively fuse cross-view features. This approach ensures accurate feature alignment across perspectives, greatly en- hancing multi-view consistency in the generated images, a critical issue in multi-view based 3D generation. In addition, to generate a complete indoor scene, we adopt an iterative framework that recursively produces multi-view content according to the 3D scene layout. Our framework takes as input the 3D layout and one or more 1 arXiv:2512.04248v1 [cs.CV] 3 Dec 2025 Figure 1. We introduce MVRoom, an indoor scene generation pipeline utilizing multi-view diffusion models. Given a 3D layout and an initial image (generated from a text description). MVRoom uses conditional layout-aware multi-view diffusion models to generate consistent novel views along continuous camera trajectories within the 3D scene. The consistent views are fed into a 3D-GS pipeline for scene reconstruction and novel-view

📄 Read Full PDF on ArXiv