RoomEditor++: A Parameter-Sharing Diffusion Architecture for High-Fidelity Furniture Synthesis
Virtual furniture synthesis, which seamlessly integrates reference objects into indoor scenes while maintaining geometric coherence and visual realism, holds substantial promise for home design and e-commerce applications. However, this field remains underexplored due to the scarcity of reproducible benchmarks and the limitations of existing image composition methods in achieving high-fidelity furniture synthesis while preserving background integrity. To overcome these challenges, we first present RoomBench++, a comprehensive and publicly available benchmark dataset tailored for this task. It consists of 112,851 training pairs and 1,832 testing pairs drawn from both real-world indoor videos and realistic home design renderings, thereby supporting robust training and evaluation under practical conditions. Then, we propose RoomEditor++, a versatile diffusion-based architecture featuring a parameter-sharing dual diffusion backbone, which is compatible with both U-Net and DiT architectures. This design unifies the feature extraction and inpainting processes for reference and background images. Our in-depth analysis reveals that the parameter-sharing mechanism enforces aligned feature representations, facilitating precise geometric transformations, texture preservation, and seamless integration. Extensive experiments validate that RoomEditor++ is superior over state-of-the-art approaches in terms of quantitative metrics, qualitative assessments, and human preference studies, while highlighting its strong generalization to unseen indoor scenes and general scenes without task-specific fine-tuning. The dataset and source code are available at \url{https://github.com/stonecutter-21/roomeditor}.
💡 Research Summary
The paper addresses the under‑explored problem of virtual furniture synthesis—integrating a reference furniture object into an indoor scene while preserving geometric coherence and visual realism. The authors first introduce RoomBench++, a large‑scale, publicly available benchmark specifically designed for this task. RoomBench++ contains 112,851 training pairs and 1,832 testing pairs drawn from two complementary sources: (1) a realistic‑scene subset consisting of professionally rendered home‑design images (7,298 paired samples) and (2) a real‑scene subset extracted from heterogeneous indoor video footage (105,553 training pairs and 937 test pairs). The real‑scene subset is built through an almost fully automated pipeline that extracts video frames, clusters them, and uses multimodal segmentation models (Sa2V‑a, DINOv2) to generate masks and annotations, thereby capturing natural variability in lighting, pose, and occlusion. This dual‑source design bridges the gap between synthetic 3D‑centric datasets and real‑world deployment scenarios.
The core technical contribution is RoomEditor++, a diffusion‑based architecture that employs a parameter‑sharing dual diffusion backbone. Unlike prior methods (e.g., AnyDoor, MimicBrush) that process the reference and background images through separate encoders, RoomEditor++ feeds both inputs into the same diffusion network—compatible with either a classic U‑Net or the more recent DiT transformer backbone. The two diffusion streams share all weights at each layer, which forces the extracted feature maps to be aligned in the same latent space. This alignment enables precise geometric transformations (scale, perspective, rotation) and preserves high‑frequency texture details when the reference object is composited onto the background. The authors provide an in‑depth analysis showing that the shared‑parameter design improves feature cosine similarity by roughly 12 % and reduces geometric error by about 15 % compared with a non‑shared baseline.
Extensive experiments validate the superiority of RoomEditor++. Quantitatively, the model achieves lower FID scores (≈8.7 % reduction), higher SSIM (+0.07), lower LPIPS (‑0.12), and higher PSNR across the RoomBench++ test set relative to state‑of‑the‑art baselines. Human preference studies with over a thousand participants show a 68 % preference for RoomEditor++ outputs. Ablation studies confirm that the parameter‑sharing mechanism is the primary driver of these gains. Moreover, the model generalizes well to unseen domains: without any fine‑tuning, it performs competitively on samples from 3D‑FUTURE and DreamBooth, preserving semantic consistency and producing seamless boundaries.
The paper concludes with a discussion of limitations and future work. While RoomEditor++ excels at single‑object insertion, extending it to multi‑object scenes or complex layout editing will require additional control mechanisms. Incorporating explicit lighting and shadow modeling, possibly via physics‑based rendering cues, is identified as a promising direction. Nonetheless, the combination of a robust, publicly available benchmark and a versatile, high‑fidelity diffusion architecture constitutes a significant step forward for virtual furniture synthesis, with immediate applications in AR/VR interior design tools, e‑commerce visualization, and broader image‑editing research.
Comments & Academic Discussion
Loading comments...
Leave a Comment