SceneLinker: Compositional 3D Scene Generation via Semantic Scene Graph from RGB Sequences

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce SceneLinker, a novel framework that generates compositional 3D scenes via semantic scene graph from RGB sequences. To adaptively experience Mixed Reality (MR) content based on each user’s space, it is essential to generate a 3D scene that reflects the real-world layout by compactly capturing the semantic cues of the surroundings. Prior works struggled to fully capture the contextual relationship between objects or mainly focused on synthesizing diverse shapes, making it challenging to generate 3D scenes aligned with object arrangements. We address these challenges by designing a graph network with cross-check feature attention for scene graph prediction and constructing a graph-variational autoencoder (graph-VAE), which consists of a joint shape and layout block for 3D scene generation. Experiments on the 3RScan/3DSSG and SG-FRONT datasets demonstrate that our approach outperforms state-of-the-art methods in both quantitative and qualitative evaluations, even in complex indoor environments and under challenging scene graph constraints. Our work enables users to generate consistent 3D spaces from their physical environments via scene graphs, allowing them to create spatial MR content. Project page is https://scenelinker2026.github.io.

💡 Research Summary

SceneLinker presents a two‑stage framework that converts a sequence of RGB images into a compositional 3‑D scene aligned with the real‑world layout, targeting mixed‑reality (MR) applications where virtual content must adapt to a user’s physical environment.
In the first stage, the method builds a global 3‑D semantic scene graph from incremental RGB‑based SLAM outputs. Using ORB‑SLAM3, keyframes and sparse point clouds are extracted, forming two complementary graphs: an entity‑visibility graph that aggregates multi‑view image features (ResNet‑18) and point features (PointNet) via a learnable sigmoid gate, and a neighbor graph that defines spatial adjacency through oriented bounding‑box collision detection. Edge features encode relative pose and axis‑wise extrema derived from bounding‑box corners.
The core contribution here is the Cross‑Check Feature Attention (CCFA) network. Unlike prior node‑centric or edge‑centric attention mechanisms that bias either node or edge features, CCFA cross‑checks features of adjacent nodes, computes similarity scores, and refines attention weights accordingly. This yields robust relationship inference (e.g., “close‑by”, “symmetrical”, “taller than”) even under occlusions and limited viewpoints, producing a coherent global scene graph.
The second stage feeds the predicted graph and per‑node shape priors (DeepSDF latent codes) into a graph‑variational auto‑encoder (graph‑VAE). Object and relationship embeddings are further enriched by a pre‑trained CLIP vision‑language model. A Joint Shape‑and‑Layout (JSL) block fuses shape latent vectors with bounding‑box parameters, ensuring that generated objects respect the spatial layout while allowing diverse geometry synthesis. The VAE backbone enables fast inference compared with diffusion‑based generators, making the system suitable for real‑time MR scenarios.
Experiments on the 3RScan/3DSSG and SG‑FRONT datasets demonstrate that SceneLinker outperforms state‑of‑the‑art methods in both graph prediction and scene generation. Relationship‑specific recall improves by over 7 % for “close‑by” and 14 % for “symmetrical” relations. Quantitative metrics such as mAP, IoU, and FID show significant gains, and qualitative results confirm accurate reconstruction of complex indoor layouts with multiple objects.
Overall, SceneLinker advances MR content creation by (1) leveraging multimodal SLAM data for robust 3‑D entity extraction, (2) introducing cross‑check attention to enhance scene‑graph consistency, and (3) integrating shape and layout learning within a graph‑VAE for fast, layout‑consistent 3‑D scene synthesis. Future work may explore dynamic graph updates for moving objects and scaling to larger shape repositories, further bridging the gap between physical spaces and immersive virtual experiences.

SceneLinker: Compositional 3D Scene Generation via Semantic Scene Graph from RGB Sequences

💡 Research Summary

Comments & Academic Discussion

Leave a Comment