GSR: Learning Structured Reasoning for Embodied Manipulation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite rapid progress, embodied agents still struggle with long-horizon manipulation that requires maintaining spatial consistency, causal dependencies, and goal constraints. A key limitation of existing approaches is that task reasoning is implicitly embedded in high-dimensional latent representations, making it challenging to separate task structure from perceptual variability. We introduce Grounded Scene-graph Reasoning (GSR), a structured reasoning paradigm that explicitly models world-state evolution as transitions over semantically grounded scene graphs. By reasoning step-wise over object states and spatial relations, rather than directly mapping perception to actions, GSR enables explicit reasoning about action preconditions, consequences, and goal satisfaction in a physically grounded space. To support learning such reasoning, we construct Manip-Cognition-1.6M, a large-scale dataset that jointly supervises world understanding, action planning, and goal interpretation. Extensive evaluations across RLBench, LIBERO, GSR-benchmark, and real-world robotic tasks show that GSR significantly improves zero-shot generalization and long-horizon task completion over prompting-based baselines. These results highlight explicit world-state representations as a key inductive bias for scalable embodied reasoning.

💡 Research Summary

The paper introduces Grounded Scene‑graph Reasoning (GSR), a novel framework that separates perception, reasoning, and control for embodied manipulation by using semantically grounded scene graphs as the explicit world state. Raw RGB‑D observations are processed by a vision foundation model to produce a 3D scene graph composed of object nodes (with functional keypoints and articulated parts) and relational edges (on, inside, adjacent, etc.). This graph serves as the input to a large language model (LLM) based on Qwen‑3‑8B, which performs step‑wise commonsense reasoning over the graph and outputs the next atomic action.

To train GSR, the authors construct a massive dataset called Manip‑Cognition‑1.6M, which contains three supervised components: (1) world‑understanding pairs (text → scene graph), (2) action‑planning triples (forward action reasoning, world modeling of edge changes, and goal‑conditioned planning), and (3) goal‑interpretation samples (current graph + natural‑language goal → target goal graph). After extensive augmentation, the dataset totals 1.6 million samples covering 6 k manipulation trajectories.

Training proceeds in two stages. First, supervised fine‑tuning (SFT) with LoRA adapts the LLM to the structured inputs, teaching it to map graphs to actions, predict state transitions, and infer final goal configurations. Second, reinforcement fine‑tuning (RFT) uses Group Relative Policy Optimization (GRPO) to align the model with execution constraints. Three custom reward terms penalize (a) multi‑action outputs (step‑wise constraint), (b) hallucinated objects (scene‑graph grounding), and (c) premature or missing termination (termination reward).

The authors evaluate GSR on three fronts: (i) open‑source benchmarks (230 tasks from RLBench and LIBERO), (ii) a dedicated GSR‑benchmark (180 long‑horizon tasks emphasizing object disambiguation, spatial sequencing, and goal generalization), and (iii) real‑world robot experiments with a UR5e arm and a meta‑skill library for low‑level control. Compared to strong baselines—end‑to‑end models (RT‑2, OpenVLA, π0) and prompting‑based spatial reasoning methods (VoxPoser, ReKep, ConceptGraphs, ENACT)—GSR achieves substantially higher zero‑shot success rates (average +23 % points) and markedly better performance on tasks requiring more than ten sequential steps (+35 % points). Object‑hallucination errors drop by over 80 %, and termination errors decrease by more than 70 %.

Ablation studies confirm that the explicit scene‑graph representation provides a robust inductive bias separating task structure from visual variability, while the two‑stage training mitigates reasoning artifacts observed after SFT alone. The paper also discusses limitations: reliance on the vision front‑end for accurate graph extraction (especially for transparent or reflective objects), dependence on a predefined meta‑skill library for low‑level execution, and the growing context length required for large graphs, which may exceed the LLM’s token window. Future directions include improving graph extraction robustness, learning new skills automatically, graph compression techniques, and extending the framework to multi‑robot collaboration.

In summary, GSR demonstrates that grounding manipulation in a structured, language‑compatible world model enables more reliable causal reasoning, better generalization, and higher success on long‑horizon embodied tasks, positioning explicit scene‑graph representations as a key building block for scalable robot intelligence.

GSR: Learning Structured Reasoning for Embodied Manipulation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment