Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control
Teaching robots dexterous skills from human videos remains challenging due to the reliance on low-level trajectory imitation, which fails to generalize across object types, spatial layouts, and manipulator configurations. We propose Graph-Fused Vision-Language-Action (GF-VLA), a framework that enables dual-arm robotic systems to perform task-level reasoning and execution directly from RGB and Depth human demonstrations. GF-VLA first extracts Shannon-information-based cues to identify hands and objects with the highest task relevance, then encodes these cues into temporally ordered scene graphs that capture both hand-object and object-object interactions. These graphs are fused with a language-conditioned transformer that generates hierarchical behavior trees and interpretable Cartesian motion commands. To improve execution efficiency in bimanual settings, we further introduce a cross-hand selection policy that infers optimal gripper assignment without explicit geometric reasoning. We evaluate GF-VLA on four structured dual-arm block assembly tasks involving symbolic shape construction and spatial generalization. Experimental results show that the information-theoretic scene representation achieves over 95 percent graph accuracy and 93 percent subtask segmentation, supporting the LLM planner in generating reliable and human-readable task policies. When executed by the dual-arm robot, these policies yield 94 percent grasp success, 89 percent placement accuracy, and 90 percent overall task success across stacking, letter-building, and geometric reconfiguration scenarios, demonstrating strong generalization and robustness across diverse spatial and semantic variations.
💡 Research Summary
This paper introduces GF-VLA (Graph-Fused Vision-Language-Action), a novel framework designed to enable robotic systems, particularly dual-arm manipulators, to learn and execute complex manipulation tasks directly from single human demonstrations. The core challenge addressed is the limitation of traditional low-level trajectory imitation methods, which fail to generalize across variations in object types, spatial layouts, and robot morphologies.
The GF-VLA framework operates through a sophisticated pipeline that bridges high-level semantic reasoning with low-level physical understanding. It begins by processing RGB-D human demonstration videos. A perception module, utilizing models like SAM2 and hand pose estimators, segments and tracks hands and objects in each frame. The key innovation lies in the subsequent information-theoretic processing. Instead of treating all visual data equally, the system calculates entropy and mutual information over temporal windows of the trajectories of detected elements (hands and objects). This analysis automatically identifies the most “task-relevant” or “active” components—those with high dynamical variation and strong interaction with others—based purely on statistical measures of the data itself.
These information-theoretic cues are then used to construct temporally ordered scene graphs. In each keyframe, hands and objects become nodes, and the edges between them encode their spatial relationships and the calculated informational flow. This creates a structured, symbolic representation that explicitly captures the dynamics of hand-object and object-object interactions over time, effectively segmenting the demonstration into logical subtasks.
This sequence of scene graphs, representing the physical “how” of the task, is fused with the natural language instruction, representing the “what”. The combined input is fed into a language-conditioned Vision-Language-Action (VLA) transformer model. Crucially, by providing the model with this pre-structured graph representation rather than raw pixels, the cognitive load is reduced, enabling more reliable and interpretable planning. The model generates a hierarchical behavior tree and translates it into executable Cartesian motion commands. To enhance transparency, the framework incorporates Chain-of-Thought (CoT) prompting, forcing the LLM planner to articulate intermediate reasoning steps before final action generation.
A significant contribution for practical bimanual manipulation is the cross-hand selection policy. This component intelligently infers which gripper (left or right) should be assigned to manipulate which object to optimize overall task efficiency, doing so without resorting to explicit geometric reasoning, but rather by leveraging the contextual understanding from the scene graphs and task description.
The framework was rigorously evaluated on four structured dual-arm block assembly tasks involving stacking, letter-building, and geometric reconfiguration. The results were highly promising: the information-theoretic scene representation achieved over 95% graph accuracy and 93% subtask segmentation accuracy. When the resulting policies were executed on a real dual-arm robot, they yielded a 94% grasp success rate, 89% placement accuracy, and an overall task success rate of 90% across diverse scenarios. These results demonstrate that GF-VLA possesses strong generalization capabilities and robustness to spatial and semantic variations.
In conclusion, GF-VLA presents a major step towards robust and generalizable robot learning from demonstration. By principled integration of information theory for salient feature extraction, graph-based representations for structured physical reasoning, and modern VLA models for semantic task planning, it offers a powerful and interpretable paradigm for bridging the gap between human demonstration and robotic execution in complex, contact-rich manipulation tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment