BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands

Reading time: 5 minute
...

📝 Original Info

  • Title: BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands
  • ArXiv ID: 2511.22364
  • Date: 2025-11-27
  • Authors: Seongwon Cho, Daechul Ahn, Donghyun Shin, Hyeonbeom Choi, San Kim, Jonghyun Choi

📝 Abstract

Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation under dynamic environmental changes. However, most prior approaches update their world representation only at discrete update points such as navigation targets, waypoints, or the end of an action step, leaving robots blind between updates and causing cascading failures: overlooked objects, late error detection, and delayed replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual process framework that decouples strategic planning from continuous environment monitoring. Specifically, BINDER integrates a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a VideoLLM for continuous monitoring). The two modules play complementary roles: the DRM performs strategic planning with structured 3D scene updates and guides what the IRM attends to, while the IRM analyzes video streams to update memory, correct ongoing actions, and trigger replanning when necessary. Through this bidirectional coordination, the modules address the trade off between maintaining awareness and avoiding costly updates, enabling robust adaptation under dynamic conditions. Evaluated in three real world environments with dynamic object placement, BINDER achieves substantially higher success and efficiency than SoTA baselines, demonstrating its effectiveness for real world deployment.

💡 Deep Analysis

Figure 1

📄 Full Content

BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands Seongwon Cho1,∗, Daechul Ahn1,∗, Donghyun Shin2, Hyeonbeom Choi1, San Kim1, Jonghyun Choi1† https://seongwon980.github.io/BINDER Abstract— Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation under dy- namic environmental changes. However, most prior approaches update their world representation only at discrete update points—such as navigation targets, waypoints, or the end of an action step—leaving robots blind between updates and causing cascading failures: overlooked objects, late error detection, and delayed replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual-process framework that decouples strategic planning from continuous environment monitoring. Specifically, BINDER in- tegrates a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a Video-LLM for continuous monitoring). The two mod- ules play complementary roles: the DRM performs strategic planning with structured 3D scene updates and guides what the IRM attends to, while the IRM analyzes video streams to update memory, correct ongoing actions, and trigger replanning when necessary. Through this bidirectional coordination, the modules address the trade-off between maintaining awareness and avoid- ing costly updates, enabling robust adaptation under dynamic conditions. Evaluated in three real-world environments with dynamic object placement, BINDER achieves substantially higher success and efficiency than state-of-the-art baselines, demonstrating its effectiveness for real-world deployment. I. INTRODUCTION Open-Vocabulary Mobile Manipulation (OVMM) aims to enable robots to navigate unknown environments and manipulate objects based on open-vocabulary language in- structions [1]. Particularly, in real-world settings (e.g., home, office), robots must cope with continuous environmental changes—objects added, relocated, and humans or robots moving through space. To handle such dynamics, robots require both sophisticated reasoning for task planning and continuous environmental monitoring throughout execution. While early approaches operated in fixed, pre-scanned en- vironments without considering environmental changes [2], [3], [4], recent work has introduced various environmental feedback mechanisms—including updating 3D voxel mem- ory [5], scene graph memory [6], [7], and leveraging pow- erful VLMs like GPT-4V for closed-loop reasoning [8]. However, these suffer from a limitation: they operate with intermittent scene perception, leaving robots effectively blind to environmental changes between scene perception updates. ∗Seongwon Cho and Daechul Ahn contributed equally to this work. 1Seoul National University 2Korea University †JC is with ECE, ASRI and IPAI in SNU and a corresponding author. Email: jonghyunchoi@snu.ac.kr (a) ... ... Blind Still explore (b) ... ... Blind Still explore (c) Success Grasp Vision processing pause Navigation target Task: explore(“banana”) → grasp(“banana”) Invisible Robot Visible Fig. 1: Limitations of existing OVMM approaches and our proposed BINDER. Robots are searching for a banana while exploring an unknown environment from navigation target p0 to p1. (a) Sparse-update approaches refresh per- ception only at navigation targets, leading to intermittent scene perception that leaves robots blind during traversal and causes them to miss objects that appear en-route. (b) Methods that perform more frequent updates at intermedi- ate waypoints partially reduce this temporal blindness but require repeated vision-processing pauses for 3D reconstruc- tion, introducing inefficiency and still leaving blind spots between update intervals. (c) BINDER instead maintains continuous visual awareness en-route via video-based mon- itoring and triggers 3D updates only when needed, enabling opportunistic detections (such as the banana appearing along the path) and task execution without intermittent pauses. Due to the computational demands of updating 3D semantic scene representation, previous approaches update their en- vironmental representations—whether 3D voxel maps [5], [8] or scene graphs [6], [7], [3], or volumetric/object-centric maps [4], [9], [8], [10]—only at discrete intervals [5], [6], [8], [7]. Even approaches employing powerful task planning models (e.g., GPT [6], [8]) are undermined by this intermit- tent perception, as their reasoning for task planning might rely on potentially outdated environmental data. Consider a robot searching for ‘banana’ while exploring from navigation target p0 to p1, as illustrated in Fig. 1. Despite the object being directly in its path, approaches that update 3D semantic scenes only at navigation targets or after completing sub-actions (e.g., grasping or placing) entirely miss this opportunity (Fig. 1-(a

📸 Image Gallery

sankey.png select_k.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut