BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands
Reading time: 5 minute
...
📝 Original Info
Title: BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands
ArXiv ID: 2511.22364
Date: 2025-11-27
Authors: Seongwon Cho, Daechul Ahn, Donghyun Shin, Hyeonbeom Choi, San Kim, Jonghyun Choi
📝 Abstract
Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation under dynamic environmental changes. However, most prior approaches update their world representation only at discrete update points such as navigation targets, waypoints, or the end of an action step, leaving robots blind between updates and causing cascading failures: overlooked objects, late error detection, and delayed replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual process framework that decouples strategic planning from continuous environment monitoring. Specifically, BINDER integrates a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a VideoLLM for continuous monitoring). The two modules play complementary roles: the DRM performs strategic planning with structured 3D scene updates and guides what the IRM attends to, while the IRM analyzes video streams to update memory, correct ongoing actions, and trigger replanning when necessary. Through this bidirectional coordination, the modules address the trade off between maintaining awareness and avoiding costly updates, enabling robust adaptation under dynamic conditions. Evaluated in three real world environments with dynamic object placement, BINDER achieves substantially higher success and efficiency than SoTA baselines, demonstrating its effectiveness for real world deployment.
💡 Deep Analysis
📄 Full Content
BINDER: Instantly Adaptive Mobile Manipulation with
Open-Vocabulary Commands
Seongwon Cho1,∗, Daechul Ahn1,∗, Donghyun Shin2, Hyeonbeom Choi1, San Kim1, Jonghyun Choi1†
https://seongwon980.github.io/BINDER
Abstract— Open-vocabulary mobile manipulation (OVMM)
requires robots to follow language instructions, navigate, and
manipulate while updating their world representation under dy-
namic environmental changes. However, most prior approaches
update their world representation only at discrete update
points—such as navigation targets, waypoints, or the end of an
action step—leaving robots blind between updates and causing
cascading failures: overlooked objects, late error detection, and
delayed replanning. To address this limitation, we propose
BINDER
(Bridging INstant and DEliberative Reasoning),
a
dual-process framework that decouples strategic planning from
continuous environment monitoring. Specifically, BINDER in-
tegrates a Deliberative Response Module (DRM, a multimodal
LLM for task planning) with an Instant Response Module
(IRM, a Video-LLM for continuous monitoring). The two mod-
ules play complementary roles: the DRM performs strategic
planning with structured 3D scene updates and guides what the
IRM attends to, while the IRM analyzes video streams to update
memory, correct ongoing actions, and trigger replanning when
necessary. Through this bidirectional coordination, the modules
address the trade-off between maintaining awareness and avoid-
ing costly updates, enabling robust adaptation under dynamic
conditions. Evaluated in three real-world environments with
dynamic object placement, BINDER achieves substantially
higher success and efficiency than state-of-the-art baselines,
demonstrating its effectiveness for real-world deployment.
I. INTRODUCTION
Open-Vocabulary Mobile Manipulation (OVMM) aims
to enable robots to navigate unknown environments and
manipulate objects based on open-vocabulary language in-
structions [1]. Particularly, in real-world settings (e.g., home,
office), robots must cope with continuous environmental
changes—objects added, relocated, and humans or robots
moving through space. To handle such dynamics, robots
require both sophisticated reasoning for task planning and
continuous environmental monitoring throughout execution.
While early approaches operated in fixed, pre-scanned en-
vironments without considering environmental changes [2],
[3], [4], recent work has introduced various environmental
feedback mechanisms—including updating 3D voxel mem-
ory [5], scene graph memory [6], [7], and leveraging pow-
erful VLMs like GPT-4V for closed-loop reasoning [8].
However, these suffer from a limitation: they operate with
intermittent scene perception, leaving robots effectively blind
to environmental changes between scene perception updates.
∗Seongwon Cho and Daechul Ahn contributed equally to this work.
1Seoul National University 2Korea University
†JC is with ECE, ASRI and IPAI in SNU and a corresponding author.
Email: jonghyunchoi@snu.ac.kr
(a)
...
...
Blind
Still
explore
(b)
...
...
Blind
Still
explore
(c)
Success
Grasp
Vision processing pause
Navigation target
Task: explore(“banana”) → grasp(“banana”)
Invisible
Robot
Visible
Fig. 1: Limitations of existing OVMM approaches and
our proposed BINDER. Robots are searching for a banana
while exploring an unknown environment from navigation
target p0 to p1. (a) Sparse-update approaches refresh per-
ception only at navigation targets, leading to intermittent
scene perception that leaves robots blind during traversal
and causes them to miss objects that appear en-route. (b)
Methods that perform more frequent updates at intermedi-
ate waypoints partially reduce this temporal blindness but
require repeated vision-processing pauses for 3D reconstruc-
tion, introducing inefficiency and still leaving blind spots
between update intervals. (c) BINDER instead maintains
continuous visual awareness en-route via video-based mon-
itoring and triggers 3D updates only when needed, enabling
opportunistic detections (such as the banana appearing along
the path) and task execution without intermittent pauses.
Due to the computational demands of updating 3D semantic
scene representation, previous approaches update their en-
vironmental representations—whether 3D voxel maps [5],
[8] or scene graphs [6], [7], [3], or volumetric/object-centric
maps [4], [9], [8], [10]—only at discrete intervals [5], [6],
[8], [7]. Even approaches employing powerful task planning
models (e.g., GPT [6], [8]) are undermined by this intermit-
tent perception, as their reasoning for task planning might
rely on potentially outdated environmental data.
Consider a robot searching for ‘banana’ while exploring
from navigation target p0 to p1, as illustrated in Fig. 1.
Despite the object being directly in its path, approaches
that update 3D semantic scenes only at navigation targets
or after completing sub-actions (e.g., grasping or placing)
entirely miss this opportunity (Fig. 1-(a