실제 환경 제어 인터페이스 벤치마크 SWITCH

Reading time: 6 minute
...

📝 Abstract

Autonomous intelligence requires not only perception and reasoning, but critically, effective interaction with the existing world and its infrastructure. Everyday environments are rich in tangible control interfaces (TCIs)-e.g., light switches, appliance panels, and embedded GUIs-that demand commonsense/physics reasoning, but also causal prediction and outcome verification in time and space (e.g., delayed heating, remote lights). Moreover, failures here have potential safety implications, yet current benchmarks rarely test grounding, partial observability (video), or post-hoc verification in situated settings. We introduce SWITCH (Semantic World Interface Tasks for Control & Handling), an embodied, task-driven benchmark created through iterative releases to probe these gaps. Its first iteration, SWITCH-Basic, evaluates five complementary abilities-task-aware VQA, semantic UI grounding, action generation, state-transition prediction, and result verification-under egocentric RGB video input and device diversity. Across 351 tasks spanning 98 real devices/appliances, commercial and open LMMMs exhibit inconsistent performance even on single-step interactions, often over-relying on textual cues and under-using visual/video evidence (and high aggregate scores can mask such failures). SWITCH provides data, code, and held-out splits to enable reproducible evaluation and community contributions toward more challenging future iterations of the benchmark and the creation of training datasets. Benchmark resources are available at: https : / / github . com / BAAI -Agents/SWITCH.

💡 Analysis

Autonomous intelligence requires not only perception and reasoning, but critically, effective interaction with the existing world and its infrastructure. Everyday environments are rich in tangible control interfaces (TCIs)-e.g., light switches, appliance panels, and embedded GUIs-that demand commonsense/physics reasoning, but also causal prediction and outcome verification in time and space (e.g., delayed heating, remote lights). Moreover, failures here have potential safety implications, yet current benchmarks rarely test grounding, partial observability (video), or post-hoc verification in situated settings. We introduce SWITCH (Semantic World Interface Tasks for Control & Handling), an embodied, task-driven benchmark created through iterative releases to probe these gaps. Its first iteration, SWITCH-Basic, evaluates five complementary abilities-task-aware VQA, semantic UI grounding, action generation, state-transition prediction, and result verification-under egocentric RGB video input and device diversity. Across 351 tasks spanning 98 real devices/appliances, commercial and open LMMMs exhibit inconsistent performance even on single-step interactions, often over-relying on textual cues and under-using visual/video evidence (and high aggregate scores can mask such failures). SWITCH provides data, code, and held-out splits to enable reproducible evaluation and community contributions toward more challenging future iterations of the benchmark and the creation of training datasets. Benchmark resources are available at: https : / / github . com / BAAI -Agents/SWITCH.

📄 Content

Intelligent behaviour requires not only perception and reasoning, but also effective interaction with the existing world and the infrastructure in it. Despite significant progress in computer vision (CV), large multi-modality models (LM-MMs), and interactive agents, research efforts have largely overlooked understanding and operating tangible control interfaces (TCIs) 1 , from light switches and appliances to ondevice GUI panels -the primary medium of everyday human-device interaction.

Recent benchmark efforts (e.g., [3,4,8]) have probed models’ ability to understand common-sense causality or physics-based outcomes and consistency. However, TCI interfaces require modeling causality beyond these dimensions only, as the effects of TCI interactions can depend on specific devices (e.g., same device type has different behaviours), present temporal delays (e.g., pre-heat an oven), and even may require spatial verification (e.g., pressing a switch in one room to turn a light in another one).

Moreover, despite over two decades of “smart home” and IoT efforts, the vast majority of deployed devices remain non-API-enabled or, if automatable at all, expose fragmented proprietary protocols: buildings, offices, and homes today still operate through knobs, panels, and remotes that require eyes, hands, and potentially devicespecific knowledge, rather than programmatic control. As a result, evaluating whether current foundation models can function in everyday settings requires moving beyond textimage benchmarks and into embodied tasks where success depends on using these interfaces correctly and verifying their effects in situ.

Similarly to the work on manipulating GUIs, grounding is critical for proper situated interaction with TCIs. However, current benchmarks and simulators either don’t cover such interfaces or don’t model them in enough detail transferable to real-world settings.

These issues highlight the critical need for models that can understand and operate within such interaction-centric environments, bridging not only perception, reasoning, and Figure 1. An overview of the SWITCH benchmark, using the case “Turn off all the lights” as a running example. SWITCH covers the collection and annotation of real-world TCI interaction data (“Collected Data”), which we systematically structure into five distinct tasks. These tasks are designed to evaluate models across three crucial capability dimensions: Perception/Spatial Reasoning, Causal Reasoning/Planning, and Verification. Furthermore, we leverage the benchmark to evaluate advanced generative models, like Veo3 [10]. By comparing generated videos against ground truth, we illustrate how current models still exhibit significant room for improvement in logical consistency and fine-grained interaction for real-word use, thus underscoring the importance of SWITCH’s target scenarios. action; but also verification of outcomes and adaptation. As well as ways to evaluate them in both higher-level causality understanding and planning of actions, but also in grounded realistic execution of interactions.

To address this gap, we introduce SWITCH (Semantic World Interface Tasks for Control and Handling), a new effort towards a unified task-driven benchmark designed to evaluate models’ abilities to perceive, reason about, and interact with complex dynamic environments via TCI elements and understand and verify the effects of such interactions. SWITCH’s initial iteration, SWITCH-Basic, emphasizes five complementary capabilities: 1/ Task-Aware Visual Question Answering: answering questions conditioned on multimodal observations and task goals; 2/ Semantic UI Comprehension: grounding and interpreting actionable UI elements in context; 3/ Action Generation: planning and executing context-aware actions aligned with user intent; 4/ State Transition Prediction: reasoning about the causal consequences of UI-actions; 5/ Result Verification: post-hoc evaluation to determine task success. Figure 1 illustrates the benchmark design dimensions.

Each task is explicitly designed to bridge the gap between virtual reasoning and real-world applicability, requiring models to understand UI layout and functionality, anticipate consequences of actions, and verify outcomes. Unlike existing datasets, which often consider only object-level interactions or abstract digital actions, SWITCH evaluates practical, generalizable interactive skills that reflect realistic human-device interactions.

We take an iterative benchmark design approach to both allow timely analysis of current model capabilities and collect feedback for upcoming more in-depth iterations of the benchmark. This iterative process will also allow the future creation of an effective dataset to train models for such scenarios.

This first SWITCH-Basic iteration of the benchmark focuses on single-step iterations and an initial analysis of current model capabilities through a set of multi-choice questions (MCQ) in different modalities to match

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut