Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As general intelligent agents are poised for widespread deployment in diverse households, evaluation tailored to each unique unseen 3D environment has become a critical prerequisite. However, existing benchmarks suffer from severe data contamination and a lack of scene specificity, inadequate for assessing agent capabilities in unseen settings. To address this, we propose a dynamic in-situ task generation method for unseen environments inspired by human cognition. We define tasks through a structured graph representation and construct a two-stage interaction-evolution task generation system for embodied agents (TEA). In the interaction stage, the agent actively interacts with the environment, creating a loop between task execution and generation that allows for continuous task generation. In the evolution stage, task graph modeling allows us to recombine and reuse existing tasks to generate new ones without external data. Experiments across 10 unseen scenes demonstrate that TEA automatically generated 87,876 tasks in two cycles, which human verification confirmed to be physically reasonable and encompassing essential daily cognitive capabilities. Benchmarking SOTA models against humans on our in-situ tasks reveals that models, despite excelling on public benchmarks, perform surprisingly poorly on basic perception tasks, severely lack 3D interaction awareness and show high sensitivity to task types in reasoning. These sobering findings highlight the necessity of in-situ evaluation before deploying agents into real-world human environments.

💡 Research Summary

The paper tackles a crucial gap in the evaluation of embodied agents destined for deployment in unseen household environments. Existing benchmarks suffer from data contamination, lack of scene specificity, and reliance on pre‑generated task instances, which makes them unsuitable for assessing agents in truly novel 3‑D spaces. To overcome these limitations, the authors introduce TEA (Task Evolution and Interaction system), a two‑stage, fully automatic task generation framework that mimics human cognitive processes.

In the first stage, called the Interaction stage, an agent is placed in a completely unknown environment with no pre‑defined tasks. The agent performs a random walk (ε‑random walk) to explore the scene, collecting multimodal data at each step: RGB images, depth maps, 3‑D bounding boxes, object labels, and positional information. This data stream (denoted D) is fed into a set of task‑generation functions G, each of which maps raw sensory inputs to a structured task graph. A task graph consists of vertices (objects, rooms, the agent), edges (spatial or ownership relationships), and attributes (color, label, image, depth). The generated tasks are filtered for redundancy using multimodal embeddings; a similarity matrix S is built, spectral clustering is applied, and only K representative tasks are retained. This prevents exponential explosion while preserving diversity.

The second stage, the Evolution stage, operates purely on the graph representations of already generated tasks. Two graph‑based operations are defined: (1) Reuse, where a subgraph t₁ that is a subset of a larger task graph t₂ (t₁ ⪯ t₂) can inherit the concrete instances discovered in t₂, allowing a complex task to spawn simpler ones without additional perception. (2) Recombination, where vertices of the same semantic type but different attributes are swapped, producing novel task templates (e.g., converting a label‑conditioned object search into an image‑conditioned search). These operations generate a new set of tasks T′ without any external assets, thereby achieving massive task diversity in situ.

To quantify redundancy and diversity, the authors propose the Maximum Independent Ratio (MIR). Given a similarity threshold α (set to 0.8), MIR is the size of the largest subset of tasks whose pairwise similarity stays below α, divided by the total number of tasks. Higher MIR indicates less redundancy. Experiments across ten unseen scenes show that the first interaction loop (ε = 1) yields a modest MIR of 0.31, while the second loop (ε = 0) combined with the evolution stage raises MIR to an average of 0.54 and up to 0.68, demonstrating the effectiveness of the evolution mechanisms.

In total, TEA automatically generated 87,876 tasks over two interaction‑evolution cycles. Human annotators verified that the tasks are physically plausible (e.g., “the red table is not in view”) and collectively cover essential daily cognitive capabilities such as navigation, object classification, visual relationship detection, and counting.

The authors then benchmark several state‑of‑the‑art vision‑language models (VLMs) and compare them to human performance on the same in‑situ tasks. While the models achieve high scores on public benchmarks, their performance collapses on TEA’s tasks: basic perception accuracy falls below 30 %, 3‑D interaction awareness is markedly poor, and reasoning performance is highly sensitive to task type. This discrepancy highlights severe over‑fitting to contaminated datasets and underscores the necessity of environment‑specific evaluation before real‑world deployment.

Overall, TEA contributes three major advances: (1) a closed‑loop, agent‑in‑the‑loop task generation method that works without any initial task instances; (2) a graph‑based reuse and recombination paradigm that creates a virtually unlimited set of diverse, physically grounded tasks without external assets; (3) a rigorous evaluation pipeline (including MIR) and human validation that confirms the relevance of the generated tasks. The work argues convincingly that in‑situ, scene‑specific evaluation is indispensable for trustworthy deployment of embodied agents in real households, and positions TEA as a foundational tool for future research in this direction.

Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents

💡 Research Summary

Comments & Academic Discussion

Leave a Comment