I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Language-conditioned robotic manipulation in open-world settings requires not only accurate task execution but also the ability to detect failures for robust deployment in real-world environments. Although recent advances in vision-language models (VLMs) have significantly improved the spatial reasoning and task-planning capabilities of robots, they remain limited in their ability to recognize their own failures. In particular, a critical yet underexplored challenge lies in detecting semantic misalignment errors, where the robot executes a task that is semantically meaningful but inconsistent with the given instruction. To address this, we propose a method for building datasets targeting Semantic Misalignment Failures detection, from existing language-conditioned manipulation datasets. We also present I-FailSense, an open-source VLM framework with grounded arbitration designed specifically for failure detection. Our approach relies on post-training a base VLM, followed by training lightweight classification heads, called FS blocks, attached to different internal layers of the VLM and whose predictions are aggregated using an ensembling mechanism. Experiments show that I-FailSense outperforms state-of-the-art VLMs, both comparable in size and larger, in detecting semantic misalignment errors. Notably, despite being trained only on semantic misalignment detection, I-FailSense generalizes to broader robotic failure categories and effectively transfers to other simulation environments and real-world with zero-shot or minimal post-training. The datasets and models are publicly released on HuggingFace (Webpage: https://clemgris.github.io/I-FailSense/).

💡 Research Summary

I‑FailSense tackles the under‑explored problem of detecting semantic misalignment failures in language‑conditioned robotic manipulation. While recent vision‑language models (VLMs) excel at spatial reasoning and instruction following, they struggle to recognise when a robot’s behavior, although visually plausible, does not match the textual goal. The authors first construct a dedicated dataset (D_SMF) from existing manipulation benchmarks by pairing expert trajectories with both their original instructions (positive examples) and alternative instructions from the same task category (negative examples). This yields challenging “semantic misalignment” cases where the robot manipulates the correct object type but in the wrong way (e.g., lifting a blue block instead of a red one).

The core of I‑FailSense is a two‑stage post‑training pipeline built on the 3‑billion‑parameter PaliGemma2‑mix VLM. Stage 1 applies Low‑Rank Adaptation (LoRA) to the language side of the model, fine‑tuning only a small set of parameters while keeping the vision encoder frozen. Stage 2 freezes the entire VLM and attaches lightweight binary classification heads—called Failure‑Sense (FS) blocks—to multiple internal language layers. Each FS block independently predicts success or failure for a given observation trajectory and instruction; their outputs are combined via a voting‑based arbitration mechanism. This multi‑layer ensembling leverages diverse levels of representation, improving robustness over single‑layer baselines.

Empirical results show that I‑FailSense achieves over 90 % accuracy on the newly created simulated semantic‑misalignment benchmark, outperforming zero‑shot state‑of‑the‑art VLMs of comparable or larger size. Remarkably, despite being trained only on semantic‑misalignment data, the system generalises to control‑error detection and to unseen simulation environments (the AHA dataset), surpassing a VLM‑based baseline by 19 percentage points. In real‑world trials, fine‑tuning only the FS blocks yields 74 % accuracy, demonstrating effective sim‑to‑real transfer with minimal supervision.

Overall, I‑FailSense contributes (1) a systematic method for generating semantic‑misalignment failure data, (2) a parameter‑efficient adaptation strategy that preserves the strengths of large VLMs, and (3) a multi‑layer, ensemble‑based failure classifier that generalises across failure types, simulation domains, and real‑world settings. The models and datasets are released publicly on HuggingFace, facilitating reproducibility and further research in robust, failure‑aware robotic systems.

I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment