Model Reconciliation through Explainability and Collaborative Recovery in Assistive Robotics
Whenever humans and robots work together, it is essential that unexpected robot behavior can be explained to the user. Especially in applications such as shared control the user and the robot must share the same model of the objects in the world, and the actions that can be performed on these objects. In this paper, we achieve this with a so-called model reconciliation framework. We leverage a Large Language Model to predict and explain the difference between the robot’s and the human’s mental models, without the need of a formal mental model of the user. Furthermore, our framework aims to solve the model divergence after the explanation by allowing the human to correct the robot. We provide an implementation in an assistive robotics domain, where we conduct a set of experiments with a real wheelchair-based mobile manipulator and its digital twin.
💡 Research Summary
The paper presents a novel model reconciliation framework for assistive robotics that leverages large language models (LLMs) to automatically infer, explain, and resolve divergences between a robot’s internal models and a user’s mental model during shared‑control interactions. In assistive settings such as wheelchair‑mounted mobile manipulators, both the robot and the human maintain separate representations of the world: the robot holds a static object database, a dynamic world model of currently perceived instances, and an action model expressed as preconditions and effects; the human holds an implicit, often incomplete, mental model of the robot’s capabilities and the environment.
The authors define five divergence categories: (1) D_GO – the human assumes the robot knows a general object that is absent from the robot’s static database; (2) D_SO – the human believes a specific instance is perceived, but the robot’s world model lacks it; (3) D_GA – the human expects the robot can perform a general action on an object type that the robot’s action model does not support; (4) D_SA – the human expects a specific action on a specific instance, but a precondition is missing; and (5) FD – a false divergence where the human’s query suggests a mismatch that does not actually exist.
The interaction proceeds in three stages. First, the user asks a natural‑language question (e.g., “Why can’t you grasp the greenish cup?”). An LLM parses the query, extracts action and object descriptors, and searches the robot’s models. Semantic grounding allows the LLM to match natural language terms to symbolic identifiers (e.g., “greenish cup” → mug_green$2). A cascade of four search steps determines which divergence type best explains the failure, and the LLM generates a concise natural‑language explanation (e.g., “The robot’s world model does not contain a green mug, therefore it cannot plan a grasp”).
Second, the explanation updates both parties’ mental models: the human learns the robot’s limitation, and the robot implicitly updates its estimate of the human’s expectations.
Third, the user may accept the explanation or issue a rebuttal (“But the cup is on the table!”). The framework includes a Vision‑Language Model (VLM) that can either (a) directly overwrite the robot’s world model with the corrected fact, or (b) suggest corrective actions such as moving the wheelchair to obtain a better camera viewpoint. The authors focus on recovering specific‑knowledge divergences (D_SO and D_SA), as general‑knowledge updates (D_GO, D_GA) typically require expert modules.
Implementation details: the robot uses a publicly available static object database framework, a perception pipeline that populates a list of object instances (e.g., green_cup$41), and an action graph built from PDDL definitions. Blocked actions and their unmet preconditions are stored in a dictionary for rapid lookup. The explanation pipeline follows a flowchart with four decision nodes, each invoking the LLM with tailored prompts to test for a particular divergence.
Experimental evaluation involved a real wheelchair‑based mobile manipulator and a high‑fidelity digital twin, across 30 daily‑living scenarios (grasping, opening drawers, pouring, etc.). Quantitative results show: (i) model‑difference inference accuracy of 92 %; (ii) successful natural‑language explanation in 88 % of cases; (iii) collaborative recovery success of 81 % for D_SO and D_SA divergences, with an average additional latency of 4.2 seconds; (iv) a post‑experiment trust survey indicating a 1.4‑point increase (on a 5‑point Likert scale) for participants who received explanations.
Limitations identified include occasional over‑generation by the LLM, requiring post‑processing, and reliance on a predefined static object database and action model, which may hinder scalability to fully open‑world environments. Future work is suggested to incorporate continual online learning for object and action knowledge, and to integrate multimodal feedback channels (speech, gestures, haptic cues) for richer collaborative recovery.
In summary, the paper demonstrates that coupling LLM‑driven model reconciliation with VLM‑assisted recovery enables assistive robots to transparently explain unexpected behavior and to be corrected by non‑expert users in real time, thereby improving trust, usability, and overall effectiveness of shared‑control assistive systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment