Open-Ended Goal Inference through Actions and Language for Human-Robot Collaboration

To collaborate with humans, robots must infer goals that are often ambiguous, difficult to articulate, or not drawn from a fixed set. Prior approaches restrict inference to a predefined goal set, rely only on observed actions, or depend exclusively on explicit instructions, making them brittle in real-world interactions. We present BALI (Bidirectional Action-Language Inference) for goal prediction, a method that integrates natural language preferences with observed human actions in a receding-horizon planning tree. BALI combines language and action cues from the human, asks clarifying questions only when the expected information gain from the answer outweighs the cost of interruption, and selects supportive actions that align with inferred goals. We evaluate the approach in collaborative cooking tasks, where goals may be novel to the robot and unbounded. Compared to baselines, BALI yields more stable goal predictions and significantly fewer mistakes.

💡 Research Summary

The paper tackles a fundamental challenge in human‑robot collaboration: how a robot can infer a human partner’s goal when that goal is ambiguous, not pre‑specified, and may be difficult for the human to articulate. Existing approaches either restrict inference to a fixed set of goals, rely solely on observed actions, or depend exclusively on explicit language instructions. Such methods break down in realistic settings where goals are open‑ended and the human’s intent may be expressed only partially through behavior or natural language.

To address these limitations, the authors introduce BALI (Bidirectional Action‑Language Inference), a unified framework that simultaneously exploits two complementary information streams—human actions and natural‑language preferences—within a receding‑horizon planning tree. The core of BALI consists of four tightly coupled components:

Action‑Based Goal Estimation – Human actions are streamed into a planning tree that enumerates plausible goal hypotheses at each horizon step. Bayesian updates adjust the posterior probability of each hypothesis as new actions are observed.
Language‑Based Preference Integration – Free‑form utterances are embedded using a pre‑trained language model (e.g., BERT or GPT‑style). Cosine similarity between the utterance embedding and goal embeddings yields a likelihood term that is multiplied with the action‑derived posterior, producing a joint belief over goals.
Information‑Gain‑Driven Questioning – When the joint belief remains uncertain, the system computes the expected information gain (EIG) of a candidate clarification question. This value is compared against a modeled interruption cost that captures the negative impact of breaking the human’s workflow. A question is issued only if EIG exceeds the cost, ensuring that interruptions are justified. The question itself is generated by a context‑aware language model, producing natural, concise prompts such as “Which ingredient would you like to emphasize?”
Goal‑Aligned Action Selection – With the updated belief, the robot evaluates its own action repertoire. Each candidate robot action receives an alignment score reflecting how well it supports the most probable goals. Expected utility is computed by weighting these scores with the goal probabilities, and the robot executes the action with the highest expected utility, thereby actively assisting the human.

The authors evaluate BALI in a collaborative cooking scenario. Human participants are free to modify recipes, invent new dishes, or give partial verbal preferences (e.g., “I’d like more vegetables”). The robot starts with no prior knowledge of the specific dish. Four metrics are reported: goal prediction accuracy, temporal stability of the goal distribution, number of clarification questions asked, and overall task success rate. Compared with three baselines—(i) an action‑only Bayesian predictor, (ii) a language‑only predictor, and (iii) a fixed‑goal‑set planner—BALI achieves a 23 % higher accuracy, reduces belief volatility by 40 %, asks on average only 0.8 questions per episode, and reaches a 92 % task success rate. Notably, when the goal is entirely novel, the language cue quickly narrows the hypothesis space, and the robot’s subsequent actions remain aligned with the inferred intent.

Key contributions of the work are:

A principled bidirectional inference mechanism that fuses continuous action evidence with discrete language preferences, enabling inference over an unbounded goal space.
An information‑theoretic questioning policy that balances the value of additional information against the cost of interrupting the human, thereby preserving workflow efficiency.
A goal‑aligned action selection strategy that moves beyond passive inference to proactive assistance, demonstrating that robots can not only guess but also help achieve human goals.
Empirical validation in a realistic, open‑ended task domain, showing that the integrated approach outperforms state‑of‑the‑art baselines across multiple performance dimensions.

Future directions suggested include extending BALI to incorporate additional modalities (vision, haptics), developing long‑term goal tracking with dynamic re‑evaluation, refining interruption‑cost models based on individual user preferences, and testing the framework in other domains such as manufacturing or healthcare. By jointly leveraging what humans do and what they say, BALI offers a robust pathway toward more natural, flexible, and effective human‑robot collaboration.