This paper addresses the topic of robustness under sensing noise, ambiguous instructions, and human-robot interaction. We take a radically different tack to the issue of reliable embodied AI: instead of focusing on formal verification methods aimed at achieving model predictability and robustness, we emphasise the dynamic, ambiguous and subjective nature of human-robot interactions that requires embodied AI systems to perceive, interpret, and respond to human intentions in a manner that is consistent, comprehensible and aligned with human expectations. We argue that when embodied agents operate in human environments that are inherently social, multimodal, and fluid, reliability is contextually determined and only has meaning in relation to the goals and expectations of humans involved in the interaction. This calls for a fundamentally different approach to achieving reliable embodied AI that is centred on building and updating an accessible "explicit world model" representing the common ground between human and AI, that is used to align robot behaviours with human expectations.
Human Inspiration Humans learn to interpret the world not only through static visual or speech perception, but through continuous integration of multimodal cues including gaze, gestures, prosody and movement dynamics, and contextual knowledge. These cues carry rich inferential signals about what others mean, what they want, and what will happen next, and contribute to developing a mutual understanding of the world that forms the basis for cooperative action. This shared conception of the world has been termed common ground, a joint understanding of the tasks, communications, and environments between agents (Dillenbourg and Traum 1999).
Common Ground The idea of common ground originates from language and cognition studies (Clark, Schreuder, and Buttrick 1983) and has been extensively studied in the field of human-AI teaming under the hood of shared mental models, which cover constructs such as knowledge representation, schema, and situation awareness (Andrews et al. 2022). In the domain of Human-Robot Collaboration (HRC), this concept underpins effective teamwork, requiring sophisticated mechanisms to bridge differences between human and artificial agents in terms of perception, cognition, and embodiment (Tan et al. 2020).
Perceptual Grounding Perceptual grounding is arguably the first step in the establishment of common grounds to construct a valid world model for HRC. Research in this domain has been centred around visual understanding enabled by deep learning models, and more recently visual foundational models. These models have been used to address various tasks and benchmarks on Visual Question Answering (VQA) (Zhong et al. 2022), which nevertheless is inadequate at capturing the dynamic, multimodal, and task-specific context of HRC. Therefore, a growing interest is observed in building common grounds for task-oriented collaborations. We proposed a Task-oriented Collaborative Question Answering (TCQA) benchmark (Tan et al. 2020) for benchmarking grounding methods with quantitative evaluation of their effectiveness in HRC tasks. Our baseline model combining deep learning to tackle basic perception and symbolic reasoning to capture high-level contextual information and reasoning achieved good performance on the benchmark, but this approach still suffers from fragility/errors in novel scenes and lacks flexibility in constructing new semantic inferences. To address these issues, Large Language and Multimodal Models (LLMs/LMMs) have been leveraged for semantic knowledge to inform affordance reasoning (Ahn et al. 2022;Huang et al. 2023), coordination (Zhang et al. 2024) and human goal reasoning (Wan, Mao, and Tenenbaum 2023). However, these approaches face challenges owing to their intrinsic disembodiment from the physical world.
Joint Attention and Multimodal Interaction Foundational work in social robotics has emphasised the importance of joint attention and shared intentionality for meaningful interaction. Scassellati demonstrated that joint attention enables robots to interpret human referential cues (Scassellati 1996). Extending this, Breazeal et al. showed that nonverbal behaviours significantly improve efficiency and robustness in human-robot teamwork (Breazeal et al. 2005), revealing that embodied communication is essential for reliable coordination, while Sato et al. showed that continuous monitoring of human behaviours expressed both via conscious actions/language and unconscious/involuntary nonverbal cues is needed for robots to actively infer human intentions (Sato et al. 1995). In parallel, work on legible robot motion showed that robots must act not just efficiently, but also expressively, producing behaviours that communicate intent to human partners, improving predictability and coordination in shared workspaces (Dragan, Lee, and Srinivasa 2013). These works collectively support the argument that reliable collaboration emerges from interactive common ground building, not merely isolated perception. In related work, we demonstrated how multimodal human cues are essential for reliable referential grounding. In M2GESTIC (Weerakoon et al. 2020), we showed that a distance-weighted understanding of pointing gestures can significantly reduce ambiguity in comprehending natural multi-modal human instructions. We also demonstrated that eye gaze provides strong cues for predicting referents and action steps during joint tasks (Johari et al. 2021). COSM2IC (Weerakoon et al. 2022) introduced adaptive real-time multimodal fusion that prioritises gesture or linguistic structure depending on context, highlighting that reliability emerges from dynamic coordination rather than rigid pipelines. Most recently, Ges3ViG (Mane et al. 2025) integrates pointing gestures with 3D visual grounding, advancing spatially grounded reference understanding for realworld embodied AI.
Cognitive Architectures (CAs) These symbolic AI systems rely on symbol manipulation and reasoning based on logical rules emulating human cogn
This content is AI-processed based on open access ArXiv data.