Where Norms and References Collide: Evaluating LLMs on Normative Reasoning

Where Norms and References Collide: Evaluating LLMs on Normative Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Embodied agents, such as robots, will need to interact in situated environments where successful communication often depends on reasoning over social norms: shared expectations that constrain what actions are appropriate in context. A key capability in such settings is norm-based reference resolution (NBRR), where interpreting referential expressions requires inferring implicit normative expectations grounded in physical and social context. Yet it remains unclear whether Large Language Models (LLMs) can support this kind of reasoning. In this work, we introduce SNIC (Situated Norms in Context), a human-validated diagnostic testbed designed to probe how well state-of-the-art LLMs can extract and utilize normative principles relevant to NBRR. SNIC emphasizes physically grounded norms that arise in everyday tasks such as cleaning, tidying, and serving. Across a range of controlled evaluations, we find that even the strongest LLMs struggle to consistently identify and apply social norms, particularly when norms are implicit, underspecified, or in conflict. These findings reveal a blind spot in current LLMs and highlight a key challenge for deploying language-based systems in socially situated, embodied settings.


💡 Research Summary

The paper tackles the problem of Norm‑Based Reference Resolution (NBRR), where an ambiguous linguistic request must be disambiguated by invoking socially shared norms that are grounded in the physical environment. To investigate whether current large language models (LLMs) can perform this type of reasoning, the authors introduce SNIC (Situated Norms in Context), a diagnostic benchmark specifically designed for NBRR.

SNIC was built in two stages. First, the authors handcrafted 120 textual vignettes that each contain (1) an ambiguous imperative (“hand me the mug”) and (2) a set of 5‑7 candidate objects with differing properties (e.g., clean vs. dirty, decorative vs. functional). Each vignette is linked to a concrete, physically grounded norm such as “use clean items when serving” or “do not touch objects that belong to someone else.” Human participants (210 workers on Prolific) were asked to select the intended referent and provide a free‑form justification. The authors measured full matches, partial matches, and no‑matches, and computed a binary version of Fleiss’ κ. After filtering for items where the norm‑guided referent received a plurality of votes, 51 vignettes were retained as a high‑quality seed set.

In the second stage, the seed set was procedurally augmented to generate roughly 9,000 additional examples. Augmentation involved systematic swaps of object attributes (clean ↔ dirty), changes of setting (kitchen ↔ library), and insertion of norm‑conflict scenarios (e.g., a hazardous broken plate versus a dirty plate). This process preserved the core requirement that correct disambiguation depends on applying a social norm, while creating a large, diverse corpus that mimics real‑world variability.

The authors evaluated several state‑of‑the‑art LLMs (GPT‑4, Claude‑Sonnet, Llama‑2‑70B, etc.) using a one‑shot prompting format. Each model received the vignette and the ambiguous request and was asked to output the most appropriate object. Results show a clear split between explicit and implicit norm conditions. When the relevant norm was explicitly stated in the text, models achieved around 70 % accuracy. However, for implicit norms or for scenarios where two norms conflicted (e.g., safety versus cleanliness), accuracy dropped to the 30‑45 % range, and models frequently produced “partial matches” by listing multiple candidates or by applying only one of the competing norms. Inter‑annotator agreement among humans was modest (κ values from 0.07 to 0.22), indicating that even humans disagree on some norm interpretations, but the models performed substantially worse than the human baseline.

Key insights from the study are: (1) LLMs, despite their strong world‑knowledge representations, do not reliably retrieve or apply context‑specific, physically grounded social norms; (2) the presence of norm conflict dramatically amplifies this weakness, revealing that current models lack a principled mechanism for norm prioritization; (3) human validation confirms that many everyday norms are not universally salient, which explains variability in model performance across norm categories.

The paper contributes (a) the SNIC benchmark, publicly released with code and data, (b) a rigorous human‑validation pipeline that ensures each test item truly requires normative reasoning, and (c) an empirical characterization of the gap between LLM capabilities and the normative reasoning needed for embodied agents. The authors argue that closing this gap will likely require integrating formal deontic logic or other norm‑representation frameworks with LLMs, as well as incorporating multimodal perception so that agents can directly perceive physical cues (e.g., dirtiness, hazards) that trigger norms. Future work is suggested in three directions: formalizing norms as logical rules that can be dynamically activated, developing meta‑learning approaches that adapt norm salience to cultural context, and building end‑to‑end robot dialogue systems that combine LLM language understanding with sensor‑driven norm detection. Overall, the study highlights a critical blind spot in contemporary language models and provides a concrete testbed for advancing socially aware, norm‑compliant embodied AI.


Comments & Academic Discussion

Loading comments...

Leave a Comment