SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs

SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are increasingly tested for a “Theory of Mind” (ToM) - the ability to attribute mental states to oneself and others. Yet most evaluations stop at explicit belief attribution in classical toy stories or stylized tasks, leaving open the questions of whether LLMs can implicitly apply such knowledge to predict human behavior, or to judge an observed behavior, in diverse scenarios. We introduce SimpleToM, a benchmark that advances ToM evaluation along two novel axes. First, it probes multiple levels of ToM reasoning, from mental state inference (explicit ToM) to behavior prediction and judgment (applied ToM). Second, it situates these tasks in diverse, everyday scenarios - such as supermarkets, hospitals, schools, and offices - where information asymmetries naturally arise (e.g., hidden defects in grocery store items, incomplete information in provider-patient interactions, or restricted access to locked devices). SimpleToM contains concise stories (e.g., “The can of Pringles has moldy chips in it. Mary picks up the can in the supermarket and walks to the cashier.”), each with three questions that test different degrees of ToM reasoning, asking models to predict: (a) mental states (“Is Mary aware of the mold?”), (b) behaviors (“Will Mary pay for the chips or report the mold?”), and (c) judgments (“Mary paid for the chips. Was that reasonable?”). Experiments reveal a striking gap: state-of-the-art models often reliably infer mental state (a), but fail at applying knowledge about the mental state for secondary predictions, with performance dropping sharply for behavior prediction (b) and further for behavior judgment (c). This exposes a critical fragility in LLMs’ social reasoning in terms of what they know (explicit ToM) versus how well they can implicitly apply that knowledge for predictions (applied ToM).


💡 Research Summary

The paper introduces SimpleToM, a novel benchmark designed to evaluate large language models (LLMs) on both explicit Theory of Mind (ToM) inference and applied ToM reasoning. While prior work on LLM ToM has largely focused on classic false‑belief tasks such as the Sally‑Anne experiment, these evaluations stop at asking the model directly whether a character knows a piece of information. SimpleToM goes further by embedding the same information asymmetry in everyday contexts—supermarkets, hospitals, schools, offices, etc.—and probing three distinct question types for each short two‑sentence story: (a) explicit mental‑state inference (“Is Mary aware that the Pringles can contains mold?”), (b) behavior prediction (“Will Mary pay for the chips or report the mold?”), and (c) judgment of the observed behavior (“Mary paid for the chips. Was that reasonable?”).

The dataset construction proceeds in four stages. First, ten real‑world scenarios that naturally generate information asymmetry are defined (e.g., hidden product defects, unobservable medical data, locked device contents). For each scenario a seed story is manually written. Second, large generative models (GPT‑4, Claude‑3‑Opus, Claude‑3‑Sonnet) are prompted to propose diverse entity sets compatible with the scenario, and then to generate three new stories at varying severity levels for each entity set. Third, each story is automatically paired with two possible next actions: an “unaware” action (what the protagonist would likely do if they truly lack the key information) and an “aware” counterfactual. Fourth, human annotators rigorously validate that (i) the key information is indeed unknown to the protagonist, (ii) the “unaware” action is appropriate only under that ignorance, and (iii) the “aware” action is appropriate only if the protagonist were to know the information. After two rounds of generation and filtering, the final corpus contains 1,147 stories and 3,441 questions (1,147 × 3).

Evaluation covers twelve state‑of‑the‑art LLMs, including GPT‑5, o1‑preview, Claude‑3‑Opus, and several open‑source models, under zero‑shot and few‑shot prompting. Results reveal a striking performance gap: on explicit mental‑state questions models achieve 85‑90 % accuracy, indicating they can reliably infer whether a character is aware of a hidden fact. However, on behavior‑prediction questions accuracy drops to roughly 55 %, and on judgment questions it falls below 48 %. The gap widens in scenarios that involve complex social norms or competing goals (e.g., medical ethics, security protocols).

The authors attribute this discrepancy to two main factors. First, current LLMs excel at single‑step commonsense inference but lack a mechanism to integrate that inference into downstream decision‑making that respects scenario‑specific constraints and norms. Second, applied ToM questions require a chain of reasoning: infer the mental state, map that state onto likely actions, and then evaluate the rationality of the chosen action given the inferred state. Existing prompting strategies do not sufficiently scaffold this multi‑step chain, leading to error propagation.

SimpleToM also distinguishes itself from prior ToM datasets by (1) providing a broad set of everyday scenarios rather than a single toy story, (2) deliberately avoiding explicit perception verbs (“see”, “notice”) to force models to rely on implicit commonsense, and (3) offering a three‑tiered question hierarchy that jointly measures inference, prediction, and normative judgment.

In the discussion, the paper outlines future research directions: (i) developing architectures or prompting frameworks that explicitly model goal‑norm integration so that inferred mental states can be fed into a decision module, (ii) extending SimpleToM to multi‑turn dialogues to test sustained ToM tracking, and (iii) leveraging human‑in‑the‑loop evaluation to refine judgment criteria for safety‑critical applications such as medical advice or customer‑service bots. By releasing the dataset and code publicly, the authors aim to establish SimpleToM as a standard benchmark for diagnosing and improving the social reasoning capabilities of next‑generation LLMs.


Comments & Academic Discussion

Loading comments...

Leave a Comment