CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and policy adherence. We introduce CAR-bench, a benchmark for evaluating consistency, uncertainty handling, and capability awareness in multi-turn, tool-using LLM agents in an in-car assistant domain. The environment features an LLM-simulated user, domain policies, and 58 interconnected tools spanning navigation, productivity, charging, and vehicle control. Beyond standard task completion, CAR-bench introduces Hallucination tasks that test agents’ limit-awareness under missing tools or information, and Disambiguation tasks that require resolving uncertainty through clarification or internal information gathering. Baseline results reveal large gaps between occasional and consistent success on all task types. Even frontier reasoning LLMs achieve less than 50% consistent pass rate on Disambiguation tasks due to premature actions, and frequently violate policies or fabricate information to satisfy user requests in Hallucination tasks, underscoring the need for more reliable and self-aware LLM agents in real-world settings.

💡 Research Summary

CAR‑bench introduces a comprehensive benchmark designed to evaluate large language model (LLM) agents in the context of an in‑car voice assistant, where real‑world deployment demands more than raw tool‑calling ability. The benchmark combines six tightly integrated components: (a) an LLM‑simulated user that follows detailed persona specifications (age, conversational style, technical proficiency) and generates multi‑turn dialogues; (b) a set of 19 domain‑specific safety and policy rules (e.g., prohibiting simultaneous high‑beam and fog‑light activation, requiring user confirmation before sending an email); (c) a toolkit of 58 API‑style tools spanning navigation, charging, vehicle control, productivity, weather, and cross‑domain functions, each defined with JSON‑encoded parameters and categorized as “get” (information retrieval) or “set” (state modification); (d) mutable state variables (e.g., climate settings, window positions, navigation status) and immutable context variables (e.g., vehicle model, city); (e) large static databases that provide realistic geographic, POI, route, weather, contact, and calendar data for 48 European cities; and (f) evaluation metrics Pass@3 (at least one success in three trials) and Pass^3 (success in all three trials) to capture consistency across repeated interactions.

Beyond standard “Base” tasks that simply require reaching a ground‑truth end state, CAR‑bench defines two novel task families. Hallucination tasks deliberately hide a required tool or withhold critical data, forcing the agent to recognize its own limitations and either refuse gracefully or propose a viable alternative rather than fabricating information. Disambiguation tasks present under‑specified or ambiguous user requests, requiring the agent to either ask clarifying questions or internally gather missing information before taking any action. Both task types are tightly coupled with the policy engine, making premature or unsafe actions a direct safety violation.

Baseline experiments cover a spectrum of models, from non‑reasoning LLMs to state‑of‑the‑art reasoning‑enhanced models such as GPT‑5. Across all tasks the average Pass^3 is only 54 %, revealing a substantial gap between occasional success (Pass@3) and reliable performance. Hallucination tasks show modest improvement for reasoning models (≈60 % Pass^3) but still exhibit frequent hallucinations. Disambiguation tasks are the most challenging: the best model drops from 68 % Pass@3 to 36 % Pass^3, with no system exceeding a 50 % consistent success rate. Error taxonomy highlights a “completion‑policy tension”: agents prioritize satisfying the user request, often at the expense of policy compliance, leading to premature actions, stochastic policy violations, and fabricated responses when capabilities are missing.

The paper’s contributions are threefold: (1) a richly detailed evaluation environment with 58 interconnected tools and 19 safety policies tailored to automotive assistants; (2) the introduction of Hallucination and Disambiguation task types to systematically probe limit‑awareness and uncertainty resolution; and (3) a thorough analysis comparing reasoning versus non‑reasoning models, accompanied by an error taxonomy that surfaces the core weaknesses of current LLM agents. The authors argue that future work must focus on tighter policy verification, meta‑reasoning mechanisms for uncertainty detection, and real‑time user feedback loops to improve self‑awareness and safety in deployed LLM‑driven in‑car assistants. CAR‑bench thus provides a rigorous, reproducible benchmark that can drive the next generation of trustworthy, policy‑compliant LLM agents for safety‑critical domains.

CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

💡 Research Summary

Comments & Academic Discussion

Leave a Comment