Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn agentic scenarios with real-world tool access present unique challenges where uncertainties and ambiguities compound, leading to severe or catastrophic risks beyond traditional text generation failures. We propose using “quitting” as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence. Leveraging the ToolEmu framework, we conduct a systematic evaluation of quitting behavior across 12 state-of-the-art LLMs. Our results demonstrate a highly favorable safety-helpfulness trade-off: agents prompted to quit with explicit instructions improve safety by an average of +0.39 on a 0-3 scale across all models (+0.64 for proprietary models), while maintaining a negligible average decrease of -0.03 in helpfulness. Our analysis demonstrates that simply adding explicit quit instructions proves to be a highly effective safety mechanism that can immediately be deployed in existing agent systems, and establishes quitting as an effective first-line defense mechanism for autonomous agents in high-stakes applications.


💡 Research Summary

The paper tackles a pressing safety challenge in large‑language‑model (LLM) agents that operate in multi‑turn, tool‑enabled environments where actions can have real‑world consequences. While uncertainty quantification has been extensively studied for single‑turn text generation, the authors argue that multi‑step agentic scenarios introduce compounded ambiguities that can lead to financial loss, privacy breaches, or physical harm. To address this gap, they propose a simple behavioral safeguard: giving agents an explicit “quit” action that allows them to terminate a task when they lack sufficient confidence or when the situation is ambiguous or risky.

Using the ToolEmu benchmark, the authors construct 144 high‑stakes test cases spanning 36 toolkits (e.g., banking, healthcare, IoT) and nine risk categories. They evaluate twelve state‑of‑the‑art LLMs—both open‑source and proprietary—under three prompting conditions: (1) a baseline ReAct prompt with no quit option, (2) a “Simple Quit” prompt that merely informs the model it may quit, and (3) a “Specified Quit” prompt that provides concrete safety directives (e.g., quit if negative consequences cannot be ruled out, if more information is needed, or if the model lacks knowledge to assess risk).

Safety and helpfulness are scored on a 0‑3 scale by a separate evaluator model (Qwen3‑32B, temperature 0). The “Specified Quit” condition yields the largest safety gains: an average increase of +0.40 points across all models and +0.64 points for proprietary models. Notable examples include Claude 4 Sonnet, whose safety score more than doubles (1.008 → 2.223, +1.215) and GPT‑4o, which improves by +0.972 (0.894 → 1.866). These gains correlate with high quit rates (e.g., 72 % for Claude 4 Sonnet, 58 % for GPT‑4o), indicating that agents are correctly abstaining from risky actions.

Crucially, the helpfulness penalty is minimal: the average drop is only –0.03 points, and in many cases the change is negligible. This demonstrates a favorable safety‑helpfulness trade‑off, refuting the common belief that stronger safety mechanisms necessarily degrade performance. The authors formalize the addition of a quit action by extending the agent’s action space A to A ∪ {quit} and defining the policy π: H → A ∪ {quit}. When quit is selected, the agent outputs a “Final Answer” with an explanatory thought, effectively terminating the episode.

The study’s contributions are fourfold: (1) a systematic evaluation of quitting behavior across a diverse set of LLMs and high‑risk scenarios, (2) empirical evidence that simple prompting can dramatically improve safety without retraining, (3) identification of a “compulsion to act” bias that can be mitigated through explicit safety instructions, and (4) a practical, immediately deployable safety technique for real‑world LLM agents. The authors suggest future work could integrate quit decisions with automated risk detectors, explore learned policies for optimal quitting, and extend the approach to more complex, open‑ended environments. Overall, the paper establishes quitting as an effective first‑line defense for autonomous agents operating in high‑stakes applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment