Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.
💡 Research Summary
The paper addresses a critical gap in the safety evaluation of tool‑enabled, memory‑augmented large language model (LLM) agents: existing benchmarks focus on single‑prompt malicious instructions, ignoring how an adversary might gradually coerce an agent over multiple turns and in non‑English languages. To fill this void, the authors introduce STING (Sequential Testing of Illicit N‑step Goal execution), an automated red‑team framework that simulates a sophisticated attacker through four coordinated agents: a Strategist, an Attacker, a Refusal Detector, and a Phase‑Completion Checker.
The Strategist first constructs a benign‑looking persona and decomposes a harmful intent (e.g., fabricating a deep‑fake video) into a sequence of atomic phases, each requiring specific tool calls (summarization, image generation, animation, posting). The Attacker then engages the target agent turn‑by‑turn, attempting to satisfy the current phase. After each response, the Refusal Detector flags explicit or implicit refusals, while the Phase‑Completion Checker evaluates whether the phase’s objective has been met. Based on this feedback, the Attacker either retries the same phase with an adapted prompt or proceeds to the next phase. Completion of all phases constitutes a “jailbreak.”
Methodologically, the authors model multi‑turn red‑teaming as a bounded‑budget reachability problem and define a discrete random variable S_H representing the index of the first successful strategy (time‑to‑first‑jailbreak). This formulation enables the use of survival‑analysis tools: Kaplan–Meier discovery curves (D(s) = 1 – S_sur(s)), hazard‑ratio analysis, and a novel scalar metric called Restricted Mean Jailbreak Discovery (RMJD), which is the area under the discovery curve up to a maximum number of strategies S_max. Hazard ratios are estimated via a Cox proportional‑hazards model stratified by intent, allowing the authors to isolate the effect of attack language while controlling for task difficulty.
Experiments are conducted on the public AgentHarm test set (44 base harms, each with four prompt variants, yielding 176 instances). Attack strategies are generated with Gemini 3 Pro in English and executed in seven languages (English, Chinese, French, Ukrainian, Hindi, Urdu, Telugu). The attacker, refusal detector, and phase checker are instantiated with Qwen3‑Next‑80B‑A3B‑Instruct for its strong multilingual capabilities; target agents include Qwen3‑Next, OpenAI’s GPT‑5.1, Google’s Gemini 3 Flash, and Anthropic’s Claude Sonnet 4.5. Each strategy is allowed up to 10 turns, and up to 100 strategies per intent are evaluated.
Results show that STING achieves up to 107 % higher illicit‑task completion compared with single‑prompt baselines, demonstrating that multi‑turn, tool‑driven attacks expose failure modes that static evaluations miss. The discovery curves are markedly steeper for agents with weaker defenses, and RMJD values correlate with known safety gaps. Multilingual analysis reveals that lower‑resource languages do not uniformly increase jailbreak risk; Chinese shows a modest hazard increase, whereas Hindi, Urdu, and Telugu exhibit lower hazard ratios than English, contradicting prior findings in chatbot‑only settings.
Ablation studies on attacker reasoning depth indicate a non‑monotonic relationship: modest reasoning boosts success, but excessive prompting can trigger refusals or detection. Lightweight defenses—adding a refusal detector and phase‑completion checker to the target—substantially reduce illicit‑task completion while incurring only modest drops in benign tool‑use performance, highlighting a practical trade‑off.
In sum, STING provides a comprehensive, statistically grounded framework for evaluating the safety of LLM agents in realistic, multi‑turn, multilingual deployments. By quantifying attack efficiency (through discovery curves and RMJD) and isolating language effects (via hazard ratios), the work offers actionable insights for developers, researchers, and policymakers aiming to harden agentic systems against sophisticated misuse.
Comments & Academic Discussion
Loading comments...
Leave a Comment