A Multi-Turn Framework for Evaluating AI Misuse in Fraud and Cybercrime Scenarios

A Multi-Turn Framework for Evaluating AI Misuse in Fraud and Cybercrime Scenarios
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

AI is increasingly being used to assist fraud and cybercrime. However, it is unclear the extent to which current large language models can provide useful information for complex criminal activity. Working with law enforcement and policy experts, we developed multi-turn evaluations for three fraud and cybercrime scenarios (romance scams, CEO impersonation, and identity theft). Our evaluations focus on text-to-text interactions. In each scenario, we evaluate whether models provide actionable assistance beyond information typically available on the web, as assessed by domain experts. We do so in ways designed to resemble real-world misuse, such as breaking down requests for fraud into a sequence of seemingly benign queries. We found that (1) current large language models provide minimal actionable information for fraud and cybercrime without the use of advanced jailbreaking techniques, (2) model safeguards have significant impact on the provision of information, with the two open-weight large language models fine-tuned to remove safety guardrails providing the most actionable and useful responses, and (3) decomposing requests into benign-seeming queries elicited more assistance than explicitly malicious framing or basic system-level jailbreaks. Overall, the results suggest that current text-generation models provide relatively minimal uplift for fraud and cybercrime through information provision, without extensive effort to circumvent safeguards. This work contributes a reproducible, expert-grounded framework for tracking how these risks may evolve with time as models grow more capable and adversaries adapt.


💡 Research Summary

The paper presents a systematic, reproducible framework for measuring how large language models (LLMs) might be misused to facilitate fraud and cyber‑crime. Working with law‑enforcement and policy experts, the authors focus on three high‑impact scenarios—romance scams, CEO impersonation, and identity theft—each of which is broken down into the six stages of a typical fraud lifecycle (planning, intelligence gathering, falsification, execution, evasion, scaling). For each stage they design two fixed prompts (an initial probe and a follow‑up), yielding a 12‑turn “Long‑Form Task” (LFT) per scenario.

Two framing strategies are compared: a “malicious” framing that explicitly states criminal intent, and a “benign” framing that masks the intent behind legitimate‑sounding research or educational language. The authors also test three system‑level jailbreak conditions (none, publicly available, and internal) to gauge low‑effort attempts to bypass safety guards. In total, 1,674 LFTs are run five times each, producing 20,088 model responses across fourteen LLMs that vary in size, reasoning capability, web‑search integration, and safety‑guard status (including “uncensored” models fine‑tuned to remove guardrails).

Responses are evaluated on two dimensions using an LLM‑as‑a‑judge autograder (the Inspect framework) calibrated against human expert scores: (1) Actionability – whether the model supplies ready‑to‑use attack material, and (2) Information Access – whether the model aggregates information more effectively than an expert‑calibrated web‑search baseline. Scores range from 1 (refusal or non‑response) to 6 (novel, fully automated solutions).

Key findings:

  1. Limited uplift – 88.5 % of all responses fall into the “refused or non‑actionable” category for Actionability, and 67.5 % do not exceed the web‑search baseline for Information Access. In other words, current LLMs provide only modest incremental assistance to fraudsters when no advanced jailbreak is used.
  2. Safety guards matter – Models that have had their safety guardrails removed (“uncensored” models) achieve significantly higher scores on both dimensions, especially on Actionability where they gain roughly one to two rating points on average. This underscores the protective role of built‑in safeguards.
  3. Benign framing is more effective – Queries presented in a research‑oriented, seemingly innocuous style elicit more detailed, less‑refused answers than overtly malicious prompts. This suggests that adversaries can increase success by disguising intent.
  4. Low‑effort jailbreaks have limited impact – System‑level jailbreak prompts do not substantially increase compliance, indicating that the models’ core safety mechanisms remain robust against simple adversarial prompts.

The authors argue that multi‑turn evaluations capture capabilities that single‑turn compliance tests miss, as they reflect the iterative nature of real‑world attacks. They also note limitations: fixed prompts cannot capture dynamic adaptation by the model, and the autograder, while highly correlated with human judgments (α≈0.72‑0.79), is not a perfect substitute.

Overall, the study contributes a novel, expert‑validated LFT methodology that can be reused to monitor evolving misuse risks as LLM capabilities grow. Policymakers and security practitioners can employ this framework to track risk trajectories, inform guardrail design, and prioritize mitigation strategies. Future work is suggested to expand prompt diversity, incorporate dynamic dialogue, and apply the framework to other dual‑use domains such as deep‑fake generation or ransomware assistance.


Comments & Academic Discussion

Loading comments...

Leave a Comment