How Catastrophic is Your LLM? Certifying Risk in Conversation
Large Language Models (LLMs) can produce catastrophic responses in conversational settings that pose serious risks to public safety and security. Existing evaluations often fail to fully reveal these vulnerabilities because they rely on fixed attack prompt sequences, lack statistical guarantees, and do not scale to the vast space of multi-turn conversations. In this work, we propose C$^3$LLM, a novel, principled statistical Certification framework for Catastrophic risks in multi-turn Conversation for LLMs that bounds the probability of an LLM generating catastrophic responses under multi-turn conversation distributions with statistical guarantees. We model multi-turn conversations as probability distributions over query sequences, represented by a Markov process on a query graph whose edges encode semantic similarity to capture realistic conversational flow, and quantify catastrophic risks using confidence intervals. We define several inexpensive and practical distributions–random node, graph path, and adaptive with rejection. Our results demonstrate that these distributions can reveal substantial catastrophic risks in frontier models, with certified lower bounds as high as 70% for the worst model, highlighting the urgent need for improved safety training strategies in frontier LLMs.
💡 Research Summary
The paper introduces C³LLM, a statistical certification framework that quantifies and bounds the probability of catastrophic outputs from large language models (LLMs) in multi‑turn conversational settings. Existing safety evaluations rely on fixed attack prompt sequences or single‑turn benchmarks, which do not capture the vast space of realistic dialogues and lack statistical guarantees. C³LLM addresses these gaps by modeling conversations as probability distributions over query sequences on a graph whose nodes are individual prompts and whose edges encode semantic similarity.
The authors construct a query graph G = (V, E) from a finite set of queries V. A lifted state space Ω = {(v, S)} ∪ {τ} tracks the current query v and the set S of already‑used queries, preventing repetitions within a single conversation. A Markov process governs transitions between states. Two families of transition schemes are defined: forward selection, which starts from an initial distribution μ and proceeds step‑by‑step, and backward selection, which starts from an endpoint distribution ν and builds the path in reverse.
Three concrete distributions over query sequences are instantiated:
-
Random node – each turn selects an unvisited node uniformly (or with a prescribed weight) from V \ S. This provides a baseline estimate of a model’s overall propensity to produce harmful content without exploiting graph structure.
-
Graph path – the sequence must follow edges in the graph, ensuring semantic coherence. Two variants are explored: (a) a vanilla path where the final node is any node in V, and (b) a constrained path where the final node belongs to a high‑risk target set Q_T (e.g., prompts that explicitly ask for bomb instructions). Path‑based sampling captures the contextual buildup that can make later turns more dangerous.
-
Adaptive with rejection – after each model response the attacker observes whether the model “rejects” (refuses) the request. If a rejection occurs, the attacker adapts the next query, effectively re‑sampling from the remaining nodes conditioned on the observed outcome. This mimics realistic red‑team behavior where prompts are iteratively refined to bypass safety filters.
Risk is formalized via a binary judge function J_q∗(r_i) that returns 1 if the i‑th response r_i reveals a predefined catastrophic target q∗ (e.g., instructions for weapon synthesis) and 0 otherwise. The objective is to certify, with confidence level 1 − α (α = 0.05), an upper bound on
Pr_{γ ∼ D_n}
Comments & Academic Discussion
Loading comments...
Leave a Comment