Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks

Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) have revolutionized conversational AI, yet their robustness in extended multi-turn dialogues remains poorly understood. Existing evaluation frameworks focus on static benchmarks and single-turn assessments, failing to capture the temporal dynamics of conversational degradation that characterize real-world interactions. In this work, we present a large-scale survival analysis of conversational robustness, modeling failure as a time-to-event process over 36,951 turns from 9 state-of-the-art LLMs on the MT-Consistency benchmark. Our framework combines Cox proportional hazards, Accelerated Failure Time (AFT), and Random Survival Forest models with simple semantic drift features. We find that abrupt prompt-to-prompt semantic drift sharply increases the hazard of inconsistency, whereas cumulative drift is counterintuitively \emph{protective}, suggesting adaptation in conversations that survive multiple shifts. AFT models with model-drift interactions achieve the best combination of discrimination and calibration, and proportional hazards checks reveal systematic violations for key drift covariates, explaining the limitations of Cox-style modeling in this setting. Finally, we show that a lightweight AFT model can be turned into a turn-level risk monitor that flags most failing conversations several turns before the first inconsistent answer while keeping false alerts modest. These results establish survival analysis as a powerful paradigm for evaluating multi-turn robustness and for designing practical safeguards for conversational AI systems.


💡 Research Summary

The paper tackles a critical gap in the evaluation of large language models (LLMs): the temporal dynamics of consistency loss during multi‑turn, adversarial conversations. While most existing benchmarks assess static, single‑turn performance, the authors reframe robustness as a “time‑to‑inconsistency” survival problem. Using the MT‑Consistency benchmark, they collect 36,951 turns from nine state‑of‑the‑art LLMs (Claude 3.5 Sonnet, DeepSeek R1, GPT‑4o, a 120B open‑weight GPT‑style model, Llama 3.3 70B, Llama 4 Maverick, Gemini 2.5, Mistral Large, and Qwen 3). Each conversation starts with a correct answer; the event of interest is the first turn where the model’s response deviates from that answer under a strict consistency criterion. Conversations that remain correct through the eight‑turn horizon are right‑censored.

The authors engineer time‑varying covariates from sentence‑transformer embeddings of prompts and full dialogue context. From these they derive three semantic‑drift metrics: prompt‑to‑prompt drift (Dp2p), context‑to‑prompt drift (Dc2p), and cumulative drift (Dcum). Additional covariates include prompt length, subject domain clusters, difficulty level, and model identity.

Three families of survival models are fitted: (i) Cox proportional hazards (PH) with time‑varying covariates, (ii) parametric Accelerated Failure Time (AFT) models (Weibull, log‑normal, log‑logistic) and extensions that allow model‑drift interactions, and (iii) non‑parametric Random Survival Forests (RSF). The Cox models reveal that abrupt drift variables violate the proportional‑hazards assumption (Schoenfeld residual tests), indicating that risk spikes are not constant over time. AFT models, by contrast, treat covariates as multiplicative factors on the time scale, allowing direct interpretation of how drift stretches or compresses the median time‑to‑inconsistency. When model‑specific interaction terms are added, the AFT framework captures heterogeneous sensitivities: for example, a 0.2 increase in Dp2p reduces the median survival time by 30 % for Claude 3.5 but only 15 % for Llama 4. RSF confirms the importance of Dp2p and Dc2p in early turns but suffers from limited interpretability and higher computational overhead.

Performance is evaluated using Harrell’s concordance index (C‑index) and Brier scores. The best‑performing configuration is a log‑normal AFT model with model‑drift interactions, achieving a C‑index of 0.78 and superior calibration relative to Cox and RSF baselines. A striking empirical finding is that while abrupt drift sharply raises hazard, cumulative drift appears protective, suggesting that conversations which survive multiple semantic shifts may induce an adaptive stabilization in the model’s internal state.

Capitalizing on the lightweight nature of the AFT model, the authors construct a turn‑level risk monitor that outputs a hazard score after each turn. In a held‑out test, the monitor flags 70 % of failing conversations on average 2.3 turns before the first inconsistency, while maintaining a false‑alert rate below 12 %. This demonstrates practical feasibility for real‑time safety layers in deployed LLM systems.

In sum, the work pioneers the application of survival analysis to LLM robustness, providing a statistically rigorous framework to quantify when and why consistency failures occur, uncovering nuanced effects of semantic drift, and delivering a deployable early‑warning mechanism. The methodology opens avenues for more temporally aware evaluation protocols and for integrating risk‑aware safeguards into conversational AI pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment