How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models
Language is a deep-rooted means of perpetration of stereotypes and discrimination. Large Language Models (LLMs), now a pervasive technology in our everyday lives, can cause extensive harm when prone to generating toxic responses. The standard way to address this issue is to align the LLM , which, however, dampens the issue without constituting a definitive solution. Therefore, testing LLM even after alignment efforts remains crucial for detecting any residual deviations with respect to ethical standards. We present EvoTox, an automated testing framework for LLMs’ inclination to toxicity, providing a way to quantitatively assess how much LLMs can be pushed towards toxic responses even in the presence of alignment. The framework adopts an iterative evolution strategy that exploits the interplay between two LLMs, the System Under Test (SUT) and the Prompt Generator steering SUT responses toward higher toxicity. The toxicity level is assessed by an automated oracle based on an existing toxicity classifier. We conduct a quantitative and qualitative empirical evaluation using five state-of-the-art LLMs as evaluation subjects having increasing complexity (7-671B parameters). Our quantitative evaluation assesses the cost-effectiveness of four alternative versions of EvoTox against existing baseline methods, based on random search, curated datasets of toxic prompts, and adversarial attacks. Our qualitative assessment engages human evaluators to rate the fluency of the generated prompts and the perceived toxicity of the responses collected during the testing sessions. Results indicate that the effectiveness, in terms of detected toxicity level, is significantly higher than the selected baseline methods (effect size up to 1.0 against random search and up to 0.99 against adversarial attacks). Furthermore, EvoTox yields a limited cost overhead (from 22% to 35% on average).
💡 Research Summary
The paper introduces EvoTox, an automated, black‑box testing framework designed to assess how far a large language model (LLM) can be pushed toward generating toxic content, even after alignment procedures. EvoTox exploits the interaction between two LLMs: the System Under Test (SUT) and a Prompt Generator (PG). Starting from an initial seed prompt, EvoTox runs a (1 + λ) Evolution Strategy (ES). In each iteration the PG creates λ mutated prompts that are close variations of the current “parent” prompt. The SUT then produces a response for each mutant, and an automated oracle—implemented with a pre‑trained toxicity classifier (e.g., Perspective API)—assigns a toxicity confidence score to each response. The mutant with the highest toxicity score becomes the parent for the next generation, and the process repeats until the toxicity score plateaus.
Prompt mutation leverages few‑shot in‑context learning: the PG receives a few examples of prior prompts, their SUT responses, and the objective of increasing toxicity, enabling it to generate grammatically correct, semantically coherent, and realistic prompts. This contrasts with typical jailbreak attacks that prepend unnatural strings or random noise to force a model to violate its refusal mechanisms.
The authors evaluate EvoTox on five state‑of‑the‑art LLMs ranging from 7 B to 671 B parameters, covering both open‑source and commercial models with varying degrees of alignment. Four EvoTox variants (basic, stateful, context‑aware, hybrid) are compared against three baseline strategies: random search, curated toxic‑prompt datasets, and a leading jailbreak method (AutoDAN). Quantitative results show that EvoTox achieves substantially higher detected toxicity levels, with effect sizes up to 1.0 versus random search and up to 0.99 versus jailbreak attacks. Importantly, the additional computational cost is modest, averaging a 22 %–35 % increase in execution time relative to the baselines.
A qualitative study involves domain experts (psychologists and psychotherapists) who rate the fluency of generated prompts and the perceived toxicity of SUT responses. Human raters find EvoTox prompts significantly more natural and human‑like than those produced by adversarial attacks, and they perceive the corresponding SUT responses as more toxic. This alignment between automated toxicity scores and human judgment validates the oracle’s relevance.
The paper’s contributions are: (1) the EvoTox framework, a novel black‑box, ES‑based toxicity testing approach that uses LLMs to test LLMs; (2) a thorough empirical assessment covering cost‑effectiveness, prompt realism, and human‑perceived toxicity across multiple models; (3) an open‑source replication package enabling reproducibility. The authors acknowledge limitations such as potential bias in the toxicity classifier, dependence on the quality of the PG LLM, and the limited set of evaluated models. Future work is suggested on combining multiple toxicity oracles, optimizing PG meta‑prompts, integrating continuous testing pipelines in production, and extending the methodology to multilingual and culturally diverse contexts.
Comments & Academic Discussion
Loading comments...
Leave a Comment