Navigating the Rabbit Hole: Emergent Biases in LLM-Generated Attack Narratives Targeting Mental Health Groups
Large Language Models (LLMs) have been shown to demonstrate imbalanced biases against certain groups. However, the study of unprovoked targeted attacks by LLMs towards at-risk populations remains underexplored. Our paper presents three novel contributions: (1) the explicit evaluation of LLM-generated attacks on highly vulnerable mental health groups; (2) a network-based framework to study the propagation of relative biases; and (3) an assessment of the relative degree of stigmatization that emerges from these attacks. Our analysis of a recently released large-scale bias audit dataset reveals that mental health entities occupy central positions within attack narrative networks, as revealed by a significantly higher mean centrality of closeness (p-value = 4.06e-10) and dense clustering (Gini coefficient = 0.7). Drawing from an established stigmatization framework, our analysis indicates increased labeling components for mental health disorder-related targets relative to initial targets in generation chains. Taken together, these insights shed light on the structural predilections of large language models to heighten harmful discourse and highlight the need for suitable approaches for mitigation.
💡 Research Summary
The paper investigates a previously under‑explored form of bias in large language models (LLMs): the emergence of hostile narratives targeting mental‑health groups even when those groups are not mentioned in the initial prompt. Building on the “Toxicity Rabbit Hole” (TRH) framework introduced by Dutta et al. (2024), the authors repeatedly ask a fixed LLM (Mistral 7B) to rewrite an input stereotype in increasingly toxic language. Each generation becomes the input for the next step, producing long chains of progressively hateful text. The authors extract 190,599 generations organized into 15,401 chains from the publicly released TRH dataset, which contains 459 million tokens across 1,266 identity groups.
To focus on mental‑health entities, the researchers compile a lexicon of 390 terms drawn from ICD‑10 (F01‑F99), Wikipedia lists, and colloquial expressions (e.g., “anxiety”, “depression”). They run LLaMA‑3.2B to label each generation as toxic or non‑toxic, extract victim entities, and filter out non‑toxic outputs (240 instances). After extensive cleaning and manual consolidation, they obtain a “VictimSet” of 24,185 unique entities, of which 195 belong to the mental‑health subset (MHSet) and the remaining 23,989 to a non‑mental‑health set (Non‑MHSet).
The core methodological contribution is the construction of a directed, weighted “Rabbit Hole Network”. Nodes correspond to victim entities; a directed edge u → v exists when entity v appears in a generation that follows a generation containing entity u within the same chain. Edge weights count how often each transition occurs across all chains. The resulting graph contains 24,184 nodes and 663,433 edges; the largest weakly‑connected component is retained for analysis.
Network analysis reveals that mental‑health nodes are disproportionately central. Closeness centrality for MHSet nodes is significantly higher than the overall average (p = 4.06 × 10⁻¹⁰). The Gini coefficient of clustering is 0.7, indicating that mental‑health entities form dense, unevenly distributed clusters. PageRank and degree metrics also rank MHSet nodes above most non‑mental‑health nodes, suggesting that once a mental‑health group is introduced, it tends to remain a focal point for subsequent toxic expansions. Community‑detection algorithms show that mental‑health entities congregate in a few tightly‑connected sub‑communities rather than being scattered randomly throughout the network.
Beyond structural metrics, the authors apply a stigmatization framework comprising four components—labeling, stereotyping, threat, and dehumanization—to each toxic generation. Compared with the initial target groups (often race, religion, or nationality), mental‑health entities exhibit amplified stigmatizing language: labeling frequency is ~1.8× higher, threat language ~2.3× higher, and dehumanization ~2.0× higher. The “threat” component shows the strongest increase, indicating that the model escalates the portrayal of mental‑health groups as dangerous or harmful as the rabbit‑hole deepens.
The paper’s contributions are threefold: (1) introducing an explicit evaluation of LLM‑generated attacks on vulnerable mental‑health groups using a recursive toxicity‑amplification protocol; (2) presenting a network‑based methodology that captures how toxic narratives propagate across identity entities; (3) quantifying the escalation of stigmatizing language toward mental‑health groups relative to the original targets.
Limitations include reliance on a single model (Mistral 7B), binary toxicity labeling that may obscure nuanced hate, and automated stigmatization tagging without extensive human validation, which could introduce systematic errors. Future work should examine multiple LLM architectures, adopt graded toxicity scales, and incorporate human annotators to verify stigmatization measures.
The findings have immediate practical implications. Current LLM safety benchmarks typically test direct prompt‑based bias (e.g., “Generate a hateful sentence about X”). This study shows that even when X is absent, the model can autonomously pull mental‑health identities into toxic discourse, making them structural “high‑risk” nodes in the generative process. Consequently, developers, policymakers, and AI ethics boards should extend guard‑rail testing to include recursive generation scenarios and explicitly monitor for mental‑health stigmatization. Mitigation strategies might involve targeted data‑curation, fine‑tuning with anti‑stigma objectives, and post‑generation filtering that flags emergent mental‑health attacks. Overall, the work highlights a critical blind spot in LLM safety evaluation and calls for more comprehensive, network‑aware bias auditing frameworks.
Comments & Academic Discussion
Loading comments...
Leave a Comment