MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support
Large language models are increasingly used for mental health support, yet their conversational coherence alone does not ensure clinical appropriateness. Existing general-purpose safeguards often fail to distinguish between therapeutic disclosures and genuine clinical crises, leading to safety failures. To address this gap, we introduce a clinically grounded risk taxonomy, developed in collaboration with PhD-level psychologists, that identifies actionable harm (e.g., self-harm and harm to others) while preserving space for safe, non-crisis therapeutic content. We release MindGuard-testset, a dataset of real-world multi-turn conversations annotated at the turn level by clinical experts. Using synthetic dialogues generated via a controlled two-agent setup, we train MindGuard, a family of lightweight safety classifiers (with 4B and 8B parameters). Our classifiers reduce false positives at high-recall operating points and, when paired with clinician language models, help achieve lower attack success and harmful engagement rates in adversarial multi-turn interactions compared to general-purpose safeguards. We release all models and human evaluation data.
💡 Research Summary
The paper addresses a critical gap in the safety of large language models (LLMs) deployed for mental‑health support. While existing guardrails can flag broad categories such as self‑harm or violence, they lack the nuance to distinguish therapeutic disclosures from genuine crises, leading to both missed emergencies and unnecessary escalations. To remedy this, the authors collaborate with PhD‑level licensed psychologists to create a clinically grounded risk taxonomy that partitions user utterances into three categories: safe, self‑harm risk, and harm‑to‑others risk. This taxonomy mirrors clinical decision‑making, separating self‑directed from other‑directed threats and preserving a “safe” class for non‑crisis therapeutic content.
A new evaluation set, MindGuard‑testset, is assembled by having ten clinicians converse with a proprietary clinician‑LLM across seven patient archetypes, generating 67 multi‑turn dialogues (average 16.9 turns). A separate trio of clinicians annotates each user turn with the taxonomy, achieving 94.4 % unanimous agreement and a Krippendorff’s α of 0.57. The final corpus contains 1,134 turns, of which 3.7 % are labeled unsafe (1.8 % self‑harm, 1.9 % harm‑to‑others), reflecting realistic rarity of acute risk while ensuring sufficient coverage for evaluation.
For model training, the authors generate synthetic dialogues using a controlled two‑agent setup that mimics the emergence of risk signals over time. They train lightweight guardrail classifiers—MindGuard‑4B and MindGuard‑8B—on these data, conditioning each prediction on the current user message and the full conversation history.
Intrinsic evaluation shows an AUROC of 0.982, and at high‑recall operating points (recall ≥ 0.9) the false‑positive rate drops by more than 40 % compared with a strong baseline (Llama Guard 3). Extrinsic evaluation via automated red‑team attacks demonstrates that MindGuard markedly reduces attack success and harmful engagement rates in multi‑turn adversarial scenarios, confirming its robustness in realistic deployment settings.
All resources—including model weights, the annotated test set, annotation guidelines, and human evaluation data—are released publicly on Hugging Face. By integrating a clinically validated risk taxonomy with specialized guardrail models, the work establishes a new standard for safety‑critical AI in mental‑health applications, offering a scalable solution that can detect actionable risk without over‑blocking therapeutic dialogue.
Comments & Academic Discussion
Loading comments...
Leave a Comment