Assessing LLM Response Quality in the Context of Technology-Facilitated Abuse

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Technology-facilitated abuse (TFA) is a pervasive form of intimate partner violence (IPV) that leverages digital tools to control, surveil, or harm survivors. While tech clinics are one of the reliable sources of support for TFA survivors, they face limitations due to staffing constraints and logistical barriers. As a result, many survivors turn to online resources for assistance. With the growing accessibility and popularity of large language models (LLMs), and increasing interest from IPV organizations, survivors may begin to consult LLM-based chatbots before seeking help from tech clinics. In this work, we present the first expert-led manual evaluation of four LLMs - two widely used general-purpose non-reasoning models and two domain-specific models designed for IPV contexts - focused on their effectiveness in responding to TFA-related questions. Using real-world questions collected from literature and online forums, we assess the quality of zero-shot single-turn LLM responses generated with a survivor safety-centered prompt on criteria tailored to the TFA domain. Additionally, we conducted a user study to evaluate the perceived actionability of these responses from the perspective of individuals who have experienced TFA. Our findings, grounded in both expert assessment and user feedback, provide insights into the current capabilities and limitations of LLMs in the TFA context and may inform the design, development, and fine-tuning of future models for this domain. We conclude with concrete recommendations to improve LLM performance for survivor support.

💡 Research Summary

This paper investigates the suitability of large language models (LLMs) for assisting survivors of technology‑facilitated abuse (TFA), a rapidly growing subset of intimate partner violence that exploits smartphones, IoT devices, social media, and other digital tools. Existing “tech clinics” provide high‑quality, personalized help but are limited by staffing, geography, and accessibility, forcing many survivors to rely on search engines and forums that often deliver incomplete or unsafe advice. With the proliferation of LLM‑powered chat interfaces (e.g., ChatGPT, Claude, Gemini), survivors are already exposed to AI‑generated content, yet no systematic assessment of the safety and usefulness of such responses in the TFA context has been performed.

The authors conduct the first expert‑led manual evaluation of four LLMs: two general‑purpose, non‑reasoning models (GPT‑4o and Claude 3.7 Sonnet) and two domain‑specific models designed for intimate partner violence (IPV) support (Ruth and Aimee). They assemble a realistic corpus of survivor questions by extracting 431 items from peer‑reviewed literature and another 431 from Reddit and Quora using a semi‑automated similarity‑based retrieval pipeline. Each question is annotated with one of 17 abuse types (e.g., surveillance, financial control) and 28 specific means (e.g., spyware, GPS tracker). After verification, 193 representative questions are sampled to ensure balanced coverage across categories.

All models receive the same “survivor‑safety‑centered” prompt and generate zero‑shot, single‑turn answers. Expert raters (three domain specialists) evaluate each response on four dimensions:

Accuracy – factual correctness of the information.
Completeness – inclusion of all critical steps or considerations.
Safety – presence of warnings about potential escalation, legal ramifications, or further risk.
Actionability – clarity and practicality of steps a survivor can actually follow.

Ratings are on a 5‑point Likert scale, and inter‑rater agreement is reported. The evaluation reveals that across all four LLMs, 81 % of answers are either inaccurate, incomplete, or not actionable, and 60 % omit essential safety warnings. Concrete error patterns include: recommending VPN usage for spyware detection (which does not remove the spyware), suggesting simple password changes for sophisticated account hijacking, and failing to acknowledge that certain technical mitigations may increase the abuser’s retaliation risk.

To complement expert assessment, a user study with 114 self‑identified TFA survivors measures perceived actionability. Participants frequently describe the AI responses as “overly long,” “hard to understand,” and “difficult to implement given financial or logistical constraints.” Many express concern that the advice does not consider the possibility of escalation, thereby potentially endangering them further.

Based on these findings, the authors propose four concrete recommendations for future LLM development in the TFA domain:

Curate a domain‑specific, survivor‑authored dataset – the released question‑answer‑metadata collection should serve as a benchmark for training and evaluation.
Design safety‑first prompting templates that explicitly require the model to flag risks, suggest low‑risk first steps, and provide escalation pathways (e.g., contact a trusted organization).
Fine‑tune general‑purpose models on the curated TFA dataset while integrating a multi‑stage safety verification pipeline (automated checks + human‑in‑the‑loop).
Implement a human‑AI collaborative workflow where AI‑generated advice is reviewed by trained IPV advocates before being delivered to survivors, especially for high‑severity queries.

The paper concludes that, while LLMs hold promise for scaling early‑stage support for TFA survivors, current zero‑shot performance is insufficient for safe deployment. Targeted data collection, safety‑oriented model engineering, and robust human oversight are essential to transform LLMs from a risky information source into a reliable component of survivor assistance ecosystems.

Assessing LLM Response Quality in the Context of Technology-Facilitated Abuse

💡 Research Summary

Comments & Academic Discussion

Leave a Comment