Evaluating Text-based Conversational Agents for Mental Health: A Systematic Review of Metrics, Methods and Usage Contexts
Text-based conversational agents (CAs) are increasingly used in mental health, yet evaluation practices remain fragmented. We conducted a PRISMA-guided systematic review (May-June 2024) across ACM Digital Library, Scopus, and PsycINFO. From 613 records, 132 studies were included, with dual-coder extraction achieving substantial agreement (Cohen’s kappa = 0.77-0.92). We synthesized evaluation approaches across three dimensions: metrics, methods, and usage contexts. Metrics were classified into CA-centric attributes (e.g., reliability, safety, empathy) and user-centric outcomes (experience, knowledge, psychological state, health behavior). Methods included automated analyses, standardized psychometric scales, and qualitative inquiry. Temporal designs ranged from momentary to follow-up assessments. Findings show reliance on Western-developed scales, limited cultural adaptation, predominance of small and short-term samples, and weak links between automated performance metrics and user well-being. We argue for methodological triangulation, temporal rigor, and equity in measurement. This review offers a structured foundation for reliable, safe, and user-centered evaluation of mental health CAs.
💡 Research Summary
**
This paper presents a PRISMA‑guided systematic review of evaluation practices for text‑based conversational agents (CAs) used in mental‑health contexts. The authors searched ACM Digital Library, Scopus, and PsycINFO between May and June 2024, retrieved 613 records, and after duplicate removal, title‑abstract screening, full‑text assessment, and quality checks, retained 132 empirical studies that evaluated a mental‑health CA via text interaction. Dual‑coder extraction yielded a high inter‑rater reliability (Cohen’s κ = 0.77–0.92).
The review is organized around three analytical dimensions: (1) metrics, (2) methodological approaches, and (3) usage contexts (timing and duration of assessment). Metrics are split into CA‑centric attributes and user‑centric outcomes. CA‑centric attributes include technical performance (response latency, resource usage), algorithmic performance (precision/recall, BLEU, ROUGE, perplexity), information quality (clarity, coherence, diversity, conciseness), human‑likeness (emotional intelligence, social intelligence, personality consistency), reliability (safety, fairness, privacy, trustworthiness, explainability, transparency), and mental‑health expertise (use of CBT/DBT techniques, support for clinicians). User‑centric outcomes are further divided into experience (usability, satisfaction, acceptability, perceived performance, relationship building, engagement) and change (knowledge acquisition, psychological state, health‑behaviour modification).
Methodologically, the studies fall into three main categories: (a) automated analyses of conversational output (BLEU, ROUGE‑L, perplexity, Distinct‑n, etc.), (b) standardized psychometric scales (PHQ‑9, GAD‑7, WHO‑5, SWLS, etc.), and (c) qualitative inquiry (semi‑structured interviews, focus groups, user diaries). While 126 of the 132 papers employed questionnaires or validated scales, only about 15 % combined multiple methods to achieve methodological triangulation. Consequently, the link between automated performance metrics and actual user well‑being remains weak.
Regarding usage contexts, the authors distinguish momentary assessments (immediate post‑session feedback) from follow‑up assessments (short‑term 1–4 weeks, medium‑term 1–3 months, long‑term >6 months). The majority of studies (≈78 %) relied on single‑session or short‑term follow‑up designs; only a handful examined long‑term outcomes or repeated‑use patterns. Sample characteristics reveal a predominance of adult participants (age 11–82) with no studies involving children under 11, reflecting literacy requirements of text‑based systems. Quantitative studies reported an average sample size of 533 (range 2–36 070), yet 65 % of them included fewer than 100 participants, and only 3 % exceeded 1 000 participants. Qualitative studies averaged 80 participants, with 84 % below 100.
The review identifies several systemic limitations: (1) heavy reliance on Western‑developed psychometric instruments without cultural adaptation, (2) small and short‑term samples that limit generalizability, (3) fragmented methodological approaches that prevent comprehensive assessment of both technical and therapeutic dimensions, (4) insufficient evaluation of reliability dimensions such as safety, fairness, and privacy, and (5) under‑representation of mental‑health practitioners in evaluation processes.
To address these gaps, the authors propose three overarching recommendations. First, adopt methodological triangulation as a standard practice, integrating automated performance analysis, validated self‑report scales, and qualitative feedback to capture complementary aspects of CA effectiveness. Second, enforce temporal rigor by designing longitudinal studies that track user outcomes over months or years, thereby assessing sustained therapeutic impact and user retention. Third, embed equity considerations into metric selection: develop culturally adapted versions of psychometric tools, ensure diverse demographic representation (age, gender, ethnicity, language proficiency), and systematically evaluate bias, fairness, and privacy safeguards.
In conclusion, the review offers a structured taxonomy of evaluation metrics, a clear methodological framework, and a contextual map of current practices. It highlights that while technical performance and user satisfaction are well‑studied, the crucial connection between CA behaviour, ethical reliability, and genuine mental‑health improvement remains under‑explored. The authors’ synthesis provides a foundation for future researchers and practitioners to design more robust, inclusive, and outcome‑oriented evaluation protocols for mental‑health conversational agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment