A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies

A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting accuracy, we systematically varied (i) contextual knowledge like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, Deepseek), plateau beyond 70B parameters while closed-weight (o3-mini, gpt-5) models improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Taken together, the results suggest choice of contextual knowledge and modeling strategies is important for deploying LLMs to accurately assess mental health.


💡 Research Summary

This paper presents a comprehensive, systematic evaluation of eleven state‑of‑the‑art large language models (LLMs) for estimating Post‑Traumatic Stress Disorder (PTSD) severity from free‑form clinical interview transcripts. Using a dataset of 1,437 participants that pairs natural‑language interview recordings with self‑reported PTSD Checklist for DSM‑5 (PCL‑5) scores, the authors explore how two orthogonal dimensions—contextual knowledge supplied to the model and modeling strategy—affect predictive accuracy.

Contextual Knowledge Manipulations
The study varies the amount and type of background information provided in prompts: (a) detailed definitions of the four PCL subscales (Re‑experiencing, Avoidance, Dysphoria, Hyperarousal) and item‑level descriptions, (b) study‑level context such as participant eligibility criteria and interview questions, and (c) distributional priors (e.g., mean and standard deviation of scores in the sample). Results show that models given rich, structured knowledge achieve substantially higher Pearson correlations (up to r = 0.455 for subscale‑based 3‑shot prompting) and lower mean absolute error (MAE) than “context‑light” configurations, which can suffer a 40 % drop in performance.

Modeling Strategies Explored

  1. Prompt Shot Number – zero‑shot, 1‑shot, and 3‑shot prompting.
  2. Chain‑of‑Thought (CoT) Variants – a “Think‑Step‑By‑Step” (TSBS) style that forces the model to verbalize intermediate reasoning. TSBS yields modest gains for smaller models (8 B, 405 B) but can slightly degrade performance for the strongest 70 B LLaMA and for GPT‑4o‑mini, indicating that explicit CoT is not universally beneficial.
  3. Reasoning Effort Levels – low, medium, high, operationalized by allowing the model to generate increasingly many “reasoning tokens”. High effort (≈2,200 tokens) consistently improves MAE (e.g., o3‑mini MAE drops from 9.56 to 8.23) and modestly raises r (from 0.388 to 0.422).
  4. Model Size and Architecture – Open‑source LLaMA‑3.1‑Instruct models plateau in performance beyond 70 B parameters; larger variants (405 B, 670 B) do not surpass the 70 B baseline. In contrast, closed‑weight, newer models such as OpenAI’s GPT‑5 continue to improve, achieving the highest zero‑shot r = 0.441 and 3‑shot r = 0.475 (statistically significant).
  5. Prediction Formulation – (i) Subscale‑based prediction (model outputs four subscale scores that are summed) and (ii) Direct scalar prediction (single total score). While subscale‑based outputs align with clinical practice, direct prediction with study‑context or distributional priors yields the best overall metrics (r ≈ 0.482, MAE ≈ 7.80).
  6. Post‑Processing – Predictive Redistribution – a calibration step that aligns the predicted score distribution with the empirical distribution of PCL‑5 scores. Applying this technique reduces MAE by 2–3 points across models, with only marginal gains in r, indicating improved absolute calibration without substantially altering rank ordering.
  7. Ensembling – Nine ensemble methods (simple averaging, weighted averaging, stacking, etc.) are evaluated. The top‑performing ensemble combines a supervised RoBERTa‑based regression baseline (trained on frozen embeddings) with the best zero‑shot LLMs (70 B LLaMA‑Instruct and GPT‑5). This hybrid achieves r = 0.492 and MAE = 7.31, outperforming any single model.

Key Findings

  • Detailed contextual knowledge is the single most influential factor for LLM accuracy in this clinical regression task.
  • Increasing the model’s reasoning effort (more CoT tokens) reliably improves absolute error and modestly boosts correlation.
  • Open‑source LLMs exhibit a performance ceiling around 70 B parameters; newer closed‑weight models keep improving with each generation.
  • Direct scalar prediction with distributional priors can outperform clinically faithful subscale aggregation.
  • Calibration via predictive redistribution is an effective, low‑cost post‑processing step.
  • The best practical solution is a hybrid ensemble that leverages both supervised embeddings and zero‑shot LLM reasoning.

Implications
The work provides actionable guidance for deploying LLMs in mental‑health assessment pipelines. It underscores that naïve zero‑shot usage without carefully crafted prompts can lead to substantial errors, whereas thoughtful prompt engineering, controlled reasoning, and simple post‑processing can bring LLM performance close to, and in some configurations surpass, supervised baselines and human raters. Moreover, the plateau observed for open‑source models suggests that future research should prioritize better prompt design, knowledge grounding, and ensemble strategies over merely scaling model size. The findings pave the way for more reliable, scalable, and ethically responsible AI‑assisted PTSD severity estimation in clinical settings.


Comments & Academic Discussion

Loading comments...

Leave a Comment