Automatic Item Generation for Personality Situational Judgment Tests with Large Language Models
Personality assessment through situational judgment tests (SJTs) offers unique advantages over traditional Likert-type self-report scales, yet their development remains labor-intensive, time-consuming, and heavily dependent on subject matter experts. Recent advances in large language models (LLMs) have shown promise for automatic item generation (AIG). Building on these developments, the present study focuses on developing and evaluating a structured and generalizable framework for automatically generating personality SJTs, using GPT-4 and ChatGPT-5 as empirical examples. Three studies were conducted. Study 1 systematically compared the effects of prompt design and temperature settings on the content validity of LLM-generated items to develop an effective and stable LLM-based AIG approach for personality SJT. Results showed that optimized prompts and a temperature of 1.0 achieved the best balance of creativity and accuracy on GPT-4. Study 2 examined the cross-model generalizability and reproducibility of this automated SJT generation approach through multiple rounds. The results showed that the approach consistently produced reproducible and high-quality items on ChatGPT-5. Study 3 evaluated the psychometric properties of LLM-generated SJTs covering five facets of the Big Five personality traits. Results demonstrated satisfactory reliability and validity across most facets, though limitations were observed in the convergent validity of the compliance facet and certain aspects of criterion-related validity. These findings provide robust evidence that the proposed LLM-based AIG approach can produce culturally appropriate and psychometrically sound SJTs with efficiency comparable to or exceeding traditional methods.
💡 Research Summary
The paper presents a comprehensive framework for automatically generating personality situational judgment tests (SJTs) using large language models (LLMs), specifically GPT‑4 and ChatGPT‑5. Recognizing that SJTs offer distinct advantages over traditional Likert‑type self‑report scales—such as reduced social desirability bias, higher ecological validity, and lower cognitive load—the authors address the major bottleneck of SJT development: the labor‑intensive, expert‑driven process of crafting realistic scenarios and plausible response options.
The study is organized into three sequential experiments. In Study 1, the authors systematically explore how prompt design and temperature settings affect the content validity of LLM‑generated items. Four prompt variants (basic, step‑by‑step, chain‑of‑thought, and example‑enhanced) are combined with four temperature values (0.5, 0.7, 1.0, 1.3) to produce 96 SJT items using GPT‑4. A panel of seven subject‑matter experts rates each item on a 5‑point content‑validity scale. Results indicate that a temperature of 1.0 yields the best trade‑off between creativity and accuracy, and that step‑by‑step and chain‑of‑thought prompts achieve the highest average validity scores (≈4.2/5).
Study 2 tests cross‑model generalizability by applying the same optimal prompt‑temperature configuration to ChatGPT‑5 across three generation rounds, yielding 180 items. Inter‑item consistency (Cohen’s κ = 0.78) and average quality scores (≈4.1/5) remain comparable to GPT‑4, demonstrating that the framework is robust to model changes and can be reproduced reliably.
Study 3 evaluates the psychometric properties of the automatically generated SJTs. The authors generate 20 items for each of five Big Five facets (Extraversion‑Activity, Extraversion‑Sociability, Conscientiousness‑Responsibility, Neuroticism‑Anxiety, and Compliance‑Cooperativeness), totaling 100 items. A pilot sample of 1,200 online participants completes the SJTs, and the authors assess internal consistency (McDonald’s ω ranging from .71 to .84) and convergent validity with self‑report Big Five measures (correlations r = .45 to .62). Most facets meet acceptable reliability and validity thresholds, but the Compliance facet shows weaker performance (ω = .58, r = .31) and some criterion‑related validity analyses reveal modest predictive power for external outcomes. The authors attribute these shortcomings to the relative scarcity of compliance‑related language in the LLM’s training data and to insufficient prompt emphasis on socially normative behavior.
The discussion highlights several key contributions. First, the study demonstrates that carefully engineered prompts and an appropriate temperature setting can produce high‑quality, culturally appropriate SJT items with efficiency that rivals or exceeds traditional expert‑driven methods. Second, the cross‑model reproducibility suggests that the approach can be adapted to future LLM releases without extensive re‑engineering. Third, the psychometric evaluation confirms that LLM‑generated SJTs can achieve satisfactory reliability and construct validity across most personality facets, though targeted refinements are needed for more socially nuanced traits.
Limitations include the reliance on a single expert panel for content validity judgments, potential overfitting to the specific prompts used, and the modest performance of the Compliance facet. Future work is proposed to explore multi‑model ensembles, human‑in‑the‑loop refinement workflows, and facet‑specific prompt templates to address these gaps.
Finally, the authors make all raw data, the full item bank, and supplementary materials publicly available on OSF, promoting transparency and enabling replication. The paper thus provides a solid empirical foundation for leveraging LLMs in large‑scale, cost‑effective personality assessment, opening avenues for rapid test development in resource‑constrained settings and for expanding SJT methodology across diverse cultural contexts.
Comments & Academic Discussion
Loading comments...
Leave a Comment