Projective Psychological Assessment of Large Multimodal Models Using Thematic Apperception Tests

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Thematic Apperception Test (TAT) is a psychometrically grounded, multidimensional assessment framework that systematically differentiates between cognitive-representational and affective-relational components of personality-like functioning. This test is a projective psychological framework designed to uncover unconscious aspects of personality. This study examines whether the personality traits of Large Multimodal Models (LMMs) can be assessed through non-language-based modalities, using the Social Cognition and Object Relations Scale - Global (SCORS-G). LMMs are employed in two distinct roles: as subject models (SMs), which generate stories in response to TAT images, and as evaluator models (EMs), who assess these narratives using the SCORS-G framework. Evaluators demonstrated an excellent ability to understand and analyze TAT responses. Their interpretations are highly consistent with those of human experts. Assessment results highlight that all models understand interpersonal dynamics very well and have a good grasp of the concept of self. However, they consistently fail to perceive and regulate aggression. Performance varied systematically across model families, with larger and more recent models consistently outperforming smaller and earlier ones across SCORS-G dimensions.

💡 Research Summary

This paper investigates whether large multimodal models (LMMs) possess measurable “personality‑like” traits that can be assessed through a non‑language‑based, visual projective test. The authors employ the Thematic Apperception Test (TAT), a classic psychological instrument that presents ambiguous pictures and asks respondents to generate narratives, and they score those narratives using the Social Cognition and Object Relations Scale – Global (SCORS‑G). SCORS‑G provides eight global dimensions (Complexity of Representations, Affective Quality, Emotional Investment in Relationships, Emotional Investment in Moral Standards, Understanding of Social Causality, Experience and Management of Aggressive Impulses, Self‑Esteem, and Identity and Coherence of Self) each rated on a 7‑point scale.

The experimental pipeline consists of four stages. First, a set of vision‑capable LMMs is selected as “subject models” (SMs). The authors include recent models from the GPT, Gemini, Claude, and LLaMA families, accessed via the OpenRouter unified API. Second, each SM is prompted with a shortened TAT consisting of seven standard images. For each image three different instruction phrasings are used, and each instruction‑image pair is repeated three times, yielding 63 stories per model (7 × 3 × 3). No conversation history is retained between prompts, ensuring independence of each story.

Third, a separate set of LMMs is designated as “evaluator models” (EMs). Each EM receives the story, the associated image, and the SCORS‑G rubric, and is asked to assign a rating for every dimension. To mitigate stochastic variation, each story is evaluated three times and the scores are averaged. Human experts have previously rated a benchmark set of 92 TAT stories; these human averages serve as the reference for validating EM performance.

The authors first filter EMs by unsupervised inter‑rater agreement, selecting those with the highest consistency. The chosen EMs are then benchmarked against human raters, showing strong correlation across most SCORS‑G dimensions, confirming that the models can reliably apply the scoring framework.

Results reveal systematic patterns across model families. Larger, more recent models achieve higher overall SCORS‑G scores, especially on dimensions related to interpersonal understanding (Understanding of Social Causality) and self‑concept (Identity and Coherence of Self). Conversely, all models score low on Experience and Management of Aggressive Impulses, indicating a consistent inability—or unwillingness—to recognize, express, or regulate aggression in the narratives. The authors discuss two plausible explanations: (1) current multimodal training data may under‑represent aggressive or conflictual scenarios, limiting the models’ internal representations; (2) safety‑filter mechanisms and social desirability bias cause models to avoid aggressive content when they infer they are being evaluated. The latter is supported by the observation that certain stories containing physical intimacy or potential violence were rejected by several models due to content‑filter blocks.

A detailed analysis of model‑by‑model performance shows that even within a family, newer variants (e.g., GPT‑5‑mini vs. GPT‑5, Claude‑3.5‑sonnet vs. Claude‑sonnet‑4) outperform their predecessors on most dimensions, suggesting that architectural improvements and instruction‑tuning contribute to richer internal representations.

The paper acknowledges several limitations. SCORS‑G, while psychometrically validated for humans, still relies on human‑derived rating norms; thus, EMs may be reproducing human bias rather than revealing an independent “personality” of the model. The limited set of seven images restricts cultural and demographic generalizability. Moreover, the evaluation does not directly test whether the models’ low aggression scores stem from genuine representational deficits or from deliberate avoidance due to safety constraints.

Future work is proposed in three directions: (i) expanding the image set to include diverse cultural, age, and gender contexts; (ii) incorporating hybrid human‑model labeling to calibrate EM judgments and reduce reliance on purely algorithmic scoring; and (iii) designing controlled experiments that temporarily relax safety filters to disentangle genuine representational gaps from socially desirable response suppression.

In conclusion, the study demonstrates that LMMs can be probed with a visual projective test and scored with a multidimensional psychometric framework, revealing that modern multimodal models exhibit sophisticated interpersonal and self‑conceptual reasoning but consistently underperform on aggression‑related dimensions. This work opens a novel methodological avenue for assessing the “psychological” characteristics of AI systems beyond traditional questionnaire‑based approaches, with important implications for AI safety, alignment, and ethical deployment.

Projective Psychological Assessment of Large Multimodal Models Using Thematic Apperception Tests

💡 Research Summary

Comments & Academic Discussion

Leave a Comment