Background: The House-Tree-Person (HTP) drawing test, introduced by John Buck in 1948, remains a widely used projective technique in clinical psychology. However, it has long faced challenges such as heterogeneous scoring standards, reliance on examiners subjective experience, and a lack of a unified quantitative coding system.
Results: Quantitative experiments showed that the mean semantic similarity between Multimodal Large Language Model (MLLM) interpretations and human expert interpretations was approximately 0.75 (standard deviation about 0.05). In structurally oriented expert data sets, this similarity rose to 0.85, indicating expert-level baseline comprehension. Qualitative analyses demonstrated that the multi-agent system, by integrating social-psychological perspectives and destigmatizing narratives, effectively corrected visual hallucinations and produced psychological reports with high ecological validity and internal coherence.
Conclusions: The findings confirm the potential of multimodal large models as standardized tools for projective assessment. The proposed multi-agent framework, by dividing roles, decouples feature recognition from psychological inference and offers a new paradigm for digital mental-health services.
Keywords: House-Tree-Person test; multimodal large language model; multi-agent collaboration; cosine similarity; computational psychology; artificial intelligence
Deep Dive into From Visual Perception to Deep Empathy: An Automated Assessment Framework for House-Tree-Person Drawings Using Multimodal LLMs and Multi-Agent Collaboration.
Background: The House-Tree-Person (HTP) drawing test, introduced by John Buck in 1948, remains a widely used projective technique in clinical psychology. However, it has long faced challenges such as heterogeneous scoring standards, reliance on examiners subjective experience, and a lack of a unified quantitative coding system.
Results: Quantitative experiments showed that the mean semantic similarity between Multimodal Large Language Model (MLLM) interpretations and human expert interpretations was approximately 0.75 (standard deviation about 0.05). In structurally oriented expert data sets, this similarity rose to 0.85, indicating expert-level baseline comprehension. Qualitative analyses demonstrated that the multi-agent system, by integrating social-psychological perspectives and destigmatizing narratives, effectively corrected visual hallucinations and produced psychological reports with high ecological validity and internal coherence.
Conclusions: The findings confirm the pote
From Visual Perception to Deep Empathy: An Automated Assessment
Framework for House-Tree-Person Drawings Using Multimodal LLMs and
Multi-Agent Collaboration
Shuide Wen
*1, Yu Sun
2, Beier Ku
3, Zhi Gao
4, Lijun Ma
5,
Yang Yang
†6, and Can Jiao
†7,8
1Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
2School of Psychology, Shenzhen University, Shenzhen, China
3Jesus College, University of Oxford, Oxford, UK
4Shenzhen Institute of Education Sciences, Shenzhen, China
5Department of Psychology, School of Public Health and Management,
Guangzhou University of Chinese Medicine, Guangzhou, China
6Harbin Institute of Technology, Harbin, China
7School of Government, Shenzhen University, Shenzhen, China
8The Shenzhen Humanities & Social Sciences Key Research Bases of the Center
for Mental Health, Shenzhen University, Shenzhen, China
wenshuide@sz.tsinghua.edu.cn, 2410161009@mails.szu.edu.cn,
beier.ku@jesus.ox.ac.uk,
career@sz.edu.cn, malj@gzucm.edu.cn, yfield@hit.edu.cn, jiaocan@szu.edu.cn
Abstract
Background The House–Tree–Person (HTP) drawing test, introduced by John Buck
in 1948 (Buck, 1948), remains one of the most widely used projective techniques in
clinical psychology. Yet it has long faced challenges such as heterogeneous scoring
standards, reliance on examiners’ subjective experience and a lack of a unified
quantitative coding system (Guo et al., 2023).
Objective This study seeks to verify the effectiveness of representative multimodal large
language models (MLLMs) in recognising and interpreting HTP features, and to develop
an automated assessment framework based on multi-agent collaboration to address
hallucinations and empathy deficits associated with single models.
Methods The research comprised two stages. In Study 1, 307 anonymised HTP
drawings previously interpreted by human experts (e.g., Yan Hu, Wang Long) were
re-interpreted by Qwen-VL-Plus, and cosine similarity between model-generated and
expert interpretations was computed using Doubao-embedding-vision. In Study 2 a
multi-agent system was built, incorporating roles such as an Observer, Interpreter,
Zeitgeist Observer and Listener (Park et al., 2023), to deliver deep assessments of typical
cases.
Results Quantitative experiments showed that the mean semantic similarity between
MLLM interpretations and human expert interpretations was around 0.75 (SD ≈ 0.05);
in structurally oriented expert data sets this rose to 0.85, indicating expert-level baseline
comprehension. Qualitative analyses demonstrated that the multi-agent system, by
integrating social-psychological perspectives and destigmatising narratives, effectively
corrected visual hallucinations and produced psychological reports with high ecological
validity and internal coherence.
Conclusions The findings confirm the potential of multimodal large models as
standardised tools for projective assessment. The multi-agent framework proposed
here, by dividing roles, decouples “feature recognition” from “psychological inference”
and offers a new paradigm for digital mental-health services.
Keywords House–Tree–Person test; multimodal large language model; multi-agent
collaboration; cosine similarity; computational psychology; artificial intelligence
1 Introduction
The House–Tree–Person (HTP) drawing test is a projective technique in which
individuals are asked to draw a house, a tree and a person, and the resulting images are
analysed to uncover unconscious emotions and personality traits. Since its introduction
by John Buck in 1948 (Buck, 1948) and subsequent systematisation by
Emanuel Hammer in 1958 (Hammer, 1958), the test has been widely used in clinical
evaluation, personnel selection and educational counselling. HTP belongs to
non-structured projective methods; its scoring and interpretive depth depend heavily
on the examiner’s personal experience, leading to inconsistent readings of the same
graphical symbols (e.g., chimneys, ornaments). A systematic review noted that current
studies lack uniform methods for selecting and interpreting drawing indicators (Guo et
al., 2023). Although 39 features are significant predictors of mental disorders, their
categories, measurement dimensions and interpretive frameworks remain muddled
(Guo et al., 2023); for example, a chimney has been interpreted as a sign of family
conflict in some studies and of warmth and external support in others (Guo et al.,
2023). Such heterogeneity limits test reliability and restricts large-scale application while
complicating clinical training and replication.
In recent years, advances in artificial intelligence—especially multimodal large language
models (MLLMs) capable of visual understanding and complex reasoning—have made
automated psychological assessment possible. However, existing work largely focuses
on simple image classification or object detection tasks and lacks deep reasoning about
psychodynamic mechanisms; in ad
…(Full text truncated)…
This content is AI-processed based on ArXiv data.