From Visual Perception to Deep Empathy: An Automated Assessment Framework for House-Tree-Person Drawings Using Multimodal LLMs and Multi-Agent Collaboration

February 23, 2026

Reading time: 5 minute

...

📝 Original Info

Title: From Visual Perception to Deep Empathy: An Automated Assessment Framework for House-Tree-Person Drawings Using Multimodal LLMs and Multi-Agent Collaboration
ArXiv ID: 2512.21360
Date: 2025-12-23
Authors: Shuide Wen, Yu Sun, Beier Ku, Zhi Gao, Lijun Ma, Yang Yang, Can Jiao

📝 Abstract

Background: The House-Tree-Person (HTP) drawing test, introduced by John Buck in 1948, remains a widely used projective technique in clinical psychology. However, it has long faced challenges such as heterogeneous scoring standards, reliance on examiners subjective experience, and a lack of a unified quantitative coding system. Results: Quantitative experiments showed that the mean semantic similarity between Multimodal Large Language Model (MLLM) interpretations and human expert interpretations was approximately 0.75 (standard deviation about 0.05). In structurally oriented expert data sets, this similarity rose to 0.85, indicating expert-level baseline comprehension. Qualitative analyses demonstrated that the multi-agent system, by integrating social-psychological perspectives and destigmatizing narratives, effectively corrected visual hallucinations and produced psychological reports with high ecological validity and internal coherence. Conclusions: The findings confirm the potential of multimodal large models as standardized tools for projective assessment. The proposed multi-agent framework, by dividing roles, decouples feature recognition from psychological inference and offers a new paradigm for digital mental-health services. Keywords: House-Tree-Person test; multimodal large language model; multi-agent collaboration; cosine similarity; computational psychology; artificial intelligence

💡 Deep Analysis

Deep Dive into From Visual Perception to Deep Empathy: An Automated Assessment Framework for House-Tree-Person Drawings Using Multimodal LLMs and Multi-Agent Collaboration.

📄 Full Content

From Visual Perception to Deep Empathy: An Automated Assessment Framework for House-Tree-Person Drawings Using Multimodal LLMs and Multi-Agent Collaboration

Shuide Wen *1, Yu Sun 2, Beier Ku 3, Zhi Gao 4, Lijun Ma 5, Yang Yang †6, and Can Jiao †7,8

1Shenzhen International Graduate School, Tsinghua University, Shenzhen, China 2School of Psychology, Shenzhen University, Shenzhen, China 3Jesus College, University of Oxford, Oxford, UK 4Shenzhen Institute of Education Sciences, Shenzhen, China 5Department of Psychology, School of Public Health and Management, Guangzhou University of Chinese Medicine, Guangzhou, China 6Harbin Institute of Technology, Harbin, China 7School of Government, Shenzhen University, Shenzhen, China 8The Shenzhen Humanities & Social Sciences Key Research Bases of the Center for Mental Health, Shenzhen University, Shenzhen, China

wenshuide@sz.tsinghua.edu.cn, 2410161009@mails.szu.edu.cn, beier.ku@jesus.ox.ac.uk, career@sz.edu.cn, malj@gzucm.edu.cn, yfield@hit.edu.cn, jiaocan@szu.edu.cn

Abstract Background The House–Tree–Person (HTP) drawing test, introduced by John Buck in 1948 (Buck, 1948), remains one of the most widely used projective techniques in clinical psychology. Yet it has long faced challenges such as heterogeneous scoring standards, reliance on examiners’ subjective experience and a lack of a unified quantitative coding system (Guo et al., 2023). Objective This study seeks to verify the effectiveness of representative multimodal large language models (MLLMs) in recognising and interpreting HTP features, and to develop an automated assessment framework based on multi-agent collaboration to address hallucinations and empathy deficits associated with single models. Methods The research comprised two stages. In Study 1, 307 anonymised HTP drawings previously interpreted by human experts (e.g., Yan Hu, Wang Long) were re-interpreted by Qwen-VL-Plus, and cosine similarity between model-generated and expert interpretations was computed using Doubao-embedding-vision. In Study 2 a multi-agent system was built, incorporating roles such as an Observer, Interpreter, Zeitgeist Observer and Listener (Park et al., 2023), to deliver deep assessments of typical cases. Results Quantitative experiments showed that the mean semantic similarity between MLLM interpretations and human expert interpretations was around 0.75 (SD ≈ 0.05); in structurally oriented expert data sets this rose to 0.85, indicating expert-level baseline comprehension. Qualitative analyses demonstrated that the multi-agent system, by integrating social-psychological perspectives and destigmatising narratives, effectively corrected visual hallucinations and produced psychological reports with high ecological validity and internal coherence. Conclusions The findings confirm the potential of multimodal large models as standardised tools for projective assessment. The multi-agent framework proposed here, by dividing roles, decouples “feature recognition” from “psychological inference” and offers a new paradigm for digital mental-health services. Keywords House–Tree–Person test; multimodal large language model; multi-agent collaboration; cosine similarity; computational psychology; artificial intelligence 1 Introduction The House–Tree–Person (HTP) drawing test is a projective technique in which individuals are asked to draw a house, a tree and a person, and the resulting images are analysed to uncover unconscious emotions and personality traits. Since its introduction by John Buck in 1948 (Buck, 1948) and subsequent systematisation by Emanuel Hammer in 1958 (Hammer, 1958), the test has been widely used in clinical evaluation, personnel selection and educational counselling. HTP belongs to non-structured projective methods; its scoring and interpretive depth depend heavily on the examiner’s personal experience, leading to inconsistent readings of the same graphical symbols (e.g., chimneys, ornaments). A systematic review noted that current studies lack uniform methods for selecting and interpreting drawing indicators (Guo et al., 2023). Although 39 features are significant predictors of mental disorders, their categories, measurement dimensions and interpretive frameworks remain muddled (Guo et al., 2023); for example, a chimney has been interpreted as a sign of family conflict in some studies and of warmth and external support in others (Guo et al., 2023). Such heterogeneity limits test reliability and restricts large-scale application while complicating clinical training and replication. In recent years, advances in artificial intelligence—especially multimodal large language models (MLLMs) capable of visual understanding and complex reasoning—have made automated psychological assessment possible. However, existing work largely focuses on simple image classification or object detection tasks and lacks deep reasoning about psychodynamic mechanisms; in ad

…(Full text truncated)…

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on ArXiv data.

From Visual Perception to Deep Empathy: An Automated Assessment Framework for House-Tree-Person Drawings Using Multimodal LLMs and Multi-Agent Collaboration

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

A Scalable Multi-GPU Framework for Encrypted Large-Model Inference

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

CRAFT-E: A Neuro-Symbolic Framework for Embodied Affordance Grounding

Start searching

No results found