Visual Affect Analysis: Predicting Emotions of Image Viewers with Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-language models (VLMs) show promise as tools for inferring affect from visual stimuli at scale; it is not yet clear how closely their outputs align with human affective ratings. We benchmarked nine VLMs, ranging from state-of-the-art proprietary models to open-source models, on three psycho-metrically validated affective image datasets: the International Affective Picture System, the Nencki Affective Picture System, and the Library of AI-Generated Affective Images. The models performed two tasks in the zero-shot setting: (i) top-emotion classification (selecting the strongest discrete emotion elicited by an image) and (ii) continuous prediction of human ratings on 1-7/9 Likert scales for discrete emotion categories and affective dimensions. We also evaluated the impact of rater-conditioned prompting on the LAI-GAI dataset using de-identified participant metadata. The results show good performance in discrete emotion classification, with accuracies typically ranging from 60% to 80% on six-emotion labels and from 60% to 75% on a more challenging 12-category task. The predictions of anger and surprise had the lowest accuracy in all datasets. For continuous rating prediction, models showed moderate to strong alignment with humans (r > 0.75) but also exhibited consistent biases, notably weaker performance on arousal, and a tendency to overestimate response strength. Rater-conditioned prompting resulted in only small, inconsistent changes in predictions. Overall, VLMs capture broad affective trends but lack the nuance found in validated psychological ratings, highlighting their potential and current limitations for affective computing and mental health-related applications.

💡 Research Summary

This paper presents a comprehensive benchmark of nine state‑of‑the‑art vision‑language models (VLMs), ranging from proprietary systems (GPT‑4.1, Gemini‑2.5‑Flash) to open‑source architectures, on their ability to infer human affective responses to images. The authors evaluate the models on three psychometrically validated affective image corpora: the International Affective Picture System (IAPS), the Nencki Affective Picture System (NAPS), and the Library of AI‑Generated Affective Images (LAI‑GAI). Each dataset provides normative ratings collected under standardized protocols, including discrete emotion categories (six basic emotions in IAPS/NAPS, twelve in LAI‑GAI) and continuous dimensions such as valence, arousal, and approach‑avoidance, all on 7‑ or 9‑point Likert scales with 30‑70 raters per image.

The evaluation follows a strict zero‑shot protocol: models receive only task instructions, no fine‑tuning or exemplars, and the same prompts are used across all models to eliminate prompt‑engineering confounds. Three tasks are defined. Task 1 asks the model to select the single emotion that received the highest average human rating for each image (top‑emotion classification). Deterministic generation (temperature = 0.0, seed = 42) yields accuracies of 60‑80 % for the six‑emotion setting and 60‑75 % for the twelve‑emotion setting. Errors are systematic: anger and surprise are the least well‑predicted across all models, suggesting that these emotions rely heavily on contextual or cultural cues not captured by visual features alone.

Task 2 requires the model to output a scalar rating on the same Likert scale used in the human studies, effectively performing regression on each affective dimension. To emulate human inter‑rater variability, the authors generate 50 samples per image (n‑sampling = 50) with temperature = 0.5, then average the results. Pearson correlations between model predictions and human means range from r = 0.75 to r ≈ 0.88, indicating strong alignment overall. However, the arousal dimension consistently shows weaker correlations (r ≈ 0.60‑0.68), and across all dimensions models tend to overestimate response strength, reflecting a bias toward more extreme affective language in the underlying VLM training data.

Task 3 explores rater‑conditioned prompting on the LAI‑GAI dataset, where de‑identified participant metadata (age, sex, country, initial emotional state) is appended to the prompt. The goal is to test whether personal background can improve individual‑level prediction. The results show only marginal gains (1‑3 % absolute improvement) and no statistically significant difference, implying that current VLM architectures do not effectively integrate such metadata for fine‑grained affective personalization.

The authors conclude that VLMs capture broad affective trends but lack the nuance required for high‑stakes applications such as mental‑health support or emotion‑aware content moderation. Specific limitations include (1) systematic under‑performance on certain emotions (anger, surprise), (2) weaker modeling of arousal, and (3) limited benefit from demographic conditioning. They recommend future work on (a) emotion‑specific prompt engineering, (b) architectural modifications to better fuse visual cues with affective semantics, (c) large‑scale human‑in‑the‑loop fine‑tuning to reduce over‑estimation bias, and (d) integration of physiological or contextual signals to improve arousal prediction. The paper also provides an open‑source evaluation pipeline, prompts, and scripts to facilitate reproducibility and further research in affective computing with multimodal foundation models.

Visual Affect Analysis: Predicting Emotions of Image Viewers with Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment