Comparing affective responses to standardized pictures and videos: A study report

Multimedia documents such as text, images, sounds or videos elicit emotional responses of different polarity and intensity in exposed human subjects. These stimuli are stored in affective multimedia databases. The problem of emotion processing is an important issue in Human-Computer Interaction and different interdisciplinary studies particularly those related to psychology and neuroscience. Accurate prediction of users’ attention and emotion has many practical applications such as the development of affective computer interfaces, multifaceted search engines, video-on-demand, Internet communication and video games. To this regard we present results of a study with N=10 participants to investigate the capability of standardized affective multimedia databases in stimulation of emotion. Each participant was exposed to picture and video stimuli with previously determined semantics and emotion. During exposure participants’ physiological signals were recorded and estimated for emotion in an off-line analysis. Participants reported their emotion states after each exposure session. The a posteriori and a priori emotion values were compared. The experiment showed, among other reported results, that carefully designed video sequences induce a stronger and more accurate emotional reaction than pictures. Individual participants’ differences greatly influence the intensity and polarity of experienced emotion.

💡 Research Summary

The paper investigates whether standardized affective multimedia databases—specifically image and video collections—differ in their ability to elicit emotional responses in human participants. Ten adult volunteers (balanced gender, ages 20‑35) were exposed to a set of 20 pictures and 20 short video clips, each previously annotated for valence (positive, neutral, negative) and arousal (high, low). While participants viewed each stimulus, four physiological signals were recorded continuously at 500 Hz: electrocardiogram (ECG), galvanic skin response (GSR), respiration rate, and electro‑oculogram (EOG). Immediately after each stimulus, participants completed a 7‑point Likert self‑report rating of their experienced emotion intensity and polarity.

The authors pre‑processed the physiological data (band‑pass filtering, baseline correction) and extracted a suite of features: heart‑rate variability metrics, GSR peak count and amplitude, respiratory variability, and fixation duration from eye‑movement data. These features fed two machine‑learning classifiers—Random Forest and Support Vector Machine—to predict the a priori emotion label associated with each stimulus. Classification performance (accuracy, F1‑score) was then compared with the a posteriori self‑report and physiological indices.

Results show that video clips produce significantly stronger and more reliable emotional reactions than static pictures. Self‑reported intensity was higher for videos (mean = 5.8) than for pictures (mean = 4.3; p < 0.01). Correspondingly, physiological arousal markers such as GSR amplitude increased more during video exposure (0.42 µS vs. 0.27 µS; p < 0.05). The machine‑learning models achieved 78 % classification accuracy for video‑induced emotions versus 62 % for picture‑induced emotions, indicating that videos generate clearer multimodal signatures that are easier to decode. However, substantial inter‑individual variability was observed; some participants exhibited pronounced responses to specific emotions (e.g., fear), while others showed muted reactions to the same stimuli. The authors suggest that personal traits, cultural background, and baseline affective sensitivity likely modulate these differences.

In the discussion, the authors argue that the temporal dynamics and auditory components of video stimuli provide richer affective cues, leading to heightened engagement and more consistent physiological patterns. They propose that affective user‑interfaces, adaptive multimedia retrieval systems, and emotion‑aware gaming could benefit from prioritizing video over image content when robust emotional elicitation is required. Limitations include the small sample size, lack of cultural diversity, and the fact that video clips were longer and more complex than the pictures, potentially confounding modality with stimulus duration. Future work is recommended to involve larger, more heterogeneous participant pools, to control for stimulus length, and to explore real‑time emotion detection pipelines using deep neural networks.

In conclusion, the study provides empirical evidence that standardized affective video databases outperform image databases in provoking stronger, more accurate emotional responses, both subjectively and physiologically. This finding supports the integration of video‑based affective stimuli in research and application domains that rely on precise emotion modeling.

💡 Research Summary

📜 Original Paper Content