시각 콘텐츠 기억력 모델링을 위한 대규모 비지도 데이터셋 및 멀티모달 ToT 검색
📝 Abstract
Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.
💡 Analysis
Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.
📄 Content
Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries Sree Bhattacharyya1 Yaman K. Singla2 Sudhir Yarram2 Somesh Singh2 Harini SI2 James Z. Wang1 1The Pennsylvania State University 2Adobe Media and Data Science Research sfb6038@psu.edu, behavior-in-the-wild@googlegroups.com Abstract Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in pro- gressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the di- versity and scalability of datasets for modeling visual con- tent memorability. Most existing datasets are limited to col- lecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset de- signed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demon- strate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descrip- tions for visual content. We also employ a contrastive train- ing strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content mem- orability research. 1
- Introduction In today’s digital age, the constant stream of content delivered through the internet and social media inundates people with more information than ever before. Our minds, however, retain only part of such information effectively. Several factors influence what persists in memory, in- 1Code available at: https : / / github . com / sreebhattacharyya/web_scale_memorability. cluding characteristics of the information itself, as well as individual differences. Studying content memorability, especially for visual media, has long been a research goal within psychology, cognitive science, and computer science. Creating systems that can automatically model content memorability has diverse applications, including in education [12], advertising, marketing [16], and design [39, 48]. Within the field of Artificial Intelligence (AI), applications include creating personalized agents [41], tailored recommender [4], and retrieval systems [44, 46], or providing a better computational understanding of human intelligence [31]. Past research within AI has mainly focused on understanding and modeling visual content memorability, spanning the modalities of images and videos. Adding to the findings from human studies in psychology [8, 33], these computational efforts have found that a large part of memorability can be attributed to factors intrinsic to the visual content [21, 26, 31], such as semantic concepts, emotions, or object categories. This is crucial to understanding and incorporating memorable properties within content, which can potentially generalize for all individuals interacting with it. While several past research works study visual memora- bility, they usually depend on obtaining human annotations to create a memorability dataset. The most common method for collecting such data, especially for short-term recall, is in the form of a visual memory game [26]. For experiments studying long-term recall, the process becomes further chal- lenging, as individual participants are required to first view the visual content, and then respond to memorability-related questions a significant amount of time later. In [21], for ex- ample, the authors state that the total time for curating the dataset (which contains only 2205 samples) was two years. This makes the data curation process less scalable, both in terms of the number of data points collected and the de- tail of memorability signals obtained from the participants. For example, most works express memorability for visual content through singular scores, aggregated over the data 1 arXiv:2511.20854v1 [cs.CV] 25 Nov 2025 OP: I used to be and still am a huge fan of horror, and from what i remember it was in an iceberg/compilation of scary media……I’ve been looking for this for a long time and i know it exists since i had found a screenshot of it a few months ago. i appreciate any kind of help! Commenter: Best Haul Video on the Internet! Uploaded by YouHaveSchizophrenia. Descriptive Memorability Recall Ground Truth Content “I used to be and still …… kind of help!” “Best Haul Video on the Internet”
“I
This content is AI-processed based on ArXiv data.