Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieval

Reading time: 5 minute
...

📝 Original Info

📝 Abstract

Real-world image captions often lack contextual depth, omitting crucial details such as event background, temporal cues, outcomes, and named entities that are not visually discernible. This gap limits the effectiveness of image understanding in domains like journalism, education, and digital archives, where richer, more informative descriptions are essential. To address this, we propose a multimodal pipeline that augments visual input with external textual knowledge. Our system retrieves semantically similar images using BEIT-3 (Flickr30k-384 and COCO-384) and SigLIP So-384, reranks them using ORB and SIFT for geometric alignment, and extracts contextual information from related articles via semantic search. A fine-tuned Qwen3 model with QLoRA then integrates this context with base captions generated by Instruct BLIP (Vicuna-7B) to produce event-enriched, context-aware descriptions. Evaluated on the OpenEvents v1 dataset, our approach generates significantly more informative captions compared to traditional methods, showing strong potential for real-world applications requiring deeper visual-textual understanding

💡 Deep Analysis

Figure 1

📄 Full Content

Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieval Nguyen Lam Phu Quy∗ 23122048@student.hcmus.edu.vn University of Science - VNUHCM Ho Chi Minh City, Vietnam Pham Phu Hoa∗ 23122030@student.hcmus.edu.vn University of Science - VNUHCM Vietnam Tran Chi Nguyen∗ 23122044@student.hcmus.edu.vn University of Science - VNUHCM Ho Chi Minh City, Vietnam Dao Sy Duy Minh† 23122041@student.hcmus.edu.vn University of Science - VNUHCM Ho Chi Minh City, State B, Vietnam Nguyen Hoang Minh Ngoc† ng0005oc@e.ntu.edu.sg Nanyang Technological University Singapore Huynh Trung Kiet† 23132039@student.hcmus.edu.vn University of Science - VNUHCM Ho Chi Minh City, Vietnam Figure 1: System Architecture of the Multimodal Caption Generation Framework Abstract Real-world image captions often lack contextual depth, omitting crucial details such as event background, temporal cues, outcomes, and named entities that are not visually discernible. This gap limits the effectiveness of image understanding in domains like journal- ism, education, and digital archives, where richer, more informative descriptions are essential. To address this, we propose a multimodal pipeline that augments visual input with external textual knowl- edge. Our system retrieves semantically similar images using BEIT-3 (Flickr30k-384 and COCO-384) and SigLIP So-384, reranks them us- ing ORB and SIFT for geometric alignment, and extracts contextual information from related articles via semantic search. A fine-tuned Qwen3 model with QLoRA then integrates this context with base ∗The first three authors contributed equally as lead authors. †The last three authors contributed equally in a supporting role. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MM ’25, Dublin, Ireland © 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-XXXX-X/2018/06 https://doi.org/XXXXXXX.XXXXXXX captions generated by Instruct BLIP (Vicuna-7B) to produce event- enriched, context-aware descriptions. Evaluated on the OpenEvents v1 dataset, our approach generates significantly more informative captions compared to traditional methods, showing strong potential for real-world applications requiring deeper visual-textual under- standing. Our code is available at https://github.com/PhamPhuHoa- 23/Event-Enriched-Image-Captioning-ReZeroSlavery CCS Concepts • Computing methodologies; Keywords event-enriching captioning, image-caption retrieval, image-caption generation, visual analysis ACM Reference Format: Nguyen Lam Phu Quy, Pham Phu Hoa, Tran Chi Nguyen, Dao Sy Duy Minh, Nguyen Hoang Minh Ngoc, and Huynh Trung Kiet. 2025. Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieval. In Proceedings of 2025 ACM Multimedia Conference (MM ’25). ACM, New York, NY, USA, 7 pages. https://doi.org/XXXXXXX.XXXXXXX 1 Introduction Image captioning has evolved from simple object recognition to sophisticated multimodal understanding, yet current approaches remain limited by their reliance on purely visual information [14, arXiv:2512.20042v2 [cs.CV] 1 Feb 2026 MM ’25, October 27–31, 2025, Dublin, Ireland Nguyen Lam Phu Quy, Pham Phu Hoa, Tran Chi Nguyen, Dao Sy Duy Minh, Nguyen Hoang Minh Ngoc, and Huynh Trung Kiet 15]. While state-of-the-art models can accurately describe visible elements, they fail to capture crucial contextual details such as event background, temporal dynamics, named entities, and real- world significance that extend beyond what is directly observable in the image. Several recent works have attempted to incorporate external knowledge into image captioning [5, 25, 39]. However, these ap- proaches typically generate brief, sentence-level descriptions that focus on immediate visual-textual alignment rather than compre- hensive event understanding. For instance, knowledge-enhanced models may correctly identify "President Biden" in an image but fail to provide the rich contextual narrative about the specific meeting, its outcomes, or broader implications, producing captions of 10-20 words instead of the detailed, paragraph-length descriptions needed for meaningful event documentation. This limitation is particularly problematic in domains like jour- nalism and digital archives, where images serve as visual docu- mentation of significant events requiring rich, contextually-aware descriptions. A photograph of officials at a conference table may appear visually similar across diff

📸 Image Gallery

Figure_1.png acm-jdslogo.png cover.png teaser_slay.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut