멀티모달 엔터티 기반 뉴스 이미지 캡션 생성 MERGE
📝 Abstract
News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.
💡 Analysis
News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.
📄 Content
Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning Xiaoxing You1,, Qiang Huang2,, Lingyu Li1, Chi Zhang3, Xiaopeng Liu3, Min Zhang2, Jun Yu2,4,† 1Hangzhou Dianzi University, Hangzhou, China 2Harbin Institute of Technology (Shenzhen), Shenzhen, China 3People’s Daily, Beijing, China 4Peng Cheng Laboratory, Shenzhen, China {youxiaoxing, lilingyu0571}@hdu.edu.cn, {huangqiang, zhangmin2021, yujun}@hit.edu.cn, {zhangchi, liuxiaopeng}@pdnews.cn Abstract News image captioning aims to produce journalistically in- formative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key chal- lenges: (1) incomplete information coverage, (2) weak cross- modal alignment, and (3) suboptimal visual-entity ground- ing. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal align- ment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability. Code — https://github.com/youxiaoxing/MERGE 1 Introduction News articles typically include images accompanied by cap- tions that blend visual elements with contextual details, enhancing reader comprehension and engagement. Unlike vanilla image captioning methods (Vinyals et al. 2016; Hos- sain et al. 2019; Yu et al. 2019; Xu et al. 2023), which primarily describe visible content, news image captioning demands both precise entity recognition and the incorpora- tion of deeper contextual knowledge. Editors must analyze key elements–such as people, events, time, and location–and craft captions tailored to diverse journalistic contexts, where the same image may require entirely different descriptions (Nguyen et al. 2023). *Equal contribution. †Corresponding author. Copyright © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org ). All rights reserved. (b) (a) Ground-Truth: Ruth and Jake in the “Constellations.” Tell (Tran et al., 2020): Jake and Franco in “Burn This.” Kalarani et al. (2023): Alle and Jake in “Burn This.” MERGE: Ruth and Jake in “Constellations.” Ground-Truth: The 2011 Toyota Tacoma. Tell (Tran et al., 2020): The 2005 Toyota Tacoma. Kalarani et al. (2023): The 2005 Toyota Tacoma. MERGE: The 2011 Toyota Tacoma. Ground-Truth: Chloe, Luke and Jason in “Tap World.” Tell (Tran et al., 2020): A scene from “Tap World.” Kalarani et al. (2023): A scene from Tap World. MERGE: Chloe, Luke and Jason in “Tap World.” (c) Figure 1: Challenges in news image captioning: (a) Identi- fying entities absent from the article; (b) Aligning numeri- cal details and visual objects across modalities; (c) Disam- biguating entities in images with multiple subjects. Automated news image captioning has been widely stud- ied to assist editors. Early template-based systems (Ramisa et al. 2018; Biten et al. 2019; Hu, Chen, and Jin 2020) filled predefined templates with entities. While effective for structured output, these methods often yield rigid captions lacking nuanced context. Transformer-based models (Tran, Mathews, and Xie 2020; Zhao and Wu 2024) have intro- duced richer modeling of visual features–such as faces and objects–to improve entity-aware captioning. However, they often struggle to extract precise details from long or noisy articles, leading to incomplete or generic captions. Another prominent line of work (Zhou et al. 2022; Qu, Tuytelaars, and Moens 2024) focuses on extracting relevant textual con- texts from articles. Techniques in this direction often lever- age pre-trained or fine-tuned CLIP models (Radford et al. 2021) to retrieve salient sentences while minimizing redun- dancy. Yet, these methods typically fall short in establishing deep semantic connections between visual elements and tex- tual narratives. More recently, Multimodal Large Language Models (MLLMs) (Xu et al. 2024a; Zhang, Zhang, and Wan 2024) have shown great promise by jointly modeling visual arXiv:2511.21002v1 [cs.CV] 26 Nov 2025 and textual modalities. Their advanced reasoning capabili- ties and flexibility make them well-suited for the complex demands of news image captioning. Despite significant progress, as illustrated in Figure 1, ex- isting appro
This content is AI-processed based on ArXiv data.