Self-Augmented Visual Contrastive Decoding

Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal capabilities, but they inherit the tendency to hallucinate from their underlying language models. While visual contrastive decoding has been proposed to mitigate this issue,…

Authors: Eun Woo Im, Muhammad Kashif Ali, Vivek Gupta

Self-Augmented Visual Contrastive Decoding
Published as a conference paper at ICLR 2026 Self-Aug : Q U E RY A N D E N T R O P Y A D A P T I V E D E C O D I N G F O R L A R G E V I S I O N - L A N G U A G E M O D E L S Eun W oo Im 1 ∗ , Muhammad Kashif Ali 2 , V ivek Gupta 1 1 Arizona State Univ ersity 2 Friedrich-Alexander -Univ ersit ¨ at Erlangen-N ¨ urnberg { eunwooim, vgupt140 } @asu.edu , kashif.m.ali@fau.de A B S T R A C T Large V ision-Language Models (L VLMs) have demonstrated remarkable multi- modal capabilities, but they inherit the tendency to hallucinate from their underlying language models. While visual contrastiv e decoding has been proposed to mitigate this issue, e xisting methods often apply generic visual augmentations that disre- gard the specific context provided by the te xt query , limiting their ef fectiv eness. This study introduces a nov el training-free decoding strategy that addresses these limitations, featuring two k ey contrib utions. First, a self-augmentation prompting strategy that le verages the intrinsic knowledge of the model to dynamically align semantics between the query and the visual augmentation. Second, an adaptive thresholding algorithm that adaptively adjusts next token candidate size based on the output sparsity , utilizing full information from the logit distrib ution. Ex- tensiv e experiments across fi ve L VLMs and sev en benchmarks demonstrate that the proposed decoding significantly enhances factual consistency compared to state-of-the-art decoding methods. This work highlights the importance of inte- grating query-dependent augmentation and entropy-aware decoding for improving effecti ve generation of L VLMs. 1 I N T R O D U C T I O N Large Language Models (LLMs) have achie ved remarkable success in language comprehension, generation, and reasoning ( Brown et al. , 2020 ; Google , 2023 ; T ouvron et al. , 2023 ; Chiang et al. , 2023 ; OpenAI , 2023 ). By integrating visual encoding and projection, Large V ision-Language Models (L VLMs) have e xtended these capabilities to multimodal applications such as visual perception and planning ( Li et al. , 2022 ; Y u et al. , 2022 ; Li et al. , 2023a ; Maaz et al. , 2023 ; Y e et al. , 2023 ; Zhang et al. , 2023a ; Zhu et al. , 2023 ; Huang et al. , 2023 ). Despite their impressiv e performance, L VLMs inherit critical limitations from their foundational language models. One of the most significant issues is hallucination , a phenomenon of generating plausible b ut factually incorrect or nonsensical outputs. This behavior is lar gely a byproduct of the auto-regressiv e training objectiv e of the model, a process that incentivizes a reliance on spurious correlations ov er a precise understanding of underlying facts by maximizing token likelihood based on surf ace-lev el statistical patterns ( Bender & K oller , 2020 ). Advanced decoding methods can significantly enhance the factual consistency by shaping ho w token sequences are selected from output distrib utions at each generation step ( V an der Poel et al. , 2022 ; Fa vero et al. , 2024 ). A prominent decoding strategy to reduce hallucination ef fect is Contrastiv e Decoding (CD) ( Li et al. , 2023c ), a technique that improv es factuality by contrasting the outputs of an expert model with those of a weaker , amateur counterpart ( Zhang et al. , 2023b ; Chuang et al. , 2023 ). Motiv ated by this principle, V isual Contrasti ve Decoding (VCD) ( Leng et al. , 2024 ) was introduced to improve the general perceptual capabilities of L VLMs by contrasting standard output with an amateur logit generated from an input image degraded by random noise. Subsequent research in VCD has primarily focused on determining which visual modifications or hidden states with experimental heuristics can maximize the sample v ariance while maintaining the semantics ( Li et al. , 2023b ; Huang et al. , 2024 ). Howe ver , these methods often overlook the critical role of the input text query , which specifies which aspects of an image are rele v ant to the user request. For instance, asking to identify an object in the image and solving a handwritten math problem require entirely different capabilities ∗ Corresponding author . Project page: https://eunwooim.github.io/selfaug 1 Published as a conference paper at ICLR 2026 and reasoning from the L VLM. While V A CoDe ( Kim et al. , 2024b ) addressed this by estimating the div ergence between logit distrib utions among the predefined visual augmentation set at the first generation step in a brute-force manner , there are two fundamental limitations. First, first-token div ergence is an empirical measure that does not al ways assure a fa vorable augmentation choice for the entire generation sequence. Second, its dependence on a single token renders it suitable for short, multiple-choice style answers but fundamentally limits its ef fectiv eness for complex tasks requiring open-ended generation and multi-step reasoning. Moreov er , a challenge in contrastiv e decoding arises from the subtraction of the amateur logit from the expert logit ( Li et al. , 2023c ). This operation can cause undesired effects that amplify the scores of certain tokens; if the amateur model produces a negati ve logit value, it will have its final score erroneously increased ( L yu et al. , 2024 ). T o mitigate this amplification ef fect, existing methods ( Jin et al. , 2024 ) truncate the vocab ulary set based on a threshold set proportionally to the maximum value of the e xpert logit distribution. Howe ver , while this approach is ef fectiv e at penalizing false positi ves, its reliance on a single data point ( i.e. , the maximum logit) hinders it from utilizing the rich information encoded in the full logit distribution, such as model confidence . These aforementioned limitations lead us to two main research questions. (1) How can the semantic intent of a text query guide the selection of a visual augmentation 1 to elicit a maximally informativ e discrepancy for contrasti v e decoding? (2) Is there a correlation between a predicti ve confidence of the model and the plausibility of its next-token candidates? T o address these questions, this study introduces Self-Aug , a nov el decoding strategy that adapti v ely select which visual augmentation is best suited to be contextually r elevant . Unlike prior works ( Kim et al. , 2024b ), Self-Aug utilizes the intrinsic model knowledge to determine an optimal visual modification out of the box. Furthermore, we introduce an improved thresholding algorithm, Sparsity Adaptiv e T runcation (SA T), to overcome the limitations of existing plausibility constraints. Where prior methods often fail to utilize full information from the logit, SA T dynamically determines a threshold by utilizing the entire logit distribution as a proxy for the confidence of the output. The proposed method integrates seamlessly into any L VLM without requiring an y architectural modifications or additional training. Extensi ve experiments and analysis v erify that the proposed methods significantly enhance factual consistency and reduce hallucinations across multiple models and benchmarks. The contributions of this study are summarized as follows: 1. This work introduces Self-Aug , a prompting strate gy that lev erages parametric knowledge of the model to select a visual augmentation that is semantically relev ant to the textual query , thereby extracting a more informati ve discrepanc y . 2. The proposed SA T improv es the existing adapti ve plausibility constraint by le veraging the entropy of the expert logit and dynamically sets a threshold of tok en implausibilities. 3. Extensiv e experiments v alidate the effecti veness of the proposed method across fiv e L VLMs and sev en benchmarks. The results demonstrate that Self-Aug significantly reduces halluci- nations while amplifying the relev ance and informativeness in the response. 2 P R E L I M I NA R I E S A uto-r egressiv e Generation of L VLMs Suppose that f θ is an L VLM ( Gong et al. , 2023 ; Maaz et al. , 2023 ; Li et al. , 2025a ), parameterized by θ . The model operates on a vocabulary set V , and the set of all possible token sequences can be denoted by its Kleene closure, V ∗ = S i ≥ 0 V i , where i indicates the timestamp of the L VLM output. The function f θ : V ∗ × R h × w × 3 → V ∗ auto- regressi vely generates a response from a gi v en text query x ∈ V ∗ and a visual input v ∈ R h × w × 3 . At each timestep t , the L VLM computes a logit distrib ution ov er the vocab ulary for the ne xt token y t , conditioned on the inputs ( x, v ) and the sequence of previously generated tok ens y l ′ [ j ] ). If l ′ remains as accurate as l , the contrastive formulation yields minimal performance gains because the two distributions are nearly identical. Therefore, the failure case in all CD-based methods would be the case when the two assumptions are in v alidated ( i.e. , l [ i ] > l ′ [ i ] and l [ j ] ≤ l ′ [ j ] ). T ab . 7 demonstrates this phenomenon using the av eraged scores of Qwen3-VL-32B across three separate runs. Determining and addressing such failure is beyond the scope of this study , and is deferred to future works. Limitations and Future W ork The proposed method presents sev eral branches for future work by addressing current limitations. First, the effecti v eness of SAS Prompting depends on the reasoning and instruction-follo wing ability of the base model. Less capable models might produce malformed outputs or poor augmentation choices. This dependency could be addressed in future work by de veloping more robust prompting methods ( e .g. , Chain-of-Thoughts ( W ei et al. , 2022 )) or utilizing a smaller , specialized model for the selection task. Second, the current method is limited to a predefined set of visual augmentations. While this set covers common scenarios, it may not contain the best augmentation for highly specialized visual reasoning tasks. A promising solution inv olves developing methods that can dynamically select from a more diverse and lar ger library of transformations using external modules ( e .g. , object detector), enhancing the versatility . Finally , expanding the principle to the video domain by designing temporal-aw are dynamic recalibration methods of obtaining contrastiv e pairs would enable frame-consistent decoding. The direction aids understanding complex temporal contexts and facilitating downstream applications as highlighted in previous studies ( Im et al. , 2023 ; Ali et al. , 2024 ). 6 C O N C L U S I O N This study introduces Self-Aug , a novel decoding strategy designed to mitigate hallucinations in L VLMs. The proposed method aligns the semantics between query and visual augmentation by lev eraging the flexible intrinsic reasoning of the model without relying on predefined heuristics. In addition, the proposed sparsity adapti ve truncation introduces a confidence-aware thresholding that dynamically adjusts candidate sets based on logit entropy , effecti vely penalizing f alse positiv es. Extensiv e experiments conducted across five L VLM and se ven benchmarks demonstrated that Self- Aug consistently improv es factual consistenc y over existing decoding strategies while maintaining practical computational ef ficiency . Beyond immediate performance gains, this study underlines the importance of the semantic coupling of query-augmentation with confidence-sensitiv e decoding as a principled approach for dev eloping more robust multimodal generation. A C K N OW L E D G M E N T S W e gratefully acknowledge the Complex Data Reasoning and Analysis Lab at Arizona State University for their resources and computational support. R E F E R E N C E S Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr , Y ana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Pr ocessing Systems (NeurIPS) , 35: 23716–23736, 2022. Muhammad Kashif Ali, Eun W oo Im, Dongjin Kim, and T ae Hyun Kim. Harnessing meta-learning for improving full-frame video stabilization. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , pp. 12605–12614, 2024. Muhammad Kashif Ali, Eun W oo Im, Dongjin Kim, T ae Hyun Kim, V ivek Gupta, Haonan Luo, and T ianrui Li. Harnessing meta-learning for controllable full-frame video stabilization. arXiv pr eprint arXiv:2508.18859 , 2025. Jinze Bai, Shuai Bai, Shusheng Y ang, Shijie W ang, Sinan T an, Peng W ang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv pr eprint arXiv:2308.12966 , 1(2):3, 2023. 10 Published as a conference paper at ICLR 2026 Shuai Bai, Keqin Chen, Xuejing Liu, Jialin W ang, W enbin Ge, Sibo Song, Kai Dang, Peng W ang, Shijie W ang, Jun T ang, et al. Qwen2. 5-vl technical report. arXiv preprint , 2025. Sourya Basu, Gov ardana Sachitanandam Ramachandran, Nitish Shirish Keskar , and Lav R V arshney . Mirostat: A neural text decoding algorithm that directly controls perplexity . arXiv pr eprint arXiv:2007.14966 , 2020. Emily M Bender and Ale xander K oller . Climbing to wards nlu: On meaning, form, and understanding in the age of data. In Annual Meeting of the Association for Computational Linguistics (ACL) , pp. 5185–5198, 2020. Alexander Boriche v and Y uri T omilov . Optimal polynomial decay of functions and operator semi- groups. Mathematische Annalen , 347(2):455–478, 2010. T om Brown, Benjamin Mann, Nick Ryder , Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Prana v Shyam, Girish Sastry , Amanda Askell, et al. Language models are few-shot learners. Advances in Neur al Information Pr ocessing Systems (NeurIPS) , 33:1877–1901, 2020. Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Y ao, Bo Li, and Jiawei Zhou. Halc: Object hallucination reduction via adapti ve focal-contrast decoding. arXiv pr eprint arXiv:2403.00425 , 2024a. Zhe Chen, W eiyun W ang, Hao Tian, Shenglong Y e, Zhangwei Gao, Erfei Cui, W enwen T ong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences , 67(12):220101, 2024b. W ei-Lin Chiang, Zhuohan Li, Ziqing Lin, Y ing Sheng, Zhanghao W u, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Y onghao Zhuang, Joseph E Gonzalez, et al. V icuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality . See https://vicuna. lmsys. or g (accessed 14 April 2023) , 2(3):6, 2023. Y ung-Sung Chuang, Y ujia Xie, Hongyin Luo, Y oon Kim, James Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv pr eprint arXiv:2309.03883 , 2023. W enliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, W eisheng W ang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: T owards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems (NeurIPS) , 36: 49250–49267, 2023. Imant Daunhawer , Thomas M Sutter , Kieran Chin-Cheong, Emanuele Palumbo, and Julia E V ogt. On the limitations of multimodal vaes. arXiv preprint , 2021. Alessandro Fav ero, Luca Zancato, Matthew T rager , Siddharth Choudhary , Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. Multi-modal hallucination control by visual information grounding. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , pp. 14303–14312, 2024. Chaoyou Fu, Peixian Chen, Y unhang Shen, Y ulei Qin, Mengdan Zhang, Xu Lin, Jinrui Y ang, Xiawu Zheng, K e Li, Xing Sun, Y unsheng W u, and Rongrong Ji. Mme: A comprehensive ev aluation benchmark for multimodal lar ge language models, 2024. URL 2306.13394 . T ao Gong, Chengqi L yu, Shilong Zhang, Y udong W ang, Miao Zheng, Qian Zhao, Kuikun Liu, W enwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv pr eprint arXiv:2305.04790 , 2023. Google. Bard. https://bard.google.com/ , 2023. Alex Gra ves. Sequence transduction with recurrent neural networks. arXiv preprint , 2012. 11 Published as a conference paper at ICLR 2026 Jiawei Gu, Xuhui Jiang, Zhichao Shi, He xiang T an, Xuehao Zhai, Chengjin Xu, W ei Li, Y inghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge. arXiv preprint , 2024. Chuan Guo, Geof f Pleiss, Y u Sun, and Kilian Q W einberger . On calibration of modern neural networks. In International Confer ence on Machine Learning (ICML) , pp. 1321–1330. PMLR, 2017. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Y ejin Choi. The curious case of neural te xt degeneration. 2019. Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin W ang, Conghui He, Jiaqi W ang, Dahua Lin, W eiming Zhang, and Nenghai Y u. Opera: Alleviating hallucination in multi-modal lar ge language models via ov er-trust penalty and retrospection-allocation. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , pp. 13418–13427, 2024. Shaohan Huang, Li Dong, W enhui W ang, Y aru Hao, Saksham Singhal, Shuming Ma, T engchao Lv , Lei Cui, Owais Khan Mohammed, Barun Patra, et al. Language is not all you need: Aligning perception with language models. Advances in Neur al Information Pr ocessing Systems (NeurIPS) , 36:72096–72109, 2023. Eun W oo Im, Junsung Shin, Sungyong Baik, and T ae Hyun Kim. Deep v ariational bayesian modeling of haze degradation process. In Pr oceedings of the 32nd ACM International Confer ence on Information and Knowledge Manag ement , pp. 895–904, 2023. Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Y u, Dan Su, Y an Xu, Etsuko Ishii, Y e Jin Bang, Andrea Madotto, and Pascale Fung. Surve y of hallucination in natural language generation. A CM computing surve ys , 55(12):1–38, 2023. Zhuoran Jin, Pengfei Cao, Y ubo Chen, Kang Liu, Xiaojian Jiang, Jiexin Xu, Qiuxia Li, and Jun Zhao. T ug-of-war between kno wledge: Exploring and resolving knowledge conflicts in retrie v al- augmented language models. arXiv pr eprint arXiv:2402.14409 , 2024. Junho Kim, Hyunjun Kim, Kim Y eonju, and Y ong Man Ro. Code: Contrasting self-generated description to combat hallucination in large multi-modal models. Advances in Neural Information Pr ocessing Systems (NeurIPS) , 37:133571–133599, 2024a. Sihyeon Kim, Boryeong Cho, Sangmin Bae, Sumyeong Ahn, and Se-Y oung Y un. V acode: V isual augmented contrastiv e decoding. arXiv pr eprint arXiv:2408.05337 , 2024b. Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chun yan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decod- ing. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , pp. 13872–13882, 2024. Bo Li, Y uanhan Zhang, Liangyu Chen, Jinghao W ang, Fan yi Pu, Joshua Adrian Cahyono, Jingkang Y ang, Chunyuan Li, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. IEEE T ransactions on P attern Analysis and Machine Intelligence , 2025a. Chaoyu Li, Eun W oo Im, and Pooyan Fazli. V idhalluc: Evaluating temporal hallucinations in multimodal lar ge language models for video understanding. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pp. 13723–13733, 2025b. Junnan Li, Dongxu Li, Caiming Xiong, and Stev en Hoi. Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. In International Conference on Machine Learning (ICML) , pp. 12888–12900. PMLR, 2022. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Confer ence on Machine Learning (ICML) , pp. 19730–19742. PMLR, 2023a. Kenneth Li, Oam Patel, Fernanda V i ´ egas, Hanspeter Pfister , and Martin W attenberg. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Pr ocessing Systems (NeurIPS) , 36:41451–41530, 2023b. 12 Published as a conference paper at ICLR 2026 Xiang Lisa Li, Ari Holtzman, Daniel Fried, Perc y Liang, Jason Eisner , T atsunori Hashimoto, Luke Zettlemoyer , and Mike Le wis. Contrastiv e decoding: Open-ended text generation as optimization. 2023c. Y aoyiran Li, Anna K orhonen, and Ivan V uli ´ c. Self-augmented in-context learning for unsupervised word translation. arXiv pr eprint arXiv:2402.10024 , 2024. Y ifan Li, Y ifan Du, Kun Zhou, Jinpeng W ang, W ayne Xin Zhao, and Ji-Rong W en. Evaluating object hallucination in large vision-language models. arXiv preprint , 2023d. Tsung-Y i Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, De va Ramanan, Piotr Doll ´ ar , and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Pr oceedings of the Eur opean Confer ence on Computer V ision (ECCV) , pp. 740–755. Springer , 2014. Haotian Liu, Chunyuan Li, Qingyang W u, and Y ong Jae Lee. V isual instruction tuning. Advances in Neural Information Pr ocessing Systems (NeurIPS) , 36:34892–34916, 2023. Haotian Liu, Chunyuan Li, Y uheng Li, and Y ong Jae Lee. Improved baselines with visual instruction tuning. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition (CVPR) , pp. 26296–26306, 2024a. Haotian Liu, Chunyuan Li, Y uheng Li, Bo Li, Y uanhan Zhang, Sheng Shen, and Y ong Jae Lee. Llav anext: Improv ed reasoning, ocr , and world kno wledge, 2024b. Xinyu L yu, Beitao Chen, Lianli Gao, Hengtao Shen, and Jingkuan Song. Alleviating hallucinations in large vision-language models through hallucination-induced optimization. Advances in Neural Information Pr ocessing Systems (NeurIPS) , 37:122811–122832, 2024. Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. V ideo-chatgpt: T o wards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 , 2023. Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generati ve lar ge language models. arXiv pr eprint arXiv:2303.08896 , 2023. Clara Meister , T iago Pimentel, Gian W iher , and Ryan Cotterell. Locally typical sampling. T ransac- tions of the Association for Computational Linguistics , 11:102–121, 2023. OpenAI. ChatGPT. https://openai.com/blog/chatgpt/ , 2023. Y eji Park, Deokyeong Lee, Junsuk Choe, and Buru Chang. Con vis: Contrastive decoding with hallucination visualization for mitigating hallucinations in multimodal large language models. In Association for the Advancement of Artificial Intelligence (AAAI) , volume 39, pp. 6434–6442, 2025. Maitreya P atel, Naga Sai Abhiram Kusumba, Sheng Cheng, Changhoon Kim, T ejas Gokhale, Chitta Baral, et al. Tripletclip: Improving compositional reasoning of clip via synthetic vision-language negati ves. Advances in Neur al Information Pr ocessing Systems (NeurIPS) , 37:32731–32760, 2024. SW Prov encher . A fourier method for the analysis of exponential decay curves. Biophysical journal , 16(1):27–41, 1976. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, T rev or Darrell, and Kate Saenko. Object hallucination in image captioning. arXiv pr eprint arXiv:1809.02156 , 2018. Dustin Schwenk, Apoorv Khandelw al, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In Pr oceedings of the Eur opean Confer ence on Computer V ision (ECCV) , pp. 146–162. Springer , 2022. Claude E Shannon. A mathematical theory of communication. The Bell system technical journal , 27 (3):379–423, 1948. 13 Published as a conference paper at ICLR 2026 W eijia Shi, Xiaochuang Han, Mike Lewis, Y ulia Tsvetko v , Luke Zettlemoyer , and W en-tau Y ih. T rusting your evidence: Hallucinate less with context-a ware decoding. In Pr oceedings of the 2024 Confer ence of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies (V olume 2: Short P apers) , pp. 783–791, 2024. Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Y ikang Shen, Chuang Gan, Liang-Y an Gui, Y u-Xiong W ang, Y iming Y ang, et al. Aligning large multimodal models with factually augmented rlhf. arXiv pr eprint arXiv:2309.14525 , 2023. Shengbang T ong, Zhuang Liu, Y uexiang Zhai, Y i Ma, Y ann LeCun, and Saining Xie. Eyes wide shut? e xploring the visual shortcomings of multimodal llms. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , pp. 9568–9578, 2024. Gabriele N T ornetta. Entropy methods for the confidence assessment of probabilistic classification models. arXiv pr eprint arXiv:2103.15157 , 2021. Hugo T ouvron, Thibaut La vril, Gautier Izacard, Xa vier Martinet, Marie-Anne Lachaux, T imoth ´ ee Lacroix, Baptiste Rozi ` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint , 2023. Liam V an der Poel, Ryan Cotterell, and Clara Meister . Mutual information alleviates hallucinations in abstractiv e summarization. arXiv pr eprint arXiv:2210.13210 , 2022. Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denn y Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Pr ocessing Systems (NeurIPS) , 35:24824–24837, 2022. Simon W enkel, Khaled Alhazmi, T anel Liiv , Saud Alrshoud, and Martin Simon. Confidence score: The forgotten dimension of object detection performance e valuation. Sensors , 21(13):4350, 2021. Qinghao Y e, Haiyang Xu, Guohai Xu, Jiabo Y e, Ming Y an, Y iyang Zhou, Jun yang W ang, Anwen Hu, Pengcheng Shi, Y aya Shi, et al. mplug-o wl: Modularization empo wers large language models with multimodality . arXiv preprint , 2023. Shukang Y in, Chaoyou Fu, Sirui Zhao, T ong Xu, Hao W ang, Dianbo Sui, Y unhang Shen, Ke Li, Xing Sun, and Enhong Chen. W oodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences , 67(12):220105, 2024. Jiahui Y u, Zirui W ang, V ijay V asude van, Legg Y eung, Mojtaba Seyedhosseini, and Y onghui W u. Coca: Contrastive captioners are image-te xt foundation models. arXiv pr eprint arXiv:2205.01917 , 2022. W eihao Y u, Zhengyuan Y ang, Linjie Li, Jianfeng W ang, Ke vin Lin, Zicheng Liu, Xinchao W ang, and Lijuan W ang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv pr eprint arXiv:2308.02490 , 2023. Eric Zelikman, Y uhuai W u, Jesse Mu, and Noah D Goodman. Star: Self-taught reasoner bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems (NeurIPS) , volume 1126, 2024. Hang Zhang, Xin Li, and Lidong Bing. V ideo-llama: An instruction-tuned audio-visual language model for video understanding. arXiv pr eprint arXiv:2306.02858 , 2023a. Y ue Zhang, Leyang Cui, W ei Bi, and Shuming Shi. Alleviating hallucinations of lar ge language models through induced hallucinations. arXiv pr eprint arXiv:2312.15710 , 2023b. Lianmin Zheng, W ei-Lin Chiang, Y ing Sheng, Siyuan Zhuang, Zhanghao W u, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Pr ocessing Systems (NeurIPS) , 36:46595–46623, 2023. Y iyang Zhou, Chenhang Cui, Jaehong Y oon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Y ao. Analyzing and mitigating object hallucination in large vision-language models. arXiv pr eprint arXiv:2310.00754 , 2023. 14 Published as a conference paper at ICLR 2026 Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny . Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv pr eprint arXiv:2304.10592 , 2023. 15 Published as a conference paper at ICLR 2026 Self-Aug : Q U E RY A N D E N T R O P Y A D A P T I V E D E C O D I N G F O R L A R G E V I S I O N - L A N G U A G E M O D E L S A P P E N D I X Due to space limitations in the main manuscript, we provide supplementary materials in this appendix that elaborate on the proposed design, e xperimental settings, and visualizations. This includes the complete prompt design for Self-Augmentation Selection (SAS) and LLM-as-a-Judge for reasoning quality , additional qualitativ e examples, extended experimental results, and a detailed breakdown of the model and benchmark information. A A D D I T I O NA L E X P E R I M E N T A L S E T U P D E TA I L S The visual augmentations were implemented on top of the official VCD ( Leng et al. , 2024 ) source code with the following specific parameters. The color in version operation w as performed using the PyT orch torchvision.transforms.functional.invert function. For the random crop and random mask augmentations, a ratio of 2.0 was used, which corresponds to applying the operation to a randomly placed square patch with side lengths equal to half the original image dimensions. For the noise augmentation, a diffusion noise step of 500 was applied from the official VCD random noise implementation. B F U L L P RO M P T D E S I G N This section provides the verbatim prompts used for both the self-augmentation selection and the subsequent reasoning quality assessment. The full SAS Prompt, which lev erages in-context learning and reasoning to achiev e optimal query-augmentation semantic alignment, is presented first. This is followed by the prompt used to instruct the LLM-as-a-Judge for the ev aluation of the SAS reasoning trace against the Oracle. The individual ef fects of the reasoning and in-context learning components within the SAS Prompt are quantified in the ablation study section of the main manuscript. B . 1 S A S P R O M P T I N G Y ou are an expert data augmentation analyst. Y our task is to select the single most semantically disruptiv e image augmentation that most effecti vely in validates the question’ s premise or prevents a confident answer . Provide a clear reason e xplaining why the augmentation is chosen, then state your final choice. ## Augmentations and Their Effects ## - V ertical flip: Flips image top-to-bottom. Disrupts questions about “above”, “below”, “under” or reading orientation. - Color inv ersion: Replaces each color with its complement. Disrupts questions relying on accurate color identification. - Random crop: Removes random parts of the image. Disrupts questions requiring global context or peripheral objects. - Random mask: Occludes portions of the image. Disrupts object presence, count, or attribute recogni- tion. - Noise: Adds visual distortion. Disrupts questions requiring small details, texture, or text clarity . - Horizontal flip: Flips the image left-to-right. Disrupts questions about left/right positioning and left-to-right text reading. ## Examples ## Question: “Is the mirror above the TV?” Reason: The question focuses on vertical positioning. V er- tical flip rev erses top and bottom, making “above” mean “below , ” inv alidating the question. Other augmentations don’t af fect vertical relationships. Choice: vertical flip 16 Published as a conference paper at ICLR 2026 Question: “Is this photo taken indoors?” Reason: The question requires identifying a specific envi- ronmental context. Random crop may exclude key background elements like trees, inv alidating the question. Flips, color in version, noise, and random mask don’ t directly affect scene context. Choice: random crop Question: “ Are there any green beans in the image?” Reason: The question requires identifying a specific color . Color inv ersion changes green to its complement, inv alidating the question. Flips, noise, random mask, and random crop don’t tar get color directly . Choice: color in version Question: “Ho w many people are in the image?” Reason: The question requires counting visible people. Random mask can completely obscure one or more people, making the e xact count impossible. Noise obscures details but typically doesn’t hide entire objects, allowing approximate counting. Flips and color in version don’ t affect object visibility or count. Choice: random mask Question: “Is the cat on the right side of the laptop?” Reason: The question relies on horizontal positioning. Horizontal flip reverses left and right, making “right” mean “left”, in v alidating the question. Other augmentations don’t tar get horizontal positions. Choice: horizontal flip Question: “Does this artwork e xist in the form of painting?” Reason: The question requires identifying the texture of the artw ork. Noise obscures fine details, making it hard to identify the medium. Other augmentations don’t tar get texture details. Choice: noise ## Y our Answer ## If multiple augmentations could disrupt the question, select the one whose effect is most direct and unambiguous. Y ou must choose one of the giv en augmentations following the “Reason:” and “Choice:” format. Question: “ { te xt } ” B . 2 L L M - A S - A - J U D G E P R O M P T F O R R E A S O N I N G Q UA L I T Y Y our task is to e valuate a candidate model’ s response against an expert-pro vided reference solution. The goal is to select the image augmentation that most effecti vely disrupts the premise of a gi ven question. ## Evaluation Rubric (Inte ger Scale 0-10) ## - 10 (Excellent): The candidate’ s choice is highly ef fectiv e and the reasoning is clear , logically sound, and directly supports the choice. The response is of reference quality . - 7-9 (Good): The choice is effecti ve and the reasoning is logical, but may be slightly less specific or insightful than the reference. - 4-6 (Acceptable): The choice is plausible but not optimal. The reasoning is generic, weak, or contains minor flaws. - 1-3 (Poor): The choice is inef fectiv e and the reasoning is flawed or irrele vant. - 0 (V ery Poor): The choice and reasoning are completely incorrect or nonsensical. ## Reference Example ## Question: “Ho w many people are in the image?” Reference Reason: “The question requires counting visible people. Random mask can completely obscure one or more people, making the exact count impossible. ” Reference Choice: “random mask” Candidate Reason: “Random crop might cut some people out of the frame. ” Candidate Choice: “random crop” Evaluation: Score: 7, Reason: The candidate’ s choice is a valid strategy for disrupting a counting task, but it is less direct than the reference. The reasoning is correct but lacks specificity . ## T ask ## Question: “ { question } ” Reference Reason: “ { oracle reason } ” Reference Choice: “ { oracle choice } ” Candidate Reason: “ { model reason } ” Candidate Choice: “ { model choice } ” Evaluation: 17 Published as a conference paper at ICLR 2026 C M O D E L A N D B E N C H M A R K D E T A I L S Model Families • LLaV A-1.5 ( Liu et al. , 2023 ) is a powerful open-source L VLM that establishes the ef- fectiv eness of visual instruct tuning for creating general-purpose visual assistants. Its architecture is characterized by its simplicity , connecting a pretrained CLIP vision encoder to a V icuna LLM using a single Multi-Layer Perceptron projection layer . The LLaV A-1.5 version impro ved upon the original by incorporating a more capable LLM and scaling the instruction-following data. • Qwen-VL ( Bai et al. , 2023 ) is a series of highly performant, versatile vision-language models based on the Qwen language model f amily . A key feature of the Qwen-VL architecture is its support for multiple languages, the ability to process multi-image inputs, and its strong capabilities in fine-grained visual understanding, including te xt recognition and object localization. • InstructBLIP ( Dai et al. , 2023 ) is a vision-language instruction tuning framework designed to enhance zero-shot generalization across a di verse set of tasks. Its central innovation is the use of an instruction-aware Query T ransformer . This module is trained to extract visual features from the image encoder that are specifically relev ant to the gi ven te xt instruction, enabling more targeted and ef fecti ve multimodal reasoning. • Qwen3-VL ( Bai et al. , 2025 ) is the latest ev olution in the Qwen vision-language series, engineered to ov ercome the limitations of fix ed-resolution processing and enhance visual agent capabilities. Its architecture employs Naive Dynamic Resolution, which processes images at their native aspect ratios by con verting them into v ariable-length token sequences, effecti vely eliminating the loss of detail caused by resizing or padding. Combined with Mul- timodal Rotary Positional Embeddings (M-RoPE) that decompose positional information into temporal, height, and width components, the model achiev es state-of-the-art perfor - mance in comprehending high-resolution documents, long-conte xt videos, and ex ecuting complex tool-use instructions. Discriminative Benchmarks • MME ( Fu et al. , 2024 ) is a benchmark that provides a granular ev aluation of multimodal tasks, spanning 10 perception and 4 cognition categories. The performance is measured on binary yes or no questions using an accuracy-based MME score. Following the standard practice ( Leng et al. , 2024 ; Kim et al. , 2024b ), we consider the perception category for the experiments. • MMVP ( T ong et al. , 2024 ) is designed to ev aluate a model’ s understanding of fine-grained visual details. It achieves this by using 300 CLIP-blind image pairs, where models must capture subtle differences to perform paired classification accurately . These image pairs cov er nine distinct visual patterns: orientation and direction, feature presence, state and condition, quantity and count, positional and relational context, color and appearance, structural and physical characteristics, text, and vie wpoint and perspecti ve. The ev aluation follows a multiple-choice format, where final model responses are mapped to the answer options using GPT -4 as an automated judge. • POPE ( Li et al. , 2023d ) serves as a dominant benchmark for assessing object hallucination by testing models with three distinct types of negati v e questions. These categories include queries about random non-existent objects, popular objects that are frequent in the dataset but absent from the image, and adversarial objects selected for their high co-occurrence. The dataset contains 9,000 question-image pairs built from 500 images, each e v aluated against multiple questions across the three categories. Generative Benchmarks • LLaV A-Bench (In-the-W ild) ( Liu et al. , 2023 ) is a benchmark to ev aluate the ability of Large V ision Language Models (L VLMs) to handle complex tasks and adapt to new domains. It features 24 images and 60 queries, which collapse into three categories: con versation, 18 Published as a conference paper at ICLR 2026 detailed description, and complex reasoning. The ev aluation is conducted using GPT -4V as a judge to rate both the model response and a reference answer . The final performance is reported as a score ratio, calculated by dividing the total score of the reference answer . • MMHal-Bench ( Sun et al. , 2023 ) e valuates and penalizes hallucinations across a diverse set of reasoning types. It is composed of 96 image-question pairs that co ver eight distinct categories, including object attributes, comparison, and spatial relations. Evaluation is performed using GPT -4V as an automated judge to assess the sev erity of hallucination in the generated response. The responses are scored on a scale from 0 to 7, where a higher score indicates greater facutal consistency . • MM-V et ( Y u et al. , 2023 ) e v aluates an L VLM to integrate multiple multimodal capabilities for comple x reasoning. The benchmark defines six fundamental multimodal abilities: recognition, knowledge, OCR, spatial a wareness, language generation, and mathematics. A key feature of MM-V et is its focus on compositional tasks, where these six core abilities are combined to create 16 distinct capability integrations. The dataset itself is composed of 200 images and 218 questions, each requiring a specific combination of these integrated skills. D A D D I T I O NA L Q UA L I TA T I V E R E S U L T S T o provide a more granular understanding of the beha vior of the method, this section presents additional qualitative results from both discriminati ve and generative benchmarks. Each example provides a comprehensi ve analysis that includes the reasoning trace for the chosen augmentation, a stylized visualization of the augmentation, the logit v alues for the expert, amateur , and contrasted distributions, and the corresponding Sparsity Adapti ve T runcation (SA T) threshold. For impro ved visualization clarity , common punctuation tokens such as commas and periods ha ve been omitted from the presented logit distributions. Expert token b Amateur token b Final token a Expert logit 20.88 20.88 20.77 Amateur logit 20.89 20.89 20.50 Final logit 20.86 20.86 20.50 0.361 Query: How many spots are on the animal in the image? (a) 5 (b) 6 Reason: The question requires counting the number of spots on the animal. Random mask can completely obscure the animal, making it impossible to count the spots. Noise and flips don't affect the visibility of the animal or the count. Expert token _Left Amateur token _Left Final token _Right Expert logit 22.62 22.62 22.38 Amateur logit 22.69 22.69 22.03 Final logit 22.56 22.56 22.72 0.253 Query: Is the sports car in the picture facing left or right? Reason: The question requires determining the direction the sports car is facing. Horizontal flip r everses left and right, ma king “left” mean “right” and vice versa, in validating the question. Other augmentations don't target horizontal positions . Figure A.1: Qualitative results on MMVP ( T ong et al. , 2024 ). 19 Published as a conference paper at ICLR 2026 Expert token _No Amateur token _No Final token _Yes Expert logit 27.56 27.56 27.22 Amateur logit 28.64 28.64 25.53 Final logit 26.48 26.48 28.91 0.380 Expert token _No Amateur token _No Final token _Yes Expert logit 26.16 26.16 25.33 Amateur logit 27.36 27.36 23.19 Final logit 24.95 24.95 27.47 0.391 Q u e r y : I s t h e r e a r e d s c a r f i n t h e i m a g e ? Query: Are there two horses in this image? Reason: The question requires identifying the presence of tw o horses. Random mask can completely obscure one or both horses, making the e xact presence impossible. noise and flips don't affect the number of horses or their visibility . Reason: The question requires identifying the color of an object. Color inve rsion changes red to its complement, making it impossible to determine if there is a red scarf in the image. Flips, noise, r andom mask, and random crop don't directly aff ect color identification. Figure A.2: Qualitative results on MME ( Fu et al. , 2024 ). Expert token _The _farthest _north _highlight ed _city _in _the _image _is _Min ... Amateur token _The _farthest _north _highlight ed _city _in _the _image _is _Min ... Final token _The _farthest _north _highlight ed _city _in _the _image _is _Chicago ... Expert logit 17.89 18.28 23.14 19.75 24.11 24.66 21.52 25.17 18.83 23.50 15.25 ... 17.89 18.28 23.14 19.75 24.11 24.66 21.52 25.17 18.83 23.50 15.25 ... 17.89 18.28 23.14 19.75 24.11 24.66 21.52 25.17 18.83 23.50 15.25 ... Amateur logit 17.88 18.05 23.06 19.52 24.09 24.66 21.44 25.06 19.03 23.55 15.22 ... 17.88 18.05 23.06 19.52 24.09 24.66 21.44 25.06 19.03 23.55 15.22 ... 17.88 18.05 23.06 19.52 24.09 24.66 21.44 25.06 19.03 23.55 15.22 ... Final logit 17.91 18.52 23.22 19.98 24.12 24.66 21.59 25.28 18.62 23.45 15.28 ... 17.91 18.52 23.22 19.98 24.12 24.66 21.59 25.28 18.62 23.45 15.28 ... 17.91 18.52 23.22 19.98 24.12 24.66 21.59 25.28 18.62 23.45 15.28 ... 0.196 0.341 0.475 0.342 0.500 0.494 0.298 0.444 0.351 0.488 0.125 ... Expert token _To _is _a _bott er _located _to _the _sh am po o Amateur token _To _is _a _t ing _located _to _the _sh am po o Final token _There _is _a _condition er _located _to _the _sh am po o Expert logit 21.53 26.75 21.55 14.44 23.53 20.19 25.34 27.78 21.47 25.14 27.00 25.00 22.59 21.53 26.75 21.55 13.80 22.75 20.19 25.34 27.78 21.47 25.14 27.00 25.00 22.59 21.48 26.75 21.55 13.74 23.53 20.19 25.34 27.78 21.47 25.14 27.00 25.00 22.59 Amateur logit 21.55 26.39 20.62 11.69 22.88 19.44 25.00 27.11 20.28 24.72 26.66 24.09 22.27 21.55 26.39 20.62 13.70 23.08 19.44 25.00 27.11 20.28 24.72 26.66 24.09 22.27 21.14 26.39 20.62 10.05 22.88 19.44 25.00 27.11 20.28 24.72 26.66 24.09 22.27 Final logit 21.52 27.11 22.47 17.19 24.19 20.94 25.69 28.45 22.66 25.56 27.34 25.91 22.92 21.52 27.11 22.47 13.90 22.42 20.94 25.69 28.45 22.66 25.56 27.34 25.91 22.92 21.52 27.11 22.47 17.44 24.19 20.94 25.69 28.45 22.66 25.56 27.34 25.91 22.92 0.222 0.420 0.414 0.082 0.387 0.252 0.426 0.499 0.468 0.500 0.500 0.500 0.330 Query: Which of these highlighted cities is the farthest north? Reason: The question requires identifying the relative position of the cities . Noise obscures the map and mak es it difficult to determine the distances between the cities. Other augmentations don't affect the relative positions of the cities . Query: What is located to the right of the shampoo? Reason: The question requires identifying the position of an object relative to another object. Random crop ma y remove the conte xt of the objects' relationship, making it impossible to determine their position. Flips, color inversion, noise , and random mask don't directly affect relative positioning. Figure A.3: Qualitative results on MM-V et ( Y u et al. , 2023 ). 20 Published as a conference paper at ICLR 2026 Expert token _The _colors _of _the _shirts _worn _by _the _three _men _are _green _blue _red Amateur token _The _colors _of _the _shirts _worn _by _the _three _men _are _p _orange _p Final token _The _colors _of _the _shirts _worn _by _the _three _men _are _green _blue _red Expert logit 26.75 20.73 27.56 29.00 26.44 25.44 30.19 29.78 26.61 26.05 26.89 19.91 20.89 25.92 27.53 26.75 20.73 27.56 29.00 26.44 25.44 30.19 29.78 26.61 26.05 26.89 12.73 14.61 17.27 27.53 26.75 20.73 27.56 29.00 26.44 25.44 30.19 29.78 26.61 26.05 26.89 19.91 20.89 25.92 27.53 Amateur logit 26.66 21.00 27.36 28.94 26.12 25.20 30.02 29.67 26.52 25.97 26.72 16.20 17.41 19.91 27.19 26.66 21.00 27.36 28.94 26.12 25.20 30.02 29.67 26.52 25.97 26.72 19.97 18.94 21.66 27.19 26.66 21.00 27.36 28.94 26.12 25.20 30.02 29.67 26.52 25.97 26.72 16.20 17.41 19.91 27.19 Final logit 26.84 20.47 27.77 29.06 26.75 25.67 30.36 29.89 26.70 26.12 27.06 23.61 24.38 31.94 27.88 26.84 20.47 27.77 29.06 26.75 25.67 30.36 29.89 26.70 26.12 27.06 -inf -inf -inf 27.88 26.84 20.47 27.77 29.06 26.75 25.67 30.36 29.89 26.70 26.12 27.06 23.61 24.38 31.94 27.88 0.422 0.294 0.499 0.493 0.487 0.472 0.500 0.499 0.484 0.499 0.498 0.253 0.344 0.494 0.492 Expert token _The _par ach utes _in _the _sky _are _blue _orange _yellow Amateur token _The _par ach utes _in _the _sky _are _blue _yellow _pur Final token _The _par ach utes _in _the _sky _are _blue _orange _green Expert logit 27.91 23.41 25.34 25.12 23.72 29.25 25.98 24.62 19.25 19.05 20.59 26.77 27.91 23.41 25.34 25.12 23.72 29.25 25.98 24.62 19.25 18.78 17.09 26.77 27.91 23.41 25.34 25.12 23.72 29.25 25.98 24.62 19.25 19.05 20.20 26.77 Amateur logit 28.02 23.67 25.52 25.44 23.94 29.16 25.52 24.53 18.64 17.50 19.66 26.72 28.02 23.67 25.52 25.44 23.94 29.16 25.52 24.53 18.64 18.64 19.81 26.72 28.02 23.67 25.52 25.44 23.94 29.16 25.52 24.53 18.64 17.50 17.53 26.72 Final logit 27.80 23.14 25.17 24.81 23.50 29.34 26.45 24.72 19.86 20.59 21.53 26.81 27.80 23.14 25.17 24.81 23.50 29.34 26.45 24.72 19.86 18.92 -inf 26.81 27.80 23.14 25.17 24.81 23.50 29.34 26.45 24.72 19.86 20.59 22.88 26.81 0.452 0.375 0.499 0.488 0.441 0.500 0.497 0.362 0.128 0.220 0.272 0.475 Query: What are the colors of the parachutes in the sky? Query: What are the colors of the shirts worn by the three men from left to right in the image? Reason: The question requires identifying the colors of the shirts. Color inversion changes each color to its complement, making it impossible to determine the original colors. Flips, noise, r andom mask, and random crop don't affect color perception. Reason: The question requires identifying specific colors. color inversion changes each color to its complement, making it impossible to determine the original colors . Flips, noise, random mask, and random crop don't dir ectly affect color perception. Figure A.4: Qualitative results on MMHal-Bench ( Sun et al. , 2023 ). 21 Published as a conference paper at ICLR 2026 color_inversion horizontal_flip noise random_crop random_mask vertical_flip Prediction color_inversion horizontal_flip noise random_crop random_mask vertical_flip Oracle (GPT-4o-mini) 2 0 0 0 0 0 0 0 0 0 0 0 1 0 6 1 0 0 0 0 18 21 1 0 2 0 1 2 5 0 0 0 0 0 0 0 LLaVA-Bench color_inversion horizontal_flip noise random_crop random_mask vertical_flip Prediction color_inversion horizontal_flip noise random_crop random_mask vertical_flip Oracle (GPT-4o-mini) 77 0 2 2 149 0 0 11 0 17 0 0 104 0 197 8 71 0 35 0 14 935 15 0 58 0 45 45 303 0 0 0 0 14 1 11 MME color_inversion horizontal_flip noise random_crop random_mask vertical_flip Prediction color_inversion horizontal_flip noise random_crop random_mask vertical_flip Oracle (GPT-4o-mini) 8 0 1 2 0 0 0 1 0 4 3 0 13 0 6 2 0 0 2 1 14 87 6 1 9 0 8 16 33 0 0 0 0 0 1 0 MM-Vet color_inversion horizontal_flip noise random_crop random_mask vertical_flip Prediction color_inversion horizontal_flip noise random_crop random_mask vertical_flip Oracle (GPT-4o-mini) 16 0 0 0 1 0 0 0 0 2 1 0 0 0 3 0 0 0 3 0 14 14 4 0 5 0 2 8 18 1 1 0 0 2 0 1 MMHal-Bench color_inversion horizontal_flip noise random_crop random_mask vertical_flip Prediction color_inversion horizontal_flip noise random_crop random_mask vertical_flip Oracle (GPT-4o-mini) 28 0 0 0 6 0 0 8 0 5 2 0 12 0 6 4 0 0 3 2 0 18 0 0 13 0 12 66 84 0 0 0 0 19 0 12 MMVP color_inversion horizontal_flip noise random_crop random_mask vertical_flip Prediction color_inversion horizontal_flip noise random_crop random_mask vertical_flip Oracle (GPT-4o-mini) 14 0 0 7 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 346 2613 0 0 0 0 0 0 0 POPE Confusion Matrices: LLaVA-7B vs. GPT-4o-mini Figure A.5: Confusion matrix for LLA V A-1.5 7B model. 22 Published as a conference paper at ICLR 2026 color_inversion horizontal_flip noise random_crop random_mask vertical_flip Prediction color_inversion horizontal_flip noise random_crop random_mask vertical_flip Oracle (GPT-4o-mini) 1 0 0 0 1 0 0 0 0 0 0 0 0 0 4 0 4 0 0 0 4 16 20 0 0 0 0 0 10 0 0 0 0 0 0 0 LLaVA-Bench color_inversion horizontal_flip noise random_crop random_mask vertical_flip Prediction color_inversion horizontal_flip noise random_crop random_mask vertical_flip Oracle (GPT-4o-mini) 57 0 0 0 173 0 0 25 0 0 3 0 0 0 163 1 214 2 0 0 13 840 146 0 3 1 74 7 365 1 0 0 0 1 0 25 MME color_inversion horizontal_flip noise random_crop random_mask vertical_flip Prediction color_inversion horizontal_flip noise random_crop random_mask vertical_flip Oracle (GPT-4o-mini) 6 0 1 1 3 0 0 3 0 1 4 0 0 0 0 1 20 0 0 1 5 22 81 2 3 0 6 0 57 0 0 0 0 0 1 0 MM-Vet color_inversion horizontal_flip noise random_crop random_mask vertical_flip Prediction color_inversion horizontal_flip noise random_crop random_mask vertical_flip Oracle (GPT-4o-mini) 16 0 0 0 1 0 0 1 0 0 1 1 0 0 0 0 3 0 0 0 2 10 23 0 0 0 0 0 34 0 1 0 0 0 3 0 MMHal-Bench color_inversion horizontal_flip noise random_crop random_mask vertical_flip Prediction color_inversion horizontal_flip noise random_crop random_mask vertical_flip Oracle (GPT-4o-mini) 24 0 4 0 6 0 0 6 2 0 2 5 2 0 4 0 14 2 0 0 0 6 15 2 6 0 4 2 156 7 0 0 0 0 9 22 MMVP color_inversion horizontal_flip noise random_crop random_mask vertical_flip Prediction color_inversion horizontal_flip noise random_crop random_mask vertical_flip Oracle (GPT-4o-mini) 14 0 0 0 23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 2959 0 0 0 0 0 0 0 POPE Confusion Matrices: LLaVA-13B vs. GPT-4o-mini Figure A.6: Confusion matrix for LLA V A-1.5 13B model. 23 Published as a conference paper at ICLR 2026 T able A.1: Comparison of model performance against the gpt-4o-mini oracle. Agreement measures the accuracy percentage (%) of augmentation choices, and Judge Score estimates the quality rating of the model reasoning on a 0 to 10 scale by gpt-4o-mini. Model Metric LLaV A-Bench MME MM-V et MMHal MMVP POPE A verage LLaV A-7B Agreement (%) 56.67 72.56 61.93 54.17 52.00 87.57 64.15 Judge Score 7.97 8.59 7.98 7.79 7.69 9.63 8.28 LLaV A-13B Agreement (%) 51.67 69.77 40.37 63.54 72.67 99.10 66.19 Judge Score 8.63 9.12 8.55 9.08 8.86 9.99 9.04 E D E TA I L E D C O M P A R I S O N A G A I N S T O R AC L E In the main script, experiments were conducted to ev aluate the impact of the model scale on the quality of augmentation choice and reasoning. The agreement of each model’ s choice and the quality of its reasoning trace were measured against the Oracle, with the results summarized in T ab . A.1 . These results confirm that lar ger model capacity generally leads to better query-augmentation semantic alignment and higher reasoning quality . A more granular analysis using the confusion matrices in Fig. A.5 - A.6 , rev eals a complex, task- dependent relationship. On uniform benchmarks such as POPE, the alignment between the 13B model and the Oracle is nearly optimal. In contrast, on more complex benchmarks such as MM-V et, the 13B model exhibits a predicti ve bias, frequently selecting random crop when the Oracle chooses the functionally similar random mask. Note that this disagreement is not a critical failure, b ut rather a choice between two functionally similar occlusion-based augmentations. This finding highlights a key strength of the proposed method. The fact that strong do wnstream per- formance is achiev ed without requiring a perfect, Oracle-level selection confirms that the frame work is highly effecti v e at lev eraging the competent, albeit imperfect, reasoning of different model scales to significantly improv e factual consistency . 24

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment