When Choices Become Priors: Contrastive Decoding for Scientific Figure Multiple-Choice QA

Scientific figure multiple-choice question answering (MCQA) requires models to reason over diverse visual evidence, ranging from charts and multipanel figures to microscopy and biomedical images. However, this setting suffers from a distinctive bias:…

Authors: Taeyun Roh, Eun-yeong Jo, Wonjune Jang

When Choices Become Priors: Contrastive Decoding for Scientific Figure Multiple-Choice QA
Preprint. Under review . When Choices Become Priors: Contrastive Decoding for Sci- entific Figure Multiple-Choice QA ∗ T aeyun Roh 1 Eun-yeong Jo 2 W onjune Jang 3 Jaewoo Kang 1,4 † 1 Korea University 2 Konkuk University 3 Myongji University 4 AIGEN Sciences nrbsld@korea.ac.kr tina0325@konkuk.ac.kr dnjswnswkd03@mju.ac.kr kangj@korea.ac.kr Abstract Scientific figur e multiple-choice question answering (MCQA) r equires mod- els to r eason over diverse visual evidence, ranging from charts and mul- tipanel figures to microscopy and biomedical images. However , this set- ting suffers fr om a distinctive bias: answer choices themselves can act as priors, steering multimodal models toward scientifically plausible op- tions even when the figure supports a different answer . W e investigate this failure mode through a simple question: what if decoding explicitly discounts what the model would prefer from text alone, so as to favor figure-gr ounded evidence? T o this end, we propose S C I C O N , a training- free decoding method that scores each candidate by subtracting a text-only option score from its image-conditioned counterpart. Unlike prior con- trastive decoding appr oaches that mitigate hallucinations by contrasting original inputs with distorted images or perturbed instructions, S C I C O N directly tar gets the choice-induced prior encoded in candidate text. Across three scientific figure QA benchmarks and thr ee model backbones, S C I C O N consistently improves accuracy over standard decoding baselines. These results show that decoding against choice-induced priors is an effective and simple way to improve figure-gr ounded reasoning in scientific MCQA. 1 Introduction Scientific figures are a primary medium for communicating scientific evidence. Plots, microscopy panels, radiology images, symbolic diagrams, and multipanel experimental summaries often contain the central empirical claims of a paper in visual form( Hsu et al. , 2021 ; Roberts et al. , 2024 ). Building multimodal systems that can reliably interpr et such figures is ther efore important for a wide range of scientific applications, including literature understanding, evidence-grounded question answering, scientific search, and research assistance, where agentic systems increasingly rely on accurate grounding in scientific documents and their visual evidence ( Baek et al. , 2025 ; Schmidgall et al. , 2025 ; Agarwal et al. , 2025 ). Despite this importance, most pr ogress in vision-language modeling has been driven by object-centric and general-domain benchmarks, wher e the dominant challenges are object recognition, captioning, and br oad visual question answering ( Lin et al. , 2015 ; Agrawal et al. , 2016 ; Hudson & Manning , 2019 ). As a result, scientific figure understanding remains r elatively underexplor ed given its practical importance. Recent benchmarks ( Roberts et al. , 2024 ; Li et al. , 2024b ; Jiang et al. , 2025 ) reveal that even advanced multimodal models struggle with scientific figure r easoning. Unlike object-centric natural-image tasks, scientific figur es requir e interpreting tr ends, comparing panels, and mapping symbolic abstractions to domain-specific meanings. The challenge thus shifts from simple visual recognition to identifying which candidate answer is actually supported by the scientific evidence encoded in the figure. ∗ Code is available at https://github.com/dmis- lab/SciCON . † Corresponding author . 1 Preprint. Under review . This setting is especially vulnerable to shortcut behavior in the multiple-choice regime. In scientific multiple-choice QA (MCQA), answer choices are not merely alternatives to rank; they often contain strong semantic and domain-specific cues that can themselves act as priors ( Balepur et al. , 2024 ). As a result, a model may prefer an option because it is scientifically plausible from text alone, even when the figure supports a different answer . W e refer to this phenomenon as choice-induced prior bias . Its inference-time manifestation is what we call text-prior-dominant decoding : the final prediction r emains overly aligned with what the model would answer from the question and choices alone, rather than with the visual evidence in the figure. Prior work has shown that vision-language models can over-rely on language priors and memorized associations instead of genuine visual reasoning ( Luo et al. , 2025 ; V o et al. , 2025 ; Sun et al. , 2026 ). Related lines of work have also explored decoding-based methods for mitigating prior-driven failures and hallucinations in multimodal generation ( Leng et al. , 2024 ; W ang et al. , 2024b ; Park et al. , 2024 ). However , while language-prior effects have been widely discussed in general multimodal settings, there has been limited analysis of how answer choices themselves induce bias in scientific figure QA, and even less work on inference-time methods designed specifically to counteract this failur e mode. In this paper , we ask a simple question: what if decoding explicitly discounts what the model would prefer from text alone, so as to favor figur e-grounded evidence? T o answer this, we propose S C I C O N ( Sci entific Con trastive decoding), a simple training-free decoding method for scientific figure MCQA. For each candidate option, S C I C O N computes an image- conditioned score and a text-only scor e, and subtracts the latter during final decision making. If a candidate is preferr ed mainly because it is textually or scientifically plausible, its scor e is reduced; if it is genuinely supported by the figur e, it remains competitive after subtraction. Our contributions are thr eefold: • W e identify choice-induced prior bias as a distinctive failure mode in scientific figure MCQA, where answer choices themselves act as priors and steer models toward semantically plausible distractors over visually grounded answers. • W e propose S C I C O N , a simple training-free contrastive decoding method that sub- tracts text-only answer prefer ence from image-conditioned answer scores, ther eby explicitly decoding against choice-induced prior bias. • Across thr ee benchmarks and thr ee model backbones, we show that S C I C O N consis- tently impr oves accuracy . Our analysis further suggests that its primary mechanism is gold-answer recovery : it is most effective when the image-conditioned branch supports the correct answer but the text prior does not. 2 Related work 2.1 Improving Scientific Figure Understanding Recent work on scientific figure understanding has advanced along two main dir ections: building benchmarks that expose the limitations of general-domain multimodal models, and adapting multimodal models to scientific imagery and reasoning. On the evaluation side, SciFIBench, MMSci, and MAC introduce benchmarks for scientific figure interpretation, multidisciplinary scientific reasoning, and live scientific cross-modal evaluation, respectively ( Roberts et al. , 2024 ; Li et al. , 2024b ; Jiang et al. , 2025 ). On the modeling side, prior efforts have improved scientific figure understanding through domain-specific data construction and adaptation of general-purpose vision-language models ( Li et al. , 2023 ; 2024a ; Lozano et al. , 2025 ). Our work is complementary to these directions: rather than introducing a new benchmark or adapting model parameters, we focus on an inference-time decoding strategy for scientific MCQA. 2 Preprint. Under review . A B C D 0 0.5 1 0.01 0.24 0.72 0.03 0.02 0.04 0.91 0.04 0.05 0.55 0.34 0.07 Probability ( p ) Comparison of Answer Distributions Multimodal T ext-only SciCON Question: Which of the following captions best describes the whole figure? Correct Answer: B • Option A: Relative expression of T et transcripts in development. • Option B (Gold): Increase of 5hmC in the maternal genome of zygotes derived fr om PGC7-null oocytes. • Option C: 5hmC preferentially appears in the paternal genome of early mouse preimplantation embryos. • Option D: 5hmC is present in rabbit and bovine zygotes. Figure 1: An MMSci example where both text-only and multimodal decoding favor a plausible distractor (Option C). S C I C O N suppresses this text-driven bias and recovers the visually grounded corr ect answer (Option B). 2.2 Choice Bias in Multiple-Choice Question Answering Prior work on multiple-choice QA shows that pr edictions can be biased by answer -space artifacts rather than underlying evidence. In language-only and multimodal MCQA settings, prior studies have examined biases arising fr om answer-option artifacts, option ordering, and superficial answer plausibility , showing that models can exploit shortcuts in the candi- date set without performing the intended reasoning ( Balepur et al. , 2024 ; Zheng et al. , 2024 ; W ang et al. , 2024a ; Atabuzzaman et al. , 2025 ). Related work has also explor ed calibration, answer permutation, and debiasing strategies to r educe such effects at infer ence time( Zheng et al. , 2024 ; Atabuzzaman et al. , 2025 ). Our setting is related but distinct. Rather than positional bias among options, we focus on a stronger issue in scientific figur e QA: answer choices often encode domain knowledge and scientifically plausible phrasing that act as priors. As a r esult, a model may favor a candidate from the question and choices alone, even when the figure supports another answer . T o our knowledge, this choice-induced prior bias remains lar gely unexplored in scientific MCQA, with little work on decoding methods designed to suppress it. 2.3 Contrastive and Debiasing Decoding Contrastive decoding has emerged as an ef fective training-free strategy for steering model predictions and suppr essing undesirable behaviors ( O’Brien & Lewis , 2023 ). In multimodal generation, much of this line of work has focused on hallucination reduction, especially failures where models mention objects or attributes that are not supported by the image. V isual Contrastive Decoding (VCD) addresses this pr oblem by contrasting pr edictions under original and distorted visual inputs ( Leng et al. , 2024 ). Instruction Contrastive Decoding (ICD) instead perturbs the instruction side, contrasting standard pr ompts with disturbance instructions to r educe hallucination and prior-driven errors ( W ang et al. , 2024b ). 3 Preprint. Under review . Dataset JS (Correct) JS (W r ong) Cosine (Correct) Cosine (W rong) MAC 0.2477 0.1448 0.5559 0.7218 SciFIBench 0.1161 0.0728 0.7623 0.8253 MMSci 0.0870 0.0536 0.8005 0.8790 T able 1: Distances between multimodal and text-only answer distributions with Qwen3.5-4B backbone. Correct cases are farther fr om the text-only prior , showing higher JS divergence and lower cosine similarity . 3 Method 3.1 Problem Setting W e study scientific figure MCQA across thr ee benchmark datasets. Each example consists of a scientific figure x , a question q , and a set of candidate answers C = { c 1 , . . . , c K } . The goal is to predict the corr ect answer y ∗ ∈ C . Given a vision-language model, we define two answer distributions over the same candidate set. The first is the multimodal answer distribution, p mm ( c | x , q , C ) , (1) obtained by conditioning on the figure, question, and answer choices. The second is the corresponding text-only answer distribution, p txt ( c | q , C ) , (2) obtained fr om the same input after r emoving the figure. Intuitively , p mm reflects the model’s prefer ence when both visual and textual evidence ar e available, whereas p txt captures the prefer ence induced by textual evidence alone. 3.2 Preliminary Experiments Our central hypothesis is that scientific figure QA suffers fr om a failure mode wher e the final prediction distribution p mm remains overly aligned with the text-only p txt distribution. Ideally , the model should assign high pr obability to an answer because it is supported by the visual evidence in x . In practice, however , some candidate answers may already appear highly plausible from text alone due to domain-specific wording, semantic r elatedness, or common scientific associations. As a result, the image-conditioned distribution p mm may remain too close to p txt , and the final prediction may r eflect answer plausibility rather than genuine figure gr ounding. W e refer to this phenomenon as text-prior-dominant decoding. Formally , it arises when the multimodal decoder outputs an incorrect answer ˆ y = arg max c ∈ C p mm ( c | x , q , C )  = y ∗ , (3) and the selected distractor is mor e strongly favored by the text-only distribution than the correct answer , p txt ( ˆ y | q , C ) > p txt ( y ∗ | q , C ) . (4) This failure is especially problematic when the correct answer r eceives substantial additional support from the image, yet that visually grounded signal is still insufficient to over come the textual prior . Our goal is therefor e to design a decoding rule that suppresses answer prefer ence that can already be explained by text alone, while preserving answer prefer ence that emerges only when the image is available. Before intr oducing our method, we first test whether successful scientific figur e reasoning is associated with a measurable departure from the text-only prior . If our hypothesis is correct, then corr ect predictions should be those in which the image-conditioned answer distribution deviates meaningfully from what the model would pr efer from text alone. 4 Preprint. Under review . Mu ltimod al T ext - on ly Ca n d idate sc or e with T ext - only 0 .8 2.8 1.4 Que stion Figur e Choices : A , B, C Questio n Vision Language Model Can didate sc or e wit h Multimodal X 𝜶 ( 𝟎 . 𝟓 ) 3.0 3. 3 0.9 - = 0. 4 0. 5 3. 3 1. 4 1. 9 3.0 0. 7 2. 3 - = - = 1. 9 0.5 2.3 Cand idat e sco r e wi th SciCon 0. 9 Choices : A , B, C Figure 2: Illustration of S C I C O N . Given a question and candidate answers, the model produces candidate scor es under both multimodal and text-only inputs. S C I C O N subtracts the text-only scor e, scaled by α , fr om the multimodal score so that candidates favor ed mainly by textual prior are suppr essed and visually grounded candidates ar e promoted. T o test this, we use a standard greedy decoder and compute both p mm ( c | x , q , C ) and p txt ( c | q , C ) for each example. W e then partition examples into correct and wrong groups according to whether the greedy pr ediction matches the gold answer , and compare the two distributions using Jensen–Shannon divergence, JS ( p mm ∥ p txt ) (5) and cosine similarity , cos ( p mm , p txt ) . (6) T able 1 shows a consistent pattern across MAC, SciFIBench, and MMSci: correct predictions exhibit higher Jensen–Shannon divergence and lower cosine similarity than wr ong predic- tions. In other words, successful predictions tend to move farther away from the text-only prior , whereas incorr ect predictions r emain more aligned with it. These results support our failure-mode hypothesis: many errors arise when the model does not move sufficiently far fr om text-only prefer ence even when visual evidence is available. This observation motivates our decoding approach, which explicitly discounts text-only prefer ence at inference time so that answers are favored only when they r emain strong after visual grounding. 3.3 SciCon: Contrastive Decoding for Scientific Figure Multiple-Choice QA For each question, we query the model under two contexts defined in Section 3.1 : a multi- modal context that includes the scientific figure, and a text-only context in which the figure is removed. This yields two candidate-wise logits over the same answer set: l mm ( c ) = logit θ ( c | x , q , C ) , (7) l txt ( c ) = logit θ ( c | q , C ) . (8) Here, l mm ( c ) measures the model’s prefer ence for candidate c when both visual and textual evidence are available, whereas l txt ( c ) measures the preference induced by the textual evidence only . 5 Preprint. Under review . Backbone Method MAC SciFIBench MMSci ACC F1 ACC F1 ACC F1 Qwen 3.5 4B Greedy (Baseline) 69.72 70.75 46.20 44.93 38.83 19.41 VCD 68.50 68.77 45.40 44.00 40.85 26.58 ICD 58.41 58.57 40.50 38.23 33.79 14.83 S C I C O N (Ours) 74.01 74.24 48.70 46.99 43.44 19.06 Qwen 3.5 9B Greedy (Baseline) 81.35 81.40 55.10 54.67 46.54 27.34 VCD 81.96 81.92 55.90 55.55 49.29 32.35 ICD 81.04 81.01 53.00 52.13 46.65 24.14 S C I C O N (Ours) 82.26 82.34 58.00 57.55 52.14 33.91 Phi-3.5-vision-instruct Greedy (Baseline) 42.81 42.02 48.60 48.47 47.78 34.43 VCD 43.73 43.27 53.50 53.75 51.95 34.59 ICD 42.81 41.60 47.10 46.86 46.32 31.38 S C I C O N (Ours) 49.54 49.89 54.90 55.02 52.71 29.93 T able 2: Main results on scientific figure QA benchmarks. W e compare S C I C O N with greedy decoding and contrastive baselines (VCD and ICD) across Qwen 3.5 4B, Qwen 3.5 9B, and Phi-3.5-vision-instruct backbones. Performance is reported in accuracy (ACC) and macr o-F1. The best result for each metric is highlighted in bold . S C I C O N then adjusts the final decision by explicitly discounting the text-only preference from the multimodal logit: l sc ( c ) = l mm ( c ) − α l txt ( c ) , (9) where α > 0 controls the strength of prior suppr ession. The final prediction is then given by ˆ y = argmax c ∈ C l sc ( c ) . (10) This decoding rule favors candidates whose support cannot be explained by text alone. A distractor that is attractive primarily because it is linguistically plausible receives a lower contrastive score, whereas an answer that remains strong after subtracting the text-only prefer ence is more likely to r eflect genuine visual grounding. 4 Experiments 4.1 Benchmarks W e evaluate the effectiveness of our method on thr ee repr esentative scientific multimodal MCQA benchmarks, which requir e high-level visual reasoning and domain-specific knowl- edge: • MAC ( Jiang et al. , 2025 ) : MAC is constructed from scientific journal cover images drawn from venues such as Cell , Nature , Science , and ACS journals, spanning domains including biology , medicine, chemistry , and materials science. Compared with conventional figure QA benchmarks, it focuses on visually rich cover images paired with scientific narratives, making it suitable for evaluating abstract and high-level scientific visual understanding. • SciFIBench ( Roberts et al. , 2024 ) : SciFIBench consists of scientific figures paired with textual descriptions across both computer science and broader scientific do- mains, including areas such as computer vision, AI, machine learning, physics, mathematics, and quantitative biology . Its figures ar e accompanied by hard distrac- tors, making the benchmark particularly challenging for evaluating fine-grained figure–text gr ounding. • MMSci ( Li et al. , 2024b ) : MMSci is a broad scientific multimodal benchmark cov- ering five major categories and 72 subjects, including physical sciences, earth and 6 Preprint. Under review . Category N Greedy ICD VCD S C I C O N Biological sciences 2062 44.23 45.25 48.93 51.16 Physical sciences 1039 49.95 48.22 50.05 53.61 Health sciences 330 49.09 47.27 50.30 53.03 Earth and environmental sciences 246 50.00 51.22 50.41 54.47 Scientific community and society 34 32.35 44.12 29.41 41.18 T able 3: Category-wise accuracy (%) on MMSci with the Qwen3.5-9B backbone. S C I C O N achieves the highest accuracy in four of the five scientific categories, while ICD performs best in Scientific community and society . environmental sciences, biological sciences, and health. It contains diverse scien- tific figure types and associated textual contexts, making it useful for evaluating scientific figure understanding acr oss heterogeneous domains and visual formats. 4.2 Experimental Setup Backbone Models T o demonstrate the generalizability of our approach across model scales and architectur es, we use three vision-language backbones: Qwen 3.5 4B , Qwen 3.5 9B , and Phi-3.5-vision-instruct . The Qwen models provide two dif ferent scale settings, while Phi-3.5-vision-instruct of fers an additional architecture with str ong visual and instruction- following capabilities. Baselines W e compare S C I C O N against one standard decoding baseline and two con- trastive decoding methods: • Greedy Decoding : The standard decoding strategy that dir ectly selects the candi- date with the highest model score under the original multimodal context, without any explicit correction for textual prior bias. It serves as the main non-contrastive baseline. • VCD ( Leng et al. , 2024 ) : V isual Contrastive Decoding suppresses language-prior- driven predictions by contrasting the original image-conditioned output with the output obtained from a degraded visual input, such as a blurred or distorted image. The underlying idea is that candidates whose scores remain high even when visual information is corrupted ar e less likely to be genuinely grounded in the image. • ICD ( W ang et al. , 2024b ) : Instruction Contrastive Decoding reduces hallucinations by contrasting the original output distribution with a distribution induced by distur - bance instructions, which ar e designed to exacerbate hallucination in multimodal reasoning. By discounting concepts that are overly sensitive to such disturbed instructions, ICD suppresses unsupported predictions and improves alignment with the visual evidence. Implementation Details Performance is evaluated using Accuracy and Macro-F1 to ac- count for potential class imbalance across benchmarks. W e provide the formal definition of Macro-F1 in Appendix A.1 . For VCD and ICD, we adopt the default hyperparameter settings from the original implementations without additional tuning. For S C I C O N , we use α = 0.5 as the default setting thr oughout the main experiments. This choice is rank-equivalent to the default VCD/ICD coefficient setting, and ther efore yields the same effective weighting ratio between the original and contrastive branches. Further details on the hyperparameter settings of VCD, ICD, and S C I C O N , as well as S C I C O N ’s sensitivity to α , are provided in Appendix A.3 . 4.3 Experimental Results T able 2 summarizes the main results. Across all thr ee benchmarks and all three backbone models, S C I C O N achieves the best accuracy in every case. The gains ar e especially clear with 7 Preprint. Under review . Name Definition Interpretation Gold uplift l sc ( y ∗ ) − l mm ( y ∗ ) The extent to which the contrastive decoder raises the gold option r elative to the original image-conditioned branch. V isual evidence margin l mm ( y ∗ ) − l txt ( y ∗ ) The extent to which the multimodal branch supports the gold answer more strongly than the text-only branch. Positive values indicate gold-specific visual evidence. T ext-prior gold hit 1 [ argmax p txt = y ∗ ] averaged over a group Whether the text-only branch already ranks the gold answer first. High values indicate that the text prior is already informative. Prior alignment cos ( p mm , p txt ) The similarity between the multimodal and text-only answer distributions. Higher values indicate that the multimodal branch remains close to the text prior . T able 4: Definitions of the four diagnostic quantities used in the analysis. Phi-3.5-vision-instruct, where S C I C O N improves over greedy decoding on MAC, SciFIBench, and MMSci, while also surpassing the str ong VCD baseline on all thr ee benchmarks. Similar improvements ar e observed for both Qwen 3.5 4B and Qwen 3.5 9B. Macro-F1 follows the same overall trend on MAC and SciFIBench, where S C I C O N performs best across all backbones. MMSci is the main exception: VCD yields higher macro-F1 for Qwen 3.5 4B and Phi-3.5-vision-instruct despite lower accuracy . W e hypothesize that this discrepancy is r elated to the highly imbalanced gold answer-label distribution in MMSci. As shown in Appendix A.5 , MMSci has both a long-tailed answer-label distribution and highly variable candidate counts. Under this imbalance, gains on frequent labels can substantially improve overall accuracy , while limited gains on rar e labels may constrain macro-F1. T able 3 reports category-wise results on MMSci. S C I C O N attains the best accuracy in four of the five categories, with particularly strong impr ovements in Biological sciences and Earth and environmental sciences. Scientific community and society is the only category where ICD slightly outperforms S C I C O N . Overall, the category-wise analysis supports the effectiveness of text-prior subtraction across diverse scientific domains, while indicating that the influence of textual priors is not uniform across categories. 5 Analysis T o better understand how S C I C O N changes prediction behavior , we analyze the answer distributions pr oduced by the Qwen3.5-4B model. T able 4 defines the four diagnostic quantities used throughout this section. 5.1 Main mechanism: gold-answer recovery The clearest and most consistent pattern across datasets is gold-answer recovery . T able 5 compares two gr oups of examples: corrected cases, where the baseline is wrong but S C I C O N is correct, and harmed cases, wher e the baseline is correct but S C I C O N becomes wrong. In corrected cases, the gold uplift is strongly positive on all thr ee datasets. This indicates that S C I C O N explicitly increases the score of the gold answer relative to the original multimodal branch. At the same time, the visual evidence margin is also large and positive, showing that the image-conditioned branch already contains substantially str onger support for the gold answer than the text-only branch. By contrast, harmed cases exhibit little or even negative gold-specific visual advantage. Their visual evidence margin is close to zero or below zer o across datasets, suggesting that 8 Preprint. Under review . Dataset Gold uplift (corrected) V isual evidence margin (corrected) V isual evidence margin (harmed) MAC 1.568 2.088 0.013 SciFIBench 1.273 1.171 -0.054 MMSci 1.301 1.197 -0.189 T able 5: Gold-answer recovery is the primary mechanism behind S C I C O N . Corrected cases show large gold uplift and strong visual evidence margins, whereas harmed cases show little (or negative) gold-specific visual advantage. Dataset T ext-prior gold hit (corrected) T ext-prior gold hit (harmed) Prior alignment (harmed) MAC 0.103 0.733 0.846 SciFIBench 0.000 0.739 0.887 MMSci 0.012 0.798 0.896 T able 6: When S C I C O N helps and when it hurts. Corrected cases arise when the text prior is misleading, whereas harmed cases arise when the text prior is already aligned with the gold answer . the multimodal branch provides little additional evidence for the gold answer beyond what is already pr esent in the text prior . Overall, these results indicate that S C I C O N mainly helps by recovering gold answers when visual evidence is present but underutilized, whereas failures tend to arise when such visual support is weak. 5.2 When prior subtraction helps and when it hurts T able 6 further clarifies the regime in which prior subtraction is beneficial. In corr ected cases, the text-prior gold hit is extr emely low . In other words, the text-only branch almost never identifies the correct answer in the very examples that S C I C O N successfully fixes. These are precisely the cases where the text prior is misleading, and subtracting it allows the model to rely mor e on visual evidence. The harmed cases show the opposite pattern. Here, the text-prior gold hit is high. This means that the text-only prior is already well aligned with the correct answer . These examples also exhibit high prior alignment, indicating that the multimodal branch remains close to the text prior . In this regime, aggr essive prior subtraction can remove useful information and occasionally turn a correct pr ediction into an error . 6 Limitations S C I C O N is most effective when visual evidence provides a strong signal that diverges fr om the text prior . However , when visual evidence is ambiguous, subtracting the text prior can be counterproductive—especially if the prior alr eady points to the correct answer . In addition, our evaluation is limited to scientific figure multiple-choice QA, so its effectiveness in broader open-ended scientific multimodal r easoning remains to be validated. Furthermore, while S C I C O N demonstrates robust performance with a fixed subtraction weight, the optimal value of α varies slightly across different scientific domains. This suggests that the degr ee of choice-induced prior bias is not uniform, and a static coefficient may not capture the nuances of every specialized dataset. Additionally , our curr ent formulation is specifically tailored to the multiple-choice r egime; extending this contrastive approach to open-ended generative scientific QA remains a subject for futur e investigation. 9 Preprint. Under review . 7 Conclusion W e introduced S C I C O N , a training-free contrastive decoder for scientific figure MCQA that subtracts text-only answer prefer ences from image-conditioned scores. Across three benchmarks, this simple appr oach consistently improves accuracy over greedy decoding and strong contrastive baselines. Our analysis confirms that S C I C O N excels at gold-answer recovery by prioritizing visual evidence over linguistic biases. These findings demonstrate that decoding against the answer -text prior is a practical and effective strategy for grounding multimodal scientific reasoning. Beyond its performance gains, S C I C O N offers a more computationally efficient alternative to existing contrastive methods by r equiring only one additional text-only prefill pass instead of a second multimodal pass. This efficiency makes it a practical candidate for deployment in real-time scientific assistant systems that demand both high accuracy and low latency . Future work will focus on developing adaptive mechanisms to dynamically adjust prior suppression based on the model’s internal confidence in the visual evidence. References Shubham Agarwal, Gaurav Sahu, Abhay Puri, Issam H. Laradji, Krishnamurthy DJ Dvi- jotham, Jason Stanley , Laur ent Charlin, and Christopher Pal. Litllm: A toolkit for scientific literature r eview , 2025. URL . Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh. Vqa: V isual question answering, 2016. URL https: //arxiv.org/abs/1505.00468 . Md. Atabuzzaman, Ali Asgar ov , and Chris Thomas. Benchmarking and mitigating mcqa selection bias of large vision-language models, 2025. URL 16805 . Jinheon Baek, Sujay Kumar Jauhar , Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative resear ch idea generation over scientific literature with lar ge language models, 2025. URL . Nishant Balepur , Abhilasha Ravichander , and Rachel Rudinger . Artifacts or abduction: How do llms answer multiple-choice questions without the question?, 2024. URL https: //arxiv.org/abs/2402.12483 . T ing-Y ao Hsu, C. Lee Giles, and T ing-Hao ’Kenneth’ Huang. Scicap: Generating captions for scientific figures, 2021. URL . Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering, 2019. URL 1902.09506 . Mohan Jiang, Jin Gao, Jiahao Zhan, and Dequan W ang. Mac: A live benchmark for multi- modal large language models in scientific understanding. arXiv preprint , 2025. Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint , 2024. Chunyuan Li, Clif f W ong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Y ang, T ristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: T raining a large language-and- vision assistant for biomedicine in one day , 2023. URL 00890 . Lei Li, Y uqi W ang, Runxin Xu, Peiyi W ang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of lar ge vision- language models, 2024a. URL . 10 Preprint. Under review . Zekun Li, Xianjun Y ang, Kyuri Choi, W anrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Y an, Linda Ruth Petzold, Stephen D. W ilson, W oosang Lim, and W illiam Y ang W ang. Mmsci: A dataset for graduate-level multi- discipline multimodal scientific understanding. arXiv preprint , 2024b. T sung-Y i Lin, Michael Maire, Serge Belongie, Lubomir Bour dev , Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár . Micr osoft coco: Common objects in context, 2015. URL . Alejandro Lozano, Min W oo Sun, James Burgess, Liangyu Chen, Jeffrey J Nirschl, Jeffr ey Gu, Ivan Lopez, Josiah Aklilu, Austin W olfgang Katzer , Collin Chiu, Anita Rau, Xiaohan W ang, Y uhui Zhang, Alfred Seunghoon Song, Robert T ibshirani, and Serena Y eung-Levy . Biomedica: An open biomedical image-caption archive, dataset, and vision-language models derived fr om scientific literature, 2025. URL . T iange Luo, Ang Cao, Gunhee Lee, Justin Johnson, and Honglak Lee. Probing visual language priors in VLMs. arXiv preprint , 2025. Sean O’Brien and Mike Lewis. Contrastive decoding improves reasoning in lar ge language models. arXiv preprint , 2023. Y eji Park, Deokyeong Lee, Junsuk Choe, and Buru Chang. Convis: Contrastive decod- ing with hallucination visualization for mitigating hallucinations in multimodal large language models, 2024. URL . Jonathan Roberts, Kai Han, Neil Houlsby , and Samuel Albanie. Scifibench: Bench- marking large multimodal models for scientific figure interpretation. arXiv preprint arXiv:2405.08807 , 2024. Samuel Schmidgall, Y usheng Su, Ze W ang, Ximeng Sun, Jialian W u, Xiaodong Y u, Jiang Liu, Michael Moor , Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as resear ch assistants, 2025. URL . Xiaoxiao Sun, Mingyang Li, Kun Y uan, Min W oo Sun, Mark Endo, Shengguang W u, Changlin Li, Y uhui Zhang, Zeyu W ang, and Ser ena Y eung-Levy . Do VLMs perceive or recall? pr obing visual perception vs. memory with classic visual illusions. arXiv preprint arXiv:2601.22150 , 2026. An V o, Khai-Nguyen Nguyen, Mohammad Reza T aesiri, V y T uong Dang, Anh T otti Nguyen, and Daeyoung Kim. V ision language models are biased. arXiv preprint , 2025. Haochun W ang, Sendong Zhao, Zewen Qiang, Nuwa Xi, Bing Qin, and T ing Liu. Llms may perform mcqa by selecting the least incorrect option, 2024a. URL abs/2402.01349 . Xintong W ang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. arXiv preprint arXiv:2403.18715 , 2024b. Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors, 2024. URL 03882 . A Additional Experimental Details A.1 Evaluation Metrics W e report both Accuracy and Macro-F1 . Accuracy measures the proportion of correctly answered multiple-choice questions over the entire evaluation set. Macro-F1 is included as a complementary class-balanced metric to account for potential imbalance in answer labels across benchmarks. 11 Preprint. Under review . α MAC SciFIBench MMSci Accuracy F1 Accuracy F1 Accuracy F1 0.1 74.01 74.18 47.80 46.79 40.85 21.98 0.3 74.01 74.23 49.00 47.79 42.09 23.07 0.5 (Default) 74.01 74.24 48.70 46.99 43.44 19.06 0.7 74.62 74.76 48.30 46.66 44.46 18.39 0.9 73.39 73.54 47.10 45.45 43.41 16.25 T able 7: Hyperparameter sensitivity of S C I C O N with respect to the subtraction weight α on Qwen3.5-4B. Performance is generally stable across a broad range of α values, although the best setting varies slightly by dataset. Formally , let C denote the set of answer classes (e.g., A/B/C/D). For each class c ∈ C , we compute precision and r ecall as Precision c = T P c T P c + F P c , Recall c = T P c T P c + F N c , where T P c , F P c , and F N c denote the numbers of true positives, false positives, and false negatives for class c , respectively . The class-wise F1 score is then F1 c = 2 · Precision c · Recall c Precision c + Recall c . Macro-F1 is defined as the unweighted average acr oss classes: Macro-F1 = 1 | C | ∑ c ∈ C F1 c . Following standard practice, if the denominator of precision or r ecall is zero for a class, the corresponding term is tr eated as zero. A.2 Jensen–Shannon Divergence T o quantify the discrepancy between answer prefer ences induced by differ ent input condi- tions, we measure the Jensen–Shannon divergence (JSD) between two answer distributions. In our setting, these correspond to the answer distribution produced by the text-only model and that produced by the multimodal model. Given two discrete pr obability distributions P and Q over the same answer set, the Jensen– Shannon divergence is defined as JSD ( P ∥ Q ) = 1 2 KL ( P ∥ M ) + 1 2 KL ( Q ∥ M ) , where M = 1 2 ( P + Q ) , and KL ( · ∥ · ) denotes the Kullback–Leibler divergence: KL ( P ∥ M ) = ∑ i P ( i ) log P ( i ) M ( i ) . JSD is a symmetric and bounded measur e of distributional difference. A lower JSD indicates that the two answer distributions are more similar , while a higher JSD indicates a larger discrepancy . In our analysis, lower JSD between the text-only and multimodal answer distributions suggests that the multimodal model remains closer to the text-only preference, consistent with stronger dominance of textual priors. 12 Preprint. Under review . A.3 Hyperparameter Details W e use the default hyperparameter settings from the released implementations of VCD and ICD. In both baselines, contrastive decoding is performed at the candidate-logit level. For VCD, the candidate score can be written as s VCD ( y ) = l orig ( y ) + α  l orig ( y ) − l noisy ( y )  , (11) where l orig ( y ) denotes the logit of candidate y under the original image-question input, and l noisy ( y ) denotes the corresponding logit under a noise-corrupted image. W ith the default setting α = 1.0, this becomes s VCD ( y ) = l orig ( y ) +  l orig ( y ) − l noisy ( y )  . (12) For ICD, the candidate score can be written as s ICD ( y ) = l orig ( y ) + α  l orig ( y ) − l dist ( y )  , (13) where l orig ( y ) denotes the logit of candidate y under the original multimodal prompt, and l dist ( y ) denotes the corresponding logit under a disturbed instruction prompt. W ith the default setting α = 1.0, this becomes s ICD ( y ) = l orig ( y ) +  l orig ( y ) − l dist ( y )  . (14) S C I C O N follows the same contrastive intuition, but uses a simpler subtraction rule: s SciCon ( y ) = l mm ( y ) − α l txt ( y ) , (15) where l mm ( y ) and l txt ( y ) denote the candidate logits under the multimodal and text-only inputs, respectively . Unless otherwise noted, we use α = 0.5 in the main experiments, yielding s SciCon ( y ) = l mm ( y ) − 0.5 l txt ( y ) . (16) Under argmax decoding, this default S C I C O N form is rank-equivalent to 2 l mm ( y ) − l txt ( y ) , (17) matching the effective original-to-contrastive weighting ratio used by default VCD/ICD. A.4 Impact of hyperparameter α on Performance W e further analyze the sensitivity of S C I C O N to the subtraction weight α on the Qwen3.5-4B backbone. As shown in T able 7 , performance remains reasonably stable across a broad range of α values, although the best setting varies somewhat by dataset. The best results are obtained at α = 0.7 on MAC, α = 0.3 on SciFIBench, and α = 0.7 in accuracy on MMSci, while MMSci macro-F1 peaks at α = 0.3. Overall, these results suggest that S C I C O N is not overly sensitive to the exact choice of α , and that α = 0.5 provides a reasonable default trade-off for the main experiments. A.5 Answer-Label and Candidate-Count Distributions T o better understand the discrepancy between accuracy and macro-F1 on MMSci, we report the distributions of gold answer labels and candidate counts for all three benchmarks. As shown in T ables 8 and 9 , MAC and SciFIBench have fixed candidate counts and relatively balanced gold answer-label distributions, whereas MMSci exhibits both a highly imbalanced answer-label distribution and large variation in the number of candidate choices. This observation is consistent with the gap between accuracy and macro-F1 on MMSci discussed in the main text. 13 Preprint. Under review . Dataset Gold answer label distribution MAC A : 78, B : 89, C : 73, D : 87 SciFIBench A : 188, B : 221, C : 180, D : 200, E : 211 MMSci A : 963, B : 941, C : 826, D : 773, E : 76, F : 63, G : 22, H : 21, I : 14, J : 2, K : 4, L : 1, M : 1, N : 2, O : 1, V : 1 T able 8: Distribution of gold answer labels across datasets. MAC and SciFIBench are relatively balanced, while MMSci shows a long-tailed distribution concentrated on early option indices. Dataset Candidate count distribution (Count: Frequency) MAC 4: 327 SciFIBench 5: 1000 MMSci 2: 103, 3: 115, 4: 2731, 5: 107, 6: 213, 7: 99, 8: 95, 9: 56, 10: 40, 11: 33, 12: 43, 13: 2, 14: 15, 15: 18, 16: 20, 17: 5, 18: 3, 20: 1, 21: 2, 23: 1, 24: 1, 25: 1, 26: 7 T able 9: Distribution of candidate counts across datasets. MAC and SciFIBench use fixed numbers of answer choices, whereas MMSci contains a wide range of candidate counts. B Inference-T ime Complexity Analysis W e provide a simple theor etical comparison of infer ence-time cost across decoding methods. Since all compared methods are training-free and dif fer mainly in the number and type of forward passes requir ed at test time, their dominant cost can be analyzed at the level of T ransformer prefill computation. Let L q denote the number of text tokens in the question/prompt, and let L v denote the number of visual tokens extracted fr om the input figur e. For a T ransformer forward pass over a sequence of length L , we approximate the prefill cost as C ( L ) = O ( L 2 d ) , (18) where d denotes the hidden-size scale factor . Under this approximation, a multimodal input has total length L mm = L q + L v , (19) while a text-only input has length L txt = L q . (20) Therefor e, the cost of one multimodal forward pass and one text-only forward pass can be written as C mm = O  ( L q + L v ) 2 d  , C txt = O  L 2 q d  . (21) Greedy decoding r equires only a single forwar d pass on the original multimodal input: C Greedy = O  ( L q + L v ) 2 d  . (22) VCD and ICD each r equire an additional contrastive branch besides the original multimodal branch. Assuming the degraded-image branch in VCD and the disturbed-input branch in ICD have approximately the same token length as the original multimodal input, their costs are C VCD ≈ O  2 ( L q + L v ) 2 d  , C ICD ≈ O  2 ( L q + L v ) 2 d  . (23) In contrast, S C I C O N uses one multimodal forward pass and one text-only forward pass: C S C I C O N = O  ( L q + L v ) 2 d + L 2 q d  . (24) 14 Preprint. Under review . Since L 2 q < ( L q + L v ) 2 for L v > 0, (25) it follows that C Greedy < C S C I C O N < C VCD ≈ C ICD . (26) Thus, while S C I C O N incurs additional inference cost r elative to greedy decoding, it is theo- retically mor e efficient than contrastive decoding methods that require two full multimodal forward passes. C Case Study: Gold-Answer Recovery In this section, we provide a qualitative analysis of a successful case fr om the MAC Bench- mark where the baseline greedy decoding fails due to strong choice-induced prior bias, but S C I C O N successfully recovers the correct answer . C.1 Gold-Answer Recovery in MAC A B C D 0 0.5 1 0.61 0.37 0.02 0.01 0.85 0.01 0.08 0.06 0.12 0.87 0.01 0.01 Probability ( p ) Comparison of Answer Distributions Multimodal T ext-only S C I C O N Question: Which of the following options best describe the cover image? Correct Answer: B • Option A: California is home to expansive, water-intensive industrial agriculture. As the state faced severe dr ought in the 2010s, industry remained buoyed by unjust water rights, while neighboring families suffer ed without access to water . The cover by T ali W einberg transforms lush agricultural landscapes into drought-stricken fields. • Option B (Gold): Mangroves ar e crucial carbon sinks, yet these ecosystems are being lost rapidly , mainly due to aquaculture and shrimp farming. The cover highlights the restoration potential of lost mangroves in China and Southeast Asia. • Option C: Clean water is vital to planetary health, yet anthropogenic pressures thr eaten this resour ce. The cover shows the swirls of an algae bloom caused by agricultural runof f, with consequences for biodiversity and human health. • Option D: The inaugural issue of One Earth focuses on climate action, depicting the generation at risk from climate change and their engagement with the issue. Figure 3: A MAC case where gr eedy decoding selects a text-prior-dominant distractor (Op- tion A). The text-only branch assigns overwhelming pr obability to A, and the multimodal branch remains biased toward the same incorrect choice. After subtracting the text-only prior , S C I C O N suppresses A and recovers the visually gr ounded correct answer (Option B). 15 Preprint. Under review . Figure 3 illustrates a r epresentative gold-answer r ecovery case on MAC. Greedy decoding selects option A because both the multimodal and text-only branches strongly favor it, indicating that the incorrect pr ediction is largely driven by textual plausibility rather than genuine visual grounding. In contrast, S C I C O N suppresses this text-prior-dominant dis- tractor and shifts probability mass to the gold answer B, which becomes the top pr ediction after prior subtraction. This example highlights the central mechanism of S C I C O N : when an incorrect option is attractive mainly because of text alone, contrastive prior subtraction can recover the visually supported answer . C.2 Gold-Answer Recovery in SciFIBench Scientific figure example A B C D E 0 0.2 0.4 0.6 0.8 1 0.4 0.03 0.52 0.04 0 0 0.02 0.92 0.04 0.02 0.87 0.03 0.07 0.03 0.01 Probability ( p ) Comparison of Answer Distributions Multimodal T ext-only S C I C O N Question: Which caption best matches the image? Correct Answer: A • Option A (Gold): ρ = 0 case. Left panel depicts the conditional expectation function E { E [ R i | X 1 i = y , X 2 i ] } as a function of y , when ρ = 0. The observable relationship between R and income reflects the true negative ef fect β 1 . The right panel reports regr ession results on a simulated dataset of 10,000 observations. • Option B: ρ = 1 case. The regr ession results r eflect a spurious positive association between income and reported satisfaction, and Column (3) r eports an infeasible direct regr ession recovering the true parameters. • Option C: Mean response R i versus income X 1 i , comparing ρ = 0 and ρ = 1; the right panel reflects a spurious upwar d slope under correlated reporting heter ogeneity . • Option D: ρ = 0 case with a lowess regr ession of R on income in a simulated dataset of 10,000 observations. • Option E: Radial case with d = 2 and m = 1, analyzing α rad, ( n = 0 ) ⋆ ( p ) and Λ rad, ( n = 0 ) D ( α , p ) . Figure 4: A representative SciFIBench case where the text-only branch assigns overwhelming probability to a scientifically plausible distractor (Option C, p = 0.918), and the multimodal branch follows the same incorrect preference. By subtracting this text-only answer bias, S C I C O N shifts probability mass toward the visually grounded corr ect answer and recovers Option A ( p = 0.870). Analysis: As shown in Figure 4 , the text-only branch strongly pr efers Option C ( p txt ( C ) = 0.918), likely because it pr ovides a compact and scientifically coherent summary of the 16 Preprint. Under review . visual comparison shown in the figure. The multimodal branch remains closely aligned with this textual prior and incorrectly predicts Option C ( p mm ( C ) = 0.519), despite assigning substantial probability to the gold answer A ( p mm ( A ) = 0.404). After subtracting the text- only prefer ence, S C I C O N sharply suppresses the text-driven distractor and recovers the correct answer , assigning the highest probability to Option A ( p sc ( A ) = 0.870). This example illustrates that even when the multimodal model contains useful visual evidence, its final decision can still be dominated by a strong text prior unless that prior is explicitly corrected. C.3 Gold-Answer Recovery in MMSci Figure 1 shows a repr esentative MMSci example where gr eedy decoding fails because both the multimodal and text-only branches str ongly prefer the same distractor . This indicates that the multimodal prediction is dominated by textual prior rather than visual grounding. By subtracting the text-only prior , S C I C O N suppresses the distractor and recovers the correct answer . D Detailed Results on SciFIBench Categories T o provide a more granular understanding of model performance, we report the accuracy breakdown acr oss all individual sub-categories of SciFIBench. While the main text focuses on aggregate performance, this detailed view illustrates how different decoding strategies interact with the unique visual and linguistic characteristics of diverse scientific disciplines. The General Science subset encompasses a wide range of fields, fr om quantitative biology (q-bio) and economics (econ) to theoretical physics. Acr oss these varied domains, we observe that S C I C O N generally maintains superior performance compared to both greedy decoding and other contrastive baselines. This consistency suggests that the tendency of multimodal models to over-r ely on linguistic priors is a widespread phenomenon across scientific literature, and that our approach of explicitly subtracting the text-only prior is robust to dif ferences in domain-specific terminology . In the Computer Science (CS) subset, which includes specialized fields such as Computer V ision (cs.CV), Cryptography (cs.CR), and Machine Learning (cs.LG), the performance shows higher variance across dif ferent sub-categories. While S C I C O N achieves the highest overall accuracy for the CS group, certain baselines occasionally show str engths in specific niche areas. These results highlight the inher ent complexity of scientific figur e reasoning, where the optimal decoding strategy may be influenced by the specific symbolic conventions and structural layouts unique to each sub-field. Overall, the category-wise results reinfor ce the claim that S C I C O N provides a more reliable grounding in visual evidence across the broad spectr um of scientific research. 17 Preprint. Under review . Group Category N Greedy ICD VCD S C I C O N General Science General q-bio 77 41.56 44.16 40.26 44.16 General econ 79 62.03 55.70 54.43 63.29 General q-fin 75 65.33 60.00 64.00 66.67 General stat 61 72.13 73.77 73.77 73.77 General eess 77 48.05 50.65 48.05 49.35 General math 52 48.08 48.08 48.08 51.92 General physics 79 75.95 77.22 75.95 78.48 General overall 500 59.20 58.60 57.80 61.20 Computer Science (CS) CS cr oss-list 133 49.62 44.36 51.13 51.88 CS other cs 132 46.97 49.24 52.27 54.55 CS cs.CR 25 80.00 64.00 84.00 76.00 CS cs.DC 25 44.00 44.00 48.00 56.00 CS cs.CV 25 48.00 44.00 56.00 52.00 CS cs.RO 25 52.00 44.00 48.00 56.00 CS cs.SE 25 48.00 60.00 56.00 60.00 CS cs.NI 25 64.00 44.00 60.00 60.00 CS cs.SY 25 52.00 44.00 48.00 44.00 CS cs.LG 25 56.00 52.00 64.00 60.00 CS cs.CL 25 56.00 48.00 56.00 56.00 CS cs.AI 10 20.00 20.00 30.00 30.00 CS overall 500 51.00 47.40 54.00 54.80 T able 10: Detailed category-wise accuracy (%) on SciFIBench with the Qwen3.5-9B backbone. SciFIBench consists of two subsets: General Science and Computer Science (CS). S C I C O N achieves the best overall accuracy in both subsets, while performance on individual cate- gories varies by baseline. 18

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment