Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation
Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive…
Authors: Jiawei Mao, Hardy Chen, Haoqin Tu
K estr el: Gr ounding Self-Refinement f or L VLM Hallucination Mitigation Jiawei Mao 1 Hardy Chen 1 Haoqin T u 1 Y uhan W ang 1 Letian Zhang 1 Zeyu Zheng 2 Huaxiu Y ao 3 Zirui W ang 4 Cihang Xie 1 Y uyin Zhou 1 1 UC Santa Cruz 2 UC Berkele y 3 UNC-Chapel Hill 4 Apple Figure 1. Kestrel progressiv ely corrects hallucinated L VLM answers by integrating an external grounding agent with iterativ e self- improv ement. At each round, the model grounds the current claim with explicit visual and textual evidence, conducts claim-le vel v erifica- tion, and conservati vely refines the response, yielding a final answer that is both more reliable and more interpretable. Abstract Lar ge vision-language models (L VLMs) have become in- cr easingly str ong b ut r emain pr one to hallucinations in multimodal tasks, which significantly narr ows their de- ployment. As training these L VLMs to avoid hallucina- tions becomes pr ohibitively expensive for lar g er models, training-fr ee methods offer a cheap and flexible solution to this pr oblem, yet existing appr oaches based on decod- ing or tool use often bring limited gains and/or weak in- terpr etability . W e pr opose Kestrel , a training-fr ee frame- work for L VLM hallucination mitigation that combines an e xplicit visual-gr ounding ag ent with e vidence-verified self-r efinement mechanism. In detail, Kestrel first col- lects explicit visual evidence and con verts tool outputs into reusable and structur ed textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an L VLM judge for e vidence checking , then it- eratively self-r efine answers based on verified evidence to r educe the risk of over-corr ection. Extensive experiments show that Kestrel impr oves performance over str ong baselines acr oss hallucination benchmarks ( e.g ., aver age +3.31 % on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while pr oviding transpar ent verification traces for hallucination diagnosis and analysis — e.g ., both the inte grated self-refinement module and gr ounding agent contributing an avera ge +2.0 % gain on POPE. Pr oject website: https : / / jwmao1 . github . io / Kestrel _ project/ 1. Introduction Recent advances in large-scale pretraining [ 1 , 26 , 34 ] and multimodal instruction tuning [ 8 , 22 ] ha ve substantially improv ed the capabilities of large vision-language mod- els (L VLMs) [ 3 , 10 , 31 ] on multimodal understanding and reasoning tasks such as visual question answering (VQA). 1 Howe ver , L VLMs still exhibit hallucination, producing re- sponses that are inconsistent with or weakly supported by the input image. For example, empirical studies [ 20 , 28 , 29 ] show that this issue remains prev alent, making hallucination a central challenge for improving the reliability of L VLMs. T o mitigate hallucination, two broad classes of meth- ods have been proposed, training-based and training-free. For the training-based line of work, continual training with hallucination annotations or alignment with external feed- back has been shown effecti ve [ 7 , 15 , 24 , 28 , 36 ]. How- ev er , these training-based solutions incur significant data and compute overhead, posing hurdles in real-world de- ployment. Existing training-free methods for hallucination mitigation improve test-time correction without additional training, but leave key gaps to be filled: (i) limited gains and robustness when operating purely on internal decod- ing dynamics without external grounding evidence, and (ii) limited reliability when correction is performed in a sin- gle pass. Distribution-contrast methods [ 17 , 32 ] can reduce object hallucinations but remain sensitive to perturbations and often fav or common-object representations. Many ap- proaches rely on internal logit dynamics [ 13 ] or language- lev el decoding control [ 12 ], which can yield brittle correc- tions that are dif ficult to v alidate against concrete visual ev- idence. On the other hand, methods that introduce external verification may produce non-deterministic evidence due to randomness of tools [ 35 ]. While other methods [ 33 ] could collect reliable evidence, their one-time verification-and- update can be insufficient to prevent over -correction under challenging cases. Building on these limitations, we propose Kestrel (see Fig. 1 ), a training-free framew ork for L VLM hallucination mitigation that unifies an explicit visual grounding agent with evidence-dri ven iterative self-refinement. Specifi- cally , Kestrel first decomposes the question into verifi- able claim-lev el targets (e.g., existence, color , count, and position), and then in vokes SAM3 [ 4 ] around each tar get to collect se gmentation o verlays, bounding boxes, target crop- and-zoom views, and text evidence derived from these col- lected visual evidence, all of which are collated as struc- tured evidence items with citation identifiers. The frame- work then performs claim-lev el verification with verdicts and outputs confidence-aware verification results, form- ing an auditable evidence chain. T o regulate the potential ov er-correction, we further introduce an evidence-gated up- date scheme into the iteration: the frame work progressi vely supplements and strengthens claim-lev el evidence through multiple rounds of v erification and revision, and permits an- swer flips only when e vidence strength, confidence, and ev- idence coverage jointly satisfy predefined criteria. These designs preserv e the training-free methods while improving the interpretability , rob ustness, and decision stability of hal- lucination mitigation. Experiments show that Kestrel remains fully training- free, yet consistently reduces hallucinations at test time across multiple benchmarks, with improvements that trans- fer across different state-of-the-art L VLM backbones. On POPE [ 20 ] (MS-COCO, A-OKVQA, and GQA), Kestrel improv es accuracy by an average of +3.31 % points over Qwen3-VL and +3.03 % over InternVL3.5; it also sur- passes prior training-free mitigation baselines by +1.38 % and +1.47 % points on a verage under the same back- bones, respectiv ely . On the more challenging MME- Hallucination [ 9 ], Kestrel boosts Qwen3-VL by +28.34 points and exceeds OPERA [ 12 ] by +16.67, deli vering con- sistent gains across diverse hallucination types (existence, count, and position) while maintaining strong overall per- formance and setting a new state-of-the-art. Our main contributions are summarized as follo ws: • W e propose Kestrel , a training-free L VLM hallu- cination mitigation framework that unifies e xplicit visual grounding agent with iterative self-refinement at test time. Kestrel decomposes answers into verifiable claim-lev el targets, grounds them with structured visual and textual evidence, performing conserv ativ e multi-round verification and revision to improve interpretability and reduce over - correction. • Kestrel achiev es state-of-the-art performance in hal- lucination mitigation on POPE and the more fine-grained MME-Hallucination. • Kestrel generalizes across multiple state-of-the-art L VLM backbones with substantial and consistent gains, showing that the framew ork is backbone-agnostic and broadly applicable in the training-free setting. 2. Related W ork 2.1. Large V ision-Language Models Large vision-language models (L VLMs) hav e advanced rapidly through large-scale multimodal pretraining and in- struction tuning, achieving strong performance across mul- timodal understanding and reasoning tasks. Represen- tativ e paradigms include CLIP-style vision-language pre- training [ 26 ], Flamingo-style few-shot multimodal mod- eling [ 1 ], and BLIP-2 style [ 18 ] modular alignment be- tween frozen vision encoders and LLMs. L VLMs, such as LLaV A [ 22 , 23 ], InstructBLIP [ 8 ], OpenFlamingo [ 2 ], CogVLM [ 30 ], K osmos-2 [ 25 ], and recent models [ 3 , 5 , 6 , 10 , 31 ], further demonstrate the ef fecti veness of scal- able multimodal alignment and visual instruction tuning. Meanwhile, grounded multimodal modeling has become increasingly important, as ex emplified by Kosmos-2 [ 25 ], which explicitly supports phrase grounding and visual re- ferring. Nev ertheless, current L VLMs still struggle to main- tain f aithful grounding between generated responses and image content, especially in fine-grained reasoning scenar- 2 Figure 2. Kestrel vs. prior training-free hallucination mitigation methods. By combining an e xternal grounding agent with iterati ve self-improv ement, Kestrel collects explicit visual evidence and further con verts tool outputs into structured textual evidence for verifi- cation. This design yields more interpretable and stable e vidence, reduces overconfident corrections, and a voids biased interpretation that may arise when L VLMs rely only on raw visual e vidence compared with prior approaches. ios, making hallucination a persistent challenge for reliable deployment. 2.2. Hallucination in L VLMs Hallucination is a persistent problem in large vision- language models (L VLMs). Early studies [ 20 ] show that L VLMs often generate content inconsistent with the in- put image, especially by predicting non-existent objects, while POPE [ 20 ] improves the stability of such ev alua- tion. Subsequent work sho ws that hallucination extends be- yond object existence to finer-grained errors in attributes, counts, and relations, as benchmarked by AMBER [ 29 ]. More challenging settings, such as visual illusion and am- biguous local evidence, are further explored in Hallusion- Bench [ 11 ]. Broader benchmarks including MME [ 9 ], MMHal-Bench [ 28 ], and THRONE [ 16 ] further suggest that hallucination is heterogeneous, benchmark-sensitiv e, and closely tied to failures in visual grounding and multi- modal reasoning. These findings motiv ate mitigation meth- ods that verify model outputs against explicit and fine- grained visual evidence. 2.3. T raining-based Hallucination Mitigation Early approaches improve faithfulness by redesigning in- struction data or supervision signals so that models bet- ter distinguish grounded from ungrounded responses. F or example, robust visual instruction tuning [ 7 ] introduces hallucination-oriented supervision, while HA CL [ 15 ] uses contrastiv e learning to separate grounded and hallucinated representations. Reflective instruction tuning [ 36 ] fur- ther improves reliability by adding rationale supervision. Alignment-based methods, such as factually augmented RLHF [ 28 ] and Silkie [ 19 ], incorporate preference or fac- tual signals during post-training, and HIO [ 24 ] strengthens token-le vel contrastive learning around hallucinated con- tent. Overall, training-based methods are effecti ve, but they usually require additional annotations, synthetic data, preference collection, or repeated optimization, leading to higher training cost and deployment complexity . 2.4. T raining-free Hallucination Mitigation T raining-free hallucination mitigation aims to reduce hal- lucination at inference time without updating model pa- rameters. A major line of work focuses on contrastive or decoding-based strategies, such as VCD [ 17 ], RIT - U AL [ 32 ], OPERA [ 12 ], and SHIELD [ 13 ], which allevi- ate hallucination by intervening on decoding beha vior or vi- sual token representations. Another line introduces explicit verification or post-hoc correction. For example, W ood- pecker [ 33 ] adopts a multi-stage correction pipeline, while DeGF [ 35 ] lev erages text-to-image generativ e feedback for iterativ e refinement. Meanwhile, recent grounding models such as SAM3 [ 4 ] make it increasingly practical to collect explicit visual evidence at inference time. Compared with prior training-free methods, our work further emphasizes combine explicit grounding evidence and conservati ve iter- 3 Figure 3. Overview of Kestrel . Gi ven an image-question pair , Kestrel follows a training-free four-stage pipeline for L VLM halluci- nation mitigation: (1) Initialization , which obtains an initial answer and re writes it into question-aligned verifiable claims with associated visual entities and claim types; (2) Agent Grounding , which in vokes an external SAM3-based grounding agent to collect explicit visual evidence (e.g., se gmentation overlays, box es, and crop-and-zoom vie ws) and con vert them into structured te xtual evidence; (3) Claim-level V erification , which verifies each claim against the cited e vidence to produce claim-wise v erdicts, confidence scores, and a top-le vel verifi- cation decision; and (4) Self-Refinement , which performs evidence-gated answer updating based on the current and pre vious verification traces. ativ e self-refinement to mitigate hallucination. 3. Method W e propose Kestrel , a training-free framew ork for miti- gating L VLM hallucination with explicit visual grounding agent and structured evidence-dri ven self-refinement at test time. Gi ven an image I and corresponding question Q , Kestrel iterati vely follows a four-step pipeline: (i) initial- ization, (ii) agent grounding, (iii) claim-level verification, and (iv) self-refinement (see Fig. 3 ). 3.1. Initialization Kestrel first queries the L VLM to obtain an initial answer ˆ A (0) . T o support claim-le vel verification, Kestrel con verts Q into a small set of verifiable claims that directly corre- spond to the question. Concretely , Kestrel rewrites the question-answer decision into visually checkable claims, 4 each anchored to one or two concrete visual entities. These entities serve as the detection targets for the grounding agent. Meanwhile, based on the v erifiable attrib utes re- quired by the question, we categorize the extracted claims (e.g., existence , color , count , position ) to route sub- sequent agent grounding. 3.2. Agent Grounding T o obtain explicit, inspectable grounding evidence, Kestrel inv okes an external visual grounding agent built on SAM3 promptable concept segmentation [ 4 ]. V isual evidence. SAM3 takes the visual entities in the claims as concept prompts and returns the matched in- stances, from which Kestrel collects explicit visual ev- idence, including: (i) segmentation ov erlays for transpar- ent localization, (ii) instance bounding boxes (deri ved from SAM3 masks) to support geometry-based reasoning, and (iii) crop-and-zoom vie ws around predicted instances to re- duce ambiguity for attribute inspection (e.g., color) and lo- cal details. Structured textual evidence. T o make agent outputs di- rectly usable for claim verification and auditable diagno- sis, Kestrel derives textual e vidence from visual e vidences via L VLM for each claim type: (i) for existence , we con vert the predicted instances into an existence statement by checking whether the number of matched instances is greater than zero; (ii) for count , we report the instance count computed from the number of predicted masks; (iii) for color , we generate a concise color observation condi- tioned on the masked crop-and-zoom and the full image; (iv) for position , we con vert SAM3 geometry into text by deriving coarse spatial cues from the union bounding box and, when two entities are in volv ed, computing their rela- tiv e relation from the corresponding bounding-box centers. Each textual evidence item is paired with a citation iden- tifier and can be referenced during verification and answer revision. 3.3. Claim-lev el V erification Giv en the claims and corresponding structured evidence items, Kestrel performs claim-level verification using L VLM-as-a-judge. The judge is instructed to base its de- cision only on the provided evidence and to cite the corre- sponding evidence. For each claim, the verifier outputs: (i) a v erdict ( supported / contr adicted / insufficient ), (ii) a con- fidence score, and (iii) a short reasoning that must cite the relev ant evidence. W e then consolidate claim-wise judg- ments into a top-lev el verification verdict for the current answer: it is labeled as contradicted if any claim is confi- dently refuted with cited evidence, as supported only when all claims are confidently supported, and as insufficient oth- erwise. The resulting verification trace constitutes an ex- plicit, evidence-grounded audit trail, enabling interpretable analysis of when hallucinations arise and how corrections are triggered. 3.4. Self-Refinement Since external agent and the L VLM may be unrobust, di- rectly revising the answer based on the instance-lev el ver- dict can introduce over-corr ection . Therefore, Kestrel adopts a evidence-g ated self-refinement strategy: it per- mits correction only when the verification provides suffi- ciently reliable signals—i.e., high-confidence claim-level judgments together with cited evidence for the correspond- ing claim. Otherwise, Kestrel preserves the current an- swer ˆ A ( i ) (where i denotes i th iteration) and proceeds to collect stronger evidence in subsequent rounds. Importantly , the self-refinement is stateful : the revision step conditions not only on the current verification results but also on prior rounds’ claims, evidence, and decisions. Based on the verification trace, Kestrel updates the an- swer to obtain ˆ A ( i +1) and proposes a new set of claims for the next iteration, prioritizing claims that remain uncertain or are implicated by contradictions. This iterativ e process progressiv ely strengthens evidence and stabilizes decision- making, while remaining training-free. Kestrel repeats the cycle for a small number of itera- tions, and stops early when the answer stabilizes under con- sistently supportiv e verification, or when additional itera- tions no longer yield stronger evidence. The final output is the answer together with its claim-lev el verification traces. 4. Experiments 4.1. Experimental Setup Benchmarks. W e ev aluate Kestrel on POPE [ 20 ] in three source datasets (MS-COCO [ 21 ], A-OKVQA [ 27 ], and GQA [ 14 ]) under random (Rand.), popular (Pop.), and adversarial (Adv .) sampling, and MME-Hallucination [ 9 ] ev aluates fine-grained hallucination across existence, count, position, and color . L VLM backbones. W e ev aluate Kestrel with SoT A open-weight L VLMs, e.g., Qwen3-VL 8B [ 3 ] and In- ternVL3.5 8B [ 31 ]. Baselines. W e compare against Qwen3-VL agent (with zoom-in tool) 1 and training-free L VLM hallucination mit- igation baselines: VCD [ 17 ], OPERA [ 12 ], RITU AL [ 32 ], W oodpecker [ 33 ], and DeGF [ 35 ]. All baselines use the 1 https : / / github . com / QwenLM / Qwen - Agent / blob / main / examples/cookbook_think_with_images.ipynb 5 T able 1. Results on POPE [ 20 ] benchmark . Higher ( ↑ ) accuracy indicates better performance. The best results are bolded , and the second-best are underlined. Backbone Method MS-COCO [ 21 ] A-OKVQA [ 27 ] GQA [ 14 ] Rand. ↑ Pop. ↑ Adv . ↑ Rand. ↑ Pop. ↑ Adv . ↑ Rand. ↑ Pop. ↑ Adv . ↑ Qwen3-VL [ 3 ] Baseline [ 3 ] 89.00 86.92 86.20 92.36 86.67 81.87 90.70 83.63 81.50 Qwen3-VL agent [ 3 ] 91.03 88.06 86.13 92.87 85.20 78.03 91.41 81.81 78.30 VCD [ 17 ] 90.40 88.80 87.41 93.53 87.86 82.00 91.56 85.76 81.93 W oodpecker [ 33 ] 89.97 88.03 87.10 93.23 88.90 83.33 91.27 86.27 82.77 RITU AL [ 32 ] 86.20 83.67 82.27 87.67 83.50 77.76 86.86 82.30 78.23 OPERA [ 12 ] 90.50 88.83 87.50 93.76 89.50 83.86 91.80 87.11 83.30 DeGF [ 35 ] 90.33 88.16 86.90 92.96 87.70 82.61 91.13 83.79 82.00 Kestrel (ours) 91.53 89.30 87.53 93.46 91.73 86.76 91.67 90.33 86.27 InternVL3.5 [ 31 ] Baseline [ 31 ] 90.77 88.10 85.73 92.67 87.83 81.53 89.77 84.10 81.31 VCD [ 17 ] 91.35 89.22 87.60 92.87 89.73 83.73 91.60 85.07 83.37 W oodpecker [ 33 ] 91.20 89.11 87.50 93.73 89.80 84.00 91.43 85.16 83.26 RITU AL [ 32 ] 91.60 89.03 87.48 93.71 89.75 83.90 91.39 85.18 83.29 OPERA [ 12 ] 91.53 89.18 87.55 93.55 89.79 83.81 91.45 85.20 83.31 DeGF [ 35 ] 91.43 89.12 87.37 93.39 89.68 84.11 91.48 85.06 83.20 Kestrel (ours) 91.27 89.27 88.10 93.57 91.80 87.13 91.57 89.87 86.53 same L VLM backbone for fair comparison (In the W ood- pecker pipeline, all components except the visual verifica- tion module are replaced with Qwen3-VL.). Implementation details. W e set the maximum number of self-refinement iterations to K =3 and stop early when the answer is stable with two consecutiv e supported veri- fication verdicts. F or the SAM3 grounding agent, we use a confidence threshold of 0 . 5 (with a recheck threshold of 0 . 35 when needed for existence claim). In self-refinement, we use confidence thresholds in the range [0 . 82 , 0 . 90] for different claim types. All experiments are conducted on NVIDIA A100 GPUs. Details of the corresponding prompt template are provided in the supplementary material. 4.2. Results And Discussion Results on POPE. T ab . 1 compares Kestrel with base- lines under two L VLM backbones (Qwen3-VL and In- ternVL3.5) across three ev aluation sources (MS-COCO, A- OKVQA, and GQA). Overall, Kestrel exhibits the most consistent gains on the more challenging popular and adver - sarial splits, while remaining competitiv e on random sets, indicating rob ust hallucination mitig ation across backbones and data sources. W ith Qwen3-VL , Kestrel achiev es the best perfor - mance on MS-COCO across all splits. On A-OKVQA , Kestrel reaches 91.73 % (Pop.) and 86.76 % (Adv .), surpassing decoding-based baselines such as OPERA (89.50 % /83.86 % ) and VCD (87.86 % /82.00 % ). On GQA , Kestrel further improves to 90.33 % (Pop.) and 86.27 % (Adv .), outperforming OPERA (87.11 % /83.30 % ) and VCD (85.76 % /81.93 % ) by a clear margin. W ith InternVL3.5 , we observe a similar trend, deliv ering the largest improve- ments ov er the L VLM backbone and the strongest overall performance. Compared to decoding-centric methods that reshape to- ken distributions at inference time (e.g., VCD [ 17 ] and OPERA [ 12 ]) and post-hoc correction pipelines such as W oodpecker [ 33 ], Kestrel externalizes grounding into e x- plicit evidence and performs claim-lev el verification with conservati ve, evidence-g ated updates. This design better preserves correctness while reducing risky corrections un- der stronger prior interference. Fig. 4 further shows that the gains on POPE mainly come from correcting initially in- correct predictions, while most correct predictions are pre- served after refinement. The relativ ely lo w rate of ov er- correction suggests that our conservati ve gating mechanism effecti vely suppresses unnecessary revisions, enabling the model to mitigate hallucinations without sacrificing predic- tion stability . Results on MME-Hallucination. T ab . 2 reports MME- Hallucination results decomposed into object-lev el (Ex- istence, Count) and attribute-le vel (Position, Color) sub- sets. Overall, Kestrel achieves the best MME Score under Qwen3-VL, reaching 760.00 ( +28 . 34 ), outper- forming baselines such as OPERA (743.33 +11 . 67 ) and VCD/RITU AL (736.66 +5 . 00 ). These results suggest that Kestrel ’ s evidence-dri ven, claim-level verification and conservati ve update rule particularly strengthen object ex- 6 Figure 4. Prediction transition statistics with Qwen3-VL before and after refinement. Prediction are categorized into four types: correctly preserved, error corrected, o ver -corrected, and incorrectly preserv ed. The results show that the refinement process is conservati ve, retaining most originally correct predictions while correcting a portion of erroneous ones, with limited o ver -correction. Zoom in for a better view . T able 2. Results on MME-Hallucination [ 9 ] benchmark. Higher scores ( ↑ ) indicate better performance. The best results are bolded , and the second-best are underlined. Backbone Method Object-level Attribute-le vel MME Score ↑ Existence ↑ Count ↑ Position ↑ Color ↑ Qwen3-VL [ 3 ] Baseline [ 3 ] 195.00 ( +0 . 00 ) 175.00 ( +0 . 00 ) 168.33 ( +0 . 00 ) 193.33 ( +0 . 00 ) 731.66 ( +0 . 00 ) Qwen3-VL agent [ 3 ] 200.00 ( +5 . 00 ) 181.67 ( +6 . 67 ) 168.33 ( +0 . 00 ) 193.33 ( +0 . 00 ) 743.33 ( +11 . 67 ) VCD [ 17 ] 195.00 ( +0 . 00 ) 180.00 ( +5 . 00 ) 168.33 ( +6 . 73 ) 193.33 ( +0 . 00 ) 736.66 ( +5 . 00 ) W oodpecker [ 33 ] 195.00 ( +0 . 00 ) 173.33 ( +1 . 67 ) 168.33 ( +0 . 00 ) 195.00 ( +1 . 67 ) 731.66 ( +10 . 00 ) RITU AL [ 32 ] 195.00 ( +0 . 00 ) 180.00 ( +5 . 00 ) 168.33 ( +0 . 00 ) 193.33 ( +0 . 00 ) 736.66 ( +5 . 00 ) OPERA [ 12 ] 195.00 ( +0 . 00 ) 180.00 ( +5 . 00 ) 168.33 ( +0 . 00 ) 200.00 ( +6 . 67 ) 743.33 ( +11 . 67 ) DeGF [ 35 ] 195.00 ( +0 . 00 ) 181.67 ( +6 . 67 ) 166.67 ( +1 . 66 ) 188.33 ( +5 . 00 ) 732.67 ( +1 . 01 ) Kestrel (ours) 200.00 ( +5 . 00 ) 186.67 ( +11 . 67 ) 180.00 ( +11 . 67 ) 193.33 ( +0 . 00 ) 760.00 ( +28 . 34 ) InternVL3.5 [ 31 ] Baseline [ 31 ] 200.00 ( +0 . 00 ) 175.00 ( +0 . 00 ) 175.00 ( +0 . 00 ) 193.33 ( +0 . 00 ) 743.33 ( +0 . 00 ) VCD [ 17 ] 200.00 ( +0 . 00 ) 175.00 ( +0 . 00 ) 175.00 ( +0 . 00 ) 193.33 ( +0 . 00 ) 736.66 ( − 6 . 67 ) W oodpecker [ 33 ] 200.00 ( +0 . 00 ) 166.67 ( − 8 . 33 ) 161.67 ( − 13 . 33 ) 186.67 ( − 6 . 66 ) 715.01 ( − 28 . 32 ) RITU AL [ 32 ] 195.00 ( − 5 . 00 ) 175.00 ( +0 . 00 ) 175.00 ( +0 . 00 ) 193.33 ( +0 . 00 ) 738.33 ( − 5 . 00 ) OPERA [ 12 ] 195.00 ( − 5 . 00 ) 173.33 ( − 1 . 67 ) 175.00 ( +0 . 00 ) 195.00 ( +1 . 67 ) 738.33 ( − 5 . 00 ) DeGF [ 35 ] 195.00 ( − 5 . 00 ) 175.00 ( +0 . 00 ) 168.33 ( − 6 . 67 ) 188.33 ( − 5 . 00 ) 726.66 ( − 16 . 67 ) Kestrel (ours) 200.00 ( +0 . 00 ) 186.67 ( +11 . 67 ) 181.67 ( +6 . 67 ) 195.00 ( +1 . 67 ) 763.34 ( +20 . 01 ) istence/count and spatial reasoning, yielding more balanced improv ements across hallucination-sensitiv e attributes. W e also observe slightly larger variance for Kestrel on some subsets, which is expected when grounding and v erification are performed with stochastic components; nevertheless, the mean improvement remains consistent and substantial ov er all compared methods. Additional results are pro vided in the supplementary material. T ab. 2 presents the results of Kestrel with the In- ternVL3.5 backbone [ 31 ] on MME Hallucination. Notably , Kestrel still achieves further improv ement o ver such a strong backbone, showing that our framew ork remains ef- fectiv e e ven when the base model already performs at a high lev el. This result is particularly meaningful, as reducing hallucinations becomes increasingly difficult as the back- bone grows stronger . Moreover , among the methods com- pared in T ab. 2 , Kestrel is the only one that consistently improv es InternVL3.5, further highlighting the advantage 7 Figure 5. Qualitative results of Kestrel . W e compare the VQA responses from the regular baseline and our method based on Qwen3-VL. Zoom in for a better view . of our grounded evidence-based refinement frame work. Qualitative Analysis In Fig. 5 , we present results of hal- lucination mitigation across e xistence, count, color , and po- sition. The initial L VLM answers are inconsistent with the image content, but are corrected by Kestrel through ex- ternal evidence from the grounding agent and multi-round self-refinement. The visual evidence provides explicit sup- port for target localization and attribute verification, which helps recov er missed objects, disambiguate object counts, reject unsupported color predictions, and correct erroneous spatial relations. Based on these grounded cues, the itera- tiv e refinement procedure progressiv ely updates the answer 8 T able 3. Efficiency comparison. W e report the av erage inference latenc y per instance, peak GPU memory , and POPE accurac y on MS- COCO. Experiments are conducted on a single NVIDIA A100 GPU. Lower latency/memory is better, while higher POPE accuracy is better . Method A verage Latency ↓ GPU Memory ↓ Checked Case POPE (MS-COCO) ↑ Qwen3-VL [ 3 ] 0.78 s ( × 1.00) 17428 MB ( × 1.00) 9000 87.37 Kestrel ( 1 st iteration) 12.00 s ( × 15.38) 21472 MB ( × 1.23) 9000 89.28 Kestrel ( 2 nd iteration) 8.54 s ( × 10.94) 21296 MB ( × 1.22) 4978 89.45 Kestrel ( 3 rd iteration) 7.12 s ( × 9.12) 21274 MB ( × 1.22) 477 89.34 Kestrel 18.75 s ( × 24.03) 21472 MB ( × 1.23) 9000 89.45 through claim-le vel verification, yielding more reliable cor - rections than direct one-step re vision. These results demon- strate that explicit grounding and conservati ve iterativ e re- finement work together to effecti vely reduce hallucinations across div erse perception scenarios. 4.3. Efficiency T ab. 3 reports the efficienc y of Qwen3-VL [ 3 ] and Kestrel in terms of average inference latency , peak GPU memory , checked cases, and POPE accuracy on MS-COCO [ 21 ]. As expected, Kestrel incurs substantially higher end-to-end latency than a single-pass L VLM inference, due to the ad- ditional overhead introduced by external grounding, claim- lev el verification, and iterati ve self-refinement. Importantly , the per-round statistics are computed over different num- bers of checked cases because of early stopping: the first iteration is applied to all 9000 cases, whereas only 4978 and 477 cases proceed to the second and third iterations, respectiv ely . Consequently , later iterations are e valuated on progressiv ely smaller subsets. Since many easy cases are resolved early , the remaining instances typically in v olve fewer unresolved claims and require less additional e vi- dence, resulting in lower av erage latency and slightly re- duced memory usage in subsequent iterations. This also ex- plains why the best POPE accuracy is already achiev ed af- ter the second iteration, while the third iteration only yields limited changes on a much smaller subset. Based on this trade-off, we set the maximum number of refinement itera- tions to 3, which provides a practical balance between per- formance and efficienc y . 4.4. Human Study T o complement the automatic benchmark results, we con- duct a human study to ev aluate whether the responses pro- duced by Kestrel are better aligned with human prefer- ence under evidence-grounded judgment. T o complement automatic benchmarks, we conduct a human preference study to assess whether Kestrel produces responses that are more reliable and more interpretable. Figure 6. Human pr eference comparison across methods. Bars show the fraction of trials in which each method is preferred (n=60). Error bars denote 95 % Wilson confidence interv als. Setup. W e sample 60 ev aluation cases and compare fiv e training-free mitigation methods: Kestrel , DeGF , W ood- pecker , RITU AL, and VCD, which are also included in our main experiments. F or each case, annotators are presented with the candidate outputs in randomized order and asked to choose the response they prefer based on grounding qual- ity , factual consistency with the image, and ov erall answer reliability . W e report the human preference rate of each method. Results. As sho wn in Fig. 6 , Kestrel is preferred in 41/60 cases (68.3 % ), substantially outperforming DeGF (13.3 % ), W oodpecker (11.7 % ), RITU AL (6.7 % ), and VCD (0.0 % ). This result indicates that Kestrel ’ s outputs are more consistent with human judgments of groundedness and answer quality . W e attrib ute this advantage to Kestrel ’ s explicit evidence-backed verification process. Instead of relying only on decoding-time intervention, Kestrel ex- ternalizes grounding into structured visual and textual ev- idence, verifies question-aligned claims against cited evi- dence, and updates the answer only when the evidence is sufficiently reliable. 4.5. Ablation Study Designs. T ab . 4 reports the ablation results of Kestrel on POPE-COCO. Nai ve self-refinement and the ground- ing agent alone bring only marginal impro vements ov er 9 T able 4. Ablation study of core design components. W e ev aluate different settings by progressively enabling the key components in Kestrel . Higher scores indicate better performance. Setting Grounding Structured T extual Claim-level Evidence-gated Self- History- POPE Agent Evidence V erification Update Refinement A ware MS-COCO baseline ✗ ✗ ✗ ✗ ✗ ✗ 87.37 s1 ✗ ✗ ✗ ✗ ✓ ✗ 87.44 s2 ✓ ✗ ✗ ✗ ✗ ✗ 87.50 s3 ✗ ✗ ✓ ✓ ✓ ✓ 87.58 s4 ✓ ✗ ✓ ✓ ✓ ✓ 88.61 s5 ✓ ✓ ✗ ✓ ✓ ✓ 89.05 s6 ✓ ✓ ✓ ✗ ✓ ✓ 89.30 s7 ✓ ✓ ✓ ✓ ✗ ✗ 89.16 s8 ✓ ✓ ✓ ✓ ✓ ✗ 89.32 s9 ✓ ✓ ✓ ✓ ✓ ✓ 89.45 T able 5. Ablation of visual e vidence types. Contribution of differ - ent visual evidence forms, including se gmentation, bounding box es, and crop-zoom views. Higher scores indicate better performance. visual segment box crop-zoom MME Score baseline – – – 731.66 s1 ✗ ✓ ✓ 738.33 s2 ✓ ✗ ✓ 743.33 s3 ✓ ✓ ✗ 740.00 s4 ✓ ✓ ✓ 760.00 T able 6. Ablation of structured textual evidence. W e study the contribution of dif ferent textual e vidence types used for claim veri- fication. Higher scores indicate better performance. textual exist count color position MME Score s1 ✗ ✓ ✓ ✓ 755.00 s2 ✓ ✗ ✓ ✓ 751.67 s3 ✓ ✓ ✗ ✓ 753.33 s4 ✓ ✓ ✓ ✗ 748.33 s5 ✓ ✓ ✓ ✓ 760.00 the baseline, indicating that neither iterativ e revision nor external tool access is sufficient by itself. Introduc- ing verification-guided refinement further improves per- formance, while incorporating the grounding agent yields a larger gain, highlighting the importance of explicit grounded evidence. Adding structured textual evidence and claim-lev el verification leads to further improvements, showing that normalized evidence construction and fine- grained verification are both beneficial. W e also ob- serve that remo ving evidence-gated update degrades per- formance, confirming the role of conservati ve update con- trol in prev enting risky corrections. Finally , self-refinement and stateful refinement provide additional gains on top of the full evidence-v erification pipeline, and the full Kestrel achiev es the best result ov erall. Evidence. The evidence ablations (T ab . 5 and T ab . 6 ) further highlight the complementary roles of both visual and textual evidence in Kestrel . For visual evidence, re- moving any single component consistently de grades perfor - mance relativ e to the full setting, showing that segmenta- tion ov erlays (segment), bounding boxes (box), and crop- zoom views (crop-zoom) all contribute to the final result. Among them, removing segmentation or crop-zoom causes a lar ger drop than remo ving boxes, suggesting that pre- cise region localization and enlarged local inspection are particularly important for reliable claim verification. For structured textual e vidence, ablating any e vidence type also leads to performance degradation, confirming that exis- tence, count, color , and position evidence all provide use- ful support for refinement. Notably , removing position or count evidence causes a relatively larger drop, indicat- ing that these attributes benefit more from explicit struc- tured evidence. Overall, the best performance is achieved when all visual and textual e vidence types are used together , demonstrating that the different evidence forms are comple- mentary rather than redundant. 5. Conclusion In this paper , we presented Kestrel , a training-free frame- work for L VLM hallucination mitigation that integrates ex- plicit visual grounding with structured evidence-dri ven self- refinement. By conv erting external tool outputs into citable visual and textual evidence, Kestrel enables claim-level verification and conservati ve evidence-gated answer up- dates, providing both improved reliability and transparent verification traces. Extensive experiments on POPE and MME-Hallucination show that Kestrel consistently im- prov es over strong training-free baselines across multiple backbones, with particularly clear gains under more chal- lenging hallucination settings. Further ablations validate the contrib utions of each design. Overall, Kestrel demon- strates that coupling explicit grounding with interpretable iterativ e verification offers an ef fectiv e and stable path to- ward more trustworthy L VLMs. 10 References [1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Y ana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems , 35:23716–23736, 2022. 1 , 2 [2] Anas A wadalla, Irena Gao, Josh Gardner , Jack Hessel, Y usuf Hanafy , W anrong Zhu, Kalyani Marathe, Y onatan Bitton, Samir Gadre, Shiori Saga wa, et al. Openflamingo: An open- source framework for training large autoregressiv e vision- language models. arXiv preprint , 2023. 2 [3] Shuai Bai, Y uxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, W ei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv pr eprint arXiv:2511.21631 , 2025. 1 , 2 , 5 , 6 , 7 , 9 , 12 [4] Nicolas Carion, Laura Gustafson, Y uan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan V asudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, T engyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer , Meng W ang, Peize Sun, Roman R ¨ adle, Triantafyllos Afouras, Effrosyni Mavroudi, Kather- ine Xu, Tsung-Han W u, Y u Zhou, Liliane Momeni, Rishi Hazra, Shuangrui Ding, Sagar V aze, Francois Porcher, Feng Li, Siyuan Li, Aishwarya Kamath, Ho Kei Cheng, Piotr Doll ´ ar , Nikhila Ravi, Kate Saenko, Pengchuan Zhang, and Christoph Feichtenhofer . Sam 3: Segment anything with concepts, 2025. 2 , 3 , 5 [5] Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Juny- ing Chen, Xiangbo W u, Zhiyi Zhang, Zhihong Chen, Jian- quan Li, Xiang W an, and Benyou W ang. Allava: Harness- ing gpt4v-synthesized data for lite vision-language models, 2024. 2 [6] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi W ang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions, 2023. 2 [7] Zhiyang Chen, Y ousong Zhu, Y ufei Zhan, Zhaowen Li, Chaoyang Zhao, Jinqiao W ang, and Ming T ang. Mitigating hallucination in visual language models with visual supervi- sion. arXiv pr eprint arXiv:2311.16479 , 2023. 2 , 3 [8] W enliang Dai, Junnan Li, Dongxu Li, Anthony T iong, Junqi Zhao, W eisheng W ang, Bo yang Li, Pascale N Fung, and Stev en Hoi. Instructblip: T owards general-purpose vision- language models with instruction tuning. Advances in neural information pr ocessing systems , 36:49250–49267, 2023. 1 , 2 [9] Chaoyou Fu, Peixian Chen, Y unhang Shen, Y ulei Qin, Mengdan Zhang, Xu Lin, Jinrui Y ang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensiv e evaluation bench- mark for multimodal large language models. arXiv preprint arXiv:2306.13394 , 2023. 2 , 3 , 5 , 7 [10] Aaron Grattafiori, Abhimanyu Dubey , Abhinav Jauhri, Ab- hinav Pandey , Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur , Alan Schelten, Alex V aughan, et al. The llama 3 herd of models. arXiv pr eprint arXiv:2407.21783 , 2024. 1 , 2 [11] Tianrui Guan, Fuxiao Liu, Xiyang W u, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun W ang, Lichang Chen, Furong Huang, Y aser Y acoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern r ecognition , pages 14375–14385, 2024. 3 [12] Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin W ang, Con- ghui He, Jiaqi W ang, Dahua Lin, W eiming Zhang, and Nenghai Y u. Opera: Alle viating hallucination in multi- modal large language models via over -trust penalty and retrospection-allocation. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 13418–13427, 2024. 2 , 3 , 5 , 6 , 7 [13] Y iyang Huang, Liang Shi, Y itian Zhang, Y i Xu, and Y un Fu. Shield: Suppressing hallucinations in lvlm encoders via bias and vulnerability defense. arXiv pr eprint arXiv:2510.16596 , 2025. 2 , 3 [14] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Pr oceedings of the IEEE/CVF con- fer ence on computer vision and pattern r ecognition , pages 6700–6709, 2019. 5 , 6 [15] Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, W ei Y e, Ming Y an, Qinghao Y e, Ji Zhang, Fei Huang, and Shikun Zhang. Hallucination augmented contrastiv e learn- ing for multimodal large language model. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 27036–27046, 2024. 2 , 3 [16] Prannay Kaul, Zhizhong Li, Hao Y ang, Y onatan Duk- ler , Ashwin Swaminathan, CJ T aylor, and Stefano Soatto. Throne: An object-based hallucination benchmark for the free-form generations of large vision-language models. In Pr oceedings of the IEEE/CVF Conference on Computer V i- sion and P attern Recognition , pages 27228–27238, 2024. 3 [17] Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shi- jian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through vi- sual contrasti ve decoding. arXiv pr eprint arXiv:2311.16922 , 2023. 2 , 3 , 5 , 6 , 7 [18] Junnan Li, Dongxu Li, Silvio Sav arese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational conference on machine learning , pages 19730– 19742. PMLR, 2023. 2 [19] Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi W ang, Liang Chen, Y azheng Y ang, Benyou W ang, and Lingpeng K ong. Silkie: Preference distillation for large visual lan- guage models. arXiv pr eprint arXiv:2312.10665 , 2023. 3 [20] Y ifan Li, Y ifan Du, Kun Zhou, Jinpeng W ang, W ayne Xin Zhao, and Ji-Rong W en. Ev aluating object hallucina- tion in large vision-language models. arXiv pr eprint arXiv:2305.10355 , 2023. 2 , 3 , 5 , 6 [21] Tsung-Y i Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, De va Ramanan, Piotr Doll ´ ar , and C La wrence 11 Zitnick. Microsoft coco: Common objects in context. In Eur opean confer ence on computer vision , pages 740–755. Springer , 2014. 5 , 6 , 9 [22] Haotian Liu, Chunyuan Li, Qingyang W u, and Y ong Jae Lee. V isual instruction tuning. Advances in neural information pr ocessing systems , 36:34892–34916, 2023. 1 , 2 [23] Haotian Liu, Chunyuan Li, Y uheng Li, and Y ong Jae Lee. Improv ed baselines with visual instruction tuning. In Pr o- ceedings of the IEEE/CVF conference on computer vision and pattern r ecognition , pages 26296–26306, 2024. 2 [24] Xinyu L yu, Beitao Chen, Lianli Gao, Hengtao Shen, and Jingkuan Song. Alle viating hallucinations in large vision- language models through hallucination-induced optimiza- tion. Advances in Neural Information Pr ocessing Systems , 37:122811–122832, 2024. 2 , 3 [25] Zhiliang Peng, W enhui W ang, Li Dong, Y aru Hao, Shaohan Huang, Shuming Ma, and Furu W ei. Kosmos-2: Ground- ing multimodal large language models to the world. arXiv pr eprint arXiv:2306.14824 , 2023. 2 [26] Alec Radford, Jong W ook Kim, Chris Hallacy , Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry , Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning , pages 8748–8763. PmLR, 2021. 1 , 2 [27] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. In Eur opean conference on computer vision , pages 146–162. Springer , 2022. 5 , 6 [28] Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Y ikang Shen, Chuang Gan, Liangyan Gui, Y u- Xiong W ang, Y iming Y ang, et al. Aligning large multimodal models with factually augmented rlhf. In F indings of the As- sociation for Computational Linguistics: ACL 2024 , pages 13088–13110, 2024. 2 , 3 [29] Junyang W ang, Y uhang W ang, Guohai Xu, Jing Zhang, Y ukai Gu, Haitao Jia, Jiaqi W ang, Haiyang Xu, Ming Y an, Ji Zhang, et al. Amber: An llm-free multi-dimensional bench- mark for mllms hallucination ev aluation. arXiv preprint arXiv:2311.07397 , 2023. 2 , 3 [30] W eihan W ang, Qingsong Lv , W enmeng Y u, W enyi Hong, Ji Qi, Y an W ang, Junhui Ji, Zhuoyi Y ang, Lei Zhao, Song XiX- uan, et al. Cogvlm: V isual expert for pretrained language models. Advances in Neural Information Pr ocessing Sys- tems , 37:121475–121499, 2024. 2 [31] W eiyun W ang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang W ei, Zhao yang Liu, Linglin Jing, Shenglong Y e, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility , reasoning, and efficiency . arXiv pr eprint arXiv:2508.18265 , 2025. 1 , 2 , 5 , 6 , 7 , 12 [32] Sangmin W oo, Jaehyuk Jang, Donguk Kim, Y ubin Choi, and Changick Kim. RITUAL: Random image transformations as a universal anti-hallucination lever in large vision language models. arXiv preprint , 2024. 2 , 3 , 5 , 6 , 7 [33] Shukang Y in, Chaoyou Fu, Sirui Zhao, T ong Xu, Hao W ang, Dianbo Sui, Y unhang Shen, Ke Li, Xing Sun, and Enhong Chen. W oodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences , 67(12):220105, 2024. 2 , 3 , 5 , 6 , 7 [34] Xiaohua Zhai, Basil Mustafa, Ale xander Kolesniko v , and Lucas Beyer . Sigmoid loss for language image pre-training. In Pr oceedings of the IEEE/CVF international confer ence on computer vision , pages 11975–11986, 2023. 1 [35] Ce Zhang, Zifu W an, Zhehan Kan, Martin Q Ma, Si- mon Stepputtis, Dev a Ramanan, Russ Salakhutdinov , Louis- Philippe Morency , Katia Sycara, and Y aqi Xie. Self- correcting decoding with generati ve feedback for mitigat- ing hallucinations in large vision-language models. arXiv pr eprint arXiv:2502.06130 , 2025. 2 , 3 , 5 , 6 , 7 [36] Jinrui Zhang, T eng W ang, Haigang Zhang, Ping Lu, and Feng Zheng. Reflective instruction tuning: Mitigating hal- lucinations in large vision-language models. In European Confer ence on Computer V ision , pages 196–213. Springer, 2024. 2 , 3 T echnical Appendices A. Prompt T emplate W e pro vide the prompt templates used in the three core stages of Kestrel : Initialization (Fig. 7 ), claim-le vel verifi- cation (Fig. 8 ), and self-refinement (Fig. 9 ). These prompts follow the main e vidence-grounded iterati ve framework, in which Kestrel first rewrites the original question into a concrete and visually verifiable claim, then verifies the claim against the collected e vidence, and finally updates the answer conservati vely through refinement. T o improve re- producibility , all prompts adopt constrained JSON outputs with explicit field definitions. The initialization prompt pro- duces a question-aligned claim for grounding, the verifica- tion prompt returns evidence-based verdicts and confidence scores, and the self-refinement prompt updates the answer while proposing a new claim for the ne xt round. B. More Case Studies Fig. 10 and Fig. 11 present representativ e examples of Kestrel ’ s evidence-grounded self-refinement process. In each case, Qwen3-VL [ 3 ] and InternVL3.5 [ 31 ] initially produce an incorrect answer based on their first-pass vi- sual impression, while Kestrel revises the prediction af- ter introducing an explicit claim and verifying it against grounded evidence. The examples cov er se veral queries, showing that the proposed frame work is effecti ve across dif- ferent claim types. These cases highlight two key proper- ties of Kestrel . First, the method can correct errors ev en when the initial response is confidently wrong, by con vert- ing free-form reasoning into claim-lev el verification. Sec- ond, the refinement is interpretable: each answer update is supported by explicit visual e vidence, such as segmentation ov erlays, cropped regions, or structured text e vidence de- 12 Prompt template for Initialization messages =[ { ”role”:”user”, ”content”:[ { ”type”:”image”, ”image”: sample[”image”] } , { ”type”:”text”, ”text”: Y ou are giv en an image and a Y es/No question. Determine the answer and output one verifiable claim. T ask requirements: - answer must be exactly ”Y es” or ”No”. - output exactly one claim in verifiable claims. - claim fields: id, type, text, tar gets. - type must be exactly ” sample[”expected claim type”] ”. - text must be concrete and visually checkable. T arget rules: - if type = position: targets contain 1–2 short object phrases; for two-object relation, use [subject, anchor]. - otherwise: targets contain exactly one object noun from question; do not include color , number, or position w ords. Return JSON only: { ”answer”:”Y es—No”, ”verifiable claims”:[ { ”id”:”c1”, ”type”:” sample[”expected claim type”] ”, ”text”:”... ”, ”targets”: sample[”example targets”] } ] } Question: ” sample[”question”] ” PrevSummary (optional): ” sample[”prev summary”] ” ¡image¿ ” } ] } ] Y es-Guard T emplate (optional, only when initial answer = Y es): messages =[ { ”role”:”user”, ”content”:[ { ”type”:”image”, ”image”: sample[”image”] } , { ”type”:”text”, ”text”: Check whether the Y es/No question is true in this image. Return JSON only: { ”answer”:”yes—no—unclear”, ”confidence”:”high—medium—low”, ”reason”:”one short sentence” } Rules: - use image evidence only; - yes: clearly true; no: clearly false; otherwise unclear; - be conservati ve when e vidence is weak. Question: ” sample[”question”] ” T arget hint: ” sample[”target hint”] ” ¡image¿ ” } ] } ] Figure 7. Prompt template for initialization stage. Prompt template for claim-lev el verification messages =[ { ”role”:”user”, ”content”:[ sample[”evidence content”] , { ”type”:”text”, ”text”: Y ou are a strict verifier . Judge each claim using ONL Y the provided e vidence items. How to use e vidence: - Each item has an EvidenceID and a T ype. - Use only EvidenceID values that appear in the giv en evidence. - T ypical IDs: - seg overlay: e seg { tkey } - count text: e count { tkey } - count compare text: e countcmp { tkey } - count vision text: e countvis { tkey } - count vision compare text: e countviscmp { tkey } - crop zoom: e crop { tkey } - color text: e color { tkey } - position text: e pos { tkey } - position relation text: e posrel { claim id } Judging rules: - For each claim, choose exactly one status: supported — contradicted — insufficient. - supported: evidence clearly confirms the claim. - contradicted: evidence clearly refutes the claim. - insufficient: evidence is missing or ambiguous. - Do NO T use common sense. T op-level v erdict: - contradicted: at least one claim is strongly contradicted. - supported: all claims are strongly supported. - otherwise: insufficient. Return JSON only: { ”verdict”:”supported—contradicted—insufficient”, ”checked”:[ { ”claim id”:”... ”, ”status”:”supported—contradicted—insufficient”, ”confidence”:0.0, ”wh y”:”... ”, ”citations”:[”EvidenceID1”, ”EvidenceID2”] } ] } Input: Question: ” sample[”question”] ” Claims: sample[”claims json”] PrevV erdict (optional): sample[”prev verdict json”] ” } ] } ] Figure 8. Prompt template for claim-level v erification stage. scribing existence or position. Overall, the qualitative re- sults show that Kestrel improves faithfulness by ground- ing the refinement process in v erifiable visual observations, rather than relying on unconstrained self-correction. Kestrel introduces a multi-stage, iterative infer- ence pipeline that includes claim decomposition, external grounding, evidence construction, claim-lev el verification, and self-refinement. Although early stopping significantly reduces the number of refinement rounds in practice, the ov erall inference latency remains substantially higher than a single-pass L VLM inference due to the additional tool calls and verification steps. This ov erhead may limit the applicability of the frame work in latenc y-sensiti ve or lar ge- scale deployment scenarios. In future work, we plan to explore more efficient strategies for evidence-dri ven veri- fication, such as adaptiv e tool in vocation based on uncer- tainty signals, prioritizing high-risk claims for grounding, and reusing intermediate evidence across iterations. Such improv ements could substantially reduce the computational ov erhead while preserving the benefits of explicit grounding and iterativ e verification. C. Limitations and Future W ork Kestrel introduces a multi-stage, iterati ve inference pipeline that includes claim decomposition, external grounding, evidence construction, claim-lev el verification, and self-refinement. Although early stopping significantly 13 Prompt template for Self-Refine messages =[ { ”role”:”user”, ”content”:[ { ”type”:”image”, ”image”: sample[”image”] } , { ”type”:”text”, ”text”: Y ou are a cautious VQA assistant. Decide the final answer as a binary choice. HARD REQUIREMENT : - Answer must be exactly ”Y es” or ”No” (capitalized). - No punctuation, no explanation, no extra tokens. - Y ou must output a concrete answer . Use all av ailable history: - previous rounds: hypothesis / verify / refine; - current round: hypothesis and verify details. Also output one new claim for the next round. Claim constraints: - RELEV ANCE: every ne w claim must directly verify the Question’ s Y es/No. - new claim.type must be exactly ” sample[”expected claim type”] ”. - claim.text must mention at least one key entity / attrib ute / relation from the Question. T ype-specific target rules: - if type = position: - targets contain 1–2 short object phrases from Question; - if two-object relation exists, use [subject, anchor]; - keep distinguishing modifiers when needed. - otherwise: - targets contain e xactly one object noun from Question; - do not include color / number / position / quantifier words. - if a claim is not clearly usable to support/refute Question, it is in valid and must not be output. Return JSON only: { ”new claims”:[ { ”id”:”c1”, ”type”:” sample[”expected claim type”] ”, ”text”:”... ”, ”targets”: sample[”e xample targets”] , ”priority”:1 } ], ”Answer”:”Y es—No” } Question: ” sample[”question”] ” PreviousAnswer: ” sample[”A t”] ” RoundHistory: sample[”round history json”] CurrentRoundContext: sample[”current round context json”] < image > ” } ] } ] Figure 9. Prompt template for self-refinement stage. reduces the number of refinement rounds in practice, the ov erall inference latency remains substantially higher than a single-pass L VLM inference due to the additional tool calls and verification steps. This ov erhead may limit the applicability of the frame work in latenc y-sensiti ve or lar ge- scale deployment scenarios. In future work, we plan to explore more efficient strategies for evidence-dri ven veri- fication, such as adaptiv e tool in vocation based on uncer- tainty signals, prioritizing high-risk claims for grounding, and reusing intermediate evidence across iterations. Such improv ements could substantially reduce the computational ov erhead while preserving the benefits of explicit grounding and iterativ e verification. 14 Figure 10. Qualitative results of Kestrel . W e compare the VQA responses from the regular baseline and our method based on Qwen3- VL. Zoom in for a better view . 15 Figure 11. Qualitative r esults of Kestrel . W e compare the VQA responses from the regular baseline and our method based on InternVL- 3.5. Zoom in for a better view . 16
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment