Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

K estr el: Gr ounding Self-Reﬁnement f or L VLM Hallucination Mitigation Jiawei Mao 1 Hardy Chen 1 Haoqin T u 1 Y uhan W ang 1 Letian Zhang 1 Zeyu Zheng 2 Huaxiu Y ao 3 Zirui W ang 4 Cihang Xie 1 Y uyin Zhou 1 1 UC Santa Cruz 2 UC Berkele y 3 UNC-Chapel Hill 4 Apple Figure 1. Kestrel progressiv ely corrects hallucinated L VLM answers by integrating an external grounding agent with iterativ e self- improv ement. At each round, the model grounds the current claim with explicit visual and textual evidence, conducts claim-le vel v eriﬁca- tion, and conservati vely reﬁnes the response, yielding a ﬁnal answer that is both more reliable and more interpretable. Abstract Lar ge vision-language models (L VLMs) have become in- cr easingly str ong b ut r emain pr one to hallucinations in multimodal tasks, which signiﬁcantly narr ows their de- ployment. As training these L VLMs to avoid hallucina- tions becomes pr ohibitively expensive for lar g er models, training-fr ee methods offer a cheap and ﬂexible solution to this pr oblem, yet existing appr oaches based on decod- ing or tool use often bring limited gains and/or weak in- terpr etability . W e pr opose Kestrel , a training-fr ee frame- work for L VLM hallucination mitigation that combines an e xplicit visual-gr ounding ag ent with e vidence-veriﬁed self-r eﬁnement mechanism. In detail, Kestrel ﬁrst col- lects explicit visual evidence and con verts tool outputs into reusable and structur ed textual evidence. Second, to take full advantage of these evidence, Kestrel veriﬁes them via an L VLM judge for e vidence checking , then it- eratively self-r eﬁne answers based on veriﬁed evidence to r educe the risk of over-corr ection. Extensive experiments show that Kestrel impr oves performance over str ong baselines acr oss hallucination benchmarks ( e.g ., aver age +3.31 % on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while pr oviding transpar ent veriﬁcation traces for hallucination diagnosis and analysis — e.g ., both the inte grated self-reﬁnement module and gr ounding agent contributing an avera ge +2.0 % gain on POPE. Pr oject website: https : / / jwmao1 . github . io / Kestrel _ project/ 1. Introduction Recent advances in large-scale pretraining [ 1 , 26 , 34 ] and multimodal instruction tuning [ 8 , 22 ] ha ve substantially improv ed the capabilities of large vision-language mod- els (L VLMs) [ 3 , 10 , 31 ] on multimodal understanding and reasoning tasks such as visual question answering (VQA). 1 Howe ver , L VLMs still exhibit hallucination, producing re- sponses that are inconsistent with or weakly supported by the input image. For example, empirical studies [ 20 , 28 , 29 ] show that this issue remains prev alent, making hallucination a central challenge for improving the reliability of L VLMs. T o mitigate hallucination, two broad classes of meth- ods have been proposed, training-based and training-free. For the training-based line of work, continual training with hallucination annotations or alignment with external feed- back has been shown effecti ve [ 7 , 15 , 24 , 28 , 36 ]. How- ev er , these training-based solutions incur signiﬁcant data and compute overhead, posing hurdles in real-world de- ployment. Existing training-free methods for hallucination mitigation improve test-time correction without additional training, but leave key gaps to be ﬁlled: (i) limited gains and robustness when operating purely on internal decod- ing dynamics without external grounding evidence, and (ii) limited reliability when correction is performed in a sin- gle pass. Distribution-contrast methods [ 17 , 32 ] can reduce object hallucinations but remain sensitive to perturbations and often fav or common-object representations. Many ap- proaches rely on internal logit dynamics [ 13 ] or language- lev el decoding control [ 12 ], which can yield brittle correc- tions that are dif ﬁcult to v alidate against concrete visual ev- idence. On the other hand, methods that introduce external veriﬁcation may produce non-deterministic evidence due to randomness of tools [ 35 ]. While other methods [ 33 ] could collect reliable evidence, their one-time veriﬁcation-and- update can be insufﬁcient to prevent over -correction under challenging cases. Building on these limitations, we propose Kestrel (see Fig. 1 ), a training-free framew ork for L VLM hallucination mitigation that uniﬁes an explicit visual grounding agent with evidence-dri ven iterative self-reﬁnement. Speciﬁ- cally , Kestrel ﬁrst decomposes the question into veriﬁ- able claim-lev el targets (e.g., existence, color , count, and position), and then in vokes SAM3 [ 4 ] around each tar get to collect se gmentation o verlays, bounding boxes, target crop- and-zoom views, and text evidence derived from these col- lected visual evidence, all of which are collated as struc- tured evidence items with citation identiﬁers. The frame- work then performs claim-lev el veriﬁcation with verdicts and outputs conﬁdence-aware veriﬁcation results, form- ing an auditable evidence chain. T o regulate the potential ov er-correction, we further introduce an evidence-gated up- date scheme into the iteration: the frame work progressi vely supplements and strengthens claim-lev el evidence through multiple rounds of v eriﬁcation and revision, and permits an- swer ﬂips only when e vidence strength, conﬁdence, and ev- idence coverage jointly satisfy predeﬁned criteria. These designs preserv e the training-free methods while improving the interpretability , rob ustness, and decision stability of hal- lucination mitigation. Experiments show that Kestrel remains fully training- free, yet consistently reduces hallucinations at test time across multiple benchmarks, with improvements that trans- fer across different state-of-the-art L VLM backbones. On POPE [ 20 ] (MS-COCO, A-OKVQA, and GQA), Kestrel improv es accuracy by an average of +3.31 % points over Qwen3-VL and +3.03 % over InternVL3.5; it also sur- passes prior training-free mitigation baselines by +1.38 % and +1.47 % points on a verage under the same back- bones, respectiv ely . On the more challenging MME- Hallucination [ 9 ], Kestrel boosts Qwen3-VL by +28.34 points and exceeds OPERA [ 12 ] by +16.67, deli vering con- sistent gains across diverse hallucination types (existence, count, and position) while maintaining strong overall per- formance and setting a new state-of-the-art. Our main contributions are summarized as follo ws: • W e propose Kestrel , a training-free L VLM hallu- cination mitigation framework that uniﬁes e xplicit visual grounding agent with iterative self-reﬁnement at test time. Kestrel decomposes answers into veriﬁable claim-lev el targets, grounds them with structured visual and textual evidence, performing conserv ativ e multi-round veriﬁcation and revision to improve interpretability and reduce over - correction. • Kestrel achiev es state-of-the-art performance in hal- lucination mitigation on POPE and the more ﬁne-grained MME-Hallucination. • Kestrel generalizes across multiple state-of-the-art L VLM backbones with substantial and consistent gains, showing that the framew ork is backbone-agnostic and broadly applicable in the training-free setting. 2. Related W ork 2.1. Large V ision-Language Models Large vision-language models (L VLMs) hav e advanced rapidly through large-scale multimodal pretraining and in- struction tuning, achieving strong performance across mul- timodal understanding and reasoning tasks. Represen- tativ e paradigms include CLIP-style vision-language pre- training [ 26 ], Flamingo-style few-shot multimodal mod- eling [ 1 ], and BLIP-2 style [ 18 ] modular alignment be- tween frozen vision encoders and LLMs. L VLMs, such as LLaV A [ 22 , 23 ], InstructBLIP [ 8 ], OpenFlamingo [ 2 ], CogVLM [ 30 ], K osmos-2 [ 25 ], and recent models [ 3 , 5 , 6 , 10 , 31 ], further demonstrate the ef fecti veness of scal- able multimodal alignment and visual instruction tuning. Meanwhile, grounded multimodal modeling has become increasingly important, as ex empliﬁed by Kosmos-2 [ 25 ], which explicitly supports phrase grounding and visual re- ferring. Nev ertheless, current L VLMs still struggle to main- tain f aithful grounding between generated responses and image content, especially in ﬁne-grained reasoning scenar- 2 Figure 2. Kestrel vs. prior training-free hallucination mitigation methods. By combining an e xternal grounding agent with iterati ve self-improv ement, Kestrel collects explicit visual evidence and further con verts tool outputs into structured textual evidence for veriﬁ- cation. This design yields more interpretable and stable e vidence, reduces overconﬁdent corrections, and a voids biased interpretation that may arise when L VLMs rely only on raw visual e vidence compared with prior approaches. ios, making hallucination a persistent challenge for reliable deployment. 2.2. Hallucination in L VLMs Hallucination is a persistent problem in large vision- language models (L VLMs). Early studies [ 20 ] show that L VLMs often generate content inconsistent with the in- put image, especially by predicting non-existent objects, while POPE [ 20 ] improves the stability of such ev alua- tion. Subsequent work sho ws that hallucination extends be- yond object existence to ﬁner-grained errors in attributes, counts, and relations, as benchmarked by AMBER [ 29 ]. More challenging settings, such as visual illusion and am- biguous local evidence, are further explored in Hallusion- Bench [ 11 ]. Broader benchmarks including MME [ 9 ], MMHal-Bench [ 28 ], and THRONE [ 16 ] further suggest that hallucination is heterogeneous, benchmark-sensitiv e, and closely tied to failures in visual grounding and multi- modal reasoning. These ﬁndings motiv ate mitigation meth- ods that verify model outputs against explicit and ﬁne- grained visual evidence. 2.3. T raining-based Hallucination Mitigation Early approaches improve faithfulness by redesigning in- struction data or supervision signals so that models bet- ter distinguish grounded from ungrounded responses. F or example, robust visual instruction tuning [ 7 ] introduces hallucination-oriented supervision, while HA CL [ 15 ] uses contrastiv e learning to separate grounded and hallucinated representations. Reﬂective instruction tuning [ 36 ] fur- ther improves reliability by adding rationale supervision. Alignment-based methods, such as factually augmented RLHF [ 28 ] and Silkie [ 19 ], incorporate preference or fac- tual signals during post-training, and HIO [ 24 ] strengthens token-le vel contrastive learning around hallucinated con- tent. Overall, training-based methods are effecti ve, but they usually require additional annotations, synthetic data, preference collection, or repeated optimization, leading to higher training cost and deployment complexity . 2.4. T raining-free Hallucination Mitigation T raining-free hallucination mitigation aims to reduce hal- lucination at inference time without updating model pa- rameters. A major line of work focuses on contrastive or decoding-based strategies, such as VCD [ 17 ], RIT - U AL [ 32 ], OPERA [ 12 ], and SHIELD [ 13 ], which allevi- ate hallucination by intervening on decoding beha vior or vi- sual token representations. Another line introduces explicit veriﬁcation or post-hoc correction. For example, W ood- pecker [ 33 ] adopts a multi-stage correction pipeline, while DeGF [ 35 ] lev erages text-to-image generativ e feedback for iterativ e reﬁnement. Meanwhile, recent grounding models such as SAM3 [ 4 ] make it increasingly practical to collect explicit visual evidence at inference time. Compared with prior training-free methods, our work further emphasizes combine explicit grounding evidence and conservati ve iter- 3 Figure 3. Overview of Kestrel . Gi ven an image-question pair , Kestrel follows a training-free four-stage pipeline for L VLM halluci- nation mitigation: (1) Initialization , which obtains an initial answer and re writes it into question-aligned veriﬁable claims with associated visual entities and claim types; (2) Agent Grounding , which in vokes an external SAM3-based grounding agent to collect explicit visual evidence (e.g., se gmentation overlays, box es, and crop-and-zoom vie ws) and con vert them into structured te xtual evidence; (3) Claim-level V eriﬁcation , which veriﬁes each claim against the cited e vidence to produce claim-wise v erdicts, conﬁdence scores, and a top-le vel veriﬁ- cation decision; and (4) Self-Reﬁnement , which performs evidence-gated answer updating based on the current and pre vious veriﬁcation traces. ativ e self-reﬁnement to mitigate hallucination. 3. Method W e propose Kestrel , a training-free framew ork for miti- gating L VLM hallucination with explicit visual grounding agent and structured evidence-dri ven self-reﬁnement at test time. Gi ven an image I and corresponding question Q , Kestrel iterati vely follows a four-step pipeline: (i) initial- ization, (ii) agent grounding, (iii) claim-level veriﬁcation, and (iv) self-reﬁnement (see Fig. 3 ). 3.1. Initialization Kestrel ﬁrst queries the L VLM to obtain an initial answer ˆ A (0) . T o support claim-le vel veriﬁcation, Kestrel con verts Q into a small set of veriﬁable claims that directly corre- spond to the question. Concretely , Kestrel rewrites the question-answer decision into visually checkable claims, 4 each anchored to one or two concrete visual entities. These entities serve as the detection targets for the grounding agent. Meanwhile, based on the v eriﬁable attrib utes re- quired by the question, we categorize the extracted claims (e.g., existence , color , count , position ) to route sub- sequent agent grounding. 3.2. Agent Grounding T o obtain explicit, inspectable grounding evidence, Kestrel inv okes an external visual grounding agent built on SAM3 promptable concept segmentation [ 4 ]. V isual evidence. SAM3 takes the visual entities in the claims as concept prompts and returns the matched in- stances, from which Kestrel collects explicit visual ev- idence, including: (i) segmentation ov erlays for transpar- ent localization, (ii) instance bounding boxes (deri ved from SAM3 masks) to support geometry-based reasoning, and (iii) crop-and-zoom vie ws around predicted instances to re- duce ambiguity for attribute inspection (e.g., color) and lo- cal details. Structured textual evidence. T o make agent outputs di- rectly usable for claim veriﬁcation and auditable diagno- sis, Kestrel derives textual e vidence from visual e vidences via L VLM for each claim type: (i) for existence , we con vert the predicted instances into an existence statement by checking whether the number of matched instances is greater than zero; (ii) for count , we report the instance count computed from the number of predicted masks; (iii) for color , we generate a concise color observation condi- tioned on the masked crop-and-zoom and the full image; (iv) for position , we con vert SAM3 geometry into text by deriving coarse spatial cues from the union bounding box and, when two entities are in volv ed, computing their rela- tiv e relation from the corresponding bounding-box centers. Each textual evidence item is paired with a citation iden- tiﬁer and can be referenced during veriﬁcation and answer revision. 3.3. Claim-lev el V eriﬁcation Giv en the claims and corresponding structured evidence items, Kestrel performs claim-level veriﬁcation using L VLM-as-a-judge. The judge is instructed to base its de- cision only on the provided evidence and to cite the corre- sponding evidence. For each claim, the veriﬁer outputs: (i) a v erdict ( supported / contr adicted / insufﬁcient ), (ii) a con- ﬁdence score, and (iii) a short reasoning that must cite the relev ant evidence. W e then consolidate claim-wise judg- ments into a top-lev el veriﬁcation verdict for the current answer: it is labeled as contradicted if any claim is conﬁ- dently refuted with cited evidence, as supported only when all claims are conﬁdently supported, and as insufﬁcient oth- erwise. The resulting veriﬁcation trace constitutes an ex- plicit, evidence-grounded audit trail, enabling interpretable analysis of when hallucinations arise and how corrections are triggered. 3.4. Self-Reﬁnement Since external agent and the L VLM may be unrobust, di- rectly revising the answer based on the instance-lev el ver- dict can introduce over-corr ection . Therefore, Kestrel adopts a evidence-g ated self-reﬁnement strategy: it per- mits correction only when the veriﬁcation provides sufﬁ- ciently reliable signals—i.e., high-conﬁdence claim-level judgments together with cited evidence for the correspond- ing claim. Otherwise, Kestrel preserves the current an- swer ˆ A ( i ) (where i denotes i th iteration) and proceeds to collect stronger evidence in subsequent rounds. Importantly , the self-reﬁnement is stateful : the revision step conditions not only on the current veriﬁcation results but also on prior rounds’ claims, evidence, and decisions. Based on the veriﬁcation trace, Kestrel updates the an- swer to obtain ˆ A ( i +1) and proposes a new set of claims for the next iteration, prioritizing claims that remain uncertain or are implicated by contradictions. This iterativ e process progressiv ely strengthens evidence and stabilizes decision- making, while remaining training-free. Kestrel repeats the cycle for a small number of itera- tions, and stops early when the answer stabilizes under con- sistently supportiv e veriﬁcation, or when additional itera- tions no longer yield stronger evidence. The ﬁnal output is the answer together with its claim-lev el veriﬁcation traces. 4. Experiments 4.1. Experimental Setup Benchmarks. W e ev aluate Kestrel on POPE [ 20 ] in three source datasets (MS-COCO [ 21 ], A-OKVQA [ 27 ], and GQA [ 14 ]) under random (Rand.), popular (Pop.), and adversarial (Adv .) sampling, and MME-Hallucination [ 9 ] ev aluates ﬁne-grained hallucination across existence, count, position, and color . L VLM backbones. W e ev aluate Kestrel with SoT A open-weight L VLMs, e.g., Qwen3-VL 8B [ 3 ] and In- ternVL3.5 8B [ 31 ]. Baselines. W e compare against Qwen3-VL agent (with zoom-in tool) 1 and training-free L VLM hallucination mit- igation baselines: VCD [ 17 ], OPERA [ 12 ], RITU AL [ 32 ], W oodpecker [ 33 ], and DeGF [ 35 ]. All baselines use the 1 https : / / github . com / QwenLM / Qwen - Agent / blob / main / examples/cookbook_think_with_images.ipynb 5 T able 1. Results on POPE [ 20 ] benchmark . Higher ( ↑ ) accuracy indicates better performance. The best results are bolded , and the second-best are underlined. Backbone Method MS-COCO [ 21 ] A-OKVQA [ 27 ] GQA [ 14 ] Rand. ↑ Pop. ↑ Adv . ↑ Rand. ↑ Pop. ↑ Adv . ↑ Rand. ↑ Pop. ↑ Adv . ↑ Qwen3-VL [ 3 ] Baseline [ 3 ] 89.00 86.92 86.20 92.36 86.67 81.87 90.70 83.63 81.50 Qwen3-VL agent [ 3 ] 91.03 88.06 86.13 92.87 85.20 78.03 91.41 81.81 78.30 VCD [ 17 ] 90.40 88.80 87.41 93.53 87.86 82.00 91.56 85.76 81.93 W oodpecker [ 33 ] 89.97 88.03 87.10 93.23 88.90 83.33 91.27 86.27 82.77 RITU AL [ 32 ] 86.20 83.67 82.27 87.67 83.50 77.76 86.86 82.30 78.23 OPERA [ 12 ] 90.50 88.83 87.50 93.76 89.50 83.86 91.80 87.11 83.30 DeGF [ 35 ] 90.33 88.16 86.90 92.96 87.70 82.61 91.13 83.79 82.00 Kestrel (ours) 91.53 89.30 87.53 93.46 91.73 86.76 91.67 90.33 86.27 InternVL3.5 [ 31 ] Baseline [ 31 ] 90.77 88.10 85.73 92.67 87.83 81.53 89.77 84.10 81.31 VCD [ 17 ] 91.35 89.22 87.60 92.87 89.73 83.73 91.60 85.07 83.37 W oodpecker [ 33 ] 91.20 89.11 87.50 93.73 89.80 84.00 91.43 85.16 83.26 RITU AL [ 32 ] 91.60 89.03 87.48 93.71 89.75 83.90 91.39 85.18 83.29 OPERA [ 12 ] 91.53 89.18 87.55 93.55 89.79 83.81 91.45 85.20 83.31 DeGF [ 35 ] 91.43 89.12 87.37 93.39 89.68 84.11 91.48 85.06 83.20 Kestrel (ours) 91.27 89.27 88.10 93.57 91.80 87.13 91.57 89.87 86.53 same L VLM backbone for fair comparison (In the W ood- pecker pipeline, all components except the visual veriﬁca- tion module are replaced with Qwen3-VL.). Implementation details. W e set the maximum number of self-reﬁnement iterations to K =3 and stop early when the answer is stable with two consecutiv e supported veri- ﬁcation verdicts. F or the SAM3 grounding agent, we use a conﬁdence threshold of 0 . 5 (with a recheck threshold of 0 . 35 when needed for existence claim). In self-reﬁnement, we use conﬁdence thresholds in the range [0 . 82 , 0 . 90] for different claim types. All experiments are conducted on NVIDIA A100 GPUs. Details of the corresponding prompt template are provided in the supplementary material. 4.2. Results And Discussion Results on POPE. T ab . 1 compares Kestrel with base- lines under two L VLM backbones (Qwen3-VL and In- ternVL3.5) across three ev aluation sources (MS-COCO, A- OKVQA, and GQA). Overall, Kestrel exhibits the most consistent gains on the more challenging popular and adver - sarial splits, while remaining competitiv e on random sets, indicating rob ust hallucination mitig ation across backbones and data sources. W ith Qwen3-VL , Kestrel achiev es the best perfor - mance on MS-COCO across all splits. On A-OKVQA , Kestrel reaches 91.73 % (Pop.) and 86.76 % (Adv .), surpassing decoding-based baselines such as OPERA (89.50 % /83.86 % ) and VCD (87.86 % /82.00 % ). On GQA , Kestrel further improves to 90.33 % (Pop.) and 86.27 % (Adv .), outperforming OPERA (87.11 % /83.30 % ) and VCD (85.76 % /81.93 % ) by a clear margin. W ith InternVL3.5 , we observe a similar trend, deliv ering the largest improve- ments ov er the L VLM backbone and the strongest overall performance. Compared to decoding-centric methods that reshape to- ken distributions at inference time (e.g., VCD [ 17 ] and OPERA [ 12 ]) and post-hoc correction pipelines such as W oodpecker [ 33 ], Kestrel externalizes grounding into e x- plicit evidence and performs claim-lev el veriﬁcation with conservati ve, evidence-g ated updates. This design better preserves correctness while reducing risky corrections un- der stronger prior interference. Fig. 4 further shows that the gains on POPE mainly come from correcting initially in- correct predictions, while most correct predictions are pre- served after reﬁnement. The relativ ely lo w rate of ov er- correction suggests that our conservati ve gating mechanism effecti vely suppresses unnecessary revisions, enabling the model to mitigate hallucinations without sacriﬁcing predic- tion stability . Results on MME-Hallucination. T ab . 2 reports MME- Hallucination results decomposed into object-lev el (Ex- istence, Count) and attribute-le vel (Position, Color) sub- sets. Overall, Kestrel achieves the best MME Score under Qwen3-VL, reaching 760.00 ( +28 . 34 ), outper- forming baselines such as OPERA (743.33 +11 . 67 ) and VCD/RITU AL (736.66 +5 . 00 ). These results suggest that Kestrel ’ s evidence-dri ven, claim-level veriﬁcation and conservati ve update rule particularly strengthen object ex- 6 Figure 4. Prediction transition statistics with Qwen3-VL before and after reﬁnement. Prediction are categorized into four types: correctly preserved, error corrected, o ver -corrected, and incorrectly preserv ed. The results show that the reﬁnement process is conservati ve, retaining most originally correct predictions while correcting a portion of erroneous ones, with limited o ver -correction. Zoom in for a better view . T able 2. Results on MME-Hallucination [ 9 ] benchmark. Higher scores ( ↑ ) indicate better performance. The best results are bolded , and the second-best are underlined. Backbone Method Object-level Attribute-le vel MME Score ↑ Existence ↑ Count ↑ Position ↑ Color ↑ Qwen3-VL [ 3 ] Baseline [ 3 ] 195.00 ( +0 . 00 ) 175.00 ( +0 . 00 ) 168.33 ( +0 . 00 ) 193.33 ( +0 . 00 ) 731.66 ( +0 . 00 ) Qwen3-VL agent [ 3 ] 200.00 ( +5 . 00 ) 181.67 ( +6 . 67 ) 168.33 ( +0 . 00 ) 193.33 ( +0 . 00 ) 743.33 ( +11 . 67 ) VCD [ 17 ] 195.00 ( +0 . 00 ) 180.00 ( +5 . 00 ) 168.33 ( +6 . 73 ) 193.33 ( +0 . 00 ) 736.66 ( +5 . 00 ) W oodpecker [ 33 ] 195.00 ( +0 . 00 ) 173.33 ( +1 . 67 ) 168.33 ( +0 . 00 ) 195.00 ( +1 . 67 ) 731.66 ( +10 . 00 ) RITU AL [ 32 ] 195.00 ( +0 . 00 ) 180.00 ( +5 . 00 ) 168.33 ( +0 . 00 ) 193.33 ( +0 . 00 ) 736.66 ( +5 . 00 ) OPERA [ 12 ] 195.00 ( +0 . 00 ) 180.00 ( +5 . 00 ) 168.33 ( +0 . 00 ) 200.00 ( +6 . 67 ) 743.33 ( +11 . 67 ) DeGF [ 35 ] 195.00 ( +0 . 00 ) 181.67 ( +6 . 67 ) 166.67 ( +1 . 66 ) 188.33 ( +5 . 00 ) 732.67 ( +1 . 01 ) Kestrel (ours) 200.00 ( +5 . 00 ) 186.67 ( +11 . 67 ) 180.00 ( +11 . 67 ) 193.33 ( +0 . 00 ) 760.00 ( +28 . 34 ) InternVL3.5 [ 31 ] Baseline [ 31 ] 200.00 ( +0 . 00 ) 175.00 ( +0 . 00 ) 175.00 ( +0 . 00 ) 193.33 ( +0 . 00 ) 743.33 ( +0 . 00 ) VCD [ 17 ] 200.00 ( +0 . 00 ) 175.00 ( +0 . 00 ) 175.00 ( +0 . 00 ) 193.33 ( +0 . 00 ) 736.66 ( − 6 . 67 ) W oodpecker [ 33 ] 200.00 ( +0 . 00 ) 166.67 ( − 8 . 33 ) 161.67 ( − 13 . 33 ) 186.67 ( − 6 . 66 ) 715.01 ( − 28 . 32 ) RITU AL [ 32 ] 195.00 ( − 5 . 00 ) 175.00 ( +0 . 00 ) 175.00 ( +0 . 00 ) 193.33 ( +0 . 00 ) 738.33 ( − 5 . 00 ) OPERA [ 12 ] 195.00 ( − 5 . 00 ) 173.33 ( − 1 . 67 ) 175.00 ( +0 . 00 ) 195.00 ( +1 . 67 ) 738.33 ( − 5 . 00 ) DeGF [ 35 ] 195.00 ( − 5 . 00 ) 175.00 ( +0 . 00 ) 168.33 ( − 6 . 67 ) 188.33 ( − 5 . 00 ) 726.66 ( − 16 . 67 ) Kestrel (ours) 200.00 ( +0 . 00 ) 186.67 ( +11 . 67 ) 181.67 ( +6 . 67 ) 195.00 ( +1 . 67 ) 763.34 ( +20 . 01 ) istence/count and spatial reasoning, yielding more balanced improv ements across hallucination-sensitiv e attributes. W e also observe slightly larger variance for Kestrel on some subsets, which is expected when grounding and v eriﬁcation are performed with stochastic components; nevertheless, the mean improvement remains consistent and substantial ov er all compared methods. Additional results are pro vided in the supplementary material. T ab. 2 presents the results of Kestrel with the In- ternVL3.5 backbone [ 31 ] on MME Hallucination. Notably , Kestrel still achieves further improv ement o ver such a strong backbone, showing that our framew ork remains ef- fectiv e e ven when the base model already performs at a high lev el. This result is particularly meaningful, as reducing hallucinations becomes increasingly difﬁcult as the back- bone grows stronger . Moreover , among the methods com- pared in T ab. 2 , Kestrel is the only one that consistently improv es InternVL3.5, further highlighting the advantage 7 Figure 5. Qualitative results of Kestrel . W e compare the VQA responses from the regular baseline and our method based on Qwen3-VL. Zoom in for a better view . of our grounded evidence-based reﬁnement frame work. Qualitative Analysis In Fig. 5 , we present results of hal- lucination mitigation across e xistence, count, color , and po- sition. The initial L VLM answers are inconsistent with the image content, but are corrected by Kestrel through ex- ternal evidence from the grounding agent and multi-round self-reﬁnement. The visual evidence provides explicit sup- port for target localization and attribute veriﬁcation, which helps recov er missed objects, disambiguate object counts, reject unsupported color predictions, and correct erroneous spatial relations. Based on these grounded cues, the itera- tiv e reﬁnement procedure progressiv ely updates the answer 8 T able 3. Efﬁciency comparison. W e report the av erage inference latenc y per instance, peak GPU memory , and POPE accurac y on MS- COCO. Experiments are conducted on a single NVIDIA A100 GPU. Lower latency/memory is better, while higher POPE accuracy is better . Method A verage Latency ↓ GPU Memory ↓ Checked Case POPE (MS-COCO) ↑ Qwen3-VL [ 3 ] 0.78 s ( × 1.00) 17428 MB ( × 1.00) 9000 87.37 Kestrel ( 1 st iteration) 12.00 s ( × 15.38) 21472 MB ( × 1.23) 9000 89.28 Kestrel ( 2 nd iteration) 8.54 s ( × 10.94) 21296 MB ( × 1.22) 4978 89.45 Kestrel ( 3 rd iteration) 7.12 s ( × 9.12) 21274 MB ( × 1.22) 477 89.34 Kestrel 18.75 s ( × 24.03) 21472 MB ( × 1.23) 9000 89.45 through claim-le vel veriﬁcation, yielding more reliable cor - rections than direct one-step re vision. These results demon- strate that explicit grounding and conservati ve iterativ e re- ﬁnement work together to effecti vely reduce hallucinations across div erse perception scenarios. 4.3. Efﬁciency T ab. 3 reports the efﬁcienc y of Qwen3-VL [ 3 ] and Kestrel in terms of average inference latency , peak GPU memory , checked cases, and POPE accuracy on MS-COCO [ 21 ]. As expected, Kestrel incurs substantially higher end-to-end latency than a single-pass L VLM inference, due to the ad- ditional overhead introduced by external grounding, claim- lev el veriﬁcation, and iterati ve self-reﬁnement. Importantly , the per-round statistics are computed over different num- bers of checked cases because of early stopping: the ﬁrst iteration is applied to all 9000 cases, whereas only 4978 and 477 cases proceed to the second and third iterations, respectiv ely . Consequently , later iterations are e valuated on progressiv ely smaller subsets. Since many easy cases are resolved early , the remaining instances typically in v olve fewer unresolved claims and require less additional e vi- dence, resulting in lower av erage latency and slightly re- duced memory usage in subsequent iterations. This also ex- plains why the best POPE accuracy is already achiev ed af- ter the second iteration, while the third iteration only yields limited changes on a much smaller subset. Based on this trade-off, we set the maximum number of reﬁnement itera- tions to 3, which provides a practical balance between per- formance and efﬁcienc y . 4.4. Human Study T o complement the automatic benchmark results, we con- duct a human study to ev aluate whether the responses pro- duced by Kestrel are better aligned with human prefer- ence under evidence-grounded judgment. T o complement automatic benchmarks, we conduct a human preference study to assess whether Kestrel produces responses that are more reliable and more interpretable. Figure 6. Human pr eference comparison across methods. Bars show the fraction of trials in which each method is preferred (n=60). Error bars denote 95 % Wilson conﬁdence interv als. Setup. W e sample 60 ev aluation cases and compare ﬁv e training-free mitigation methods: Kestrel , DeGF , W ood- pecker , RITU AL, and VCD, which are also included in our main experiments. F or each case, annotators are presented with the candidate outputs in randomized order and asked to choose the response they prefer based on grounding qual- ity , factual consistency with the image, and ov erall answer reliability . W e report the human preference rate of each method. Results. As sho wn in Fig. 6 , Kestrel is preferred in 41/60 cases (68.3 % ), substantially outperforming DeGF (13.3 % ), W oodpecker (11.7 % ), RITU AL (6.7 % ), and VCD (0.0 % ). This result indicates that Kestrel ’ s outputs are more consistent with human judgments of groundedness and answer quality . W e attrib ute this advantage to Kestrel ’ s explicit evidence-backed veriﬁcation process. Instead of relying only on decoding-time intervention, Kestrel ex- ternalizes grounding into structured visual and textual ev- idence, veriﬁes question-aligned claims against cited evi- dence, and updates the answer only when the evidence is sufﬁciently reliable. 4.5. Ablation Study Designs. T ab . 4 reports the ablation results of Kestrel on POPE-COCO. Nai ve self-reﬁnement and the ground- ing agent alone bring only marginal impro vements ov er 9 T able 4. Ablation study of core design components. W e ev aluate different settings by progressively enabling the key components in Kestrel . Higher scores indicate better performance. Setting Grounding Structured T extual Claim-level Evidence-gated Self- History- POPE Agent Evidence V eriﬁcation Update Reﬁnement A ware MS-COCO baseline ✗ ✗ ✗ ✗ ✗ ✗ 87.37 s1 ✗ ✗ ✗ ✗ ✓ ✗ 87.44 s2 ✓ ✗ ✗ ✗ ✗ ✗ 87.50 s3 ✗ ✗ ✓ ✓ ✓ ✓ 87.58 s4 ✓ ✗ ✓ ✓ ✓ ✓ 88.61 s5 ✓ ✓ ✗ ✓ ✓ ✓ 89.05 s6 ✓ ✓ ✓ ✗ ✓ ✓ 89.30 s7 ✓ ✓ ✓ ✓ ✗ ✗ 89.16 s8 ✓ ✓ ✓ ✓ ✓ ✗ 89.32 s9 ✓ ✓ ✓ ✓ ✓ ✓ 89.45 T able 5. Ablation of visual e vidence types. Contribution of differ - ent visual evidence forms, including se gmentation, bounding box es, and crop-zoom views. Higher scores indicate better performance. visual segment box crop-zoom MME Score baseline – – – 731.66 s1 ✗ ✓ ✓ 738.33 s2 ✓ ✗ ✓ 743.33 s3 ✓ ✓ ✗ 740.00 s4 ✓ ✓ ✓ 760.00 T able 6. Ablation of structured textual evidence. W e study the contribution of dif ferent textual e vidence types used for claim veri- ﬁcation. Higher scores indicate better performance. textual exist count color position MME Score s1 ✗ ✓ ✓ ✓ 755.00 s2 ✓ ✗ ✓ ✓ 751.67 s3 ✓ ✓ ✗ ✓ 753.33 s4 ✓ ✓ ✓ ✗ 748.33 s5 ✓ ✓ ✓ ✓ 760.00 the baseline, indicating that neither iterativ e revision nor external tool access is sufﬁcient by itself. Introduc- ing veriﬁcation-guided reﬁnement further improves per- formance, while incorporating the grounding agent yields a larger gain, highlighting the importance of explicit grounded evidence. Adding structured textual evidence and claim-lev el veriﬁcation leads to further improvements, showing that normalized evidence construction and ﬁne- grained veriﬁcation are both beneﬁcial. W e also ob- serve that remo ving evidence-gated update degrades per- formance, conﬁrming the role of conservati ve update con- trol in prev enting risky corrections. Finally , self-reﬁnement and stateful reﬁnement provide additional gains on top of the full evidence-v eriﬁcation pipeline, and the full Kestrel achiev es the best result ov erall. Evidence. The evidence ablations (T ab . 5 and T ab . 6 ) further highlight the complementary roles of both visual and textual evidence in Kestrel . For visual evidence, re- moving any single component consistently de grades perfor - mance relativ e to the full setting, showing that segmenta- tion ov erlays (segment), bounding boxes (box), and crop- zoom views (crop-zoom) all contribute to the ﬁnal result. Among them, removing segmentation or crop-zoom causes a lar ger drop than remo ving boxes, suggesting that pre- cise region localization and enlarged local inspection are particularly important for reliable claim veriﬁcation. For structured textual e vidence, ablating any e vidence type also leads to performance degradation, conﬁrming that exis- tence, count, color , and position evidence all provide use- ful support for reﬁnement. Notably , removing position or count evidence causes a relatively larger drop, indicat- ing that these attributes beneﬁt more from explicit struc- tured evidence. Overall, the best performance is achieved when all visual and textual e vidence types are used together , demonstrating that the different evidence forms are comple- mentary rather than redundant. 5. Conclusion In this paper , we presented Kestrel , a training-free frame- work for L VLM hallucination mitigation that integrates ex- plicit visual grounding with structured evidence-dri ven self- reﬁnement. By conv erting external tool outputs into citable visual and textual evidence, Kestrel enables claim-level veriﬁcation and conservati ve evidence-gated answer up- dates, providing both improved reliability and transparent veriﬁcation traces. Extensive experiments on POPE and MME-Hallucination show that Kestrel consistently im- prov es over strong training-free baselines across multiple backbones, with particularly clear gains under more chal- lenging hallucination settings. Further ablations validate the contrib utions of each design. Overall, Kestrel demon- strates that coupling explicit grounding with interpretable iterativ e veriﬁcation offers an ef fectiv e and stable path to- ward more trustworthy L VLMs. 10 References [1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Y ana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems , 35:23716–23736, 2022. 1 , 2 [2] Anas A wadalla, Irena Gao, Josh Gardner , Jack Hessel, Y usuf Hanafy , W anrong Zhu, Kalyani Marathe, Y onatan Bitton, Samir Gadre, Shiori Saga wa, et al. Openﬂamingo: An open- source framework for training large autoregressiv e vision- language models. arXiv preprint , 2023. 2 [3] Shuai Bai, Y uxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, W ei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv pr eprint arXiv:2511.21631 , 2025. 1 , 2 , 5 , 6 , 7 , 9 , 12 [4] Nicolas Carion, Laura Gustafson, Y uan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan V asudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, T engyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer , Meng W ang, Peize Sun, Roman R ¨ adle, Triantafyllos Afouras, Effrosyni Mavroudi, Kather- ine Xu, Tsung-Han W u, Y u Zhou, Liliane Momeni, Rishi Hazra, Shuangrui Ding, Sagar V aze, Francois Porcher, Feng Li, Siyuan Li, Aishwarya Kamath, Ho Kei Cheng, Piotr Doll ´ ar , Nikhila Ravi, Kate Saenko, Pengchuan Zhang, and Christoph Feichtenhofer . Sam 3: Segment anything with concepts, 2025. 2 , 3 , 5 [5] Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Juny- ing Chen, Xiangbo W u, Zhiyi Zhang, Zhihong Chen, Jian- quan Li, Xiang W an, and Benyou W ang. Allava: Harness- ing gpt4v-synthesized data for lite vision-language models, 2024. 2 [6] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi W ang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions, 2023. 2 [7] Zhiyang Chen, Y ousong Zhu, Y ufei Zhan, Zhaowen Li, Chaoyang Zhao, Jinqiao W ang, and Ming T ang. Mitigating hallucination in visual language models with visual supervi- sion. arXiv pr eprint arXiv:2311.16479 , 2023. 2 , 3 [8] W enliang Dai, Junnan Li, Dongxu Li, Anthony T iong, Junqi Zhao, W eisheng W ang, Bo yang Li, Pascale N Fung, and Stev en Hoi. Instructblip: T owards general-purpose vision- language models with instruction tuning. Advances in neural information pr ocessing systems , 36:49250–49267, 2023. 1 , 2 [9] Chaoyou Fu, Peixian Chen, Y unhang Shen, Y ulei Qin, Mengdan Zhang, Xu Lin, Jinrui Y ang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensiv e evaluation bench- mark for multimodal large language models. arXiv preprint arXiv:2306.13394 , 2023. 2 , 3 , 5 , 7 [10] Aaron Grattaﬁori, Abhimanyu Dubey , Abhinav Jauhri, Ab- hinav Pandey , Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur , Alan Schelten, Alex V aughan, et al. The llama 3 herd of models. arXiv pr eprint arXiv:2407.21783 , 2024. 1 , 2 [11] Tianrui Guan, Fuxiao Liu, Xiyang W u, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun W ang, Lichang Chen, Furong Huang, Y aser Y acoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern r ecognition , pages 14375–14385, 2024. 3 [12] Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin W ang, Con- ghui He, Jiaqi W ang, Dahua Lin, W eiming Zhang, and Nenghai Y u. Opera: Alle viating hallucination in multi- modal large language models via over -trust penalty and retrospection-allocation. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 13418–13427, 2024. 2 , 3 , 5 , 6 , 7 [13] Y iyang Huang, Liang Shi, Y itian Zhang, Y i Xu, and Y un Fu. Shield: Suppressing hallucinations in lvlm encoders via bias and vulnerability defense. arXiv pr eprint arXiv:2510.16596 , 2025. 2 , 3 [14] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Pr oceedings of the IEEE/CVF con- fer ence on computer vision and pattern r ecognition , pages 6700–6709, 2019. 5 , 6 [15] Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, W ei Y e, Ming Y an, Qinghao Y e, Ji Zhang, Fei Huang, and Shikun Zhang. Hallucination augmented contrastiv e learn- ing for multimodal large language model. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 27036–27046, 2024. 2 , 3 [16] Prannay Kaul, Zhizhong Li, Hao Y ang, Y onatan Duk- ler , Ashwin Swaminathan, CJ T aylor, and Stefano Soatto. Throne: An object-based hallucination benchmark for the free-form generations of large vision-language models. In Pr oceedings of the IEEE/CVF Conference on Computer V i- sion and P attern Recognition , pages 27228–27238, 2024. 3 [17] Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shi- jian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through vi- sual contrasti ve decoding. arXiv pr eprint arXiv:2311.16922 , 2023. 2 , 3 , 5 , 6 , 7 [18] Junnan Li, Dongxu Li, Silvio Sav arese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational conference on machine learning , pages 19730– 19742. PMLR, 2023. 2 [19] Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi W ang, Liang Chen, Y azheng Y ang, Benyou W ang, and Lingpeng K ong. Silkie: Preference distillation for large visual lan- guage models. arXiv pr eprint arXiv:2312.10665 , 2023. 3 [20] Y ifan Li, Y ifan Du, Kun Zhou, Jinpeng W ang, W ayne Xin Zhao, and Ji-Rong W en. Ev aluating object hallucina- tion in large vision-language models. arXiv pr eprint arXiv:2305.10355 , 2023. 2 , 3 , 5 , 6 [21] Tsung-Y i Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, De va Ramanan, Piotr Doll ´ ar , and C La wrence 11 Zitnick. Microsoft coco: Common objects in context. In Eur opean confer ence on computer vision , pages 740–755. Springer , 2014. 5 , 6 , 9 [22] Haotian Liu, Chunyuan Li, Qingyang W u, and Y ong Jae Lee. V isual instruction tuning. Advances in neural information pr ocessing systems , 36:34892–34916, 2023. 1 , 2 [23] Haotian Liu, Chunyuan Li, Y uheng Li, and Y ong Jae Lee. Improv ed baselines with visual instruction tuning. In Pr o- ceedings of the IEEE/CVF conference on computer vision and pattern r ecognition , pages 26296–26306, 2024. 2 [24] Xinyu L yu, Beitao Chen, Lianli Gao, Hengtao Shen, and Jingkuan Song. Alle viating hallucinations in large vision- language models through hallucination-induced optimiza- tion. Advances in Neural Information Pr ocessing Systems , 37:122811–122832, 2024. 2 , 3 [25] Zhiliang Peng, W enhui W ang, Li Dong, Y aru Hao, Shaohan Huang, Shuming Ma, and Furu W ei. Kosmos-2: Ground- ing multimodal large language models to the world. arXiv pr eprint arXiv:2306.14824 , 2023. 2 [26] Alec Radford, Jong W ook Kim, Chris Hallacy , Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry , Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning , pages 8748–8763. PmLR, 2021. 1 , 2 [27] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. In Eur opean conference on computer vision , pages 146–162. Springer , 2022. 5 , 6 [28] Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Y ikang Shen, Chuang Gan, Liangyan Gui, Y u- Xiong W ang, Y iming Y ang, et al. Aligning large multimodal models with factually augmented rlhf. In F indings of the As- sociation for Computational Linguistics: ACL 2024 , pages 13088–13110, 2024. 2 , 3 [29] Junyang W ang, Y uhang W ang, Guohai Xu, Jing Zhang, Y ukai Gu, Haitao Jia, Jiaqi W ang, Haiyang Xu, Ming Y an, Ji Zhang, et al. Amber: An llm-free multi-dimensional bench- mark for mllms hallucination ev aluation. arXiv preprint arXiv:2311.07397 , 2023. 2 , 3 [30] W eihan W ang, Qingsong Lv , W enmeng Y u, W enyi Hong, Ji Qi, Y an W ang, Junhui Ji, Zhuoyi Y ang, Lei Zhao, Song XiX- uan, et al. Cogvlm: V isual expert for pretrained language models. Advances in Neural Information Pr ocessing Sys- tems , 37:121475–121499, 2024. 2 [31] W eiyun W ang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang W ei, Zhao yang Liu, Linglin Jing, Shenglong Y e, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility , reasoning, and efﬁciency . arXiv pr eprint arXiv:2508.18265 , 2025. 1 , 2 , 5 , 6 , 7 , 12 [32] Sangmin W oo, Jaehyuk Jang, Donguk Kim, Y ubin Choi, and Changick Kim. RITUAL: Random image transformations as a universal anti-hallucination lever in large vision language models. arXiv preprint , 2024. 2 , 3 , 5 , 6 , 7 [33] Shukang Y in, Chaoyou Fu, Sirui Zhao, T ong Xu, Hao W ang, Dianbo Sui, Y unhang Shen, Ke Li, Xing Sun, and Enhong Chen. W oodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences , 67(12):220105, 2024. 2 , 3 , 5 , 6 , 7 [34] Xiaohua Zhai, Basil Mustafa, Ale xander Kolesniko v , and Lucas Beyer . Sigmoid loss for language image pre-training. In Pr oceedings of the IEEE/CVF international confer ence on computer vision , pages 11975–11986, 2023. 1 [35] Ce Zhang, Zifu W an, Zhehan Kan, Martin Q Ma, Si- mon Stepputtis, Dev a Ramanan, Russ Salakhutdinov , Louis- Philippe Morency , Katia Sycara, and Y aqi Xie. Self- correcting decoding with generati ve feedback for mitigat- ing hallucinations in large vision-language models. arXiv pr eprint arXiv:2502.06130 , 2025. 2 , 3 , 5 , 6 , 7 [36] Jinrui Zhang, T eng W ang, Haigang Zhang, Ping Lu, and Feng Zheng. Reﬂective instruction tuning: Mitigating hal- lucinations in large vision-language models. In European Confer ence on Computer V ision , pages 196–213. Springer, 2024. 2 , 3 T echnical Appendices A. Prompt T emplate W e pro vide the prompt templates used in the three core stages of Kestrel : Initialization (Fig. 7 ), claim-le vel veriﬁ- cation (Fig. 8 ), and self-reﬁnement (Fig. 9 ). These prompts follow the main e vidence-grounded iterati ve framework, in which Kestrel ﬁrst rewrites the original question into a concrete and visually veriﬁable claim, then veriﬁes the claim against the collected e vidence, and ﬁnally updates the answer conservati vely through reﬁnement. T o improve re- producibility , all prompts adopt constrained JSON outputs with explicit ﬁeld deﬁnitions. The initialization prompt pro- duces a question-aligned claim for grounding, the veriﬁca- tion prompt returns evidence-based verdicts and conﬁdence scores, and the self-reﬁnement prompt updates the answer while proposing a new claim for the ne xt round. B. More Case Studies Fig. 10 and Fig. 11 present representativ e examples of Kestrel ’ s evidence-grounded self-reﬁnement process. In each case, Qwen3-VL [ 3 ] and InternVL3.5 [ 31 ] initially produce an incorrect answer based on their ﬁrst-pass vi- sual impression, while Kestrel revises the prediction af- ter introducing an explicit claim and verifying it against grounded evidence. The examples cov er se veral queries, showing that the proposed frame work is effecti ve across dif- ferent claim types. These cases highlight two key proper- ties of Kestrel . First, the method can correct errors ev en when the initial response is conﬁdently wrong, by con vert- ing free-form reasoning into claim-lev el veriﬁcation. Sec- ond, the reﬁnement is interpretable: each answer update is supported by explicit visual e vidence, such as segmentation ov erlays, cropped regions, or structured text e vidence de- 12 Prompt template for Initialization messages =[ { ”role”:”user”, ”content”:[ { ”type”:”image”, ”image”: sample[”image”] } , { ”type”:”text”, ”text”: Y ou are giv en an image and a Y es/No question. Determine the answer and output one veriﬁable claim. T ask requirements: - answer must be exactly ”Y es” or ”No”. - output exactly one claim in veriﬁable claims. - claim ﬁelds: id, type, text, tar gets. - type must be exactly ” sample[”expected claim type”] ”. - text must be concrete and visually checkable. T arget rules: - if type = position: targets contain 1–2 short object phrases; for two-object relation, use [subject, anchor]. - otherwise: targets contain exactly one object noun from question; do not include color , number, or position w ords. Return JSON only: { ”answer”:”Y es—No”, ”veriﬁable claims”:[ { ”id”:”c1”, ”type”:” sample[”expected claim type”] ”, ”text”:”... ”, ”targets”: sample[”example targets”] } ] } Question: ” sample[”question”] ” PrevSummary (optional): ” sample[”prev summary”] ” ¡image¿ ” } ] } ] Y es-Guard T emplate (optional, only when initial answer = Y es): messages =[ { ”role”:”user”, ”content”:[ { ”type”:”image”, ”image”: sample[”image”] } , { ”type”:”text”, ”text”: Check whether the Y es/No question is true in this image. Return JSON only: { ”answer”:”yes—no—unclear”, ”conﬁdence”:”high—medium—low”, ”reason”:”one short sentence” } Rules: - use image evidence only; - yes: clearly true; no: clearly false; otherwise unclear; - be conservati ve when e vidence is weak. Question: ” sample[”question”] ” T arget hint: ” sample[”target hint”] ” ¡image¿ ” } ] } ] Figure 7. Prompt template for initialization stage. Prompt template for claim-lev el veriﬁcation messages =[ { ”role”:”user”, ”content”:[ sample[”evidence content”] , { ”type”:”text”, ”text”: Y ou are a strict veriﬁer . Judge each claim using ONL Y the provided e vidence items. How to use e vidence: - Each item has an EvidenceID and a T ype. - Use only EvidenceID values that appear in the giv en evidence. - T ypical IDs: - seg overlay: e seg { tkey } - count text: e count { tkey } - count compare text: e countcmp { tkey } - count vision text: e countvis { tkey } - count vision compare text: e countviscmp { tkey } - crop zoom: e crop { tkey } - color text: e color { tkey } - position text: e pos { tkey } - position relation text: e posrel { claim id } Judging rules: - For each claim, choose exactly one status: supported — contradicted — insufﬁcient. - supported: evidence clearly conﬁrms the claim. - contradicted: evidence clearly refutes the claim. - insufﬁcient: evidence is missing or ambiguous. - Do NO T use common sense. T op-level v erdict: - contradicted: at least one claim is strongly contradicted. - supported: all claims are strongly supported. - otherwise: insufﬁcient. Return JSON only: { ”verdict”:”supported—contradicted—insufﬁcient”, ”checked”:[ { ”claim id”:”... ”, ”status”:”supported—contradicted—insufﬁcient”, ”conﬁdence”:0.0, ”wh y”:”... ”, ”citations”:[”EvidenceID1”, ”EvidenceID2”] } ] } Input: Question: ” sample[”question”] ” Claims: sample[”claims json”] PrevV erdict (optional): sample[”prev verdict json”] ” } ] } ] Figure 8. Prompt template for claim-level v eriﬁcation stage. scribing existence or position. Overall, the qualitative re- sults show that Kestrel improves faithfulness by ground- ing the reﬁnement process in v eriﬁable visual observations, rather than relying on unconstrained self-correction. Kestrel introduces a multi-stage, iterative infer- ence pipeline that includes claim decomposition, external grounding, evidence construction, claim-lev el veriﬁcation, and self-reﬁnement. Although early stopping signiﬁcantly reduces the number of reﬁnement rounds in practice, the ov erall inference latency remains substantially higher than a single-pass L VLM inference due to the additional tool calls and veriﬁcation steps. This ov erhead may limit the applicability of the frame work in latenc y-sensiti ve or lar ge- scale deployment scenarios. In future work, we plan to explore more efﬁcient strategies for evidence-dri ven veri- ﬁcation, such as adaptiv e tool in vocation based on uncer- tainty signals, prioritizing high-risk claims for grounding, and reusing intermediate evidence across iterations. Such improv ements could substantially reduce the computational ov erhead while preserving the beneﬁts of explicit grounding and iterativ e veriﬁcation. C. Limitations and Future W ork Kestrel introduces a multi-stage, iterati ve inference pipeline that includes claim decomposition, external grounding, evidence construction, claim-lev el veriﬁcation, and self-reﬁnement. Although early stopping signiﬁcantly 13 Prompt template for Self-Reﬁne messages =[ { ”role”:”user”, ”content”:[ { ”type”:”image”, ”image”: sample[”image”] } , { ”type”:”text”, ”text”: Y ou are a cautious VQA assistant. Decide the ﬁnal answer as a binary choice. HARD REQUIREMENT : - Answer must be exactly ”Y es” or ”No” (capitalized). - No punctuation, no explanation, no extra tokens. - Y ou must output a concrete answer . Use all av ailable history: - previous rounds: hypothesis / verify / reﬁne; - current round: hypothesis and verify details. Also output one new claim for the next round. Claim constraints: - RELEV ANCE: every ne w claim must directly verify the Question’ s Y es/No. - new claim.type must be exactly ” sample[”expected claim type”] ”. - claim.text must mention at least one key entity / attrib ute / relation from the Question. T ype-speciﬁc target rules: - if type = position: - targets contain 1–2 short object phrases from Question; - if two-object relation exists, use [subject, anchor]; - keep distinguishing modiﬁers when needed. - otherwise: - targets contain e xactly one object noun from Question; - do not include color / number / position / quantiﬁer words. - if a claim is not clearly usable to support/refute Question, it is in valid and must not be output. Return JSON only: { ”new claims”:[ { ”id”:”c1”, ”type”:” sample[”expected claim type”] ”, ”text”:”... ”, ”targets”: sample[”e xample targets”] , ”priority”:1 } ], ”Answer”:”Y es—No” } Question: ” sample[”question”] ” PreviousAnswer: ” sample[”A t”] ” RoundHistory: sample[”round history json”] CurrentRoundContext: sample[”current round context json”] < image > ” } ] } ] Figure 9. Prompt template for self-reﬁnement stage. reduces the number of reﬁnement rounds in practice, the ov erall inference latency remains substantially higher than a single-pass L VLM inference due to the additional tool calls and veriﬁcation steps. This ov erhead may limit the applicability of the frame work in latenc y-sensiti ve or lar ge- scale deployment scenarios. In future work, we plan to explore more efﬁcient strategies for evidence-dri ven veri- ﬁcation, such as adaptiv e tool in vocation based on uncer- tainty signals, prioritizing high-risk claims for grounding, and reusing intermediate evidence across iterations. Such improv ements could substantially reduce the computational ov erhead while preserving the beneﬁts of explicit grounding and iterativ e veriﬁcation. 14 Figure 10. Qualitative results of Kestrel . W e compare the VQA responses from the regular baseline and our method based on Qwen3- VL. Zoom in for a better view . 15 Figure 11. Qualitative r esults of Kestrel . W e compare the VQA responses from the regular baseline and our method based on InternVL- 3.5. Zoom in for a better view . 16

Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment