Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation

Locate-then-Sparsify: Attrib ution Guided Sparse Strategy f or V isual Hallucination Mitigation T iantian Dang 1 , 2 Chao Bi 2 * Shufan Shen 2 Jinzhe Liu 2 Qingming Huang 2 , 3 Shuhui W ang 1 , 2 ∗ 1 School of Adv anced Interdisciplinary Sciences, Uni versity of Chinese Academy of Sciences 2 State K ey Lab . of AI Safety , Institute of Computing T echnology , Chinese Academy of Sciences 3 School of Computer Science and T echnology , Uni versity of Chinese Academy of Sciences { dangtiantian23, qmhuang } @ucas.ac.cn { bichao,shenshufan22z,liujinzhe23b,wangshuhui } @ict.ac.cn Abstract Despite the signiﬁcant advancements in Lar ge V ision- Language Models (L VLMs), their tendency to generate hallucinations undermines r eliability and restricts br oader practical deployment. Among the hallucination mitigation methods, featur e steering emer ges as a pr omising appr oach that reduces err oneous outputs in L VLMs without increas- ing inference costs. However , current methods apply uni- form featur e steering acr oss all layers. This heuristic strat- e gy ignor es inter-layer differ ences, potentially disrupting layers unr elated to hallucinations and ultimately leading to performance degr adation on general tasks. In this pa- per , we pr opose a plug-and-play fr amework called L ocate- T hen- S parsify for F eature S teering ( LTS-FS ), which con- tr ols the steering intensity accor ding to the hallucination r elevance of each layer . W e ﬁrst construct a synthetic dataset comprising token-level and sentence-level halluci- nation cases. Based on this dataset, we introduce an attri- bution method based on causal interventions to quantify the hallucination rele vance of each layer . W ith the attribution scor es acr oss layers, we pr opose a layerwise strate gy that con verts these scor es into featur e steering intensities for in- dividual layers, enabling mor e precise adjustments specif- ically on hallucination-r elevant layers. Extensive e xperi- ments acr oss multiple L VLMs and benchmarks demonstrate that our L TS-FS framework effectively mitigates hallucina- tion while pr eserving strong performance . 1. Introduction By harnessing the advanced text generation capabilities of Large Language Models, Large V ision Language Mod- els (L VLMs) ha ve achieved impressi ve performance across various multimodal tasks [ 1 , 26 , 40 , 44 ]. Despite their * Corresponding author . (a) TSNE visualizations of features in L VLM layers. (b) Performance on CHAIR and MMMU benchmarks. Figure 1. Current methods ( e.g. , Nullu [ 42 ]) mitigate hallucina- tions by uniformly steering features across layers, which (a) al- ters feature distributions and (b) leads to degraded performance on general tasks lik e MMMU. In contrast, we propose a layerwise steering frame work, L TS-FS, which mitigates hallucinations more effecti vely ( e.g . , on CHAIR) while minimally perturbing the fea- ture distributions, thus preserving more generalization ability . strong performance, L VLMs face a signiﬁcant challenge known as hallucination , wherein the model generates ﬂuent and semantically coherent responses that include factually incorrect statements about the input visual content [ 12 , 21 , 44 ]. Such hallucinations hinder the reliability of L VLMs, posing serious risks in real-world applications [ 17 , 34 ]. T o mitigate hallucinations in L VLMs, early studies ﬁne- tune the whole model on specially designed datasets, which is costly and degrades its generalization ability [ 9 , 25 , 39 ]. In contrast, decoding-based methods introduce strategies such as contrastive decoding [ 2 , 19 ] and self-correction [ 5 , 45 ] to mitigate hallucinations in a training-free manner , thereby preserving the original capabilities of pre-trained models. Ne vertheless, these methods signiﬁcantly increase the number of decoding steps required for each input query , leading to high inference costs for real-world deployment. Recently , feature steering methods [ 31 , 42 ] sho w promise to o vercome the abo ve limitations. These meth- ods adjust features of intermediate layers by steering them from their original positions in the feature space toward directions that are less prone to generating hallucination outputs. By modifying only the features without introduc- ing additional decoding steps, feature steering methods can maintain inference costs comparable to those of the origi- nal model. Ho we ver , current methods steer features based on heuristically designed rules [ 31 ] ( e.g. , adjust all layers). These rules overlook the inherent differences across layers in pre-trained models, making the steering process to dis- turb layers less rele vant to hallucinations. The disruption alters the distributions of features(in Fig. 1b ) and ultimately impairs the model’ s generalization ability (in Fig. 1a ), sim- ilar to the tuning-based methods. Therefore, an upgraded method to mitigate hallucinations that can achiev e fea- ture steering while preserving the original capabilities of L VLMs is urgently required. In this paper , we propose L ocate- T hen- S parsify for F eature S teering ( L TS-FS ), a plug-and-play framework that effecti vely mitigates hallucinations while preserving the in- herent capabilities of L VLMs. First, we construct a dataset including hallucination samples at two granularities. W ith this dataset, we locate the hallucination-rele vant layers through intervention-based attribution. Guided by the at- tribution score, we propose a layerwise strategy that selec- tiv ely steers features in hallucination-relev ant layers rather than uniformly adjusting all layers. As sho wn in Fig. 1 , compared with Nullu, a classical feature-steering-based method, our strategy barely disrupts the original feature dis- tribution. Meanwhile, the e valuation results on the MMMU benchmark demonstrate that our L TS-FS not only main- tains fewer hallucinatory expressions but also achie ves bet- ter generalization performance. Speciﬁcally , for dataset construction, we ﬁrst distin- guish hallucinations in L VLMs according to token-le vel and sentence-level granularities. Then, we construct hal- lucination samples at token and sentence granularity le v- els to build a dataset. Supported by this dataset, we locate hallucination-relev ant layers through an attribution method based on causal interventions. This method sequentially masks the attention output of each layer to assess its con- tribution to the logits of hallucination outputs. Based on the contribution, we deﬁne attribution scores and assign them to each layer , which reﬂects its relev ance to hallu- cination phenomenon. After obtaining layer-wise attribu- tion scores, we propose a sparsity-a ware layer selection and steering strategy that con verts the attribution scores into steering intensities ( i.e . , applying weaker steering to lay- ers with lo w scores and stronger steering to those with high scores). By modifying only hallucination-rele vant layers, we mitigate hallucinations while minimizing interference with the model’ s feature distribution, thereby more effec- tiv ely preserving its original capabilities. W e conduct ex- tensiv e experiments to demonstrate that L TS-FS can fur- ther improv e the hallucination mitigation capacity of cur- rent SO T A feature steering methods ( e.g. , 2% accurac y gain on POPE-popular with Qwen-VL-2.5-7B) while preserving more generalization capability of L VLMs ( e.g. , increasing detailness from 4.72 to 4.92 under GPT4v Aided Evalua- tion on LLaV A-Bench). Codes are av ailable at https : //github.com/huttersadan/LTS- FS . • W e introduce a granularity-based hallucination catego- rization and construct a synthetic dataset to correlate model components with hallucinations. • W e employ an intervention-based attribution method to locate hallucination-relev ant layers by quantifying their contributions to hallucination outputs. • W e propose a layerwise strategy that selectively adjusts steering intensity , achieving SOT A hallucination mitiga- tion results while preserving the generalization ability . 2. Related W ork 2.1. Hallucinations in L VLMs Hallucinations ha ve been extensi vely studied in the arti- ﬁcial intelligence community . Many studies hav e been carried out with the aim of reducing the impact of hal- lucinations [ 13 , 16 , 29 , 48 ]. Hallucinations in L VLMs refer more to the mismatch between the visual and tex- tual content modalities. Most methods are based on self-correction [ 5 , 45 ], instruction-tuning [ 9 , 25 , 39 ], or decoding-enhancement [ 2 , 14 , 19 ]. T ypically , Y in et al. [ 45 ] reﬁned textual responses while correcting hallucinations. Liu et al. [ 25 ] composed negativ e instances to refrain from ov er-conﬁdence. Huang et al. [ 14 ] penalized speciﬁc tok ens during decoding, which suppresses the formation of hallu- cinations. These methods generally require a large amount of manually labeled data and computing resources or suf fer longer inference times. T o av oid these limitations, recent studies have proposed feature steering methods [ 20 , 31 , 42 ]. Y ang et al. [ 42 ] projected generated captions into a dedi- cated space to suppress hallucinated entities. Liu et al. [ 31 ] proposed an intervention-based approach, steering the la- tent representations during inference with a pre-computed “anti-hallucination” direction. Howe ver , directly adjust- ing weights or features may hinder the internal knowledge and suf fer a reduction in generalization ability . T o o ver - come these limitations, our method identiﬁes hallucination- relev ant layers and selecti vely adjusts features within them, thereby better preserving the internal knowledge of L VLMs. 2.2. Parameter Localization Parameter localization, a technique that identiﬁes param- eters correlated with speciﬁc datasets, offers ﬂe xible and effecti ve solutions for downstream tasks such as model ﬁne-tuning [ 36 ], knowledge editing [ 30 ], and model com- pression [ 38 ]. According to localization granularity , exist- ing localization methods can be categorized into weight- lev el [ 11 ] and layer -level [ 7 ] paradigms. For the weight- lev el paradigm, current methods design speciﬁc rules such as acti vations [ 11 ], redundanc y [ 37 ], second deriv ativ es [ 8 ], and energy ef ﬁciency [ 43 ] to locate the data-relev ant weights. For the layer -le vel paradigm, GRIFFIN [ 7 ] selects layers based on their high acti vation magnitudes in response to input prompts. FLAP [ 3 ] computes the sample v ariance of each input feature as importance and locates layers ac- cordingly . RL-Pruner [ 38 ] determines the layer-wise im- portance distribution through reinforcement learning. Un- like the abo ve methods designed for pruning or adjusting model parameters, we employ a layer-le vel strategy to lo- cate the layers relev ant to the hallucination phenomenon in L VLMs. The localization results can ef fectiv ely support the feature steering process to mitigate hallucinations. 2.3. Sparse Adjustments for Pr e-trained Models T o enhance the model capability in a speciﬁc domain while minimizing unintended disruptions to the o verall model be- havior , researchers have proposed sparse adjustment meth- ods [ 18 , 23 , 30 , 32 ] that selectively modify a subset of model components. NMKE [ 30 ] sparsely updates hidden neurons to edit the internal knowledge in LLMs. Jia et al. [ 18 ] develop a sparsity-aware method for model unlearn- ing. BNS [ 32 ] selectively suppresses neuron activ ations to mitigate the social bias in pre-trained language mod- els. Their sparse selection strategies are typically neuron- wise and designed for speciﬁc parameter adjustment meth- ods [ 30 ]. In contrast, we propose a layer-wise sparse selec- tion strategy to enhance the feature steering paradigm for hallucination mitigation. This strategy is decoupled from any particular steering method, delivering consistently im- prov ed performance across dif ferent steering methods. 3. Method In this section, we ﬁrst construct the bi-granularity hallu- cination dataset (Sec. 3.1 ). Based on this dataset, we in- troduce causal attribution to locate hallucination-relev ant layers (Sec. 3.2 ) and employ a layerwise sparse selection scheme to mitigate hallucination while maintaining the gen- eralization ability of L VLMs (Sec. 3.3 ). Figure 2. Hallucination examples at token lev el and sentence lev el. 3.1. Bi-granularity Dataset Construction For locating hallucination-relev ant layers, we b uild a bi- granularity dataset by constructing hallucination samples at the token lev el and sentence lev el according to their text length. Speciﬁcally , for single-sentence texts, their halluci- nations can be annotated at the token lev el based on existing hallucination benchmarks [ 21 , 41 ]. Howe ver , for multiple- sentence texts, token-lev el annotation is insuf ﬁcient. As the length of generated text increases, the model’ s behav- ior e v olves from producing isolated hallucinatory tokens to generating entire hallucinatory sentences ( i.e. , removing them can signiﬁcantly enhance the factuality with minimal impact on generation quality) [ 14 ]. Therefore, we catego- rize these samples as the sentence level for comprehensive localization of hallucination-relev ant layers. The bi-granularity hallucination samples are constructed based on current hallucination benchmarks: CHAIR [ 33 ], POPE [ 22 ], Antidote [ 41 ]. T oken-lev el samples are typ- ically constructed by prompts phrased as wh-questions or yes/no questions. POPE and Antidote benchmarks con- tain such types of questions. Moreov er, hallucination to- kens can be identiﬁed by rule-based methods. For sentence- lev el hallucinations, we split multi-sentence texts and assess the image-grounded consistency of each sentence based on CHAIR, which is ef fectiv e in identifying hallucination to- kens. Sentences containing such tokens are labeled as hal- lucinatory . More details are presented in the supplementary . Examples of Both Granularities . At the token lev el, as shown in Fig. 2 (a), the model generates a short response to a speciﬁc interrogativ e about a giv en item. In such cases, not all tokens are hallucinatory . Only “the palette” is absent from the image, while the remaining tokens describe objec- tiv ely present content. At the sentence lev el, for longer and free-form responses in Fig. 2 (b), the red part of the text reﬂects content conjectured from prior text and the image. The entire clause following “reﬂection” is unsupported. Data Usage and Split Protocol. Note that all samples used Figure 3. Overview of our L TS-FS frame work. First, we build a bi-granularity dataset with token le vel and sentence lev el hallucinations. Then, based on the dataset, hallucination-relev ant layers are located through intervention-based attribution. Finally , a layerwise strategy is applied to control the feature steering intensity across layers according to the attribution scores. to locate hallucination-relev ant layers are computed only on the training split (or a small calibration subset drawn from it) and do not include any samples from the ev alua- tion benchmarks. Once the dataset is constructed, the sub- sequent policy is ﬁxed and consistently applied to all fol- lowing test e valuations without modiﬁcation. 3.2. Hallucination-Relev ant Layer Localization After constructing a dataset including images and texts an- notated with hallucination labels, we utilize it to locate the L VLM layers that are more prone to inducing hallucina- tion ( i.e. , hallucination-relev ant layers). Inspired by prior studies [ 46 , 47 ], we design an attribution method that esti- mates the relev ance between hidden layers of L VLMs and logits of hallucination outputs through causal intervention. Feed-F orward Process in L VLM Layers. Consider an L VLM composed of an image encoder , a projection module, and an LLM with L layers. In the LLM decoding process, the output feature h l of layer l is calculated as follo ws: a l = MultiHeadAttn ( h l − 1 , h ) , (1) x l = MLP  LN ( h l − 1 + a l )  , (2) h l = LN ( x l + LN ( h l − 1 + a l )) . (3) Here, a l and x l are the outputs of the multi-head atten- tion (MHA) and MLP , respectiv ely . LN denotes the Layer- Norm module. The MHA output a l concatenates the output of H heads. Giv en the output feature h l − 1 at layer l − 1 , the attention output a l and the parameters in subsquent layers θ ≥ l , the logits of token y is predicted as follows: y ∼ P θ ≥ l  y | h l − 1 , a l  . (4) Layer -wise Attribution. T o locate hallucination-rele vant layers, we measure their contributions to hallucination out- puts by introducing attribution scores at the token lev el and sentence level based on causal intervention techniques. Giv en the MHA output a l of layer l , and the output feature of the prior layer h l − 1 . The attribution score at the token lev el s l tok of layer l is calculated as: s l tok = H X h =1 log P θ ≥ l ( y   h l − 1 , a l ) P θ ≥ l ( y   h l − 1 , a l ⊙ M h ) ! . (5) M h denotes a mask that sets the output of the h -th atten- tion head to zero. W e independently intervene on attention heads to measure the relev ance between layers and halluci- natory tokens, building on prior studies [ 47 , 49 ] that such interventions enable more accurate estimation of how indi- vidual layers contribute to the logits of output tok ens. For the sentence le vel, we compute the attrib ution scores across all tokens in the sentence and aggregate them to ob- tain an ov erall attribution score, as the entire sentence is in- trinsically associated with the hallucinated content [ 14 , 49 ]. Since individual tokens vary in their contribution to halluci- nations, we employ a weight-based aggregation method that assigns token weights according to sev eral indicators, which are designed based on insights from prior studies [ 14 , 49 ]. These studies suggest that (1) initial summarizing cues ( e.g ., additional) or the terminal punctuation of the preceding sen- tence and (2) later tokens in the sentence, are more likely to trigger hallucination. Furthermore, (3) tokens exhibiting factual errors should also be emphasized. Therefore, we design three indicators to assign these tokens with higher weights. Given the set of tokens T sent in a sentence, the indicators for a token y t ∈ T sent is deﬁned as follows: (1) Cue indicator : u ( y t | T sent ) ∈ { 0 , 1 } , where u ( y t | T sent ) = 1 if y t is a summary token ( e.g ., “additional” or a period); otherwise u ( y t | T sent ) = 0 . (2) Position indicator : r ( y t | T sent ) ∈ [0 , 1] , a higher value indicates later positions in the sentence. (3) Hallucination indicator : v ( y t | T sent ) ∈ { 0 , 1 } , where v ( y t | T sent ) = 1 if the token y t is identiﬁed as containing a factual error , and v ( y t | T sent ) = 0 otherwise. A multiplicativ e weight is formed and then normalized: ˜ w ( y t | T sent ) = (1 + λ cue u ( y t | T sent )) × (1 + λ pos r ( y t | T sent )) × (1 + λ hall v ( y t | T sent )) , w ( y t | T sent ) = ˜ w ( y t | T sent ) P y k ∈ T sent ˜ w ( y k | T sent ) , (6) where λ pos , λ cue , λ hall ≥ 0 are hyperparameters that control the strength of the three indicators. The attribution score for sentence-level hallucinations at layer l is computed as the weighted sum of the token le vel attrib ution scores s l tok . s l sent = X y t ∈ T sent w ( y t | T sent ) · s l tok . (7) In practice, attribution scores are utilized according to spe- ciﬁc tasks. For simple tasks such as question answering, token-le vel score is employed due to the conciseness of model outputs. In contrast, sentence-lev el score is adopted in more general tasks such as image captioning. 3.3. Layerwise Featur e Steering After locating hallucination-rele vant layers with higher at- tribution scores, an intuiti ve approach is to apply feature steering exclusi vely to these layers. In contrast to e xist- ing feature steering methods that uniformly steer all layers, layer-wise steering enables more tar geted hallucination mit- igation while minimizing unnecessary interference with the L VLM’ s internal representations. Speciﬁcally , we propose a layer-wise steering strategy that combines hard sparsiﬁcation and soft weighting. For layers with extremely low attribution scores, steering fea- tures of these layers has minimal impact on mitigating the model’ s hallucinations while substantially impairing its generalization capability . Therefore, we exclude such lay- ers from the steering process by employing a mask parame- terized by a threshold r s . For layers with high attribution scores, we scale the steering intensity proportionally to their normalized attri- bution scores ˜ s l ( i.e. , features in higher-scoring layers are steered more strongly). The soft weighting achiev es a more fa vorable balance between mitigating hallucinations and preserving the model’ s generalization capability . The detailed implementation of our layer-wise steering strategy is presented in Algorithm 1 . Since we only adjust the intensity of steering, our method can be seamlessly in- tegrated into existing feature steering methods [ 31 , 42 ], as all of them inherently require an explicit setting of steering intensity . Moreov er , gi ven a ﬁxed pre-trained L VLM, the layer-wise intensity deri ved by our framework is generaliz- able across di verse steering methods, highlighting the broad applicability and strong reusability of our framew ork. Algorithm 1 Feature Steering at layer l Require: layerwise attrib ution score s l , mask threshold r s , initial feature steering intensity λ , output features h l , and feature steering function f : R d × R → R d . Ensure: steered feature ˜ h l 1: Computing threshold: τ = r s · 1 L P L l =1 s l 2: Constructing mask: m l = 1  s l ≥ τ  3: Masking attribution score s l = m l · s l 4: Normalizing attribution score ˜ s l = s l P L i =1 s i 5: Scaling feature steering intensity λ l = λ ∗ m l + λ · ˜ s l 6: Steering feature ˜ h l = f ( h l , λ l ) 7: return h l 4. Experiments and Analysis In this section, we empirically in vestigate the effecti veness of L TS-FS in mitigating hallucinations while preserving model generalization. Remarkably , we use 100 sentence- lev el hallucination samples and 100 token-lev el hallucina- tion samples to synthesize the Bi-granularity dataset for layer-wise attrib ution. The sentence-lev el hallucination samples are selected and processed from CHAIR bench- mark [ 33 ], while the token-level hallucination samples are from POPE [ 22 ] and Antidote [ 41 ]. 4.1. Benchmarks and Baselines Benchmarks. Follo wing prior work, we ev aluate our L TS- FS on typical benchmarks CHAIR [ 33 ] and POPE [ 22 ]. Each metric was averaged across three independent runs with distinct random seeds. T o assess overall performance after feature steering, we further include experiments on MME [ 10 ] and LLaV A-Bench [ 27 ]. (a) CHAIR : Caption Hallucination Assessment with Im- age Relev ance [ 33 ] is a widely used benchmark to ev aluate object hallucination in image captioning. It quantiﬁes the degree of object hallucination by calculating the ratio of all mentioned objects in the generated text that are not in the ground truth object set. There are two assessment criteria. CHAIR S quantiﬁes the degree of object hallucinations at the sentence-lev el, while CHAIR I focuses on the instance lev el. Lower C S and C I indicate fe wer hallucinations. In addition, we also report Recall and Sentence Length to ensure a fair comparison, since the reported hallucination metrics may be affected by the amount of generated content. T able 1. CHAIR results of various L VLMs on MSCOCO. Bold indicates the best performance. C S and C S mean lower hallucination. Recall and output length (Len.) serv e as controls, indicating that reductions in C S /C I do not stem from suppressing objects or truncating responses. ∗ denotes the feature steering methods. Method LLaV A-v1.5-7B LLaV A-v1.5-13B Qwen-VL2.5-7B C S ↓ C I ↓ Recall Len. C S ↓ C I ↓ Recall Len. C S ↓ C I ↓ Recall Len. Regular 53.0 13.9 77.2 98.0 40.8 9.5 77.2 111.8 27.0 7.4 61.6 120.6 VCD 55.2 16.7 77.5 89.2 39.2 9.2 79.1 108.2 26.2 7.6 61.2 120.3 A GLA 50.8 16.1 75.2 88.1 38.4 9.1 78.7 109.3 25.2 7.1 59.5 118.6 Nullu ∗ 50.2 13.7 76.9 93.3 38.0 9.4 74.5 105.8 27.4 7.7 60.7 121.6 VTI ∗ 47.4 13.9 76.2 88.9 36.3 9.2 75.9 94.4 25.5 7.1 61.6 121.3 L TS-FS (Nullu) 46.8 13.5 76.6 93.2 35.7 8.9 76.1 109.8 23.8 6.0 60.8 120.6 L TS-FS (VTI) 35.8 11.9 75.4 82.2 32.0 8.8 74.2 83.6 24.8 6.6 62.5 120.0 T able 2. Comparison of the av erage accuracy under dif ferent settings (i.e., Random, Popular , Adversarial) with different baselines and our framew ork on POPE Bold indicates the best results, and underline means the second bests. ∗ denotes the feature steering methods. Method LLaV A-v1.5-7B LLaV A-v1.5-13B Qwen-VL2.5-7B Popular Random Adversarial Popular Random Adversarial Popular Random Adversarial Regular 77.52 85.37 70.13 78.40 81.91 71.07 83.31 85.32 80.17 VCD 79.09 86.55 71.48 79.38 82.27 71.73 83.19 85.94 80.56 A GLA 78.67 85.32 71.63 80.11 82.64 72.27 83.34 86.02 80.92 Nullu ∗ 79.42 86.35 71.57 80.88 83.24 72.43 83.06 85.82 80.74 VTI ∗ 77.03 84.84 69.40 79.22 84.08 71.77 82.74 85.49 80.19 L TS-FS(Nullu) 80.09 87.13 72.62 81.46 83.96 73.06 83.59 86.21 81.11 L TS-FS(VTI) 79.96 86.77 73.04 81.77 86.59 73.78 83.35 86.04 80.92 (b) POPE : Polling-based Object Probing Evalua- tion [ 22 ] contains 27,000 question-answer pairs about ob- jects in MSCOCO [ 24 ], A-OKVQA [ 35 ], and GQA [ 15 ]. These question-answer pairs inv olve only yes/no questions and are evenly distributed among existing and absent ob- jects. There are three negati ve sample settings in each dataset, i.e. , random, popular , and adv ersarial [ 22 ]. This benchmark is ev aluated by classiﬁcation, with the metrics of Accuracy , Recall, Precision, and F1-Score. (c) MME : Multi-modal Lar ge Language Model Ev al- uation [ 10 ] is a comprehensiv e ev aluation benchmark for L VLMs that assesses their perception and cognition abili- ties. It comprises ten perception-related and four cognition- related tasks ev aluated by binary classiﬁcation. MME is employed to measure hallucination while also capturing as- pects of general model ability . (d) LLaV A-Bench : LLaV A-Bench [ 27 ] comprises 24 images, each accompanied by a detailed, manually crafted description and a set of meticulously selected questions. Al- though this collection is relatively small in scale, it poses greater challenges for L VLMs. W e use GPT -4v to ev aluate the model’ s generations, assessing general capability . Baselines. W e integrate our framew ork with two feature steering methods, Nullu [ 42 ] and VTI [ 31 ]. T o validate the utility of our methods, we ev aluate the effecti veness of these two models, L TS-FS (Nullu) and L TS-FS (VTI), on three mainstream large vision-language models, including LLaV A-v1.5-7B [ 28 ], LLaV A-v1.5-13B [ 28 ] and Qwen- vl2.5-7B [ 4 ]. W e compare our method with state-of-the-art baselines: VCD [ 19 ], A GLA [ 2 ], Nullu [ 42 ], and VTI [ 31 ]. Implementation : For other hallucination mitigation meth- ods, we use the default settings. In our methods, we set λ pos = 1 , λ cue = 1 , λ hall = 1 , and r s = 0 . 5 . More detailed implementation details can be found in the Appendix. 4.2. Results on CHAIR In CHAIR ev aluation, we use Please describe the image in detail as the prompt. The results shown in T ab. 1 con- ﬁrm that our L TS-FS consistently outperforms the e valu- ated methods. The lowest CHAIR S and CHAIR I indicate our framework can better integrate visual kno wledge and effecti vely reduce hallucinations. Comparison with Nullu and VTI demonstrates that our strategy can further enhance the performance of feature-steering-based methods. More- ov er , the Recall and Length of our method are comparable to those of other methods. This provides partial evidence that our method mitigates hallucinations without sacriﬁc- ing generation quality . For the e v aluation of te xt generation quality , we hav e provided additional results in Appendix 9 . Inference Time Analysis. Compared with decoding-based methods (VCD and A GLA), feature-steering-based meth- ods (Nullu and VTI) do not inv olve time-consuming addi- tional processes in inference; thus, the inference speed is similar to that of the regular setting. Our framew ork also inherits this fa vorable characteristic. Detailed analysis can be found in Appendix 12 . T able 3. Comparison of the average F1 score under different settings (i.e., Random, Popular , Adversarial) with different baselines and our framew ork on POPE. Bold indicates the best results, and underline is the second best. ∗ denotes the feature steering methods. Method LLaV A-v1.5-7B LLaV A-v1.5-13B Qwen-VL2.5-7B Popular Random Adversarial Popular Random Adversarial Popular Random Adversarial Regular 80.71 86.47 75.85 81.30 83.85 76.47 81.68 83.54 78.93 VCD 81.23 87.16 76.04 82.01 83.76 75.76 81.95 83.88 79.51 A GLA 81.47 86.77 75.89 82.32 83.58 75.48 81.86 83.63 79.14 Nullu ∗ 81.67 86.28 76.17 82.97 84.73 77.04 81.27 83.73 79.32 VTI ∗ 80.40 86.08 75.42 81.83 83.82 76.80 80.88 83.37 78.70 L TS-FS(Nullu) 82.20 87.64 76.22 83.42 85.56 78.36 82.55 84.31 79.83 L TS-FS(VTI) 82.25 87.32 77.32 83.58 87.48 79.91 81.38 83.88 79.46 Figure 4. Results of MME ev aluation. 4.3. Results on POPE W e conduct ev aluations on POPE benchmark under the Popular , Random, and Adversarial settings. Here, we mainly provide the av erage results of Accuracy and F1- score, respectiv ely sho wn in T ab . 2 and T ab . 3 . The compre- hensiv e results can be found in the Supplementary Materi- als. Since we use Qwen-VL for ev aluation, some methods ( e.g . , Nullu) did not report corresponding results. There- fore, we reproduce all methods under as consistent an en vi- ronment as possible to ensure fair comparison. Experiments show that our method achiev es the best ac- curacy and F1-score under all settings. Particularly , when using LLaV A-v1.5-13B, it increases the accuracy of the Random setting from 81.91% to 86.59% . The results demonstrate the effecti veness of L TS-FS in mitigating hal- lucinations and its broad applicability across di verse open- source L VLMs and datasets. 4.4. Results on MME W e present the results of LLaV A-1.5-7B on MME bench- mark as a representative to e valuate the general ability of the edited model. As shown in Fig. 4 , we can observe that T able 4. Ablation results for different granularity le vels. Hallucination Lev el C S C S POPE acc POPE f1 Baseline (Nullu) 50.2 13.7 79.11 81.37 T oken-level only 50.0 13.4 79.59 81.85 Sentence-lev el only 47.3 13.0 79.33 81.58 Both lev el 46.8 13.5 79.92 82.02 T able 5. Ablation results for different choices of r s . r s CHAIR S CHAIR S Recall Length 0.0 (Regular) 53.0 13.9 77.2 98.0 0.3 49.5 14.2 76.2 95.7 0.5 46.8 13.5 76.6 93.2 0.7 47.6 13.0 76.6 97.0 0.9 49.1 13.3 75.7 96.8 Soft Gating 46.7 13.4 76.1 94.5 L TS-FS consistently achie ves enhanced performance across all perception-related tasks in MME. It is worth noting that Nullu achie ves signiﬁcant improv ements mainly in tasks such as OCR and Posters, but has negligible ef fects in some tasks (e.g., Count). This lik ely indicates that typical feature- steering-based methods are susceptible to changes in fea- ture distribution, whereas our layer -wise strategy can better ensure the comprehensi ve capability of the model. More details are provided in Appendix 11 . 4.5. Ablation Studies Effect of T wo Granularity Hallucination Levels . W e dis- cuss the effect of different granularity hallucination lev els by respectiv ely e valuating L TS-FS on the sub-datasets with only token-le vel hallucination and only sentence-lev el hal- lucination. Results are shown in T ab . 4 , regarding the per- formance of Nullu as baseline. “T oken-le vel only” setup refers to calculating the attribution scores only based on token-le vel hallucination samples in the layer localization process. In contrast, the ”Sentence-le vel only” setup is only based on sentence-level hallucination samples. “Both lev el” setup is equiv alent to the overall frame work of L TS-FS. Ours: The image ca ptures a r omantic m oment betwee n a man and a woman as they embrace e ach other on a boat. The woman is wearin g a whit e clot hing, a nd the man is holdin g her close, both enjoying t he experience of being on the boat. The woman appea rs to be s inging, ad ding to t he joyful atmosphere. The boat is situa ted in the middle of the ocean, with t he coupl e stan ding cl ose to the fr ont of the boa t. The scene is set against a beautiful sunset, creating a serene and picturesque backdrop for t heir embrace. Regular : The image captur es a romantic mome nt between a man and a woman as they embrace e ach other on a boat. The woman is wearin g a whit e dres s, and t he man is wear ing a black shirt. They are both s miling and enjoying thei r time together . The boat is sailing through the ocean, and the couple is holding onto a rope that is attached to the b oat. The scene is set against a beautiful sunset, creating a serene atmosphere. The couple's embrace and the picturesque surroundings make this moment truly special and memorable . Nullu : Ans wer: The image captures a romanti c scene of a man and a woman standing on the deck of a pirate ship , with the man holding the woman in his arms as the y both look ou t at the ocean. T he man is wearing a pirat e's hat and a blue jacket, while the woman is dressed in a blue dress . They are both enjoying the view of the sunset together, and the woman is pointing out into the distance. The man is holding onto a rope , possibly a ship's rope, as they share a tender moment. The woman's dress is flowing in the win d . Prompt: Pleas e describe the i mage in detail . Regular : The image features a clo se - up view of a Subway Series sandwich, showcasing its delicious ingre dients . The sandwich is cut in half , revealing its contents, including lettuce, toma to, and cheese . The sandw ich is generou sly filled with meat and c heese, making it look very tasty. The Subway Series logo i s also visibl e in the backgr ound, emphasizing the brand's association with t his delicious meal. Prompt: Plea se describe the image in detai l. VTI: The i mage feat ures a cl ose - up of two Subway Series sandwiches , which are cut in half , showcasing their ingredients. The sandwiches are filled with various mea ts, including ham, turkey, and roast beef, and are served on a sesame seeded bun . The sandwiches are placed on a black background, making the ingredi ents stand out prominently. The sandwiches ar e cut in half , revealing the layers o f meat, cheese, and ve getables, such as lettuce and tomato, inside. Ours: Answer : The image featu res two large subway sandwiches placed on a black background. The sandwiches are fill ed with various ingredients, including lettuce, tomatoes, an d cheese. Th e sandwiche s are positioned close to each other, creating a visually appealing pr esentation. The close - up view of the sandwiches highli ghts their delicious and appetizi ng appearance . Figure 5. Demonstration of our framew ork for hallucination mitigation on two examples of LLaV A-Bench using LLaV A-v1.5-7B. Results on the CHAIR benchmark indicate that layer lo- calization based on sentence-le vel hallucinations achie ves a more signiﬁcant mitigation effect. This further demon- strates that sentence-lev el hallucination attribution is partic- ularly beneﬁcial for longer outputs, which is also consistent with intuitiv e e xpectations. In contrast, layer localization based on token-lev el hallucinations is more adaptable to the short responses in POPE. Ho wev er, the ”Both level” setup achiev es the optimal performance in the POPE e valuation, which indicates that integrating sentence-le vel attrib ution is more conducive to enhancing the model’ s robustness. Com- bining hallucination samples at the two granularity levels can e xpand the conceptual range of hallucination attrib u- tion, thereby enabling more precise layer-wise localization. Selection of Mask Threshold r s . The hyper -parameter r s directly determines ho w many layers should be steered. W e compare the results on the CHAIR across a set of candi- date r s -values to discuss the impact of the mask threshold on generation performance. The L VLM is LLaV A-v1.5-7B, and the feature-steering basis method is Nullu. As sho wn in T ab . 5 , our strategy consistently enhances the hallucina- tion mitigation ef fect. Meanwhile, the dif ferences in results caused by r s ∈ [0 . 5 , 0 . 7] are negligible, which indicates that non-extreme selections of r s are sufﬁcient to improve the performance of the feature steering method. W e also inv estigate a soft gating variant for selecting r s . Compared with ﬁxing r s for all samples, soft gating sets r s per sample based on the attribution-score distribution across layers, allowing the number of steered layers to vary with the input. As shown in T ab . 5 , soft gating performs compa- rably to hard gating with negligible differences, so we use hard gating with a ﬁxed r s in all experiments for simplicity . 4.6. Further Analysis Case Study on LLava-Bench. In Fig. 5 , we provide two case studies based on LLaV A-v1.5-7B. The examples show clearly that hallucinations still exist in typical feature- steering methods, where nonexistent details such as “cut- T able 6. GPT -4V -aided evaluation of LLa va-Bench. Model Method Accuracy ↑ Detailedness ↑ LLaV A-1.5 Original 5.74 5.23 Nullu 6.46 5.51 L TS-FS(Nullu) 6.96 6.23 Qwen-vl2.5 Original 6.06 5.54 Nullu 6.37 5.68 L TS-FS(Nullu) 6.59 6.07 in-half sandwiches”, “pirate ships”, and “pirate hats” are fabricated. Our method consistently a voids these errors and produces descriptions that remain f aithful to the visual con- tent. These qualitativ e results demonstrate that our layer- wise feature steering effecti vely suppresses hallucinations while better preserving the comprehensi ve capabilities of the model and the ﬂuent presentation. GPT -4V Aided Evaluation on LLaV A-Bench. Follo w- ing Nullu [ 42 ], we ev aluate the performance of our method using GPT -4V Aided Ev aluation. The results are shown in T ab. 6 , which demonstrates that our method can mitigate hallucination while better maintaining generation ability . 5. Conclusion In this paper , we propose a plug-and-play framework called L ocate- T hen- S parsify for F eature S teering ( L TS- FS ), which can mitigate hallucinations for L VLMs through feature steering while better preserving their generaliza- tion ability . W e ﬁrst construct a bi-granularity hallucina- tion dataset. W ith this dataset, we attribute hallucination- relev ant layers based on causal interventions. Finally , we design a layerwise strategy to selectively control the steer- ing intensity according to the attribution scores across lay- ers. Extensi ve experiments demonstrate that L TS-FS can effecti vely mitigate hallucinations while preserving the gen- eralization ability of L VLMs. For future work, we will in- vestigate the characteristics of hallucination-relev ant layers detected by L TS-FS and try to integrate L TS-FS framew ork into the model pre-training process to more fundamentally reduce the generation of hallucinations. Acknowledgement This work w as supported in part by Na- tional Natural Science Foundation of China: 62236008, and in part by the Natural Science Foundation of Beijing under grant number L251082. The authors would like to thank all the anonymous revie wers for their insightful comments. References [1] Jean-Baptiste Alayrac and et al. Flamingo: a visual language model for few-shot learning. In Advances in Neur al Informa- tion Pr ocessing Systems (NeurIPS) , 2022. 1 [2] W enbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, Qianying W ang, Ping Chen, Xiaoqin Zhang, and Shijian Lu. Mitigating object hallucinations in lar ge vision-language models with assembly of global and local attention. In Pr o- ceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , pages 29915–29926, 2025. 2 , 6 [3] Y ongqi An, Xu Zhao, T ao Y u, Ming T ang, and Jinqiao W ang. Fluctuation-based adaptive structured pruning for large lan- guage models. In Pr oceedings of the AAAI Confer ence on Artiﬁcial Intelligence , 2024. 3 [4] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin W ang, W enbin Ge, Sibo Song, Kai Dang, Peng W ang, Shijie W ang, Jun T ang, et al. Qwen2. 5-vl technical report. arXiv pr eprint arXiv:2502.13923 , 2025. 6 [5] Chao Bi, T iantian Dang, Shuhui W ang, Feng Cao, and Qing- ming Huang. Asking questions to alleviate object hallucina- tion in large vision-language models. IEEE T ransactions on Cir cuits and Systems for V ideo T echnology , 2025. 2 [6] David M. Chan, Suzanne Petryk, et al. Clair: Evaluating im- age captions with large language models. In EMNLP 2023 , 2023. 11 [7] Harry Dong, Beidi Chen, and Y uejie Chi. Prompt-prompted adaptiv e structured pruning for efﬁcient llm generation. arXiv pr eprint arXiv:2404.01365 , 2024. 3 [8] Xin Dong, Shangyu Chen, and Sinno Jialin Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. In NeurIPS , 2017. 3 [9] Qianyu Feng, Y u W u, Hehe F an, Chenggang Y an, Mingliang Xu, and Y i Y ang. Cascaded revision network for nov el object captioning. IEEE T ransactions on Cir cuits and Systems for V ideo T echnology , 30(10):3413–3421, 2020. 2 [10] Chaoyou Fu, Peixian Chen, Y unhang Shen, Y ulei Qin, Mengdan Zhang, Xu Lin, Jinrui Y ang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive ev aluation bench- mark for multimodal large language models. In The Thirty- ninth Annual Confer ence on Neural Information Processing Systems Datasets and Benchmarks T rac k , 2025. 5 , 6 [11] Hengyuan Hu, Rui Peng, Y u-Wing T ai, and Chi-Keung T ang. Network trimming: A data-driven neuron pruning ap- proach towards efﬁcient deep architectures. ArXiv preprint , abs/1607.03250, 2016. 3 [12] Jing Huang et al. A surv ey on e v aluation of multimodal large language models. arXiv pr eprint arXiv:2408.15769 , 2024. 1 [13] Lei Huang, W eijiang Y u, W eitao Ma, W eihong Zhong, Zhangyin Feng, Haotian W ang, Qianglong Chen, W eihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, tax- onomy , challenges, and open questions. arXiv preprint arXiv:2311.05232 , 2023. 2 [14] Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin W ang, Con- ghui He, Jiaqi W ang, Dahua Lin, W eiming Zhang, and Nenghai Y u. Opera: Alleviating hallucination in multi- modal large language models via over -trust penalty and retrospection-allocation. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Reco gnition , pages 13418–13427, 2024. 2 , 3 , 4 [15] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Pr oceedings of the IEEE/CVF con- fer ence on computer vision and pattern reco gnition , pages 6700–6709, 2019. 6 [16] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Y u, Dan Su, Y an Xu, Etsuko Ishii, Y e Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surve ys , 55(12):1–38, 2023. 2 [17] Ziwei Ji, Nayeon Lee, Rita Frieske, et al. Survey of hal- lucination in natural language generation. ACM Computing Surve ys , 2023. 1 [18] Jinghan Jia, Jiancheng Liu, Parikshit Ram, Y uguang Y ao, Gaowen Liu, Y ang Liu, Pranay Sharma, and Sijia Liu. Model sparsity can simplify machine unlearning, 2024. 3 [19] Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hal- lucinations in large vision-language models through visual contrastiv e decoding. In Pr oceedings of the IEEE/CVF Con- fer ence on Computer V ision and P attern Recognition , pages 13872–13882, 2024. 2 , 6 [20] Kenneth Li, Oam Patel, Fernanda V i ´ egas, Hanspeter Pﬁster, and Martin W attenberg. Inference-time intervention: Elicit- ing truthful answers from a language model. Advances in Neural Information Pr ocessing Systems , 36:41451–41530, 2023. 2 [21] Y . Li and et al. Evaluating object hallucination in large vision-language models. arXiv preprint , 2023. 1 , 3 [22] Yif an Li, Y ifan Du, Kun Zhou, Jinpeng W ang, W ayne Xin Zhao, and Ji-Rong W en. Ev aluating object hallucination in large vision-language models. In Pr oceedings of the 2023 Confer ence on Empirical Methods in Natur al Languag e Pr o- cessing , pages 292–305, 2023. 3 , 5 , 6 [23] Jessy Lin, Luke Zettlemoyer, Gargi Ghosh, W en-T au Y ih, Aram Markosyan, V incent-Pierre Berges, and Barlas O ˘ guz. Continual learning via sparse memory ﬁnetuning, 2025. 3 [24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Dev a Ramanan, Piotr Doll ´ ar , and C La wrence Zitnick. Microsoft coco: Common objects in context. In Computer V ision–ECCV 2014: 13th Eur opean Conference, Zurich, Switzerland, September 6-12, 2014, Pr oceedings, P art V 13 , pages 740–755. Springer, 2014. 6 [25] Fuxiao Liu, Ke vin Lin, Linjie Li, Jianfeng W ang, Y aser Y a- coob, and Lijuan W ang. Mitigating hallucination in large multi-modal models via robust instruction tuning. In The T welfth International Conference on Learning Representa- tions , pages 1–12, 2023. 2 [26] Haotian Liu, Chunyuan Li, Qingyang W u, and Y ong Jae Lee. V isual instruction tuning. arXiv preprint , 2023. 1 [27] Haotian Liu, Chunyuan Li, Y uheng Li, and Y ong Jae Lee. Improv ed baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern r ecognition , pages 26296–26306, 2024. 5 , 6 [28] Haotian Liu, Chunyuan Li, Qingyang W u, and Y ong Jae Lee. V isual instruction tuning. Advances in neural information pr ocessing systems , 36, 2024. 6 [29] Hanchao Liu, W enyuan Xue, Yife i Chen, Dapeng Chen, Xiu- tian Zhao, K e W ang, Liping Hou, Rongjun Li, and W ei Peng. A survey on hallucination in large vision-language models. arXiv pr eprint arXiv:2402.00253 , 2024. 2 [30] Jinzhe Liu, Junshu Sun, Shufan Shen, Chenxue Y ang, and Shuhui W ang. Edit less, achiev e more: Dynamic sparse neu- ron masking for lifelong knowledge editing in llms, 2025. 3 [31] Sheng Liu, Haotian Y e, and James Zou. Reducing hallucina- tions in large vision-language models via latent space steer- ing. In The Thirteenth International Conference on Learning Repr esentations , 2025. 2 , 5 , 6 [32] Y an Liu, Y u Liu, Xiaokang Chen, Pin-Y u Chen, Daoguang Zan, Min-Y en Kan, and Tsung-Y i Ho. The de vil is in the neu- rons: Interpreting and mitigating social biases in pre-trained language models, 2024. 3 [33] Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Tre vor Darrell, and Kate Saenk o. Object hallucination in image cap- tioning. In Pr oceedings of the 2018 Confer ence on Empirical Methods in Natural Language Processing , pages 4035–4045, 2018. 3 , 5 [34] Prasanta Sahoo et al. A comprehensive survey of hallucina- tion mitigation techniques in large language models. F ind- ings of EMNLP , 2024. 1 [35] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. In Eur opean conference on computer vision , pages 146–162. Springer , 2022. 6 [36] Shufan Shen, Junshu Sun, Xiangyang Ji, Qingming Huang, and Shuhui W ang. Expanding sparse tuning for lo w memory usage. In NeurIPS , 2024. 3 [37] Suraj Sriniv as and R. V enkatesh Babu. Data-free parameter pruning for deep neural networks. In BMVC , 2015. 3 [38] Boyao W ang and V olodymyr Kindratenko. Rl-pruner: Struc- tured pruning using reinforcement learning for cnn com- pression and acceleration. arXiv pr eprint arXiv:2411.06463 , 2024. 3 [39] Bin W ang, Fan W u, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, W eijia Li, W ei Li, Jiaqi W ang, et al. V igc: V isual instruction generation and correction. In Proceedings of the AAAI Conference on Artiﬁcial Intel- ligence , pages 5309–5317, 2024. 2 [40] Peng W ang, Shuai Bai, et al. Qwen2-vl: Enhancing vision- language model’ s understanding of the open world. arXiv pr eprint arXiv:2409.12191 , 2024. 1 [41] Y uanchen W u, Lu Zhang, Hang Y ao, Junlong Du, Ke Y an, Shouhong Ding, Y unsheng W u, and Xiaoqiang Li. Anti- dote: A uniﬁed framework for mitigating lvlm hallucinations in counterfactual presupposition and object perception. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pages 14646–14656, 2025. 3 , 5 [42] Le Y ang, Ziwei Zheng, Boxu Chen, Zhengyu Zhao, Chenhao Lin, and Chao Shen. Nullu: Mitigating object hallucinations in large vision-language models via halluspace projection. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pages 14635–14645, 2025. 1 , 2 , 5 , 6 , 8 [43] Tien-Ju Y ang, Y u-Hsin Chen, and V ivienne Sze. Designing energy-ef ﬁcient con volutional neural netw orks using energy- aware pruning. In CVPR , 2017. 3 [44] Shukang Y in, Chaoyou Fu, Sirui Zhao, K e Li, Xing Sun, T ong Xu, and Enhong Chen. A survey on multimodal large language models. National Science Review , 2024. Earlier arXiv:2306.13549. 1 [45] Shukang Y in, Chaoyou Fu, Sirui Zhao, T ong Xu, Hao W ang, Dianbo Sui, Y unhang Shen, Ke Li, Xing Sun, and Enhong Chen. W oodpecker: Hallucination correction for multimodal large language models. Science China Information Sciences , 67(12):220105, 2024. 2 [46] Zeping Y u and Sophia Ananiadou. Neuron-le vel knowl- edge attrib ution in lar ge language models. In Pr oceedings of the 2024 Confer ence on Empirical Methods in Natural Lan- guage Pr ocessing , pages 3267–3280, 2024. 4 [47] Zeping Y u and Sophia Ananiadou. Understanding multi- modal llms: the mechanistic interpretability of llav a in visual question answering. arXiv pr eprint arXiv:2411.10950 , 2024. 4 [48] Y ue Zhang, Y afu Li, Leyang Cui, Deng Cai, Lemao Liu, T ingchen Fu, Xinting Huang, Enbo Zhao, Y u Zhang, Y u- long Chen, et al. Siren’ s song in the ai ocean: a survey on hallucination in large language models. arXiv pr eprint arXiv:2309.01219 , 2023. 2 [49] Yiyang Zhou, Chenhang Cui, Jaehong Y oon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Y ao. Analyzing and mitigating object hallucination in large vision-language models. In The T welfth International Con- fer ence on Learning Repr esentations , 2024. 4 Locate-then-Sparsify: Attrib ution Guided Sparse Strategy f or V isual Hallucination Mitigation Supplementary Material 6. Details of the construction of the dataset In this section, we introduce the details of how to construct the Bi-granularity Dataset. At ﬁrst, to preserve generalization, the data used for dataset construction and the data used for experiments are strictly disjoint. Particularly , for data selected based on CHAIR and POPE, we use data from train spilt of MSCOCO. for data selected from Antidote, we do not use these data for ev aluation. Secondly , we explain how to get a single data. As an example from CHAIR, the data instance is generated by an L VLM. W e use LLaV A-v1.5-7B to produce a response ac- cording to the CHAIR benchmark, as illustrated in Fig. 6 . W e then apply CHAIR’ s e v aluation criteria to detect hal- lucination and annotate the instance under our two-le vel scheme (token- and sentence-le vel). And then a piece of data is generated. If the responses doesn’t have hallucina- tion, there are just not selected. Finally , for balance two lev el of data, we select 100 sentence-lev el samples and 100 token-le vel samples. All data are manually inspected to ensure accuracy . 7. Implementation details of L TS-FS Hyper -parameters. The strength control parameters of s l tok : λ cue , λ pos , λ hall is set to be 1. The mask threshold r s is selected to be 0.5, as shown in T ab.5 of main te xt. En vironment. All the experiments are conducted on one A100 80G. For 7B model, two R TX3090 24G can replace A100. For detailed python requirements, please refer to our released codes. 8. Implementation Settings of CHAIR Results Generation Setting . Here we set the generation conﬁg as follows: Max New T okens=128 , num beams=1 , and sampling=False . Compared methods. W e employ the default parameters and settings as reported in the original papers. 9. Generation Capability . T o e valuate general capability more comprehensi vely , we perform an ev aluation using a broader benchmark called CLAIR [ 6 ]. This result in T ab. 7 sho ws that L TS-FS achiev es a better trade-of f between hallucination mitigation and general capability preservation. T able 7. T rade-off between hallucination mitigation and general capability preservation. Method CHAIRs POPE acc details CLAIR Original 53.0 77.63 5.23 80.03 nullu 50.2 79.11 5.51 75.00 L TS-FS(nullu) 46.8 79.92 6.23 82.74 Soft Gating 46.7 79.5 6.26 83.64 10. More details of POPE results Generation Setting. Here we set the generation conﬁg as follows: Max New T okens=16 , num beams=1 , and sam- pling=False . Compared methods. W e employ the default parameters and settings as reported in original papers. T otal Results. The total results is sho wn in T ab . 13 . Across all settings, our L TS-FS framework achiev es the best accu- racy and F1, demonstrating consistent effecti veness in hal- lucination mitigation. Compared with the original feature- steering methods, applying L TS-FS consistently improves both VTI and Nullu on hallucination-related metrics. Al- though L TS-FS and the other methods trade wins on re- call, L TS-FS consistently maintains higher precision. Since, in hallucination ev aluation, precision is more indicati ve of mitigation quality , this further supports the strong perfor - mance of our approach. 11. More details of MME results W e report the MME numerical results in T ab . 8 . The nu- merical results demonstrate that L TS-FS can strongly in- crease the mitigation abilitity of feature steering methods. Speciﬁcally , across the subset most related to hallucina- tion: Count, and Position, L TS-FS achiev es great improve- ments, highlighting its effecti veness in enhancing feature- steering–based mitigation. MME includes not only perception-related tasks b ut also recognition-related tasks. W e report these results in T ab . 9 . Despite the sparsity selection emphasizes hallucination re- lated cues rather than recognition factors, L TS-FS still pro- duces improv ements on recognition-related tasks. 12. Time Analysis There are tw o time cost analysis, the time to apply methods and the time for inference. The time to apply methods is the time to employ a hallucination mitigation method into a speciﬁc L VLMs. As an example, in order to apply VTI to L VLMs, the direction vector needs to be computed and the T able 8. Results on all MME perception-related tasks. Method Existence Count P osition Color P osters Celebrity Scene Landmark Artwork OCR T otal Regular 182 118 105 151 118 112 145 131 108 78 1248 Nullu 190 122 106 157 128 118 148 130 114 121 1334 L TS-FS(Nullu) 195 153 128 157 130 127 155 131 113 123 1412 T able 9. Results on all MME recognition-related tasks. Model Method Common Sense Reasoning Numerical Calculation T ext T ranslation Code Reasoning T otal LLaV A-1.5-7B Regular 110 50 50 71 281 Nullu 113 59 75 77 324 L TS-FS + Nullu 120 59 75 80 334 T able 10. Time analysis comparison of different hallucination mitigation strategies. VCD represents a decoding-based method. Nullu represents a feature-steering-based method. Method Preparation Cost Infer ence Cost Regular – 1.31s VCD 0s 3.14s Nullu 30mins 1.37s Ours 90mins 1.34s T able 11. Ablation study of indicators. HI, CI, PI respectiv ely indicate hallucination indicator, cue indicator , and position indica- tor . Settings C S C S Recall Length Regular 53.0 13.9 77.2 98.0 w/o HI 52.0 14.0 76.9 97.4 w/o CI 48.2 13.6 77.1 95.7 w/o PI 47.6 13.7 76.9 94.3 L TS-FS(Nullu) 46.8 13.5 76.6 93.2 T able 12. Results of the generalization test of our framework. W e use LLaV A-v1.5-7B to conduct this experiment. C S and C I is the CHAIR S and CHAIR I under CHAIR benchmark. ACC and F1 mean the accuracy and F1 score in the GQA subset on POPE. Settings C S C I Acc F1 Regular 53.0 13.9 75.47 79.83 MSCOCO → GQA — — 77.31 79.57 GQA → MSCOCO 49.5 13.2 — — Antidote → GQA — — 77.28 80.12 Antidote → MSCOCO 49.8 13.7 — — L TS-FS(Nullu) 46.8 13.5 77.15 80.63 layer should be adjust. This whole time is the time to apply methods. For our method, the time to apply methods con- tains two parts. First, we needs layer-wise attribution to se- lect speciﬁc layers. Second, we need apply feature steering methods based on these sparse layers. The second part time is almost the same as the original feature steering methods, which can be completed in under 30 minutes. The ﬁrst part is the attrib ution process, which is time-consuming. For LLaV A-v1.5-7B, it takes about 1–2 hours on a single A100 80 GB GPU. As for the time for inference, our frame work is based on feature steering methods. Therefore the time for infer- ence is comparable with regular generation. Comparasion is shown in T ab . 10 . Despite requiring a longer prepara- tion phase, the additional cost is reasonable, as it a voids the extra inference time latency that would otherwise accumu- late during decoding and and further highlights the inherent advantages of feature steering techniques. 13. Ablation Study about Indicators In this section, we discuss the effect of the three indicator in sentence le vel hallucination attrib ution. The result is sho wn in T ab. 11 . W e in vestigate the effect of remo ving each indi- cator in turn and ﬁnd that w/o cue indicator and w/o position indicator yield only small changes, whereas w/o hallucina- tion causes a much larger decline, indicating that hallucina- tion token attribution is paramount, with cue and position still providing auxiliary gains. 14. Discussion about Generalization T o assess generalization be yond the construction sources, we ev aluate on datasets whose distributions dif fer from those used to build our bi-granularity labels. Although the construction le verages CHAIR, POPE, and Antidote, we additionally report results on MME and LLaV A-Bench, which serve as out of- istribution dataset of overall ca- pability . W e also run a decoupled calibration ev alua- tion protocol: layer scores and weights are calibrated on one source (e.g., CHAIR on MSCOCO), then frozen Response: The scene depicts a co lorful tru ck driving down a city street, featuring a bright yellow body and a bright pink truck bed. The truck is carrying several bags and lots of garbage, indicating that the city is still experiencing sanitatio n issues. \n\ nThere are some other objects not related to the truck on the street, such as a bottle near the left side and a cup on the right. Additionally , there is a motorcycle parked on the left side of the image. \n\ nThe presence of a backpack suggests that people might be walking down the street or that someone is attending to their own da ily garbage and trash management. There is also a tie visible in the scene, possibly belonging to a person dressed for work .", LV L M CHAIR Evaluation Response: The scene depicts a co lorful tru ck driving down a city street, featuring a bright yellow body and a bright pink truck bed. The truck is carrying several bags and lots of garbage, indicating that the city is still experiencing sanitatio n issues. \n\ nThere are some other objects not related to the truck on the street, such as a bottle near the left side and a cup on the right. Additionally , there is a motorcycle parked on the left side of the image. \n\ nThe presence of a backpack suggests that people might be walking down the street or that someone is attending to their own da ily garbage and trash management. There is also a tie visible in the scene, possibly belonging to a person dressed for work .” Split sentence; Locate hallucinations LLaV A - v1.5 - 7B Sentence level label: [ 𝑠𝑒𝑛 𝑡 ! , , , 𝑠𝑒𝑛 𝑡 " , , , 𝑠𝑒𝑛 𝑡 # , , , 𝑠𝑒𝑛 𝑡 $ ] Figure 6. A sample generation based on CHAIR benchmark VTI: The i mage featur es a young boy wearing a su it and ti e, with a blue bow tie, and glasses. He is s miling and looking at the camera, g iving off a confid ent and happy e xpression. The boy's outfit consists of a blue jacket, and he appears t o be well - dressed and well - groomed. The boy's eyes are wide open, and he is looking direct ly into the camera, capturi ng attention. The scene takes place in a room with a fe w other people present, including one person standing in the background and two others sitti ng down. Ours : The image features a young boy wearing g lasses and a bow tie. He appears to be smili ng and posing for the camera. The boy is wea ring a blue suit and a red bow tie , giving hi m a formal and stylish appea rance. The scene is set in a city environment, with a building visible in t he background. Regular : The image feat ures a young boy wearing gl asses and a bow tie. He is smiling and looking directly i nto the camera. The boy's attire includes a suit and tie , giving him a formal appearance. He is posing confidently for the picture. In the background, t here are a few people visible. One pers on is s tanding to the boy's l eft, and two others are standing to his right. The scene appea rs to be set in a public place, such as a park or a city street. Prompt: Please describe the ima ge in detail. Ours: The image fe atures a cl ose - up view of a delicious me al consisting of several cooked lobster tails and l emon wedges. The lobster tails a re placed on a baking tr ay, with som e of them b eing covered in seasoni ng. The lemon wedges are scattered around the tails, adding a refreshing touch to the dish. The combination of lobster and lemon creates a visually appealing and appetizing presentation. Regular : The image feat ures a deli cious meal of grill ed lobster tails wi th lemon wedges. T he lobst er tai ls are cooked a nd seaso ned, and they ar e served on a di ning table. There are six lobster tails in to tal, with some of them overlapping e ach other. The lemon wedg es are spre ad out evenl y across th e tails, enhancing the ir flavor . The dish looks appetizing and i s ready to be enjoyed. Nullu : The i mage featur es a delic ious - looking dish co nsisting of six cooked lobsters , each with a lemon wedge on top, arranged on a plate. The lobsters are seasoned with herbs and are placed on a tray, r eady to be eaten. They are spread out acr oss the entirety of the plate, with some of t hem being larger than others, and they are all covered in a seasoning. The lobsters are placed in various posi tions, with some overlapping each other, creat ing an appetizing presentati on. Prompt: Please describe the ima ge in detail. Figure 7. More examples on LLaV A-Bench. and applied to a dif ferent target set for ev aluation (e.g., POPE-GQA or Antidote). Concretely , CHAIR relies on MSCOCO; POPE uses MSCOCO and GQA; Antidote uses its own corpus. W e therefore test cross-dataset pairs such as MSCOCO → GQA to verify transfer . The results is shown in T ab. 12 ). MSCOCO → GQA denotes calibrating attribu- tion on MSCOCO and ev aluating on the POPE–GQA sub- set, and GQA → MSCOCO means attribution based on the GQA dataset and ev aluation on the MSCOCO dataset under the CHAIR benchmark. Despite calibrating on only part of the data, our framework typically deliv ers additional gains. The ﬁndings suggest that our improvements are driv en by intrinsic generalization capacity , not by overﬁtting to a par - ticular data distribution. 15. More cases in LLaV A-bench More case studies on the LLaV A-bench are presented in Fig. 7 , which demonstrates our the effecti veness of our framew ork in hallucination mitigation. In particular , color and count attributes are gi ven greater emphasis, thereby av oiding hallucinations in these aspects. 16. GPT4v-Evaluation prompt Follo wing VCD, the prompt for GPT4v-aided ev aluation is shown in Fig. 8 . The GPT4v receive three type of L VLM’ s responses and then generate output. Then we collect the output from GPT4v and ﬁnally report the average accuracy and detailedness. 17. Limitation and future work Although our approach can be effecti vely ported to feature- steering methods and achiev es strong hallucination mitiga- tion, there is still room for dev elopment. Since e xisting fea- ture steering techniques have not been e v aluated on larger 70B-scale models, extending our method to 70B models re- mains a challenge. W e aim to extend our frame work to larger models and further in vestigate its impact across ad- ditional multimodal domains. T able 13. A verage POPE results with Random and Popular . Setting Model Method Accuracy ↑ Pr ecision ↑ Recall ↑ F1 Score ↑ Random LLaV A-v1.5-7B Regular 85.37 80.77 93.22 86.47 VCD 86.55 84.02 90.69 87.16 A GLA 85.32 83.56 91.34 86.77 Nullu 86.35 84.36 91.09 86.28 VTI 84.84 80.02 93.36 86.08 L TS-FS(Nullu) 87.13 84.69 91.02 87.64 L TS-FS(VTI) 86.77 84.13 91.00 87.32 LLaV A-v1.5-13B Regular 81.91 75.84 93.82 83.85 VCD 82.27 75.97 92.68 83.76 A GLA 82.64 76.19 93.16 83.58 Nullu 83.24 77.93 92.89 84.73 VTI 84.08 76.29 93.04 83.82 L TS-FS(Nullu) 83.96 78.89 93.85 85.56 L TS-FS(VTI) 86.59 82.35 93.47 87.48 Qwen-VL2.5-7B Regular 85.32 96.38 73.57 84.03 VCD 85.94 97.13 74.11 83.89 A GLA 86.02 96.56 73.65 83.63 Nullu 85.82 97.17 73.93 83.73 VTI 85.49 96.85 73.51 83.37 L TS-FS(Nullu) 86.21 97.09 74.78 84.31 L TS-FS(VTI) 86.04 97.23 73.64 83.87 Popular LLaV A-v1.5-7B Regular 77.52 71.45 93.22 80.71 VCD 79.09 73.21 92.17 81.23 A GLA 78.67 75.39 89.02 81.47 Nullu 79.42 74.45 91.04 81.67 VTI 77.03 70.90 93.36 80.40 L TS-FS(Nullu) 80.09 75.28 91.07 82.20 L TS-FS(VTI) 79.96 75.25 91.14 82.25 LLaV A-v1.5-13B Regular 78.40 71.78 93.76 81.30 VCD 79.38 72.24 92.47 82.01 A GLA 80.11 72.88 92.16 82.32 Nullu 80.88 74.91 93.02 82.97 VTI 79.22 73.26 93.04 81.83 L TS-FS(Nullu) 81.46 75.62 92.93 83.42 L TS-FS(VTI) 81.77 75.58 93.47 83.58 Qwen-VL2.5-7B Regular 83.31 91.14 74.58 81.68 VCD 83.19 90.27 74.18 81.95 A GLA 83.34 90.69 74.53 81.86 Nullu 83.06 91.20 74.04 81.27 VTI 82.74 90.86 73.51 80.88 L TS-FS(Nullu) 83.59 91.12 76.18 82.55 L TS-FS(VTI) 83.35 90.96 73.64 81.38 Adversarial LLaV A-v1.5-7B Regular 70.13 64.14 93.22 75.85 VCD 71.48 66.28 89.62 76.04 A GLA 71.63 66.59 90.13 75.89 Nullu 71.57 66.06 90.53 76.17 VTI 69.40 63.46 93.36 75.42 L TS-FS(Nullu) 72.62 65.99 90.62 76.22 L TS-FS(VTI) 73.04 67.37 91.24 77.32 LLaV A-v1.5-13B Regular 71.07 64.60 93.83 76.47 VCD 71.73 63.61 94.23 75.76 A GLA 72.27 64.14 93.56 75.48 Nullu 72.43 66.08 92.44 77.04 VTI 71.77 65.58 93.01 76.80 L TS-FS(Nullu) 73.06 67.01 92.96 78.36 L TS-FS(VTI) 73.78 67.51 93.47 79.91 Qwen-VL2.5-7B Regular 80.17 85.21 73.64 78.93 VCD 80.56 85.31 75.07 79.51 A GLA 80.92 85.73 74.72 79.14 Nullu 80.74 86.32 74.24 79.32 VTI 80.19 85.75 73.51 78.70 L TS-FS(Nullu) 81.11 86.14 75.07 79.83 L TS-FS(VTI) 80.92 85.94 73.64 79.46 Descripti o n: AI that sco res ima ge descri ption ac curac y and de taile dness. Instruc tions : You are a n AI designed to eval uate and score t he perform ance of three AI assistants in de scribing a given image. Your primary focus is o n the accuracy and detailedness of their d escriptions. You w ill asses s the accuracy by checkin g for hallucinat ions - any part of the descri ption th at is inconsis tent w ith the image content. For detailednes s, you w ill cons ider h ow rich the r espons e is in neces sary details, excludin g any hallucinat ed parts. You will provide s cores on a scale from 1 to 10 for each assistant separately, based on these criter ia. After scoring, you will offer an explanation for your evaluation, ensuring it is free from bias and not influenced by the order of presentation of the responses. Input form at: [Assistant 1] {Response 1} [End of Assistan t 1] [Assistant 2] {Response 2} [End of Assistan t 2] [Assistant 3] {Response 3} [End of Assistan t 3] Output form at: Accura cy: Score s of the thre e an swers: Reason: Detai ledne ss: Score s of the thre e an swers: Reason: Figure 8. Prompt of GPT -4V Evaluation.

Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment