PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning
Authors: Fengyuan Sun, Hui Chen, Xinhao Xu
P R U N E H A L : R E D U C I N G H A L L U C I N A T I O N S I N M U L T I - M O D A L L A R G E L A N G UA G E M O D E L S T H R O U G H A D A P - T I V E K V C A C H E P R U N I N G Fengyuan Sun 1, 2 , Hui Chen 1 , Xinhao Xu 1 , Dandan Zheng 2 , Jingdong Chen 2 , Jun Zhou 2 , Jungong Han 3 , Guiguang Ding 1 1 School of Software, Tsinghua Uni versity 2 Ant Group 3 Department of Automation, Tsinghua Univ ersity sfy24@tsinghua.edu.cn A B S T R AC T While multi-modal large language models (MLLMs) hav e made significant progress in recent years, the issue of hallucinations remains a major challenge. T o mitigate this phenomenon, existing solutions either introduce additional data for further training or incorporate external or internal information during infer- ence. Howe ver , these approaches ine vitably introduce extra computational costs. In this paper , we observ e that hallucinations in MLLMs are strongly associated with insuf ficient attention allocated to visual tokens. In particular , the presence of redundant visual tokens disperses the model’ s attention, preventing it from focus- ing on the most informativ e ones. As a result, critical visual cues are often under - attended, which in turn exacerbates the occurrence of hallucinations. Building on this observation, we propose PruneHal , a training-free, simple yet ef fectiv e method that leverages adapti ve KV cache pruning to enhance the model’ s focus on critical visual information, thereby mitigating hallucinations. T o the best of our knowledge, we are the first to apply token pruning for hallucination mitiga- tion in MLLMs. Notably , our method don’t require additional training and incurs nearly no extra inference cost. Moreov er , PruneHal is model-agnostic and can be seamlessly integrated with dif ferent decoding strategies, including those specif- ically designed for hallucination mitigation. W e ev aluate PruneHal on sev eral widely used hallucination ev aluation benchmarks using four mainstream MLLMs, achieving rob ust and outstanding results that highlight the ef fectiveness and supe- riority of our method. Our code will be publicly av ailable. 1 I N T RO D U C T I O N Recent advancements in multi-modal large language models (MLLMs) hav e led to significant break- throughs, enabling these models to ef fectiv ely tackle a wide range of complex visual tasks (W ang et al., 2024d; Bai et al., 2023; Achiam et al., 2023; Chen et al., 2024b; Dai et al., 2023; Liu et al., 2023b; 2024a; Lu et al., 2024; Y ao et al., 2024). Despite MLLMs’ strong performance across v ar- ious visual tasks, their practicality remains limited due to hallucinations. The issues of hallucina- tions (Bai et al., 2024; Rohrbach et al., 2018; Li et al., 2023b) often cause MLLMs to output content that is inconsistent with visual input, sev erely undermining their reliability . Sev eral studies ha ve explored mitigating hallucinations in MLLMs from different perspectiv es. Some works (Gunjal et al., 2024; Zhou et al., 2023; Liu et al., 2023a; Y u et al., 2024) focus on further training on specifically designed datasets or alignment to reduce the model’ s propensity to generate hallucinated content, typically by le veraging additional annotated data or specialized su- pervision. In contrast, training-free approaches (Leng et al., 2024; Huang et al., 2024; Liu et al., 2024b; W ang et al., 2024c; Xu et al., 2025) aim to alleviate hallucinations during inference, often by designing specific decoding strate gies. Although these methods ha ve demonstrated ef fecti veness, they incur additional training costs or computational o verhead during inference. 1 Please descr ibe this image in detail. a ttn score The image dep icts a group of people in a hos pital setting……wi th a bed visible in the backg round. The pre sence of the medical e quipment and the m en's attire sugg est that they are work ing on a medica l procedure or tr eatment. MLLM Amplify visual a ttention The image dep icts a group of people in a hos pital setting……wi th a window visible in the ba ckground. The overall atmo sphere sugge sts a professi onal and focus ed environm ent, as the men wor k together to ens ure the proper functionin g of the medical e quipment. MLLM 0. 5 0. 2 0.3 0. 5 0.4 0.3 User Figure 1: An example where the hallucinated word “bed”, which is not presented in the image, is correctly changed to “window”, by amplifying the visual attention. More examples can be found in Appendix. C.3. In this paper , we first note that hallucinations in MLLMs ar e closely associated with the model’ s insufficient attention to visual tokens. Previous works (Chen et al., 2024a; Liu et al., 2024b) ha ve demonstrated that in MLLMs, visual tokens constitute most of the input tokens, yet they recei ve little attention during the forward pass of the self-attention (V asw ani et al., 2017). It is reasonable to hypothesize that the generation of hallucinated content may also be associated with this phe- nomenon. T o intuiti vely v alidate this, we conduct a qualitati ve experiment. Specifically , we attempt to directly amplify the visual attentions in attention maps during model inference, immediately be- fore the model generates hallucinated content (details are provided in Appendix. C.1). As shown in Fig. 1, increasing MLLMs’ attention to visual information enables it to generate content that aligns with the image during inference process, thereby reducing hallucinations. T o illustrate this quantitati vely , we present a comparison between the visual attention in the inference steps that lead to hallucinations and the a verage visual attention across all inference steps (details are provided in Appendix. C.2). As illustrated in Fig. 2, it is clear that across the shallow , middle, and deep layers of LLaV A-v1.5-7B, the generation of hallucinated contents are strongly associated with lo wer attention scores on visual tok ens (intuiti vely , most of the blue scatter points lie abo ve the red ones). These evidences reveal that the issue of lower attention scores recei ved by visual tokens is potentially one of the cause of generating hullucinated contents during decoding. 0 50 100 150 200 Sample Inde x 3.5 4.0 4.5 5.0 5.5 6.0 A verage A ttention × 1 0 4 All Hal (a) Layer 1 0 20 40 60 80 Sample Inde x 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 A verage A ttention × 1 0 4 All Hal (b) Layer 16 0 50 100 150 200 Sample Inde x 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 A verage A ttention × 1 0 4 All Hal (c) Layer 30 Figure 2: A verage visual attention in language model of LLaV A-v1.5-7B. Each blue scatter point represents the mean attention scores received by all visual tokens in a caption, av eraged ov er all decoding steps. The red scatter points represent the mean attention values when hallucinated content is generated in captions. Attention scores are av eraged o ver all attention heads. Images and captions are selected from the validation set of MSCOCO 2014. Meanwhile, existing works (Chen et al., 2024a; Y ang et al., 2025) ha ve pointed out that visual tokens in MLLMs exhibit widespread redundancy , with only a small number of key tokens containing critical visual information needed for inference. W e reasonably hypothesize that hallucinations in MLLMs are highly correlated with visual tokens’ redundancy . In particular , the redundancy of visual tokens could disperse models’ attention and cause important tokens to be under-attended. This further reduces attention to critical and informati ve visual tokens, exacerbating the hallucination phenomenon. 2 Based on these insights, we propose PruneHal , a nov el frame work that le verages KV cache pruning to mitigate hallucinations in MLLMs. Initially , we apply a simple KV cache pruning method that retains the top-K most important visual tokens. W e sho w that this top-K-based approach is both simple and effecti ve, enabling the model to focus on key visual tokens. Howe ver , this strategy struggles to determine the optimal time for pruning, often leading to excessi ve pruning. This, in turn, results in the loss of crucial visual information, negativ ely impacting the overall performance of the model. T o address this, we propose an adapti ve pruning strategy that tracks the historical visual attention distribution to prune redundant visual tokens more effecti vely . This allo ws PruneHal to achieve a balanced trade-off as inference progresses, mitigating hallucinations while preserving output quality . T o verify the effecti veness of the proposed method, we conducted extensiv e experiments on mul- tiple widely-used MLLMs, including LLaV A (Liu et al., 2023b), InstructBLIP (Dai et al., 2023), and Qwen-VL (Bai et al., 2023), across various benchmarks and hallucination metrics. The results demonstrated that PruneHal significantly improves performance on a range of MLLMs, benchmarks and datasets, sho wing fa v ourable robustness. Additionally , we found that PruneHal is fully compat- ible with existing decoding strate gies designed specifically for hallucination mitigation. Our contribution can be summarized as follo ws: 1. Our e xperiments re veal a strong link between hallucinations in MLLMs and the insuf ficient attention given to key visual tokens. W e further observe that redundant visual tokens siphon off a substantial portion of the model’ s attention, lea ving k ey visual tokens under-attended. 2. T o address the abov e issues, we introduce adaptiv e KV cache pruning to mitigate this is- sue, and propose PruneHal frame work. Our PruneHal alleviates hallucinations in MLLMs during inference time while introducing almost no computational ov erhead. 3. Experimental results sho w that our PruneHal ef fectiv ely mitigates MLLMs’ hallucinations. When combined with existing approaches, PruneHal achiev ed state-of-the-art results on various hallucination benchmarks. 2 R E L A T E D W O R K S 2 . 1 H A L L U C I N A T I O N I N M U LT I - M O DA L L A R G E L A N G U A G E M O D E L S Hallucinations in MLLMs refer to the phenomenon where the visual content in the input conflicts with the generated textual information. Existing works have attempted to detect and address hal- lucinations in MLLMs from multiple perspectives. CHAIR (Rohrbach et al., 2018) requires large models to generate detailed captions for images and calculates the proportion of hallucinated content in the captions. POPE (Li et al., 2023b) turns hallucination problem into binary classification which can detect object hallucinations in MLLMs. Existing works mainly address hallucinations in MLLMs from two perspectives. On one hand, some approaches in volv e additional training or use external knowledge to guide the model. LURE (Zhou et al., 2023) trains an extra state detector that triggers the regeneration of detected hallucinated contents by a revisor model; W oodPecker (Y in et al., 2024) introduces an additional visual model to monitor and ask the original model to regenerate hallucinated contents. On the other hand, some works focus on specially designed decoding strategies: OPERA (Huang et al., 2024) highlights the relationship between MLLM’ s aggre gation patterns and hallucinations, and mitigates hallucinations using o ver -trust logit penalty and retrospection; VCD (Leng et al., 2024) points out that models’ visual uncertainty may lead to hallucinations and proposes a contrastive decoding method to address this; DeCo (W ang et al., 2024c) suggests that MLLMs can correctly perceiv e visual content, and lev erages information from shallow layers of language models in MLLMs to guide the decoding process. Howe ver , all above mentioned works introduce additional computational ov erhead, thus slows do wn model inference speed. 2 . 2 V I S UA L T O K E N C O M P R E S S I O N I N M U LT I - M O DA L L A R G E L A N G U A G E M O D E L S Previous works accelerate MLLM’ s inference by compressing visual tokens. FastV (Chen et al., 2024a) firstly lev eraged attention scores in language models of MLLMs to prune redundant visual 3 tokens. LLaV A-PruMerge (Shang et al., 2024) and VTC-CLS (W ang et al., 2024b) utilize informa- tion from the [CLS] token in MLLM’ s visual encoder to prune and mer ge redundant visual tok ens. V isionZip (Y ang et al., 2025) picks a small proportion of ke y visual tokens based on attention scores extracted from the visual encoder and applies a merging strategy to retain the remaining information. Additionally , some works (T ao et al., 2025; Fu et al., 2024; Sun et al., 2025) have been specifically designed for video LLMs. These methods ef fectiv ely accelerate MLLM’ s inference. 3 M E T H O D O L O G Y 3 . 1 P R E L I M I N A R I E S During MLLMs’ inference, for each token generation, the language model performs a forward pass. When KV cache is av ailable, the self-attention modules computes the attention map using only the last generated token as the query . During decoding process, we denote the input sequence as X = [ x 1 , x 2 , ..., x n ] ∈ R n × d , where x n ∈ R 1 × d is the last generated token. d denotes the dimension of hidden states. In self-attention modules, for the i -th attention head, the attention map A i ∈ R 1 × n for the last generated query token is computed as follo ws: A i = softmax q i n K i n T √ d k ! , (1) where q i n ∈ R 1 × d k is the query vector for the last generated token, K i n is the key matrix of the i -th attention head, and d k denotes the dimension for each attention head. Next, the key and value v ectors k i n and v i n are concatenated with the corresponding KV cache and the cache itself is updated: K i cache := concat ([ K i cache , k i n ]) , V i cache := concat ([ V i cache , v i n ]) , (2) where K i cache , V i cache ∈ R n × d k represent the KV cache. W e average the attention map A i ov er all attention heads to obtain the av erage attention map A ∈ R 1 × n , where each value in A represents the attention paid by the last generated token to the token at that position. Attention map A can also be represented as: A = { A t , A v , A o } , (3) where A t , A v , A o denotes attention scores distrib uted to prompt te xt tokens, visual tokens and out- put te xt tok ens, respecti vely . Previous works (Chen et al., 2024a; Liu et al., 2024b) ha ve highlighted that attention scores in A v are often very lo w , especially when considering that the number of visual tokens is often lar ge. In Sec. 1, we hav e explored the connection between this phenomenon and hallucinations in MLLMs. In this section, we propose PruneHal, a training-free, plug-and-play framework to mitigate MLLMs’ hallucination . Specifically , in Sec. 3.2, we propose to lev erage simple top-K-based KV cache pruning to guide MLLMs to ward focusing on critical visual tokens. In Sec. 3.3, we introduce adaptiv e design into our framework, which seeks to strike a balance between loss of crucial visual information and eliminating the redundant visual tokens. 3 . 2 K V C AC H E P RU N I N G R E M OV E S R E D U N D A N T V I S UA L T O K E N S T O E N H A N C E F O C U S O N C R I T I C A L V I S UA L I N F O R M A T I O N In this subsection, we argue that the insufficient attention paid to critical visual tokens is a key con- tributor to hallucinations in MLLMs. Specifically , a large number of redundant visual tokens con- sume a substantial portion of the model’ s attention, leaving ke y visual information under-attended. Simply amplifying visual attention as in Fig. 1 not only leads MLLMs to generate uninformati ve out- puts, but also exacerbates the disruption caused by redundant tokens. T o address this, we naturally turn to KV cache pruning, which preserves the model’ s language prior while seamlessly removing 4 redundant visual tokens. Among all visual tokens with attention scores A v , we select the top- k tokens with the highest attention scores: 0 10 20 30 40 50 decode step 0.000 0.001 0.002 0.003 0.004 0.005 0.006 avg_vis_attn T op-576(12.5) T op-288(11.9) T op-144(10.2) (a) LLaV A-v1.5-7B 0 10 20 30 40 50 decode step 0.0008 0.0010 0.0012 0.0014 0.0016 0.0018 0.0020 avg_vis_attn T op-256(14.7) T op-192(14.4) T op-128(13.6) (b) Qwen-VL-7B 0 10 20 30 40 50 decode step 0.0100 0.0125 0.0150 0.0175 0.0200 0.0225 0.0250 0.0275 avg_vis_attn T op-32(24.2) T op-24(24.0) T op-16(23.5) (c) InstructBLIP-7B Figure 3: A verage visual attention scores over all samples from the validation set of MSCOCO 2014. The x-axis represents decoding steps, and the y-axis represents the av erage visual attention score. W e sho w data from Layer 1 for all three models. In the figures, T op- K denotes r etaining K most attended visual tokens after pruning, corr esponding to the curv e of that color . The curv es start from the second decoding step after pruning, and ends at 50-th decoding step. The numbers in parentheses represent the proportion of hallucinated objects among all mentioned objects; the higher the numbers, the more sev ere the hallucinations. More examples can be found in Appendix. C.4. I v = TopK ( A v , k ) , 0 < k < N v , (4) where N v denotes the number of visual tokens. After obtaining the indices of the tokens to be retained in the pre vious decoding step, we perform KV cache pruning in the current step by keeping only the entries corresponding to those indices. For con venience, we only show the case of a single attention head; the principle of multi-head attention is exactly the same: K ′ cache = { K text , K ′ v is } , where K ′ v is = K v is [ I v , :] (5) V ′ cache = { V text , V ′ v is } , where V ′ v is = V v is [ I v , :] , (6) where we can represent K cache = { K text , K v is } , K text and K v is correspond to the KV cache for text and visual tokens, respectiv ely . Here K v is , V v is ∈ R N v × d represent full KV cache, while K ′ v is , V ′ v is ∈ R k × d represent pruned KV cache. Among all KV cache entries in storage, we retain the subset corresponding to the indices in I v and discard the others. As stated abo ve, we conducted simple top-K-based KV cache pruning 1 on several different MLLMs. During the first decoding process, we recorded the indices to keep in the first decoding process, and conduct pruning in the second decoding process. T o validate the effecti veness of this, during each subsequent decoding step, we record the visual attention distribution under dif ferent degrees of KV cache pruning, av eraged o ver the 100 random selected samples from the v alidation set of MSCOCO 2014. The visual attention plots are sho wn below in Fig. 3. As shown in Fig. 3, KV cache pruning remo ves redundant visual tokens, which increases the average attention scores of the remaining ones. Moreov er , as more tokens are discarded and visual attention scores increase, hallucinations are gradually mitigated, demonstrating that pruning ef fecti vely guides MLLMs to focus on critical visual cues. Such an ef fect is consistent across all layers and models. 3 . 3 A D A P T I V E K V C AC H E P R U N I N G S T R I K E S A T R A D E - O FF B E T W E E N P R E S E RV I N G V I S U A L I N F O R M A T I O N A N D M I T I G ATI N G H A L L U C I N A T I O N S While KV cache pruning can mitigate hallucinations, excessi ve pruning leads to substantial loss of visual information, which in turn inevitably degrades the performance of MLLMs (W ang et al., 2024a; Liu et al., 2024c; Zhang et al., 2023). Moreover , as shown in Fig. 3, while the number 1 W e want to showcase conceptual effecti veness when combining KV Cache pruning to mitigate MLLMs’ hallucinations. More advanced KV Cache pruning methods are also compatible in our framework, which we leav e as our future works. 5 Decoding Step i Layer 0 . . . Layer 1 Layer 2 Layer N -1 Avg vis attn 0.04 0.05 0.03 0.045 threshold 0.032 0.024 0.036 …… 0.04 r =0.8 Avg vis attn 0.036 0.032 0.027 0.040 . . . Decoding Step j-1 Prompt attn Visual attn Output attn Decoding Step j+1 . . . Tr a c k vi s u a l at t n Vo t e b y l a y e r 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 hig h Decoding Step j 0.033 0.025 0.021 0.030 Vo t e b y l a y e r low keep prune 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 4 3 5 2 4 1 3 Figure 4: Illustration of PruneHal. During decoding step i , visual attention distribution are tracked and continuously monitored in subsequent steps. Once the visual attention distrib ution of more than half of the layers falls belo w a predefined threshold, a pruning operation is performed. of tokens in the auto-regr essiv e sequence keeps growing as decoding step increases, attention scores assigned to visual tokens gradually diminishes, causing the model’s focus on them to continuously decline. This highlights the need for a dynamic mechanism to sustain attention to visual tokens throughout decoding. Therefore, we emphasize the need to dynamically strike a balance during decoding between pre- serving crucial visual information and eliminating redundant visual tokens to maintain the model’ s performance. T o address this, our PruneHal framework adapti vely performs KV cache pruning, en- hancing visual attention while avoid crucial visual information loss caused by excessi ve pruning, thereby mitigating hallucinations while retaining MLLMs’ performance to the maximum extent. T o this end, we propose to adaptively conduct KV cache pruning. As illustrated in Fig. 4, we adopt voting mechanism to decide whether to prune. Specifically , when a majority of layers indicate insuf- ficient visual attention, a pruning operation is triggered to encourage the model to concentrate on the most informati ve visual tokens. Each pruning step removes a fixed proportion of the remaining to- kens, and an upper bound is imposed on the total number of pruning operations to pre vent e xcessi ve loss of visual information. At the first decoding step, we record the average attention score of all remaining visual tokens in each layer , i.e., avg ( A 1 v i ) for i = 1 , 2 , . . . , N , where N is the number of layers in the language model. In the second step, these recorded v alues are used to guide the first pruning operation, as stated in Eq. 4 – 6. Algorithm 1: Our PruneHal framew ork Input: keep ratio r ; max pruning times t ;KV cache K v is , V v is ∈ R N v × d ;visual token count n . Input: { A m v i } N i =1 , m = 1 , 2 , ... ; i : layer index, m : decoding step 1 pr une cnt ← 0 , {A i } N i =1 ← { A 1 v i } N i =1 ; 2 for each decoding step m do 3 if prune cnt = t then 4 continue ; 5 I l = { i | A m v i < √ r · A i } ; // layer vote 6 if |I l | ≥ N 2 or m = 2 then 7 for i = 1 to N do 8 I v = TopK ( A m v i , r × n ) ⊂ { 1 , 2 , . . . , n } ; 9 K v is ← K v is [ I v , :] ; 10 V v is ← V v is [ I v , :] ; // Perform KV cache pruning 11 A i ← A m − 1 v i ; // Update parameters 12 pr une cnt ← pr une cnt + 1 ; // Update prune count 13 n ← r × n ; // Update visual token count 6 Our full framew ork can be described in Alg. 1. After each pruning operation, the historical visual attention distribution are immediately refreshed to reflect the most recent state. W e predefine the retention ratio for KV cache pruning as r , which controls the fraction of visual tokens to be preserved at each step. In subsequent decoding steps, the framew ork continuously monitors average visual attention scores across layers. Whenev er more than half of the layers sho w that their current a verage visual attention scores ha ve dropped belo w √ r times the historical distrib ution, a pruning operation is triggered. During this process, only the top- r fraction of the remaining visual tokens with the highest attention scores are retained in the KV cache, while the remaining tokens are discarded. By performing pruning dynamically , we minimize the loss of crucial visual information while main- taining the model’ s focus on key visual tokens. In this way , the model’ s all-round ability is preserved, while hallucinations are effecti vely suppressed. 4 E X P E R I M E N T S 4 . 1 S E T U P Baselines. W e first apply PruneHal on v arious decoding methods, including greedy decoding, nucleus sampling, and beam search. Besides, following previous work(W ang et al., 2024c), we also integrate PruneHal with v arious decoding strategies for mitigating hallucination, including DoLa (Chuang et al., 2023), VCD (Leng et al., 2024), OPERA (Huang et al., 2024) and DeCo (W ang et al., 2024c). For all aforementioned decoding strategies, we use the default hyperparameters from the source code for fair comparison. Model selection. Building upon the e xperiments of pre vious works (Huang et al., 2024; W ang et al., 2024c) and to e v aluate the generalizability of our method, we conduct e xperiments on the follo wing models: LLaV A-v1.5-7B, LLaV A-v1.5-13B (Liu et al., 2023b), InstructBLIP-7B (Dai et al., 2023), and Qwen-VL-7B (Bai et al., 2023). These models include both MLP and QFormer (Li et al., 2023a) connectors, as well as variants of different sizes, which allows for a comprehensiv e demonstration of our method’ s effecti veness and rob ustness. Settings. W e conduct all our experiments on a single NVIDIA H20 GPU. Due to the di verse charac- teristics of dif ferent MLLMs (including significant variations in the number of visual tokens and the proportion of attention they receive), we specify tailored parameters for each model. For LLaV A- v1.5-7B and LLaV A-v1.5-13B, we set r = 0 . 4 and t = 3 ; for InstructBLIP-7B, we set r = 0 . 7 and t = 2 ; for Qwen-VL-7B, we set r = 0 . 9 and t = 4 . Benchmarks and Metrics. Since our framew ork only affects the decoding phase during the infer- ence of MLLMs and the prefilling phase remains unchanged, we cannot use metrics that require the model to answer only yes or no, such as POPE (Li et al., 2023b), since models’ responses will keep unchanged. W e use CHAIR (Rohrbach et al., 2018), AMBER (W ang et al., 2023), and GPT -4V - assisted e v aluation as hallucination metrics to assess our frame work, with detailed settings pro vided in Appendix. A. 4 . 2 M A I N R E S U LT S As shown in T ab. 1, across all models and decoding methods, our proposed PruneHal improves the performance of the models on both the CHAIR S and CHAIR I metrics. When integrating with existing decoding strategies for hallucination mitigation, our method further improves their perfor - mances. For example, on LLaV A-v1.5-7B, the CHAIR S metric shows that our PruneHal reduces the proportion of hallucinated sentences by 21.1%, 22.9%, and 25.0% under greedy , nucleus, and beam search, respectiv ely . When combined with Deco, which is the current state-of-the-art ap- proach, it achiev es an additional improvement of 23.9%, 16.7%, and 18.0% over Deco alone under greedy , nucleus, and beam search, respectiv ely , pro viding strong evidence for the effecti veness of our method. On AMBER dataset, as shown in T ab. 2, after applying our method, on the three hallucination- related metrics— CHAIR , Hal , and Co g —the models’ performance sho w a significant improv ement, demonstrating that our method enhances the accuracy and reliability of the model’ s outputs. 7 As for GPT -4V assisted ev aluation, as shown in T ab . 3, our method significantly improv es the cor- rectness of the model outputs, while maintaining the detailedness of the outputs. On LLaV A-v1.5- 7B, our PruneHal improves the correctness score from 6.04 to 6.98, demonstrating a significant enhancement in the truthfulness of the model’ s output. The improvement in correctness indicates a successful reduction in MLLMs’ hallucinations, while the preserv ation of detailedness highlights that our PruneHal does not compromise the div ersity of the model’ s responses. 4 . 3 M O D E L A NA LY S I S Analysis on hyperparameters. W e conducted experiments on dif ferent hyperparameter settings and the results are shown in Appendix. B, which sho ws robustness of our framew ork. T able 1: CHAIR object hallucination ev aluation results. Lo wer scores mean fewer hallucinations. Decoding Method LLaV A-v1.5-7B InstructBLIP-7B Qwen-VL-7B LLaV A-v1.5-13B CHAIR S ↓ CHAIR I ↓ CHAIR S ↓ CHAIR I ↓ CHAIR S ↓ CHAIR I ↓ CHAIR S ↓ CHAIR I ↓ Greedy V anilla 44.6 12.5 60.0 24.2 53.6 14.7 44.2 12.1 PruneHal 35.2 ↓ 9.4 10.0 ↓ 2.5 52.8 ↓ 7.2 23.3 ↓ 0.9 52.4 ↓ 1.2 13.5 ↓ 1.2 34.6 ↓ 9.6 9.4 ↓ 2.7 DoLa (ICLR’24) 43.6 12.7 48.6 16.0 58.4 16.4 46.4 15.3 DoLa + PruneHal 36.2 ↓ 7.4 10.5 ↓ 2.2 44.3 ↓ 4.3 14.7 ↓ 1.3 51.4 ↓ 7.0 13.8 ↓ 2.6 39.0 ↓ 7.4 13.6 ↓ 1.7 DeCo (ICLR’25) 36.8 10.3 41.0 14.4 48.0 12.8 41.0 11.1 DeCo + PruneHal 28.0 ↓ 8.8 8.7 ↓ 1.6 38.8 ↓ 2.2 13.5 ↓ 0.9 40.8 ↓ 7.2 11.4 ↓ 1.4 32.6 ↓ 8.4 9.7 ↓ 1.4 Nucleus V anilla 53.2 15.3 57.8 25.7 54.4 14.8 53.8 14.8 PruneHal 41.0 ↓ 12.2 12.4 ↓ 2.9 49.4 ↓ 8.4 25.2 ↓ 0.5 52.8 ↓ 1.6 14.1 ↓ 0.7 45.2 ↓ 8.6 13.2 ↓ 1.6 VCD (CVPR’24) 54.6 16.3 58.0 16.9 54.0 12.9 53.6 14.6 VCD + PruneHal 41.2 ↓ 13.4 12.4 ↓ 3.9 54.0 ↓ 4.0 14.8 ↓ 2.1 51.2 ↓ 2.8 11.7 ↓ 1.2 42.4 ↓ 11.2 11.6 ↓ 3.0 DeCo 40.8 11.0 45.2 13.4 48.6 13.0 41.8 11.9 DeCo + PruneHal 34.0 ↓ 6.8 9.8 ↓ 1.2 40.6 ↓ 4.6 11.7 ↓ 1.7 46.6 ↓ 2.0 12.4 ↓ 0.6 33.8 ↓ 8.0 10.2 ↓ 1.7 Beam Search V anilla 48.8 13.9 54.0 15.4 42.2 10.5 47.8 13.7 PruneHal 36.6 ↓ 12.2 10.4 ↓ 3.5 49.4 ↓ 4.6 14.1 ↓ 1.3 38.4 ↓ 3.8 9.8 ↓ 0.7 38.0 ↓ 9.8 10.7 ↓ 3.0 OPERA (CVPR’24) 45.2 13.2 44.6 14.2 34.6 9.2 35.6 11.2 OPERA + PruneHal 35.2 ↓ 10.0 10.2 ↓ 3.0 39.8 ↓ 4.8 12.8 ↓ 1.4 31.0 ↓ 3.6 8.6 ↓ 0.6 29.6 ↓ 6.0 9.5 ↓ 1.7 DeCo 35.6 9.5 44.5 12.8 35.8 9.7 38.0 10.7 DeCo + PruneHal 29.2 ↓ 6.4 8.0 ↓ 1.5 41.8 ↓ 2.7 11.5 ↓ 1.3 33.2 ↓ 2.6 9.3 ↓ 0.4 31.2 ↓ 6.8 9.0 ↓ 1.7 T able 2: Results on AMBER image captioning dataset. Lower CHAIR, Hal, and Cog values indicate better truthfulness. G., N., and B. represents Greedy , Nucleus and Beam Search, respectiv ely . Dec. Method LLaV A-v1.5-7B InstructBLIP-7B Qwen-VL-7B LLaV A-v1.5-13B CHAIR ↓ Hal ↓ Cog ↓ CHAIR ↓ Hal ↓ Cog ↓ CHAIR ↓ Hal ↓ Cog ↓ CHAIR ↓ Hal ↓ Cog ↓ G. V anilla 6.5 30.6 3.3 20.6 60.5 8.0 11.0 52.5 6.1 6.2 29.3 3.0 PruneHal 6.2 ↓ 0.3 26.7 ↓ 3.9 2.9 ↓ 0.4 20.0 ↓ 0.6 58.6 ↓ 1.9 7.2 ↓ 0.8 10.5 ↓ 0.5 50.8 ↓ 1.7 5.6 ↓ 0.5 6.0 ↓ 0.2 28.0 ↓ 1.3 2.9 ↓ 0.1 N. V anilla 9.4 40.3 4.2 23.8 67.9 8.4 12.0 53.2 5.6 9.0 38.5 3.5 PruneHal 8.8 ↓ 0.6 37.1 ↓ 3.2 4.0 ↓ 0.2 22.6 ↓ 1.2 64.7 ↓ 3.2 7.3 ↓ 1.1 11.5 ↓ 0.5 51.2 ↓ 2.0 5.1 ↓ 0.5 8.6 ↓ 0.4 37.2 ↓ 1.3 3.3 ↓ 0.2 B. V anilla 7.8 30.5 3.5 11.0 46.0 5.7 6.7 32.5 3.1 7.3 30.0 3.5 PruneHal 7.4 ↓ 0.4 28.4 ↓ 2.1 3.2 ↓ 0.3 10.0 ↓ 1.0 45.2 ↓ 0.8 5.2 ↓ 0.5 6.5 ↓ 0.2 31.9 ↓ 0.6 3.0 ↓ 0.1 7.0 ↓ 0.3 26.7 ↓ 3.3 3.2 ↓ 0.3 Analysis on adaptive module de- sign. Adaptiv e module balances cru- cial visual information loss and atten- tion to key tokens. Excessiv e pruning harms detailedness and di v ersity of out- puts, while insuf ficient pruning weak- ens hallucination mitigation. Experi- ments against simple KV cache prun- ing across different ratios confirm this trade-off. W e ev aluate hallucinations with CHAIR and div ersity with the GPT -4V Detailedness metric and MM- V et (Y u et al., 2023). T able 3: GPT -4V assisted ev aluation results. T wo aspects are verified: correctness (C) and detailedness (D). Higher correctness indicates less hallucination. LLaV A-7B, BLIP-7B, Qwen- 7B and LLaV A-13B indicates LLaV A-v1.5- 7B, InstructBLIP-7B, Qwen-VL-7B and LLaV A- v1.5-13B, respectiv ely . Model LLaV A-7B BLIP-7B Qwen-7B LLaV A-13B C ↑ D ↑ C ↑ D ↑ C ↑ D ↑ C ↑ D ↑ Greedy 6.04 6.76 5.88 5.94 6.82 6.68 6.58 7.02 PruneHal 6.98 6.73 6.04 5.96 7.10 6.72 6.84 7.04 8 T able 4: Ablation on adaptive module design. For Conservati ve and Aggressiv e, pruning is applied once at the second decoding step. Conservati ve keeps ratio r , while Aggressi ve keeps ratio r t , which equals the maximum pruning extent for each model. Method LLaV A-v1.5-7B InstructBLIP-7B CHAIR S ↓ CHAIR I ↓ GPT4V -D ↑ MM-V et ↑ CHAIR S ↓ CHAIR I ↓ GPT4V -D ↑ MM-V et ↑ V anilla 44.6 12.5 6.76 28.3 60.0 24.2 5.94 24.1 Conservati ve 40.8 11.9 6.72 29.5 56.8 23.9 5.97 23.6 Aggressiv e 37.2 11.1 6.36 26.2 54.8 22.9 5.74 22.5 PruneHal 35.2 10.0 6.73 29.4 52.8 23.3 5.96 23.7 As shown in T ab . 4, conserv ati ve pruning results in high CHAIR scores (indicating persistent hal- lucinations), while aggressive pruning reduces hallucinations but degrades GPT4V -D and MM-V et. In contrast, PruneHal strikes a balanced trade-of f, achie ving improvements on both metrics. 0 10 20 30 40 50 Chair Metrics C S C S C S C I C I C I 0.3 0.4 0.5 0.6 0.7 0.8 avg visual attn High (Ours) R andom L ow (a) Greedy 0 10 20 30 40 50 60 Chair Metrics C S C S C S C I C I C I 0.2 0.4 0.6 0.8 1.0 avg visual attn High (Ours) R andom L ow (b) Nucleus 0 10 20 30 40 50 Chair Metrics C S C S C S C I C I C I 0.3 0.4 0.5 0.6 0.7 0.8 avg visual attn High (Ours) R andom L ow (c) Beam Search Figure 5: A verage visual token attentions and hallucinations in LLaV A-v1.5-7B. Blue, orange, and green denote PruneHal (top-attended tokens), random tokens, and least-attended tokens, respec- tiv ely . Lines show av erage attention scores over 500 samples across decoding steps; bars report CHAIR S and CHAIR I , where lower v alues indicate fewer hallucinations. Analysis on visual attention scores. In PruneHal, we retain the most attended visual tokens. T o study the link between visual attention and hallucinations, we compare against retaining random tokens or the least attended ones, finding that lower attention leads to more hallucinations. As sho wn in Fig. 5, pruning by retaining visual tokens with the lowest attention scores significantly increases the proportion of hallucinated vocab ulary (i.e. CHAIR I ) in the generated content. Under greedy , nucleus, and beam search, the increases are 58.7% (10.9 → 17.3), 138.71% (12.4 → 29.6), and 53.85% (10.4 → 16.0), respecti vely . This further underscores the importance of retaining highly- attended key visual tok ens. Latency analysis. Our method adds negligible ov erhead compared to baselines. On 100 random MSCOCO-2014 validation samples, we measured av- erage forward-pass latency of all output tokens with de- fault settings. Under beam search, our approach even accelerates inference, since pruning reduces the heavy computational cost associated with forward propaga- tion. In contrast, Deco and DoLa introduce extra over - head from intermediate-layer processing, while VCD and OPERA requires multiple forward passes, yield- ing a marked decrease in inference speed. Our method only in volv es lightweight tensor operations and ev en reduces FLOPs, making it the most efficient. vanilla ours Deco vanilla ours vcd vanilla ours OPER A 19.5 21.2 23.3 20.8 21.7 40.7 29.2 21.9 168.0 Latency on various methods (ms/tok en) Gr eedy Nucleus Beam Sear ch Figure 6: Per-tok en latency for various methods on LLaV A-v1.5-7B. 5 C O N C L U S I O N In this paper , we re veal the strong correlation between lo w attention to visual tok ens in MLLMs and object hallucination, and propose to leverage KV cache pruning to reduce such issues. W e introduce PruneHal, a dynamic KV cache pruning framew ork that effecti vely mitigates hallucinations across 9 multiple mainstream models, while compatible with specifically designed decoding strategies for hallucination mitigation to further enhance their performance. Moreover , our method is training-free and introduces virtually no computational overhead, enabling a seamless and cost-free reduction of hallucinations in MLLMs. R E F E R E N C E S Josh Achiam, Ste ven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv pr eprint arXiv:2303.08774 , 2023. Jinze Bai, Shuai Bai, Y unfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Y ang Fan, W enbin Ge, Y u Han, Fei Huang, et al. Qwen technical report. arXiv pr eprint arXiv:2309.16609 , 2023. Zechen Bai, Pichao W ang, T ianjun Xiao, T ong He, Zongbo Han, Zheng Zhang, and Mik e Zheng Shou. Hallucination of multimodal large language models: A survey . arXiv pr eprint arXiv:2404.18930 , 2024. Liang Chen, Haozhe Zhao, Tian yu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is w orth 1/2 tokens after layer 2: Plug-and-play inference acceleration for lar ge vision- language models. In Eur opean Confer ence on Computer V ision , pp. 19–35. Springer , 2024a. Zhe Chen, W eiyun W ang, Y ue Cao, Y angzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Y e, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint , 2024b. Y ung-Sung Chuang, Y ujia Xie, Hongyin Luo, Y oon Kim, James Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv pr eprint arXiv:2309.03883 , 2023. W enliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, W eisheng W ang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: T owards general-purpose vision-language models with instruction tuning. Advances in neural information pr ocessing systems , 36:49250–49267, 2023. T ianyu Fu, T engxuan Liu, Qinghao Han, Guohao Dai, Shengen Y an, Huazhong Y ang, Xuefei Ning, and Y u W ang. Framefusion: Combining similarity and importance for video token reduction on large visual language models. arXiv pr eprint arXiv:2501.01986 , 2024. Anisha Gunjal, Jihan Y in, and Erhan Bas. Detecting and prev enting hallucinations in large vision language models. In Pr oceedings of the AAAI Confer ence on Artificial Intelligence , volume 38, pp. 18135–18143, 2024. Qidong Huang, Xiao yi Dong, P an Zhang, Bin W ang, Conghui He, Jiaqi W ang, Dahua Lin, W eiming Zhang, and Nenghai Y u. Opera: Alle viating hallucination in multi-modal large language models via over -trust penalty and retrospection-allocation. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pp. 13418–13427, 2024. Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastiv e de- coding. In Proceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recogni- tion , pp. 13872–13882, 2024. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International confer ence on machine learning , pp. 19730–19742. PMLR, 2023a. Y ifan Li, Y ifan Du, Kun Zhou, Jinpeng W ang, W ayne Xin Zhao, and Ji-Rong W en. Evaluating object hallucination in large vision-language models. arXiv pr eprint arXiv:2305.10355 , 2023b. 10 Fuxiao Liu, Ke vin Lin, Linjie Li, Jianfeng W ang, Y aser Y acoob, and Lijuan W ang. Mitigat- ing hallucination in large multi-modal models via robust instruction tuning. arXiv pr eprint arXiv:2306.14565 , 2023a. Haotian Liu, Chunyuan Li, Qingyang W u, and Y ong Jae Lee. V isual instruction tuning. Advances in neural information pr ocessing systems , 36:34892–34916, 2023b. Haotian Liu, Chun yuan Li, Y uheng Li, and Y ong Jae Lee. Improv ed baselines with visual instruction tuning. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pp. 26296–26306, 2024a. Shi Liu, Kecheng Zheng, and W ei Chen. Paying more attention to image: A training-free method for alle viating hallucination in lvlms. In Eur opean Confer ence on Computer V ision , pp. 125–140. Springer , 2024b. Zuyan Liu, Benlin Liu, Jiahui W ang, Y uhao Dong, Guangyi Chen, Y ongming Rao, Ranjay Krishna, and Jiwen Lu. Efficient inference of vision instruction-following models with elastic cache. In Eur opean Confer ence on Computer V ision , pp. 54–69. Springer, 2024c. Haoyu Lu, W en Liu, Bo Zhang, Bingxuan W ang, Kai Dong, Bo Liu, Jingxiang Sun, T ongzheng Ren, Zhuoshu Li, Hao Y ang, et al. Deepseek-vl: to wards real-world vision-language understanding. arXiv pr eprint arXiv:2403.05525 , 2024. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, T re vor Darrell, and Kate Saenko. Object hallucination in image captioning. arXiv pr eprint arXiv:1809.02156 , 2018. Y uzhang Shang, Mu Cai, Bingxin Xu, Y ong Jae Lee, and Y an Y an. Llav a-prumerge: Adaptiv e token reduction for efficient lar ge multimodal models. arXiv pr eprint arXiv:2403.15388 , 2024. Fengyuan Sun, Leqi Shen, Hui Chen, Sicheng Zhao, Jungong Han, and Guiguang Ding. Adatp: Attention-debiased token pruning for video large language models. arXiv pr eprint arXiv:2505.20100 , 2025. Keda T ao, Can Qin, Haoxuan Y ou, Y ang Sui, and Huan W ang. Dycoke: Dynamic compression of tokens for fast video large language models. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pp. 18992–19001, 2025. Ashish V aswani, Noam Shazeer, Niki Parmar , Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser , and Illia Polosukhin. Attention is all you need. Advances in neural informa- tion pr ocessing systems , 30, 2017. Ao W ang, Hui Chen, Jiaxin Li, Jianchao T an, Kefeng Zhang, Xunliang Cai, Zijia Lin, Jungong Han, and Guiguang Ding. Prefixkv: Adaptive prefix kv cache is what vision instruction-following models need for efficient generation. arXiv preprint , 2024a. Ao W ang, Fengyuan Sun, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. [cls] token tells ev erything needed for training-free ef ficient mllms. arXiv pr eprint arXiv:2412.05819 , 2024b. Chenxi W ang, Xiang Chen, Ningyu Zhang, Bozhong T ian, Haoming Xu, Shumin Deng, and Huajun Chen. Mllm can see? dynamic correction decoding for hallucination mitigation. arXiv preprint arXiv:2410.11779 , 2024c. Junyang W ang, Y uhang W ang, Guohai Xu, Jing Zhang, Y ukai Gu, Haitao Jia, Jiaqi W ang, Haiyang Xu, Ming Y an, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination ev aluation. arXiv pr eprint arXiv:2311.07397 , 2023. Peng W ang, Shuai Bai, Sinan T an, Shijie W ang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin W ang, W enbin Ge, et al. Qwen2-vl: Enhancing vision-language model’ s perception of the world at any resolution. arXiv preprint , 2024d. Xinhao Xu, Hui Chen, Mengyao L yu, Sicheng Zhao, Y izhe Xiong, Zijia Lin, Jungong Han, and Guiguang Ding. Mitigating hallucinations in multi-modal large language models via image token attention-guided decoding. In Proceedings of the 2025 Conference of the Nations of the Amer- icas Chapter of the Association for Computational Linguistics: Human Language T echnologies (V olume 1: Long P apers) , pp. 1571–1590, 2025. 11 Senqiao Y ang, Y ukang Chen, Zhuotao Tian, Chengyao W ang, Jingyao Li, Bei Y u, and Jiaya Jia. V isionzip: Longer is better but not necessary in vision language models. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pp. 19792–19802, 2025. Y uan Y ao, Tian yu Y u, Ao Zhang, Chongyi W ang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, W eilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v lev el mllm on your phone. arXiv preprint arXiv:2408.01800 , 2024. Shukang Y in, Chaoyou Fu, Sirui Zhao, T ong Xu, Hao W ang, Dianbo Sui, Y unhang Shen, Ke Li, Xing Sun, and Enhong Chen. W oodpecker: Hallucination correction for multimodal lar ge lan- guage models. Science China Information Sciences , 67(12):220105, 2024. T ianyu Y u, Y uan Y ao, Haoye Zhang, T aiwen He, Y ifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-T ao Zheng, Maosong Sun, et al. Rlhf-v: T owards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pp. 13807–13816, 2024. W eihao Y u, Zhengyuan Y ang, Linjie Li, Jianfeng W ang, Ke vin Lin, Zicheng Liu, Xinchao W ang, and Lijuan W ang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv pr eprint arXiv:2308.02490 , 2023. Zhenyu Zhang, Y ing Sheng, T ianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Y uandong Tian, Christopher R ´ e, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient gen- erativ e inference of lar ge language models. Advances in Neur al Information Pr ocessing Systems , 36:34661–34710, 2023. Y iyang Zhou, Chenhang Cui, Jaehong Y oon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Y ao. Analyzing and mitigating object hallucination in large vision-language models. arXiv pr eprint arXiv:2310.00754 , 2023. A E X P E R I M E N TA L D E T A I L S A . 1 C H A I R Caption Hallucination Assessment with Image Relev ance (CHAIR) is the most widely used hallu- cination identification metric. CHAIR calculates what proportion of words generated are actually in the image according to the ground truth sentences and object segmentations, and ev aluates ob- ject hallucination both at the instance le vel (CHAIR I ) and sentence lev el (CHAIR S ) as sho wn in Eq. 7. we follow the same experimental settings as OPERA (Huang et al., 2024) and Deco (W ang et al., 2024c), using the consistent 500 images from MSCOCO 2014 validation dataset, with prompt "Please describe this image in detail." . CHAIR I = |{ hallucinated objects }| |{ all mentioned objects }| , CHAIR S = |{ captions with hallucinated objects }| |{ all captions }| (7) A . 2 A M B E R AMBER is an LLM-free, multi-dimensional benchmark designed to e valuate existence, attribute, and relation hallucinations. It includes four metrics: CHAIR , Cover , Hal , and Cog . Among these four metrics, Cover ev aluates the comprehensi veness of model outputs, while the others assess object hallucinations from different perspecti ves. In our experiments, we select CHAIR , Hal , and Cog , the three metrics related to hallucination ev aluation to assess MLLMs’ hallucinations. After generating responses, AMBER first extracts nouns from the sentences using language toolk- its (e.g., NL TK). The proportion of extracted nouns that do not appear in the annotated words is calculated as the CHAIR metric. Hal metric measures the proportion of responses containing hallu- cinations, while Cog metric e valuates whether these hallucinations resemble those found in human 12 cognition. By leveraging a set of hallucinatory target objects, the likelihood of MLLMs generating these objects is computed. The AMBER dataset contains 1,004 images across diverse object categories, with 14 major cate- gories, such as Nature, Architecture, and Street V iew . The distribution of each category is fairly balanced across categories without a long-tail phenomenon. The prompt used for the ev aluation is: "Describe this image." . A . 3 G P T - 4 V A S S I S T E D E V A L U A T I O N Follo wing prior work (Huang et al., 2024), we conduct an open-ended ev aluation with GPT -4V on 500 randomly sampled COCO images. GPT -4V compares the outputs of tw o assistants with respect to Correctness (C) (i.e., truthfulness) and Detailedness (D) (i.e., richness). The two answers from vanilla models and the models paralleled with our PruneHal framework are offered to GPT -4V at the same time for fair comparison, and it is required to gi ve a judgement ranging from 1 to 10 points respectiv ely for each metric. The prompts are provided in T ab. 5. GPT -4V prompt Y ou are required to score the performance of tw o AI assistants in describing a gi ven image. Y ou should pay extra attention to the hallucination, which refers to the part of descriptions that are inconsistent with the image content, such as claiming the e xistence of something not present in the image or describing incorrectly in terms of the counts, positions, or colors of objects in the image. Please rate the responses of the assistants on a scale of 1 to 10, where a higher score indicates better performance, according to the following criteria: 1: Accuracy: whether the response is accurate with respect to the image content. Responses with fewer hallucinations should be gi ven higher scores. 2: Detailedness: whether the response is rich in necessary details. Note that hallucinated descriptions should not count as necessary details. Please output the scores for each criterion, containing only tw o v alues indicating the scores for Assistant 1 and 2, respectiv ely . The two scores are separated by a space. Follo wing the scores, please provide an e xplanation of your e v aluation, av oiding an y potential bias and ensuring that the order in which the responses were presented does not affect your judgment. [Assistant 1] {} [End of Assistant 1] [Assistant 2] {} [End of Assistant 2] Output format: Accuracy: < Scores of the two answers > Reason: Detailedness: < Scores of the two answers > Reason: T able 5: Prompt for GPT -4V assisted ev aluation. B A B L A T I O N S T U D I E S O N H Y P E R P A R A M E T E R S W e conduct hyperparameter ablation experiments on LLaV A-v1.5-7B and InstructBLIP-7B, using CHAIR metrics on using the consistent 500 images from MSCOCO 2014 validation dataset. The re- sults sho w that both parameters in PruneHal ( r and t ) e xhibit strong rob ustness and can consistently reduce hallucinations within a reasonable range. The results are sho wn in T ab. 6. 13 T able 6: Ablation study on hyperparameters. Our selected hyperparameters and the best results are highlighted in bold. LLaV A-v1.5-7B Greedy Nucleus Beam Search C S ↓ C I ↓ C S ↓ C I ↓ C S ↓ C I ↓ V anilla 44.6 12.5 53.2 15.3 48.8 13.9 r =0.4, t =3 35.2 10.0 41.0 12.4 36.6 10.4 r =0.4, t =2 38.4 11.1 41.2 12.3 38.2 10.7 r =0.5, t =3 38.2 10.5 41.6 12.8 38.6 10.7 r =0.5, t =2 41.8 11.6 47.6 13.7 41.2 12.3 r =0.3, t =3 37.4 10.8 43.2 14.0 36.2 10.7 r =0.3, t =2 36.7 10.9 43.2 10.8 36.6 10.8 InstructBLIP-7B Greedy Nucleus Beam Search C S ↓ C I ↓ C S ↓ C I ↓ C S ↓ C I ↓ V anilla 60.0 24.2 57.8 25.7 54.0 15.4 r =0.7, t =2 52.8 23.3 49.4 25.2 49.4 14.1 r =0.7, t =3 52.9 23.4 50.3 25.1 50.1 14.8 r =0.8, t =2 57.0 22.9 50.2 24.4 53.2 14.7 r =0.8, t =3 56.2 23.7 54.4 26.0 51.4 14.7 r =0.6, t =2 57.4 23.8 51.2 23.5 53.2 15.1 r =0.6, t =3 59.8 24.7 56.9 25.4 51.8 15.2 C D E T A I L E D E X P E R I M E N TA L S E T T I N G S A N D M O R E E X A M P L E S I N S E C T I O N 1 For experiments conducted in Fig. 1 and Fig. 2, following (Huang et al., 2024), the images are se- lected from the 500 images subset from MSCOCO 2014 validation dataset, and all experiments are conducted on LLaV A-v1.5-7B. The prompts and images are exactly the same as in Appendix. A.1. W e first acquire the generated results and get image-caption pairs, and selected those captions con- taining hallucinatory outputs. C . 1 D E TA I L E D S E T T I N G S F O R Q UA L I TA T I V E E X P E R I M E N T For the experiment mentioned in Fig. 1, W e truncate the generated text up to (but not including) the hallucinated word, and feed it into the model together with the original instruction. In each layer’ s self-attention module of the language model, after computing the attention map, we double the attention v alues corresponding to visual tokens at the position of the last tok en. Once the forward pass produces an output token, we append this token to the instruction. For the follo wing tokens, no further amplification of attention is applied, and the computation proceeds normally . C . 2 D E TA I L E D S E T T I N G S F O R Q UA N T I TA T I V E E X P E R I M E N T For the experiment mentioned in Fig. 2, we select all image-caption pairs containing hallucinatory outputs (223 out of 500). For captions containing hallucinations, we track visual attention scores for the entire generation process. At each decoding step, we compute the av erage attention score from the generated token to all visual tokens. After collecting the data for all tokens, we calculate both the overall av erage across all tokens (shown as a single red scatter in Fig. 2) and the av erage specifically for the hallucinated tokens (sho wn as a single red scatter in Fig. 2). C . 3 M O R E E X A M P L E S F O R Q UA L I TA T I V E E X P E R I M E N T W e also provide more examples for the qualitati ve e xperiment in Sec. 1 in Fig. 7. C . 4 M O R E E X A M P L E S F O R E X P E R I M E N T S I N S E C . 3 . 2 In Fig. 3, we presented examples from the first layer of the language models in various MLLMs. In Fig. 8, we additionally report results from Layers 16 and 31, representing middle and deep layers in 32-layer language models, respectively . These layers e xhibit the same phenomenon as Fig. 3, highlighting the generalizability of our findings. D C A S E S In this section, we present sev eral cases where our proposed PruneHal successfully mitigates hallu- cinations in MLLMs. By contrasting these successful cases with the outputs of the vanilla model, we demonstrate that PruneHal not only suppresses hallucinations but also preserves the overall de- scriptiv e quality of the generated responses. W e conduct experiments under greedy search mode for ev ery model. The cases for LLaV A-v1.5-7B, Qwen-VL-7B, InstructBLIP-7B and LLaV A-v1.5-13B are shown in Figs. 9 – 12, respecti vely . 14 Please descr ibe this image in detail. a ttn score The image fe atures an older w oman sitting at a dining table, preparing food. She is in the proces s of cutting up a h ot dog…… A fork is also v isible on the table , likely used for eating the hot do gs. The scene appears to be a ca sual meal setting, with the woman enjoyi ng her time prep aring and eating the hot do gs. MLLM Amplify visual a ttention The image fe atures an older w oman sitting at a dining table, prepar ing food. She i s in the process of cutting up a hot dog…… A camera is a lso present on the table, possi bly used for cap turing the food p reparation pr ocess. The dining table occupie s most of the scen e, with the woman si tting comfortably in the middle of it. MLLM 0. 5 0. 2 0.3 0. 5 0.4 0.3 User Please descr ibe this image in detail. a ttn score The image fe atures a cat sittin g on top of a mu lti-level cat tree, which is p laced in a room .…… A remote control can also be seen on a surface, sugg esting that the ro om is used for entertainme nt purposes. MLLM Amplify visual a ttention The image fe atures a cat sittin g on top of a mu lti-level cat tree, which is p laced in a room.… … A window can also be seen in the ba ckground, allo wing natural l ight to enter the room. MLLM 0. 5 0. 2 0.3 0. 5 0.4 0.3 User Please descr ibe this image in detail. a ttn score The image fe atures a bedroom with a bed that has a colo rful floral bedsp read on it.…… The room also has a chair situated near the bed, an d a clock is moun ted on the wall abo ve the bed. The overall a tmosphere of the room is cozy an d inviting. MLLM Amplify visual a ttention The image fe atures a bedroom with a bed that has a colo rful floral bedsp read on it.…… The room also has a rug on the floor, addin g to the cozy atmos phere. The overall se tting appears to b e a comfortable a nd inviting sp ace for relaxation and rest. MLLM 0. 5 0. 2 0.3 0. 5 0.4 0.3 User Figure 7: Additional examples for qualitati ve e xperiments in Sec. 1. 15 0 10 20 30 40 50 decode step 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 avg_vis_attn T op-576 T op-144 T op-36 (a) Layer 16 of LLaV A-v1.5-7B. 0 10 20 30 40 50 decode step 0.020 0.025 0.030 0.035 0.040 0.045 avg_vis_attn T op-32 T op-16 T op-8 (b) Layer 16 of InsturctBLIP-7B. 0 10 20 30 40 50 decode step 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 avg_vis_attn T op-256 T op-128 T op-64 (c) Layer 16 of Qwen-VL-7B. 0 10 20 30 40 50 decode step 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 avg_vis_attn T op-576 T op-144 T op-36 (d) Layer 30 of LLaV A-v1.5-7B. 0 10 20 30 40 50 decode step 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 avg_vis_attn T op-32 T op-16 T op-8 (e) Layer 30 of InsturctBLIP-7B. 0 10 20 30 40 50 decode step 0.0005 0.0010 0.0015 0.0020 0.0025 avg_vis_attn T op-256 T op-128 T op-64 (f) Layer 30 of Qwen-VL-7B. Figure 8: Examples from middle and deep layers for experiments in Sec. 3.2. 16 Please describe this image in detail. The image features a woman sitting on a motorcycle, which is parked in front of a white building. She is wearing a leather jacket and appears to be posing for the camera. The motorcycle i s positioned in the foreground, occupying a significant portion of the scene. In the background, there is a car parked further away from the motorcycle, and a handbag c an be seen placed on the ground near the woman. The scene seems to be set in a parking area or a similar location where vehicles are parked. LLaVA - v1.5 - 7B The image features a woman sitting on a motorcycle, which is parked in front of a building. She is wearing a leather jacket and appears to be posing f or the camera. The motorcycle is the main focus of the scene, with the woma n sitting comfortably on it. The building in the ba ckground add s context to the location of the motorcycle. LLaVA - v1.5 - 7B + PruneHal User Please describe this image in detail. The image features a woman standing next to a large clock on a pole, which is decorated with a bow and Christmas lights. She is posing for a picture in front of the clock, which is located on a str eet corner. In the backgroun d, there are several cars parked along the street, including a truck on the left side of the scene. Another perso n can be seen st anding further back on the r ight side of th e image. LLaVA - v1.5 - 7B The image features a woman standing next to a large clock on a pole, which is decorated with Christmas lights. She is posing for a picture in front of the clo ck, which is located on a street corner. In the backgroun d, there are several cars parked along the street. Additionally, there is another person standing further back on the right side of the scene, possibly observing the woman or waiting for someone. LLaVA - v1.5 - 7B + PruneHal User Figure 9: Cases for LLaV A-v1.5-7B. The scene shows a bathroom with a dark sink and wood en floors, giving it a co zy appearance. A bottle is pla ced on the righ t side of the sink, and two cups can be seen nearby - one cup is closer to the sink area, while the other is slightly farther away . There is also an electronic devi ce sitting on the floor near the sink, possibly for use as a computer mouse or some other digital tool. The bathroom has a mirror above the sink, whic h complements t he overall design of the spac e. Two towels are visible in the background, h anging likely to dry hands afte r washing them at the green sink. Qwen - VL - 7B The scene shows a bathroom with a dark sink and wood en floors, giving it a co zy appearance. A bottle is pla ced on the sid e of the sink, and two towels can be seen hanging above the sink - one near the faucet and another further b ack, likely for drying hands after washing them. A cup is also present in the room, positioned closer to the edge of the sink. The bath room features a mirror, which helps create an illusio n of more space and a dds functionality to the area by allowing user s to check their appearance whil e using the sink or applying makeup. Overall , the bathroom appears clean and well - organized, providi ng a comfortable environmen t for personal hygiene activities." Qwen - VL - 7B + PruneHal User Please describe this image in detail. The scene features a baby sitting in a high chair, closely observing and reaching for a sm all cake placed on the high c hair‘s tray. The birthday cak e is beautifully decorated with purple flo wers on top, capturing the child’s attention. Around the baby and the c ake, there are several kitchen items such as two spoons, one closer to the left side of the image and an other near the right side, a knife slightly above and to the right of the spoon on the left , and a book that can be seen at the upper part of the image near the back. There is also a refrigerator visible at the far end of the room. A bowl is locat ed towards the leftmost edge of the i mage. Qwen - VL - 7B The scene features a baby sitting in a high chair, closely examining a small cake placed on the tray. The birthday cake is decorated with beautiful fl owers, capturing the c hild's attention. In addition to the mai n subject, a stove can be seen in the kitchen area nearby, sugges ting that the celebration is taking place in a home setting. There is also a refrigerator and an oven, indicating that the kitchen is well - equipped for cooking and dining purposes. Qwen - VL - 7B + PruneHal User Figure 10: Cases for Qwen-VL-7B. 17 Please describe this image in detail. The image depicts a group of people standing on the deck or pier near wat er, with several boats in view. There are at least 10 individuals present among them - some holding onto railings and others walking around freely along various parts of the do ck area. A few car s can be seen parked nearby as well to provide transportation for those who may n eed it while enjoying their time by the harbor's edge. InstructBLIP - 7B The image depicts a group of people standing on the deck or pier near wat er, with some holding onto railings. There are several boats in view as well - one is closer to shore and another further away from it towards left side while others can be seen scattered aro und them at different distanc es along both sides of their pathway across the body of waters the y're navigating through. A few individu als appear fo cused intently upon something happening off - screen behind an object placed close by; possibly discussing plans for future activities together? InstructBLIP - 7B + PruneHal User Please describe this image in detail. The image depicts a cana l with two red boats floating dow n the waterway. On e boat is located closer to us, whi le another one can be seen further away from our perspective on top of it in front and slightly left sideways towards its stern area. There are several people visib le around these vessels: some near them or standing by buildings along both sides of this narrow channel lined up against each other like houses built close together for easy acc essibility between homes overlooking their respec tive balconies/porches ab ove street level viewing points at differen t levels across multiple floors within thos e structures that surround e ither end of the river - like setting where they reside alongside var ious objects such as chairs place d nearby. InstructBLIP - 7B The image depicts a cana l with two red boats floating al ong the waterway. On e boat is located closer to us, whi le another one can be seen further away from our perspective in front of it on its left side. Both ves sels are docked and appear ready for passengers or cargo transportation through this scenic location surrounded by buildin gs nearby. There're several people visible throughout differ ent parts within view: some near each other at various distances towards both sides (left/right) around 10 feet apart; others scattered across multiple locations but still close enou gh together that they could interact if needed – possibly enjoying the ir time here as tourists exploring these cha rming surroundings fi lled with activity. InstructBLIP - 7B + PruneHal User Figure 11: Cases for InstructBLIP-7B. 18 Please describe this image in detail. The image depicts a room with a large table in the center, surrounded by several chairs. On the table, there are several laptops placed n ext to each other, each with a mouse in front of them. The laptops are open and ready for use, indicatin g that this is a workspace or a meetin g area. In addition to the l aptops, there are two TVs mounted on the walls, on e on the left side and the other on the right side of t he room. The room also features a dining table with chairs , and a few more chairs are scattered around the room. A person can be seen in the background, possibly attending to the laptops or preparing for a meeting. LLaVA - v1.5 - 13B The image depicts a room with a large round table in the center, surrounded by several chairs. On the table, there are several laptop computers, each with their own mouse, arranged in a circ ular fashion. The laptops are open and ready for use, i ndicating that this is a workspace or a meetin g area. In addition to the l aptops, there are two TVs in the room, one on the left side and the other on the right side. Th e room also features a projector, which is placed on the left side of the room, likely for presentations o r meetings. LLaVA - v1.5 - 13B + PruneHal User Please describe this image in detail. The image features a giraffe sitting on the grass in a fenced - in area. The giraffe is positioned in the c enter of the scene, with its long neck and legs stretched out. The fence can be seen surrounding the giraffe, providing a safe and secure environment for the animal. In addition to the gi raffe, there are several birds scattered throughout the scene, some of which are perched on branches or flying in the air. The presence of these birds adds to the livel y atmosphere of the image. LLaVA - v1.5 - 13B The image features a giraffe sitting on the grass in a fenced - in area. The giraffe is positioned in the c enter of the scene, with its long neck and legs stretched out. The fenced - in area provides a safe and co ntrolled environment for the giraffe to rest and relax. LLaVA - v1.5 - 13B + PruneHal User Figure 12: Cases for LLaV A-v1.5-13B. 19
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment