Do Not Leave a Gap: Hallucination-Free Object Concealment in Vision-Language Models

Do Not Lea v e a Gap: Hallucination-F ree Ob ject Concealmen t in Vision-Language Mo dels Amira Guesmi and Muhammad Shaﬁque Engineering Division, New Y ork U niversit y Abu Dhabi, UAE Abstract. Vision-language mo dels (VLMs) ha ve recen tly sho wn remark- able capabilities in visual understanding and generation, but remain vulnerable to adversarial manipulations of visual conten t. Prior ob ject- hiding attacks primarily rely on suppressing or blo c king region-sp eciﬁc represen tations, often creating seman tic gaps that inadv ertently induce hallucination, where mo dels inv ent plausible but incorrect ob jects. In this w ork, w e demonstrate that hallucination arises not from ob ject absence p er se, but from seman tic discontin uity introduced by such suppression- based attac ks. W e prop ose a new class of b ackgr ound-c onsistent obje ct c onc e alment attac ks, which hide target ob jects by re-encoding their vi- sual represen tations to b e statistically and semantically consisten t with surrounding background regions. Crucially , our approach preserves token structure and atten tion ﬂow, av oiding represen tational v oids that trig- ger hallucination. W e presen t a pixel-lev el optimization framew ork that enforces background-consisten t re-enco ding across multiple transformer la yers while preserving global scene semantics. Extensive exp erimen ts on state-of-the-art vision-language mo dels show that our metho d eﬀec- tiv ely conceals target ob jects while preserving up to 86% of non-target ob jects and reducing grounded hallucination by up to 3 × compared to atten tion-suppression-based attac ks. Qualitative results further conﬁrm that our approac h maintains scene coherence and av oids spurious ob- ject insertion. Our ﬁndings highligh t semantic con tinuit y as a k ey factor in hallucination b eha vior and introduce a new direction for adversarial analysis o f generativ e m ultimo dal mo dels. Keyw ords: VLM · Hallucination · Ob ject concealmen t · Adv ersarial attac ks 1 In tro duction Vision–Language Mo dels (VLMs) hav e b ecome foundational comp onen ts in mo d- ern AI systems, enabling image captioning, visual question answering, and m ul- timo dal reasoning [ 1 , 7 , 13 ]. As these mo dels are increasingly deploy ed in sensitive settings, there is growing in terest in obje ct c onc e alment and visual privacy at- tac ks, where speciﬁc regions or ob jects in an image are inten tionally hidden from the mo del’s p erception while preserving the ov erall semantic integrit y of the scene [ 6 , 10 , 12 ]. A dominan t class of existing ob ject concealment attacks op erates by suppr essing visual information corresp onding to a target region, 2 A. Guesmi et al. for example by masking pixels, erasing patc hes, or explicitly reducing attention w eights asso ciated with region-of-in terest (R OI) tok ens [ 3 , 12 , 16 ]. While suc h ap- proac hes can successfully prev ent the direct recognition of the target ob ject, they frequen tly induce a secondary and often ov erlo ok ed failure mode: hal lucination . When queried ab out the scene, the mo del compensates for missing visual evi- dence b y inv enting ob jects, attributes, or relationships that w ere never present in the original image [ 4 , 9 ]. Fig. 1: Overview of Background-Consisten t Re-enco ding (BCR) . A pixel-level p erturbation δ is optimized to produce an adversarial image x adv from a clean image x with a sp eciﬁed R OI. A frozen vision enco der extracts lay er-wise hidden states, from whic h ROI and bac kground tokens are iden tiﬁed via patch em b edding. BCR enforces seman tic contin uity by (i) aligning ROI and bac kground statistics, (ii) softly pro jecting R OI features onto background representations, and (iii) preserving bac kground tokens b et ween clean and adversarial images. A total v ariation regularizer encourages smo oth p erturbations. The ob jectiv e is optimized across multiple transformer lay ers, while the language mo del remains frozen and is used only for ev aluation. This phenomenon is commonly treated as an unav oidable side eﬀect of con- cealmen t. In contrast, we argue that hallucination is not an incidental artifact of ob ject remov al, but a direct consequence of semantic disc ontinuity in tro duced b y suppression-based attacks. By aggressively attenuating or nullifying ROI rep- resen tations, prior methods create representational gaps within the vision en- co der. These gaps propagate through cross-mo dal alignment lay ers, prompting the language model to infer missing seman tics and ﬁll them with plausible—but incorrect—con tent [ 9 ]. In this w ork, w e prop ose a diﬀeren t p ersp ectiv e: eﬀectiv e ob ject concealmen t should preserv e the structur al and statistic al c onsistency of visual represen tations, ev en when the target ob ject is hidden. Instead of remov- Title S uppressed Due to Excessive Length 3 ing ROI tok ens or sev ering their inﬂuence, we aim to r e-enc o de them such that they remain present but b ecome indistinguishable from background conten t. Un- der this formulation, the global image represen tation remains coherent, reducing the incen tive for the language model to hallucinate missing entities [ 3 , 7 ]. T o op erationalize this idea, we introduce Backgr ound-Consistent R e-enc o ding (BCR) (see Figure 1 ), a pixel-level attac k framework that conceals ob jects by aligning the internal representations of ROI tokens with background statistics across m ultiple la yers of the vision transformer. BCR join tly enforces statistical similarit y , dictionary-based pro jection on to bac kground features, and preserv a- tion of non-ROI con tent, resulting in adversarial images that hide target ob jects without introducing seman tic gaps. Imp ortan tly , our approac h operates directly in pixel space, requiring no mo diﬁcation of model parameters and remaining applicable to mo dern VLM arc hitectures [ 5 , 11 ]. This pap er mak es the following con tributions: – W e introduce Backgr ound-Consistent R e-enc o ding (BCR) , a nov el ob ject concealmen t paradigm that preserves tok en structure and attention ﬂo w while hiding target ob jects. – W e propose a principled pixel-lev el optimization framework that enforces bac kground-consistent visual represen tations across multiple vision trans- former la yers. – W e design hallucination-aw are ev aluation metrics and empirically demon- strate that BCR substantially reduces hallucination while maintaining strong concealmen t p erformance across multiple VLM arc hitectures. 2 Related W ork 2.1 A dversarial A ttacks on Vision–Language Mo dels Vision–language mo dels (VLMs) hav e been sho wn to inherit and amplify adver- sarial vulnerabilities from their visual backbones while introducing new cross- mo dal failure mo des. Early work adapted classical image-space attacks such as F GSM and PGD [ 5 , 11 ] to m ultimo dal architectures, demonstrating that small, imp erceptible p erturbations can signiﬁcantly degrade image captioning and vi- sual question answ ering (V QA) performance [ 10 , 16 ]. These studies established that multimodal grounding do es not inherently confer robustness and that adver- sarial p erturbations can propagate through cross-mo dal alignmen t la yers to cor- rupt do wnstream language generation. Subsequen t research has explored more structured attac k strategies that explicitly exploit multimodal fusion mec ha- nisms. While eﬀectiv e, these attacks generally aim to br e ak the mo del’s predic- tions, often at the cost of destro ying ov erall image semantics and utilit y . 2.2 A dversarial A ttacks for Priv acy and Information Protection More recently , adversarial attac ks hav e b een explored as to ols for priv acy preser- v ation. VIP (Visual Information Protection) [ 12 ] frames priv acy as an adv er- sarial ob jective and prop oses selectively masking regions of in terest (ROIs) by 4 A. Guesmi et al. suppressing atten tion and v alue activ ations in early vision lay ers, eﬀectively pre- v enting VLMs from recognizing sensitive conten t. VIP demonstrates strong con- cealmen t p erformance across multiple VLMs, but relies on explicit ROI suppres- sion, whic h creates a representational “gap” that can encourage the language mo del to hallucinate or infer missing conten t. Other metho ds explicitly manip- ulate in ternal represen tations rather than output tok ens. F or instance, PRM (P atch Represen tation Misalignment) [ 6 ] disrupts hierarchical patc h represen- tations b y enforcing feature divergence b et ween clean and adv ersarial images, leading to global seman tic corruption. How ever, these metho ds t ypically alter pixel-lev el app earance or remo ve information entirely , making them unsuitable for scenarios where global seman tic coherence must be preserved. In con trast to prior work, our metho d does not suppress or erase visual information within the ROI. Instead, we propose Backgr ound-Consistent R e- enc o ding (BCR) manipulation, which reshapes the distributional r elationship b et w een ROI and bac kground tokens inside the vision encoder. 3 Threat Mo del W e consider a white-b ox, image-level adversary targeting vision–language mo dels (VLMs). The adversary is giv en access to the mo del architecture and parame- ters, and can compute gradients with resp ect to the visual input. The attack op erates by applying a b ounded p erturbation to the input image at the pixel lev el, sub ject to an ℓ ∞ norm constrain t. The adversary is additionally pro vided with a region of interest (R OI), sp eciﬁed as a b ounding b ox, corresp onding to an ob ject whose visual presence should be concealed. Suc h R OIs naturally arise from ob ject detectors, annotations, or user-deﬁned sensitive regions. The adv ersary’s ob jective is not to induce misclassiﬁcation or nonsensical outputs, but rather to r emove the semantic evidenc e of the tar get obje ct from the mo del’s in ternal representation while preserving global visual coherence. Im- p ortan tly , the attac ker seeks to av oid introducing explicit visual artifacts or represen tational gaps that could trigger comp ensatory hallucinations b y the language deco der. Unlik e ob ject remo v al or inpainting approaches, our ob jec- tiv e is not to alter the visual scene but to mo dulate represen tational grounding within the vision–language model. This enables priv acy-preserving concealment while maintaining p erceptual readability and eviden tiary integrit y for human ob- serv ers. The adversary do es not mo dify the mo del weigh ts, prompts, or deco ding strategy , and has no con trol o ver the language mo del beyond its dep endence on visual features. All ev aluations are p erformed under this ﬁxed threat mo del. 4 A ttac k Principle Pr oblem Setting. Vision–language mo dels (VLMs) rely on a contin uous visual represen tation to ground language generation, where aligned visual tokens pro- vide the evidential basis for downstream captioning and reasoning [ 1 , 7 , 13 ]. When Title S uppressed Due to Excessive Length 5 salien t visual evidence is abruptly remov ed or heavily suppressed (e.g., via mask- ing, erasing, or zeroing lo calized regions), the mo del is forced to infer missing con tent from contextual priors encoded in the language deco der [ 9 , 14 ]. This inference pro cess frequen tly manifests as hal lucination , where the mo del fabri- cates ob jects or attributes that are not visually present. Existing ob ject-hiding or priv acy-oriented attacks often rely on explicit remov al or severe suppression of visual features corresponding to a target region [ 12 , 16 ]. While eﬀective at concealing the ob ject itself, such approaches in tro duce a representational “gap” in the visual embedding space that the mo del attempts to comp ensate for during deco ding, leading to unstable or hallucinated outputs [ 9 ]. This eﬀect is particu- larly pronounced in large-scale VLMs, where strong language priors and gener- ativ e biases amplify missing or ambiguous visual evidence [ 9 ]. Principle of Backgr ound-Consistent R e-enc o ding (BCR). Instead of removing or suppressing the target ob ject, w e prop ose to r e-enc o de its visual representation suc h that it b ecomes statistically and semantically consistent with the surround- ing background. The goal is not to erase information, but to blend the target region in to the background feature manifold. Concretely , BCR enforces three complemen tary constraints: - Statistical alignmen t : The ﬁrst- and second-order feature statistics of target- region tok ens are matched to those of bac kground tokens. - Dictionary consistency : T arget-region features are reconstructed as con vex com binations of bac kground features, ensuring that they lie within the back- ground feature span. - Background preserv ation : F eatures outside the target region are explicitly preserv ed to preven t global degradation or collateral semantic drift. By maintaining contin uity in the visual feature space, the global image repre- sen tation remains coherent and do es not signal missing con tent to the language deco der. As a result, the model neither men tions the concealed ob ject nor com- p ensates for its absence via hallucination. Contr ast with Suppr ession-Base d Attacks. Unlike masking- or suppression-based attac ks (e.g., patch remov al, attention blo c king, or zeroing), BCR do es not in- tro duce visually or representationally empt y regions. Instead, the target ob ject is absorb e d into the bac kground represen tation. This distinction is critical: while suppression creates a v oid that invites hallucination, background-consisten t re- enco ding preserv es representational smoothness and stabilizes downstream lan- guage generation. The resulting adversarial image conceals the target ob ject while preserving global semantics and minimizing hallucinated conten t. Imp or- tan tly , BCR op erates en tirely at the represen tation level while b eing imple- men ted via pixel-space optimization, making it applicable to black-box or frozen VLMs without mo difying model parameters. 5 A ttac k F ormulation W e consider a vision–language mo del (VLM) comp osed of a vision enco der f θ and a language deco der g ϕ . Given an input image x ∈ R 3 × H × W , the vision enco der 6 A. Guesmi et al. pro duces a sequence of visual tok ens Z = f θ ( x ) ∈ R T × D , where the ﬁrst token corresp onds to a global representation (CLS) and the remaining tokens enco de spatial image regions. The language deco der conditions on this representation to generate a caption or answ er. Let R denote a target region of in terest (R OI) corresp onding to an ob ject that the attac ker wishes to conceal, speciﬁed b y b ounding b oxes. Let I r ⊂ { 1 , . . . , T − 1 } b e the indices of visual tok ens whose receptiv e ﬁelds in tersect the R OI, and I b the indices of background tokens. Our goal is to pro duce an adversarial image x adv that conceals the target ob ject while preserving global seman tic consistency and av oiding hallucination. 5.1 Bac kground-Consistent Re-enco ding Ob jectiv e Unlik e suppression-based attacks that remo ve or mask R OI features, we aim to r e-enc o de R OI tokens suc h that they b ecome statistically and semantically indistinguishable from background tokens. F ormally , w e optimize the following ob jective: min x adv L BCR ( x adv ) = X l ∈L  λ stat L ( l ) stat + λ dict L ( l ) dict + λ pres L ( l ) pres  + λ tv L tv , (1) sub ject to a pixel-lev el perturbation budget ∥ x adv − x ∥ ∞ ≤ ϵ . The losses are computed across a set of in termediate vision encoder lay ers L to enforce m ulti-level consistency . 5.2 Statistical Alignment Loss T o eliminate distributional discrepancies that could signal the presence of a con- cealed ob ject, w e align the ﬁrst- and second-order statistics of ROI and back- ground features. F or lay er l , let Z ( l ) r = { z ( l ) i | i ∈ I r } and Z ( l ) b = { z ( l ) j | j ∈ I b } . W e deﬁne: L ( l ) stat =   µ ( Z ( l ) r ) − µ ( Z ( l ) b )   2 2 +   σ ( Z ( l ) r ) − σ ( Z ( l ) b )   2 2 , (2) where µ ( · ) and σ ( · ) denote p er-dimension mean and standard deviation. This loss enforces that R OI features follo w the same distribu tion as background features, prev enting detectable anomalies. 5.3 Dictionary Pro jection Loss While statistical alignmen t controls low-order moments, it do es not ensure se- man tic consistency . T o explicitly constrain ROI features to lie on the background feature manifold, we in tro duce a soft background pro jection loss. F or eac h ROI tok en, we compute a soft assignment ov er background tokens using scaled dot- pro duct similarit y: α ( l ) ij = exp  z ( l ) r,i · z ( l ) b,j /τ  P k ∈I b exp  z ( l ) r,i · z ( l ) b,k /τ  . Each R OI feature is then pro- jected on to the background feature dictionary: ˆ z ( l ) r,i = P j ∈I b α ( l ) ij z ( l ) b,j . Title S uppressed Due to Excessive Length 7 The dictionary loss is deﬁned as: L ( l ) dict = 1 |I r | X i ∈I r   z ( l ) r,i − ˆ z ( l ) r,i   2 2 . (3) This loss encourages ROI tok ens to be re-enco ded as conv ex combinations of background features, enforcing semantic contin uity without remo ving tok en structure. 5.4 Bac kground Preserv ation Loss T o av oid global semantic drift and unintended distortions, we explicitly preserve bac kground features by penalizing deviations from their clean counterparts: L ( l ) pres = 1 |I b | X j ∈I b   z ( l ) b,j ( x adv ) − z ( l ) b,j ( x )   2 2 . (4) This constraint ensures that only the ROI representation is altered, while the remainder of the image remains p erceptually and seman tically stable. 5.5 Pixel-Lev el Regularization Finally , we apply total v ariation regularization to encourage spatially smooth p erturbations within the R OI: L tv = X c,h,w   x adv c,h,w +1 − x adv c,h,w   +   x adv c,h +1 ,w − x adv c,h,w   . (5) By join tly enforcing statistical alignment, dictionary-based seman tic pro jec- tion, and background preserv ation across multiple la yers, BCR remov es ob ject- sp eciﬁc information without in tro ducing representational gaps. As a result, the global image represen tation remains coherent, preven ting b oth ob ject mention and comp ensatory hallucination during language generation. Why R epr esentation Continuity Pr events Hal lucination. Vision–language mo d- els rely on the assumption that visual tok en representations form a con tinuous and semantically coherent manifold. During generation, the language deco der implicitly performs inference ov er this manifold: when a region-speciﬁc signal is missing or anomalous, the deco der comp ensates b y extrapolating from learned visual–seman tic priors, often resulting in hallucinated conten t. This b ehavior is particularly pronounced in suppression-base d attacks, where masking, zeroing, or atten tion blo c king creates a representational “gap” that violates the contin u- it y assumptions learned during pretraining. Our attac k a voids this failure mo de b y enforcing r epr esentation c ontinuity rather than suppression. By re-enco ding R OI tok ens to matc h the statistical distribution and seman tic span of back- ground tokens, we ensure that the visual embedding remains locally smo oth and globally consisten t. F rom the deco der’s persp ective, the ROI do es not app ear 8 A. Guesmi et al. as missing information, but as background-consisten t evidence. Consequently , there is no incentiv e for the model to h yp othesize or hallucinate alternativ e ob jects to explain an apparent v oid. Importantly , this contin uity is enforced across multiple lay ers of the vision enco der, preven ting ob ject semantics from re-emerging through higher-lev el abstractions or cross-tok en interactions. Th us, hallucination is reduced not b y restricting generation, but by eliminating repre- sen tational discon tinuities that w ould otherwise trigger comp ensatory reasoning in the language mo del. 6 Ev aluation Metrics Ev aluating ob ject concealment attacks on vision–language mo dels requires more than measuring caption similarit y or task accuracy . A successful concealment attac k m ust simultaneously (i) remo ve evidence of a target ob ject, (ii) preserve non-target visual seman tics, and (iii) av oid inducing hallucinated con tent. T o capture these distinct ob jectives, we introduce a set of complementary ev aluation metrics com bining caption analysis with grounded visual veriﬁcation. Ob ject Sets. Given a clean image I and its adversarial counterpart I ′ , w e generate captions c = f ( I ) and c ′ = f ( I ′ ) using the same prompting strategy . F rom each caption, w e extract a set of ob ject-lik e tok ens using a dep endency- based noun phrase parser: O ( c ) , O ( c ′ ) . All ob ject strings are normalized (lemmatized and low ercased) prior to compar- ison. Concealmen t Success. Concealmen t Success measures whether the target ob- ject is successfully remov ed from the adversarial caption. Let o ∗ denote the target ob ject. W e deﬁne: Concealmen tSuccess = ⊮  o ∗ ∈ O ( c ) ∧ o ∗ / ∈ O ( c ′ )  . (6) This metric isolates the primary attack ob jective indep enden tly of other caption c hanges. Global Preserv ation. Global Preserv ation measures how well non-target visual con tent is retained. It is deﬁned as the fraction of clean-caption ob jects that remain presen t after the attack: GlobalPreserv ation = |O ( c ) ∩ O ( c ′ ) | |O ( c ) | . (7) A high preserv ation score indicates that the attac k does not indiscriminately erase visual seman tics. Grounded Hallucination Rate. Caption-level comparison alone cannot de- termine whether newly mentioned ob jects are visually supp orted. T o address this limitation, we introduce a gr ounde d hal lucination metric that v eriﬁes adversarial caption ob jects against visual evidence using an external grounding mo del. Title S uppressed Due to Excessive Length 9 Let H = O ( c ′ ) \ O ( c ) denote the set of newly in tro duced ob jects in the adv ersarial caption. F or eac h ob ject h ∈ H , we query a grounding model (GLIP [ 8 ]) to determine whether the ob ject can b e visually lo calized in the image. F ormally , let detect ( h, I ′ ) = ( 1 if the grounding mo del localizes ob ject h with conﬁdence > τ , 0 otherwise. Ob jects that cannot be grounded are considered hallucinated. The grounded hallucination rate is deﬁned as: GroundedHallucinationRate = |{ h ∈ H | detect ( h, I ′ ) = 0 }| |O ( c ′ ) | . (8) This formulation ensures that hallucination is measured relative to actual visual evidence rather than only caption diﬀerences. Head-Noun Grounded Hallucination. T o account for paraphrasing and comp ound nouns (e.g., “salt shaker” vs. “shaker” ), we additionally rep ort a head- noun grounded hallucination rate. Each ob ject phrase is reduced to its syntactic head noun prior to grounding v eriﬁcation, yielding a more semantically robust estimate. Seman tic Drift. While hallucination fo cuses on unsupp orted ob ject introduc- tion, Semantic Drift measures global caption deviation. Using a text em b ed- ding function ϕ ( · ) (e.g., a sentence enco der), we compute: SemanticDrift = 1 − cos  ϕ ( c ) , ϕ ( c ′ )  . This metric ensures that reduced hallucination is not achiev ed b y collapsing captions into generic or uninformative descriptions. T ogether, these metrics pro vide a ﬁne-grained ev aluation of concealment attac ks. Unlik e CHAIR [ 14 ] and related hallucination metrics [ 9 ], which assume static ob ject presence and op erate at coarse category granularit y , our approach explic- itly ev aluates grounded ob ject support and is designed for adversarial ob ject concealmen t scenarios. 7 Exp erimen tal Setup 7.1 Mo dels T o ev aluate the eﬀectiv eness and generality of our attac k, w e consider three widely used vision–language mo dels (VLMs) with distinct arc hitectural designs: LLaV A-1.5 whic h com bines a CLIP-based visual encoder with the Vicuna-7B large language mo del, BLIP-2 paired with the Flan-T5-XL language mo del via a Q-F ormer alignmen t mo dule, and InstructBLIP which also employs a Vicuna-7B language mo del with instruction-tuned visual–language alignment. These mo dels ha ve been extensively adopted in prior work on adv ersarial attacks and robust- ness ev aluation for VLMs [ 6 , 10 , 12 ]. They represen t div erse design choices in terms of visual enco ders, cross-mo dal fusion mechanisms, and deco ding strate- gies. F or brevit y , we refer to these models as LLaV A, BLIP2-T5, and Instruct- BLIP throughout the remainder of the pap er. 10 A. Guesmi et al. 7.2 Datasets W e ev aluate our attac k on t wo standard large-scale vision datasets with ob ject- lev el annotations. ImageNet: W e randomly sample 1,000 images from the Im- ageNet v alidation set [ 15 ]. F or each image, we treat a ground-truth bounding b o x as the region of in teres t (ROI) and deﬁne the attac k ob jective as preven ting the VLM from detecting or describing the ob ject within this region. COCO: In addition, w e ev aluate our metho d on images from the COCO dataset [ 2 ], which con tains complex multi-ob ject scenes with diverse ob ject categories and spatial la youts. COCO allows us to assess the robustness of our attack under more chal- lenging conditions, including o verlapping ob jects and crowded scenes. As with ImageNet, ground-truth b ounding b oxes are used to deﬁne ROIs, and the attack targets a single ob ject category per image while preserving the remaining scene con tent. 7.3 Implemen tation Details Unless otherwise stated, all experiments use a p erturbation budget of ϵ = 0 . 2 . The loss w eights are set to λ stat = 1 , λ dict = 1 , λ pres = 1 , and λ tv = 10 − 3 , with a temp erature parameter τ = 0 . 07 . BCR is applied to later vision trans- former lay ers. Sp eciﬁcally , w e optimize lay ers { 22 , 23 , 24 , 25 } for Instruct-BLIP , { 21 , 22 , 23 , 24 } for LLaV A, and { 22 , 23 , 24 , 25 } for BLIP2-T5. F or BLIP-2, we apply BCR losses to the frozen ViT-g/14 vision encode r, prior to the Q-F ormer b ottlenec k. W e do not operate on Q-F ormer tok ens, as they discard spatial cor- resp ondence. The c hoice of targeted lay ers is discussed in App endix B. Baselines W e compare our BCR attack against attention-suppression-based concealmen t methods, including VIP . All baseline metho ds are implemen ted us- ing their oﬃcial or publicly av ailable code and ev aluated under the same threat mo del and perturbation budget. Caption Generation and Prompts. F or caption-based ev aluation, w e adopt the same prompts used in VIP , including: “Describe this picture.” F or bi- nary veriﬁcation tasks, we use ob ject-speciﬁc yes/no prompts (e.g., “Is there a {object} in the image?” ). All captions are generated using greedy deco ding with a ﬁxed maxim um token budget. Ev aluation Proto col. Each attack is ev aluated using the metrics introduced in Section 6 , including Concealment Success, Global Preserv ation, Hallucination Rate, and Semantic Drift. Results are rep orted as av erages ov er all test images and target ob jects. T o isolate the eﬀect of ob ject concealment, all comparisons b et w een clean and adv ersarial images are p erformed using identical prompts and deco ding settings. F urther details and algorithm are presented in Appendix A. 8 Main Results W e ev aluate our Background-Consisten t Re-enco ding (BCR) attack against ex- isting ROI-based suppression metho ds, with a particular fo cus on ob ject conceal- men t eﬀectiv eness, grounded hallucination b eha vior, and semantic consistency . Title S uppressed Due to Excessive Length 11 All results are av eraged ov er the ev aluation set describ ed in Section 7 . T able 1 summarizes the main quan titative results across three vision–language mo dels. While both VIP and BCR ac hieve high ob ject concealment rates, their failure mo des diﬀer substantially . VIP relies on suppressing atten tion and v alue ﬂow from the R OI tok ens, whic h frequen tly in tro duces a seman tic gap in the vi- sual representation. This disruption weak ens the visual grounding av ailable to the language mo del. As a result, the mo del often comp ensates by generating plausible but visually unsupp orted ob jects, leading to a high gr ounde d hal luci- nation rate and increased semantic drift. In contrast, BCR ac hieves comparable concealmen t p erformance while substantially reducing grounded hallucinations. Instead of suppressing ROI tokens, BCR re-enco des them to b e statistically and seman tically consistent with surrounding bac kground tokens. This preserves rep- resen tational con tinuit y in the visual enco der and prev ents the emergence of anomalous feature patterns that trigger comp ensatory language generation. The impro vemen ts are consistent across mo dels and datasets. F or example, on Ima- geNet with Instruct-BLIP , BCR reduces grounded hallucination from 0 . 43 (VIP) to 0 . 19 while increasing global preserv ation from 0 . 57 to 0 . 81 . Similar trends are observ ed across BLIP2-T5 and LLaV A, as well as on the more complex COCO scenes. These results suggest that hallucination under concealment attacks is closely tied to representational discontin uity . By maintaining feature-level con tinuit y rather than removing information outright, BCR enables eﬀectiv e ob ject hiding while preserving coheren t scene understanding. T able 1: Multi-mo del ev aluation on ImageNet and COCO subsets. C : Concealment success, GP : Global Preserv ation, GH : Grounded Hallucination rate (low er is b etter), SD : Sema ntic Drift. Dataset Method Instruct-BLIP BLIP2-T5 LLaV A C ↑ GP ↑ GH ↓ SD ↓ C ↑ GP ↑ GH ↓ SD ↓ C ↑ GP ↑ GH ↓ SD ↓ ImageNet No Attac k 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 Masking 1.00 0.41 0.59 0.54 1.00 0.52 0.48 0.40 1.00 0.55 0.45 0.41 PRM 0.66 0.38 0.62 0.44 0.70 0.31 0.69 0.41 0.73 0.34 0.66 0.39 VIP 0.93 0.57 0.43 0.35 0.96 0.59 0.41 0.32 0.98 0.62 0.48 0.29 BCR (Ours) 0.92 0.81 0.19 0.13 0.95 0.83 0.17 0.10 0.98 0.86 0.14 0.08 COCO No Attac k 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 Masking 1.00 0.37 0.63 0.46 1.00 0.39 0.61 0.44 1.00 0.41 0.59 0.41 PRM 0.63 0.54 0.46 0.37 0.77 0.58 0.42 0.34 0.80 0.60 0.40 0.33 VIP 0.80 0.42 0.58 0.48 0.83 0.45 0.55 0.45 0.88 0.48 0.52 0.43 BCR (Ours) 0.89 0.77 0.23 0.16 0.92 0.79 0.21 0.13 0.95 0.82 0.18 0.11 8.1 Qualitativ e Comparison with Atten tion Suppression W e presen t qualitativ e comparisons b et ween our Background-Consisten t Re- enco ding (BCR) attac k and attention-suppression-based metho ds such as VIP . Figure 2 shows representativ e examples where the target ob ject lies within the annotated region of in terest (ROI). F or each case, w e rep ort the clean caption together with captions generated after applying VIP and BCR. 12 A. Guesmi et al. In the ﬁrst example, b oth metho ds successfully conceal the target person. Ho wev er, VIP substan tially alters the global scene description, in tro ducing un- related ob jects such as a ﬁr e hydr ant and describing the scene as an abstract pain ting. This b eha vior reﬂects a common failure mo de of attention suppression: b y remo ving the contribution of ROI tok ens, the visual representation becomes incomplete, prompting the language mo del to hallucinate visually unsupp orted con tent. In contrast, BCR remov es the p erson while preserving the surround- ing visual context. The generated caption contin ues to reference grounded el- emen ts of the scene, such as the pigeons and park b enc h, without introducing spurious ob jects. The second example highlights a diﬀerent failure mo de. VIP Fig. 2: Qualitative comparison of ob ject concealment attacks on InstructBLIP . again remo ves the target ob ject but produces a caption describing an unrelated ﬁrew orks scene, which b ears little resem blance to the underlying image. This reﬂects sev ere semantic drift caused by the disruption of the visual representa- tion. In contrast, BCR replaces the concealed ob ject with a visually plausible marine en tity (a sea urchin). Although this represen ts a semantic substitution rather than a strict omission, the resulting description remains consisten t with the visual environmen t (coral reef and marine bac kground) and do es not in- tro duce unrelated ob jects. A cross examples, atten tion suppression often creates a representational void that the language mo del comp ensates for by generat- ing plausible but visually unsupported con tent. BCR mitigates this eﬀect by preserving representational contin uity within the visual enco der. Instead of re- mo ving R OI tok ens, BCR re-enco des them to align with the statistical structure of surrounding background tokens. This prev ents anomalous feature patterns that would otherwise trigger comp ensatory hallucination during language gen- eration. Ov erall, BCR consisten tly conceals the target ob ject while main taining coheren t scene descriptions, fav oring contextually grounded substitutions o ver Title S uppressed Due to Excessive Length 13 hallucinated or unrelated con tent. Additional qualitative results are pro vided in App endix C. 8.2 Comparison with Pixel-Space Obfuscation Baselines Bey ond attention-suppression attac ks, another class of concealment base lines op- erates directly in pixel space, including blurring , masking , and inp ainting of the target region. These techniques are widely used in visual priv acy and anonymiza- tion, as they remov e explicit visual evidence of the target ob ject. Ho wev er, their in teraction with vision–language mo dels can introduce unintended seman tic arti- facts. Figure 3 shows representativ e examples. When the target region is blurred or masked, the resulting image often con tains ambiguous or unnatural patterns that the vision enco der struggles to interpret. As a result, the language mo del at- tempts to explain these signals b y generating plausible but visually unsupp orted descriptions. F or example, in a scene originally depicting chic kens near a fence, pixel-space obfuscation leads the mo del to hallucinate unrelated ob jects suc h as dogs or cows. This illustrates a key limitation of pixel-space concealment: remov- ing visual evidence alone can introduce ambiguit y that triggers comp ensatory reasoning in the language model, resulting in hallucinated scene interpretations. Fig. 3: F ailure cases of pixel-space obfuscation metho ds. When the target ob ject is mask ed or blurred, the resulting visual artifacts introduce am biguous signals that vision–language models attempt to explain. Despite the absence of the original ob- ject, the mo del generates hallucinated descriptions containing unrelated entities suc h as dogs or cows. 8.3 P erceptual Fidelit y In addition to seman tic preserv ation, eﬀective ob ject concealment should main- tain high visual ﬁdelit y so that p erturbations remain unobtrusiv e. W e there- fore measure p erceptual similarit y b et ween clean and adversarial images using Structural Similarit y (SSIM) and Learned Perceptual Image P atch Similarity (LPIPS). T able 2 shows that BCR consisten tly ac hieves higher SSIM and lo w er LPIPS across mo dels compared to attention-suppression attac ks. 14 A. Guesmi et al. T able 2: Perceptual ﬁdelit y b et ween clean and adver- sarial images. Higher SSIM and low er LPIPS indicate b etter visual similarit y . Model Instruct-BLIP BLIP2-T5 LLaV A Method SSIM ↑ LPIPS ↓ SSIM ↑ LPIPS ↓ SSIM ↑ LPIPS ↓ PRM 0.93 0.06 0.93 0.06 0.92 0.07 VIP 0.71 0.34 0.804 0.175 0.65 0.32 BCR (Ours) 0.96 0.02 0.95 0.03 0.94 0.04 Although BCR mo diﬁes pix- els within the ROI, the p er- turbations remain visually coheren t with surrounding con tent, av oiding the arti- facts commonly introduced b y suppression-based meth- o ds. These results indicate that BCR preserv es both se- man tic consistency and p er- ceptual realism, pro ducing adversarial images that remain visually close to the original inputs. 9 Ablation Study T able 3: Ablation study of BCR comp onen ts on Instruct-BLIP . ✓ indicates the loss term is enabled. W e rep ort Concealmen t Success (CS ↑ ), Hallucination Rate (HR ↓ ), and Global Preserv ation (GP ↑ ). L stat L dict L pres L tv CS ↑ HR ↓ GP ↑ ✓ ✓ ✓ ✓ 0.91 0.08 0.83 ✓ ✓ ✓ 0.90 0.09 0.81 ✓ ✓ ✓ 0.86 0.17 0.69 ✓ ✓ ✓ 0.79 0.21 0.74 ✓ ✓ ✓ 0.72 0.28 0.66 VIP-style suppression baseline 0.88 0.42 0.51 W e conduct an ablation study to isolate the con tribution of each component in BCR. All v ariants are ev aluated on the same images and ROIs with iden tical optimization and ev aluation settings. T able 3 shows that the full ob jective ac hieves the best balance b e- t ween concealment success and seman tic sta- bilit y . Removing either the statistical align- men t loss ( L stat ) or the dictionary pro jec- tion loss ( L dict ) substantially increases hallu- cination, indicating that b oth distributional matc hing and seman tic anchoring are nec- essary for stable representations. Similarly , disabling background preserv ation ( L pres ) de- grades global semantics, reducing GP scores. Ov erall, the results conﬁrm that BCR’s eﬀec- tiv eness stems from enforcing representational contin uity at both statistical and seman tic levels. 10 Conclusion W e introduced Bac kground-Consistent Re-enco ding (BCR), a contin uity-preserving attac k that conceals target ob jects in vision–language mo dels without inducing hallucination. By re-enco ding R OI features to align with surrounding con text, BCR maintains representational con tinuit y and scene coherence. Our results sug- gest that hallucination arises largely from representational discontin uities rather than ob ject absence itself. This insigh t highlights contin uity-a ware manipulation as a promising direction for adversarial analysis and robust m ultimo dal percep- tion. Title S uppressed Due to Excessive Length 15 References 1. Ala yrac, J.B., Donahue, J., Luc, P ., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sc h, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Adv ances in neural information pro cessing systems 35 , 23716– 23736 (2022) 1 , 4 2. Caesar, H., Uijlings, J., F errari, V.: Co co-stuﬀ: Thing and stuﬀ classes in context. In: Pro ceedings of the IEEE conference on computer vision and pattern recognition. pp. 1209– 1218 (2018) 10 3. Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., W ang, W., Li, B., F ung, P .N., Hoi, S.: Instructblip: T ow ards general-purpose vision-language mo dels with instruction tuning. Adv ances in neural information pro cessing systems 36 , 49250–49267 (2023) 2 , 3 4. Datta, S., Sundararaman, D.: Ev aluating hallucination in large vision-language mo dels based on context-a ware ob ject similarities. arXiv preprint (2025) 2 5. Go odfellow, I.J., Shlens, J., Szegedy , C.: Explaining and harnessing adversarial examples. arXiv preprin t arXiv:1412.6572 (2014) 3 6. Hu, A., Gu, J., Pinto, F., Kamnitsas, K., T orr, P .: As ﬁrm as their foundations: Can op en-sourced foundation mo dels b e used to create adv ersarial examples for do wnstream tasks? arXiv preprin t arXiv:2403.12693 (2024) 1 , 4 , 9 7. Li, J., Li, D., Sa v arese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image enco ders and large language mo dels. In: In ternational conference on mac hine learning. pp. 19730–19742. PMLR (2023) 1 , 3 , 4 8. Li, L.H., Zhang, P ., Zhang, H., Y ang, J., Li, C., Zhong, Y., W ang, L., Y uan, L., Zhang, L., Hw ang, J.N., Chang, K.W., Gao, J.: Grounded language-image pre- training (202 2), https://arxiv.org/abs/2112.03857 9 9. Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., W ang, K., Hou, L., Li, R., P eng, W.: A surv ey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024) 2 , 5 , 9 10. Luo, H., Gu, J., Liu, F., T orr, P .: An image is worth 1000 lies: Adv ersarial transfer- abilit y across prompts on vision-language mo dels. arXiv preprint (2024) 1 , 3 , 9 11. Madry , A., Makelo v, A., Sc hmidt, L., T sipras, D., Vladu, A.: T ow ards deep learning mo dels resistant to adv ersarial attac ks. arXiv preprint arXiv:1706.06083 (2017) 3 12. Meftah, H.F., Hamidouche, W., F ezza, S.A., Déforges, O.: Vip: Visual information protection through adversarial attac ks on vision-language mo dels. arXiv preprin t arXiv:2507.08982 (2025) 1 , 2 , 3 , 5 , 9 13. Radford, A., Kim, J.W., Hallacy , C., Ramesh, A., Goh, G., Agarwal, S., Sastry , G., Ask ell, A., Mishkin, P ., Clark, J., et al.: Learning transferable visual mo dels from natural language sup ervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 1 , 4 14. Rohrbac h, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Ob ject halluci- nation in image ca ptioning. arXiv preprint arXiv:1809.02156 (2018) 5 , 9 15. Russak ovsky , O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpath y , A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog- nition c hallenge. International journal of computer vision 115 (3), 211–252 (2015) 10 16. Zhang, T., W ang, L., Zhang, X., Zhang, Y., Jia, B., Liang, S., Hu, S., F u, Q., Liu, A., Liu, X.: Visual adversarial attack on vision-language mo dels for autonomous driving. arXiv preprin t arXiv:2411.18275 (2024) 2 , 3 , 5 16 A. Guesmi et al. A. A dditional Implentation Details A.1. Algorithm Overview Algorithm 1 summarizes the prop osed Background-Consisten t Re-enco ding (BCR) attac k. Giv en an input image and a region of interest (R OI), the algorithm opti- mizes a pixel-level adv ersarial image under a b ounded p erturbation budget. At eac h iteration, visual features are extracted from selected lay ers of the vision enco der. ROI token representations are then encouraged to (i) match the ﬁrst- and second-order statistics of background tok ens, (ii) lie within the background feature manifold via soft dictionary pro jection, and (iii) preserve non-ROI fea- tures to maintain global scene seman tics. A total v ariation regularizer enforces spatial smoothness of the p erturbation. The optimization proceeds via gradien t descen t in pixel space, pro ducing an adversarial image that conceals the target ob ject while preserving semantic con tinuit y and reducing hallucination. Algorithm 1 Background-Consisten t Re-enco ding (BCR) Require: Image x , ROI b o xes B , vision enco der f θ , lay ers L , step size η , steps T , p erturbation budget ϵ Ensure: Adv ersarial image x adv 1: Compute ROI pixel mask M from B 2: Compute ROI token indices I r and bac kground indices I b 3: Initialize x adv ← x 4: for each lay er l ∈ L do 5: Store clean features Z ( l ) ( x ) 6: end for 7: for t = 1 to T do 8: for each lay er l ∈ L do 9: Extract R OI features Z ( l ) r and bac kground features Z ( l ) b 10: Compute statistical loss L ( l ) stat 11: Compute dictionary pro jection loss L ( l ) dict 12: Compute bac kground preserv ation loss L ( l ) pres 13: end for 14: Compute total v ariation loss L tv on ROI 15: L ← P l ∈L  λ stat L ( l ) stat + λ dict L ( l ) dict + λ pres L ( l ) pres  + λ tv L tv 16: Update x adv using gradien t descen t on L 17: Pro ject p erturbation to ℓ ∞ budget ϵ 18: end for 19: return x adv A.2. Grounded Hallucination V eriﬁcation with GLIP T o ensure that hallucination measurements reﬂect the presence of visually un- supp orted ob jects rather than linguistic v ariation, we augmen t our caption-based ev aluation with a grounding-based veriﬁcation step. Title S uppressed Due to Excessive Length 17 Caption comparison alone can o verestimate hallucination due to paraphras- ing or synonym usage. F or example, an ob ject may app ear in a caption under a diﬀerent lexical form (e.g., “automobile” vs. “car”). Conv ersely , mo dels may in tro duce ob ject terms that are not visually supp orted in the image. T o disam- biguate these cases, we verify whether newly mentioned ob jects can b e grounded in the image. Giv en a clean image I and its adversarial coun terpart I ′ , we generate captions c = f ( I ) and c ′ = f ( I ′ ) using the same prompt and deco ding settings. Ob ject candidates are extracted from b oth captions using a dep endency-based noun phrase parser. After lemmatization and normalization, we obtain the ob ject sets O ( c ) , O ( c ′ ) . The set of newly in tro duced ob jects is deﬁned as H cand = O ( c ′ ) \ O ( c ) . F or each candidate ob ject o ∈ H cand , we query the Grounded Language–Image Pretraining (GLIP) detector to verify whether the ob ject can b e lo calized in the adv ersarial image I ′ . GLIP predicts b ounding boxes conditioned on the textual query corresp onding to the ob ject name. If no detection with conﬁdence ab o ve a predeﬁned threshold τ g is pro duced, the ob ject is considered visually unsup- p orted. F ormally , the hallucination set is deﬁned as H = { o ∈ H cand | GLIP ( I ′ , o ) = ∅ } . Using the veriﬁed hallucination set H , w e compute the grounded hallucina- tion rate as HallucinationRate grounded = | H | | O ( c ′ ) | . This metric measures the fraction of ob jects mentioned in the adversarial caption that cannot b e visually grounded in the image. W e use a publicly a v ailable GLIP model pretrained on large-scale grounding datasets. The grounding conﬁdence threshold is set to τ g = 0 . 3 . F or m ulti-word phrases, the head noun is used as the textual query to improv e grounding ro- bustness. This grounding-based veriﬁcation reduces ambiguit y in hallucination ev aluation b y ensuring that newly i n tro duced ob ject tokens corresp ond to de- tectable visual evidence. By com bining caption-based ob ject extraction with grounding v eriﬁcation, the ev aluation more accurately captures hallucinations caused b y adversarial manipulation rather than lexical v ariation. B. Sensitivit y Analysis of T argeted T ransformer La yers Our Bac kground-Consistent Re-enco ding (BCR) ob jective enforces statistical and semantic alignment b etw een region-of-interest (ROI) tokens and background 18 A. Guesmi et al. tok ens across a subset of vision transformer la yers. The c hoice of targeted la yers inﬂuences the strength and stability of the concealment attac k, as diﬀerent la yers enco de visual information at v arying lev els of abstraction. Early lay ers of vision transformers primarily capture low-lev el features suc h as edges, textures, and lo cal patterns, while deep er la yers enco de higher-level se- man tic represen tations that are more directly aligned with language generation. Consequen tly , applying the BCR ob jective at diﬀerent depths may aﬀect the de- gree to whic h ob ject seman tics are suppressed or redistributed into bac kground represen tations. Fig. 4: Sensitivity analysis of targeted transformer la yers. Early-la yer optimization lea ves ob ject seman tics largely intact, allowing the model to still recognize the target ob ject. In contrast, targeting deep er lay ers remov es the ob ject semantics while preserv- ing th e surrounding scene context, resulting in successful concealmen t. T o study this eﬀect, we ev aluate BCR by applying the alignment losses to diﬀeren t groups of transformer lay ers: – Early lay ers : losses applied to the ﬁrst four transformer blo c ks. – Middle lay ers : losses applied to intermediate transformer blocks. – Late lay ers : losses applied to the ﬁnal four transformer blo c ks. All other hyperparameters remain identical to the main experiments, includ- ing the p erturbation budget and optimization sc hedule. Title S uppressed Due to Excessive Length 19 W e additionally measured concealmen t success and hallucination rate across la yer conﬁgurations, conﬁrming that targeting de eper lay ers yields the most sta- ble concealmen t b ehavior. Early-lay er optimization leav es ob ject semantics in- tact, allo wing the mo del to still recognize the target ob ject. In con trast, target- ing deep er lay ers remov es the ob ject semantics while preserving the surrounding scene con text. T able 4: Quantitativ e sensitivity analysis of targeted transformer la yers on Instruct- BLIP . Later-la yer optimization yields stronger concealmen t and low er hallucination. T argeted Lay ers Concealment Success ↑ Hallucination Rate ↓ Early La yers 0.22 0.07 Middle La yers 0.41 0.11 Late Lay ers 0.91 0.18 These results indicate that hallucination and ob ject recognition in vision– language mo dels are primarily driven by higher-level semantic representations. By aligning ROI features with bac kground features in deeper lay ers, BCR eﬀec- tiv ely remov es ob ject-speciﬁc semantics while preserving global scene coherence. Based on this analysis, all exp erimen ts in the main pap er apply the BCR ob jec- tiv e to later transformer lay ers of the vision enco der. C. A dditional Qualitative Results W e presen t additional qualitativ e examples illustrating the b eha vior of BCR across diﬀerent scenes and target ob jects (see Figure 5 ). F or each example, we compare captions generated from the clean image and the adversarial images pro duced b y diﬀerent concealmen t strategies. A cross diverse scenes, suppression-based approac hes frequently pro duce un- stable descriptions. Because these metho ds remo ve or weak en ROI representa- tions, the resulting visual features become seman tically inconsistent with the surrounding context. The language mo del then comp ensates b y generating plau- sible but unsupp orted ob jects, leading to hallucinated descriptions. In contrast, BCR maintains represen tational contin uity b y re-enco ding R OI tok ens to match the statistical and semantic prop erties of bac kground tok ens. As a result, the generated captions remain visually grounded and coherent, even after the target ob ject is concealed. These additional examples further demonstrate that hallucination in vision– language models is closely tied to representational discontin uities in tro duced b y suppression-based attacks, while con tinuit y-preserving manipulation enables more stable and realistic scene descriptions. 20 A. Guesmi et al. Fig. 5: Additional qualitative comparisons betw een suppression-based attacks and BCR. While suppression-based metho ds often introduce hallucinated or semantically unrelated ob jects, BCR consistently remov es the target ob ject while preserving the o verall scene structure and con textual elemen ts. Title S uppressed Due to Excessive Length 21 D. F ailure Cases and Limitations While BCR substantially reduces hallucination compared to suppression-based attac ks, it is not without limitations. W e summarize the main failure mo des observ ed in our exp eriments and discuss directions for future impro vemen t. Partial Semantic Substitution. In some cases, BCR do es not fully eliminate ob ject-level reasoning but instead induces semantic substitution . F or example, small or visually am biguous ob jects (e.g., accessories such as ties or utensils) ma y b e re-enco ded as seman tically adjacen t bac kground ob jects (e.g., clothing regions or tableware). Although this b ehavior av oids hallucination, it may still allo w indirect inference of the concealed ob ject category through contextual cues. This limitation is inherent to contin uity-preserving attacks that aim to maintain global coherence rather than explicitly erase ob ject evidence. Mo del-Sp e ciﬁc Vision T okenization. Diﬀeren t VLMs employ distinct vision bac k- b ones and tok enization strategies (e.g., patch size, cropping, or p ooling), which can aﬀect ROI-to-tok en alignment. While BCR generalizes across LLaV A, BLIP- 2, and Instruct-BLIP , additional mo del-sp eciﬁc adjustments may b e required for arc hitectures with aggressive image cropping or non-uniform tok en lay outs. Sc op e of Conc e alment. BCR is designed to conceal ob jects while preserving ov er- all scene seman tics, not to guaran tee absolute ob ject remov al under all p ossible queries. Highly targeted or adversarial prompts explicitly probing the R OI (e.g., rep eated y es/no questioning) may still extract residual information. Addressing suc h interactiv e threat mo dels remains an open problem. Ov erall, these limitations highlight the inherent trade-oﬀ b etw een conceal- men t and semantic con tinuit y . W e b eliev e BCR represen ts a principled step to ward hallucination-aw are adversarial attacks, and future work may combine con tinuit y-based ob jectives with prompt-aw are or adaptiv e strategies to further strengthen ob ject concealment.

Do Not Leave a Gap: Hallucination-Free Object Concealment in Vision-Language Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment