From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs

F rom Drop-oﬀ to Reco v ery: A Mec hanistic Analysis of Segmen tation in MLLMs Bo yong W u 1 , 2 , Sanghw an Kim 1 , 2 , 3 , and Zeynep Ak ata 1 , 2 , 3 1 T echnical Univ ersity of Munic h 2 Helmholtz Munich 3 Munic h Center for Machine Learning (MCML) Fig. 1: Overview of main ﬁndings. (a) Lay erwise linear probing on ADE20K: the adapter in tro duces a represen tation drop-oﬀ, but LLM la yers progressiv ely reco ver segmen tation qualit y . (b) Atten tion kno c kout on conﬂicting class pairs: kno c king out atten tion from correctly classiﬁed tokens degrades segmen tation, conﬁrming that cross- tok en self-reﬁnement is driven by semantic anchors. (c) Per-tok en pixel accuracy at an in termediate LLM lay er: causal attention starv es early p osition tokens of semantic anc hors, while bidirectional attention among image tokens alleviates this bottleneck. Abstract. Multimo dal Large Language Models (MLLMs) are increas- ingly applied to pixel-level vision tasks, yet their intrinsic capacity for spatial understanding remains po orly understo od. W e in vestigate seg- men tation capacit y through a lay erwise linear probing ev aluation across the entire MLLM pip eline: vision enco der, adapter, and LLM. W e fur- ther conduct an interv ention based attention kno ck out analysis to test whether cross-tok en atten tion progressiv ely reﬁnes visual representations, and an ev aluation of bidirectional attention among image tokens on spa- tial consistency . Our analysis rev eals that the adapter introduces a seg- men tation r epresen tation drop-oﬀ, but LLM lay ers progressively recov er through attention-mediated reﬁnemen t, where correctly classiﬁed tokens steer misclassiﬁed neighbors tow ard the correct lab el. A t early image to- k en positions, this recov ery is b ounded by causal atten tion, whic h bidi- rectional atten tion among image tokens alleviates. These ﬁndings provide a mec hanistic account of ho w MLLMs pro cess visual information for seg- men tation, informing the design of future segmen tation-capable mo dels. 2 B. W u et al. 1 In tro duction Detecting and segmen ting con tent in visual scenes is foundational for applica- tions in rob otics, AR/VR, autonomous driving, and medical imaging [9, 16, 39]. Vision T ransformers pretrained with con trastive image-text ob jectives or self- sup ervised distillation ha ve emerged as p o werful feature extractors: their patc h- lev el embeddings capture rich dense semantics and transfer strongly to segmen- tation when paired with light weigh t deco ders or linear prob es [3, 27, 28]. More recen tly , large-scale pretraining has yielded specialized vision enco ders such as SAM [5, 18, 30] that further push segmentation performance. In parallel, Multi- mo dal Large Language Models (MLLMs) hav e demonstrated impressive capabil- ities in m ultimo dal reasoning, instruction following, and language conditioned con trol [2, 8, 23]. Among MLLMs, adapter st yle architectures are notable for their simplicity and hav e demonstrated state-of-the-art p erformance b y com bin- ing a pretrained Vision Encoder with a pretrained Large Language Model (LLM) through a learned adapter netw ork, that maps vision embeddings into the LLM’s tok en space [23 – 25]. These capabilities hav e motiv ated a growing b ody of work that adapts MLLMs for pixel-level semantic segmen tation tasks, arguing that language sup ervision can improv e interpretabilit y and enable comp ositional seg- men tation driven by Referring Expression Segmen tation, whic h allo ws segmenta- tion through instruction following b ey ond what traditional segmentation mo dels supp ort [17, 19]. While recent w orks hav e prop osed MLLM-based metho ds sp ecialized for se- man tic segmen tation [7, 31, 36, 40], it remains unclear whether the MLLM as a whole is actually b etter at image segmen tation than its underlying vision enco der under a con trolled probing setup. Recent analyses suggest that, with- out task-speciﬁc tuning, MLLMs ma y underperform on classical vision tasks and may ov erlo ok visual evidence that is already presen t in their vision back- b ones [13, 14, 33, 42]. T o address this gap in the literature, we propose a systematic, in terven tion- driv en framework to dissect where segmentation comp etence arises or degrades across the MLLM stack. First, we p erform linear probing of frozen embe ddings at the vision enco der, the adapter, and ev ery in termediate LLM lay er (Sec. 3). This reveals a r epr esentation dr op-oﬀ at the adapter: pro jecting into the LLM em b edding space trades ﬁne-grained spatial ﬁdelity for cross-mo dal alignmen t, degrading tok en-level separability . How ev er, downstream LLM lay ers progres- siv ely self-r eﬁne these visual representations and gradually recov er segmentation qualit y . This reco very raises the question of whether it is actively driv en by cross- tok en attention or is merely a b ypro duct of residual connections and normal- ization. T o further study these mec hanisms, w e design atten tion knock out in ter- v entions on conﬂicting class pairs (e.g., ceiling vs. sky), which suppress sp eciﬁc atten tion pathw ays while lea ving all other operations unchanged (Sec. 4). W e ﬁnd that correctly classiﬁed tok ens act as semantic anc hors whose attention signals pull misclassiﬁed neigh b ors tow ard the correct label, indicating that cross-tok en atten tion driv es the observed self-reﬁnemen t. F rom Drop-oﬀ to Reco very: Segmentation in MLLMs 3 Finally , under standard causal atten tion, early image tok ens can only at- tend to preceding tokens in sequence order and thus lack global spatial context, limiting self-reﬁnement at those p ositions. W e ablate this context starv ation b y in tro ducing bidirectional attention exclusiv ely among image tokens and by v ary- ing the vision encoder (Sec. 5). Bidirectional attention eliminates the p ositional bias at early patches, and the degree of recov ery depends on the compatibility b e- t ween the vision enco der’s representation space and the LLM’s embedding space, with text-aligned enco ders (e.g., CLIP , SigLIP) b eneﬁting more than vision-only ones (e.g., DINOv2). Our contribution can b e summarized as follows (Fig. 1): – W e reveal a representation drop-oﬀ at the adapter and progressive self- reﬁnemen t across LLM lay ers by performing lay erwise linear probing that analyzes segmentation competence across the MLLM stac k. – Our attention kno c kout exp erimen ts show that self-reﬁnemen t is driv en by cross-tok en attention, with correctly classiﬁed tokens acting as semantic an- c hors that guide their misclassiﬁed neighbors. – W e further ﬁnd that applying bidirectional atten tion alleviates con text star- v ation at early image tokens and that text-aligned enco ders b eneﬁt more from LLM-lay er reﬁnement than vision-only enco ders. A cross exp erimen ts, we observ e that LLM lay ers pro vide structure- and constrain t-aw are reﬁnement that improv es lo cal consistency and resolv es cer- tain class conﬂicts, but they do not reco ver ﬁne-grained spatial details absent from the enco der. These ﬁndings clarify when and where MLLMs aid seman tic segmen tation and where they fall short, oﬀering practical guidance for the design of segmentation-capable MLLMs. 2 Related W ork Vision Enco ders for Segmentation. Vision T ransformers pretrained at scale with sup ervised, contrastiv e, or self-supervised ob jectives hav e become p o werful feature extractors for dense prediction tasks [6, 11, 28, 30, 38]. Notably , patc h-level features from these mo dels already transfer strongly to segmen tation and other dense tasks when deco ded with ligh tw eight heads or linear prob es [3, 27], estab- lishing that rich spatial information is present in the enco der represen tations themselv es. Our work builds on this probing paradigm but extends it beyond the vision enco der in isolation: we prob e representations at ev ery stage of the full MLLM pip eline, from the enco der through the adapter and into each LLM la yer, to understand whether the downstream comp onents preserve, degrade, or enhance the spatial information already present in the enco der features. MLLMs for Segmentation. Adapter-st yle MLLMs connect a pretrained vision enco der to a pretrained LLM through a learned pro jection [24, 25, 37], enabling multimodal reasoning without training either comp onen t from scratch. A gro wing line of work adapts this arc hitecture for segmentation, lev eraging in- struction follo wing and language-conditioned control to pro duce pixel-level out- 4 B. W u et al. puts. A common strategy introduces sp ecial tok ens that bridge language rea- soning with mask prediction. LISA [19] pioneered this by connecting an MLLM to SAM via a learned [SEG] token, enabling reasoning segmentation from com- plex language instructions. GSV A [36] extended this to m ulti-target and empty- target cases with m ultiple [SEG] and [REJ] tokens. PixelLM [31] tak es a SAM- free approach, generating multi-scale segment tokens deco ded by a light weigh t head. Other recen t notable eﬀorts include GLaMM [29], SAM4MLLM [7], OMG- LLaV A [40], and others [32, 35, 41, 43, 45], each prop osing architectural v arian ts for grounding or segmentation. While these works rep ort comp etitive segmen- tation results, they fo cus on system-level p erformance and do not in vestigate where within the MLLM stac k segmentation comp etence actually resides. Our w ork complements this line of research by providing a diagnostic analysis of the represen tations these architectures produce. Diagnostic Analyses of MLLM Representations. Recent analyses cau- tion that MLLMs may underp erform on standard vision tasks without task- sp eciﬁc ﬁnetuning. Zhang et al . [42] examine the p erformance of MLLMs on classiﬁcation tasks and ﬁnd that they can ov erlo ok visual evidence enco ded by their own vision bac kb one. Other works in this area ha ve shown that MLLMs substan tially underp erform their vision enco ders on vision-centric tas ks suc h as depth estimation and corresp ondence, iden tifying the LLM as the primary b ottle- nec k. Their analysis prob es represen tations across the VLM but fo cuses on gen- eral p erceptual tasks and do es not p erform causal interv entions [13]. Conv ersely , Li et al . [20] show that generative MLLMs can extract visual information more eﬀectiv ely than CLIP from the same frozen encoder, suggesting a more nuanced picture. Liang et al . [22] demonstrate that in termediate MLLM lay ers can hold ric her region-level descriptions than the ﬁnal la yer. W e build on these ﬁndings but focus sp eciﬁcally on seman tic segmen tation, introduce attention kno c kout exp erimen ts to test whether LLM lay ers activ ely resolve ambiguit y , and ablate arc hitectural choices such as atten tion directionalit y and vision enco der selec- tion. Our w ork con tributes a segmen tation-fo cused, causal-in terven tion view, complemen ting classiﬁcation- and corresp ondence-oriented ﬁndings. 3 La y erwise Linear Probing Mo dels and notation. Given an input image, the vision enco der divides it in to a grid of T non-ov erlapping patches (e.g., T = 576 for a 336 × 336 image with patc h size 14) and maps them to a sequence of token embeddings. The adapter, a learned t w o-lay er MLP , pro jects eac h tok en em b edding in to the LLM’s d - dimensional input space. Within eac h LLM lay er, the full input sequence contains system tokens, image tokens, and text prompt tok ens. W e denote b y X ( ℓ ) ∈ R T × d the representations corresp onding to the T image patch tokens only at lay er ℓ . Under standard causal atten tion, image tok ens can only attend to preceding tok ens in the sequence (i.e., system tokens and earlier image patc hes) and are therefore not inﬂuenced b y the text prompt that follows them. Since each token corresp onds to a ﬁxed spatial p osition in the image, p er-tok en predictions can b e F rom Drop-oﬀ to Reco very: Segmentation in MLLMs 5 Blocked Attentio Incorr ectly classied image toke Corr ectly classied image toke (a) Layerwise MLLM Linear Pr obing Setu (b) Attention Knockout Interventio (c) Bidir ectional Attention Mask between Image Patche Allowed Attentio V ision Encode Adapte T okenize MLL LL LL LL Layer Layer Layer Layer Layer Layer Layer Layer Layer Layer Layer Linear Pr ob USER: Describe this imag .. .. .. Blocking Attention to Incorr ect Image T oken Blocking Attention to Corr ect Image T oken Segmentation Mas [sys pr ompt [user pr ompt Assistant U ser [img patch 1 [img patch ... [img patch N Fig. 2: Ov erview of the three analysis metho ds. (a) Lay erwise linear probing: giv en an input image, the vision enco der pro duces patch token embeddings (brown), whic h are pro jected by the adapter into the LLM’s embedding space. Inside the LLM, image tok ens are processed jointly with text prompt tok ens (yello w). A t each lay er ℓ , w e extract only the image token representations and train an indep endent linear prob e to predict p er-patc h semantic classes, reassem bled into a 2D segmentation map. (b) Atten tion kno c kout: we selectively blo c k attention to incorrectly classiﬁed tokens (left) or correctly classiﬁed tokens (right) across all LLM lay ers, testing whether cross- tok en attention driv es self-reﬁnement. (c) Bidirectional attention mask: image tokens attend to eac h other bidirectionally while all other token pairs retain causal masking, alleviating context starv ation at early image p ositions. reshap ed in to a 2D grid and upsampled to the original resolution for pixel-level ev aluation. Datasets. W e ev aluate on the following three standard segmentation b enc h- marks: ADE20K (150 classes, div erse indo or and outdo or scenes) [44]; P ASCAL V OC 2012 (20 foreground classes, augmen ted training set) [12]; and Cit yscap es (urban street scenes) [10]. W e use standard train/v alidation splits and rep ort mIoU as the primary metric and pixel accuracy (pAcc) as a secondary measure. Probing proto col. W e in tro duce a probing proto col to systematically ev al- uate where segmen tation comp etence arises, degrades, or is reﬁned across the MLLM stac k. Our framework targets adapter-st yle MLLMs, which combine a frozen vision enco der, a trained adapter, and an LLM. This mo dular structure allo ws us to isolate and compare representations at three stages: the vision en- co der output, the adapter output, and eac h intermediate LLM la yer. T o ev aluate the segmentation qualit y of representations at each stage of the MLLM, we train an indep endent linear prob e p er lay er [1]. The pro cedure is identical across all stages for the vision enco der, adapter, and LLM lay ers, and diﬀers only in which hidden state is extracted and the dimension of the hidden state. 6 B. W u et al. F or a target lay er ℓ , we freeze the entire MLLM and extract X ( ℓ ) for every image in the training set. Each token is indep enden tly classiﬁed by a linear prob e trained with cross-entrop y on the frozen features. Additional training and implemen tation details are provided in the supplementary material. W e train three MLLM v ariants by pairing Vicuna-7B [34] with CLIP ViT- L/14@336, DINOv2 Large@336, and SigLIP SO400M/14, each follo wing the standard LLaV A-1.5 tw o-stage pro cedure [23]: pretraining the adapter on 558K image-caption pairs, then ﬁnetuning the full model on 665K visual instruction data. By comparing mIoU across lay ers under identical prob e training condi- tions, we obtain a complete proﬁle of ho w segmentation-relev ant information ev olves from the vision enco der through the adapter and into the LLM. 3.1 The A dapter Introduces a Representation Drop-oﬀ W e ﬁrst compare the segmen tation qualit y of features immediately b efore and after the adapter. Fig. 3 rep orts the mIoU for three vision enco ders at the en- co der output and the adapter output. Across all enco ders, w e observe a consis- ten t drop in mIoU after the adapter pro jects the visual features into the LLM’s em b edding space. The magnitude of this drop v aries: CLIP exp eriences a mo d- est decline, DINOv2 and SigLIP suﬀer larger degradations. This representation drop-oﬀ indicates that the adapter in tro duces a structural b ottlenec k, trading ﬁne-grained spatial ﬁdelit y for cross-mo dal alignmen t with the language embed- ding space. Analogous representation gaps ha ve b een do cumen ted in contrastiv e vision-language spaces, where image and text em b eddings occupy geometrically separated regions [21], while F u et al . [13] show that VLMs can underp erform their own vision enco ders on tasks such as corresp ondence. 3.2 LLM La yers Progressively Recov er Segmen tation Quality Segmen tation qualit y do es not con tinue to degrade as features propagate through the LLM. On the contrary , as shown in Fig. 3, mIoU steadily recov ers across the LLM lay ers. The reco very is characterized by a sharp increase in mIoU in the early LLM la yers, follo wed by a plateau in the mid-to-late la yers where p eak p erformance is reac hed. This is a notable ﬁnding: The LLM lay ers, which were pretrained for language mo deling and not for spatial reasoning, are able to reﬁne the visual representations and restore segmentation-relev ant structure that was lost at the adapter. The drop-oﬀ and recov ery pattern is consistent across all three encoders, but the strength of recov ery v aries. CLIP , which is pretrained with a con trastiv e ob jective that aligns visual features to text, exhibits the strongest recov ery: its LLM-la yer representations ultimately exceed the vision encoder baseline. SigLIP , whic h shares a similar con trastive pretraining ob jectiv e, also sho ws meaningful reco very . DINOv2, a vision-only enco der with no text alignment, recov ers less strongly . This suggests that the degree of compatibility b et ween the vision en- co der’s representation space and the LLM’s em b edding space inﬂuences how eﬀectiv ely the LLM lay ers can reﬁne the visual features. F rom Drop-oﬀ to Reco very: Segmentation in MLLMs 7 Fig. 3: Lay erwise linear probing results across the MLLM stack. mIoU on ADE20K for CLIP , DINOv2, and SigLIP enco ders paired with Vicuna-7B, measured at the vision enco der output, adapter output, and eac h LLM lay er. All three enco ders exhibit a drop at the adapter follow ed b y progressive recov ery across LLM lay ers. Dashed lines mark th e b est-p erforming lay er; v alues on the righ t indicate the total mIoU improv ement across LLM lay ers. 3.3 Qualitativ e Evidence and Seman tic Clustering Fig. 4 shows segmen tation predictions from the linear prob e at diﬀerent stages of the MLLM for representativ e ADE20K v alidation images. At the adapter out- put, predictions exhibit noisy b oundaries and more frequent confusion b et ween classes. A t deep er LLM la yers, these errors are progressiv ely resolved: b ound- aries b ecome more spatially coheren t and class assignments stabilize. This visual evidence complements the quan titative mIoU improv ements and suggests that the LLM lay ers p erform a form of contextual reﬁnement, lev eraging global token in teractions and semantics to re-imp ose structure to disambiguate lo cal patch- lev el predictions, yielding net gains for sp eciﬁc conﬂicts. T o further inv estigate how the LLM lay ers reﬁne visual representations, w e visualize the hidden states of individual image patches using UMAP pro jections at diﬀeren t depths of the mo del. Fig. 5 shows the 2D embeddings of all 576 patc h tokens from a single image of the CLIP MLLM at the adapter output, an intermediate LLM lay er, and the lay er at which linear probing p erformance p eaks. Each patch is colored b y the semantic class and classes not among the four most prev alent are shown in gray . At the adapter output, patc hes from diﬀerent seman tic classes are heavily interlea ved, conﬁrming that the pro jected features lac k clear category-level organization. As representations pass through the LLM, same-class patc hes progressiv ely cluster together, and by lay er 20, distinct se- man tic groups suc h as ﬂo or, ceiling, wa ll, and building occupy clearly separated regions of the embedding space, wall and building cluster close together b ecause of semantic similarity . This pro vides a complemen tary , geometric persp ective on the recov ery and corroborates the mIoU impro vemen ts observed in the linear 8 B. W u et al. Input Ground T ruth V ision Encoder Adapter LLM La yer Fig. 4: Qualitative segmen tation predictions across the MLLM stack. F rom left to right: input image, ground truth, linear prob e prediction at the vision enco der output, adapter output, and at an intermediate LLM lay er. Representation drop-oﬀ at the adapter but deeper lay ers app ear to resolv e class confusions (e.g., w all vs. b ed) and pro duce more spatially coherent predictions. probing results and qualitativ e results: The LLM do es not merely mak e fea- tures more linearly separable, but actively re-organizes them into semantically coheren t clusters. 4 A tten tion Kno c kout The lay erwise probing results from Sec. 3.1 show that segmentation quality im- pro ves across LLM lay ers, suggesting that the model progressiv ely resolv es classi- ﬁcation errors. T o test whether this self-reﬁnement is actively mediated by cross- tok en atten tion rather than b eing a passiv e b ypro duct of residual connections or lay er normalization, we design a global atten tion knock out experiment. W e Adapter Output LLM (L10) LLM (L20) Input Image ceiling wall floor building other Fig. 5: UMAP pro jections of patc h-level hidden states across the MLLM stac k. Each point represen ts one of 576 image patch tok ens, colored by seman tic class. A t the adapter output, classes are in terleav ed, by lay er 20, same-class patches form distinct clusters, illustrating the progressive emergence of semantic structure through the LLM lay ers. F rom Drop-oﬀ to Reco very: Segmentation in MLLMs 9 adapt the attention kno c kout technique introduced by Gev a et al . [15] for tracing information ﬂow in auto-regressive language mo dels, and applied to MLLM b y Neo et al . [26] to study how visual information is extracted at the output p osi- tion. While b oth prior works use kno c kout to iden tify which tokens inform the mo del’s generated text output, we repurp ose the tec hnique to prob e a diﬀerent question: whether cross-tok en attention among image patc h tok ens themselv es driv es the self-reﬁnement of spatial representations across LLM lay ers. Pro cedure. Given a test image, we ﬁrst obtain predicted lab els ˆ y t for each image patch token t . W e select a target class c to blo c k and identify the corre- sp onding tok en set B c = { s ∈ I : ˆ y s = c } , where I denotes the full set of image tok ens. At every LLM lay er ℓ , we mask out all attention from any image token to any tok en in B c b y setting the pre-softmax atten tion logits to −∞ : A ( ℓ ) t ← s ← −∞ ∀ t ∈ I , s ∈ B c . (1) This renders class c completely invisible: No image token, including tokens of class c themselv es, can attend to the blo ck ed tok ens. Blo ck ed tok ens can still attend to all non-blo c ked tok ens, so their representations contin ue to ev olve but without any self-reinforcement from same-class neighbors. W e then extract hidden states at ev ery la yer and ev aluate segmentation via the same p er-la yer linear prob es used in the earlier exp erimen t. Exp erimen tal conditions. W e focus on images where the unmo diﬁed mo del exhibits characteristic class confusions, e.g., patches of ceiling misclassiﬁed as sky . F or eac h suc h image, w e run tw o complementary conditions and compare the resulting lay erwise segmentation against the unmo diﬁed mo del (Fig. 2b): 1. Blo c k incorrect class. W e blo c k the class that the model inc orr e ctly as- signs to some tokens (e.g. blo c k sky when the ground truth is c eiling ). If atten tion to incorrectly classiﬁed tokens reinforces errors, remo ving their inﬂuence should accelerate self-correction through global con text. 2. Blo c k correct class. W e blo c k the class that is c orr e ctly assigned (e.g. block c eiling ). If correctly classiﬁed tok ens serv e as semantic anc hors in a global con text that pull misclassiﬁed neighbors tow ard the righ t lab el, removing them should impair self-correction. 4.1 Blo c king the Incorrect Class Accelerates Self-Correction W e select ADE20K v alidation images where the unmodiﬁed mo del exhibits char- acteristic class confusions, suc h as ceiling patc hes misclassiﬁed as sky , and com- pare lay erwise segmenta tion maps under the tw o blo c king conditions against the unmo diﬁed model. When we blo c k the incorrectly assigned class (e.g. making sky invisible in an image where the ground truth is c eiling ), misclassiﬁed tokens are corrected faster across lay ers compared to the unmo diﬁed mo del. Already in the early LLM la y- ers, the segmen tation maps show reduced confusion in the aﬀected region, and b y the ﬁnal lay er the mo del, with incorrect classes kno ck ed out, pro duces fewer 10 B. W u et al. No Knock out Input L4 L8 L16 L24 L32 Knock out Incorrect Class (sky) Knock out Correct Class (ceiling) sk y ceiling Fig. 6: Global attention kno ck out: lay erwise segmen tation comparison. T op ro w: unmo diﬁed baseline. Middle row: incorrect class block ed (e.g. sky). Bottom ro w: correct class blo ck ed (e.g. ceiling). Blo c king the incorrect class accelerates the resolution of misclassiﬁed patc hes across la yers, while blocking the correct class impairs self- correction and leav es more errors in mid-to-late lay ers. residual misclassiﬁcations than the unmo diﬁed mo del. This indicates that to- k ens c arrying the incorrect lab el w ere actively reinforcing erroneous predictions through attention: b y silencing them, the remaining con textual cues from cor- rectly classiﬁed neigh b ors (e.g. walls, ﬂo ors, and other indoor elemen ts) dominate the attention ﬁeld, enabling the mo del to resolve the am biguity more eﬃcien tly . 4.2 Blo c king the Correct Class Impairs Self-Correction The complementary condition rev eals the opp osite eﬀect. When we blo c k the correctly assigned class (e.g. making ceiling invisible), the mo del’s ability to self- correct deteriorates markedly . Misclassiﬁed tokens persist longer across the lay er progression, and in the mid-to-late lay ers the segmen tation qualit y falls b elo w that of the unmo diﬁed mo del, with more misclassiﬁed patc hes remaining than when no interv ention is applied. This demonstrates that correctly classiﬁed to- k ens serve as semantic anchors: Their attention signals help pull misclassiﬁed neigh b ors tow ard the correct lab el. Without these anchors, the mo del loses its primary self-correction mec hanism and errors are left unresolved or even ampli- ﬁed. 4.3 Cross-T ok en A ttention Drives Self-Reﬁnement T ogether, these tw o conditions pro vide direct evidence that the represen tation reﬁnemen t do cumen ted in Sec. 3.1 is not a passive byproduct of residual connec- tions or la yer normalization, but is actively mediated by cross-token attention. F rom Drop-oﬀ to Reco very: Segmentation in MLLMs 11 The LLM lay ers lev erage semantic context, attending to tok ens of the correct class, to progressiv ely resolv e lo cal classiﬁcation errors. How ever, b ecause self- correction relies on the presence of correctly classiﬁed anc hors, this mechanism cannot recov er ﬁne-grained spatial details absent from the encoder features or o vercome systematic enco der biases where no correct anchors exist. These ﬁnd- ings conﬁrm the self-reﬁnement hypothesis and identify cross-token attention as its op erativ e mechanism. 5 Causal vs. Bidirectional A tten tion The preceding exp erimen ts established that cross-tok en atten tion drives self- reﬁnemen t across LLM lay ers (Secs. 3.1 and 4.1). Ho wev er, under standard causal attention, image tokens are pro cessed in raster order (left-to-right, top- to-b ottom) and each token can only attend to preceding tokens in the sequence. The ﬁrst image token cannot see any other image tokens, while the last tok en attends to all T image patches. This creates a structural asymmetry in which early tokens lack global spatial context, a limitation particularly relev ant for segmen tation, where every patch should ideally access scene-level information. W e observ ed this eﬀect in our lay erwise probing experiments: tokens in the ﬁrst ro w, and esp ecially the top-left corner, are often misclassiﬁed in the qualitative visualizations and exhibited consistently low er classiﬁcation accuracy that did not improv e across LLM la yers. Image-only bidirectional attention. T o test whether this p ositional bias limits segmentation quality , we mo dify the attention mask to grant bidirectional atten tion exclusiv ely among image tokens while preserving causal atten tion for text generation (Fig. 2c). Let I denote the set of image token p ositions within the input sequence. The mo diﬁed mask function is: M ( q , k ) = ( q ∈ I ∧ k ∈ I ) | {z } image–image: bidirectional ∨ ( q ≥ k ) | {z } causal , (2) where q and k index the query and key p ositions in the full input sequence, resp ectiv ely . When b oth tokens lie within the image region, attention is p er- mitted regardless of their relative p osition, but for all other pairs, the standard causal constrain t applies. This design diﬀers from the preﬁx-LM strategy em- plo yed b y PaLiGemma [4], whic h grants bidirectional attention across al l input tok ens, image, system prompt, and task preﬁx alik e. Our v arian t targets spatial self-reﬁnemen t among image tokens sp eciﬁcally , isolating its eﬀect on segmen- tation while preserving the sequential structure required for autoregressive text generation. T raining. W e follow the standard LLaV A-1.5 tw o-stage pro cedure: (1) pre- training the adapter on 558K image-caption pairs with the LLM frozen, and (2) ﬁnetuning the full mo del on 665K visual instruction data. Both stages use the same atten tion t yp e, either fully causal or fully bidirectional, so that the comparison reﬂects the cumulativ e eﬀect of attention directionality across the 12 B. W u et al. Fig. 7: Con text starv ation under causal attention. Per-patc h pixel accuracy for the ﬁrst 50 tokens in the LLM lay er of the MLLM. Accuracy Gap: Bidirectional minus causal accuracy . Early patc hes suﬀer severe context starv ation under causal masking. en tire training pro cess. All MLLM hyperparameters, training data, and vision enco ders (CLIP , DINOv2, SigLIP) are identical betw een the t wo conditions, only the attention mask diﬀers. 5.1 Early T ok ens Suﬀer Context Starv ation W e compare p er-patch classiﬁcation accuracy betw een the causal and bidirec- tional CLIP MLLM mo dels at the same la yer across the ADE20K v alidation set. T o isolate the positional eﬀect of the atten tion mask from the con tent- distribution bias shared by both mo dels, we rep ort the diﬀerence in pixel ac- curacy b et ween the bidirectional and causal attention mechanisms ev aluated at patc h p ositions. Fig. 7 reports accuracy for the ﬁrst 50 patch tokens in patc h order. The gap is striking: Patc h 0, which under causal atten tion attends to no visual con text, shows a + 14.35 p ercen tage-p oin t pixel accuracy increase, a 23.2% relativ e impro vemen t under bidirectional atten tion; the gap deca ys in later image patc hes. 5.2 Bidirectional A ttention Sustains Recov ery Across Lay ers Fig. 8 compares lay erwise linear probing mIoU for causal and bidirectional MLLMs across all three enco der conﬁgurations. Bidirectional attention yields mo dest p eak mIoU gains for CLIP ( + 0.42) and DINOv2 ( + 0.32), consistent with the p er-patch analysis: The improv emen t is concen trated at the few context- starv ed p ositions rather than reﬂecting a broad representational change. SigLIP sho ws a larger eﬀect: Under causal attention, reco very p eaks at la yer 12 (mIoU 35.39) and then declines through deep er lay ers, while bidirectional attention sustains monotonic improv ement to 41.26 at lay er 32. F rom Drop-oﬀ to Reco very: Segmentation in MLLMs 13 Fig. 8: Causal vs. bidirectional atten tion: Lay erwise probing on ADE20K across the MLLM Stack for diﬀeren t vision enco ders. Bidirectional attention yields mo dest p eak mIoU gains for CLIP and DINOv2, while SigLIP shows a larger eﬀect. 5.3 Self-Reﬁnemen t Requires Access to Neigh b ors The bidirectional atten tion exp erimen t rev eals context starv ation as a real but sharply lo calized cost of causal masking. F or the ﬁrst few image tok ens, the causal mask creates a p ersisten t representation p enalt y that is not resolved across LLM la yers; bidirectional attention eliminates it by granting immediate access to all visual neighbors. Beyond these early positions, causal and bidirectional atten- tion pro duce comparable represen tations, indicating that self-reﬁnemen t through partial con text is suﬃcien t once a tok en has access to ev en a mo dest num b er of visual neighbors. This ﬁnding reinforces the self-reﬁnemen t narrativ e from Sec. 4.1: Self-reﬁnement dep ends on access to semantically consistent neighbors, and causal masking dela ys the formation of these con textual anchors for early tok ens, whereas bidirectional attention restores immediate global con text and enables reﬁnement to op erate uniformly across spatial p ositions. 5.4 Eﬀect on Language Understanding The preceding sections show that bidirectional atten tion among image tokens impro ves segmentation, particularly for context-starv ed early patches. A natu- ral concern is whether this mo diﬁcation degrades the mo del’s language capa- bilities. T ab. 1 compares VQA p erformance across nine b enchmarks for causal and bidirectional v ariants of all three enco der conﬁgurations. CLIP and SigLIP mo dels maintain comparable p erformance to their causal baselines across most b enc hmarks, indicating that bidirectional image attention preserv es language un- derstanding. DINOv2 sho ws minor regressions, consistent with its weak er text alignmen t observed throughout our exp erimen ts. These results suggest that bidi- rectional attention among image tokens oﬀers a fa vorable trade-oﬀ: it alleviates con text starv ation for segmentation without sacriﬁcing language capabilities for text-aligned enco ders. 14 B. W u et al. T able 1: VQA b enc hmark: causal vs. bidirectional LLaV A 1.5 (7B) across three vision encoders. All models use Vicun a-7B as the LLM backbone. Bidirectional atten tion preserves language understanding with segmentation gains that v ary across enco ders. DINOv2 shows broader regressions, consistent with its lac k of text alignmen t. Vision Encoder Atten tion GQA MMB MME P MME C MMMU POPE SQA I T extVQA VizWiz CLIP ViT-L/14 Causal 62.6 66.4 1483 284 35.3 86.8 68.7 46.9 56.0 Bidirectional 62.7 65.5 1538 288 36.2 86.9 69.7 47.0 57.5 DINOv2 ViT-L/14 Causal 62.1 57.7 1304 324 34.6 87.2 66.1 14.0 51.4 Bidirectional 60.5 55.3 1247 326 32.4 85.1 66.3 13.7 45.8 SigLIP SO400M/14 Causal 61.1 63.7 1414 275 34.7 84.4 70.4 50.2 58.2 Bidirectional 62.1 66.9 1402 298 34.9 84.8 70.7 53.6 53.3 6 Conclusion Our probing and interv ention framew ork reveals that adapter-style MLLMs in- tro duce a representation drop-oﬀ at the adapter that degrades token-lev el sep- arabilit y , but LLM la yers progressiv ely reco ver segmentation qualit y through atten tion-mediated self-reﬁnement. A ttention kno ck out exp erimen ts conﬁrm that correctly classiﬁed tokens act as semantic anchors whose attention signals pull misclassiﬁed neigh b ors to ward the correct label, identifying cross-token atten- tion as the operative reﬁnement mec hanism. Causal attention creates context starv ation at early image tok en positions that bidirectional attention among image tokens alleviates b y restoring immediate global con text and enabling re- ﬁnemen t to op erate uniformly across spatial positions. These ﬁndings provide a mechanistic account of how MLLMs pro cess visual information for segmenta- tion, informing the design of future segmentation-capable MLLMs and hop efully pro viding a step for more interpretable multimodal systems. Limitations. W e used LLaV A-t yp e mo dels as a starting p oin t and v aried the vision enco der across the conﬁgurations for in terpreting MLLMs, but our conclusions migh t not generalize for signiﬁcan tly diﬀeren t arc hitectures. Our linear prob es ma y underestimate actual segmentation capacity achiev able with ric her task heads and the kno c kout interv entions blo c k attention globally across all lay ers, which does not isolate the contribution of individual lay ers to the reﬁnemen t pro cess. F uture work could explore whether these ﬁndings extend to broader task types and mo del architectures. F rom Drop-oﬀ to Reco very: Segmentation in MLLMs 15 A c knowledgemen ts Our work was partially funded b y the ER C (853489 - DEXIM) and the Al- fried Krupp von Bohlen und Halbac h F oundation, whic h w e thank for their sup- p ort. The authors gratefully ac knowledge the scien tiﬁc support and resources of the AI service infrastructure LRZ AI Systems pro vided b y the Leibniz Sup er- computing Centre (LRZ) of the Ba v arian A cademy of Sciences and Humanities (BA dW), funded by Bay erisches Staatsministerium für Wissenschaft und Kunst (StMWK). References 1. Alain, G., Bengio, Y.: Understanding in termediate la yers using linear classiﬁer prob es. In: ICLR W orkshop (2017). https: / / doi . org / 10 . 48550 / arXiv . 1610 . 01644 2. Bai, S., Chen, K., Liu, X., W ang, J., Ge, W., Song, S., Dang, K., W ang, P ., W ang, S., T ang, J., Zh ong, H., Zhu, Y., Y ang, M., Li, Z., W an, J., W ang, P ., Ding, W., F u, Z., Xu, Y., Y e, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Y ang, Z., Xu, H., Lin, J.: Qwen2.5-VL T ec hnical Rep ort (F eb 2025). https :/ /doi. org/ 10.48550 / arXiv.2502.13923 3. Banani, M.E., Ra j, A., Maninis, K.K., Kar, A., Li, Y., Rubinstein, M., Sun, D., Guibas, L., Johnson, J., Jampani, V.: Probing the 3D A w areness of Visual F oun- dation Mo dels. In: CVPR (2024). https://doi.org/10.48550/arXiv.2404.08636 4. Bey er, L., Steiner, A., Pin to, A.S., Kolesnik o v, A., W ang, X., Salz, D., Neumann, M., Alab dulmohsin, I., T schannen, M., Bugliarello, E., Unterthiner, T., Keysers, D., Koppula, S., Liu, F., Grycner, A., Gritsenko, A., Houlsby , N., Kumar, M., Rong, K., Eisenschlos, J., Kabra, R., Bauer, M., Bošnjak, M., Chen, X., Minderer, M., V oigtlaender, P ., Bica, I., Balazevic, I., Puigcerver, J., Papalampidi, P ., Henaﬀ, O., Xiong, X., Soricut, R., Harmsen, J., Zhai, X.: PaliGemma: A versatile 3B VLM for transfer (Oct 2024). https://doi.org/10.48550/arXiv.2407.07726 5. Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alw ala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., W ang, M., Sun, P ., Rädle, R., Afouras, T., Ma vroudi, E., Xu, K., W u, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., V aze, S., Porc her, F., Li, F., Li, S., Kamath, A., Cheng, H.K., Dollár, P ., Ravi, N., Saenko, K., Zhang, P ., F eich tenhofer, C.: SAM 3: Segmen t An ything with Concepts. In: ICLR (2026). https://doi.org/10.48550/arXiv.2511.16719 6. Caron, M., T ouvron, H., Misra, I., Jégou, H., Mairal, J., Bo jano wski, P ., Joulin, A.: Emerging Prop erties in Self-Sup ervised Vision T ransformers. In: ICCV (2021). https://doi.org/10.48550/arXiv.2104.14294 7. Chen, Y.C., Li, W.H., Sun, C., W ang, Y.C.F., Chen, C.S.: SAM4MLLM: Enhance Multi-Mo dal Large Language Mo del for Referring Expression Segmentation. In: ECCV (2024). https://doi.org/10.48550/arXiv.2409.10542 8. Chen, Z., W ang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Y e, S., Tian, H., Liu, Z., Gu, L., W ang, X., Li, Q., Ren, Y., Chen, Z., Luo, J., W ang, J., Jiang, T., W ang, B., He, C., Shi, B., Zhang, X., Lv, H., W ang, Y., Shao, W., Chu, P ., T u, Z., He, T., W u, Z., Deng, H., Ge, J., Chen, K., Zhang, K., W ang, L., Dou, M., Lu, L., Zhu, X., Lu, T., Lin, D., Qiao, Y., Dai, J., W ang, W.: Expanding Performance 16 B. W u et al. Boundaries of Op en-Source Multimo dal Mo dels with Mo del, Data, and T est-Time Scaling (Dec 2024). https://doi.org/10.48550/arXiv.2412.05271 9. Cheng, B., Misra, I., Sch wing, A.G., Kirillov, A., Girdhar, R.: Masked-atten tion Mask T ransformer for Universal Image Segmentation. In: CVPR (2022). https : //doi.org/10.48550/arXiv.2112.01527 10. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzw eiler, M., Benenson, R., F ranke, U., Roth, S., Schiele, B.: The Cit yscap es Dataset for Semantic Urban Scene Understanding. In: CVPR (2016). https://doi.org/10.48550/arXiv.1604.01685 11. Doso vitskiy , A., Beyer, L., Kolesnik o v, A., W eissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly , S., Uszkoreit, J., Houlsby , N.: An Image is W orth 16x16 W ords: T ransformers for Image Recognition at Scale. In: ICLR (2021). https://doi.org/10.48550/arXiv.2010.11929 12. Ev eringham, M., V an Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The P ascal Visual Ob ject Classes (V OC) Challenge. International Journal of Computer Vision 88 (2), 303–338 (Jun 2010). https://doi.org/10.1007/s11263- 009- 0275- 4 13. F u, S., Bonnen, T., Guillory , D., Darrell, T.: Hidden in plain sight: VLMs ov erlo ok their visual represen tations. In: Conference on Language Mo deling (COLM) (2025). https://doi.org/10.48550/arXiv.2506.08008 14. F u, X., Hu, Y., Li, B., F eng, Y., W ang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: BLINK: Multimo dal Large Language Mo dels Can See but Not P erceive. In: ECCV (2024). https://doi.org/10.48550/arXiv.2404.12390 15. Gev a, M., Bastings, J., Filipp ov a, K., Glob erson, A.: Dissecting Recall of F actual Asso ciations in Auto-Regressiv e Language Mo dels. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Metho ds in Natural Language Pro cessing. pp. 12216–12235. Asso ciation for Computational Linguistics, Singap ore (Dec 2023). https://doi.org/10.18653/v1/2023.emnlp- main.751 16. He, K., Gkioxari, G., Dollár, P ., Girshic k, R.: Mask R-CNN. In: ICCV (2017). https://doi.org/10.48550/arXiv.1703.06870 17. Hu, R., Rohrbac h, M., Darrell, T.: Segmen tation from Natural Language Expres- sions. In: ECCV (2016). https://doi.org/10.48550/arXiv.1603.06180 18. Kirillo v, A., Mintun, E., Ra vi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P ., Girshick, R.: Segmen t Anything. In: ICCV (2023). https://doi.org/10.48550/arXiv.2304.02643 19. Lai, X., Tian, Z., Chen, Y., Li, Y., Y uan, Y., Liu, S., Jia, J.: LISA: Reasoning Segmen tation via Large Language Mo del. In: CVPR (2024). https: / /doi . org/ 10.48550/arXiv.2308.00692 20. Li, S., Koh, P .W., Du, S.S.: Exploring Ho w Generative MLLMs Perceiv e More Than CLIP with the Same Vision Enco der. In: Annual Meeting of the Asso ciation for Computational Linguistics (ACL) (2025). https://doi.org/10.48550/arXiv. 2411.05195 21. Liang, W., Zhang, Y., K won, Y., Y eung, S., Zou, J.: Mind the Gap: Understand- ing the Modality Gap in Multi-mo dal Con trastive Representation Learning. In: NeurIPS (2022). https://doi.org/10.48550/arXiv.2203.02053 22. Liang, Y., Cai, Z., Xu, J., Huang, G., W ang, Y., Liang, X., Liu, J., Li, Z., W ang, J., Huang, S.L.: Unleashing Region Understanding in In termediate Lay- ers for MLLM-based Referring Expression Generation. In: NeurIPS (No v 2024), https://openreview.net/forum?id=168NLzTpw8 23. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improv ed Baselines with Visual Instruction T un- ing. In: CVPR (2024). https://doi.org/10.48550/arXiv.2310.03744 F rom Drop-oﬀ to Reco very: Segmentation in MLLMs 17 24. Liu, H., Li, C., W u, Q., Lee, Y.J.: Visual Instruction T uning. In: NeurIPS (2023). https://doi.org/10.48550/arXiv.2304.08485 25. Merullo, J., Castricato, L., Eickhoﬀ, C., Pa vlick, E.: Linearly Mapping from Image to T ext Space. In: ICLR (2023). https://doi.org/10.48550/arXiv.2209.15162 26. Neo, C., Ong, L., T orr, P ., Gev a, M., Krueger, D., Barez, F.: T ow ards Interpret- ing Visual Information Pro cessing in Vision-Language Mo dels. In: ICLR (2025). https://doi.org/10.48550/arXiv.2410.07149 27. Oquab, M., Darcet, T., Moutak anni, T., V o, H., Szafraniec, M., Khalido v, V., F ernandez, P ., Haziza, D., Massa, F., El-Noub y , A., Assran, M., Ballas, N., Galuba, W., Ho wes, R., Huang, P .Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P ., Joulin, A., Bo jano wski, P .: DINOv2: Learning Robust Visual F eatures without Supervision. TMLR (2024). https : // doi.org/10.48550/arXiv.2304.07193 28. Radford, A., Kim, J.W., Hallacy , C., Ramesh, A., Goh, G., Agarwal, S., Sastry , G., Ask ell, A., Mishkin, P ., Clark, J.: Learning transferable visual mo dels from natural language sup ervision. In: International conference on mac hine learning. pp. 8748– 8763. PMLR (2021), http://proceedings.mlr.press/v139/radford21a 29. Rasheed, H., Maaz, M., Mullappilly , S.S., Shaker, A., Khan, S., Cholakk al, H., An wer, R.M., Xing, E., Y ang, M.H., Khan, F.S.: GLaMM: Pixel Grounding Large Multimo dal Mo del. In: CVPR (2024). https :// doi.org /10.48550 /arXiv. 2311. 03356 30. Ra vi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., W u, C.Y., Girshick, R., Dollár, P ., F eic htenhofer, C.: SAM 2: Segment An ything in Images and Videos. In: ICLR (2025). https: / / doi .org /10 .48550 /arXiv .2408 . 00714 31. Ren, Z., Huang, Z., W ei, Y., Zhao, Y., F u, D., F eng, J., Jin, X.: PixelLM: Pixel Reasoning with Large Multimo dal Mo del. In: CVPR (2024). https: / /doi .org / 10.48550/arXiv.2312.02228 32. T ong, S., Brown, E., W u, P ., W oo, S., Middepogu, M., Akula, S.C., Y ang, J., Y ang, S., Iyer, A., P an, X., W ang, Z., F ergus, R., LeCun, Y., Xie, S.: Cambrian- 1: A F ully Op en, Vision-Centric Exploration of Multimodal LLMs. In: NeurIPS (2024). https://doi.org/10.48550/arXiv.2406.16860 33. T ong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Ey es Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. In: CVPR (2024). https :// doi. org/10.48550/arXiv.2401.06209 34. T ouvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goy al, N., Hambro, E., Azhar, F., Ro driguez, A., Joulin, A., Gra ve, E., Lample, G.: LLaMA: Op en and Eﬃcient F oundation Language Mo dels (F eb 2023). https://doi.org/10.48550/arXiv.2302.13971 35. W u, S., Jin, S., Zhang, W., Xu, L., Liu, W., Li, W., Loy , C.C.: F-LMM: Grounding F rozen Large Multimodal Mo dels. In: CVPR (2025). https://doi.org/10.48550/ arXiv.2406.05821 36. Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: GSV A: General- ized Segmen tation via Multimo dal Large Language Models. In: CVPR (2024). https://doi.org/10.48550/arXiv.2312.10103 37. Y ao, H., W u, W., Y ang, T., Song, Y., Zhang, M., F eng, H., Sun, Y., Li, Z., Ouyang, W., W ang, J.: Dense Connector for MLLMs. In: NeurIPS (2024). https :/ /doi . org/10.48550/arXiv.2405.13800 18 B. W u et al. 38. Zhai, X., Mustafa, B., Kolesnik o v, A., Beyer, L.: Sigmoid Loss for Language Image Pre-T raining. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11941–11952. IEEE, Paris, F rance (Oct 2023). https://doi.org/10. 1109/ICCV51070.2023.01100 39. Zhang, C., Cho, J., Puspitasari, F.D., Zheng, S., Li, C., Qiao, Y., Kang, T., Shan, X., Zhang, C., Qin, C., Rameau, F., Lee, L.H., Bae, S.H., Hong, C.S.: A Sur- v ey on Segment An ything Mo del (SAM): Vision F oundation Mo del Meets Prompt Engineering (Oct 2024). https://doi.org/10.48550/arXiv.2306.06211 40. Zhang, T., Li, X., F ei, H., Y uan, H., W u, S., Ji, S., Loy , C.C., Y an, S.: OMG-LLaV A: Bridging Image-level, Ob ject-level, Pixel-lev el Reasoning and Understanding. In: NeurIPS (2024). https://doi.org/10.48550/arXiv.2406.19389 41. Zhang, Y., Ma, Z., Gao, X., Shakiah, S ., Gao, Q., Chai, J.: GROUNDHOG: Grounding Large Language Mo dels to Holistic Segmentation. In: CVPR (2024). https://doi.org/10.48550/arXiv.2402.16846 42. Zhang, Y., Unell, A., W ang, X., Ghosh, D., Su, Y., Sc hmidt, L., Y eung-Levy , S.: Wh y are Visually-Grounded Language Models Bad at Image Classiﬁcation? In: NeurIPS (2024). https://doi.org/10.48550/arXiv.2405.18415 43. Zhang, Z., Ma, Y., Zhang, E., Bai, X.: PSALM: Pixelwise SegmentA tion with Large Multi-Mo dal Mo del. In: ECCV (2024). https://doi.org/10.48550/arXiv.2403. 14598 44. Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., T orralba, A.: Seman tic Understanding of Scenes through the ADE20K Dataset. IJCV 127 , 302– 321 (2019). https://doi.org/10.1007/s11263- 018- 1140- 0 45. Zou, X., Y ang, J., Zhang, H., Li, F., Li, L., W ang, J., W ang, L., Gao, J., Lee, Y.J.: Segment Everything Everywhere All at Once. In: NeurIPS (2023). https : //doi.org/10.48550/arXiv.2304.06718 F rom Drop-oﬀ to Reco very: Segmentation in MLLMs 19 Supplemen tary Materials A Implemen tation Details A.1 MLLM T raining All MLLM v ariants follow the LLaV A-1.5 tw o-stage training pro cedure [23] us- ing Vicuna-7B [34] as the LLM backbone. W e train six conﬁgurations: three vision enco ders (CLIP ViT-L/14@336 [28], DINOv2 ViT -L/14@336 [27], SigLIP SO400M/14@384 [38]) × t wo atten tion t yp es (causal, bidirectional). The adapter is a tw o-la yer MLP that pro jects vision enco der outputs into the LLM’s em- b edding space. CLIP and DINOv2 operate at 336 × 336 resolution, pro ducing 24 × 24 = 576 patc h tokens; SigLIP op erates at 384 × 384 , pro ducing 27 × 27 = 729 patc h tok ens. All mo dels are trained in bﬂoat16 precision. Stage 1 (Adapter pretraining). The adapter is trained on 558K image- caption pairs with b oth the vision enco der and LLM frozen. W e use Adam W with a learning rate of 1 × 10 − 3 , cosine learning rate schedule, warm up ratio of 3%, and no w eight deca y and training runs for 1 ep o c h. Stage 2 (Visual instruction tuning). The full mo del (adapter + LLM) is ﬁnetuned on 665K visual instruction data, with the vision enco der remaining frozen. W e use A dam W with a learning rate of 2 × 10 − 5 , cosine sc hedule, warm up ratio of 3%, and no weigh t decay and training runs for 1 ep o c h. F or the bidirectional attention v arian ts (Sec. 5), both stages use the bidi- rectional image-tok en attention mask deﬁned in Eq. (2); the causal v arian ts use the standard causal mask throughout. All other hyperparameters are identical b et ween causal and bidirectional conditions. A.2 Linear Probing Proto col W e train an indep enden t linear probe at each extraction p oin t: the vision enco der output, the adapter output, and in termediate LLM la yers. The prob e is a single linear lay er mapping from the hidden dimension d to K classes, trained with plain cross-entrop y loss without class reweigh ting or data augmen tation. W e use the Adam W optimizer ( β 1 = 0 . 9 , β 2 = 0 . 999 ) with a learning rate of 1 × 10 − 3 , p olynomial learning rate decay (p o wer 0 . 9 ), no weigh t deca y , and a batch size of 64. Each probe is trained for 20 ep o c hs; w e track v alidation mIoU after each ep och and retain the b est-p erforming chec kpoint. The p er-patc h logits are reshap ed into a 2D grid matc hing the original patch la yout (e.g., 24 × 24 × K for 576 patches) and bilinearly upsampled to the full im- age resolution. The loss is computed against the full-resolution ground-truth seg- men tation mask, allowing it to reﬂect sub-patch b oundary information through the in terp olated logits. Images are pro cessed in a single forward pass without sliding window or m ulti-scale ev aluation. The linear prob e is trained exclusively on the training split and all rep orted metrics are computed on the held-out v alidation split, ensuring that the results 20 B. W u et al. Fig. 9: La yerwise linear probing on Cit yscap es (left) and P ASCAL V OC (righ t). mIoU across the MLLM stack for CLIP , DINOv2, and SigLIP enco ders paired with Vicuna-7B. reﬂect the generalization quality of the frozen representations rather than mem- orization by the prob e. W e keep these training conditions across all lay ers and all exp erimen ts. A.3 A ttention Kno c kout Setup F or the atten tion knock out exp eriments (Sec. 4), w e use the p er-la yer linear prob es trained in the standard probing exp eriment. W e then identify the token set B c corresp onding to the class to b e blo c ked and re-run the forw ard pass with the attention mask modiﬁed according to Eq. (1), extracting hidden states at every lay er for ev aluation with the same frozen prob es. The kno c kout is im- plemen ted by registering forward pre-ho oks on each LLM lay er’s self-attention mo dule that set the corresp onding pre-softmax atten tion logits to −∞ before the softmax computation. This ensures that an y change in segmentation quality is attributable to the attention in terven tion. B Extended Probing Results B.1 La yerwise Probing on P ASCAL VOC and Cityscapes The main pap er rep orts lay erwise linear probing results on ADE20K (Fig. 3). Fig. 9 shows the corresp onding results on Cityscapes and P ASCAL V OC 2012. The progressiv e recov ery across LLM la yers is consistent across all three datasets and all three enco ders, conﬁrming that this pattern is not sp eciﬁc to ADE20K. DINOv2 and SigLIP also exhibit a consistent adapter drop-oﬀ across all datasets. B.2 Causal vs. Bidirectional Probing T ab. 2 summarizes the causal vs. bidirectional comparison on ADE20K, comple- men ting the lay erwise curves in Fig. 8. F rom Drop-oﬀ to Reco very: Segmentation in MLLMs 21 T able 2: Causal vs. bidirectional attention: Lay erwise probing on ADE20K. A dapter and P eak LLM rep ort mIoU (%). ∆ enc : P eak LLM mIoU minus vision encoder baseline (p ositiv e = surpasses encoder). Recov ery: Peak LLM mIoU minus adapter mIoU. Enco der A ttention A dapter Peak LLM ∆ enc Reco very CLIP ViT-L/14@336 Causal 33.22 40.74 + 6.36 + 7.52 Bidirectional 32.68 41.16 + 6.78 + 8.48 DINOv2 Large Causal 42.90 44.50 − 1.08 + 1.60 Bidirectional 40.47 44.82 − 0.76 + 4.35 SigLIP SO400M/14 Causal 31.44 35.39 − 3.31 + 3.95 Bidirectional 32.39 41.26 + 2.56 + 8.87 T able 3: Context starv ation under causal atten tion. Per-patc h pixel accuracy for the ﬁrst three tokens. ∆ : Bidirectional minus causal accuracy . % Impr.: relative impro vemen t ov er causal. Early patches suﬀer severe context starv ation under causal masking. Causal Bidirectional ∆ % Impr. P atch 0 0.6195 0.7630 + 0.1435 + 23.2% P atch 1 0.6851 0.7933 + 0.1082 + 15.8% P atch 2 0.7069 0.7931 + 0.0862 + 12.2% B.3 Con text Starv ation P er-Patc h A ccuracy T ab. 3 rep orts p er-patc h pixel accuracy and relativ e impro vemen t for the ﬁrst three image tokens under causal and bidirectional atten tion, quan tifying the con text starv ation eﬀect discussed in Sec. 5.1. B.4 Generalization A cross MLLM Arc hitectures Fig. 3 in the main pap er v aries the vision enco der while keeping the LLM ﬁxed (Vicuna-7B in the LLaV A-1.5 framework). T o test whether the drop-oﬀ and recov ery pattern generalizes b ey ond LLaV A-1.5, we rep eat the la y erwise linear probing exp eriment with t wo additional MLLMs: LLaV A-One Vision and DeepSeek-VL. Each model uses a diﬀerent vision enco der: LLaV A-1.5 uses CLIP ViT-L/14, One Vis ion uses SigLIP , and DeepSeek-VL uses concatenated SigLIP and SAM-B embeddings. Fig. 10 rep orts ∆ mIoU relativ e to the adapter output on ADE20K. All three architectures exhibit the same qualitative pattern: a rep- resen tation drop-oﬀ at the adapter follow ed by progressive recov ery across LLM la yers. Note that the curv es span diﬀerent num b ers of la yers b ecause each MLLM uses a diﬀerent LLM bac kb one with a diﬀerent depth. These results conﬁrm that the drop-oﬀ and self-reﬁnemen t mechanism do cumented in Sec. 3 is not sp eciﬁc to a single MLLM but extends across diﬀerent adapter-st yle architectures. 22 B. W u et al. Fig. 10: Lay erwise probing across diﬀeren t MLLMs. ∆ mIoU relativ e to adapter output on ADE20K for LLaV A-1.5 (7B), LLaV A-One Vision (7B), and DeepSeek-VL (7B), each using its default vision enco der (CLIP for LLaV A-1.5, SigLIP for One Vision, SigLIP+SAM-B for DeepSeek-VL). Curv es span diﬀerent n umbers of la yers due to diﬀerences in LLM backbone depth. The adapter drop-oﬀ and LLM self-reﬁnement pattern is consistent across architectures. C Extended A tten tion Kno c kout Exp erimen ts The main paper demonstrates the attention kno ck out analysis on individual images with a single class-confusion pair (ceiling vs. sky). Here we extend this analysis using the CLIP MLLM to more misclassiﬁed classes (e.g. sky , ceiling, and grass) and aggregate the results across many images to provide quantitativ e evidence that the seman tic anc hor mec hanism generalizes. C.1 Aggregate Kno c kout Metric T o quantify the kno ck out eﬀect at scale, we track ho w man y incorrectly predicted patc hes p ersist across lay ers. The metric below tracks predictions at every la yer. F or a giv en misclassiﬁed class c (e.g., sky), we classify all 576 patches at la yer 0 using the lay er-0 prob e and count how many are incorrectly predicted as c , yielding n (0) c . At each subsequen t lay er ℓ , we count the num b er of patches still predicted as c , yielding n ( ℓ ) c . The rate for that image at lay er ℓ is n ( ℓ ) c /n (0) c . W e a verage this ratio across all selected images. A v alue of 100% at lay er 0 means all initially misclassiﬁed patc hes are still presen t, lo wer v alues indicate the model is correcting them. W e then take the ground-truth class across all misclassiﬁed patc hes in the image as the dominant ground-truth class, this is the class that gets blo c ked in the kno c kout-correct condition. F rom Drop-oﬀ to Reco very: Segmentation in MLLMs 23 C.2 Quan titative Results Fig. 11 rep orts the aggregate kno c kout metric for sky , ceiling, and grass. Across all three classes, blo c king the incorrect class accelerates self-correction, while blo c king the correct class impairs it. These results show quan titativ ely , that correctly classiﬁed tokens act as semantic anc hors driving self-reﬁnement. C.3 Qualitativ e Examples Fig. 12 sho ws representativ e p er-image kno ck out visualizations for eac h of the three misclassiﬁed classes, illustrating the la yerwise segmen tation under all three conditions. D Extended Qualitativ e Results D.1 La yerwise Segmentation Predictions Fig. 13 sho ws additional ADE20K v alidation images with linear prob e predic- tions at the vision enco der output, adapter output and at an intermediate LLM la yer, extending the qualitative analysis from Fig. 4. 24 B. W u et al. 0 4 8 12 16 20 24 28 32 Layer 0% 20% 40% 60% 80% 100% % of incor r ect 'sk y' patches r emaining A ttention Knock out: 'sk y' (615 images, nor malized per image by L0 count) No knock out Knock out incor r ect (sk y) Knock out cor r ect (G T class) 0 4 8 12 16 20 24 28 32 Layer 60% 80% 100% 120% 140% 160% 180% % of incor r ect 'ceiling' patches r emaining A ttention Knock out: 'ceiling' (73 images, nor malized per image by L0 count) No knock out Knock out incor r ect (ceiling) Knock out cor r ect (G T class) 0 4 8 12 16 20 24 28 32 Layer 40% 60% 80% 100% 120% 140% % of incor r ect 'grass' patches r emaining A ttention Knock out: 'grass' (100 images, nor malized per image by L0 count) No knock out Knock out incor r ect (grass) Knock out cor r ect (G T class) Fig. 11: Aggregate attention kno c kout across three misclassiﬁed classes. P er- cen tage of incorrectly predicted patc hes remaining across LLM la yers, normalized to 100% at lay er 0 and av eraged ov er all images exhibiting the misclassiﬁcation. Blo cking the incorrect class (red) accelerates self-correction, blo cking the correct class (blue) impairs it and can amplify errors b ey ond the initial count. F rom Drop-oﬀ to Reco very: Segmentation in MLLMs 25 Fig. 12: Qualitativ e attention kno ck out examples. Lay erwise segmen tation pre- dictions under three conditions (rows: no knock out, knock out incorrect, kno c kout cor- rect) for three class-confusion pairs. T op: sky vs. windowpane. Middle: ceiling vs. wall. Bottom: grass vs. road. Blocking the incorrect class produces cleaner segmentation maps at earlier lay ers, while blo cking the correct class leav es errors unresolved or am- pliﬁed. 26 B. W u et al. Input Ground T ruth V ision Encoder Adapter LLM La yer Fig. 13: Extended qualitativ e segmentation predictions across the MLLM stac k. Seven ADE20K v alidation images showing, from left to right: input image, ground truth, linear probe prediction at the vision enco der output, adapter output, and at an intermediate LLM lay er. The represen tation drop-oﬀ at the adapter is visible as increased noise and class confusion compared to the vision enco der output, while deep er LLM lay ers pro duce more spatially coherent predictions.

From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment