From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs

Multimodal Large Language Models (MLLMs) are increasingly applied to pixel-level vision tasks, yet their intrinsic capacity for spatial understanding remains poorly understood. We investigate segmentation capacity through a layerwise linear probing e…

Authors: Boyong Wu, Sanghwan Kim, Zeynep Akata

From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs
F rom Drop-off to Reco v ery: A Mec hanistic Analysis of Segmen tation in MLLMs Bo yong W u 1 , 2 , Sanghw an Kim 1 , 2 , 3 , and Zeynep Ak ata 1 , 2 , 3 1 T echnical Univ ersity of Munic h 2 Helmholtz Munich 3 Munic h Center for Machine Learning (MCML) Fig. 1: Overview of main findings. (a) Lay erwise linear probing on ADE20K: the adapter in tro duces a represen tation drop-off, but LLM la yers progressiv ely reco ver segmen tation qualit y . (b) Atten tion kno c kout on conflicting class pairs: kno c king out atten tion from correctly classified tokens degrades segmen tation, confirming that cross- tok en self-refinement is driven by semantic anchors. (c) Per-tok en pixel accuracy at an in termediate LLM lay er: causal attention starv es early p osition tokens of semantic anc hors, while bidirectional attention among image tokens alleviates this bottleneck. Abstract. Multimo dal Large Language Models (MLLMs) are increas- ingly applied to pixel-level vision tasks, yet their intrinsic capacity for spatial understanding remains po orly understo od. W e in vestigate seg- men tation capacit y through a lay erwise linear probing ev aluation across the entire MLLM pip eline: vision enco der, adapter, and LLM. W e fur- ther conduct an interv ention based attention kno ck out analysis to test whether cross-tok en atten tion progressiv ely refines visual representations, and an ev aluation of bidirectional attention among image tokens on spa- tial consistency . Our analysis rev eals that the adapter introduces a seg- men tation r epresen tation drop-off, but LLM lay ers progressively recov er through attention-mediated refinemen t, where correctly classified tokens steer misclassified neighbors tow ard the correct lab el. A t early image to- k en positions, this recov ery is b ounded by causal atten tion, whic h bidi- rectional atten tion among image tokens alleviates. These findings provide a mec hanistic account of ho w MLLMs pro cess visual information for seg- men tation, informing the design of future segmen tation-capable mo dels. 2 B. W u et al. 1 In tro duction Detecting and segmen ting con tent in visual scenes is foundational for applica- tions in rob otics, AR/VR, autonomous driving, and medical imaging [9, 16, 39]. Vision T ransformers pretrained with con trastive image-text ob jectives or self- sup ervised distillation ha ve emerged as p o werful feature extractors: their patc h- lev el embeddings capture rich dense semantics and transfer strongly to segmen- tation when paired with light weigh t deco ders or linear prob es [3, 27, 28]. More recen tly , large-scale pretraining has yielded specialized vision enco ders such as SAM [5, 18, 30] that further push segmentation performance. In parallel, Multi- mo dal Large Language Models (MLLMs) hav e demonstrated impressive capabil- ities in m ultimo dal reasoning, instruction following, and language conditioned con trol [2, 8, 23]. Among MLLMs, adapter st yle architectures are notable for their simplicity and hav e demonstrated state-of-the-art p erformance b y com bin- ing a pretrained Vision Encoder with a pretrained Large Language Model (LLM) through a learned adapter netw ork, that maps vision embeddings into the LLM’s tok en space [23 – 25]. These capabilities hav e motiv ated a growing b ody of work that adapts MLLMs for pixel-level semantic segmen tation tasks, arguing that language sup ervision can improv e interpretabilit y and enable comp ositional seg- men tation driven by Referring Expression Segmen tation, whic h allo ws segmenta- tion through instruction following b ey ond what traditional segmentation mo dels supp ort [17, 19]. While recent w orks hav e prop osed MLLM-based metho ds sp ecialized for se- man tic segmen tation [7, 31, 36, 40], it remains unclear whether the MLLM as a whole is actually b etter at image segmen tation than its underlying vision enco der under a con trolled probing setup. Recent analyses suggest that, with- out task-specific tuning, MLLMs ma y underperform on classical vision tasks and may ov erlo ok visual evidence that is already presen t in their vision back- b ones [13, 14, 33, 42]. T o address this gap in the literature, we propose a systematic, in terven tion- driv en framework to dissect where segmentation comp etence arises or degrades across the MLLM stack. First, we p erform linear probing of frozen embe ddings at the vision enco der, the adapter, and ev ery in termediate LLM lay er (Sec. 3). This reveals a r epr esentation dr op-off at the adapter: pro jecting into the LLM em b edding space trades fine-grained spatial fidelity for cross-mo dal alignmen t, degrading tok en-level separability . How ev er, downstream LLM lay ers progres- siv ely self-r efine these visual representations and gradually recov er segmentation qualit y . This reco very raises the question of whether it is actively driv en by cross- tok en attention or is merely a b ypro duct of residual connections and normal- ization. T o further study these mec hanisms, w e design atten tion knock out in ter- v entions on conflicting class pairs (e.g., ceiling vs. sky), which suppress sp ecific atten tion pathw ays while lea ving all other operations unchanged (Sec. 4). W e find that correctly classified tok ens act as semantic anc hors whose attention signals pull misclassified neigh b ors tow ard the correct label, indicating that cross-tok en atten tion driv es the observed self-refinemen t. F rom Drop-off to Reco very: Segmentation in MLLMs 3 Finally , under standard causal atten tion, early image tok ens can only at- tend to preceding tokens in sequence order and thus lack global spatial context, limiting self-refinement at those p ositions. W e ablate this context starv ation b y in tro ducing bidirectional attention exclusiv ely among image tokens and by v ary- ing the vision encoder (Sec. 5). Bidirectional attention eliminates the p ositional bias at early patches, and the degree of recov ery depends on the compatibility b e- t ween the vision enco der’s representation space and the LLM’s embedding space, with text-aligned enco ders (e.g., CLIP , SigLIP) b enefiting more than vision-only ones (e.g., DINOv2). Our contribution can b e summarized as follows (Fig. 1): – W e reveal a representation drop-off at the adapter and progressive self- refinemen t across LLM lay ers by performing lay erwise linear probing that analyzes segmentation competence across the MLLM stac k. – Our attention kno c kout exp erimen ts show that self-refinemen t is driv en by cross-tok en attention, with correctly classified tokens acting as semantic an- c hors that guide their misclassified neighbors. – W e further find that applying bidirectional atten tion alleviates con text star- v ation at early image tokens and that text-aligned enco ders b enefit more from LLM-lay er refinement than vision-only enco ders. A cross exp erimen ts, we observ e that LLM lay ers pro vide structure- and constrain t-aw are refinement that improv es lo cal consistency and resolv es cer- tain class conflicts, but they do not reco ver fine-grained spatial details absent from the enco der. These findings clarify when and where MLLMs aid seman tic segmen tation and where they fall short, offering practical guidance for the design of segmentation-capable MLLMs. 2 Related W ork Vision Enco ders for Segmentation. Vision T ransformers pretrained at scale with sup ervised, contrastiv e, or self-supervised ob jectives hav e become p o werful feature extractors for dense prediction tasks [6, 11, 28, 30, 38]. Notably , patc h-level features from these mo dels already transfer strongly to segmen tation and other dense tasks when deco ded with ligh tw eight heads or linear prob es [3, 27], estab- lishing that rich spatial information is present in the enco der represen tations themselv es. Our work builds on this probing paradigm but extends it beyond the vision enco der in isolation: we prob e representations at ev ery stage of the full MLLM pip eline, from the enco der through the adapter and into each LLM la yer, to understand whether the downstream comp onents preserve, degrade, or enhance the spatial information already present in the enco der features. MLLMs for Segmentation. Adapter-st yle MLLMs connect a pretrained vision enco der to a pretrained LLM through a learned pro jection [24, 25, 37], enabling multimodal reasoning without training either comp onen t from scratch. A gro wing line of work adapts this arc hitecture for segmentation, lev eraging in- struction follo wing and language-conditioned control to pro duce pixel-level out- 4 B. W u et al. puts. A common strategy introduces sp ecial tok ens that bridge language rea- soning with mask prediction. LISA [19] pioneered this by connecting an MLLM to SAM via a learned [SEG] token, enabling reasoning segmentation from com- plex language instructions. GSV A [36] extended this to m ulti-target and empty- target cases with m ultiple [SEG] and [REJ] tokens. PixelLM [31] tak es a SAM- free approach, generating multi-scale segment tokens deco ded by a light weigh t head. Other recen t notable efforts include GLaMM [29], SAM4MLLM [7], OMG- LLaV A [40], and others [32, 35, 41, 43, 45], each prop osing architectural v arian ts for grounding or segmentation. While these works rep ort comp etitive segmen- tation results, they fo cus on system-level p erformance and do not in vestigate where within the MLLM stac k segmentation comp etence actually resides. Our w ork complements this line of research by providing a diagnostic analysis of the represen tations these architectures produce. Diagnostic Analyses of MLLM Representations. Recent analyses cau- tion that MLLMs may underp erform on standard vision tasks without task- sp ecific finetuning. Zhang et al . [42] examine the p erformance of MLLMs on classification tasks and find that they can ov erlo ok visual evidence enco ded by their own vision bac kb one. Other works in this area ha ve shown that MLLMs substan tially underp erform their vision enco ders on vision-centric tas ks suc h as depth estimation and corresp ondence, iden tifying the LLM as the primary b ottle- nec k. Their analysis prob es represen tations across the VLM but fo cuses on gen- eral p erceptual tasks and do es not p erform causal interv entions [13]. Conv ersely , Li et al . [20] show that generative MLLMs can extract visual information more effectiv ely than CLIP from the same frozen encoder, suggesting a more nuanced picture. Liang et al . [22] demonstrate that in termediate MLLM lay ers can hold ric her region-level descriptions than the final la yer. W e build on these findings but focus sp ecifically on seman tic segmen tation, introduce attention kno c kout exp erimen ts to test whether LLM lay ers activ ely resolve ambiguit y , and ablate arc hitectural choices such as atten tion directionalit y and vision enco der selec- tion. Our w ork con tributes a segmen tation-fo cused, causal-in terven tion view, complemen ting classification- and corresp ondence-oriented findings. 3 La y erwise Linear Probing Mo dels and notation. Given an input image, the vision enco der divides it in to a grid of T non-ov erlapping patches (e.g., T = 576 for a 336 × 336 image with patc h size 14) and maps them to a sequence of token embeddings. The adapter, a learned t w o-lay er MLP , pro jects eac h tok en em b edding in to the LLM’s d - dimensional input space. Within eac h LLM lay er, the full input sequence contains system tokens, image tokens, and text prompt tok ens. W e denote b y X ( ℓ ) ∈ R T × d the representations corresp onding to the T image patch tokens only at lay er ℓ . Under standard causal atten tion, image tok ens can only attend to preceding tok ens in the sequence (i.e., system tokens and earlier image patc hes) and are therefore not influenced b y the text prompt that follows them. Since each token corresp onds to a fixed spatial p osition in the image, p er-tok en predictions can b e F rom Drop-off to Reco very: Segmentation in MLLMs 5 Blocked Attentio Incorr ectly classied image toke Corr ectly classied image toke (a) Layerwise MLLM Linear Pr obing Setu (b) Attention Knockout Interventio (c) Bidir ectional Attention Mask between Image Patche Allowed Attentio V ision Encode Adapte T okenize MLL LL LL LL Layer Layer Layer Layer Layer Layer Layer Layer Layer Layer Layer Linear Pr ob USER: Describe this imag .. .. .. Blocking Attention to Incorr ect Image T oken Blocking Attention to Corr ect Image T oken Segmentation Mas [sys pr ompt [user pr ompt Assistant U ser [img patch 1 [img patch ... [img patch N Fig. 2: Ov erview of the three analysis metho ds. (a) Lay erwise linear probing: giv en an input image, the vision enco der pro duces patch token embeddings (brown), whic h are pro jected by the adapter into the LLM’s embedding space. Inside the LLM, image tok ens are processed jointly with text prompt tok ens (yello w). A t each lay er ℓ , w e extract only the image token representations and train an indep endent linear prob e to predict p er-patc h semantic classes, reassem bled into a 2D segmentation map. (b) Atten tion kno c kout: we selectively blo c k attention to incorrectly classified tokens (left) or correctly classified tokens (right) across all LLM lay ers, testing whether cross- tok en attention driv es self-refinement. (c) Bidirectional attention mask: image tokens attend to eac h other bidirectionally while all other token pairs retain causal masking, alleviating context starv ation at early image p ositions. reshap ed in to a 2D grid and upsampled to the original resolution for pixel-level ev aluation. Datasets. W e ev aluate on the following three standard segmentation b enc h- marks: ADE20K (150 classes, div erse indo or and outdo or scenes) [44]; P ASCAL V OC 2012 (20 foreground classes, augmen ted training set) [12]; and Cit yscap es (urban street scenes) [10]. W e use standard train/v alidation splits and rep ort mIoU as the primary metric and pixel accuracy (pAcc) as a secondary measure. Probing proto col. W e in tro duce a probing proto col to systematically ev al- uate where segmen tation comp etence arises, degrades, or is refined across the MLLM stac k. Our framework targets adapter-st yle MLLMs, which combine a frozen vision enco der, a trained adapter, and an LLM. This mo dular structure allo ws us to isolate and compare representations at three stages: the vision en- co der output, the adapter output, and eac h intermediate LLM la yer. T o ev aluate the segmentation qualit y of representations at each stage of the MLLM, we train an indep endent linear prob e p er lay er [1]. The pro cedure is identical across all stages for the vision enco der, adapter, and LLM lay ers, and differs only in which hidden state is extracted and the dimension of the hidden state. 6 B. W u et al. F or a target lay er ℓ , we freeze the entire MLLM and extract X ( ℓ ) for every image in the training set. Each token is indep enden tly classified by a linear prob e trained with cross-entrop y on the frozen features. Additional training and implemen tation details are provided in the supplementary material. W e train three MLLM v ariants by pairing Vicuna-7B [34] with CLIP ViT- L/14@336, DINOv2 Large@336, and SigLIP SO400M/14, each follo wing the standard LLaV A-1.5 tw o-stage pro cedure [23]: pretraining the adapter on 558K image-caption pairs, then finetuning the full model on 665K visual instruction data. By comparing mIoU across lay ers under identical prob e training condi- tions, we obtain a complete profile of ho w segmentation-relev ant information ev olves from the vision enco der through the adapter and into the LLM. 3.1 The A dapter Introduces a Representation Drop-off W e first compare the segmen tation qualit y of features immediately b efore and after the adapter. Fig. 3 rep orts the mIoU for three vision enco ders at the en- co der output and the adapter output. Across all enco ders, w e observe a consis- ten t drop in mIoU after the adapter pro jects the visual features into the LLM’s em b edding space. The magnitude of this drop v aries: CLIP exp eriences a mo d- est decline, DINOv2 and SigLIP suffer larger degradations. This representation drop-off indicates that the adapter in tro duces a structural b ottlenec k, trading fine-grained spatial fidelit y for cross-mo dal alignmen t with the language embed- ding space. Analogous representation gaps ha ve b een do cumen ted in contrastiv e vision-language spaces, where image and text em b eddings occupy geometrically separated regions [21], while F u et al . [13] show that VLMs can underp erform their own vision enco ders on tasks such as corresp ondence. 3.2 LLM La yers Progressively Recov er Segmen tation Quality Segmen tation qualit y do es not con tinue to degrade as features propagate through the LLM. On the contrary , as shown in Fig. 3, mIoU steadily recov ers across the LLM lay ers. The reco very is characterized by a sharp increase in mIoU in the early LLM la yers, follo wed by a plateau in the mid-to-late la yers where p eak p erformance is reac hed. This is a notable finding: The LLM lay ers, which were pretrained for language mo deling and not for spatial reasoning, are able to refine the visual representations and restore segmentation-relev ant structure that was lost at the adapter. The drop-off and recov ery pattern is consistent across all three encoders, but the strength of recov ery v aries. CLIP , which is pretrained with a con trastiv e ob jective that aligns visual features to text, exhibits the strongest recov ery: its LLM-la yer representations ultimately exceed the vision encoder baseline. SigLIP , whic h shares a similar con trastive pretraining ob jectiv e, also sho ws meaningful reco very . DINOv2, a vision-only enco der with no text alignment, recov ers less strongly . This suggests that the degree of compatibility b et ween the vision en- co der’s representation space and the LLM’s em b edding space influences how effectiv ely the LLM lay ers can refine the visual features. F rom Drop-off to Reco very: Segmentation in MLLMs 7 Fig. 3: Lay erwise linear probing results across the MLLM stack. mIoU on ADE20K for CLIP , DINOv2, and SigLIP enco ders paired with Vicuna-7B, measured at the vision enco der output, adapter output, and eac h LLM lay er. All three enco ders exhibit a drop at the adapter follow ed b y progressive recov ery across LLM lay ers. Dashed lines mark th e b est-p erforming lay er; v alues on the righ t indicate the total mIoU improv ement across LLM lay ers. 3.3 Qualitativ e Evidence and Seman tic Clustering Fig. 4 shows segmen tation predictions from the linear prob e at different stages of the MLLM for representativ e ADE20K v alidation images. At the adapter out- put, predictions exhibit noisy b oundaries and more frequent confusion b et ween classes. A t deep er LLM la yers, these errors are progressiv ely resolved: b ound- aries b ecome more spatially coheren t and class assignments stabilize. This visual evidence complements the quan titative mIoU improv ements and suggests that the LLM lay ers p erform a form of contextual refinement, lev eraging global token in teractions and semantics to re-imp ose structure to disambiguate lo cal patch- lev el predictions, yielding net gains for sp ecific conflicts. T o further inv estigate how the LLM lay ers refine visual representations, w e visualize the hidden states of individual image patches using UMAP pro jections at differen t depths of the mo del. Fig. 5 shows the 2D embeddings of all 576 patc h tokens from a single image of the CLIP MLLM at the adapter output, an intermediate LLM lay er, and the lay er at which linear probing p erformance p eaks. Each patch is colored b y the semantic class and classes not among the four most prev alent are shown in gray . At the adapter output, patc hes from different seman tic classes are heavily interlea ved, confirming that the pro jected features lac k clear category-level organization. As representations pass through the LLM, same-class patc hes progressiv ely cluster together, and by lay er 20, distinct se- man tic groups suc h as flo or, ceiling, wa ll, and building occupy clearly separated regions of the embedding space, wall and building cluster close together b ecause of semantic similarity . This pro vides a complemen tary , geometric persp ective on the recov ery and corroborates the mIoU impro vemen ts observed in the linear 8 B. W u et al. Input Ground T ruth V ision Encoder Adapter LLM La yer Fig. 4: Qualitative segmen tation predictions across the MLLM stack. F rom left to right: input image, ground truth, linear prob e prediction at the vision enco der output, adapter output, and at an intermediate LLM lay er. Representation drop-off at the adapter but deeper lay ers app ear to resolv e class confusions (e.g., w all vs. b ed) and pro duce more spatially coherent predictions. probing results and qualitativ e results: The LLM do es not merely mak e fea- tures more linearly separable, but actively re-organizes them into semantically coheren t clusters. 4 A tten tion Kno c kout The lay erwise probing results from Sec. 3.1 show that segmentation quality im- pro ves across LLM lay ers, suggesting that the model progressiv ely resolv es classi- fication errors. T o test whether this self-refinement is actively mediated by cross- tok en atten tion rather than b eing a passiv e b ypro duct of residual connections or lay er normalization, we design a global atten tion knock out experiment. W e Adapter Output LLM (L10) LLM (L20) Input Image ceiling wall floor building other Fig. 5: UMAP pro jections of patc h-level hidden states across the MLLM stac k. Each point represen ts one of 576 image patch tok ens, colored by seman tic class. A t the adapter output, classes are in terleav ed, by lay er 20, same-class patches form distinct clusters, illustrating the progressive emergence of semantic structure through the LLM lay ers. F rom Drop-off to Reco very: Segmentation in MLLMs 9 adapt the attention kno c kout technique introduced by Gev a et al . [15] for tracing information flow in auto-regressive language mo dels, and applied to MLLM b y Neo et al . [26] to study how visual information is extracted at the output p osi- tion. While b oth prior works use kno c kout to iden tify which tokens inform the mo del’s generated text output, we repurp ose the tec hnique to prob e a different question: whether cross-tok en attention among image patc h tok ens themselv es driv es the self-refinement of spatial representations across LLM lay ers. Pro cedure. Given a test image, we first obtain predicted lab els ˆ y t for each image patch token t . W e select a target class c to blo c k and identify the corre- sp onding tok en set B c = { s ∈ I : ˆ y s = c } , where I denotes the full set of image tok ens. At every LLM lay er ℓ , we mask out all attention from any image token to any tok en in B c b y setting the pre-softmax atten tion logits to −∞ : A ( ℓ ) t ← s ← −∞ ∀ t ∈ I , s ∈ B c . (1) This renders class c completely invisible: No image token, including tokens of class c themselv es, can attend to the blo ck ed tok ens. Blo ck ed tok ens can still attend to all non-blo c ked tok ens, so their representations contin ue to ev olve but without any self-reinforcement from same-class neighbors. W e then extract hidden states at ev ery la yer and ev aluate segmentation via the same p er-la yer linear prob es used in the earlier exp erimen t. Exp erimen tal conditions. W e focus on images where the unmo dified mo del exhibits characteristic class confusions, e.g., patches of ceiling misclassified as sky . F or eac h suc h image, w e run tw o complementary conditions and compare the resulting lay erwise segmentation against the unmo dified mo del (Fig. 2b): 1. Blo c k incorrect class. W e blo c k the class that the model inc orr e ctly as- signs to some tokens (e.g. blo c k sky when the ground truth is c eiling ). If atten tion to incorrectly classified tokens reinforces errors, remo ving their influence should accelerate self-correction through global con text. 2. Blo c k correct class. W e blo c k the class that is c orr e ctly assigned (e.g. block c eiling ). If correctly classified tok ens serv e as semantic anc hors in a global con text that pull misclassified neighbors tow ard the righ t lab el, removing them should impair self-correction. 4.1 Blo c king the Incorrect Class Accelerates Self-Correction W e select ADE20K v alidation images where the unmodified mo del exhibits char- acteristic class confusions, suc h as ceiling patc hes misclassified as sky , and com- pare lay erwise segmenta tion maps under the tw o blo c king conditions against the unmo dified model. When we blo c k the incorrectly assigned class (e.g. making sky invisible in an image where the ground truth is c eiling ), misclassified tokens are corrected faster across lay ers compared to the unmo dified mo del. Already in the early LLM la y- ers, the segmen tation maps show reduced confusion in the affected region, and b y the final lay er the mo del, with incorrect classes kno ck ed out, pro duces fewer 10 B. W u et al. No Knock out Input L4 L8 L16 L24 L32 Knock out Incorrect Class (sky) Knock out Correct Class (ceiling) sk y ceiling Fig. 6: Global attention kno ck out: lay erwise segmen tation comparison. T op ro w: unmo dified baseline. Middle row: incorrect class block ed (e.g. sky). Bottom ro w: correct class blo ck ed (e.g. ceiling). Blo c king the incorrect class accelerates the resolution of misclassified patc hes across la yers, while blocking the correct class impairs self- correction and leav es more errors in mid-to-late lay ers. residual misclassifications than the unmo dified mo del. This indicates that to- k ens c arrying the incorrect lab el w ere actively reinforcing erroneous predictions through attention: b y silencing them, the remaining con textual cues from cor- rectly classified neigh b ors (e.g. walls, flo ors, and other indoor elemen ts) dominate the attention field, enabling the mo del to resolve the am biguity more efficien tly . 4.2 Blo c king the Correct Class Impairs Self-Correction The complementary condition rev eals the opp osite effect. When we blo c k the correctly assigned class (e.g. making ceiling invisible), the mo del’s ability to self- correct deteriorates markedly . Misclassified tokens persist longer across the lay er progression, and in the mid-to-late lay ers the segmen tation qualit y falls b elo w that of the unmo dified mo del, with more misclassified patc hes remaining than when no interv ention is applied. This demonstrates that correctly classified to- k ens serve as semantic anchors: Their attention signals help pull misclassified neigh b ors tow ard the correct lab el. Without these anchors, the mo del loses its primary self-correction mec hanism and errors are left unresolved or even ampli- fied. 4.3 Cross-T ok en A ttention Drives Self-Refinement T ogether, these tw o conditions pro vide direct evidence that the represen tation refinemen t do cumen ted in Sec. 3.1 is not a passive byproduct of residual connec- tions or la yer normalization, but is actively mediated by cross-token attention. F rom Drop-off to Reco very: Segmentation in MLLMs 11 The LLM lay ers lev erage semantic context, attending to tok ens of the correct class, to progressiv ely resolv e lo cal classification errors. How ever, b ecause self- correction relies on the presence of correctly classified anc hors, this mechanism cannot recov er fine-grained spatial details absent from the encoder features or o vercome systematic enco der biases where no correct anchors exist. These find- ings confirm the self-refinement hypothesis and identify cross-token attention as its op erativ e mechanism. 5 Causal vs. Bidirectional A tten tion The preceding exp erimen ts established that cross-tok en atten tion drives self- refinemen t across LLM lay ers (Secs. 3.1 and 4.1). Ho wev er, under standard causal attention, image tokens are pro cessed in raster order (left-to-right, top- to-b ottom) and each token can only attend to preceding tokens in the sequence. The first image token cannot see any other image tokens, while the last tok en attends to all T image patches. This creates a structural asymmetry in which early tokens lack global spatial context, a limitation particularly relev ant for segmen tation, where every patch should ideally access scene-level information. W e observ ed this effect in our lay erwise probing experiments: tokens in the first ro w, and esp ecially the top-left corner, are often misclassified in the qualitative visualizations and exhibited consistently low er classification accuracy that did not improv e across LLM la yers. Image-only bidirectional attention. T o test whether this p ositional bias limits segmentation quality , we mo dify the attention mask to grant bidirectional atten tion exclusiv ely among image tokens while preserving causal atten tion for text generation (Fig. 2c). Let I denote the set of image token p ositions within the input sequence. The mo dified mask function is: M ( q , k ) = ( q ∈ I ∧ k ∈ I ) | {z } image–image: bidirectional ∨ ( q ≥ k ) | {z } causal , (2) where q and k index the query and key p ositions in the full input sequence, resp ectiv ely . When b oth tokens lie within the image region, attention is p er- mitted regardless of their relative p osition, but for all other pairs, the standard causal constrain t applies. This design differs from the prefix-LM strategy em- plo yed b y PaLiGemma [4], whic h grants bidirectional attention across al l input tok ens, image, system prompt, and task prefix alik e. Our v arian t targets spatial self-refinemen t among image tokens sp ecifically , isolating its effect on segmen- tation while preserving the sequential structure required for autoregressive text generation. T raining. W e follow the standard LLaV A-1.5 tw o-stage pro cedure: (1) pre- training the adapter on 558K image-caption pairs with the LLM frozen, and (2) finetuning the full mo del on 665K visual instruction data. Both stages use the same atten tion t yp e, either fully causal or fully bidirectional, so that the comparison reflects the cumulativ e effect of attention directionality across the 12 B. W u et al. Fig. 7: Con text starv ation under causal attention. Per-patc h pixel accuracy for the first 50 tokens in the LLM lay er of the MLLM. Accuracy Gap: Bidirectional minus causal accuracy . Early patc hes suffer severe context starv ation under causal masking. en tire training pro cess. All MLLM hyperparameters, training data, and vision enco ders (CLIP , DINOv2, SigLIP) are identical betw een the t wo conditions, only the attention mask differs. 5.1 Early T ok ens Suffer Context Starv ation W e compare p er-patch classification accuracy betw een the causal and bidirec- tional CLIP MLLM mo dels at the same la yer across the ADE20K v alidation set. T o isolate the positional effect of the atten tion mask from the con tent- distribution bias shared by both mo dels, we rep ort the difference in pixel ac- curacy b et ween the bidirectional and causal attention mechanisms ev aluated at patc h p ositions. Fig. 7 reports accuracy for the first 50 patch tokens in patc h order. The gap is striking: Patc h 0, which under causal atten tion attends to no visual con text, shows a + 14.35 p ercen tage-p oin t pixel accuracy increase, a 23.2% relativ e impro vemen t under bidirectional atten tion; the gap deca ys in later image patc hes. 5.2 Bidirectional A ttention Sustains Recov ery Across Lay ers Fig. 8 compares lay erwise linear probing mIoU for causal and bidirectional MLLMs across all three enco der configurations. Bidirectional attention yields mo dest p eak mIoU gains for CLIP ( + 0.42) and DINOv2 ( + 0.32), consistent with the p er-patch analysis: The improv emen t is concen trated at the few context- starv ed p ositions rather than reflecting a broad representational change. SigLIP sho ws a larger effect: Under causal attention, reco very p eaks at la yer 12 (mIoU 35.39) and then declines through deep er lay ers, while bidirectional attention sustains monotonic improv ement to 41.26 at lay er 32. F rom Drop-off to Reco very: Segmentation in MLLMs 13 Fig. 8: Causal vs. bidirectional atten tion: Lay erwise probing on ADE20K across the MLLM Stack for differen t vision enco ders. Bidirectional attention yields mo dest p eak mIoU gains for CLIP and DINOv2, while SigLIP shows a larger effect. 5.3 Self-Refinemen t Requires Access to Neigh b ors The bidirectional atten tion exp erimen t rev eals context starv ation as a real but sharply lo calized cost of causal masking. F or the first few image tok ens, the causal mask creates a p ersisten t representation p enalt y that is not resolved across LLM la yers; bidirectional attention eliminates it by granting immediate access to all visual neighbors. Beyond these early positions, causal and bidirectional atten- tion pro duce comparable represen tations, indicating that self-refinemen t through partial con text is sufficien t once a tok en has access to ev en a mo dest num b er of visual neighbors. This finding reinforces the self-refinemen t narrativ e from Sec. 4.1: Self-refinement dep ends on access to semantically consistent neighbors, and causal masking dela ys the formation of these con textual anchors for early tok ens, whereas bidirectional attention restores immediate global con text and enables refinement to op erate uniformly across spatial p ositions. 5.4 Effect on Language Understanding The preceding sections show that bidirectional atten tion among image tokens impro ves segmentation, particularly for context-starv ed early patches. A natu- ral concern is whether this mo dification degrades the mo del’s language capa- bilities. T ab. 1 compares VQA p erformance across nine b enchmarks for causal and bidirectional v ariants of all three enco der configurations. CLIP and SigLIP mo dels maintain comparable p erformance to their causal baselines across most b enc hmarks, indicating that bidirectional image attention preserv es language un- derstanding. DINOv2 sho ws minor regressions, consistent with its weak er text alignmen t observed throughout our exp erimen ts. These results suggest that bidi- rectional attention among image tokens offers a fa vorable trade-off: it alleviates con text starv ation for segmentation without sacrificing language capabilities for text-aligned enco ders. 14 B. W u et al. T able 1: VQA b enc hmark: causal vs. bidirectional LLaV A 1.5 (7B) across three vision encoders. All models use Vicun a-7B as the LLM backbone. Bidirectional atten tion preserves language understanding with segmentation gains that v ary across enco ders. DINOv2 shows broader regressions, consistent with its lac k of text alignmen t. Vision Encoder Atten tion GQA MMB MME P MME C MMMU POPE SQA I T extVQA VizWiz CLIP ViT-L/14 Causal 62.6 66.4 1483 284 35.3 86.8 68.7 46.9 56.0 Bidirectional 62.7 65.5 1538 288 36.2 86.9 69.7 47.0 57.5 DINOv2 ViT-L/14 Causal 62.1 57.7 1304 324 34.6 87.2 66.1 14.0 51.4 Bidirectional 60.5 55.3 1247 326 32.4 85.1 66.3 13.7 45.8 SigLIP SO400M/14 Causal 61.1 63.7 1414 275 34.7 84.4 70.4 50.2 58.2 Bidirectional 62.1 66.9 1402 298 34.9 84.8 70.7 53.6 53.3 6 Conclusion Our probing and interv ention framew ork reveals that adapter-style MLLMs in- tro duce a representation drop-off at the adapter that degrades token-lev el sep- arabilit y , but LLM la yers progressiv ely reco ver segmentation qualit y through atten tion-mediated self-refinement. A ttention kno ck out exp erimen ts confirm that correctly classified tokens act as semantic anchors whose attention signals pull misclassified neigh b ors to ward the correct label, identifying cross-token atten- tion as the operative refinement mec hanism. Causal attention creates context starv ation at early image tok en positions that bidirectional attention among image tokens alleviates b y restoring immediate global con text and enabling re- finemen t to op erate uniformly across spatial positions. These findings provide a mechanistic account of how MLLMs pro cess visual information for segmenta- tion, informing the design of future segmentation-capable MLLMs and hop efully pro viding a step for more interpretable multimodal systems. Limitations. W e used LLaV A-t yp e mo dels as a starting p oin t and v aried the vision enco der across the configurations for in terpreting MLLMs, but our conclusions migh t not generalize for significan tly differen t arc hitectures. Our linear prob es ma y underestimate actual segmentation capacity achiev able with ric her task heads and the kno c kout interv entions blo c k attention globally across all lay ers, which does not isolate the contribution of individual lay ers to the refinemen t pro cess. F uture work could explore whether these findings extend to broader task types and mo del architectures. F rom Drop-off to Reco very: Segmentation in MLLMs 15 A c knowledgemen ts Our work was partially funded b y the ER C (853489 - DEXIM) and the Al- fried Krupp von Bohlen und Halbac h F oundation, whic h w e thank for their sup- p ort. The authors gratefully ac knowledge the scien tific support and resources of the AI service infrastructure LRZ AI Systems pro vided b y the Leibniz Sup er- computing Centre (LRZ) of the Ba v arian A cademy of Sciences and Humanities (BA dW), funded by Bay erisches Staatsministerium für Wissenschaft und Kunst (StMWK). References 1. Alain, G., Bengio, Y.: Understanding in termediate la yers using linear classifier prob es. In: ICLR W orkshop (2017). https: / / doi . org / 10 . 48550 / arXiv . 1610 . 01644 2. Bai, S., Chen, K., Liu, X., W ang, J., Ge, W., Song, S., Dang, K., W ang, P ., W ang, S., T ang, J., Zh ong, H., Zhu, Y., Y ang, M., Li, Z., W an, J., W ang, P ., Ding, W., F u, Z., Xu, Y., Y e, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Y ang, Z., Xu, H., Lin, J.: Qwen2.5-VL T ec hnical Rep ort (F eb 2025). https :/ /doi. org/ 10.48550 / arXiv.2502.13923 3. Banani, M.E., Ra j, A., Maninis, K.K., Kar, A., Li, Y., Rubinstein, M., Sun, D., Guibas, L., Johnson, J., Jampani, V.: Probing the 3D A w areness of Visual F oun- dation Mo dels. In: CVPR (2024). https://doi.org/10.48550/arXiv.2404.08636 4. Bey er, L., Steiner, A., Pin to, A.S., Kolesnik o v, A., W ang, X., Salz, D., Neumann, M., Alab dulmohsin, I., T schannen, M., Bugliarello, E., Unterthiner, T., Keysers, D., Koppula, S., Liu, F., Grycner, A., Gritsenko, A., Houlsby , N., Kumar, M., Rong, K., Eisenschlos, J., Kabra, R., Bauer, M., Bošnjak, M., Chen, X., Minderer, M., V oigtlaender, P ., Bica, I., Balazevic, I., Puigcerver, J., Papalampidi, P ., Henaff, O., Xiong, X., Soricut, R., Harmsen, J., Zhai, X.: PaliGemma: A versatile 3B VLM for transfer (Oct 2024). https://doi.org/10.48550/arXiv.2407.07726 5. Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alw ala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., W ang, M., Sun, P ., Rädle, R., Afouras, T., Ma vroudi, E., Xu, K., W u, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., V aze, S., Porc her, F., Li, F., Li, S., Kamath, A., Cheng, H.K., Dollár, P ., Ravi, N., Saenko, K., Zhang, P ., F eich tenhofer, C.: SAM 3: Segmen t An ything with Concepts. In: ICLR (2026). https://doi.org/10.48550/arXiv.2511.16719 6. Caron, M., T ouvron, H., Misra, I., Jégou, H., Mairal, J., Bo jano wski, P ., Joulin, A.: Emerging Prop erties in Self-Sup ervised Vision T ransformers. In: ICCV (2021). https://doi.org/10.48550/arXiv.2104.14294 7. Chen, Y.C., Li, W.H., Sun, C., W ang, Y.C.F., Chen, C.S.: SAM4MLLM: Enhance Multi-Mo dal Large Language Mo del for Referring Expression Segmentation. In: ECCV (2024). https://doi.org/10.48550/arXiv.2409.10542 8. Chen, Z., W ang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Y e, S., Tian, H., Liu, Z., Gu, L., W ang, X., Li, Q., Ren, Y., Chen, Z., Luo, J., W ang, J., Jiang, T., W ang, B., He, C., Shi, B., Zhang, X., Lv, H., W ang, Y., Shao, W., Chu, P ., T u, Z., He, T., W u, Z., Deng, H., Ge, J., Chen, K., Zhang, K., W ang, L., Dou, M., Lu, L., Zhu, X., Lu, T., Lin, D., Qiao, Y., Dai, J., W ang, W.: Expanding Performance 16 B. W u et al. Boundaries of Op en-Source Multimo dal Mo dels with Mo del, Data, and T est-Time Scaling (Dec 2024). https://doi.org/10.48550/arXiv.2412.05271 9. Cheng, B., Misra, I., Sch wing, A.G., Kirillov, A., Girdhar, R.: Masked-atten tion Mask T ransformer for Universal Image Segmentation. In: CVPR (2022). https : //doi.org/10.48550/arXiv.2112.01527 10. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzw eiler, M., Benenson, R., F ranke, U., Roth, S., Schiele, B.: The Cit yscap es Dataset for Semantic Urban Scene Understanding. In: CVPR (2016). https://doi.org/10.48550/arXiv.1604.01685 11. Doso vitskiy , A., Beyer, L., Kolesnik o v, A., W eissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly , S., Uszkoreit, J., Houlsby , N.: An Image is W orth 16x16 W ords: T ransformers for Image Recognition at Scale. In: ICLR (2021). https://doi.org/10.48550/arXiv.2010.11929 12. Ev eringham, M., V an Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The P ascal Visual Ob ject Classes (V OC) Challenge. International Journal of Computer Vision 88 (2), 303–338 (Jun 2010). https://doi.org/10.1007/s11263- 009- 0275- 4 13. F u, S., Bonnen, T., Guillory , D., Darrell, T.: Hidden in plain sight: VLMs ov erlo ok their visual represen tations. In: Conference on Language Mo deling (COLM) (2025). https://doi.org/10.48550/arXiv.2506.08008 14. F u, X., Hu, Y., Li, B., F eng, Y., W ang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: BLINK: Multimo dal Large Language Mo dels Can See but Not P erceive. In: ECCV (2024). https://doi.org/10.48550/arXiv.2404.12390 15. Gev a, M., Bastings, J., Filipp ov a, K., Glob erson, A.: Dissecting Recall of F actual Asso ciations in Auto-Regressiv e Language Mo dels. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Metho ds in Natural Language Pro cessing. pp. 12216–12235. Asso ciation for Computational Linguistics, Singap ore (Dec 2023). https://doi.org/10.18653/v1/2023.emnlp- main.751 16. He, K., Gkioxari, G., Dollár, P ., Girshic k, R.: Mask R-CNN. In: ICCV (2017). https://doi.org/10.48550/arXiv.1703.06870 17. Hu, R., Rohrbac h, M., Darrell, T.: Segmen tation from Natural Language Expres- sions. In: ECCV (2016). https://doi.org/10.48550/arXiv.1603.06180 18. Kirillo v, A., Mintun, E., Ra vi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P ., Girshick, R.: Segmen t Anything. In: ICCV (2023). https://doi.org/10.48550/arXiv.2304.02643 19. Lai, X., Tian, Z., Chen, Y., Li, Y., Y uan, Y., Liu, S., Jia, J.: LISA: Reasoning Segmen tation via Large Language Mo del. In: CVPR (2024). https: / /doi . org/ 10.48550/arXiv.2308.00692 20. Li, S., Koh, P .W., Du, S.S.: Exploring Ho w Generative MLLMs Perceiv e More Than CLIP with the Same Vision Enco der. In: Annual Meeting of the Asso ciation for Computational Linguistics (ACL) (2025). https://doi.org/10.48550/arXiv. 2411.05195 21. Liang, W., Zhang, Y., K won, Y., Y eung, S., Zou, J.: Mind the Gap: Understand- ing the Modality Gap in Multi-mo dal Con trastive Representation Learning. In: NeurIPS (2022). https://doi.org/10.48550/arXiv.2203.02053 22. Liang, Y., Cai, Z., Xu, J., Huang, G., W ang, Y., Liang, X., Liu, J., Li, Z., W ang, J., Huang, S.L.: Unleashing Region Understanding in In termediate Lay- ers for MLLM-based Referring Expression Generation. In: NeurIPS (No v 2024), https://openreview.net/forum?id=168NLzTpw8 23. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improv ed Baselines with Visual Instruction T un- ing. In: CVPR (2024). https://doi.org/10.48550/arXiv.2310.03744 F rom Drop-off to Reco very: Segmentation in MLLMs 17 24. Liu, H., Li, C., W u, Q., Lee, Y.J.: Visual Instruction T uning. In: NeurIPS (2023). https://doi.org/10.48550/arXiv.2304.08485 25. Merullo, J., Castricato, L., Eickhoff, C., Pa vlick, E.: Linearly Mapping from Image to T ext Space. In: ICLR (2023). https://doi.org/10.48550/arXiv.2209.15162 26. Neo, C., Ong, L., T orr, P ., Gev a, M., Krueger, D., Barez, F.: T ow ards Interpret- ing Visual Information Pro cessing in Vision-Language Mo dels. In: ICLR (2025). https://doi.org/10.48550/arXiv.2410.07149 27. Oquab, M., Darcet, T., Moutak anni, T., V o, H., Szafraniec, M., Khalido v, V., F ernandez, P ., Haziza, D., Massa, F., El-Noub y , A., Assran, M., Ballas, N., Galuba, W., Ho wes, R., Huang, P .Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P ., Joulin, A., Bo jano wski, P .: DINOv2: Learning Robust Visual F eatures without Supervision. TMLR (2024). https : // doi.org/10.48550/arXiv.2304.07193 28. Radford, A., Kim, J.W., Hallacy , C., Ramesh, A., Goh, G., Agarwal, S., Sastry , G., Ask ell, A., Mishkin, P ., Clark, J.: Learning transferable visual mo dels from natural language sup ervision. In: International conference on mac hine learning. pp. 8748– 8763. PMLR (2021), http://proceedings.mlr.press/v139/radford21a 29. Rasheed, H., Maaz, M., Mullappilly , S.S., Shaker, A., Khan, S., Cholakk al, H., An wer, R.M., Xing, E., Y ang, M.H., Khan, F.S.: GLaMM: Pixel Grounding Large Multimo dal Mo del. In: CVPR (2024). https :// doi.org /10.48550 /arXiv. 2311. 03356 30. Ra vi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., W u, C.Y., Girshick, R., Dollár, P ., F eic htenhofer, C.: SAM 2: Segment An ything in Images and Videos. In: ICLR (2025). https: / / doi .org /10 .48550 /arXiv .2408 . 00714 31. Ren, Z., Huang, Z., W ei, Y., Zhao, Y., F u, D., F eng, J., Jin, X.: PixelLM: Pixel Reasoning with Large Multimo dal Mo del. In: CVPR (2024). https: / /doi .org / 10.48550/arXiv.2312.02228 32. T ong, S., Brown, E., W u, P ., W oo, S., Middepogu, M., Akula, S.C., Y ang, J., Y ang, S., Iyer, A., P an, X., W ang, Z., F ergus, R., LeCun, Y., Xie, S.: Cambrian- 1: A F ully Op en, Vision-Centric Exploration of Multimodal LLMs. In: NeurIPS (2024). https://doi.org/10.48550/arXiv.2406.16860 33. T ong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Ey es Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. In: CVPR (2024). https :// doi. org/10.48550/arXiv.2401.06209 34. T ouvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goy al, N., Hambro, E., Azhar, F., Ro driguez, A., Joulin, A., Gra ve, E., Lample, G.: LLaMA: Op en and Efficient F oundation Language Mo dels (F eb 2023). https://doi.org/10.48550/arXiv.2302.13971 35. W u, S., Jin, S., Zhang, W., Xu, L., Liu, W., Li, W., Loy , C.C.: F-LMM: Grounding F rozen Large Multimodal Mo dels. In: CVPR (2025). https://doi.org/10.48550/ arXiv.2406.05821 36. Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: GSV A: General- ized Segmen tation via Multimo dal Large Language Models. In: CVPR (2024). https://doi.org/10.48550/arXiv.2312.10103 37. Y ao, H., W u, W., Y ang, T., Song, Y., Zhang, M., F eng, H., Sun, Y., Li, Z., Ouyang, W., W ang, J.: Dense Connector for MLLMs. In: NeurIPS (2024). https :/ /doi . org/10.48550/arXiv.2405.13800 18 B. W u et al. 38. Zhai, X., Mustafa, B., Kolesnik o v, A., Beyer, L.: Sigmoid Loss for Language Image Pre-T raining. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11941–11952. IEEE, Paris, F rance (Oct 2023). https://doi.org/10. 1109/ICCV51070.2023.01100 39. Zhang, C., Cho, J., Puspitasari, F.D., Zheng, S., Li, C., Qiao, Y., Kang, T., Shan, X., Zhang, C., Qin, C., Rameau, F., Lee, L.H., Bae, S.H., Hong, C.S.: A Sur- v ey on Segment An ything Mo del (SAM): Vision F oundation Mo del Meets Prompt Engineering (Oct 2024). https://doi.org/10.48550/arXiv.2306.06211 40. Zhang, T., Li, X., F ei, H., Y uan, H., W u, S., Ji, S., Loy , C.C., Y an, S.: OMG-LLaV A: Bridging Image-level, Ob ject-level, Pixel-lev el Reasoning and Understanding. In: NeurIPS (2024). https://doi.org/10.48550/arXiv.2406.19389 41. Zhang, Y., Ma, Z., Gao, X., Shakiah, S ., Gao, Q., Chai, J.: GROUNDHOG: Grounding Large Language Mo dels to Holistic Segmentation. In: CVPR (2024). https://doi.org/10.48550/arXiv.2402.16846 42. Zhang, Y., Unell, A., W ang, X., Ghosh, D., Su, Y., Sc hmidt, L., Y eung-Levy , S.: Wh y are Visually-Grounded Language Models Bad at Image Classification? In: NeurIPS (2024). https://doi.org/10.48550/arXiv.2405.18415 43. Zhang, Z., Ma, Y., Zhang, E., Bai, X.: PSALM: Pixelwise SegmentA tion with Large Multi-Mo dal Mo del. In: ECCV (2024). https://doi.org/10.48550/arXiv.2403. 14598 44. Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., T orralba, A.: Seman tic Understanding of Scenes through the ADE20K Dataset. IJCV 127 , 302– 321 (2019). https://doi.org/10.1007/s11263- 018- 1140- 0 45. Zou, X., Y ang, J., Zhang, H., Li, F., Li, L., W ang, J., W ang, L., Gao, J., Lee, Y.J.: Segment Everything Everywhere All at Once. In: NeurIPS (2023). https : //doi.org/10.48550/arXiv.2304.06718 F rom Drop-off to Reco very: Segmentation in MLLMs 19 Supplemen tary Materials A Implemen tation Details A.1 MLLM T raining All MLLM v ariants follow the LLaV A-1.5 tw o-stage training pro cedure [23] us- ing Vicuna-7B [34] as the LLM backbone. W e train six configurations: three vision enco ders (CLIP ViT-L/14@336 [28], DINOv2 ViT -L/14@336 [27], SigLIP SO400M/14@384 [38]) × t wo atten tion t yp es (causal, bidirectional). The adapter is a tw o-la yer MLP that pro jects vision enco der outputs into the LLM’s em- b edding space. CLIP and DINOv2 operate at 336 × 336 resolution, pro ducing 24 × 24 = 576 patc h tokens; SigLIP op erates at 384 × 384 , pro ducing 27 × 27 = 729 patc h tok ens. All mo dels are trained in bfloat16 precision. Stage 1 (Adapter pretraining). The adapter is trained on 558K image- caption pairs with b oth the vision enco der and LLM frozen. W e use Adam W with a learning rate of 1 × 10 − 3 , cosine learning rate schedule, warm up ratio of 3%, and no w eight deca y and training runs for 1 ep o c h. Stage 2 (Visual instruction tuning). The full mo del (adapter + LLM) is finetuned on 665K visual instruction data, with the vision enco der remaining frozen. W e use A dam W with a learning rate of 2 × 10 − 5 , cosine sc hedule, warm up ratio of 3%, and no weigh t decay and training runs for 1 ep o c h. F or the bidirectional attention v arian ts (Sec. 5), both stages use the bidi- rectional image-tok en attention mask defined in Eq. (2); the causal v arian ts use the standard causal mask throughout. All other hyperparameters are identical b et ween causal and bidirectional conditions. A.2 Linear Probing Proto col W e train an indep enden t linear probe at each extraction p oin t: the vision enco der output, the adapter output, and in termediate LLM la yers. The prob e is a single linear lay er mapping from the hidden dimension d to K classes, trained with plain cross-entrop y loss without class reweigh ting or data augmen tation. W e use the Adam W optimizer ( β 1 = 0 . 9 , β 2 = 0 . 999 ) with a learning rate of 1 × 10 − 3 , p olynomial learning rate decay (p o wer 0 . 9 ), no weigh t deca y , and a batch size of 64. Each probe is trained for 20 ep o c hs; w e track v alidation mIoU after each ep och and retain the b est-p erforming chec kpoint. The p er-patc h logits are reshap ed into a 2D grid matc hing the original patch la yout (e.g., 24 × 24 × K for 576 patches) and bilinearly upsampled to the full im- age resolution. The loss is computed against the full-resolution ground-truth seg- men tation mask, allowing it to reflect sub-patch b oundary information through the in terp olated logits. Images are pro cessed in a single forward pass without sliding window or m ulti-scale ev aluation. The linear prob e is trained exclusively on the training split and all rep orted metrics are computed on the held-out v alidation split, ensuring that the results 20 B. W u et al. Fig. 9: La yerwise linear probing on Cit yscap es (left) and P ASCAL V OC (righ t). mIoU across the MLLM stack for CLIP , DINOv2, and SigLIP enco ders paired with Vicuna-7B. reflect the generalization quality of the frozen representations rather than mem- orization by the prob e. W e keep these training conditions across all lay ers and all exp erimen ts. A.3 A ttention Kno c kout Setup F or the atten tion knock out exp eriments (Sec. 4), w e use the p er-la yer linear prob es trained in the standard probing exp eriment. W e then identify the token set B c corresp onding to the class to b e blo c ked and re-run the forw ard pass with the attention mask modified according to Eq. (1), extracting hidden states at every lay er for ev aluation with the same frozen prob es. The kno c kout is im- plemen ted by registering forward pre-ho oks on each LLM lay er’s self-attention mo dule that set the corresp onding pre-softmax atten tion logits to −∞ before the softmax computation. This ensures that an y change in segmentation quality is attributable to the attention in terven tion. B Extended Probing Results B.1 La yerwise Probing on P ASCAL VOC and Cityscapes The main pap er rep orts lay erwise linear probing results on ADE20K (Fig. 3). Fig. 9 shows the corresp onding results on Cityscapes and P ASCAL V OC 2012. The progressiv e recov ery across LLM la yers is consistent across all three datasets and all three enco ders, confirming that this pattern is not sp ecific to ADE20K. DINOv2 and SigLIP also exhibit a consistent adapter drop-off across all datasets. B.2 Causal vs. Bidirectional Probing T ab. 2 summarizes the causal vs. bidirectional comparison on ADE20K, comple- men ting the lay erwise curves in Fig. 8. F rom Drop-off to Reco very: Segmentation in MLLMs 21 T able 2: Causal vs. bidirectional attention: Lay erwise probing on ADE20K. A dapter and P eak LLM rep ort mIoU (%). ∆ enc : P eak LLM mIoU minus vision encoder baseline (p ositiv e = surpasses encoder). Recov ery: Peak LLM mIoU minus adapter mIoU. Enco der A ttention A dapter Peak LLM ∆ enc Reco very CLIP ViT-L/14@336 Causal 33.22 40.74 + 6.36 + 7.52 Bidirectional 32.68 41.16 + 6.78 + 8.48 DINOv2 Large Causal 42.90 44.50 − 1.08 + 1.60 Bidirectional 40.47 44.82 − 0.76 + 4.35 SigLIP SO400M/14 Causal 31.44 35.39 − 3.31 + 3.95 Bidirectional 32.39 41.26 + 2.56 + 8.87 T able 3: Context starv ation under causal atten tion. Per-patc h pixel accuracy for the first three tokens. ∆ : Bidirectional minus causal accuracy . % Impr.: relative impro vemen t ov er causal. Early patches suffer severe context starv ation under causal masking. Causal Bidirectional ∆ % Impr. P atch 0 0.6195 0.7630 + 0.1435 + 23.2% P atch 1 0.6851 0.7933 + 0.1082 + 15.8% P atch 2 0.7069 0.7931 + 0.0862 + 12.2% B.3 Con text Starv ation P er-Patc h A ccuracy T ab. 3 rep orts p er-patc h pixel accuracy and relativ e impro vemen t for the first three image tokens under causal and bidirectional atten tion, quan tifying the con text starv ation effect discussed in Sec. 5.1. B.4 Generalization A cross MLLM Arc hitectures Fig. 3 in the main pap er v aries the vision enco der while keeping the LLM fixed (Vicuna-7B in the LLaV A-1.5 framework). T o test whether the drop-off and recov ery pattern generalizes b ey ond LLaV A-1.5, we rep eat the la y erwise linear probing exp eriment with t wo additional MLLMs: LLaV A-One Vision and DeepSeek-VL. Each model uses a different vision enco der: LLaV A-1.5 uses CLIP ViT-L/14, One Vis ion uses SigLIP , and DeepSeek-VL uses concatenated SigLIP and SAM-B embeddings. Fig. 10 rep orts ∆ mIoU relativ e to the adapter output on ADE20K. All three architectures exhibit the same qualitative pattern: a rep- resen tation drop-off at the adapter follow ed by progressive recov ery across LLM la yers. Note that the curv es span different num b ers of la yers b ecause each MLLM uses a different LLM bac kb one with a different depth. These results confirm that the drop-off and self-refinemen t mechanism do cumented in Sec. 3 is not sp ecific to a single MLLM but extends across different adapter-st yle architectures. 22 B. W u et al. Fig. 10: Lay erwise probing across differen t MLLMs. ∆ mIoU relativ e to adapter output on ADE20K for LLaV A-1.5 (7B), LLaV A-One Vision (7B), and DeepSeek-VL (7B), each using its default vision enco der (CLIP for LLaV A-1.5, SigLIP for One Vision, SigLIP+SAM-B for DeepSeek-VL). Curv es span different n umbers of la yers due to differences in LLM backbone depth. The adapter drop-off and LLM self-refinement pattern is consistent across architectures. C Extended A tten tion Kno c kout Exp erimen ts The main paper demonstrates the attention kno ck out analysis on individual images with a single class-confusion pair (ceiling vs. sky). Here we extend this analysis using the CLIP MLLM to more misclassified classes (e.g. sky , ceiling, and grass) and aggregate the results across many images to provide quantitativ e evidence that the seman tic anc hor mec hanism generalizes. C.1 Aggregate Kno c kout Metric T o quantify the kno ck out effect at scale, we track ho w man y incorrectly predicted patc hes p ersist across lay ers. The metric below tracks predictions at every la yer. F or a giv en misclassified class c (e.g., sky), we classify all 576 patches at la yer 0 using the lay er-0 prob e and count how many are incorrectly predicted as c , yielding n (0) c . At each subsequen t lay er ℓ , we count the num b er of patches still predicted as c , yielding n ( ℓ ) c . The rate for that image at lay er ℓ is n ( ℓ ) c /n (0) c . W e a verage this ratio across all selected images. A v alue of 100% at lay er 0 means all initially misclassified patc hes are still presen t, lo wer v alues indicate the model is correcting them. W e then take the ground-truth class across all misclassified patc hes in the image as the dominant ground-truth class, this is the class that gets blo c ked in the kno c kout-correct condition. F rom Drop-off to Reco very: Segmentation in MLLMs 23 C.2 Quan titative Results Fig. 11 rep orts the aggregate kno c kout metric for sky , ceiling, and grass. Across all three classes, blo c king the incorrect class accelerates self-correction, while blo c king the correct class impairs it. These results show quan titativ ely , that correctly classified tokens act as semantic anc hors driving self-refinement. C.3 Qualitativ e Examples Fig. 12 sho ws representativ e p er-image kno ck out visualizations for eac h of the three misclassified classes, illustrating the la yerwise segmen tation under all three conditions. D Extended Qualitativ e Results D.1 La yerwise Segmentation Predictions Fig. 13 sho ws additional ADE20K v alidation images with linear prob e predic- tions at the vision enco der output, adapter output and at an intermediate LLM la yer, extending the qualitative analysis from Fig. 4. 24 B. W u et al. 0 4 8 12 16 20 24 28 32 Layer 0% 20% 40% 60% 80% 100% % of incor r ect 'sk y' patches r emaining A ttention Knock out: 'sk y' (615 images, nor malized per image by L0 count) No knock out Knock out incor r ect (sk y) Knock out cor r ect (G T class) 0 4 8 12 16 20 24 28 32 Layer 60% 80% 100% 120% 140% 160% 180% % of incor r ect 'ceiling' patches r emaining A ttention Knock out: 'ceiling' (73 images, nor malized per image by L0 count) No knock out Knock out incor r ect (ceiling) Knock out cor r ect (G T class) 0 4 8 12 16 20 24 28 32 Layer 40% 60% 80% 100% 120% 140% % of incor r ect 'grass' patches r emaining A ttention Knock out: 'grass' (100 images, nor malized per image by L0 count) No knock out Knock out incor r ect (grass) Knock out cor r ect (G T class) Fig. 11: Aggregate attention kno c kout across three misclassified classes. P er- cen tage of incorrectly predicted patc hes remaining across LLM la yers, normalized to 100% at lay er 0 and av eraged ov er all images exhibiting the misclassification. Blo cking the incorrect class (red) accelerates self-correction, blo cking the correct class (blue) impairs it and can amplify errors b ey ond the initial count. F rom Drop-off to Reco very: Segmentation in MLLMs 25 Fig. 12: Qualitativ e attention kno ck out examples. Lay erwise segmen tation pre- dictions under three conditions (rows: no knock out, knock out incorrect, kno c kout cor- rect) for three class-confusion pairs. T op: sky vs. windowpane. Middle: ceiling vs. wall. Bottom: grass vs. road. Blocking the incorrect class produces cleaner segmentation maps at earlier lay ers, while blo cking the correct class leav es errors unresolved or am- plified. 26 B. W u et al. Input Ground T ruth V ision Encoder Adapter LLM La yer Fig. 13: Extended qualitativ e segmentation predictions across the MLLM stac k. Seven ADE20K v alidation images showing, from left to right: input image, ground truth, linear probe prediction at the vision enco der output, adapter output, and at an intermediate LLM lay er. The represen tation drop-off at the adapter is visible as increased noise and class confusion compared to the vision enco der output, while deep er LLM lay ers pro duce more spatially coherent predictions.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment