Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

Reading time: 1 minute
...

📝 Original Info

  • Title: Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
  • ArXiv ID: 2512.10362
  • Date: 2025-12-11
  • Authors: Woojun Jung, Jaehoon Go, Mingyu Jeon, Sunjae Yoon, Junyeong Kim

📝 Abstract

Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: "Contextual Blindness." This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information 'Quantity,' but from a lack of 'Structural Diversity' in the model's input. To resolve this, we propose Visual Funnel, a training-free, twostep approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves the hierarchical context-ranging from focal detail to broader surroundings-by dynamically determining crop sizes based on attention entropy and refining crop centers. Through extensive experiments, we demonstrate that Visual Funnel significantly outperforms naive single...

📄 Full Content

...(본문 내용이 길어 생략되었습니다. 사이트에서 전문을 확인해 주세요.)

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut