긴맥락 비전언어 모델을 위한 지식증류 기반 위치주의 전이 기법
📝 Abstract
While large vision-language models (VLMs) demonstrate strong long-context understanding, their prevalent small branches fail on linguistics-photography alignment for a limited window size. We discover that knowledge distillation improves students’ capability as a complement to Rotary Position Embeddings (RoPE) on window sizes (anchored from large models). Building on this insight, we propose LAid, which directly aims at the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAiddistilled models achieve up to 3.2× longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.
💡 Analysis
While large vision-language models (VLMs) demonstrate strong long-context understanding, their prevalent small branches fail on linguistics-photography alignment for a limited window size. We discover that knowledge distillation improves students’ capability as a complement to Rotary Position Embeddings (RoPE) on window sizes (anchored from large models). Building on this insight, we propose LAid, which directly aims at the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAiddistilled models achieve up to 3.2× longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.
📄 Content
The comprehensive understanding and full utilization of long context play a crucial role in building large vision-language models (VLMs). It brings better linguistics-photography alignment in both large scene (Anthropic 2024;Bai et al. 2025;Kamath et al. 2025;Meta 2024;Xue et al. 2024;Zhu et al. 2025) and long storyline (Bai et al. 2025;Meta 2024;Tworkowski et al. 2023;Yu et al. 2024;Zhu et al. 2025), and it helps improving coherence and depth of interaction in multi-rounds dialogue (Anthropic 2024;Bai et al. 2025;Ge et al. 2024;Kamath et al. 2025;Meta 2024;Tworkowski et al. 2023;Xu et al. 2024;Young et al. 2024;Zhu et al. 2025). Currently, large-scale VLMs (≥ 72B parameters) demonstrate the window size scales up to 128k tokens, e.g., Gemma 3 (Kamath et al. 2025), Qwen 2.5-VL (Bai et al. 2025), In-ternVL 3 (Zhu et al. 2025). To the best of our knowledge, we are the first to point out that the VLMs’ prevalent distilled
User:For the image with a traffic light, is there a couch? Assistant:Yes.
Figure 1: Effective context window comparison. Left: The Visual Haystack task requiring retrieval from multi-image inputs. Right: Qwen2.5-VL accuracy across scales. Larger models (32B) sustain effective performance (>0.5) significantly longer than smaller counterparts (3B, 7B) despite identical architectures, revealing a scale-dependent RoPE awareness gap that our method targets.
branches (≤ 7B parameters) exhibit markedly constrained window size despite their employment of exact positional embedding, identical architecture, and training methodology. This window shrink is negligible when applying distilled models on short context evaluations, but it becomes a major obstacle during full-length inferencing.
Previous studies have revealed the possibility of extending the pre-trained large language models (LLMs) context window to more than 10 million tokens through various trainingstage interventions. Position embedding extrapolation techniques become predominant, e.g., RoPE (Su et al. 2024) enabling models to generalize to sequence lengths beyond their training range, ALiBi (Press, Smith, and Lewis 2022) introducing inductive biases that scale effectively to longer contexts, LongRoPE (Ding et al. 2024) and FoPE (Hua et al. 2024) leveraging the non-uniformity positional interpolation. Besides, fine-tuning approaches on longer texts have shown remarkable effectiveness, as demonstrated in works like LongLLaMA (Tworkowski et al. 2023), which extended context windows through continued pre-training on carefully curated long documents, and Anthropic’s Claude model (Anthropic 2024), which achieved 100k token context through specialized training regimes. The above methods mainly focus on the training stage, requiring significant computational resources for model retraining or fine-tuning.
However, VLMs encounter unique hurdles in handling long windows during training due to their multimodal nature-the visual components introduce substantial complexity in positional understanding across modalities, memory constraints become more severe due to image token density, and the alignment between visual and textual elements at long distances requires fundamentally different mechanisms. More importantly, many foundational assumptions that underpin text-only context extension techniques -such as uniform attention patterns and sequence-independent position encodings -break down when visual tokens with dense, spatially-organized information are introduced into the context window. Few research efforts have thrown light on posttraining stage techniques that could extend context windows without expensive retraining.
To better motivate our work, we specialize the concept of Long-window Anchoring. Current state-of-the-art VLMs (Qwen2.5-VL (Bai et al. 2025), InternVL 3 (Zhu et al. 2025), Gemma (Kamath et al. 2025)) often develop models of varying parameter sizes(3B, 7B, 32B) through independent training from scratch, resulting in inconsistent window size capabilities across model sizes. We propose using larger models (e.g., Qwen 2.5-VL 32B) as “anchors” that possess strong long-window capability, then employing post-training methods to align smaller models’ long-window capability with these anchors. In this way, smaller models can inherit longwindow capability without the prohibitive computational cost of training from scratch, while maintaining their efficiency advantages. Among various potential post-training methods, we begin with knowledge distillation as our fundamental approach -a proven technique for transferring capabilities between models of different sizes while maintaining computational efficiency for the smaller target model.
In this paper, we propose a new perspective to analyze this phenomenon in Figure 1, where we uncover a fundamental distinction: larger VLMs inherently sustain stronger visual haystack performance at extended input image numbers, decaying 5.2× slower compared to 3B models. This positional awareness gap persists
This content is AI-processed based on ArXiv data.