CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

Reading time: 2 minute
...

📝 Original Info

  • Title: CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion
  • ArXiv ID: 2512.19535
  • Date: 2025-12-22
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (가능하면 원문 PDF 혹은 arXiv 페이지에서 확인 필요) **

📝 Abstract

Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at https://kyutai.org/casa .

💡 Deep Analysis

📄 Full Content

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut