Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

February 22, 2026

Reading time: 2 minute

...

📝 Original Info

Title: Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials
ArXiv ID: 2511.00833
Date: 2025-11-02
Authors: ** 정보 제공되지 않음 (논문에 명시된 저자 정보가 없으므로, 해당 항목은 비워두었습니다.) **

📝 Abstract

Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi-Head Self-Attention (MHSA) layer still performs a quadratic query-key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce Visual-Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from O(N N C) to O(N n C) with n << N. VCA first distils each head's dense query field into a handful of spatially pooled visual-contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than 0.3M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from 72.2% to 75.6% (+3.4) and improves three strong hierarchical ViTs by up to 3.1%, while in class-conditional ImageNet generation it lowers FID-50K by 2.1 to 5.2 points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at https://github.com/LeapLabTHU/LinearDiff.

Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Related Posts

A Dual Perspective on Decision-Focused Learning: Scalable Training via Dual-Guided Surrogates

A Retrospect to Multi-prompt Learning across Vision and Language

Adapting Interleaved Encoders with PPO for Language-Guided Reinforcement Learning in BabyAI

Start searching

No results found