CAViT -- Channel-Aware Vision Transformer for Dynamic Feature Fusion

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range spatial interactions via self-attention. However, channel-wise mixing in ViTs remains static, relying on fixed multilayer perceptrons (MLPs) that lack adaptability to input content. We introduce ‘CAViT’, a dual-attention architecture that replaces the static MLP with a dynamic, attention-based mechanism for feature interaction. Each Transformer block in CAViT performs spatial self-attention followed by channel-wise self-attention, allowing the model to dynamically recalibrate feature representations based on global image context. This unified and content-aware token mixing strategy enhances representational expressiveness without increasing depth or complexity. We validate CAViT across five benchmark datasets spanning both natural and medical domains, where it outperforms the standard ViT baseline by up to +3.6% in accuracy, while reducing parameter count and FLOPs by over 30%. Qualitative attention maps reveal sharper and semantically meaningful activation patterns, validating the effectiveness of our attention-driven token mixing.

💡 Research Summary

Vision Transformers (ViTs) have become a cornerstone of modern computer‑vision research, yet their internal architecture exhibits a notable asymmetry: spatial token mixing is performed by dynamic multi‑head self‑attention (MHSA), while channel mixing relies on a static multilayer perceptron (MLP). The fixed MLP treats each channel independently of the image content, limiting the model’s ability to adapt to diverse visual structures such as textures, object parts, or medical lesions.

The paper introduces CAViT (Channel‑Aware Vision Transformer), a simple yet powerful modification that replaces the MLP in every Transformer block with a second self‑attention stage operating across the channel dimension. After the usual spatial MHSA, the tensor of shape B × (N + 1) × C (batch, tokens, channels) is split to isolate the CLS token, then the spatial tokens are transposed to B × C × N, effectively treating each channel as a token. The CLS token is reshaped to B × 1 × N and concatenated, yielding a sequence of length C + 1. A single‑head channel self‑attention (SHSA) is then applied, allowing each channel to attend to all other channels conditioned on the global image context. The dimensions are swapped back, the CLS token is re‑attached, and the result is fed into the next block.

Key design choices:

Dynamic channel mixing – By using attention rather than a linear MLP, channel interactions become data‑dependent and can adapt to each input image.
Single‑head attention – In the transposed space each channel already encodes global information, so a single head suffices and keeps FLOPs low.
CLS token handling – The CLS token participates in both spatial and channel attention, preserving its role as a global aggregator.

Experiments were conducted on five classification benchmarks covering natural (CIFAR‑10, Cats‑vs‑Dogs) and medical domains (Malaria blood‑smear, PneumoniaMNIST, BreastMNIST). Both the baseline ViT‑tiny and the proposed CAViT‑tiny were trained with identical hyper‑parameters (SGD, lr = 0.001, 100 epochs, no aggressive augmentation) on an RTX 4090. Results show that CAViT‑tiny consistently matches or exceeds ViT‑tiny accuracy while reducing model size by ~32 % (5.75 M → 3.91 M parameters) and FLOPs by ~33 % (2.267 G → 1.52 G). Notable gains include +3.64 pp on CIFAR‑10, +2.50 pp on BreastMNIST, and +1.14 pp on PneumoniaMNIST; performance on Malaria remains unchanged but with far fewer resources.

Qualitative analysis using DINO‑style token‑level attention visualizations reveals that CAViT produces sharper, more semantically focused maps. In medical scans the model highlights lesion regions more precisely, while in natural images it attends to object boundaries rather than noisy edges. This demonstrates that channel‑wise attention enables content‑aware feature recalibration, improving both interpretability and discriminative power.

Ablation studies explore several variants: (i) removing channel attention (spatial MHSA only), (ii) using multi‑head channel attention, and (iii) omitting the CLS token from the channel‑attention stage. The full CAViT configuration (single‑head channel attention with CLS inclusion) yields the best trade‑off between accuracy and efficiency; multi‑head channel attention adds computational cost without measurable performance benefit, and excluding the CLS token slightly degrades results.

In summary, CAViT offers a unified, attention‑only Transformer block where both spatial and channel interactions are dynamically learned. The approach requires only a dimension‑swap operation and a lightweight SHSA layer, preserving the simplicity of the original ViT while delivering substantial gains in parameter efficiency, computational cost, and cross‑domain accuracy. The work suggests that future large‑scale vision foundation models could benefit from integrating channel‑aware self‑attention as a core component, leading to more adaptable and interpretable architectures.

CAViT -- Channel-Aware Vision Transformer for Dynamic Feature Fusion

💡 Research Summary

Comments & Academic Discussion

Leave a Comment