Vision KAN: Towards an Attention-Free Backbone for Vision with Kolmogorov-Arnold Networks

Vision KAN: Towards an Attention-Free Backbone for Vision with Kolmogorov-Arnold Networks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Attention mechanisms have become a key module in modern vision backbones due to their ability to model long-range dependencies. However, their quadratic complexity in sequence length and the difficulty of interpreting attention weights limit both scalability and clarity. Recent attention-free architectures demonstrate that strong performance can be achieved without pairwise attention, motivating the search for alternatives. In this work, we introduce Vision KAN (ViK), an attention-free backbone inspired by the Kolmogorov-Arnold Networks. At its core lies MultiPatch-RBFKAN, a unified token mixer that combines (a) patch-wise nonlinear transform with Radial Basis Function-based KANs, (b) axis-wise separable mixing for efficient local propagation, and (c) low-rank global mapping for long-range interaction. Employing as a drop-in replacement for attention modules, this formulation tackles the prohibitive cost of full KANs on high-resolution features by adopting a patch-wise grouping strategy with lightweight operators to restore cross-patch dependencies. Experiments on ImageNet-1K show that ViK achieves competitive accuracy with linear complexity, demonstrating the potential of KAN-based token mixing as an efficient and theoretically grounded alternative to attention.


💡 Research Summary

The paper introduces Vision KAN (ViK), an attention‑free vision backbone that replaces the quadratic‑cost self‑attention modules of modern Transformers with a function‑based token mixer derived from the Kolmogorov‑Arnold representation theorem. The core component, MultiPatch‑RBFKAN, combines three complementary operations: (a) a patch‑wise nonlinear transform using Radial Basis Function (RBF) expansions to approximate univariate functions within each non‑overlapping patch, (b) axis‑wise separable mixing implemented as horizontal and vertical depthwise convolutions followed by a soft‑weighted combination, and (c) a low‑rank global mapping that compresses the spatial token dimension to a small latent space and projects it back, thereby capturing long‑range dependencies with linear complexity.

The authors first review the limitations of self‑attention—quadratic memory and compute cost and limited interpretability—and motivate attention‑free alternatives such as MLP‑Mixer and MetaFormer. They then present the Kolmogorov‑Arnold theorem, which guarantees that any continuous multivariate function can be expressed as a composition of univariate functions. Building on recent KAN (Kolmogorov‑Arnold Network) work, they replace fixed activations with learnable basis expansions, choosing RBFs for their parallel GPU friendliness and clear visualizability.

ViK follows a hierarchical design similar to Vision Transformers: a convolutional patch embedding feeds four stages of ViK blocks, each reducing spatial resolution while increasing channel width. Inside each block, MultiPatch‑RBFKAN processes the feature map. First, the map is divided into p×p patches (p=4 by default). For each patch, an RBF‑KAN with M basis functions (M=8 in the main experiments) computes ϕ(x)=∑_{j=1}^{M} w_j·exp(−‖x−μ_j‖²/(2σ_j²)). This captures rich local non‑linearities without pairwise interactions.

Next, to enable cross‑patch communication, the authors apply two depthwise convolutions—one horizontal, one vertical—producing DWh(y) and DWv(y). Global average pooling followed by a small MLP yields scalar weights α_h and α_w, normalized by softmax. The mixed output is α_h·DWh(y)+α_v·DWv(y), allowing the network to adaptively emphasize horizontal or vertical structures.

Finally, a low‑rank global path reshapes each channel into a length‑N vector (N=H·W) and applies a projection P∈ℝ^{r×N} (r≪N) and its transpose Q∈ℝ^{N×r}. The operation y_global = Q·P·y injects global context at O(N·C·r) cost.

The overall per‑block complexity becomes O(N·C·(M·p² + k + r)), linear in the number of tokens N, in stark contrast to the O(N²·C) of self‑attention.

Experimental evaluation on ImageNet‑1K uses standard training recipes (300 epochs, AdamW, 224×224 inputs). ViK‑Small (13.5 M parameters, 1.6 GFLOPs) achieves 76.5 % top‑1 accuracy, comparable to ResMLP‑S12 (76.6 %) but with roughly half the FLOPs. ViK‑Base (24.9 M, 3.2 GFLOPs) reaches 80.3 % top‑1, surpassing ResNet‑50 (79.2 %) and matching larger Transformers while remaining more efficient.

Ablation studies explore the impact of basis function type, number of bases, and the presence of separable mixing and low‑rank mapping. RBF bases consistently outperform B‑spline, wavelet, and a plain MLP replacement (the latter drops accuracy by ~4.4 %). Increasing M from 4 to 8 improves performance, but gains saturate at M=8. Removing either the axis‑wise mixing or the low‑rank global path reduces accuracy to 74.6 % and 73.9 % respectively, confirming that all three components are essential.

To demonstrate interpretability, the authors visualize the learned univariate RBF functions across stages. Early layers exhibit highly oscillatory curves, reflecting sensitivity to fine‑grained texture, while deeper layers converge to smoother mappings, indicating abstraction toward semantic features. This provides a rare glimpse into the internal workings of a vision model beyond opaque attention weights.

The paper concludes that Kolmogorov‑Arnold‑theorem‑inspired function approximators can serve as a principled, efficient alternative to attention in vision backbones. Limitations include fixed patch size and static low‑rank dimension, which may affect scalability to very high‑resolution inputs or video streams. Future work is suggested on adaptive patch/grouping strategies, dynamic rank selection, and extensions to multimodal or temporal data. Overall, Vision KAN offers a compelling direction for building lightweight, interpretable, and theoretically grounded vision architectures.


Comments & Academic Discussion

Loading comments...

Leave a Comment