Interpretable Vision Transformers in Image Classification via SVDA
Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure. We apply the use of interpretability indicators – originally proposed with SVDA – to monitor attention dynamics during training and assess structural properties of the learned representations. Experimental evaluations on four widely used benchmarks – CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 – demonstrate that SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy. While the current framework offers descriptive insights rather than prescriptive guidance, our results establish SVDA as a comprehensive and informative tool for analyzing and developing structured attention models in computer vision. This work lays the foundation for future advances in explainable AI, spectral diagnostics, and attention-based model compression.
💡 Research Summary
The paper introduces SVD‑Inspired Attention (SVDA) as a drop‑in replacement for the standard dot‑product self‑attention in Vision Transformers (ViTs). SVDA decomposes the attention computation into a normalized query‑key product modulated by a learned diagonal spectral matrix Σ, mirroring the structure of a singular value decomposition. This formulation separates directional information (captured by ℓ2‑normalized Q and K) from spectral importance (encoded in Σ), yielding attention maps that are sparser, more structured, and easier to interpret without altering the overall ViT architecture, computational complexity, or parameter count (the increase is < 0.04%).
The authors evaluate SVDA‑augmented ViTs on four image‑classification benchmarks of increasing difficulty: FashionMNIST, CIFAR‑10, CIFAR‑100, and ImageNet‑100. All experiments use compact ViT backbones (4 layers × 4 heads for the first three datasets, 8 layers × 2 heads for ImageNet‑100) trained from scratch for 100 epochs with Adam. Two lightweight regularizers are added: an orthogonality penalty on the query/key projections (λ = 10⁻³) and a spectral‑entropy penalty on Σ (λ = 10⁻³).
Performance: Across all datasets, SVDA achieves virtually identical training and validation accuracy curves to the baseline ViT. The modest capacity of the models leads to over‑fitting on CIFAR‑100 and ImageNet‑100, but this effect is independent of the attention mechanism. Training time per epoch rises by roughly 10‑20 % for SVDA due to extra normalization and diagonal‑matrix operations, yet the number of multiply‑accumulate operations (MACs) actually drops (up to 36 % reduction on ImageNet‑100), indicating that current GPU kernels are not yet optimized for SVDA’s operations.
Interpretability Indicators: Building on their prior work, the authors compute six quantitative metrics at every epoch for each head and layer: (1) Spectral Entropy and Effective Rank (measuring the concentration of Σ’s spectrum), (2) Spectral Sparsity (fraction of near‑zero diagonal entries), (3) Angular Alignment (average cosine similarity between normalized Q and K), (4) Selectivity Index (degree of attention concentration on a few tokens), (5) Perturbation Robustness (stability of attention maps under input noise), and (6) a composite sparsity‑stability score. SVDA consistently yields lower spectral entropy, lower effective rank, higher sparsity (70‑85 % of Σ entries near zero), higher angular alignment (> 0.85), and higher selectivity compared with the baseline. Perturbation robustness improves as well, indicating that the learned attention is not only more structured but also more stable.
Analysis: The diagonal Σ learns to emphasize a small subset of latent dimensions, effectively performing an automatic dimensionality reduction within each attention head. The orthogonality regularizer encourages soft‑orthonormal query/key projections, which together with the diagonal modulation produce attention maps that align closely with semantic regions of the input image. Visualizations (not reproduced here) show that SVDA heads focus on meaningful object parts rather than diffuse, noisy patterns typical of vanilla ViTs.
Limitations and Future Work: The study uses relatively shallow ViTs, which limits generalization on large‑scale datasets; the observed over‑fitting on CIFAR‑100 and ImageNet‑100 is attributed to model capacity rather than SVDA itself. Moreover, the current implementation incurs a modest wall‑clock overhead despite lower theoretical FLOPs, suggesting the need for optimized kernels (e.g., fused normalization‑projection operations). Future directions include scaling SVDA to large pre‑trained ViT‑Base/ViT‑Large models, exploring SVDA‑driven model compression (pruning low‑importance spectral dimensions), and integrating the six interpretability metrics into a training‑time regularizer that explicitly optimizes for sparsity and stability.
Conclusion: By embedding a spectrally‑structured attention mechanism directly into ViTs, SVDA provides a principled way to obtain interpretable, sparse, and stable attention maps without sacrificing classification performance. The six diagnostic indicators enable continuous monitoring of attention dynamics during training, opening avenues for explainable AI, spectral diagnostics, and attention‑based model compression in computer vision.
Comments & Academic Discussion
Loading comments...
Leave a Comment