Interpretable Vision Transformers in Monocular Depth Estimation via SVDA
Monocular depth estimation is a central problem in computer vision with applications in robotics, AR, and autonomous driving, yet the self-attention mechanisms that drive modern Transformer architectures remain opaque. We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), providing the first spectrally structured formulation of attention for dense prediction tasks. SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions, enabling attention maps that are intrinsically interpretable rather than post-hoc approximations. Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead. More importantly, SVDA unlocks six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness. These reveal consistent cross-dataset and depth-wise patterns in how attention organizes during training, insights that remain inaccessible in standard Transformers. By shifting the role of attention from opaque mechanism to quantifiable descriptor, SVDA redefines interpretability in monocular depth estimation and opens a principled avenue toward transparent dense prediction models.
💡 Research Summary
The paper introduces SVD‑Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT) for monocular depth estimation, aiming to make the self‑attention mechanism transparent without sacrificing performance. Traditional dot‑product attention mixes directional alignment (query–key similarity) with the magnitude of latent dimensions, which makes attention maps dense and hard to interpret. SVDA replaces the dot‑product with a spectrally structured formulation: queries and keys are ℓ₂‑normalized row‑wise, and a learnable diagonal matrix Σ (one per head) modulates each latent dimension before the softmax. Mathematically, the attention matrix becomes A = softmax((Q Σ Kᵀ)/√dₖ). This design separates “directional” information (the inner product of Q and K) from “spectral” importance (the diagonal entries of Σ), mirroring the decomposition used in Singular Value Decomposition.
Because Σ is explicit and trainable, the model itself reveals which dimensions are active, which are suppressed, and how this evolves during training. The authors define six spectral indicators to quantify these properties: (1) Spectral Entropy (Hₛ) – entropy of the normalized Σ spectrum, measuring disorder; (2) Effective Rank – e^{Hₛ}, estimating the number of effectively used dimensions; (3) Angular Alignment – cosine similarity between normalized Q and K rows, reflecting semantic closeness; (4) Selectivity Index – a statistical measure of how sharply attention concentrates on specific tokens; (5) Spectral Sparsity – proportion of Σ entries whose magnitude falls below a small threshold, indicating prune‑able directions; (6) Perturbation Robustness – Frobenius norm difference between attention maps under a small input perturbation, assessing stability.
Experiments on two standard depth benchmarks, KITTI (outdoor) and NYU‑v2 (indoor), show that SVDA‑DPT matches or slightly improves the baseline DPT across all standard metrics (e.g., AbsRel improves from 0.058 to 0.056 on KITTI, and from 0.133 to 0.124 on NYU‑v2). The training curves are virtually identical, indicating that the spectral modification does not hinder convergence.
Beyond accuracy, the paper’s core contribution lies in the dynamics of the spectral indicators. Early training epochs exhibit high spectral entropy and effective rank, meaning the model initially spreads information across many latent dimensions. As training proceeds, entropy and rank steadily decline while spectral sparsity rises, showing that the network automatically prunes irrelevant dimensions and concentrates its representational capacity. Angular alignment and selectivity stabilize quickly, suggesting that the model establishes coherent directional relationships early on. Perturbation robustness improves with depth, indicating that deeper layers become less sensitive to small input noise.
Layer‑wise analysis reveals a consistent depth‑wise pattern: shallow layers retain higher entropy and rank, reflecting broad, global context gathering; mid‑layers begin to compress the spectrum; deep layers show low entropy, low rank, high selectivity, and strong robustness, focusing the refined depth cues onto the final prediction. These trends are observed on both datasets, underscoring that the spectral behavior is intrinsic to the DPT backbone rather than dataset‑specific.
Computationally, SVDA adds only a modest runtime overhead (~15 % slower) due to the extra ℓ₂‑normalization and diagonal scaling, while increasing parameter count by a negligible 0.01 %. Interestingly, the overall multiply‑accumulate count drops by ~6.8 % because the learned Σ often suppresses dimensions, reducing effective computation.
The authors argue that SVDA transforms attention from a black‑box component into a diagnostic lens. By exposing how attention spectra evolve, researchers can now monitor compression, sparsity, alignment, and robustness throughout training, facilitating model debugging, safety verification, and potentially guided pruning. The paper positions this work as a step toward trustworthy vision Transformers for safety‑critical applications such as robotics and autonomous driving, where understanding internal mechanisms is as important as raw performance.
In summary, the study demonstrates that integrating SVD‑Inspired Attention into a state‑of‑the‑art monocular depth estimator preserves accuracy while providing a principled, quantifiable framework for interpreting and analyzing attention. This bridges the gap between theoretical spectral decomposition and practical dense‑prediction vision models, opening avenues for transparent, explainable, and efficient Transformer‑based systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment