LOTFormer: Doubly-Stochastic Linear Attention via Low-Rank Optimal Transport
Transformers have proven highly effective across modalities, but standard softmax attention scales quadratically with sequence length, limiting long context modeling. Linear attention mitigates this by approximating attention with kernel feature maps, yet most attention mechanisms remain row normalized and can over concentrate mass on a few tokens, harming robustness and information flow. Doubly stochastic attention counteracts this by balancing token participation across both rows and columns, but existing approaches often add significant overhead. We propose LOTFormer, a linear time doubly stochastic attention mechanism derived from an optimal transport view of attention as a coupling between query and key measures. LOTFormer enforces a low rank transport plan by conditioning on a learnable pivot measure with small support. We solve two entropic transport problems, queries to pivot and pivot to keys, and compose them into a conditional coupling that is provably doubly stochastic, has rank at most $r \ll n$, and applies to values in $O(nr)$ time without forming the full $n \times n$ matrix. The pivot locations and masses are learned end-to-end. Across vision and text benchmarks, LOTFormer delivers strong accuracy efficiency tradeoffs when plugged into standard backbones including Swin, DeiT, and BERT.
💡 Research Summary
The paper introduces LOTFormer, a novel attention mechanism that simultaneously addresses the quadratic complexity of standard soft‑max attention and the token‑over‑focusing problem inherent in row‑stochastic linear attention. The authors reinterpret attention as an optimal transport (OT) coupling between two empirical probability measures defined by the query and key vectors. Instead of directly computing an n × n transport plan, they introduce a learnable “pivot” measure σ with a small support size r ≪ n. Two entropic OT problems are solved: one between the query measure µ and the pivot σ, and another between σ and the key measure ν. Each OT problem admits a Sinkhorn‑scaled solution of the form Γ¹ = Diag(u) exp(QZᵀ/ε) Diag(v) and Γ² = Diag(ũ) exp(KZᵀ/ε) Diag(ṽ), where Z contains the r pivot vectors. The final attention matrix is obtained by “gluing’’ these two couplings: Γ = (Γ¹)ᵀ Diag(σ)⁻¹ Γ². By construction Γ has row and column marginals equal to µ and ν, making it doubly‑stochastic, and its rank is bounded by r. Consequently, applying Γ to the value matrix V can be performed as V′ = (Γ¹)ᵀ Diag(σ)⁻¹ (Γ² V), which requires only O(n r) operations and avoids forming the full n × n matrix.
The pivot locations and masses are learned jointly with the rest of the Transformer parameters, allowing the model to adapt the low‑rank transport structure to the data distribution. The authors also propose a practical treatment for the
Comments & Academic Discussion
Loading comments...
Leave a Comment