Scaled Dot-Product Attention implements projection of inputs onto a common surface
Scaled dot-product attention (SDPA) is a fundamental component responsible for the success of large-language models and other nonlinear signal processing applications. The rationale for SDPA has been based upon “query, key, value” concepts borrowed from database theory, but these concepts are difficult to reconcile with standard methods in mathematical signal processing. We show that SDPA can be rewritten in a different but mathematically equivalent form as a projection of the input vectors onto a common surface determined by the inputs themselves. Therefore SDPA discovers nonlinear dependencies in the input that are time-dependent and context-dependent. The rewritten form of SDPA permits increased speed of both feedforward and learning algorithms, but more importantly suggests potential extensions. In the context of language, we re-interpret the role of SDPA as finding a time-dependent contextual meaning determined by the surface on which the set of input vectors lies. Input token embeddings are then modified by the local context surface. This interpretation differs substantially from the concept of “self-attention”, and provides a strong justification for the use of SDPA for time-series data with time-varying local nonlinear dependencies.
💡 Research Summary
The paper presents a novel mathematical reinterpretation of Scaled Dot‑Product Attention (SDPA), the core mechanism behind modern Transformers. While the conventional “query‑key‑value” narrative originates from database indexing, the authors show that SDPA can be exactly rewritten as a projection operation in which each query vector is expressed as a weighted sum of key vectors, with the weights derived from a Gaussian kernel based on Euclidean distances between normalized query and key embeddings. By assuming layer normalization forces unit‑norm embeddings, the dot‑product q·kᵀ becomes 1 − ‖q − k‖²/2, allowing the softmax to be expressed as e^{‑‖q‑k‖²/(2σ²)} normalized across keys. Consequently, the attention matrix is a distance‑based similarity matrix, and the output is a projection of the input onto a low‑dimensional surface (or manifold) defined by neighboring inputs.
The authors term this formulation “Projection SDPA”. In self‑attention (q = k = v) each token is projected onto the subspace spanned by all other tokens, effectively flattening the sequence onto a common low‑dimensional surface. In cross‑attention, the projection occurs onto a surface defined by a separate set of context vectors (e.g., source language embeddings in translation). Causal masking simply zeroes out future entries before row‑normalization, making the operation analogous to a finite‑impulse‑response (FIR) filter that uses recent samples only. The paper also discusses how a time‑decaying kernel could turn the operation into an infinite‑impulse‑response (IIR) filter, and how a continuous‑time version would involve double integrals over time‑shifted embeddings.
To validate the reformulation, the authors replace the standard SDPA in a Transformer with the projection version and train it on a Spanish‑to‑English translation task (118 k sentence pairs, 8‑head, sequence length 10, vocab 15 k). They fix σ = 0.01 for self‑attention and σ = 0.05 for cross‑attention. Training for 10 epochs on an A100 GPU shows that the projection variant runs faster (≈129 s vs. 174 s, a ~25 % speed gain) but attains slightly lower BLEU‑related accuracy and higher cross‑entropy loss. The authors attribute the modest performance drop to the fact that the Gaussian weighting does not perfectly replicate the softmax normalization dynamics, especially early in training.
The discussion emphasizes several implications: (1) the distance‑based view connects attention to classic signal‑processing concepts such as kernel smoothing and FIR filtering; (2) the lack of learnable parameters in the projection (aside from σ) suggests limited impact on model capacity, but also opens the door to learning σ or other kernel hyper‑parameters; (3) multi‑head attention can be seen as performing several independent projections onto different sub‑manifolds before recombination; (4) extending the method to IIR or continuous‑time formulations could improve modeling of long‑range dependencies but would increase computational complexity. Limitations noted include the O(N²) cost remaining unchanged, the need for per‑row normalization constants, and the narrow experimental scope (only a translation task).
In conclusion, the paper provides a rigorous equivalence between SDPA and a Gaussian‑kernel‑based projection operation, offering a fresh perspective that bridges deep learning and traditional signal‑processing theory. While the reformulation yields modest speed improvements, it does not surpass standard SDPA in accuracy on the tested task. Nonetheless, the theoretical insights open several avenues for future work, such as adaptive kernel learning, IIR extensions, and applications to other domains like time‑series forecasting or continuous‑time modeling.
Comments & Academic Discussion
Loading comments...
Leave a Comment