Poly-attention: a general scheme for higher-order self-attention
The self-attention mechanism, at the heart of the Transformer model, is able to effectively model pairwise interactions between tokens. However, numerous recent works have shown that it is unable to perform basic tasks involving detecting triples of correlated tokens, or compositional tasks where multiple input tokens need to be referenced to generate a result. Some higher-dimensional alternatives to self-attention have been proposed to address this, including higher-order attention and Strassen attention, which can perform some of these polyadic tasks in exchange for slower, superquadratic running times. In this work, we define a vast class of generalizations of self-attention, which we call poly-attention mechanisms. Our mechanisms can incorporate arbitrary higher-order (tensor) computations as well as arbitrary relationship structures between the input tokens, and they include the aforementioned alternatives as special cases. We then systematically study their computational complexity and representational strength, including giving new algorithms and matching complexity-theoretic lower bounds on the time complexity of computing the attention matrix exactly as well as approximately, and tightly determining which polyadic tasks they can each perform. Our results give interesting trade-offs between different desiderata for these mechanisms, including a tight relationship between how expressive a mechanism is, and how large the coefficients in the model may be so that the mechanism can be approximated in almost-linear time. Notably, we give a new attention mechanism which can be computed exactly in quadratic time, and which can perform function composition for any fixed number of functions. Prior mechanisms, even for just composing two functions, could only be computed in superquadratic time, and our new lower bounds show that faster algorithms for them are not possible.
💡 Research Summary
The paper tackles a fundamental limitation of the standard self‑attention mechanism used in Transformers: it can only model pairwise (second‑order) interactions between tokens. Recent work has shown that many tasks—such as detecting triples of correlated tokens (Match3), computing parity, or performing multi‑step function composition—require higher‑order relationships that self‑attention cannot capture. Existing higher‑order alternatives, namely t‑tensor attention and Strassen attention, do indeed enable some of these polyadic tasks, but they incur super‑quadratic runtimes (O(n³) or O(n^t) for t ≥ 3) and are therefore impractical for large‑scale models.
To address this, the authors introduce poly‑attention, a unifying framework that can express arbitrary higher‑order tensor computations and arbitrary token‑relationship structures. The core idea is to define an attention polynomial h(x₁,…,x_t) that is multilinear, has binary coefficients (0 or 1), and whose monomials correspond to inner products of order between 2 and a fixed degree k. Given query, key, and value matrices derived from the input X∈ℝ^{n×d} via learned linear maps, the poly‑attention output for token i is
Att_h(i) = Σ_{ℓ₂,…,ℓ_t} exp( h(Q^{(1)}i,…,Q^{(t)}{ℓ_t}) / d ) · V^{(2)}{ℓ₂} ⊙ … ⊙ V^{(t)}{ℓ_t},
where ⊙ denotes element‑wise product. This formulation subsumes:
- Standard self‑attention (h(x₁,x₂)=x₁·x₂);
- t‑tensor attention (h(x₁,…,x_t)=∏_j x_j);
- Strassen attention (h(x₁,x₂,x₃)=x₁·x₂ + x₂·x₃ + x₃·x₁).
The paper provides a thorough complexity analysis for both exact and approximate computation of poly‑attention:
Exact algorithms:
- Self‑attention is optimal at Θ(n²).
- t‑tensor attention (t ≥ 3) requires Θ(n^t) time; no faster algorithm exists under standard fine‑grained complexity assumptions.
- Strassen attention can be computed in Θ(n^ω) time (ω≈2.37, the matrix‑multiplication exponent), and the authors prove a matching lower bound, showing that sub‑ω algorithms would violate conjectured hardness of problems such as APSP.
- General poly‑attention’s exact cost depends on the degree k and the number of monomials s; in the worst case it is Θ(n^t) or Θ(n^k).
Approximation algorithms:
When the absolute values of all query, key, and value entries are bounded by B, entry‑wise ε‑approximation becomes feasible in near‑linear time n^{1+o(1)}. The bound on B is:
- B = o(√log n) for self‑attention and Strassen attention,
- B = o((log n)^{1/k}) for a degree‑k polynomial. These bounds are shown to be tight: exceeding them forces the algorithm back to the exact super‑quadratic runtime, again under fine‑grained hardness assumptions. The authors extend prior techniques (which handled only quadratic self‑attention) to the more intricate structure of Strassen and general poly‑attention.
A major contribution is the introduction of tree‑attention, a subclass of poly‑attention where the underlying attention polynomial has degree 2 and its monomials form a tree‑like dependency graph. Tree‑attention enjoys several remarkable properties:
- Quadratic exact computation: despite being higher‑order, it can be evaluated in O(n²) time, matching the runtime of standard self‑attention.
- Arbitrary‑fold function composition: for any constant r, tree‑attention can correctly implement r‑fold composition of functions (e.g., person → city → country → continent). This surpasses both 3‑tensor and Strassen attention, which can only handle 2‑fold composition and cannot be extended to three or more folds.
- Near‑linear approximation: under the same B‑bounds as above, tree‑attention can be approximated in n^{1+o(1)} time, achieving the optimal trade‑off between expressiveness and speed.
The authors validate these theoretical claims with synthetic experiments. They construct benchmarks for Match3 detection, parity, and multi‑step function composition, comparing self‑attention, tensor attention, Strassen attention, and tree‑attention. Results show that tree‑attention attains dramatically higher accuracy on polyadic tasks while retaining quadratic runtime. Approximation experiments confirm that the error remains bounded as predicted when B respects the derived thresholds.
In summary, the paper establishes a general theory of higher‑order attention:
- It formalizes poly‑attention as a polynomial‑based extension of self‑attention.
- It rigorously characterizes exact and approximate computational limits, providing matching upper and lower bounds.
- It identifies tree‑attention as the “best of all worlds” – more expressive than self‑attention, capable of arbitrary constant‑depth function composition, yet computationally as cheap as the original quadratic algorithm.
These results open a clear pathway for integrating higher‑order relational reasoning into large language models and other sequence‑processing architectures without incurring prohibitive computational costs. Future work may explore hardware‑friendly implementations, learning strategies for the attention polynomial itself, and extensions to richer polynomial structures (e.g., cyclic or dense graphs) while preserving the favorable complexity‑expressivity trade‑offs demonstrated here.
Comments & Academic Discussion
Loading comments...
Leave a Comment