PaTH Attention: Position Encoding via Accumulating Householder Transformations
The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Rotary position encoding (RoPE) has emerged as the de facto standard approach for position encoding and is part of many modern LLMs. However, in RoPE the key/query transformation between two elements in a sequence is only a function of their relative position and otherwise independent of the actual input. This limits the expressivity of RoPE-based transformers. This paper describes PaTH, a flexible data-dependent position encoding scheme based on accumulated products of Householder(like) transformations, where each transformation is data-dependent, i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices, and implement a FlashAttention-style blockwise algorithm. Across both targeted synthetic benchmarks and moderate-scale real-world language modeling experiments, we find that PaTH improves upon RoPE and other recent baselines. Finally, we show that we can convert pretrained RoPE transformers into PaTH with continued pretraining.
💡 Research Summary
The paper introduces PaTH (Position Encoding via Accumulating Householder Transformations), a novel, data‑dependent positional encoding for transformer attention. Traditional rotary position embeddings (RoPE) encode relative positions by applying a fixed rotation matrix that depends only on the distance between tokens, which limits expressivity and keeps RoPE‑based transformers within the TC⁰ complexity class. PaTH replaces this static transformation with a sequence of Householder‑like matrices (H_t = I - \beta_t w_t w_t^\top) where (w_t) and (\beta_t) are functions of the current token representation (x_t). The attention logit between query (i) and key (j) is computed as (\exp(k_j^\top (\prod_{t=j+1}^{i} H_t) q_i)), allowing the positional transformation to adapt to the actual input content.
To make this computation tractable, the authors employ the UT transform, which expresses the product of many Householder‑like matrices compactly as (I - W^\top T^{-1} W). Here (W) stores all (w_t) vectors and (T^{-1}) is the inverse of a triangular matrix built from the scalars (\beta_t). This representation enables O(L d²) preprocessing and O(B² d + B d²) per block when combined with a FlashAttention‑style blockwise algorithm. By masking (W) and (T^{-1}) appropriately, the authors can extract the product over any sub‑interval (
Comments & Academic Discussion
Loading comments...
Leave a Comment