Transformers rely on positional encoding to compensate for the inherent permutation invariance of self-attention. Traditional approaches use absolute sinusoidal embeddings or learned positional vectors, while more recent methods emphasize relative encodings to better capture translation equivariances. In this work, we propose RollPE, a novel positional encoding mechanism based on traveling waves, implemented by applying a circular roll operation to the query and key tensors in self-attention. This operation induces a relative shift in phase across positions, allowing the model to compute attention as a function of positional differences rather than absolute indices. We show this simple method significantly outperforms traditional absolute positional embeddings and is comparable to RoPE. We derive a continuous case of RollPE which implicitly imposes a topographic structure on the query and key space. We further derive a mathematical equivalence of RollPE to a particular configuration of RoPE. Viewing RollPE through the lens of traveling waves may allow us to simplify RoPE and relate it to processes of information flow in the brain.
The transformer architecture has achieved state-of-the-art performance across natural language processing, vision, and multimodal tasks. A central feature enabling this success is the self-attention mechanism, which models pairwise dependencies between tokens. However, because attention is inherently permutation-invariant, positional encoding is required to inject sequence order information.
Absolute encodings -such as fixed sinusoidal embeddings (Vaswani et al., 2017) or learned vectors (Dosovitskiy et al., 2020) -assign each token a distinct position representation. While effective, these encodings lack relative awareness: the model must learn to infer positional differences implicitly. This motivated the widely praised Rotary Positional Embeddings (RoPE) (Su et al., 2024) which efficiently implements a relative positional encoding, or shift-equivariance, by rotating the query and key vectors. However, recent evidence suggests strict equivariance is not a necessity for RoPE bringing into question what makes a good positional encoding (Ostmeier et al., 2024;van de Geijn et al., 2025;Barbero et al., 2024).
We introduce rolling positional encodings (RollPEs), a deliberately simplistic approach where position is encoded by rolling query and key tensors before computing their dot product. This operation is rather trivially a relative positional encoding, but has a compelling interpretation as a traveling wave. These positional encodings induce a topographic arrangement in the query and key space as detailed by Keller and Welling (2021), and, by adding additional smoothness constraints over this topographic structure, lead to the spatial loss of TDANNs (Margalit et al., 2024), which have been shown to reproduce aspects of the hallmark behavior in the ventral stream in the visual systems of primates (Lee et al., 2020). Furthermore, within neuroscience there has been recent evidence that traveling waves play a significant role in the formation of long-term memory (Muller et al., 2018) and that humans encode visual events as multiplexed traveling waves (King and Wyart, 2021). While we propose RollPE as a toy model, it links directly with these recent trends in theoretical neuroscience.
In our initial experiments, these simple embeddings significantly outperform classic encodings and, through Multiplexed RollPE, we show once again -as suggested by van de Geijn et al. ( 2025) -this behavior does not appear to be due to its relative inductive bias. These embeddings have a similar behavior to RoPE; in fact, we can mathematically derive a RollPE as a form of RoPE. We hypothesize that many of the properties and interpretations of RollPE apply transitively to RoPE. Thus, we hypothesize that RoPE’s success may be derived from the implicit topographic structure and traveling waves that it and RollPE impose. Through this extended abstract, we motivate our on-going work into viewing the RoPE through the lens of traveling waves.
Let a sequence of hidden states be represented as
(1)
We define the circular roll operator Roll p : R n → R n which maps q i → q i ′ , where i ′ = (i + p) mod n. We could represent this in matrix form by defining the permutation matrix S ∈ R n×n be the 1-step roll matrix, Table 1: Accuracy on CIFAR100
As a positional encoding, we apply this to both queries and keys when calculating the attention score as done in Su et al. (2024),
Note, the attention score between token i and j now depends on the relative displacement induced by the roll (proof in Appendix B). That is, RollPE is equivariant to position. One can extend this simple rolling positional encoding by multiplexing -i. e. representing queries as the superposition of multiple vectors which are rolled at different shift speeds. This gives Multiplexed RollPE (see Appendix C for details). While this yields better results, it requires more parameters and breaks equivariance.
From the results in Table 1, we see that RollPE outperforms classic ViT positional encodings and performs similarly to RoPE. We also observe that Multiplexed RollPE outperforms RollPE, which can suggest that strict shift-equivariance may not be necessary as also suggested in van de Geijn et al. (2025).
One obvious flaw in RollPE is the need for discrete positions. While this is often reasonable in vision and language, it limits applications to continuous data such as point clouds. The second flaw is that RollPE is inherently periodic with period n. While this is not a problem for low-context domains such as vision-where images are typically represented with on the order of 16 patches in each direction (Dosovitskiy et al., 2020) -this becomes very limiting for language, where desirable context length is on the order of millions. Both problems can be addressed by generalizing the shift operator S using its Lie algebra.
While Roll is only defined for integer shifts, we can write
where exp and log denote the matrix exponential and logarithm. Since S is a permutation (and
This content is AI-processed based on open access ArXiv data.