KOINEU

February 10, 2026

Reading time: 38 minute

...

📝 Original Info

Title:
ArXiv ID: 2512.12167
Date:
Authors: Unknown

📝 Abstract

So far, expensive finetuning beyond the pretraining sequence length has been a requirement for effectively extending the context of language models (LM). In this work, we break this key bottleneck by Dropping the Positional Embeddings of LMs after training (DroPE). Our simple method is motivated by three key theoretical and empirical observations. First, positional embeddings (PEs) serve a crucial role during pretraining, providing an important inductive bias that significantly facilitates convergence. Second, over-reliance on this explicit positional information is also precisely what prevents test-time generalization to sequences of unseen length, even when using popular PE-scaling methods. Third, positional embeddings are not an inherent requirement of effective language modeling and can be safely removed after pretraining following a short recalibration phase. Empirically, DroPE yields seamless zero-shot context extension without any long-context finetuning, quickly adapting pretrained LMs without compromising their capabilities in the original training context. Our findings hold across different models and dataset sizes, far outperforming previous specialized architectures and established rotary positional embedding scaling methods.

📄 Full Content

Transformers established themselves as the predominant architecture for training foundation models at unprecedented scale in language and beyond (Brown et al., 2020;Dosovitskiy et al., 2020;Jumper et al., 2021;Team et al., 2023). The defining feature of transformers is abandoning explicit architectural biases such as convolutions and recurrences in favor of highly general self-attention layers (Vaswani et al., 2017), while injecting positional information about the sequence through positional embeddings (PEs) and causal masking. However, despite significant efforts to scale attention to long sequences on modern hardware (Dao et al., 2022;Liu and Abbeel, 2023;Liu et al., 2023a), this powerful layer is inherently bottlenecked by quadratic token-totoken operations, which makes pretraining at long sequence lengths computationally intractable at scale. As a result, enabling models to use contexts beyond their pretraining length without additional long-context fine-tuning (i.e., "zero-shot context extension") has emerged as a central challenge for the next generation of foundation models (Chi et al., 2023;Press et al., 2021).

When inference sequence lengths exceed the pretraining context, the performance of modern transformer-based LMs degrades sharply. This is directly caused by their use of explicit PEs such as the ubiquitous rotary positional embeddings (RoPE) (Su et al., 2024), which become out-of-distribution at unseen sequence lengths. To address this issue, careful scaling techniques that adapt RoPE frequencies on longer sequences were introduced (bloc97, 2023;Chen et al., 2023;Ding et al., 2024;Peng et al., 2023). However, despite their popularity, these methods still rely on an expensive, long-context finetuning phase to meaningfully use tokens beyond the original sequence length, failing to generalize out of the box (Lu et al., 2024a). Beyond RoPE transformers, alternative architectures and positional embedding schemes have shown early promise in reducing costs by attenuating the underlying quadratic computational burden Choromanski et al. (2020); Wang et al. (2020); Xiong et al. (2021); Zaheer et al. (2020) or maintaining better out-of-context generalization (Kazemnejad et al., 2023;Puvvada et al., 2025;Yang et al., 2025b). Yet, these parallel efforts are still far from challenging established pipelines, introducing notable performance and stability trade-offs that prevent wide adoption.

In this work, we challenge the conventional role of RoPE in language modeling, and propose to overcome this inherent trade-off by Dropping the Positional Embeddings (DroPE) of LMs after pretraining. Our method is based on three key theoretical and empirical observations. First, explicit positional embeddings significantly facilitate pretraining convergence by baking in an important inductive bias that is difficult to recover from data alone. Second, over-reliance on positional embeddings is precisely what prevents test-time generalization to sequences of unseen length, with RoPE-scaling context extension methods focusing on recent tokens instead of ones deeper in the context to retain perplexity. Third, explicit PE is not an inherent requirement for effective language modeling and can be removed after pretraining, following a short recalibration phase which is performed at the original context length.

Empirically, DroPE models generalize zero-shot to sequences far beyond their training context, marking a sharp contrast to traditional positional scaling techniques. Moreover, we show that adapting RoPE models with DroPE does not compromise their original in-context capabilities, preserving both perplexity and downstream task performance. Our findings hold across LMs of different architectures and sizes up to 7B parameters pretrained on trillions of tokens, establishing a new standard for developing robust and scalable long-context transformers.

Contributions. In summary, our main contributions are as follows:

(1) In Section 3, we provide empirical and theoretical analysis of the role of positional embeddings in LM training, showing their importance in significantly accelerating convergence. (2) In Section 4, we discuss why RoPE-scaling methods fail to reliably attend across far-away tokens when evaluated zero-shot on long sequences, showing that these approaches inevitably shift attention weights, hindering the model’s test-time behavior.

(3) In Section 5, we introduce DroPE, a new method that challenges the conventional role of positional embeddings in transformers, motivated by our empirical and theoretical analyses of its role as a transient but critical training inductive bias. (4) We demonstrate that DroPE enables zero-shot generalization of pretrained RoPE transformers far beyond their original sequence length, without any long-context finetuning. DroPE can be incorporated at no extra cost into established training pipelines, and can be used to inexpensively empower arbitrary pretrained LLMs in the wild.

We share our code to facilitate future work and extensions toward developing foundation models capable of handling orders-of-magnitude longer contexts.

Self-attention.

where 𝑑 𝑘 is the head dimension. A multi-head attention block computes multiple attention outputs

, concatenates them, and projects to the model dimension:

Language and positional embeddings. State-of-the-art autoregressive transformer LMs use information about sequence positions provided both implicitly via causal masking of the attention scores1 , and explicitly with positional embeddings. In particular, the modern literature has settled on the Rotary PE (RoPE) scheme (Su et al., 2024), providing relative positional information to each attention head by rotating 𝑞 𝑖 and 𝑘 𝑗 in 2D chunks before the inner product in Equation 1:

Here, each 𝑅(𝜔 𝑚 ) ∈ ℝ 2×2 is a planar rotation of angle 𝜔 𝑚 = 𝑏 -2(𝑚-1)/𝑑 𝑘 acting on the (2𝑚, 2𝑚 + 1) subspace of 𝑞 𝑖 and 𝑘 𝑗 . The base 𝑏 is commonly taken to be 10,000.

where 𝜅 𝑚 ∈ [0, 1] interpolates between 0 and 1 as the base frequency 𝜔 𝑚 grows (see Appendix A). These methods, referred to as RoPE-scaling, still require additional finetuning on long sequences, and don’t generalize to long-context downstream tasks out of the box (Lu et al., 2024b).

NoPE transformers. In a parallel line of work, there have been efforts to train transformers without PEs, commonly referred to as NoPE architectures (Haviv et al., 2022;Kazemnejad et al., 2023), to avoid the need for rescaling RoPE frequencies. While NoPE was shown to be a viable LM architecture, it has failed to gain traction due to degraded performance (Haviv et al., 2022;Yang et al., 2025b) compared to RoPE architectures. For an in-depth introduction to the above concepts, see Appendix A. While NoPE transformers were shown to be expressive enough for effective sequence modeling (Haviv et al., 2022;Kazemnejad et al., 2023), we find that they consistently underperform RoPE architectures throughout our experiments. As illustrated in Figure 3, NoPE transformers maintain visibly worse perplexity throughout training. These empirical results are consistent with past literature (Haviv et al., 2022;Yang et al., 2025b), yet the reasons why positional embeddings are key for effective language model training have never been fully understood.

From a purely mechanistic perspective, even without explicit positional embeddings, NoPE transformers can exploit the causal mask to encode positional information, maintaining the same expressivity as their RoPE counterparts (Haviv et al., 2022;Kazemnejad et al., 2023). Specifically, Kazemnejad et al. (2023) prove that the first attention layer in a NoPE transformer can perfectly reconstruct sequence positions, and subsequent layers can emulate the effects of relative or absolute positional embeddings.

As detailed in Section 3.1, rather than looking at theoretical expressivity, we investigate this empirical performance discrepancy from an optimization perspective, providing theoretical analysis of the positional bias of NoPE transformers during training. The theoretical and empirical analysis in this section can be summarized in the following observation. At a high level, our analysis focuses on the rate at which NoPE and RoPE transformers can develop positional bias in their self-attention heads, which captures their non-uniformity. We quantify attention positional bias as a linear functional on the attention map:

Definition 3.1 (Attention positional bias). Given centered positional weights 𝑐 𝑖 𝑗 ∈ ℝ with 𝑗≤𝑖 𝑐 𝑖 𝑗 = 0, the positional bias of the attention weights 𝛼 𝑖 𝑗 is

Attention heads with a strong positional bias would maximize the average value of A 𝑐 across input sequences for some weights 𝑐. For example, a “diagonal” attention head, focusing mass on the current token, is exactly the maximizer of A 𝑐 , with 𝑐 𝑖 𝑗 having 1s on the diagonal and -1 𝑖-1 otherwise. Figure 4 | RoPE transformers have higher positional bias gradients at initialization. We compare the average norm of A 𝑐 across layers, for RoPE and NoPE transformers. In 4a we plot the gradient norms of positional bias towards a diagonal head, and in 4b, we take bias towards previous token attention, off-diagonal head. In both cases, the gradient norm is consistently higher for RoPE across layers, meaning that RoPE heads can learn these patterns faster.

To validate the theory behind Observation 1, we empirically compare the gradients of the attention positional bias functional in attention heads of RoPE and NoPE transformers. Specifically, we measure the average gradient norm at initialization in the direction of two common language modeling patterns: diagonal attention heads, placing mass on the current token, and off-diagonal heads, capturing immediate previous token context. As illustrated in Figure 4, the gradient magnitudes of NoPE transformers are far lower than those of RoPE transformers, with the gap between the two growing in deeper layers. This means that diagonal and off-diagonal heads are slower to develop under NoPE, reflecting its difficulty in recovering positional information. In the next section, we theoretically analyze the causes of this gradient norm gap.

We detail our findings, summarized in Observation 1, with a series of formal results, bounding the rate at which positional bias can develop early in training. We provide full proofs and an extended analysis of these results in Appendix B. 2) query and key gradients vanish: 𝜕L/𝜕𝑊 𝑄 = 𝜕L/𝜕𝑊 𝐾 = 0, (3) for all heads and any positional weights A 𝑐 = 0, ∇ 𝜃 A 𝑐 = 0 , and (4) the output is constant:

The explicit positional information injected into attention heads in RoPE transformers circumvents this issue. Enabling non-zero A 𝑐 gradients even on constant sequences. Proposition 3.3. For a non-trivial RoPE attention head, even if the input sequence is constant, there are positional weights 𝑐, for which A 𝑐 > 0, and ∥∇ 𝜃 A 𝑐 ∥ > 0.

NoPE transformers propagate embedding uniformity. At initialization, the entries of the embedding matrix are drawn i.i.d. from a distribution with a fixed small variance (commonly, 𝜎 2 = 0.02). Therefore, the token embeddings are close to uniform at the beginning of training. The next theorem shows that for NoPE transformers, this uniformity persists throughout the network, and bounds the attention positional bias A 𝑐 and its gradients.

Theorem 3.4. Define the he prefix-spread of the hidden states at layer 𝑙 as

, where h(𝑙)

𝑗 .

For NoPE transformers, there exists 𝜀 > 0 and constants 𝐶 1 , 𝐶 2 , and 𝐶 3 such that if the initial embeddings Δ (1) ℎ ≤ 𝜀, then for all layers 𝑙 ≤ 𝐿:

with high probability over the initialization distribution. The constants only depend on the number of layers and heads, and not on the sequence length.

The main idea in the proof of Theorem 3.4 is that uniformity in the embeddings causes uniformity in the attention maps, so 𝛼 𝑖 𝑗 ≈ 1/𝑖. Uniform mixing of tokens cannot increase the prefix spread; thus, uniformity persists throughout the network. This result explains the discrepancy between RoPE and NoPE transformers illustrated in Figure 4.

In summary, we demonstrate that while NoPE attention can learn positional bias, attention nonuniformity develops slowly early in training due to bounded A 𝑐 gradients at initialization. State-of-the-art RoPE scaling methods fail to effectively generalize to sequences longer than those seen in training without additional long-context finetuning. While YaRN and other popular frequency scaling techniques do avoid perplexity degradation on longcontext sequence (bloc97, 2023;Peng et al., 2023), they exhibit sharp performance drops on downstream tasks whenever important information is present deep in the sequence, beyond the training context (Liu et al., 2023b;Lu et al., 2024b). We empirically demonstrate this phenomenon, comparing the perplexity and needle-in-a-haystack (NIAH) (Hsieh et al., 2024;Kamradt, 2023) performance of a RoPE transformer scaled with YaRN and to a cropped context baseline. As illustrated in Figure 5, YaRN’s zeroshot behavior closely matches that of simply cropping the sequence length to the pretraining context, maintaining constant perplexity but ignoring information present outside the cropped window.

The cause of this limitation lies in the way context extension methods scale different RoPE frequencies.

As detailed in Section 2, elaborated on in Appendix A, and illustrated in Figure 6, the scaling factors of PI (Chen et al., 2023), RoPE-NTK (bloc97, 2023), and YaRN (Peng et al., 2023) have a strong effect on low frequencies. In Section 4.1, we discuss why this aggressive scaling of low frequencies leads to the observed failures, yielding our second observation.

Observation 2. RoPE-scaling methods must compress low frequencies to keep positional phases indistribution. This, in turn, shifts semantic attention heads at large relative distances, causing the observed failures on downstream tasks, preventing zero-shot context extension. Effect of RoPE scaling. RoPE scaling methods modify the frequencies at inference time to evaluate sequences that are longer than those seen during pretraining. In each (2𝑚, 2𝑚+1) subspace, the RoPE phase at relative distance Δ is 𝜙 𝑚 (Δ) = 𝜔 𝑚 Δ, so scaling the frequency to 𝜔 ′ 𝑚 = 𝛾 𝑚 𝜔 𝑚 is equivalent to using a phase

As illustrated in Figure 6, most scaling methods leave high frequencies nearly unchanged (𝛾 𝑚 ≈ 1) but all of them compress the low frequencies (𝛾 𝑚 ≈ 1/𝑠). As demonstrated both theoretically and empirically in Barbero et al. (2024), high RoPE frequencies are primarily used by positional heads, with attention patterns based on relative token positions (e.g., diagonal or previous-token heads). In contrast, low frequencies are predominantly used by semantic heads that attend based on query/key content. Consequently, positional heads are largely unaffected by scaling, but semantic attention is shifted. Moreover, the effect on low-frequency dominated semantic heads is exacerbated for distant tokens, since the relative phase 𝜙 𝑚 (Δ) is larger, and thus the 1/𝑠 scaling factor has a greater effect. In other words, scaling Figure 8 | RoPE scaling shifts semantic attention mass. Attention weights of the last token (query) with tokens from a retrieval target (keys) in a semantic head evaluated on a NIAH probe. Since the head uses low frequencies and the relative distance is non-trivial, the impact of YaRN is substantial, shifting attention mass between tokens.

warps low-frequency phases, shifting long-range attention in precisely the subspaces most used for semantic matching.

In Figure 7 and Figure 8, we illustrate this behavior in practice. We start by selecting a positional attention head in a pretrained Qwen2.5-0.5B model by examining its average attention positional bias (Definition 3.1) across layers. In Figure 7, we show the average attention weights in this positional head under YaRN scaling with 𝑠 = 2. Because high frequencies, which are least affected by YaRN, dominate positional heads, the average attention profiles are similar. In Figure 8, we then contrast this behavior with that of a semantic head for a long needle-in-a-haystack sequence, plotting the average attention of the last token (query) with tokens around the needle (keys). YaRN’s aggressive scaling of low frequencies substantially shifts attention mass across tokens, reflecting the impact of frequency compression at longer ranges.

Why this is inevitable. In a standard RoPE setup, low-frequency phases never make a full cycle over the original context length: 𝜙 𝑚 (𝐶 train ) = 𝜔 𝑚 𝐶 train < 2𝜋 for small 𝜔 𝑚 . E.g. for a standard RoPE base 𝑏 = 10 4 , a transformer with head dimension 𝑑 𝑘 = 64, will have at least five low frequencies for which 𝜙 𝑚 (𝐶 train ) < 2𝜋, even at a training context of 𝐶 train = 32,000. If we leave 𝜔 𝑚 unchanged at an extended length 𝐶 test > 𝐶 train , the new maximal relative phase 𝜙 𝑚 (𝐶 test ) is pushed outside the training regime and becomes out of distribution for the head. Therefore, to constrain phases to remain in range, any scaling method must choose 𝛾 𝑚 ≤ 𝐶 train 𝐶 test = 1 𝑠 , which becomes increasingly small as the extension factor 𝑠 grows. In other words, when applying a RoPE transformer to sequences longer than those seen in training, any post-hoc scaling method must compress the low frequencies. But this compression, in turn, shifts attention weights at long relative distances.

Taken together, Observations 1 and 2 imply that providing explicit positional information with PE is a key component for effective LM training, but is also a fundamental barrier to long-context generalization. This raises a natural question: is it possible to harness the inductive bias from positional embeddings exclusively during training? We answer in the affirmative. In this section, we demonstrate that it is possible to drop all positional embeddings from a pretrained transformer and quickly recover the model’s in-context capabilities with a brief recalibration phase. Most notably, this simple new procedure (termed DroPE) unlocks strong zero-shot long context generalization to unseen sequence lengths, far beyond highly-tuned RoPE extensions and prior alternative architectures.

Observation 3. Positional embeddings can be removed after pretraining, allowing LMs to generalize zero-shot to unseen sequence lengths without compromising their in-context performance after short recalibration on a fraction of the training tokens at the original context size.

We extensively validate DroPE across different LM and dataset scales, showing it outperforms prior approaches both as a zero cost integration into pretraining recipes and as an inexpensive way to adapt any LM in the wild already pretrained on trillions of tokens. For all experiments in this paper, we provide full implementation details of each evaluated architecture and optimization phase, including comprehensive hyperparameter lists in Appendix C.

Integrating DroPE at no extra cost. For our first set of experiments, we train from scratch different LMs with half a billion parameters on 16B fineweb tokens (Penedo et al., 2024), over twice the chinchilla-optimal rate (Hoffmann et al., 2022). We repeat this recipe for RoPE and NoPE transformers, as well as an ALiBi model (Press et al., 2021) and an RNoPE-SWA model Yang et al. (2025b), which are alternative architectures specifically aimed at long-context capabilities. We implement DroPE by taking the 14B tokens RoPE transformer checkpoint, removing positional embeddings from every layer, and resuming training for the final 2B tokens. Despite only recalibrating at the very end of training, at no extra cost, DroPE matches the final in-context validation perplexity of RoPE trained on the full 16B tokens, showing a clear edge over the NoPE baseline trained without positional embedding all the way (Figure 2). We provide further analysis and ablations on the recalibration starting point in Appendix D.1, validating the importance of harnessing the inductive bias of RoPE for a substantial amount of training, in line with the core motivation of our new method. To evaluate the long-context generalization of each method, we select three tasks from the RULER benchmark (Hsieh et al., 2024): (1) multi-query: retrieve needles for several listed keys, (2) multi-key: retrieve the needle for one specified key, and

(3) multi-value: retrieve all needles for one key with a single query. For the base RoPE transformer, we consider three context extension strategies: PI (Chen et al., 2023), NTK-RoPE (bloc97, 2023), and the popular YaRN (Peng et al., 2023) Figure 10 | SmolLM-DroPE recalibration.

We compare three recipes, using 30B, 60B, and 120B training tokens.

We then evaluate our SmolLM-DroPE models’ zeroshot length generalization on four different tasks from LongBench (Bai et al., 2023), a challenging benchmark even for closed-source LMs, including knowledge-extraction problems longer than 80 times SmolLM’s pretraining context (2048 tokens). We compare our method with the base SmolLM and three RoPE extensions: PI, RoPE-NTK, and YaRN. As shown in Table 2, despite a significant difficulty spike compared to our prior evaluations, DroPE still displays a clear edge over prior approaches, improving the base SmolLM’s average score by over 10 times. These gains are far beyond all prior zero-shot RoPE extensions currently used across modern LMs. We refer to Appendix D.2 for a fine-grained analysis of task performance as a function of extension factor.

Scaling to billion-parameter models. Given the remarkable efficiency of recalibration, we test DroPE’s ability to scale to larger LMs in the wild, such as SmolLM-1.7B (Allal et al., 2024) and Llama2-7B (Touvron et al., 2023), which were trained on 1 trillion and 4 trillion tokens, respectively. For both of these models, we perform recalibration on 20B tokens, which only represents 2% of the pretraining budget for SmolLM-1.7B, and only 0.5% for Llama2-7B. As demonstrated in Table 3, consistently with all our prior results on a smaller scale, SmolLM-1.7B-DroPE and Llama2-7B-DroPE once again outperform state-of-the-art RoPE-scaling methods on long-context questionanswering and summarization, providing strong evidence towards the scalability and immediate potential of DroPE. Overall, our in-context and out-of-context results demonstrate DroPE is an efficient and effective long-context extension method, which we believe can have meaningful implications for reducing the cost of training pipelines and for tackling the canonical context scalability challenges of transformers.

We complement this section with additional experimental results, including the entire LongBench benchmark, and a performance by query length breakdown in Appendix D.

Recent improvements to RoPE include variants based on Fourier and wavelet transforms (Hua et al., 2025;Oka et al., 2025) and methods such as 𝑝-RoPE (Barbero et al., 2025), NRoPE-SWA (Yang et al., 2025b), and SWAN-GPT (Puvvada et al., 2025), which occupy a middle ground between RoPE and NoPE. Our approach represents a fundamentally different paradigm, replacing RoPE with NoPE at different stages of training. These directions are complementary to ours and can be used in place of RoPE within the DroPE framework. Another orthogonal direction seeks length generalization while retaining a dedicated positional vector yet modifying its indexing or adaptivity (Wu et al., 2024;Zheng et al., 2024;zican Dong et al., 2024).

Our findings support a reinterpretation of positional embeddings in transformer LMs as a useful inductive bias that is essential for efficient training (Observation 1), but inherently constrains zero-shot context extension (Observation 2). Based on these findings, we propose DroPE, a new method rethinking the conventional role of PEs as a temporary scaffold that can and should be removed after serving their training-time purpose (Observation 3).

where [•, . . . , •] represents concatenation along the feature dimension. When clear from context, we omit layer and head indices.

Positional embeddings in transformers. The attention mechanism does not directly encode relative distances between queries and keys. Therefore, attention is invariant to prefix permutations: for any permutation 𝜎 ∈ 𝑆 𝑝 of the first 𝑝 input tokens, attn(𝑥 𝜎 -1 (1) , . . . , 𝑥 𝜎 -1 ( 𝑝) , 𝑥 𝑝 , . . . , 𝑥 𝑇 ) 𝑖 = attn(𝑥 1 , . . . , 𝑥 𝑇 ) 𝑖 for every 𝑖 > 𝑝. In other words, pure attention is blind to token positions. To address this, Vaswani et al. (2017) introduced absolute positional embeddings, adding position information to the token embeddings before the first transformer block. More recently, many architectures replace absolute embeddings with relative schemes that inject pairwise positional information directly into the attention mechanism. The most widely used approach is Rotary Position Embedding (RoPE) (Su et al., 2024). RoPE modifies the attention scores in Equation 5 by rotating queries and keys before taking their inner product:

where, 𝑅 ∈ 𝑂(𝑑 𝑘 ) is a block-diagonal orthogonal matrix composed out of 2 × 2 rotation blocks:

In the standard RoPE parameterization, 𝜔 𝑚 = 𝑏 -2 𝑚-1 𝑑 𝑘 with 𝑏 = 10,000.

Language model context extension. Generalizing to contexts longer than those seen during training is a key challenge for transformer-based language models. The key issue is that when applying a transformer on a longer context, the attention mechanism must operate over more tokens than it was trained to handle. This issue is exacerbated with

NTK-RoPE (bloc97, 2023) uses

so that low frequencies (𝑚 ≈ 𝑑 𝑘 /2) are scaled similarly to PI and for high frequencies 𝛾 𝑚 ≈ 1. YaRN (Peng et al., 2023) uses

with tunable 𝑝 and 𝑞 parameters, originally chosen as 𝑝 = 1, 𝑞 = 32. See Figure 11 for a comparison between these different RoPE scaling methods with 𝑠 = 2, 3, and 4.

In this section, we analyze the behavior of positional bias, or attention non-uniformity, in NoPE transformers and RoPE transformers early in training. We provide formal statements and proofs for all the results from Section 3, starting with Propositions 3.2 and 3.3, followed by Theorem 3.4. The notation of this section follows that of Appendix A. 2) query and key gradients vanish: 𝜕L/𝜕𝑊 𝑄 = 𝜕L/𝜕𝑊 𝐾 = 0, (3) for all heads and any positional weights A 𝑐 = 0, ∇ 𝜃 A 𝑐 = 0 , and (4) the output is constant:

Proof. Let 𝑥 1 , . . . , 𝑥 𝑇 be a constant input sequence, 𝑥 1 = • • • = 𝑥 𝑇 , and let M be a NoPE transformer, i.e. a transformer with no positional encodings and causal self attention. The order of the proof is (4) ⇒ (1) ⇒ (2 + 3).

(4) Layer outputs, and thus model outputs, are constant. At the first layer, inputs are identical

= ℎ. This means that for every attention head and every 1 ≤ 𝑗 ≤ 𝑇 𝑣 𝑗 ≡ 𝑣 = 𝑊 𝑉 ℎ.

Therefore, the output of the attention head is

independent of 𝑖. Concatenating heads and applying 𝑊 𝑂 preserves equality across positions. Residual connections, LayerNorm, and the MLP are positionwise (the same function is applied independently at each position), so identical inputs produce identical outputs at every position. Thus the layer output remains constant. By repeating this argument layer-by-layer, every subsequent layer receives identical inputs and outputs identical states, so in the end

(1) Uniform causal attention. Using (4), we know that for every layer 1 ≤ 𝑙 ≤ 𝐿 ℎ (𝑙)

Therefore, for every attention head and every 1 ≤ 𝑗 ≤ 𝑇 𝑞 𝑗 ≡ 𝑞 := 𝑊 𝑄 ℎ, 𝑘 𝑗 ≡ 𝑘 := 𝑊 𝐾 ℎ, 𝑣 𝑗 ≡ 𝑣 := 𝑊 𝑉 ℎ.

Thus, for each 1 ≤ 𝑗 ≤ 𝑖 ≤ 𝑇, the attention scores 𝑠 𝑖 𝑗 = 𝑞 ⊤ 𝑘/ √ 𝑑 𝑘 ≡ 𝑐 are constant (independent of 𝑖 or 𝑗). Hence 𝛼 𝑖 𝑗 = softmax(𝑐, . . . , 𝑐 𝑖 entries

(2 + 3) Vanishing 𝑊 𝑄 , 𝑊 𝐾 gradients. Since, the inputs for every layer are constant, we know from (1) that every attention head has 𝛼 𝑖 𝑗 ≡ 1/𝑖, independant of 𝑊 𝑄 and 𝑊 𝐾 . Therefore 𝜕𝛼 𝑖 𝑗 /𝜕𝑊 𝑄 = 𝜕𝛼 𝑖 𝑗 /𝜕𝑊 𝐾 = 0. Since the attention bias A 𝑐 depends on the parameters 𝜃 only through 𝛼 𝑖 𝑗 and the loss L depends on 𝑊 𝑄 and 𝑊 𝐾 only through 𝛼 𝑖 𝑗 , all these gradients vanish. More formally, using the chain rule,

Additionally, since the heads are uniform the attention bias is zero to begin with

Note that part (4) of the proposition holds for RoPE transformers as well. Parts (1), ( 2) and ( 3) do not. The relative rotations break attention uniformity and thus changing the magnitude of 𝑊 𝑄 and ∥𝑊 𝐾 ∥ can affect the attention weights. This is formally demonstrated in the next section.

Proposition 3.3. For a non-trivial RoPE attention head, even if the input sequence is constant, there are positional weights 𝑐, for which A 𝑐 > 0, and ∥∇ 𝜃 A 𝑐 ∥ > 0.

Proof. Let 𝑥 1 = • • • = 𝑥 𝑇 = 𝑥 ∈ ℝ 𝑑 be the inputs to a RoPE attention head, and let 𝑊 𝑄 , 𝑊 𝐾 ∈ ℝ 𝑑 𝑘 ×𝑑 be the query and key projection parameters. Since the projection maps are shared across tokens, the queries and keys are constant as well:

Set the positional bias weights to be

Since 𝑗≤𝑖 𝛼 𝑖 𝑗 = 1, we have 𝑗≤𝑖 𝑐 𝑖 𝑗 = 0 as required. The positional bias A 𝑐 is

By Cauchy-Schwarz,

with equality only when

with equality iff 𝛼 𝑖 𝑗 = 1/𝑖 is uniform. Therefore, A 𝑐 > 0 unless 𝛼 𝑖 𝑗 is uniform for all 𝑖. The following lemma asserts that this is not the case Lemma B.2. For any non-degenerate RoPE head and input embeddings 𝑥 1 = • • • = 𝑥 𝑡 = 𝑥, there exists 𝑖 ≥ 1 such that 𝑠 𝑖1 , . . . , 𝑠 𝑖𝑖 and 𝛼 𝑖1 , . . . , 𝛼 𝑖𝑖 are not uniform.

The proof of Lemma B.2 is at the end of this subsection. As for ∇ 𝜃 A 𝑐 , rewrite A 𝑐 as

so the dependence in the parameters 𝜃 is entirely through

From the definition of RoPE, we have

Consider scaling 𝑞 by a scalar 𝜆 > 0: 𝑞 ↦ → 𝜆𝑞. For fixed prefix 𝑖, define

Then

, where 𝐴 𝑖 (𝜆) := log 𝑍 𝑖 (𝜆) is the log-partition function. The second derivative of the log partition function is the logit variance

2 𝑠 𝑖 𝑗 are not all equal and 𝛼 𝑖 𝑗 (𝜆) > 0. Thus, 𝐴 ′ 𝑖 (𝜆) is strictly increasing in 𝜆. Hence, for any 𝑖 with non-constant logits,

and in particular at 𝜆 = 1,

By the chain rule for 𝑞 ↦ → 𝜆𝑞, 𝑑 𝑑𝜆 𝐹 𝑖 (𝜆)

Thus ∇ 𝑞 𝐹 𝑖 (𝑞) ≠ 0 (otherwise the dot product with 𝑞 couldn’t be strictly positive). Finally, since 𝑞 = 𝑊 𝑄 𝑥,

and with 𝑥 ≠ 0 we get ∥∇ 𝜃 𝐹 𝑖 ∥ ≥ ∇ 𝑊 𝑄 𝐹 𝑖 > 0. Therefore

has strictly positive norm (a sum of nonzero matrices sharing the same nonzero right factor 𝑥 ⊤ cannot be the zero matrix unless all left factors vanish, which they don’t for 𝑖 ≥ 2). □

To conclude this section, we now prove Lemma B.2. Lemma B.2. For any non-degenerate RoPE head and input embeddings 𝑥 1 = • • • = 𝑥 𝑡 = 𝑥, there exists 𝑖 ≥ 1 such that 𝑠 𝑖1 , . . . , 𝑠 𝑖𝑖 and 𝛼 𝑖1 , . . . , 𝛼 𝑖𝑖 are not uniform.

Proof. RoPE acts as independent 2 × 2 rotations on disjoint coordinate pairs. Thus

with pairwise distinct frequencies 𝜔 𝑚 ∈ (0, 2𝜋). Decompose 𝑞 = (𝑞 1 , . . . , 𝑞 𝑀 ), 𝑘 = (𝑘 1 , . . . , 𝑘 𝑀 ), 𝑎 𝑚 , 𝑏 𝑚 ∈ ℝ 2 , so 𝑠 𝑖 𝑗 = 𝑓 ( 𝑗 -𝑖) where

Let

Define 𝐴 𝑚 := 𝑞 ⊤ 𝑚 𝑘 𝑚 and 𝐵 𝑚 := 𝑞 ⊤ 𝑚 𝐽𝑏 𝑚 . Then

where

Assume 𝑓 (Δ) is constant in Δ for Δ = 0, . . . , 2𝑀 = 𝑑 𝑘 , and denote the constant value by -1 2 𝐶 0 . Then we have 𝑀 ∑︁ 𝑚=-𝑀 𝐶 𝑚 𝑒 𝑖Δ𝜔 𝑚 ≡ 0 were 𝐶 -𝑚 := C𝑚 , and 𝜔 -𝑚 = -𝜔 𝑚 . Since {𝑒 -𝑖𝜔 𝑀 , . . . , 𝑒 -𝑖𝜔 1 , 1, 𝑒 𝑖𝜔 1 , . . . , 𝑒 𝑖𝜔 𝑀 } are all distinct, by Vandermonde’s identity this means 𝐶 𝑚 = C𝑚 = 0 for 𝑚 = 1, . . . , 𝑀, ⇒ 𝐴 𝑚 = 𝐵 𝑚 = 0 for 𝑚 = 1, . . . , 𝑀. Now

If 𝑘 𝑚 ≠ 0, then {𝑘 𝑚 , 𝐽𝑘 𝑚 } spans ℝ 2 , forcing 𝑞 𝑚 = 0. Thus for every block 𝑚, either 𝑞 𝑚 = 0 or 𝑘 𝑚 = 0, which results in a degenerate RoPE head, contradicting the assumption. Therefore, for 𝑖 ≥ 𝑑 𝑘 + 1 the attention logits 𝑠 𝑖 𝑗 are not constant, and thus the attention weight 𝛼 𝑖 𝑗 are not constant. □

In this section, we prove Theorem 3.4. To do so, we first need to prove a sequence of Propositions and Lemmas. First, we restate the theorem here.

Theorem 3.4. Define the he prefix-spread of the hidden states at layer 𝑙 as

, where h(𝑙)

For NoPE transformers, there exists 𝜀 > 0 and constants 𝐶 1 , 𝐶 2 , and 𝐶 3 such that if the initial embeddings Δ

(1) ℎ ≤ 𝜀, then for all layers 𝑙 ≤ 𝐿:

with high probability over the initialization distribution. The constants only depend on the number of layers and heads, and not on the sequence length.

Since all weight matrices are drawn from a Gaussian distribution with a fixed variance, there exists a constant 𝐵, depending only on the architecture, such that with high probability the operator norms of Proof. Notice that

Therefore, by Cauchy-Swartz

By the linearity of 𝑊 𝐾 we get 𝑘 𝑗 -k𝑖 = 𝑊 𝐾 (ℎ 𝑗 -h𝑖 ) ≤ ∥𝑊 𝐾 ∥ ℎ 𝑗 -h𝑖 ≤ ∥𝑊 𝐾 ∥ Δ (𝑙) ℎ ≤ 𝐵Δ (𝑙) ℎ . As for ∥𝑞 𝑖 ∥ = 𝑊 𝑄 ℎ 𝑖 , recall that ℎ 𝑖 are the output of a normalization layer, and therefore (at initialization)

Putting it all together gives

To finish the proof, take a maximum over 𝑗 ≤ 𝑖. □

To bound the effect on the attention probabilities, we need the following Lemma. Lemma B.4. For any 𝑏 ∈ ℝ 𝑛 ,

∞ for all 𝑥, 𝑑 ∈ ℝ 𝑛 (see Theorem 2.1.6 in Nesterov (2013)). Take 𝑓 (𝑥) = log 𝑛 𝑖=1 𝑒 𝑥 𝑖 . 𝑓 is 𝐶 2 , convex and ∇ 𝑓 (𝑥) = softmax(𝑥). Therefore, all we need to show is that for all 𝑥, 𝑑 ∈ ℝ 𝑛 𝑑 ⊤ ∇softmax(𝑥)𝑑 = 𝑑∇ 2 𝑓 (𝑥)𝑑 ≤ ∥𝑑∥ 2 . and indeed,

as required. □

Using Lemma B.4, we can bound the uniformity of 𝛼 𝑖 𝑗 and the prefix spread of the head outputs.

Proposition B.5. Let 𝑢 𝑖 = 1 𝑖 1 ∈ ℝ 𝑖 . In any layer 𝑙,

and,

Proof. To get Equation 12, let 𝑎 be the constant vector (s 𝑖 , . . . , s𝑖 ) ∈ ℝ 𝑖 and let 𝑏 = 𝑠 𝑖 -𝑎. By Lemma B.4

Now, notice that ∥𝑏∥ ∞ = max 𝑗≤𝑖 𝑠 𝑖 𝑗 -s𝑖 , therefore Proposition B.3 gives us the desired inequality. For Equation 13 notice that,

We now bound the next layer’s spread in terms of the current one. Denote by Δ (𝑙) 𝑧 := max 𝑖 max 𝑗≤𝑖 𝑧 𝑗 -z𝑖 the prefix spread of an attention head’s output. First, we’ll give a bound for Δ (𝑙) 𝑧 , and then use this bound to prove the entire propagation result. Before, we need a short lemma. Lemma B.6. For any sequence (𝑥 𝑗 ) and 𝑗 ≤ 𝑖,

To finish the proof, take the maximum over 𝑖 and 𝑗 ≤ 𝑖. □

There exist constants 𝐴 1 , 𝐴 2 depending only on 𝐵, and 𝐻, such that

Proof. From Proposition B.7, the single-head spread is bounded by a linear term 2𝐵Δ ℎ plus a quadratic term 2𝐵 3 √ 𝐻. Concatenation and 𝑊 𝑂 multiply by at most ∥𝑊 𝑂 ∥ (up to a fixed constant depending on number of heads). Adding the residual preserves a linear contribution in Δ (ℓ) ℎ . The positionwise LayerNorm/MLP, being 𝐵-Lipschitz, scales the spread by at most 𝐵. Collecting the constants into 𝐴 1 and, 𝐴 2 gives the desired result. □

We can now proof the full propagation result. Theorem B.9. For any finite depth 𝐿, there exists 𝜀 > 0 (depending on 𝐵, 𝐿, and 𝐻) such that if Δ (1) ℎ ≤ 𝜀, then for all 𝑙 ≤ 𝐿,

This conclude the first part of the proof, regarding uniformity propagation across depth. Note that the bounds in the proof do not depend on the number of tokens in the input sequence.

A 𝑐 bound. Recall that,

where 𝑐 𝑖 𝑗 are centered positional weights, i.e. 𝑗≤𝑖 𝑐 𝑖 𝑗 = 0. For any such 𝑐 𝑖 𝑗 we have

Now, from direct computation and an application of Lemma B.10, we have

Let’s analyse the norm:

This concludes the proof of Theorem 3.4. et al., 2024) for over 16B tokens with a sequence length of 1024. We note this is well over 2 times the chinchilla optimal number of tokens from Hoffmann et al. (2022). We use a Qwen2 (Yang et al., 2024) tokenizer and follow the specifications (number of layers/hidden dimensions) from the 0.5B model from the same family. We implemented all our baselines on top of this architecture, pretraining them for the same large number of tokens. We use the AdamW optimizer Loshchilov and Hutter (2017) with a small warmup phase of 520 steps, a batch size of 1024, a peak learning rate of 3.0 × 10 -4 , and a cosine decay thereafter. For DroPE we followed a similar optimization setup, but only training for 2B total tokens using a shorter warmup of 70 steps and a slightly larger learning rate of 1.0 × 10 -3 to compensate for the shorter training budget. We provide a full list of hyperparameters and training specifications for this setting in the left column of Table 4.

DroPE from a pretrained SmolLM . For the second part of our experimental evaluation, we use a SmolLM (Allal et al., 2024) with around 362 million parameters already extensively pretrained on the SmolLM corpus (Ben Allal et al., 2024) for over 600B tokens with a sequence length of 2048 -almost 100 times the chinchilla optimal number. This model used a GPT2 (Radford et al., 2019) tokenizer and its architecture was designed to be similar to models of the Llama2 family (Touvron et al., 2023). While not all training details have been disclosed, Allal et al. ( 2024) explicitly mentions using the AdamW optimizer Loshchilov and Hutter (2017), a batch size of 512, a peak learning rate of 3.0 × 10 -3 , and a cosine decay thereafter. For DroPE we again tried to follow a similar optimization setup, across our different 30B/60B/120B training regimes, introducing a short warmup of 490 steps and a slightly lower learning rate of 1.0 × 10 -3 as we found their reported 3.0 × 10 -3 led to instabilities from the small batch size. Given the more extended training period, we used a simple QKNorm (Henry et al., 2020) after dropping the positional embeddings, which we found beneficial to mitigate sporadic instabilities from large gradients. We note that preliminary experiments showed that normalizing only the queries led to even faster learning and also successfully stabilized long training. We believe further exploration of this new Q-norm method could be an exciting direction for future work to train transformers without positional embeddings at even larger scales. We provide a full list of hyperparameters and training specifications for this setting in the right column of Table 4.

Needle-in-a-haystack. We evaluate long-context retrieval using the needle-in-a-haystack (NIAH) setup, which places a short “needle” inside a long distractor “haystack.” Following prior work (Kamradt, 2023), our haystack is a random excerpt from Paul Graham’s essays, and each needle is a seven-digit “magic number” paired with a short key/descriptor. We study three variants:

• (Standard NIAH) We insert a single needle and prompt the model to retrieve it.

• Multi-Query NIAH: We insert multiple (key, value) pairs and prompt the model to return as many values as possible for a given list of keys. For example: The special magic numbers for whispering-workhorse and elite-butterfly mentioned in the provided text are:. • (Multi-Key NIAH) We insert multiple (key, value) pairs but query for a single key, e.g., The special magic number for elite-butterfly mentioned in the provided text is:

• (Multi-Value NIAH) We associate multiple values with one key and ask for all of them without pointing to specific positions, e.g., What are all the special magic numbers for cloistered-colonization mentioned in the provided text?

Inserted needles and example targets are formatted in natural language, for instance, two examples include One of the special magic numbers for whispering-workhorse is: 1019173 and One of the special magic numbers for elite-butterfly is: 4132801. For the standard NIAH variant, we report the average success rate over all possible needle depths. For the multiple needles NIAH variants, we always insert four (key, value) needle pairs, placed at random sequence locations. Unless otherwise noted, we use greedy decoding (logit temperature = 0) for reproducibility.

Long-context evaluations. We use standard implementations of PI, RoPE-NTK, and YaRN. For tasks that require a fixed maximum context length (e.g., NIAH at 2× the training context), we set the extension factor 𝑠 manually. For settings that require reasoning across multiple context lengths and extended generations, we employ a dynamic scaling schedule that adjusts 𝛾 as a function of the generation length as detailed in Peng et al. (2023).

For DroPE, we follow Wang et al. (2024) and apply softmax temperature scaling when evaluating on longer sequences. In practice, we tune a single scalar logit scale (equivalently, the inverse temperature) on a held-out set at the target length. Analogous to (Peng et al., 2023), we fit this coefficient by minimizing perplexity to obtain the optimal scaling. For the DroPE model trained from scratch, the best-performing scale is 𝛽 ★ = 1 + 0.412 ln(𝑠), and for SmolLM-DroPE the optimal scale is 𝛽 ★ = 1 + 0.103 ln(𝑠), Where 𝑠 = 𝐶 test /𝐶 train is the context extension factor. Unless otherwise specified, all other decoding settings are held fixed across lengths.

Language modeling benchmarks. We evaluate SmolLM and SmolLM-DroPE on six standard multiple-choice benchmarks using the LightEval harness (Habib et al., 2023): ARC-E/C: gradeschool science QA split into Easy and Challenge sets, the latter defined by questions that defeat simple IR and co-occurrence baselines (Clark et al., 2018); HellaSwag: adversarially filtered commonsense sentence completion that is easy for humans but challenging for LMs (Zellers et al., 2019); Open-BookQA: combining a small “open book” of science facts with broad commonsense to answer 6K

• Lower learning rates (3 × 10 -5 , 3 × 10 -4 ). DroPE works effectively without QKNorm. At the lowest learning rate (3 × 10 -5 ), the model without QK Norm achieves a slightly better final loss (2.713 vs. 3.102). Together with the 3 × 10 -4 setting (2.530 vs.2.555), this indicates that QK Norm does not consistently improve performance in low-volatility regimes and is not the source of our gains. • High learning rate (10 -3 ). At the highest learning rate, the model without QKNorm becomes unstable (loss spikes, gradient explosions), leading to poor convergence (final loss 6.334). In contrast, adding QKNorm stabilizes training and allows us to leverage the higher learning rate to achieve the best overall performance (final loss 2.496).

Figure 12 shows the corresponding training curves with and without QK Norm, highlighting the presence of loss spikes at higher learning rates, in line with observations reported in OLMo et al. (2024a). These results empirically demonstrate that the primary role of QK Norm is to act as a stabilizer that enables the use of a more aggressive, compute-efficient learning rate. Importantly, DroPE can still be applied without QK Norm by using a moderate learning rate (e.g., (3 × 10 -4 ), which is our default setting for all experiments except the longer SmolLM-360M recalibration phases.

Observation 1. Positional information and attention non-uniformity, which are crucial for sequence modeling, develop at a bounded rate in NoPE transformers. In contrast, explicit PE methods, such as RoPE, provide a strong bias from the

rate schedule accordingly. Given the extended training periods, only for these experiments, we also add QKNorm(Henry et al., 2020) after dropping the positional embeddings, which we find beneficial for mitigating training instabilities, as noted by OLMo et al. (2024b) (See Appendix D.3).

rate schedule accordingly. Given the extended training periods, only for these experiments, we also add QKNorm(Henry et al., 2020)

rate schedule accordingly. Given the extended training periods, only for these experiments, we also add QKNorm

We start by analyzing how quickly our SmolLM-DroPE models can recover SmolLM’s in-context performance across six different LM reasoning benchmarks(Bisk et al., 2020;Clark et al., 2018;Mihaylov et al., 2018;Sakaguchi et al., 2021;Zellers et al., 2019). As shown in Figures 9 and 10 as well as Table5, even with our shortest training schedule, SmolLM-DroPE almost entirely matches SmolLM on every task, while with our longest schedule our new model manages to exceed its original performance. Furthermore, inspecting our model at every checkpoint throughout training, we find that DroPE recovers over 95% of SmolLM’s performance after less than 5B tokens, representing a minuscule 0.8% of SmolLM’s original budget.

We start by analyzing how quickly our SmolLM-DroPE models can recover SmolLM’s in-context performance across six different LM reasoning benchmarks

We empirically validate DroPE across different models and data scales, showing its effectiveness and potential to be integrated as a new core component of future state-of-the-art training pipelines. More broadly, our work demonstrates that canonical trade-offs in LM Attention. 𝑇 ∈ ℝ

𝑗≤𝑖𝑔 𝑖 𝑗 (𝑎 𝑗 -ā𝑖 ).𝑗≤𝑖 𝛼 𝑖 𝑗 (𝑐 𝑖 𝑗 -𝑐 𝛼 𝑖 𝑗 ) = 1 𝑇 𝔼 𝑗∼𝛼 𝑖 [𝑐 𝑖 𝑗 -𝔼 𝑝∼𝛼 𝑖 [𝑐 𝑖𝑝 ]] = 0.For the second part, observe that ∑︁ 𝑗≤𝑖 𝑔 𝑖 𝑗 (𝑎 𝑗 -ā𝑖 ) = ∑︁ 𝑗≤𝑖 𝑔 𝑖 𝑗 𝑎 𝑗 -ā𝑖 ∑︁ 𝑗≤𝑖 𝑔 𝑖 𝑗 = ∑︁ 𝑗≤𝑖 𝑔 𝑖 𝑗 𝑎 𝑗 .

𝑗≤𝑖𝑔 𝑖 𝑗 (𝑎 𝑗 -ā𝑖 ).𝑗≤𝑖 𝛼 𝑖 𝑗 (𝑐 𝑖 𝑗 -𝑐 𝛼 𝑖 𝑗 ) = 1 𝑇 𝔼 𝑗∼𝛼 𝑖 [𝑐 𝑖 𝑗 -𝔼 𝑝∼𝛼 𝑖 [𝑐 𝑖𝑝 ]] = 0.

𝑗≤𝑖𝑔 𝑖 𝑗 (𝑎 𝑗 -ā𝑖 ).

𝑗≤𝑖

C.1

Note the softmax in Equation 1 is taken on the first 𝑖 tokens, implementing a causal mask.

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Start searching

No results found