Attention Needs to Focus: A Unified Perspective on Attention Allocation

Attention Needs to Focus: A Unified Perspective on Attention Allocation

LLM Usage Disclosure

This document was created with assistance from AI tools for language polishing. The content has been reviewed and edited by human authors. For more information on the extent and nature of AI usage, please contact the author.

Related Works

Position Encoding

Position encoding is essential in Transformer models since self-attention itself is permutation-invariant. Early studies mainly adopted absolute position embeddings (APE). The original Transformer used fixed sinusoidal functions, and GPT-2  later replaced them with trainable vectors fused with token representations. Although these approaches worked well for sequences within the training length, they struggled to generalize once the context became longer.

Relative position encoding (RPE) offered a more flexible alternative. Transformer-XL  introduced learnable biases based on token distance, which allowed recurrent processing between segments. ALiBi  simplified the idea by applying fixed linear biases that decay with distance, making it possible to extrapolate to much longer inputs. These methods shifted attention from absolute indices to relative distances, laying the groundwork for position encodings that are better suited to scaling Transformers toward long and variable contexts.

Rotary Position Embeddings (RoPE)  marked another step forward by encoding relative positions as rotations applied to the query and key vectors. RoPE has since become standard in many LLMs. Building on this, several extensions have been proposed. For example, YaRN rescales the rotation angles to push LLaMA models to million-token contexts with limited retraining . LongRoPE2 searches for scaling factors through evolutionary methods to balance precision in both short and long ranges . XPOS combines blockwise causal attention with midrange relative biases to improve interpolation stability . Other designs, such as Fourier Position Embedding (FoPE) , emphasize frequency filtering to retain periodic structure and help models remain stable when context length grows.

In short, position encoding has progressed from absolute sinusoidal forms to relative, rotational, and frequency-based strategies. Recent work pays as much attention to the balance of decay, frequency preservation, and interpolation stability as it does to simply extending the window length.

Attention Sink

Attention sink refers to the phenomenon that initial tokens attract a large amount of attention during inference . Subsequent analysis links this to normalization, which forces LLMs to assign attention even when no informative value .

To mitigate potential issues associated with this phenomenon, various solutions have been developed. Most models try to adapt this pattern by leaving the first token as the sink token and always keeping it during inference . Similarly, GPT-OSS  adds a learnable parameter to the denominator of the softmax function, which is equal to create a “hidden” sink token with constant attraction, thereby alleviating the sink issue without requiring a physical placeholder token. Sigmoid attention  can completely avoid attention sink while sacrificing model performance and training stability.

Sparse and Efficient Attention

Another challenge comes from the quadratic cost of self-attention. Fixed-pattern sparsity, exemplified by Longformer , adopts sliding windows with global tokens, which improves efficiency but reduces flexibility. Linformer  takes a different route by approximating attention with low-rank projections, reducing the cost to linear time.

More recent work has looked at dynamic and normalization based mechanisms. Among them, StreamingLLM  introduced the notion of an “attention sink,” caching only a few initial tokens to enable million-token inference even for models trained on short windows. Sigmoid Attention  implemented a sigmoid variant of attention in a hardware-friendly way, achieving more than 15% faster inference without sacrificing accuracy. These approaches illustrate how relatively small design changes can make long-context inference practical.

A complementary direction is cache optimization. DuoAttention  splits heads into retrieval and streaming roles, keeping full key–value caches only for retrieval to reduce memory. RetroAttention  refreshes caches with new entries to improve coverage of early tokens. FlexPrefill  adapts the prefill stage by selecting query-aware sparse indices, balancing accuracy and efficiency during initialization. Together, these methods show how careful memory management can extend usable context length without prohibitive compute costs.

Research on sparse and efficient attention has moved from fixed patterns to adaptive and cache-aware designs. Instead of focusing solely on reducing asymptotic complexity, recent methods also aim to balance speed, memory footprint, and robustness in real-world deployment.

Detailed Experiment Settings for Reproducibility

Datasets

All experiments use subsets of FineWeb-Edu . We train on 10B-token (10BT) and 100B-token (100BT) pools; preprocessing applies a global shuffle (seed=42). The 15B and 30B budgets are obtained deterministically by taking the first 15B / 30B tokens from the start of the shuffled 100BT subset. Ablation runs use the full 10BT pool, which is shuffled once in preprocessing and then used with a fixed order during training. For all model sizes and runs the training data and its order are identical (the same shuffled are used), ensuring data-level consistency across experiments.

Implementation details

We align our training recipe with prior work (Gated DeltaNet, Titans). All runs use the Llama 2 tokenizer (32k vocab), sequence length 4096, and a global batch of 0.5M tokens. Optimization uses adamw_torch_fused with $`(\beta_1,\beta_2)=(0.9,0.95)`$, weight decay $`0.01`$, peak LR $`4\times10^{-4}`$ and a cosine_with_min_lr schedule; warmup = 1024 steps for 10B/15B runs and 2048 steps for 30B. We initialize $`\tau=-1.0`$ and set the learnable bias window size to $`W=512`$. Ablation runs use the 10B subset with context length 512. For transparency and reproducibility: the 340M results are from models we trained under the above protocol; 730M (Titans) numbers are cited from the original papers and marked as reported.

All models and baselines are implemented on top of the FLA  and Flame  stack, which provides Triton-based efficient-attention kernels and standard LM training utilities. For sigmoid-attention experiments we replace the FlashAttention kernel with the authors’ Flash-Sigmoid implementation integrated into the same stack.

Evaluation metrics

Normalized accuracy (acc_n).

Normalized accuracy adjusts raw accuracy for dataset-specific chance performance. We compute a chance baseline for each dataset (e.g., majority class or random-chance) and rescale accuracy relative to this baseline so that scores are comparable across datasets of different difficulty.

Attention density (density).

Attention density measures how much attention weight is assigned to non-first tokens. Concretely, we compute the mean mass assigned to keys other than the first token, then average these values of all queries, heads and layers. Higher attention density represents more attention allocation between tokens, while lower one means more focused and sparser attention.

Sink ratio (sink).

The sink ratio measures the mean attention weight that all queries assign to the first token, i.e., the sink token. It reports how serious the attention sink phenomenon. For standard Transformer using softmax function, the sink ratio and the attention density sums one.

Benchmarks

Dataset Sample Size
Wikitext 60,634
Lambada 60,000
PIQA 16,113
Hellaswag 70,000
WinoGrande 44,000
ARC 7,787 (Easy Set + Challenge Set)
SIQA 15,554
BoolQ 15,942

The statistics of the benchmarks used in the overall experiment.

For our overall experiment, we compare models on eight common-sense reasoning tasks, in Table 1:

Wikitext : A large linguistic corpus extracted from Wikipedia articles, containing over 100 million word tokens. It tests a model’s ability to predict the next word in a passage of text.

Lambada : The LAmBdA dataset tests a model’s capability of using broad discourse context to predict the last word of a passage extracted from books. It contains over 60,000 examples.

PIQA : The Physical Interaction: Question Answering (PIQA) dataset tests commonsense reasoning about physical interactions between two entities. It contains 16,113 multiple choice questions generated from crowd-sourcing.

Hellaswag : The HellaSwag dataset consists of 70,000 multiple choice questions about inferring what might happen next in a story. It requires commonsense reasoning to choose the most plausible ending.

WinoGrande : The WinoGrande dataset tests coreference resolution and commonsense reasoning with 44,000 examples obtained from books and websites.

ARC : The AI2 Reasoning Challenge (ARC) dataset contains 7,787 genuine grade-school level, multiple-choice science questions, grouped into an Easy Set (ARC-e) and a Challenge Set (ARC-c).

SIQA : The Social Interaction QA (SIQA) dataset contains 15,554 multiple choice questions that describe situations about people’s social interactions.

BoolQ : The Boolean Questions (BoolQ) dataset contains 15,942 English yes/no questions sampled from Google search queries to test a model’s ability to answer simple questions.

Learned position-dependent attention biases across all layers and attention heads. The max range of the bias is 1024, which is the same as pre-training length.
Learned position-dependent attention biases across all layers and attention heads with RoPE. The max range of the bias is 512, while the pre-training length is 1024. The

Learned Attention Bias

As demonstrated in Figure 1, the learnable attention biases exhibit effective values within a range of approximately 100 tokens. When pre-training models with RoPE or ALiBi, which inherently impose strong long-range decay, the learnable range becom es further restricted to within 20 tokens. Beyond this range, the severely diminished attention weights by RoPE prevent the model from learning meaningful positional biases, resulting instead in periodic fluctuations without useful patterns, as shown in Figure 2. This also explains why ALiBi slopes show negligible changes when trained from standard initialization, because the initial long-range decay is so pronounced that it prevents the model from learning useful positional information.

The demonstration of the learnable threshold of each attention head in each layer.

Elastic Softmax

To better understand the behavior of Elastic-Softmax, we visualize all learnable offsets $`\tau^{(h)}`$ across layers and heads. In Fig. 3, the vertical axis indexes Transformer layers (bottom to top), and the horizontal axis enumerates the attention heads within each layer. Each cell corresponds to a head–layer pair, with color intensity representing the magnitude of the learned offset.

Early layers

Offsets are generally large, producing aggressive suppression of weak attention scores. This enforces sparse local focus, anchoring each query to nearby tokens and preventing early layers from being dominated by global noise or sink concentration at the sequence start. In effect, these layers act as filters that stabilize shallow representations.

Middle layers

Offsets decrease to moderate values. This allows informative cross-token connections to form. Within each layer, different heads learn distinct thresholds: some sustain high values to enforce conservative structural focus, while others adopt lower values that facilitate phrase-level or mid-range integration. This head-wise heterogeneity indicates a functional division of labor.

Upper layers

Offsets exhibit increased head-wise variance rather than uniformly small values. Suppression becomes heterogeneous: some heads keep higher thresholds to preserve sparse, high-precision links, while others lower thresholds to admit context. This stage functions as a final refinement that selectively emphasizes the most relevant tokens and avoids sink-like redistribution.

Overall, the learned offsets follow a hierarchical focusing pattern: aggressive pruning in the lower layers, stable selective filtering in the middle, and diverse refinement in the upper layers. Through the subtraction—ReLU mechanism, Elastic-Softmax achieves true sparsification by eliminating negligible weights while preventing sink-like accumulation, thereby maintaining a balanced trade-off between coverage and focus that improves both representational stability and predictive accuracy.

Compatibility with FlashAttention

Our Elastic-Softmax remains compatible with the FlashAttention framework, but requires two sequential passes due to the elastic filtering.

Pass 1: Softmax Statistics

For each query position $`i`$ under head $`h`$, compute the numerically stable softmax weights:

\begin{equation}
\tilde{\alpha}_{ij}^{(h)} = \frac{\exp(s_{ij}^{(h)} - m_i)}{\sum_{k=1}^i \exp(s_{ik}^{(h)} - m_i)},
\quad 
m_i = \max_{1 \leq k \leq i} s_{ik}^{(h)}.
\end{equation}

This step is identical to the first stage of FlashAttention and only requires computing the max and normalization statistics. No filtering is applied yet.

Pass 2: Elastic Process

Apply the learnable offset $`\tau^{(h)}`$ and ReLU filter to obtain sparse attention:

\begin{equation}
\alpha_{ij}^{(h)} = \operatorname{ReLU}\!\left(
\tilde{\alpha}_{ij}^{(h)} + \frac{\tau^{(h)}}{i}
\right),
\quad
\text{Attn}_i^{(h)} = \sum_{j=1}^i \alpha_{ij}^{(h)} v_j^{(h)}.
\end{equation}

Here, the offset is distributed evenly across $`i`$ candidate tokens, while the ReLU ensures non-negativity of the resulting weights, effectively setting suppressed positions to zero. This second pass involves only element-wise operations and a weighted sum, which can be fused as the FlashAttention kernel with minimal overhead.

Summary.

Although Elastic-Softmax requires two forward passes, both steps are linear in sequence length and reuse the same memory-efficient data layout as FlashAttention. Thus, the overall complexity remains $`O(n^2)`$ in time and $`O(n)`$ in memory.

Statistical Properties of the Sink Token

Setup.

To examine how the sink signal is carried through the transformer layers, we measure the statistics of hidden states and Q/K/V vectors under two input formats: (i) natural text and (ii) repeated tokens. We report results on Qwen3-4B and BLOOM with different RPEs, RoPE, and ALiBi (Tables [tab:two_tables_side_by_side][tab:bloom_hs_side_by_side]), respectively.

Key observation 1: a low-variance footprint in representations.

With natural text, token representations are heterogeneous: the norms and variances of Q/K/V vectors and hidden-states vary across positions, and spread broadly (Qwen3-4B: Tables [tab:nonrepeated_qkv], [tab:hs_postln_nonrep]; BLOOM: Tables [tab:bloom_qkv_nonrep], [tab:bloom_hs_nonrep]). In contrast, the sink token exhibits a distinctive representational pattern: its hidden-state variance and the variance of its value vectors are markedly smaller than those of neighboring tokens (Qwen3-4B: Tables [tab:repeated_qkv], [tab:hs_postln_sink]; BLOOM: Tables [tab:bloom_qkv_rep], [tab:bloom_hs_rep]). This indicates that the sink effect propagates through the model via a low-variance state that is easy to recognize.

Key observation 2: repetition collapses positional separability.

Under repeated-token inputs, all summary statistics flatten out: Q/K/V norms and variances, as well as hidden-state norms and variances, become nearly uniform across positions (Qwen3-4B: Tables [tab:repeated_qkv], [tab:hs_postln_sink]; BLOOM: Tables [tab:bloom_qkv_rep], [tab:bloom_hs_rep]). In effect, every position behaves like a sink token once semantic differences are removed, indicating that the model can no longer disambiguate identical tokens by their representations alone.

Implication for RPEs (RoPE/ALiBi).

This collapse appears under both RoPE and ALiBi, which suggests that relative position encodings chiefly act by shaping attention scores (i.e., injecting position-sensitive biases) rather than leaving a persistent positional information in the hidden states. Consequently, representation-level positional separability is absent when inputs repeat, and sink-like dynamics emerge at all positions. These findings motivate our design to (i) strengthen positional discrimination at the scoring level via RoPE combined with learnable, head-wise distance biases, and (ii) suppress sink-driven background attention weight with Elastic-Softmax.

0.49

0.49

0.49

dx ID Token Norm Var
0 12522 Once 7.1974 0.0202
1 5193 upon 16.5617 0.1071
2 264 a 15.0514 0.0885
3 882 time 16.9322 0.1120
4 11 , 14.6386 0.0837
5 304 in 16.5990 0.1076
6 264 a 15.3230 0.0917
7 4268 land 16.3703 0.1047
8 3041 far 16.5625 0.1072
9 11 , 16.9450 0.1121
10 3041 far 17.3868 0.1181
11 3123 away 17.6512 0.1217
12 11 , 16.5142 0.1065
13 1052 there 14.1547 0.0782
14 12163 lived 16.1331 0.1017

0.49

Idx ID Token Norm Var
0 19309 sink 7.1961 0.0202
1 19309 sink 7.1961 0.0202
2 19309 sink 7.1961 0.0202
3 19309 sink 7.1961 0.0202
4 19309 sink 7.1960 0.0202
5 19309 sink 7.1959 0.0202
6 19309 sink 7.1959 0.0202
7 19309 sink 7.1961 0.0202
8 19309 sink 7.1906 0.0202
9 19309 sink 7.1959 0.0202
10 19309 sink 7.1962 0.0202
11 19309 sink 7.1959 0.0202
12 19309 sink 7.1961 0.0202
13 19309 sink 7.1962 0.0202
14 19309 sink 7.1961 0.0202

0.49

0.49

0.49

Idx ID Token Norm Var
0 64393 Once 37.6291 1.3823
1 14591 upon 46.2054 2.0837
2 267 a 43.6723 1.8613
3 3509 time 46.7614 2.1333
4 15 , 42.5041 1.7633
5 361 in 42.2835 1.7447
6 267 a 42.9327 1.7988
7 11970 land 44.4018 1.9240
8 8372 far 45.2311 1.9961
9 15 , 46.5993 2.1190
10 8372 far 48.5840 2.3038
11 14723 away 44.8350 1.9613
12 15 , 39.5864 1.5297
13 2782 there 44.0638 1.8954
14 65532 lived 44.2725 1.9125

0.49

Idx ID Token Norm Var
0 66037 sink 37.6223 1.3818
1 66037 sink 37.6223 1.3818
2 66037 sink 37.6223 1.3818
3 66037 sink 37.6223 1.3818
4 66037 sink 37.6223 1.3818
5 66037 sink 37.6223 1.3818
6 66037 sink 37.6223 1.3818
7 66037 sink 37.6223 1.3818
8 66037 sink 37.6223 1.3818
9 66037 sink 37.6223 1.3818
10 66037 sink 37.6223 1.3818
11 66037 sink 37.6223 1.3818
12 66037 sink 37.6223 1.3818
13 66037 sink 37.6223 1.3818
14 66037 sink 37.6223 1.3818