Enhanced QKNorm normalization for neural transformers with the Lp norm
The normalization of query and key vectors is an essential part of the Transformer architecture. It ensures that learning is stable regardless of the scale of these vectors. Some normalization approaches are available. In this preliminary work, a generalization of the QKNorm normalization scheme is proposed. The approach is based on the Lp norm, allowing non-Euclidean norms to be employed. Experimental results demonstrate the suitability of the method for a simple problem.
💡 Research Summary
The paper introduces a generalization of Query‑Key Normalization (QKNorm) for Transformer models by replacing the conventional ℓ2‑based vector normalization with an ℓp‑based normalization, where p ≥ 1 is a tunable hyper‑parameter. In standard QKNorm, each query and key vector is ℓ2‑normalized to unit length and a learnable scalar α replaces the fixed √dₖ scaling factor, thereby stabilizing the dot‑product attention logits. The authors argue that fixing the metric to Euclidean distance limits the ability to shape the geometry of the attention space. By adopting an ℓp norm, the magnitude of each component is weighted according to its absolute value raised to the power p, so larger p values emphasize the largest components while smaller p values distribute weight more evenly. This provides a continuous knob that can control how “spiky” or diffuse the attention distribution becomes.
The proposed method, termed QK‑Lp‑Norm, computes normalized queries ˆq(p) = q/‖q‖ₚ and keys ˆk(p) = k/‖k‖ₚ, then forms attention logits S(p) = α·ˆQ(p)·ˆK(p)ᵀ. Because the vectors are ℓp‑normalized, the inner product is bounded, preserving numerical stability. The final attention output is obtained by applying softmax to S(p) and multiplying by the value matrix V, exactly as in standard attention.
To evaluate the approach, the authors use a lightweight nanoGPT decoder‑only architecture (6 layers, 6 heads, embedding dimension 384) trained on the Tiny Shakespeare character‑level dataset. They perform a 10‑fold cross‑validation sweep over seven p values: {1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0}, resulting in 70 training runs on a single NVIDIA DGX A100 GPU. Validation cross‑entropy loss curves show that p > 2 consistently yields lower minimum loss than the baseline p = 2 (standard QKNorm). Specifically, the best average minima are 1.373 (p = 2.5), 1.365 (p = 3.0), 1.362 (p = 3.5), and 1.357 (p = 4.0) compared with 1.405 for p = 2. Moreover, the loss reaches its minimum earlier for p > 2, indicating faster convergence. Training time remains essentially unchanged across p values (≈ 360–363 seconds), confirming that the ℓp computation introduces negligible overhead.
The discussion interprets p as a “feature span” controller: higher p focuses attention on a smaller set of high‑magnitude components, effectively narrowing the set of features the model deems relevant. This aligns with the observed performance gains and stable training dynamics. The authors note, however, that their experiments are limited to a small character‑level model and a single dataset; scalability to large language models, multilingual corpora, or other modalities remains untested. They also acknowledge that extreme p values (approaching ∞) could discard useful information, a scenario not explored in depth.
In conclusion, the ℓp‑based QKNorm extends the design space of attention normalization without incurring additional computational cost, and empirically improves validation loss and convergence speed on a benchmark task. Future work is suggested in three main directions: (1) extensive evaluation on larger models and diverse datasets, (2) investigation of interactions between ℓp‑norm and other normalization schemes such as LayerNorm or Peri‑LN, (3) learning p as a differentiable parameter or developing adaptive strategies, and (4) theoretical analysis of how ℓp‑norm influences attention head diversity and model expressivity.
Comments & Academic Discussion
Loading comments...
Leave a Comment