SeeDNorm: Self-Rescaled Dynamic Normalization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Normalization layer constitutes an essential component in neural networks. In transformers, the predominantly used RMSNorm constrains vectors to a unit hypersphere, followed by dimension-wise rescaling through a learnable scaling coefficient $γ$ to maintain the representational capacity of the model. However, RMSNorm discards the input norm information in forward pass and a static scaling factor $γ$ may be insufficient to accommodate the wide variability of input data and distributional shifts, thereby limiting further performance improvements, particularly in zero-shot scenarios that large language models routinely encounter. To address this limitation, we propose SeeDNorm, which enhances the representational capability of the model by dynamically adjusting the scaling coefficient based on the current input, thereby preserving the input norm information and enabling data-dependent, self-rescaled dynamic normalization. During backpropagation, SeeDNorm retains the ability of RMSNorm to dynamically adjust gradient according to the input norm. We provide a detailed analysis of the training optimization for SeedNorm and proposed corresponding solutions to address potential instability issues that may arise when applying SeeDNorm. We validate the effectiveness of SeeDNorm across models of varying sizes in large language model pre-training as well as supervised and unsupervised computer vision tasks. By introducing a minimal number of parameters and with neglligible impact on model efficiency, SeeDNorm achieves consistently superior performance compared to previously commonly used normalization layers such as RMSNorm and LayerNorm, as well as element-wise activation alternatives to normalization layers like DyT.

💡 Research Summary

The paper introduces Self‑Rescaled Dynamic Normalization (SeeDNorm), a novel normalization layer designed to overcome the limitations of RMSNorm in large‑scale Transformer models. RMSNorm normalizes each token by its root‑mean‑square (RMS) and then rescales the result with a static, learnable vector γ. While this design stabilizes training, it discards the original input norm and relies on a fixed scaling factor, which can be sub‑optimal when data distributions shift or when models are evaluated in zero‑shot scenarios.

SeeDNorm augments RMSNorm with two additional learnable vectors, β and α, that enable a data‑dependent, dynamic scaling of the normalized output. For an input token x∈ℝ^{1×D}, the layer first computes the inner product x·βᵀ (a scalar), passes it through a bounded non‑linearity σ (implemented as tanh), and multiplies the result by α, yielding a dynamic scaling matrix s = σ(x·βᵀ)·α. The final output is

y =

SeeDNorm: Self-Rescaled Dynamic Normalization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment