Inverse Depth Scaling From Most Layers Being Similar
Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.
💡 Research Summary
The paper investigates how depth influences loss in large language models (LLMs) and proposes a refined scaling law that separates width and depth contributions. Traditional neural scaling laws relate loss to total parameter count and dataset size, but they do not explain the distinct roles of depth and width. The authors outline three conceptual regimes for depth usage: (1) compositional assembly, where layers build hierarchical abstractions; (2) procedural assembly, where residual networks approximate smooth dynamical systems (neural ODEs); and (3) ensemble averaging, where many layers act as redundant estimators whose errors cancel out via averaging.
To determine which regime dominates in practice, the authors analyze hidden‑state dynamics across layers in several Pythia models. They compute the angle between successive hidden states, θ(hₗ, hₗ₊₁), and find that for the overwhelming majority of tokens (≈99.6 %) the middle layers update the representation by small, roughly uniform angles, while only the first and last layers produce large rotations. Principal‑component analysis reveals two clusters corresponding to “early‑stop” tokens (mostly document beginnings) and “evenly‑updated” tokens, confirming that most layers behave similarly. Correlations between consecutive update directions are weak, contradicting the smooth‑dynamics assumption of procedural assembly.
The authors then decompose the classic scaling formula L = c_N N^{‑α_N}+c_D D^{‑α_D}+L₀ into width‑ and depth‑specific terms: L = c_m m^{‑α_m}+c_ℓ ℓ^{‑α_ℓ}+c_D D^{‑α_D}+L₀. Using ~200 data points from the Chinchilla family, they fit the model and obtain α_m≈0.98, α_ℓ≈1.2, and α_D≈0.30. The depth‑dependent component scales roughly as ℓ^{‑1}, indicating an inverse relationship between loss and depth.
To validate the mechanism, they train toy residual networks under controlled data‑generating processes. When teacher weights are identical (smooth target dynamics) the depth exponent is ≈1; when teacher weights are independent (noisy dynamics) the same exponent emerges across temperature settings. This demonstrates that inverse depth scaling can arise both from procedural discretization of smooth dynamics and from ensemble‑averaging of noisy transformations, but the weak inter‑layer correlations observed in LLMs point toward the latter.
Overall, the study concludes that most LLM layers function as an ensemble that reduces variance rather than as a hierarchy of increasingly abstract computations. Consequently, current residual architectures are depth‑inefficient. Improving LLM efficiency will likely require architectural innovations that encourage genuine compositional use of depth, such as mechanisms that enforce distinct functional roles for deeper layers or that reduce redundancy across the network.
Comments & Academic Discussion
Loading comments...
Leave a Comment