Understanding Scaling Laws in Deep Neural Networks via Feature Learning Dynamics

The empirical success of deep learning is often attributed to scaling laws that predict consistent gains as model, data, and compute grow; however, large models can exhibit training instability and di

Understanding Scaling Laws in Deep Neural Networks via Feature Learning Dynamics

The empirical success of deep learning is often attributed to scaling laws that predict consistent gains as model, data, and compute grow; however, large models can exhibit training instability and diminishing returns, suggesting that scaling laws describe what success looks like but not when and why scaling succeeds or fails. A central obstacle is the lack of a rigorous understanding of feature learning at large depth. While muP characterizes feature-learning dynamics in the infinite-width limit and enables hyperparameter transfer across width, its depth extension (depth-muP) breaks down for residual blocks with more than one internal layer. We derive Neural Feature Dynamics (NFD) for ResNets with single-layer residual blocks, characterizing feature learning via a coupled forward-backward stochastic system in the joint infinite-width and infinite-depth limit. In this regime, NFD identifies when scaling-law trends persist and explains diminishing returns. It also reveals a vanishing mechanism induced by the 1/sqrt(depth) residual scaling under which the gradient-independence assumption (GIA), known to fail during training at finite depth, becomes provably valid again at infinite depth, yielding an analytically tractable regime for end-to-end feature learning. Motivated by this insight, we study two-layer residual blocks and show that the same mechanism causes feature-learning collapse in the first internal layer at large depth, providing a structural explanation for the empirical failure of depth-muP. Based on this diagnosis, we propose a depth-aware learning-rate correction that counteracts the collapse and empirically restores depth-wise hyperparameter transfer, yielding stronger performance in deeper ResNets.


💡 Research Summary

The empirical success of deep learning is often attributed to scaling laws that predict consistent gains as model, data, and compute resources grow; however, large models can exhibit training instability and diminishing returns. This paper analyzes feature learning dynamics in the joint infinite-width and infinite-depth limit for ResNets with single-layer residual blocks, providing insights into when scaling-law trends persist and why diminishing returns occur. The authors derive Neural Feature Dynamics (NFD) to characterize feature learning via a coupled forward-backward stochastic system. NFD identifies that under 1/sqrt(depth) residual scaling, the gradient-independence assumption (GIA), which fails at finite depth during training, becomes valid again in the infinite-depth limit, making end-to-end feature learning analytically tractable. The paper also examines two-layer residual blocks and finds a similar mechanism causing feature-learning collapse in the first internal layer at large depths, explaining why depth-muP fails empirically. Based on this analysis, they propose a depth-aware learning-rate correction to counteract the collapse and restore depth-wise hyperparameter transfer, leading to improved performance in deeper ResNets.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...