Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Deep learning has non-convex loss landscape and its optimization dynamics is hard to analyze or control. Nevertheless, the dynamics can be empirically convex-like across various tasks, models, optimizers, hyperparameters, etc. In this work, we examine the applicability of convexity and Lipschitz continuity in deep learning, in order to precisely control the loss dynamics via the learning rate schedules. We illustrate that deep learning quickly becomes weakly convex after a short period of training, and the loss is predicable by an upper bound on the last iterate, which further informs the scaling of optimal learning rate. Through the lens of convexity, we build scaling laws of learning rates and losses that extrapolate as much as 80X across training horizons and 70X across model sizes.

💡 Research Summary

The paper “Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate” investigates why deep neural networks, despite having highly non‑convex loss surfaces, often exhibit optimization dynamics that resemble convex behavior. The authors first review prior observations—such as the prevalence of star‑convex trajectories, the rapid decay of negative Hessian eigenvalues, and local convexity in low‑dimensional slices—and argue that after a brief initial phase the loss landscape becomes “weakly convex.”

Building on this premise, they revisit classic convex optimization results for stochastic gradient descent (SGD). Under the assumptions that the loss is convex (or star‑convex along the optimization path) and that stochastic gradients have bounded second moments (‖g(w)‖² ≤ G²), they derive two key upper‑bound formulas: (2.3) for the averaged iterate and (2.4) for any single iterate, both expressed in terms of the learning‑rate sequence {ηₜ}. These bounds show that the final loss depends on the sum of ηₜ and the sum of ηₜ².

The authors introduce the notion of a “Qualified Schedule.” A schedule sₜ(T)∈

Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate

💡 Research Summary

Comments & Academic Discussion

Leave a Comment