Step by Step: Adaptive Gradient Descent for Training L-Lipschitz Neural Networks
We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures that the resulting network exhibits a high degree of Lipschitz regularity, that is, a small Lipschitz constant. Moreover, we show that this decay does not hinder the convergence rate of the empirical risk, now measured with the Huber loss, toward a critical point of the non-convex empirical risk. From these findings, we derive generalization bounds for two-layer neural networks trained with GD and a decaying LR with a sub-linear dependence on its number of trainable parameters, suggesting that the statistical behaviour of these networks is independent of overparameterization. We validate our theoretical results with a series of toy numerical experiments, where surprisingly, we observe that networks trained with constant step size GD exhibit similar learning and regularity properties to those trained with a decaying LR. This suggests that neural networks trained with standard GD may already be highly regular learners.
💡 Research Summary
This paper investigates how a simple learning‑rate (LR) decay schedule can simultaneously guarantee Lipschitz regularity and good generalization for two‑layer multilayer perceptrons (MLPs) trained by standard gradient descent (GD) on the mean‑squared‑error (MSE) loss. The authors start from the observation that Lipschitz continuity of neural networks is highly desirable: it yields robustness to input perturbations, improves generalization bounds, and underpins Wasserstein‑GAN and optimal‑transport based models. Existing approaches enforce Lipschitzness either by adding explicit regularization terms to the loss or by constraining the network architecture, both of which increase implementation complexity or limit expressive power. The central question of the work is whether ordinary GD, equipped with an appropriately decaying step size, can achieve the same effect without any architectural changes.
Problem setting.
The study focuses on a single‑hidden‑layer MLP with p hidden units, input dimension d₀, and scalar output. The activation σ is assumed to be 1‑Lipschitz and differentiable (e.g., ReLU, tanh). Parameters are initialized independently from an isotropic sub‑Gaussian distribution with variance scaling (Xavier/He style). The training objective is the empirical MSE risk
R_S(θ)= (1/N)∑_{n=1}^N ‖f_θ(x_n)−y_n‖²,
while performance is evaluated with the 1‑Lipschitz Huber loss ℓ, which shares the same minimizers as MSE but enjoys better statistical properties.
Learning‑rate decay framework.
A “rate function” G:
Comments & Academic Discussion
Loading comments...
Leave a Comment