Step by Step: Adaptive Gradient Descent for Training L-Lipschitz Neural Networks

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures that the resulting network exhibits a high degree of Lipschitz regularity, that is, a small Lipschitz constant. Moreover, we show that this decay does not hinder the convergence rate of the empirical risk, now measured with the Huber loss, toward a critical point of the non-convex empirical risk. From these findings, we derive generalization bounds for two-layer neural networks trained with GD and a decaying LR with a sub-linear dependence on its number of trainable parameters, suggesting that the statistical behaviour of these networks is independent of overparameterization. We validate our theoretical results with a series of toy numerical experiments, where surprisingly, we observe that networks trained with constant step size GD exhibit similar learning and regularity properties to those trained with a decaying LR. This suggests that neural networks trained with standard GD may already be highly regular learners.

💡 Research Summary

This paper investigates how a simple learning‑rate (LR) decay schedule can simultaneously guarantee Lipschitz regularity and good generalization for two‑layer multilayer perceptrons (MLPs) trained by standard gradient descent (GD) on the mean‑squared‑error (MSE) loss. The authors start from the observation that Lipschitz continuity of neural networks is highly desirable: it yields robustness to input perturbations, improves generalization bounds, and underpins Wasserstein‑GAN and optimal‑transport based models. Existing approaches enforce Lipschitzness either by adding explicit regularization terms to the loss or by constraining the network architecture, both of which increase implementation complexity or limit expressive power. The central question of the work is whether ordinary GD, equipped with an appropriately decaying step size, can achieve the same effect without any architectural changes.

Problem setting.
The study focuses on a single‑hidden‑layer MLP with p hidden units, input dimension d₀, and scalar output. The activation σ is assumed to be 1‑Lipschitz and differentiable (e.g., ReLU, tanh). Parameters are initialized independently from an isotropic sub‑Gaussian distribution with variance scaling (Xavier/He style). The training objective is the empirical MSE risk
R_S(θ)= (1/N)∑_{n=1}^N ‖f_θ(x_n)−y_n‖²,
while performance is evaluated with the 1‑Lipschitz Huber loss ℓ, which shares the same minimizers as MSE but enjoys better statistical properties.

Learning‑rate decay framework.
A “rate function” G:

Step by Step: Adaptive Gradient Descent for Training L-Lipschitz Neural Networks

💡 Research Summary

Comments & Academic Discussion

Leave a Comment