Effect of Depth and Width on Local Minima in Deep Learning

Effect of Depth and Width on Local Minima in Deep Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we analyze the effects of depth and width on the quality of local minima, without strong over-parameterization and simplification assumptions in the literature. Without any simplification assumption, for deep nonlinear neural networks with the squared loss, we theoretically show that the quality of local minima tends to improve towards the global minimum value as depth and width increase. Furthermore, with a locally-induced structure on deep nonlinear neural networks, the values of local minima of neural networks are theoretically proven to be no worse than the globally optimal values of corresponding classical machine learning models. We empirically support our theoretical observation with a synthetic dataset as well as MNIST, CIFAR-10 and SVHN datasets. When compared to previous studies with strong over-parameterization assumptions, the results in this paper do not require over-parameterization, and instead show the gradual effects of over-parameterization as consequences of general results.


💡 Research Summary

This paper investigates how the architectural parameters of depth and width influence the quality of local minima in deep neural networks, without relying on the strong over‑parameterization assumptions that dominate much of the existing theoretical literature. The authors consider the standard squared‑loss training objective for feed‑forward fully‑connected networks with ReLU, leaky‑ReLU, or absolute‑value activations, and they derive explicit expressions for the loss value at any differentiable local minimum.

The analysis begins with a single‑hidden‑layer network. By introducing diagonal activation‑pattern matrices that capture which neurons are active on each training sample, the authors show that the loss at a local minimum can be written as the total data norm minus a sum of projection terms. Each projection term corresponds to the component of the target that lies in the column space generated by a particular hidden unit. As the hidden‑layer width d grows, the dimension of this column space expands, the projection term becomes larger, and consequently the loss becomes smaller. A probabilistic bound is then established for Gaussian‑distributed data: when the product of input dimension d_x and width d is smaller than the number of samples m, the loss is bounded by O((m−d_x d)/m); when d_x d exceeds 2m, the loss is zero with high probability. This demonstrates that even without the “every weight matrix is larger than the data set” condition, sufficient width guarantees near‑perfect fitting.

The authors extend the argument to deep networks with an arbitrary number H of hidden layers. For each layer l and each neuron k, they define a matrix D^{(l)}_k that combines the weight vector, the activation pattern, and the previous layer’s activations. Theorem 1 shows that the loss at any differentiable local minimum equals the total data norm minus a sum over all layers and neurons of squared norms of projections of the target onto the orthogonal complement of the null space of D^{(l)}_k. Because each additional layer contributes another non‑negative term, increasing depth can only reduce the loss further. Moreover, widening any layer reduces the dimension of the corresponding null space, enlarging the projection term and again lowering the loss. Thus both depth and width act synergistically to push local minima toward the global optimum.

To validate the theory, the paper presents experiments on synthetic data and three benchmark image datasets (MNIST, CIFAR‑10, SVHN). Networks of varying depth and width are trained, and the training loss is recorded. The empirical results closely follow the theoretical predictions: as width increases, the loss drops sharply; adding more layers yields additional reductions, eventually reaching values indistinguishable from the global minimum. Importantly, these phenomena appear even when the total number of parameters is comparable to or smaller than the number of training examples, confirming that the observed benefits do not stem from extreme over‑parameterization.

The contributions of the work are threefold: (1) it provides a rigorous, parameter‑free analysis of how depth and width affect the landscape of local minima, (2) it unifies the treatment of shallow and deep networks under a common projection‑based framework, and (3) it supports the theoretical findings with both probabilistic bounds and extensive empirical evidence. By showing that deeper and wider networks inherently possess better local minima without requiring every weight matrix to be larger than the data set, the paper offers a more realistic theoretical foundation for the practical success of deep learning and suggests concrete guidelines for network architecture design.


Comments & Academic Discussion

Loading comments...

Leave a Comment