Dense Neural Networks are not Universal Approximators
We investigate the approximation capabilities of dense neural networks. While universal approximation theorems establish that sufficiently large architectures can approximate arbitrary continuous functions if there are no restrictions on the weight values, we show that dense neural networks do not possess this universality. Our argument is based on a model compression approach, combining the weak regularity lemma with an interpretation of feedforward networks as message passing graph neural networks. We consider ReLU neural networks subject to natural constraints on weights and input and output dimensions, which model a notion of dense connectivity. Within this setting, we demonstrate the existence of Lipschitz continuous functions that cannot be approximated by such networks. This highlights intrinsic limitations of neural networks with dense layers and motivates the use of sparse connectivity as a necessary ingredient for achieving true universality.
💡 Research Summary
**
The paper challenges the widely‑held belief that fully‑connected (dense) neural networks are universal approximators. Classical universal approximation theorems (e.g., Cybenko, Hornik, Leshno) guarantee that a feed‑forward ReLU network of sufficient width can approximate any continuous function on a compact domain, but they assume no constraints on the magnitude of the weights. The authors introduce a more realistic setting: a B‑strongly dense network, in which every weight and bias is bounded in absolute value by a fixed constant B, the input and output dimensions are fixed, the depth L is fixed, and only the hidden layer widths may grow arbitrarily. This model captures the practical limits of hardware (memory, power) and regularization that keep weights from exploding.
To analyze such networks, the authors reinterpret a multilayer perceptron as a message‑passing graph neural network (MPNN). By treating each layer’s linear transformation and ReLU activation as a message‑passing step, the entire computation can be represented as an attributed kernel—a continuous analogue of an adjacency matrix—on the unit interval. This graph‑theoretic viewpoint enables the use of tools from dense graph limit theory.
The core technical contribution is the application of the Weak Regularity Lemma (Frieze‑Kannan, László‑Szegedy) to these kernels. The lemma guarantees that any large kernel can be approximated, in cut‑norm, by a sum of O(1/ε²) simple step‑kernels, each corresponding to a small “block” of the original network. Translating back to neural networks, this means that no matter how wide the hidden layers become, a B‑strongly dense network can be compressed into a bounded‑size network whose size depends only on B, L and the desired error ε, not on the original width.
Using this compression, the authors prove Theorem 9: for any fixed depth L and weight bound B, there exist 1‑Lipschitz functions on
Comments & Academic Discussion
Loading comments...
Leave a Comment