A note on sample complexity of learning binary output neural networks under fixed input distributions

A note on sample complexity of learning binary output neural networks   under fixed input distributions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We show that the learning sample complexity of a sigmoidal neural network constructed by Sontag (1992) required to achieve a given misclassification error under a fixed purely atomic distribution can grow arbitrarily fast: for any prescribed rate of growth there is an input distribution having this rate as the sample complexity, and the bound is asymptotically tight. The rate can be superexponential, a non-recursive function, etc. We further observe that Sontag’s ANN is not Glivenko-Cantelli under any input distribution having a non-atomic part.


💡 Research Summary

The paper investigates the sample‑complexity behavior of a classic sigmoidal artificial neural network (ANN) introduced by Sontag in 1992 when the input distribution is fixed and purely atomic. The authors first recall that Sontag’s network has infinite VC dimension, which makes it a canonical example for studying worst‑case learning scenarios. By restricting the input space to a finite set of atomic points and carefully assigning probability masses to these points, they demonstrate that the number of training examples required to achieve a prescribed misclassification error can be made to grow at essentially any rate. For any monotone increasing function f (n) (including polynomial, exponential, super‑exponential, or even non‑recursive functions), they construct a probability distribution over the atomic points such that the sample complexity m(ε) necessary to guarantee error ε satisfies m(ε) ≥ f (m). The construction works by making the masses of successive atoms decay extremely fast (e.g., doubly‑exponential decay), which forces a learning algorithm to observe an astronomically large number of samples before it sees enough instances of the rare atoms to correctly classify them. The authors prove that this lower bound is asymptotically tight: there exist learning strategies that achieve the error bound with a number of samples matching the prescribed growth rate up to constant factors.

In the second part of the paper the authors turn to the Glivenko‑Cantelli property, which requires that empirical risk uniformly converges to true risk over the hypothesis class. They show that Sontag’s network fails to be Glivenko‑Cantelli under any input distribution that possesses a non‑atomic (continuous) component. The proof exploits the fact that the hypothesis class generated by the network contains functions with arbitrarily many oscillations; consequently, even with arbitrarily large samples the empirical distribution cannot uniformly approximate the true distribution on all measurable subsets. This result implies that, for such distributions, empirical risk minimization does not guarantee convergence to the optimal risk, and the network cannot be learned in the classic PAC sense.

The combined findings have several important implications. First, they highlight that sample‑complexity analyses that ignore the structure of the input distribution can be dramatically misleading; the same network can require a modest number of samples under one distribution and an astronomically large (or even non‑recursive) number under another. Second, the failure of the Glivenko‑Cantelli property for Sontag’s network demonstrates that infinite‑VC‑dimension models may lack uniform convergence even when the data distribution is “nice,” thereby challenging the conventional wisdom that VC‑dimension alone governs learnability. Finally, the authors discuss practical relevance: modern deep learning systems often operate on highly imbalanced or sparse data, which can resemble the atomic distributions studied here. In such regimes, standard generalization bounds may be overly optimistic, and practitioners should consider distribution‑aware sampling strategies, regularization, or model‑capacity control to mitigate the pathological sample‑complexity behavior identified in this work.


Comments & Academic Discussion

Loading comments...

Leave a Comment