Deep Networks Learn Deep Hierarchical Models
š Original Paper Info
- Title: Deep Networks Learn Deep Hierarchical Models- ArXiv ID: 2601.00455
- Date: 2026-01-01
- Authors: Amit Daniely
š Abstract
We consider supervised learning with $n$ labels and show that layerwise SGD on residual networks can efficiently learn a class of hierarchical models. This model class assumes the existence of an (unknown) label hierarchy $L_1 \subseteq L_2 \subseteq \dots \subseteq L_r = [n]$, where labels in $L_1$ are simple functions of the input, while for $i > 1$, labels in $L_i$ are simple functions of simpler labels. Our class surpasses models that were previously shown to be learnable by deep learning algorithms, in the sense that it reaches the depth limit of efficient learnability. That is, there are models in this class that require polynomial depth to express, whereas previous models can be computed by log-depth circuits. Furthermore, we suggest that learnability of such hierarchical models might eventually form a basis for understanding deep learning. Beyond their natural fit for domains where deep learning excels, we argue that the mere existence of human ``teachers" supports the hypothesis that hierarchical structures are inherently available. By providing granular labels, teachers effectively reveal ``hints'' or ``snippets'' of the internal algorithms used by the brain. We formalize this intuition, showing that in a simplified model where a teacher is partially aware of their internal logic, a hierarchical structure emerges that facilitates efficient learnability.š” Summary & Analysis
1. **Proving Learnability of Hierarchical Models**: The paper demonstrates that deep learning algorithms, especially residual networks, can learn hierarchical models efficiently. This insight helps explain the success of deep learning. 2. **Brain-Like Hierarchical Structure**: Similar to how the human brain processes information hierarchically, this approach shows how complex concepts are learned step-by-step in fields like computer vision and natural language processing. 3. **Utilizing Internet Data for Learning**: The vast amount of labeled data available on the internet provides 'hints' that deep learning models can use as intermediate steps to learn more complex concepts.š Full Paper Content (ArXiv Source)
A central objective in deep learning theory is to demonstrate that gradient-based algorithms can efficiently learn a class of models sufficiently rich to capture reality. This effort began over a decade ago, coincidental with the undeniable empirical success of deep learning. Initial theoretical results demonstrated that deep learning algorithms can learn linear models, followed later by proofs for simple non-linear models.
This progress is remarkable, especially considering that until recently, no models were known to be provably learnable by deep learning algorithms. Moreover, the field was previously dominated by hardness results indicating severe limitations on the capabilities of neural networks. However, despite this progress, learning linear or simple non-linear models is insufficient to explain the practical success of deep learning.
In this paper, we advance this research effort by showing that deep learning algorithmsāspecifically layerwise SGD on residual networksāprovably learn hierarchical models. We consider a supervised learning setting with $`n`$ possible labels, where each example is associated with a subset of these labels. Let $`\mathbf{f}^* \colon \mathcal{X} \to \{\pm 1\}^n`$ be the ground truth labeling function. We assume an unknown hierarchy of labels $`L_1 \subseteq L_2 \subseteq \dots \subseteq L_r = [n]`$ such that labels in $`L_1`$ are simple functions (specifically, polynomial thresholds) of the input, while for $`i > 1`$, any label in $`L_i`$ is a simple function of simpler labels (i.e., those in $`L_{i-1}`$).
We suggest that the learnability of hierarchical models offers a compelling basis for understanding deep learning. First, hierarchical models are natural in domains where neural networks excel. In computer vision, for instance, a first-level label might be āthis pixel is redā (i.e.Ā the input itself); a second-level label might be ācurved lineā or ādark regionā; and a third-level label might be āleafā or ārectangleā, and so on. Similar hierarchies exist in text and speech processing. Indeed, this hierarchical structure motivated the development of successful architectures such as convolutional and residual networks.
Second, one might even argue further that the mere existence of human āteachersā supports the hypothesis that hierarchical labeling exists and can be supplied to the algorithm. Consider the classic problem of recognizing a car in an image. Early AI approaches (circa 1970sā80s) failed because they attempted to manually codify the cognitive algorithms used by the human brain. This was superseded by machine learning, which approximates functions based on input-output pairs. While this data-driven approach has surpassed human performance, the standard narrative of its success might be somewhat misleading.
We suggest that recent breakthroughs are not solely due to ālearning from scratchā, but also because models are trained on datasets containing a vast number of granular labels. These labels represent a middle ground between explicit programming and pure input-output learning; they serve as āhintsā or intermediate steps for learning complex concepts. Although we lack full access to the brainās internal algorithm, we can provide āsnippetsā of its logic. By identifying lower-level featuresāsuch as windows, wheels, or geometric shapesāwe effectively decompose the task into a hierarchy.
At a larger scale, we can consider the following perspective for the creation of LLMs. From the 1990s to the present, humanity created the internet (websites, forums, images, videos, etc.). As a byproduct, humanity implicitly provided an extensive number of labels and examples. Because these labels are so numerousāranging from the very simple to the very complexāthey are likely to possess a hierarchical structure. Following the creation of the internet, huge models were trained on these examples, succeeding largely as a result of this structure (alongside, of course, the extensive data volume and compute power). In a sense, the evolution of the internet and modern LLMs can be viewed as an enormous collective effort to create a circuit that mimics the human brain, in the sense that all labels of interest are effectively a composition of this circuit and a simple function.
We present a simplified formalization of this intuition. We model the human brain as a computational circuit, where each label (representing a ābrain snippetā) corresponds to a majority vote over a subset of the brainās neurons. To formalize the postulate that these labels are both granular and diverse, we assume that the specific collections of neurons defining each label are chosen at random prior to the learning process. We demonstrate that this setting yields a hierarchical structure that facilitates efficient learnability by residual networks. Crucially, neither the residual network architecture nor the training algorithm relies on knowledge of this underlying label hierarchy.
Finally, we note that hierarchical models surpass previous classes of models shown to be learnable by SGD. To the best of our knowledge, prior results were limited to models that can be realized by log-depth circuits. In contrast, hierarchical models reach the depth limit of efficient learnability. For any polynomial-sized circuit, we can construct a corresponding hierarchical model learnable by SGD on a ResNet, effectively computing the circuit as one of its labels.
Related Work
Linear, or fixed representation models are defined by a fixed (usually non-linear) feature mapping followed by a learned linear mapping. This includes kernel methods, random features , and others. Several papers in the last decade have shown that neural networks can provably learn various linear models, e.g. . Several works consider model-classes which go beyond fixed representations, but still can be efficiently learned by gradient based methods on neural networks. One line of work shows learnability of parities under non-uniform distributions, or other models directly expressible by neural networks of depth two, e.g. . Closer to our approach are that consider certain hierarchical models. As mentioned above, we believe that our work is another step towards models that can capture reality. From a more formal perspective, we improves over previous work in the sense that the models we consider can be arbitrarily deep. In contrast, all the mentioned papers consider models that can be realized by networks of logarithmic depth. In fact, with the exception of which considers composition of permutations, depth two suffices to express all the above mentioned models.
Another line of related work is which argue that deep learning is successful due to hierarchical structure. This series of papers give an example to a hierarchical model that is efficiently learnable, but it is conjectured that it requires deep architecture to express. Additional attempts to argue that hierarchy is essential for deep learning includes
Notation and Preliminaries
We denote vectors using bold letters (e.g., $`\mathbf{x} ,\mathbf{y},\mathbf{z},\mathbf{w},\mathbf{v}`$) and their coordinates using standard letters. For instance, $`x_i`$ denotes the $`i`$-th coordinate of $`\mathbf{x}`$. Likewise, we denote vector-valued functions and polynomials (i.e., those whose range is $`\mathbb{R}^d`$) using bold letters (e.g., $`\mathbf{f},\mathbf{g},\mathbf{h},\mathbf{p},\mathbf{q},\mathbf{r}`$), and their $`i`$-th coordinate using standard letters. We will freely use broadcasting operations. For instance, if $`\vec\mathbf{x} = (\mathbf{x} _1,\ldots,\mathbf{x} _n)`$ is a sequence $`n`$ of vectors in $`\mathbb{R}^d`$ and $`g`$ is a function from $`\mathbb{R}^d`$ to some set $`Y`$, then $`g(\vec\mathbf{x} )`$ denotes the sequence $`(g(\mathbf{x} _1),\ldots,g(\mathbf{x} _n))`$. Similarly, for a matrix $`A\in M_{q,d}`$, we denote $`A\vec\mathbf{x} = (A\mathbf{x} _1,\ldots,A\mathbf{x} _n)`$.
For a polynomial $`p:\mathbb{R}^n\to\mathbb{R}`$, we denote by $`\|p\|_{\mathrm{co}}`$ the Euclidean norm of the coefficient vector of $`p`$. We call $`\|p\|_{\mathrm{co}}`$ the coefficient norm of $`p`$. For $`\sigma:\mathbb{R}\to\mathbb{R}`$, we denote by $`\|\sigma\| = \sqrt{\mathbb{E}_{X\sim{\cal N}(0,1)}[\sigma^2(X)]}`$ the $`\ell^2`$ norm with respect to the standard Gaussian measure. We denote the Frobenius norm of a matrix $`A\in M_{n,m}`$ by $`\|A\|_F = \sqrt{\sum_{i,j}A^2_{ij}}`$, and the spectral norm by $`\|A\| = \max_{\|\mathbf{x} \|=1}\|A\mathbf{x} \|`$.
We denote by $`\mathbb{R}^{d,n}`$ the space of sequences of $`n`$
vectors in $`\mathbb{R}^d`$. More generally, for a set $`G`$, we let
$`\mathbb{R}^{d,G} = \{\vec\mathbf{x} = (\mathbf{x} _g)_{g\in G} : \forall g\in G,\; \mathbf{x} _g\in\mathbb{R}^d\}`$.
We denote the Euclidean unit ball by
$`\mathbb{B}^d = \{\mathbf{x} \in\mathbb{R}^d : \|\mathbf{x} \|\le 1\}`$.
We denote the point-wise (Hadamard) multiplication of vectors and
matrices by $`\odot`$ and the concatenation of vectors by
$`(\mathbf{x} |\mathbf{y})`$. For $`\mathbf{x} \in\mathbb{R}^n`$,
$`A\subseteq [n]`$, and $`\sigma\in \mathbb{Z}^n`$, we use the
multi-index notation $`\mathbf{x} ^{A} = \prod_{i\in A} x_i`$ and
$`\mathbf{x} ^{\sigma} = \prod_{i=1}^n x_i^{\sigma_i}`$. For
$`\mathbf{f}:{\cal X}\to \mathbb{R}^n`$ and $`L\subseteq[n]`$, we denote
by $`\mathbf{f}_L:{\cal X}\to \mathbb{R}^{|I|}`$ the restriction
$`\mathbf{f}_L=(f_{i_1},\ldots,f_{i_k})`$, where
$`L=\{i_1,\ldots,i_k\}`$ with $`i_1<\ldots Fix a set $`{\cal X}\subseteq[-1,1]^d`$, a function
$`f:{\cal X}\to \{\pm 1\}`$, a positive integer $`K`$, and $`M>0`$. We
say that $`f`$ is a $`(K,M)`$-PTF if there is a degree $`\le K`$
polynomial $`p:\mathbb{R}^d\to\mathbb{R}`$ such that
$`\|p\|_\mathrm{co}\le M`$ and
$`\forall \mathbf{x} \in {\cal X},\;\;p(\mathbf{x} )f(\mathbf{x} )\ge 1`$.
More generally, we say that $`f`$ a $`(K,M)`$-PTF of
$`\mathbf{h}:{\cal X}\to\mathbb{R}^s`$ if there is a degree $`\le K`$
polynomial $`p:\mathbb{R}^s\to\mathbb{R}`$ such that
$`\|p\|_\mathrm{co}\le M`$ and
$`\forall \mathbf{x} \in {\cal X},\;\;p(\mathbf{h}(\mathbf{x} ))f(\mathbf{x} )\ge 1`$.
An example of a $`(K,1)`$-PTF that we will use frequently is a function
$`f:\{\pm 1\}^d\to\{\pm 1\}`$ that depends on $`K`$ variables. Indeed,
Fourier analysis on $`\{\pm 1\}^d`$ tell us that $`f`$ is a restriction
of a degree $`\le K`$ polynomial $`p`$ with $`\|p\|_\mathrm{co}=1`$. For
this polynomial we have
$`\forall \mathbf{x} \in {\cal X},\;\;p(\mathbf{x} )f(\mathbf{x} )= 1`$. We will also need a more refined definitions of PTFs, which allows to
require two sided inequity $`B\ge p(\mathbf{x} )f(\mathbf{x} )\ge 1`$,
as well as some robustness to perturbation of $`\mathbf{x}`$. To this
end, for $`\mathbf{x} \in [-1,1]^d`$ and $`r>0`$ we define Fix $`B\ge 1`$ and $`1\ge \xi>0`$. We say that $`f`$ is a
$`(K,M,B,\xi)`$-PTF if there is a degree $`\le K`$ polynomial
$`p:\mathbb{R}^d\to\mathbb{R}`$ such that $`\|p\|_\mathrm{co}\le M`$ and Likewise, we say that $`f`$ is a $`(K,M,B,\xi)`$-PTF of
$`\mathbf{h}=(h_1,\ldots,h_s):{\cal X}\to[-1,1]`$ if there is a degree
$`\le K`$ polynomial $`p:\mathbb{R}^s\to\mathbb{R}`$ such that
$`\|p\|_\mathrm{co}\le M`$ and Finally, We say that $`f`$ is a $`(K,M,B)`$-PTF (resp.Ā $`(K,M,B)`$-PTF
of $`\mathbf{h}`$) if it is a $`(K,M,B,1)`$-PTF (resp.Ā $`(K,M,B,1)`$-PTF
of $`\mathbf{h}`$). Let $`W\subseteq\mathbb{R}^d`$ be convex. We say that a differentiable
$`f:W\to \mathbb{R}`$ is $`\lambda`$-strongly-convex if for any
$`\mathbf{x} ,\mathbf{y}\in W`$ we have We note that if $`f`$ is strongly convex and
$`\|\nabla f(\mathbf{x} )\|\le \epsilon`$ for $`\mathbf{x} \in W`$ when
$`\mathbf{x}`$ minimizes $`f`$ up to an additive error of
$`\frac{\epsilon^2}{2\lambda}`$. Indeed, for any $`\mathbf{y}\in W`$ we
have The results we state next can be found in . The Hermite polynomials
$`h_0,h_1,h_2,\ldots`$ are the sequence of orthonormal polynomials
corresponding to the standard Gaussian measure $`\mu`$ on
$`\mathbb{R}`$. That is, they are the sequence of orthonormal
polynomials obtained by the Gram-Schmidt process of
$`1,x,x^2,x^3,\ldots \in L^2(\mu)`$. The Hermite polynomials satisfy the
following recurrence relation or equivalently The generating function of the Hermite polynomials is We also have Likewise, if
$`X,Y\sim{\cal N}\left(0,\begin{pmatrix}1&\rho\\\rho&1\end{pmatrix}\right)`$ Let $`{\cal X}\subseteq [-1,1]^d`$ be our instance space. We consider
the multi-label setting, in which each instance can have anything
between $`0`$ to $`n`$ positive labels, and each training example comes
with a list of all1 its positive labels. Hence, our goal is to learn
the labeling function $`\mathbf{f}^*:{\cal X}\to \{\pm 1\}^n`$ based on
a sample of i.i.d.Ā labeled examples that comes from a distribution $`{\cal D}`$
on $`{\cal X}`$. Specifically, our goal is to find a predictor
$`\hat \mathbf{f}:{\cal X}\to\mathbb{R}^n`$ whose error,
$`\mathrm{Err}_{\cal D}(\hat{\mathbf{f}})=\Pr_{\mathbf{x} \sim{\cal D}}\left(\mathrm{sign}(\hat{\mathbf{f}}(\mathbf{x} ))\ne \mathbf{f}^*(\mathbf{x} ) \right)`$,
is small. We assume that there is a hierarchy of labels (unknown to the
algorithm), with the convention that The first level of the hierarchy consists of labels which are simple
($`=`$ easy to learn) functions of the input. Specifically, each such
label is a polynomial threshold function (PTF) of the input. Any label in the $`i`$āth level of the hierarchy is a simple function
(again, a PTF) of labels from lower levels of the hierarchy. We next give the formal definition of hierarchy. Definition 1 (hierarchy). Let $`{\cal L}= \{L_1,\ldots,L_r\}`$ be a
collection of sets such that
$`L_1\subseteq L_2\subset\ldots\subseteq L_r = [n]`$. We say that
$`{\cal L}`$ is a hierarchy for $`\mathbf{f}^*:{\cal X}\to \{\pm 1\}^n`$
of complexity $`(r,K,M)`$ (or $`(r,K,M)`$-hierarchy for short) if for
any $`j\in L_1`$ the function $`f^*_j`$ is a $`(K,M)`$-PTF and for
$`i\ge 2`$, and $`j\in L_i`$ we have that
$`\mathbf{f}^*_j = \tilde f_j\circ \mathbf{f}^*_{L_{i-1}}`$ for a
$`(K,M)`$-PTF $`\tilde f_j:\{\pm 1\}^{|L_{i-1}|}\to\{\pm 1\}`$. Example 2. Fix $`{\cal L}= \{L_1,\ldots,L_r\}`$ as in Definition
1, and recall that a boolean function
that depends on $`K`$ coordinates is a $`(K,1)`$-PTF. Hence, if for any
$`i\ge 2`$, any label $`j\in L_i`$ depends on at most $`K`$ labels from
$`L_{i-1}`$, and any label $`j\in L_1`$ is a $`(K,1)`$-PTF of the input,
then $`{\cal L}`$ is an $`(r,K,1)`$-hierarchy. Assuming that $`K`$ is constant, our main result will show that given
$`\mathrm{poly}(n,d,M,1/\epsilon)`$ samples, a poly-time SGD algorithm
on a residual network of size $`\mathrm{poly}(n,d,M,1/\epsilon)`$ can
learn any function $`\mathbf{f}^*:{\cal X}\to\{\pm 1\}^n`$ with error of
$`\epsilon`$, provided that $`\mathbf{f}^*`$ has a hierarchy of
complexity $`(r,K,M)`$ (the algorithm and the network do not depend on
the hierarchy, but just on $`r,K,M`$). One of the steps in the proof of this result is to show that any
$`(K,M)`$-PTF on a subset of $`[-1,1]^n`$ is necessarily a
$`(K,2M,B,\xi)`$-PTF for $`\xi = \frac{1}{2(n+1)^{\frac{K+1}{2}}KM}`$
and $`B=2(\max(n,d)+1)^{K/2}M`$ (see Lemma
29). This is enough for
establishing our main result as informally described above. Yet, in some
cases of interest, we can have much larger $`\xi`$ and smaller $`B`$. In
this case, we can guarantee learnability with smaller network, and less
samples and runtime. Hence, we next refine the definition of hierarchy
by adding $`B`$ and $`\xi`$ as parameters. Definition 3 (hierarchy). Let $`{\cal L}= \{L_1,\ldots,L_r\}`$ be a
collection of sets such that
$`L_1\subseteq L_2\subset\ldots\subseteq L_r = [n]`$. We say that
$`{\cal L}`$ is a hierarchy for $`\mathbf{f}^*:{\cal X}\to \{\pm 1\}^n`$
of complexity $`(r,K,M,B,\xi)`$ (or $`(r,K,M,B,\xi)`$-hierarchy for
short) if for any $`j\in L_1`$ the function $`f^*_j`$ is a
$`(K,M,B)`$-PTF and for $`i\ge 2`$, and $`j\in L_i`$ we have that
$`\mathbf{f}^*_j = \tilde f_j\circ \mathbf{f}^*_{L_{i-1}}`$ for a
$`(K,M,B,\xi)`$-PTF $`\tilde f_j:\{\pm 1\}^{|L_{i-1}|}\to\{\pm 1\}`$. Fix a domain $`{\cal X}\subseteq\{\pm 1\}^d`$ and a sequence of
functions $`G^i:\{\pm 1\}^d\to\{\pm 1\}^d`$ for $`1\le i\le r`$. We
assume that $`G^0(\mathbf{x} ) = \mathbf{x}`$, and for any depth
$`i\in [r]`$ and coordinate $`j\in [d]`$, we have where $`h^i_j:\{\pm 1\}^d\to\{\pm 1\}`$ is a function that depends on
$`K`$ coordinates. We view the sequence $`G^1, \ldots, G^r`$ as a
computation circuit, or a model of a ābrain.ā Suppose we wish to learn a function of the form $`f^* = h\circ G^r`$,
where $`h:\{\pm 1\}^d\to\{\pm 1\}`$ also depends only on $`K`$ inputs,
given access to labeled samples $`(\mathbf{x} ,f^*(\mathbf{x} ))`$. The
function $`f^*`$ can be extremely complex. For instance, $`G`$ could
compute a cryptographic function. In such cases, learning $`f^*`$ solely
from labeled examples $`(\mathbf{x} ,f^*(\mathbf{x} ))`$ is likely
intractable; if our access to $`f^*`$ is restricted to the black-box
scenario described above, the task appears impossible. On the other
extreme, if we had complete white-box access to $`f^*`$āmeaning a full
description of the circuit $`G`$āthe learning problem would become
trivial. However, if $`G`$ truly models a human brain, such transparent
access is unrealistic. Consider a middle ground between these black-box and white-box
scenarios. Assume we can query the labeler (the human whose brain is
modeled by $`G`$) for additional information. For instance, if $`f^*`$
is a function that recognizes cars in an image, we can ask the labeler
not only whether the image contains a car, but also to identify specific
features: wheels, windows, dark areas, curves, and whatever he thinks
is relevant. Each of these additional labels represents another simple
function computed over the circuit $`G`$. We model these auxiliary
labels as random majorities of randomly chosen $`G^i_j`$ās. We show that
with enough such labels, the resulting problem admits a low-complexity
hierarchy and is therefore efficiently learnable. Formally, fix an integer $`q`$. We assume that for every depth
$`i\in [r]`$, there are $`q`$ auxiliary labels $`f^*_{i,j}`$ for
$`1\le j\le q`$, each of which is a signed Majority of an odd number of
components of $`G^i`$. Moreover, we assume these functions are random.
Specifically, prior to learning, the labeler independently samples
$`qr`$ functions such that for any $`i\in [r]`$ and $`j\in [q]`$, where the weight vectors $`\mathbf{w}^{i,j}\in \mathbb{R}^{d}`$ are
independent uniform vectors chosen from for some odd integer $`k`$. Theorem 4. If $`q=\tilde\omega\left(k^2d\log(|{\cal X}|)\right)`$
then $`\mathbf{f}^*`$ has
$`\left(r,K,O\left(kd^{K}\right),2k+1\right)`$-hierarchy
w.p.Ā $`1-o(1)`$ We next extend the notion of hierarchy for the common setting in which
the input and the output of the learned function is an ensemble of
vectors. Let $`G`$ be some set. We will refer to elements in $`G`$ as
locations. In the context of images a natural choice would be
$`G=[T_1]\times [T_2]`$, where $`T_1\times T_2`$ is the maximal size of
an input image. In the context of language a natural choice would be
$`G=[T]`$, where $`T`$ is the maximal number of tokens in the input. We
denote by $`\vec\mathbf{x} =(\mathbf{x} _g)_{g\in G}`$ ensemble of
vectors and let
$`\mathbb{R}^{d,G} = \{\vec\mathbf{x} = (\mathbf{x} _g)_{g\in G} : \forall g\in G,\; \mathbf{x} _g\in\mathbb{R}^d\}`$. Fix $`{\cal X}\subseteq [-1,1]^d`$ and let $`{\cal X}^{G}`$ be our
instance space. Assume that there are $`n`$ labels. We consider the
setting in which each instance at each location can have anything
between $`0`$ to $`n`$ positive labels. In light of that, our goal is to
learn the labeling function
$`\mathbf{f}^*:{\cal X}^G\to \{\pm 1\}^{n,G}`$ based on a sample of i.i.d.Ā labeled examples coming from a distribution $`{\cal D}`$ on
$`{\cal X}^G`$. We assume that there is a hierarchy of labels (unknown
to the algorithm), with the convention that The first level of the hierarchy consists of labels which are simple
($`=`$ easy to learn) functions of the input. Specifically, each such
label at location $`g`$ is a PTF of the input near $`g`$. Any label in the $`i`$āth level of the hierarchy is a simple function
of labels from lower levels. Specifically, each such label at location
$`g`$ is a PTF of lower level labels, at locations near $`g`$. We will capture the notion of proximity of locations in $`G`$ via a
proximity mapping, which designates $`w`$ nearby locations to any
element $`g\in G`$. We will always consider $`g`$ itself as a point near
$`g`$. This is captured in the following definition Definition 5 (proximity mapping). A proximity mapping of width
$`w`$ is a mapping $`\mathbf{e}=(e_1,\ldots,e_w):G\to G^{w}`$ such that
$`e_1(g)=g`$ for any $`g`$. For instance, if $`G=[T]`$, it is natural to choose
$`\mathbf{e}:G\to G^{2w+1}`$ such that
$`\{e_1(g),\ldots,e_{2w+1}(g)\} = \{g'\in T : |g'-g|\le w\}`$. Likewise,
if $`G=[T]\times [T]`$, it is natural to choose
$`\mathbf{e}:G\to G^{(2w+1)^2}`$ such that
$`\{e_1(g_1,g_2),\ldots,e_{(2w+1)^2}(g_1,g_2)\} = \{(g_1',g_2')\in T\times T : |g_1'-g_1|\le w\text{ and }|g_2'-g_2|\le w\}`$.
Given a proximity mapping $`\mathbf{e}`$ and
$`\vec\mathbf{x} \in\mathbb{R}^{d,G}`$ we define
$`E_g(\vec\mathbf{x} )`$ as the concatenation of all vectors
$`\mathbf{x} _{g'}`$ where $`g'`$ is close to $`g`$ according to
$`\mathbf{e}`$. Formally, Definition 6. Given a proximity mapping $`\mathbf{e}:G\to G^w`$,
$`g\in G`$ and $`\vec\mathbf{x} \in \mathbb{R}^{d,G}`$ we define
$`E_g(\vec{\mathbf{x} }) = (\mathbf{x} _{e_1(g)}|\ldots|\mathbf{x} _{e_w(g)})\in\mathbb{R}^{dw}`$.
Likewise, we let $`E(\vec\mathbf{x} )\in \mathbb{R}^{dw,G}`$ be
$`E(\vec\mathbf{x} )= (E_g(\vec{\mathbf{x} }))_{g\in G}`$. We next extend the definition of PTF to accommodate the ensemble
setting. Definition 7 (hierarchy). Let $`{\cal L}= \{L_1,\ldots,L_r\}`$ be a
collection of sets such that
$`L_1\subseteq L_2\subset\ldots\subseteq L_r = [n]`$. Let
$`\mathbf{e}:G\to G^w`$ be a proximity function. We say that
$`({\cal L},\mathbf{e})`$ is a hierarchy for
$`\mathbf{f}^*:{\cal X}^G\to \{\pm 1\}^{n,G}`$ of complexity
$`(r,K,M,B,\xi)`$ (or $`(r,K,M,B,\xi)`$-hierarchy for short) if For any $`j\in L_1`$ there is a $`(K,M,B,\xi)`$-PTF
$`\tilde f_j:{\cal X}^w\to\{\pm 1\}`$ such that
$`f_{j,g}(\mathbf{x} )=\tilde f(E_g(\mathbf{x} ))`$ for any
$`\mathbf{x} \in {\cal X}^G`$ and $`g\in G`$ For $`i\ge 2`$, and $`j\in L_i`$ there is a $`(K,M,B,\xi)`$-PTF
$`\tilde f_j:\{\pm 1\}^{|L_1|w}\to\{\pm 1\}`$ such that
$`f_{j,g}(\mathbf{x} )=\tilde f(E_g(\mathbf{f}^*_{L_{i-1}}(\mathbf{x} )))`$
for any $`\mathbf{x} \in {\cal X}^G`$ and $`g\in G`$ We note that the previous definition of hierarchy (i.e.Ā definitions
1 and
3) is the special case
$`w=|G|=1`$. Fix $`{\cal X}\subseteq [-1,1]^d`$, a location set $`G`$, a proximity
mapping $`e:G\times N\to G`$ of width $`w`$, some constant integer
$`K\ge1`$, and an activation function $`\sigma:\mathbb{R}\to\mathbb{R}`$
that is Lipschitz, bounded and is not a constant function. We will view
$`\sigma`$ and $`K`$ as fixed, and will allow big-$`O`$ notation to hide
constants that depend on $`\sigma`$ and $`K`$. We start by describing the residual network architecture that we will
consider. Let $`{\cal X}^G`$ be our instance space. The first layer
(actually, it is two layers, but it will be easier to consider it as one
layer) of the network will compute the function We assume that $`W^1_2\in \mathbb{R}^{n\times q}`$ is initialized to
$`0`$, while
$`(W^1_1,\mathbf{b}^1)\in \mathbb{R}^{q\times wd}\times \mathbb{R}^{q}`$
is initialized using $`\beta`$-Xavier initialization as defined next. Definition 8 (Xavier Initialization). Fix $`1\ge \beta \ge 0`$. A
random pair
$`(W,\mathbf{b})\in \mathbb{R}^{q\times d}\times \mathbb{R}^{q}`$ has
$`\beta`$-Xavier distribution if the entries of $`W`$ are
i.i.d.Ā centered Gaussians of variance $`\frac{1-\beta^2}{d}`$, and
$`\mathbf{b}`$ is independent from $`W`$ and its entries are
i.i.d.Ā centered Gaussians of variance $`\beta^2`$ The remaining layers are of the form where
$`(W^k_1,\mathbf{b}^k)\in \mathbb{R}^{q\times (wn)}\times\mathbb{R}^{q}`$
is initialized using $`\beta`$-Xavier initialization and
$`W^k_2\in \mathbb{R}^{n\times q}`$ is initialized to $`0`$. Finally,
the last layer computes for an orthogonal matrix $`W^D\in \mathbb{R}^{n\times n}`$. We will
denote the collection of weight matrices by $`\vec W`$, and the function
computed by the network by $`\hat \mathbf{f}_{\vec W}`$. Fix a convex
loss function $`\ell:\mathbb{R}\to [0,\infty)`$ we extend it to a loss
$`\ell: \mathbb{R}^{G}\times \{\pm 1\}^G\to [0,\infty)`$ by averaging: Likewise, for a function
$`\hat \mathbf{f}:{\cal X}^G\to \mathbb{R}^{n,G}`$ and $`j\in [n]`$ we
define Finally, let We will consider the following algorithm Algorithm 9. At each step $`k=1,\ldots,D-1`$ optimize the
$`\ell_S(\vec W)+ \frac{\epsilon_\mathrm{opt}}{2}\|W^k_2\|^2`$ over
$`W^k_2`$, until a gradient of size $`\le\epsilon_\mathrm{opt}`$ is
reached. (as the $`k`$āth step objective is
$`\epsilon_\mathrm{opt}`$-strongly convex the algorithm finds an
$`\frac{\epsilon_\mathrm{opt}}{2}`$-minimizer of it.) We will consider the following loss function. We are now ready to state our main result. Theorem 10 (Main). Assume that $`\mathbf{f}^*`$ has
$`(r,K,M,B,\xi)`$-hierarchy and let
$`\gamma = \frac{1}{32}\min\left(\frac{1}{B},\xi\right)`$. Assume that $`D>r\cdot\left(\left\lceil\frac{\ln(8m|G|/\xi)}{\gamma} \right\rceil+1\right)`$ $`\epsilon_\mathrm{opt}\le \frac{(1-e^{-\gamma})\xi}{16m^2|G|^2}`$ Then, there is a choice of $`\beta`$ and
$`q=\tilde O\left(\frac{(M+1)^4(wn)^{2K}}{\gamma^{4+2K}}\right)`$ such
that algorithm 9 will learn a classifier with expected
error at most
$`\tilde O\left(\frac{D^2(M+1)^4(wn)^{2K+1}}{\gamma^{4+2K}m}\right)`$. data-reference=“thm:main_intro”>10: Hierarchical Learning by Resnets In order to prove Theorem
10 it is enough to prove Theorem
11 below, which shows that there is a
choice of $`\beta`$ and
$`q=\tilde O\left(\frac{(M+1)^4(wn)^{2K}}{\gamma^{4+2K}}\right)`$ such
that algorithm 9 will learn a classifier with empirical
large margin error of $`0`$ w.p. $`\frac{1}{m}`$. That is, we define And show that algorithm 9 will learn a classifier
$`\hat \mathbf{f}`$ with $`\mathrm{Err}_{S,1/2}(\hat\mathbf{f})=0`$ w.p.
$`\frac{1}{m}`$. Letās call such an algorithm $`(1/m)`$-consistent.
Given this guarantee, Theorem
10 will follow from a standard
parameter counting argument: The number of trained parameters is
$`p=Dqn`$, and their magnitude is bounded by
$`\frac{2n}{\epsilon_\mathrm{opt}}+1`$ due to the $`\ell^2`$
regularization term. Likewise, excluding the small probability event
that one of the initial weights has magnitude $`\ge \ln(Dq(n+d)wm)`$
(which happens w.p.Ā $`\ll \frac{1}{m}`$, since all $`Dq(n+d)w`$ initial
weights are centered Guassians with variance $`\le 1`$), it is not hard
to verify that as a composition of $`2D`$ layers, the networkās output
is $`L`$-Lipchitz w.r.t.Ā the trained parameters for
$`L=2^{\tilde O(D)}`$. Thus, the expected error of any
$`(1/m)`$-consistent algorithm is
$`\tilde O\left(\frac{p\log(L)}{m}\right) =\tilde O\left(\frac{Dp}{m}\right) = \tilde O\left(\frac{D^2qn}{m}\right)`$.
(See Lemma
23 for a precise
statement). Theorem 11 (Main - Restated). Let
$`\gamma = \frac{1}{32}\min\left(\frac{1}{B},\xi\right)`$. Assume that $`\mathbf{f}^*`$ has $`(r,K,M,B,\xi)`$-hierarchy $`({\cal L},e)`$ $`D>r\cdot\left(\left\lceil\frac{\ln(8m|G|/\xi)}{\gamma} \right\rceil+1\right)`$ $`\epsilon_\mathrm{opt}\le \frac{(1-e^{-\gamma})\xi}{16m^2|G|^2}`$ There is a choice of $`\beta`$ such that
w.p.Ā $`1-2n mD |G|\exp\left(-\Omega\left(q\cdot\frac{ \gamma^{2K+4}}{(wn)^{2K}(M+1)^4}\right)\right)`$
over the initial choice of the weights, Algorithm
9 will learn a classifier
$`\hat\mathbf{f}:{\cal X}^G\to\mathbb{R}^{n,G}`$ with
$`\mathrm{Err}_{S,1/2}(\hat\mathbf{f})=0`$. For $`1\le k\le D`$, let
$`\hat \mathbf{f}^k:{\cal X}^G\to \mathbb{R}^{n,G}`$ be the function
computed by the network after the $`k`$āth layer is trained. Also, let
$`\Gamma^k:{\cal X}^G\to \mathbb{R}^{n,G}`$ be the function computed by
the layers $`1`$ to $`k`$ after the $`k`$āth layer is trained. For
$`k=0`$ we denote by $`\hat \mathbf{f}^0=\Gamma^0`$ the identity mapping
from $`:{\cal X}^G`$ to $`\mathbb{R}^{d,G}`$. We note that when
algorithm 9 trains the $`k`$āth layer we have
$`W^{k'}_2=0`$ for any $`k'>k`$. Hence, so when the $`k`$āth layer is trained the $`k'`$āth layer is simply the
identity function for any $`k'>k`$. As a result, we have
$`\hat \mathbf{f}^k(\mathbf{x} )=W^D \Gamma^k(\mathbf{x} )`$. Our first observation in the proof of Theorem
11 is that the $`k`$āth step of algorithm
9 (i.e., obtaining $`\hat\mathbf{f}^k`$
from $`\hat \mathbf{f}^{k-1}`$) is essentially equivalent to learning a
linear classifier on top of random features extension of that data
representation
$`\vec\mathbf{x} \mapsto \hat \mathbf{f}^{k-1}(\vec\mathbf{x} )`$.
Specifically, define an input space embedding
$`\Phi^{k-1}:{\cal X}^G\to \mathbb{R}^{q,G}`$ by For $`\mathbf{w}\in\mathbb{R}^{q}`$ we define We have that Lemma 12. *For any $`D-1\ge k\ge 1`$
$`\hat \mathbf{f}_j^k = \hat\mathbf{f}^k_{j,\mathbf{w}}`$ where
$`\mathbf{w}`$ is an $`\frac{\epsilon_\mathrm{opt}}{2}`$-minimizer of
the convex objective over $`\mathbf{w}\in \mathbb{R}^{q}`$. Furthermore, In particular, if we denote by $`\hat W^k_2`$ the value of $`W^k_2`$
after the $`k`$āth layer is trained, then we have
$`\hat \mathbf{f}_j^k = \hat\mathbf{f}^k_{j,\mathbf{w}}`$ where
$`\mathbf{w}`$ is the $`j`$āth row of the matrix $`W = W^D\hat W^k_2`$.
It remains therefore to show that $`\mathbf{w}`$ minimizes
$`\ell^k_{S,j}`$. To this end, we note that at the $`k`$āth step
algorithm 9 finds an
$`\frac{\epsilon_\mathrm{opt}}{2}`$-minimizer of As a result, $`\hat W:=W^D\hat W^k_2`$ is an
$`\frac{\epsilon_\mathrm{opt}}{2}`$-minimizer of In particular, $`\mathbf{w}=\hat W_{j\cdot}`$ must be
$`\frac{\epsilon_\mathrm{opt}}{2}`$-minimizer of $`\ell^k_{S,j}`$
Finally, since $`\ell^{k}_{S,j}`$ is $`\epsilon_\mathrm{opt}`$-strongly
convex, Equation
[eq:strongly_conv_guarantee]
implies that for any $`\mathbf{w}^*\in \mathbb{R}^{q}`$, Ā ā» With lemma 12 at hand, we can present
the strategy of the proof. Since the labels in $`L_1`$ are PTF of the
input, we will learn them when the first layer is trained. That is,
$`\hat\mathbf{f}^1`$ will predict the labels in $`L_1`$ correctly. The
reason for that is that, roughly speaking, PTFs are efficiently
learnable by training a linear classifier on top of random features
embedding. Since, $`\hat\mathbf{f}^1`$ predicts the labels in $`L_1`$ correctly,
the labels in $`L_2`$ become a simple function of $`\hat\mathbf{f}^1`$.
Concretely, PTF of $`\mathrm{sign}(\hat\mathbf{f}^1)`$. It is therefore
tempting to try using the same reasoning as above in order to prove that
after training the next layer, we will learn the labels in $`L_2`$, and
more generally, that after $`r`$ layers are trained, the network will
predict all labels correctly. This however wonāt work that smoothly: PTF
of $`\mathrm{sign}(\hat\mathbf{f}^1)`$ is not necessarily learnable by
training a linear classifier on top of random-features embedding on
$`\hat\mathbf{f}^1`$. To circumvent this, we show that after the network
predicts correctly a label $`j`$, the loss of this label keeps improving
when training additional layers, so after training additional
$`O(B+1/\xi)`$ layers, the loss will be small enough to guarantee that
the labels in $`L_2`$ are PTFs of $`\hat\mathbf{f}^1`$ (and not just of
$`\mathrm{sign}(\hat\mathbf{f}^1)`$). Thus, after $`O(B+1/\xi)`$ layers
are trained, the network will predict the labels in $`L_2`$ correctly,
and more generally, after $`O(rB+r/\xi)`$ layers are trained, the
network will predict all the labels correctly. The course of the proof will be as follows We start with Lemma
14 which shows that if a
label $`j`$ is a large PTF of $`\hat\mathbf{f}^k`$ then
$`\hat \mathbf{f}^{k+1}`$ will predict it correctly. To be more
accurate, we show that if a robust version of
$`\ell_{S,j}(p\circ E\circ\hat\mathbf{f}^k)`$ is small for a
polynomial $`p`$, then $`\ell_{S,j}(\hat\mathbf{f}^{k+1})`$ is
small. We then continue with Lemma
15 which uses Lemma
14 to show that (i)
$`\ell_{S,j}(\hat\mathbf{f}^1)`$ is small for any $`j\in L_1`$, (ii)
for any $`j\in [n]`$, if $`\ell_{S,j}(\hat\mathbf{f}^k)`$ is small,
then it will shrink exponentially as we train deeper layers
and (iii) if $`\ell_{S,j}(\hat\mathbf{f}^k)`$ is very small for any
$`j\in L_{i-1}`$, then $`\ell_{S,j}(\hat\mathbf{f}^{k+1})`$ is small
for any $`j\in L_{i}`$. The carry out the first step, we will need some notation. First, we
define the $`\epsilon`$ robust version of $`\ell`$ as Note that for $`z\le 1`$ we have
$`\ell^{\mathrm{rob},\epsilon}(z) =\ell(z-\epsilon)`$ while for $`z< 0`$
we have $`\ell^{\mathrm{rob},\epsilon}(z) =\ell(z)=\infty`$. Denote the
Hermite expansion of $`\sigma`$ by Let $`K'`$ be the minimal integer $`K'\ge K`$ such that $`a_{K'}\ne0`$
(such $`K'`$ exists as otherwise $`\sigma`$ is a polynomial, which
contradicts the assumption that it is bounded and non-constant). For
$`\epsilon>0`$ define
$`\beta(\epsilon) = \beta_{\sigma,K',K}(\epsilon)<1`$ as the minimal
positive number greater that $`\frac{3}{4}`$ such that if
$`\beta_{\sigma,K',K}(\epsilon)\le \beta<1`$ then Note that $`\beta(\epsilon)`$ is well defined as
$`h(\beta):=\frac{1-\beta^2}{\sqrt{1-2(1-\beta^2)^2}}`$ is continuous
near $`\beta=1`$ and equals to $`0`$ at $`\beta=1`$. In fact, since
$`h`$ is differentiable near $`\beta=1`$ we have that
$`1- \beta(\epsilon) = \Omega\left(\epsilon2^{-K'}\frac{a_{K'}}{\|\sigma\|}\right)`$.
In particular, for fixed $`\sigma,K',K`$ we have that
$`1- \beta(\epsilon) = \Omega(\epsilon)`$. Define also Note that for fixed $`\sigma,K',K`$ and $`1-\beta = \Omega(\epsilon)`$
we have We will need the following Lemma that is proved at the end of section
9, and shows that it is possible
to approximate a polynomial by composing a random layer, and a linear
function. Lemma 13. *Fix $`{\cal X}\subset [-1,1]^n`$, a degree $`K`$
polynomial $`p:{\cal X}\to [-1,1]`$, $`K'\ge K`$ and $`\epsilon>0`$. Let
$`(W,\mathbf{b})\in\mathbb{R}^{q\times n}\times \mathbb{R}^{q}`$ be
$`\beta`$-Xavier pair for $`1>\beta\ge \beta_{\sigma,K',K}(\epsilon)`$.
Then there is a vector
$`\mathbf{w}=\mathbf{w}(W,\mathbf{b})\in\mathbb{B}^{q}`$ such that for any $`t`$ and $`g`$. Since
$`y^t_{j,g}\cdot p \circ E_g\circ \hat\mathbf{f}^k(\vec\mathbf{x} ^t)\ge\epsilon_1`$
(as otherwise we will have
$`\ell^{\mathrm{rob},\epsilon_1}_{S,j}(p \circ E_g\circ \hat\mathbf{f}^k) = \infty`$),
it is enough to show that w.p.Ā $`1-\delta`$ there is
$`\tilde\mathbf{w}^*\in\mathbb{B}^d`$ such that for any $`t`$ and $`g`$. Indeed, in this case Equation
[eq:lem:pol_imp_small_loss]
holds true for
$`\mathbf{w}^* = \frac{\tilde\mathbf{w}^*}{1+\epsilon_1/2}`$. Finally,
since it is enough to show that w.p.Ā $`1-\delta`$ there is
$`\tilde\mathbf{w}^*\in\mathbb{B}^d`$ such that for the polynomial
$`\tilde p(\mathbf{x} ^1|\ldots|\mathbf{x} ^w)=p(\mathbf{x} ^1|\ldots|\mathbf{x} ^w)-x^1_j`$
(note that
$`\tilde p(E(\hat\mathbf{f}^k(\vec\mathbf{x} )))=p(E(\hat\mathbf{f}^k(\vec\mathbf{x} ))) - \hat \mathbf{f}^{k}_{j}(\vec\mathbf{x} )`$
and that $`\|\tilde p\|_\mathrm{co}\le \|p\|_\mathrm{co}+1`$), and for
any $`t`$ and $`g`$. The existence of such $`\mathbf{w}^*`$
w.p.Ā $`1-\delta`$ follows from Lemma
13 and a union bound over
$`X = \{E_g\circ\hat{\mathbf{f}}^{k}(\vec\mathbf{x} ^t) : g\in G,t\in [m]\}`$Ā ā» We continue with the following Lemma which quantitatively describes how
the loss of the different labels improves when training deeper and
deeper layers. Lemma 15. Let
$`\gamma = \frac{1}{32}\min\left(\frac{1}{B},\xi\right)`$ Assume that
$`1>\beta\ge\beta(\gamma/2)`$ and let
$`\delta=m|G|\delta(\gamma/2,\beta,q,\|p\|_\mathrm{co}+5,wn)`$ Then, For any $`j\in L_{1}`$, w.p. $`1-\delta`$,
$`\ell_{S,j}(\mathbf{f}^1)\le \frac{1}{4m|G|} + \epsilon_\mathrm{opt}`$ Given that $`\ell_{S,j}(\mathbf{f}^{k})\le \frac{1}{2m|G|}`$ we have
that
$`\ell_{S,j}(\mathbf{f}^{k+1})\le e^{-\gamma}\ell_{S,j}(\mathbf{f}^{k})+\epsilon_\mathrm{opt}`$
w.p. $`1-\delta`$. Furthermore, if
$`\epsilon_\mathrm{opt}\le \frac{1-e^{-\gamma}}{2m|G|}`$ then w.p.
$`1-t\delta`$ we have
$`\ell_{S,j}(\mathbf{f}^{k+t})\le e^{-\gamma t}\ell_{S,j}(\mathbf{f}^{k})+\frac{1-e^{-\gamma t}}{1-e^{-\gamma}}\epsilon_\mathrm{opt}`$. Given that $`\ell_{S,j'}(\mathbf{f}^{k})\le \frac{\xi}{8m^2|G|^2}`$
for any $`j'\in L_{i-1}`$ we have that
$`\ell_{S,j}(\mathbf{f}^{k+1})\le \frac{1}{4m|G|}+\epsilon_\mathrm{opt}`$
for any $`j\in L_{i}`$ w.p. $`1-|L_i|\delta`$ Before proving Lemma
15 implies, we show that it
implies Theorem 11. Proof. (of Theorem 11) Choose $`\beta = \beta(\gamma/2)`$
(more generally, $`1>\beta\ge \beta(\gamma/2)`$ such that
$`1-\beta = \Omega(\gamma)`$). Denote
$`\delta=m|G|\delta(\gamma/2,\beta,q,M+5,wn)`$ and note that by Equation
[eq:beta_est] we have Since
$`\epsilon_\mathrm{opt}\le \frac{(1-e^{-\gamma})\xi}{16m^2|G|^2}`$, we
have that if $`\ell_{S,j}(\mathbf{f}^{k})\le \frac{1}{2m|G|}`$ then w.p.
$`1-t\delta`$ Choosing $`t_0 = \left\lceil\frac{\ln(8m|G|/\xi)}{\gamma} \right\rceil`$
we get w.p. $`1-t_0\delta`$. Hence, it is not hard to verify by induction on
$`1\le i\le r`$ that for any $`j\in L_i`$, if $`k\ge i (t_0+1)`$ then w.p. $`1-nk\delta`$Ā ā» To prove lemma 15 we will use the following
fact which is an immediate consequence of the definition of the loss. Fact 16. If $`\ell_{S,j}(\hat\mathbf{f})\le \frac{\epsilon}{m|G|}`$ then for
any $`t\in [m]`$ and $`g\in G`$ we have
$`1 \ge \hat f_{j,g}(\vec\mathbf{x} ^t)\cdot f^*_{j,g}(\vec\mathbf{x} ^t)\ge \frac{(1-\epsilon)}{2B}`$ If $`\ell_{S,j}(\hat\mathbf{f})\le \frac{\epsilon}{4m^2|G|^2}`$ then
for any $`t\in [m]`$ and $`g\in G`$ we have
$`1 \ge \hat f_{j,g}(\vec\mathbf{x} ^t)\cdot f^*_{j,g}(\vec\mathbf{x} ^t)\ge (1-\epsilon)(1-\xi/2)`$ If for any $`t\in [m]`$ and $`g\in G`$ we have
$`1 \ge \hat f_{j,g}(\vec\mathbf{x} ^t)\cdot f^*_{j,g}(\vec\mathbf{x} ^t)\ge \frac{1}{B}`$
then
$`\ell^{\mathrm{rob},1/2B}_{S,j}(\hat\mathbf{f})\le \frac{1}{4m|G|}`$ We next prove lemma
15. Proof. (of lemma
15) Let $`p_1,\ldots p_n`$ be
polynomials that witness that $`({\cal L},e)`$ is an
$`(r,K,M,B,\xi)`$-hierarchy for $`\mathbf{f}^*`$. We start with the
first item. By the definition of hierarchy, we have that for any
$`t\in [m]`$ and $`g\in G`$,
$`B\ge p_j(E_p(\mathbf{f}^0(\vec\mathbf{x} ^t)))f_{j,g}(\vec\mathbf{x} ^t)\ge 1`$.
Fact 16 implies that for
$`\tilde p_j = \frac{1}{B}p_j`$ we have
$`\ell^{\mathrm{rob},\gamma}_{S,j}(\tilde p_j\circ \hat\mathbf{f}^0)\le\ell^{\mathrm{rob},1/2B}_{S,j}(\tilde p_j\circ \hat\mathbf{f}^0)\le \frac{1}{4m |G|}`$.
The first item therefore follows from Lemma
14. The third item is proved similarly. If
$`\ell_{S,j'}(\mathbf{f}^{k})\le \frac{\xi}{8m^2|G|^2}`$ for any
$`j'\in L_{i-1}`$ then Fact
16 implies that for any
$`j'\in L_{i-1},\;t\in [m]`$ and $`g\in G`$ we have Hence, by the definition of hierarchy, we have that for any $`t\in [m]`$
and $`g\in G`$,
$`B\ge p_j(E_p(\hat\mathbf{f}^k(\vec\mathbf{x} ^t)))f_{j,g}(\vec\mathbf{x} ^t)\ge 1`$.
Fact 16 now implies that for
$`\tilde p_j = \frac{1}{B}p_j`$ we have
$`\ell^{\mathrm{rob},\gamma}_{S,j}(\tilde p_j\circ \hat\mathbf{f}^k)\le\ell^{\mathrm{rob},1/2B}_{S,j}(\tilde p_j\circ \hat\mathbf{f}^k)\le \frac{1}{4m |G|}`$.
The third item therefore follows from Lemma
14. It remains to prove the second item. Define
$`q:\mathbb{R}^{n}\to\mathbb{R}`$ by
$`q(\mathbf{x} ) = 1.5 x_j-0.5x_j^3`$. By lemma
14 it is enough to show that To do so, we note that since
$`\ell_{S,j}(\hat\mathbf{f}^k)\le \frac{1}{2m|G|}`$ then Fact
16 implies that
$`\forall t,g,\;y^t_{j,g}\hat f^k_{j,g}(\vec \mathbf{x} ^t)\ge 1/(4B)`$.
Now, since $`q`$ is odd we have Equation
[eq:loss_improvments_proof]
therefore follows from the following claim Claim 1. Let $`\tilde q(x) = 1.5x-0.5x^3`$. Then, for any
$`\frac{1}{4B}\le x\le 1`$ we have
$`\ell^{\mathrm{rob},\gamma}(\tilde q(x))=\ell(\tilde q(x)-\gamma)\le e^{-\gamma}\ell(x)`$. Proof. Denote $`x'=\min(x,1-\xi/2)`$ and note that
$`\ell(x)=\ell(x')`$ and that Now, we have Ā ā» Ā ā» In this work, we argued that the availability of extensive and granular
labeling suggests that the target functions in modern deep learning are
inherently hierarchical, and we showed that deep learningāspecifically,
SGD on residual networksācan exploit such hierarchical structure. Our
proof builds on a layerwise mechanism of the learning process, where
each layer acts simultaneously as a representation learner and a
predictor, iteratively refining the output of the previous layer. Our
results give rise to several perspectives, which we outline below: Supervised Learning is inherently tractable. Contrary to
worst-case hardness results, the existence of a teacher (and thus a
hierarchy) implies that the problem is learnable in polynomial time,
given the right supervision. Very deep models are provably learnable. Unlike previous
theoretical works, we prove that ResNets can learn models that are
realizable only by very deep circuits. A middle ground between Software Engineering and Learning. Modern
deep learning can be viewed as a relaxation of software engineering
and a strengthening of classical learning. Instead of manually
ācodifying the brainās algorithmā (traditional AI) or learning blindly
from input-output pairs (classical ML), we provide snippets of the
brainās logic via related labels. This approach renders the learning
task feasible without requiring full knowledge of the underlying
circuit. A modified narrative for learning theory. Historically, the
narrative governing learning theory, particularly from a computational
perspective, has been the following: (i)Ā Learning all functions is
impossible. (ii)Ā Upon closer inspection, we are interested only in
functions that are efficiently computable. (iii)Ā This function class
is learnable using polynomial samples. (iv)Ā Unfortunately, learning it
requires exponential time. (v)Ā Nevertheless, some simple function
classes are learnable. The aforementioned narrative, however, is at odds with practice. Our
work suggests that it might be possible to replace item (v) with the
following: ā(v)Ā Re-evaluating our scope, we are primarily interested
in functions that are efficiently computable by humans. (vi)Ā We have
good reasons to believe that these functions are hierarchical.
(vii)Ā As a result, they are learnable using polynomial time and
samples.ā Our work suggests using hierarchical models as a basis for understanding
neural networks. Significant future work is required to advance this
direction. First, theoretically, it would be useful to extend the scope
of hierarchical models. To this end, one might: Analyze attention mechanisms through the lens of hierarchical models. Extend hierarchical models to capture a āsingle-function hierarchy.ā
This refers to a scenario where a function $`f`$ has āsimple versionsā
that are easy to learn, the mastery of which renders $`f`$ itself easy
to learn. This aligns with previous work on the learnability of
non-linear models via gradient-based algorithms (e.g., ), as many of
these studies assumed (often implicitly) such a hierarchical structure
on the target model. Extend the inherent justification of hierarchical models by
generalizing Theorem
4. That is, define formal
models of teachers that are āpartially awareā to their internal logic,
and show that hierarchical labeling which facilitates efficient
learnability can be provided by such teachers. Put differently, show
that āgeneric non-linear projection" of a hierarchical function is
hierarchical itself. Identify low-complexity hierarchies for known algorithms. This could
lead to new hierarchical architectures, and might even shed some light
on how humans discovered these algorithms, and facilitate teaching
them. Second, on the empirical side, it would be valuable to: Build practical learning algorithms with principled optimization
procedures based more directly on the hierarchical learning
perspective. Empirically test the hypothesis that, given enough labels, real-world
data exhibits a hierarchical structure. In this respect, finding this
explicit hierarchical structure can be viewed as an interpretation of
the learned model. Finally, we address specific limitations of our results, which rely on
several assumptions. We outline the most prominent ones here, hoping
that future work will be able to relax these constraints. We begin with the technical assumptions. A clear direction for future
work is to improve our quantitative bounds; while polynomial, they are
likely far from optimal. Other technical constraints include the
assumption that the output matrix is orthogonal and that the number of
labels equals the dimension of the hidden layers. It would be more
natural to consider an arbitrary number of labels and an output matrix
initialized as a Xavier matrix (we note, however, that Xavier matrices
are āalmost orthogonalā). Finally, the loss function used in our
analysis is non-standard. Next, we address more inherent limitations. First, we assumed extremely
strong supervision: that each example comes with all positive labels it
possesses. In practice, one usually obtains only a single positive label
per example. We note that while it is straightforward to show that
hierarchical models are efficiently learnable with this standard
supervision, proving that gradient-based algorithms on neural networks
succeed in this setting remains an open problem. Another limitation is our assumption of layer-wise training, whereas in
reality, all layers are typically trained jointly. While this makes the
mathematical analysis more intricate, joint training is likely superior
for several reasons. First, empirically, it is the standard method.
Second, if the goal of training lower layers is merely to learn
representations, there is little utility in exhausting data to achieve
marginal improvements in the loss. Indeed, to ensure data efficiency, it
is preferable to utilize features as soon as they are sufficiently good
(i.e., once the gradient w.r.t.Ā these features is large). In the sequel we denote by $`(\mathbb{R}^n)^{\otimes t}`$ the space of
order $`t`$ real tensors whose all axes has dimension $`n`$. We equip it
with the inner product
$`{\left\langle A,B \right\rangle} = \sum_{1\le i_1,\ldots,i_t\le n}A_{i_1,\ldots,i_t}B_{i_1,\ldots,i_t}`$.
For $`\mathbf{x} \in\mathbb{R}^d`$ we denote by
$`\mathbf{x} ^{\otimes t}\in (\mathbb{R}^n)^{\otimes t}`$ the tensor
whose $`(i_1,\ldots,i_t)`$ entry is $`\prod_{j=1}^t x_{i_j}`$. We note
that
$`{\left\langle \mathbf{x} ^{\otimes t},\mathbf{y}^{\otimes t} \right\rangle} = {\left\langle \mathbf{x} ,\mathbf{y} \right\rangle}^t`$. We will use the Chernoff and Hoeffdingās inequalities: Lemma 17 (Hoeffding). *Let $`X_1,\ldots,X_q\in [-B,B]`$ be
i.i.d.Ā with mean $`\mu`$. Then, for any $`\epsilon > 0`$ we have Hence, Defining $`\epsilon = \delta\frac{\mu_+}{|\mu|}`$ we get for
$`\epsilon\le \frac{\mu_+}{|\mu|}`$ A similar argument implies that for $`\epsilon\le \frac{\mu_-}{|\mu|}`$
we have As a result for
$`\epsilon\le \frac{\min\left(\mu_+,\mu_-\right)}{2|\mu|}`$ we have Ā ā» We will use the following asymptotics of binomials Coefficients, which
follows from Stirlingās approximation Lemma 20. We have
$`\frac{\binom{2k}{k}}{2^{2k}}\sim \frac{1}{\sqrt{\pi k}}`$ We will also need the following approximation of the sign function using
polynomials. Lemma 21. Let $`0<\xi<1`$ and $`\epsilon>0`$. There is a polynomial
$`p:\mathbb{R}\to\mathbb{R}`$ such that $`p([-1,1])\subseteq [-1,1]`$ For any $`x\in [-1,1]\setminus [-\xi,\xi]`$ we have
$`|p(\mathbf{x} )-\mathrm{sign}(\mathbf{x} )|\le \epsilon`$. $`\deg(p) = O\left(\frac{\log(1/\epsilon)}{\xi}\right)`$ $`p`$ās coefficients are all bounded by
$`2^{O\left(\frac{\log(1/\epsilon)}{\xi}\right)}`$ The existence of a polynomial that satisfies the first three properties
is shown in . The bound on the coefficients (the last item) follows from
Lemma 2.8.Ā in (see also
here
).
Finally, we will use the following bound on the coefficient norm of a
composition of a polynomial with a linear function. Lemma 22. Fix a degree $`K`$ polynomial
$`p:\mathbb{R}^n\to \mathbb{R}`$ and $`A\in M_{n,m}`$ whose rows has
Euclidean norm at most $`R`$. Define
$`q(\mathbf{x} )= p(A\mathbf{x} )`$. Then,
$`\|q\|_\mathrm{co}\le \|p\|_\mathrm{co}R^K(n+1)^{K/2}`$ Proof. Let $`\mathbf{a}_i`$ be the $`i`$āth row of $`A`$. Denote
$`p(\mathbf{x} )= \sum_{\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K} b_\alpha \mathbf{x} ^\alpha`$
and
$`e_\alpha(\mathbf{x} ) = \prod_{i=1}^n {\left\langle \mathbf{a}_i,\mathbf{x} \right\rangle}^{\sigma_i}`$.
We have
$`q = \sum_{\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K} b_\alpha e_\alpha`$.
Hence, Finally Ā ā» It is well established that for ānicely behaved" function classes in
which functions are defined by a vector of parameters, the sample
complexity is proportional to the number of parameters. For instance, a
function class of the form
$`{\cal F}= \{\mathbf{x} \mapsto F(\mathbf{w},\mathbf{x} ) : \mathbf{w}\in[-B,B]^p\}`$
for a function $`F`$ that is $`L`$-Lipschitz in the first argument has
realizable large margin sample complexity of
$`\tilde O\left(\frac{p}{\epsilon}\right)`$. To be more precise, if
there is a function in $`{\cal F}`$ with $`\gamma`$-error $`0`$, then
any algorithm that is guaranteed to return a function with empirical
$`\gamma`$-error $`0`$, enjoys this aforementioned sample complexity
guarantee. We next slightly extend this fact, allowing $`F`$ to be
random and allowing the algorithm to fail with some small probability. Lemma 23. Suppose that
$`{\cal F}\subset (\mathbb{R}^{n})^{\cal X}`$ is a random function class
such that There is a random function
$`F:[-B,B]^p\times{\cal X}\to \mathbb{R}^n`$ such that
$`{\cal F}= \{\mathbf{x} \mapsto F(\mathbf{w},\mathbf{x} ) : \mathbf{w}\in[-B,B]^p\}`$ W.p. $`1-\delta_1`$, for any $`\mathbf{x} \in{\cal X}`$,
$`\mathbf{w}\mapsto F(\mathbf{w},\mathbf{x} )`$ is $`L`$-Lipschitz
w.r.t.Ā the $`\ell^\infty`$ norm. Let $`{\cal A}`$ be an algorithm, and assume that for some
$`\mathbf{f}^*:{\cal X}\to\{\pm 1\}^n`$, $`{\cal A}`$ has the property
that on any $`m`$-points sample $`S`$ labeled by $`\mathbf{f}^*`$, it
returns $`\hat\mathbf{f}\in{\cal F}`$ with
$`\mathrm{Err}_{S,\gamma}(\hat\mathbf{f})=0`$ w.p. $`1-\delta_2`$ (where
the probability is over the randomness of $`F`$ and the internal
randomness of $`{\cal A}`$). Then if $`S`$ is an i.i.d.Ā sample labeled
by $`\mathbf{f}^*`$ we have $`\mathrm{Err}_{{\cal D}}(\hat\mathbf{f})\le\epsilon`$ w.p.
$`(LB/\gamma)^{O(p)}(1-\epsilon)^m+\delta_1+\delta_2`$ $`\mathbb{E}_S\mathrm{Err}_{{\cal D}}(\hat\mathbf{f}) \le O\left(\frac{p\ln(LB/\gamma) + \ln(m)}{m}\right) + \delta_1+\delta_2`$ Proof. (sketch) For $`\hat\mathbf{f}:{\cal X}\to \mathbb{R}^n`$ we
define It is not hard to see that w.p. $`1-\delta_1`$ there is
$`\tilde{\cal F}\subseteq{\cal F}`$ of size $`N=(LB/\gamma)^{O(p)}`$
such that for any $`\mathbf{g}\in {\cal F}`$ there is
$`\tilde \mathbf{g}\in\tilde{\cal F}`$ such that Let $`A`$ be the event that such $`\tilde{\cal F}`$ exists, that
$`{\cal A}`$ return a function in $`{\cal F}`$ with
$`\mathrm{Err}_{S,\gamma}(\hat \mathbf{f})=0`$, and that for any
$`\tilde \mathbf{g}\in \tilde {\cal F}`$ with
$`\mathrm{Err}_{{\cal D},\gamma/2}(\tilde\mathbf{g})\ge \epsilon`$ we
have $`\mathrm{Err}_{S,\gamma/2}(\tilde\mathbf{g}) >0`$. We have that
the probability of $`A`$ is at least
$`1-\delta_1-\delta_2-N(1-\epsilon)^m`$. Given $`A`$ we have for any
$`\mathbf{g}\in{\cal F}`$, Thus, the probability that $`{\cal A}`$ return a function with error
$`\ge \epsilon`$ is at most $`N(1-\epsilon)^m +\delta_1+\delta_2`$ which
proves the first part of the lemma. As for the second part, we note that
we have Optimizing over $`\epsilon`$ we get
$`\mathbb{E}_S\mathrm{Err}_{\cal D}(\hat\mathbf{f}) \le \frac{\ln(Nm)}{m} + \delta_1+\delta_2`$
which proves the second partĀ ā» The results we state next can be found in Chapter 2.Ā of . Let
$`{\cal X}`$ be a set. A kernel is a function
$`k:{\cal X}\times {\cal X}\to\mathbb{R}`$ such that for every
$`x_1,\ldots,x_m\in {\cal X}`$ the matrix $`\{k(x_i,x_j)\}_{i,j}`$ is
positive semi-definite. A kernel space is a Hilbert space $`{\cal H}`$
of functions from $`{\cal X}`$ to $`\mathbb{R}`$ such that for every
$`x\in {\cal X}`$ the linear functional $`f\in{\cal H}\mapsto f(x)`$ is
bounded. The following theorem describes a one-to-one correspondence
between kernels and kernel spaces. Theorem 24. For every kernel $`k`$ there exists a unique kernel
space $`{\cal H}_k`$ such that for every $`x,x'\in {\cal X}`$,
$`k(x,x') = {\left\langle k(\cdot,x),k(\cdot,x') \right\rangle}_{{\cal H}_k}`$.
Likewise, for every kernel space $`{\cal H}`$ there is a kernel $`k`$
for which $`{\cal H}={\cal H}_k`$. We denote the norm and inner product in $`{\cal H}_k`$ by
$`\|\cdot\|_k`$ and $`{\left\langle \cdot,\cdot \right\rangle}_k`$. The
following theorem describes a tight connection between kernels and
embeddings of $`{\cal X}`$ into Hilbert spaces. Theorem 25. A function $`k:{\cal X}\times {\cal X}\to \mathbb{R}`$
is a kernel if and only if there exists a mapping
$`\Psi: {\cal X}\to{\cal H}`$ to some Hilbert space for which
$`k(x,x')={\left\langle \Psi(x),\Psi(x') \right\rangle}_{{\cal H}}`$. In
this case,
$`{\cal H}_k = \{f_{\Psi,\mathbf{v}} \mid \mathbf{v}\in{\cal H}\}`$
where
$`f_{\Psi,\mathbf{v}}(x) = {\left\langle \mathbf{v},\Psi(x) \right\rangle}_{\cal H}`$.
Furthermore,
$`\|f\|_{k} = \min\{\|\mathbf{v}\|_{\cal H}: f_{\Psi,\mathbf{v}}\}`$ and
the minimizer is unique. Let $`{\cal X}`$ be a measurable space and let
$`k:{\cal X}\times{\cal X}\to \mathbb{R}`$ be a kernel. A random
features scheme (RFS) for $`k`$ is a pair $`(\psi,\mu)`$ where $`\mu`$
is a probability measure on a measurable space $`\Omega`$, and
$`\psi:\Omega\times{\cal X}\to \mathbb{R}`$ is a measurable function,
such that We often refer to $`\psi`$ (rather than $`(\psi,\mu)`$) as the RFS. We
define
$`\|\psi\|_\infty = \sup_{\mathbf{x} }\|\psi(\cdot,\mathbf{x} )\|_\infty`$,
and say that $`\psi`$ is $`C`$-bounded if $`\|\psi\|_\infty\le C`$. The
random $`q`$-embedding generated from $`\psi`$ is the random mapping where $`\omega_1,\ldots,\omega_q\sim \mu`$ are i.i.d.Ā The random
$`q`$-kernel corresponding to $`\Psi_{\boldsymbol{\omega}}`$ is
$`k_{\boldsymbol{\omega}}(\mathbf{x} ,\mathbf{x} ') =
\frac{{\left\langle \Psi_{\boldsymbol{\omega}}(\mathbf{x} ),\Psi_{\boldsymbol{\omega}}(\mathbf{x} ') \right\rangle}}{q}`$.
Likewise, the random $`q`$-kernel space corresponding to
$`\frac{1}{\sqrt{q}}\Psi_{\boldsymbol{\omega}}`$ is
$`{\cal H}_{k_{\boldsymbol{\omega}}}`$. We next discuss approximation of
functions in $`{\cal H}_k`$ by functions in
$`{\cal H}_{k_{\boldsymbol{\omega}}}`$. It would be useful to consider
the embedding FromĀ [eq:ker_eq_inner] it holds that for
any $`\mathbf{x} ,\mathbf{x} '\in{\cal X}`$,
$`k(\mathbf{x} ,\mathbf{x} ') = {\left\langle \Psi^\mathbf{x} ,\Psi^{\mathbf{x} '} \right\rangle}_{L^2(\Omega)}`$.
In particular, from
TheoremĀ 25, for every $`f\in{\cal H}_k`$
there is a unique function $`\check{f}\in L^2(\Omega)`$ such that and for every $`\mathbf{x} \in{\cal X}`$, Let us denote
$`f_{\boldsymbol{\omega}}(\mathbf{x} ) = \frac{1}{q}\sum_{i=1}^q
{\left\langle \check{f}(\omega_i),\psi(\omega_i,\mathbf{x} ) \right\rangle}`$.
FromĀ [eq:f_x_as_inner] we have that
$`\mathbb{E}_{\boldsymbol{\omega}}\left[f_{\boldsymbol{\omega}}(\mathbf{x} )\right] = f(\mathbf{x} )`$.
Furthermore, for every $`\mathbf{x}`$, the variance of
$`f_{\boldsymbol{\omega}}(\mathbf{x} )`$ is at most An immediate consequence is the following corollary. Corollary 26 (Function Approximation). For all
$`\mathbf{x} \in{\cal X}`$,
$`\mathbb{E}_{\boldsymbol{\omega}}|f(\mathbf{x} ) - f_{\boldsymbol{\omega}}(\mathbf{x} )|^2 \le \frac{\|\psi\|_\infty^2\|f\|^2_{k}}{q}`$. Now, if $`{\cal D}`$ is a distribution on $`{\cal X}`$ we get that Thus, $`O\left(\frac{\|f\|_k^2}{\epsilon^2}\right)`$ random features
suffices to guarantee that
$`\mathbb{E}_{\boldsymbol{\omega}}\|f - f_{\boldsymbol{\omega}}\|_{2,{\cal D}}\le\epsilon`$.
In this paper such an $`\ell^2`$ guarantee will not suffice, and we will
need an approximation of functions in $`{\cal H}_k`$ by functions in
$`{\cal H}_{k_{\boldsymbol{\omega}}}`$ w.r.t.Ā the stronger
$`\ell^\infty`$ norm. We next show this can be obtained, unfortunately
with a quadratic growth in the required number of features. For
$`z\in\mathbb{R}`$ we define
$`{\left\langle z \right\rangle}_B = \begin{cases}z & |z|\le B\\0&\text{otherwise}\end{cases}`$.
We will consider the following a truncated version of
$`f_{\boldsymbol{\omega}}`$ Now, if $`\psi`$ is $`C`$-bounded we have that
$`f_{{\boldsymbol{\omega}},B}(\mathbf{x} )`$ is and average of $`q`$
i.i.d.Ā $`CB`$-bounded random variables. By Hoeffdingās inequality, we
have Likewise, we have We get that Lemma 27. *Let $`f\in{\cal H}_k`$ with $`\|f\|_{k}\le M`$ and assume
that, $`\|\psi\|_\infty\le C`$. For $`B = \frac{2CM^2}{\epsilon}`$ we
have Furthermore, the norm of weight vector vector defining
$`f_{{\boldsymbol{\omega}},B}`$,
i.e.Ā $`\mathbf{w}= \frac{1}{q}\left({\left\langle \check{f}(\omega_1) \right\rangle}_B,\ldots,{\left\langle \check{f}(\omega_q) \right\rangle}_B\right)`$,
satisfies This implies that for any $`\mathbf{x} \in [-1,1]^{n}`$ we have Hence,
$`\|\nabla p(\mathbf{x} )\|_1\le nK\sqrt{(n+1)^{K-1}}\|p\|_\mathrm{co}\le K\sqrt{(n+1)^{K+1}}\|p\|_\mathrm{co}`$.
Showing that $`p`$ is
$`((n+1)^{\frac{K+1}{2}}K\|p\|_\mathrm{co})`$-Lipschitz in
$`[-1,1]^{n}`$ w.r.t.Ā the $`\|\cdot\|_\infty`$ norm. Likewise, for any
$`\mathbf{x} \in [-1,1]^n`$ we have Ā ā» Lemma 31. Assume that $`f:{\cal X}\to \{\pm 1\}`$ is
$`\left(K,M\right)`$-PTF w.r.t.Ā as witnessed by a polynomial
$`p:\mathbb{R}^{n}\to\mathbb{R}`$ that is $`L`$-Lipschitz w.r.t.
$`\|\cdot\|_\infty`$. If $`p`$ is bounded by $`B`$ is
$`\cup_{\mathbf{x} \in {\cal X}}{\cal B}_{1/(2L)}(\mathbf{x} )`$.
Then, $`f`$ is $`\left(K,2M,2B,\frac{1}{2L}\right)`$-PTF witnessed by
$`2p`$ If $`p`$ is bounded by $`B`$ is $`{\cal X}`$. Then, $`f`$ is
$`\left(K,2M,2B+1,\frac{1}{2L}\right)`$-PTF witnessed by $`2p`$ Proof. We first note the the second item follows form the first.
Indeed, if $`p`$ is bounded by $`B`$ in $`{\cal X}`$ then $`p`$ is
bounded by $`B+1/2`$ is
$`\cup_{\mathbf{x} \in {\cal X}}{\cal B}_{1/(2L)}(\mathbf{x} )`$. To
prove the first item we need to show that for any
$`\mathbf{x} \in {\cal X}`$ and
$`\tilde\mathbf{x} \in{\cal B}_\xi(\mathbf{x} )`$ we have The left inequality is clear. For the right inequality we assume that
$`f(\mathbf{x} )=1`$ (the other case is similar). Since
$`\|\mathbf{x} -\tilde\mathbf{x} \|_\infty\le \frac{1}{2L}`$ we have Ā ā» Assume now that $`{\cal X}\subseteq\{\pm 1\}^d`$, and that any label
$`j\in L_i`$ depends on at most $`K`$ labels from $`L_{i-1}`$ in at most
$`K`$ locations (of $`K`$ input locations if $`i=1)`$. That is, for any
$`j\in L_i`$, there is a function
$`\tilde f_j :\{\pm 1\}^{wn}\to \{\pm 1\}`$ (or
$`\tilde f_j :\{\pm 1\}^{dw}\to \{\pm 1\}`$ if $`i=1`$) that depends at
most $`K`$ coordinates, from $`\{kn+l: 0\le k\le w-1,\;l\in L_{i-1}\}`$
(from $`[dw]`$ if $`i=1`$), for which the following holds. For any
$`g\in G`$,
$`f^*_{j,g}(\vec \mathbf{x} ) = \tilde f_j(E_{g}(\mathbf{f}^*(\vec\mathbf{x} )))`$
(or $`f^*_{j,g}(\vec \mathbf{x} ) = f_j(E_{g}(\vec\mathbf{x} ))`$ if
$`i=1`$). As in example 2, since any Boolean function
depending on $`K`$ variables is a $`(K,1)`$-PTF, we have that the
functions $`\tilde f_j`$ are $`(K,1)`$-PTFs, implying that
$`({\cal L},\mathbf{e})`$ in an $`(r,K,1)`$-hierarchy. Lemma
28 implies that
$`({\cal L},\mathbf{e})`$ is $`(r,K,2,B,\xi)`$-hierarchy for
$`\xi = \frac{1}{2K(wn+1)^{(K+1)/2}}`$ and $`B=2(w\max(n,d)+1)^{K/2}`$.
The following lemma shows that this can be substantially improved. Lemma 32. Any Boolean function depending on $`K`$ coordinates is a
$`(K,2,3,\xi)`$-PTF for $`\xi = \frac{1}{K2^{(K+2)/2}}`$. As a result
$`({\cal L},\mathbf{e})`$ is $`(r,K,2,3,\xi)`$-hierarchy. Lemma 32 follows from the
following Lemma together with Lemma
31 Lemma 33. Let $`f:\{\pm 1\}^K\to \{\pm 1\}`$ and let
$`F(\mathbf{x} )= \sum_{A\subseteq[K]}a_A\mathbf{x} ^A`$ be its standard
multilinear extension. Then, $`F`$ is $`(K 2^{K/2})`$-Lipschitz in
$`[-1,1]^K`$ w.r.t.Ā the $`\|\cdot\|_\infty`$ norm. Proof. For $`\mathbf{x} \in [-1,1]^K`$ we have Hence, Ā ā» The following Lemma shows that $`\xi`$ and $`B`$ can be improved even
further, at the expense of the degree and the coefficient norm. Lemma 34. For any $`0<\xi<1 Proof. Fix $`f:\{\pm 1\}^K\to\{\pm 1\}`$. We need to show that $`f`$
is a $`(K',M,B,\xi)`$-PTF. Let $`\epsilon = \frac{B-1}{B+1}`$. By Lemma
21 there is a uni-variate
polynomial $`q`$ of degree
$`O\left(\frac{K+\log(1/\epsilon)}{1-\xi}\right)`$ such that
$`q([-1,1])\subseteq [-1,1]`$, for any
$`y\in [-1,1]\setminus [-1+\xi,1-\xi]`$ we have
$`|q(y)-\mathrm{sign}(y)|\le \frac{\epsilon}{K2^{K/2}}`$, and the
coefficients of $`q`$ are all bounded by
$`2^{O\left(\frac{K+\log(1/\epsilon)}{1-\xi}\right)}`$. Consider now the
polynomial $`\tilde p(\mathbf{x} ) = F(q(\mathbf{x} ))`$ where $`F`$ is
the multilinear extension on $`f`$. It is not hard to verify that
$`\deg(\tilde p)\le \deg(q)K = O\left(\frac{K^2+K\log(1/\epsilon)}{1-\xi}\right)`$
and that
$`\|\tilde p\|_\mathrm{co}\le 2^{O\left(\frac{K^2+K\log(1/\epsilon)}{1-\xi}\right)}`$.
Finally, fix $`\mathbf{x} \in \{\pm 1\}^K`$ and
$`\tilde\mathbf{x} \in {\cal B}_\xi(\mathbf{x} )`$. Note that
$`\mathbf{x} = \mathrm{sign}(\tilde\mathbf{x} )`$. Since $`F`$ is
$`K2^{k/2}`$-Lipschitz w.r.t.Ā the $`\|\cdot\|_\infty`$ norm in
$`[-1,1]^K`$ (lemma
33) we have Since $`f(\mathbf{x} )\in \{\pm 1\}`$ this implies that Taking $`p(x) = \frac{1}{1-\epsilon}\tilde p(x)`$ and noting that
$`B = \frac{1+\epsilon}{1-\epsilon}`$ we get which implies that $`f`$ is a $`(K',M,B,\xi)`$-PTF.Ā ā» data-reference=“thm:brain_dump_intro”>4 In this section we will prove (a slightly extended version of) Theorem
4. We first recall and
slightly extend the setting. Fix a domain
$`{\cal X}\subseteq\{\pm 1\}^d`$ and a sequence of functions
$`G^i:\{\pm 1\}^d\to\{\pm 1\}^d`$ for $`1\le i\le r`$. We assume that
$`G^0(\mathbf{x} ) = \mathbf{x}`$, and for any depth $`i\in [r]`$ and
coordinate $`j\in [d]`$, we have where $`p^i_j:\{\pm 1\}^d\to\{\pm 1\}`$ is a function whose multi-linear
extension is a polynomial of degree at most $`K`$. Furthermore, we
assume this extension is $`L`$-Lipschitz in $`[-1,1]^d`$ with respect to
the $`\ell_\infty`$ norm (if $`p^i_j`$ depends on $`K`$ coordinates, as
in the problem description in section
3.1, Lemma
33 implies that this holds
with $`L=K2^{K/2}`$). Fix an integer $`q`$. We assume that for every
depth $`i\in [r]`$, there are $`q`$ auxiliary labels $`f^*_{i,j}`$ for
$`1\le j\le q`$, each of which is a signed Majority of an odd number of
components of $`G^i`$. Moreover, we assume these functions are random.
Specifically, prior to learning, the labeler independently samples
$`qr`$ functions such that for any $`i\in [r]`$ and $`j\in [q]`$, where the weight vectors $`\mathbf{w}^{i,j}\in \mathbb{R}^{d}`$ are
independent uniform vectors chosen from for some odd integer $`k`$. The following theorem, which slightly
extends Theorem
4, shows that if
$`q\gg dL^2\log(|{\cal X}|)`$, then with high probability over the
choice of $`\mathbf{f}^*`$, the target function $`\mathbf{f}^*`$ has an
$`\left(r,K,O\left(kd^{K}\right),2k+1\right)`$-hierarchy. Theorem 35.
W.p.Ā $`1-4drq|{\cal X}|e^{-\Omega\left(\frac{q}{L^2k^2d}\right)}`$ the
function $`\mathbf{f}^*`$ has
$`\left(r,K,O\left(kd^{K}\right),2k+1\right)`$-hierarchy In order to prove Theorem
35 it is enough to show that for any
$`i\in[r]`$ and $`j\in [q]`$, $`f^*_{i,j}`$ is a
$`(K,O\left(kd^{K}\right),2k+1)`$-PTF of By equations
[eq:brain_dump_target_def]
and [eq:brain_dump_circ_def] we
have Hence, $`f^*_{i,j}`$ is $`(K,k)`$-PTF of $`G^{i-1}`$, as witnessed by
$`q`$ (note that $`1\le|q(G^{i-1}(\mathbf{x} ))|\le k`$ since
$`q(G^{i-1}(\mathbf{x} ))`$ is a sum of $`k`$ numbers in $`\{\pm 1\}`$
and $`k`$ is odd. Likewise,
$`\|q\|_\mathrm{co}\le \sum_{l=1}^d |w_l^{i,j}|\cdot\|p^i_l\|_\mathrm{co}\stackrel{\|p^i_l\|_\mathrm{co}\le 1}{\le} \sum_{l=1}^d |w_l^{i,j}|=k`$).
Since $`q`$ is $`(kL)`$-Lipschitz and bounded by $`k`$, Lemma
31 implies that
$`f^*_{i,j}`$ is $`(K,k,2k+1,1/(2kL))`$-PTF of $`G^{i-1}`$ Hence,
Theorem 35 follows from the following lemma
and a union bound on the $`rq`$ different $`f^*_{i,j}`$. Lemma 36. Let $`f:{\cal X}\to \{\pm 1\}`$ be a $`(K,M,B,\xi)`$-PTF
and let $`\mathbf{w}^1,\ldots,\mathbf{w}^q\in {\cal W}_{d,k}`$ be
independent and uniform. Define
$`\psi_i(\mathbf{x} ) = \mathrm{sign}({\left\langle \mathbf{w}^i,\mathbf{x} \right\rangle})`$.
Then, w.p.Ā $`1-4d|{\cal X}|e^{-\Omega\left(\frac{\xi^2q}{d}\right)}`$
$`f`$ is $`\left(K,O\left(Md^{K}\right),B\right)`$-PTF of
$`\Psi=(\psi_1,\ldots,\psi_q)`$. Proof. Let $`W = [\mathbf{w}_1\cdots\mathbf{w}_q]\in M_{d,q}`$ We
first show that w.h.p. $`W`$ approximately reconstruct $`\mathbf{x}`$
from $`\Psi(\mathbf{x} )`$ Claim 2. Let
$`\alpha_{d,k} = \frac{k}{d}\cdot\frac{\binom{k-1}{(k-1)/2}}{2^{k-1}}`$.
For any $`\mathbf{x} \in\{\pm 1\}^d`$ and $`\frac{1}{4}\ge\epsilon>0`$
we have
$`\Pr\left(\left\|\frac{1}{q\alpha_{d,k}}W\Psi(\mathbf{x} )-\mathbf{x} \right\|_\infty\ge\epsilon\right)\le 4de^{-\Omega\left(\frac{\epsilon^2q}{d}\right)}`$ Before proving the claim, we show that it implies the lemma. Indeed, it
implies that
w.p.Ā $`1-4d|{\cal X}|e^{-\Omega\left(\frac{\xi^2q}{d}\right)}`$ we have
that
$`\left\|\frac{1}{q\alpha_{d,k}}W\Psi(\mathbf{x} )-\mathbf{x} \right\|_\infty\le\frac{\xi}{2}`$
for any $`\mathbf{x} \in {\cal X}`$. Given this event, we have that for any $`\mathbf{x} \in{\cal X}`$ and $`j\in [d]`$. Thus, if
$`p:{\cal X}\to\mathbb{R}`$ is a polynomial hat witness that $`f`$ is
$`(K,M,B,\xi)`$-PTF, then we have Hence, for
$`q(\mathbf{y}):=p\left( \frac{1-\xi/2}{q\alpha_{d,k}}W\mathbf{y}\right)`$
we have that $`f`$ is $`(K,\|q\|_\mathrm{co},B)`$-PTF of $`\Psi`$. By
Lemma 22 and the fact that
the norm of each row of $`\frac{1-\xi/2}{q\alpha_{d,k}}W`$ is at most
$`\frac{1}{\sqrt{q}\alpha_{d,k}}`$ (since the entries of $`W`$ are in
$`\{-1,1,0\}`$) we have This implies the lemma as
$`\alpha_{d,k} = \Theta\left( \frac{\sqrt{k}}{d}\right)`$ by Lemma
20. Proof. (of Claim
2) Fix a coordinate
$`j\in [d]`$. It is enough to show that
$`\Pr\left(\left|\frac{1}{q\alpha_{d,k}}\left(W\Psi(\mathbf{x} )\right)_j-x_j\right|\ge\epsilon\right)\le 4e^{-\Omega\left(\frac{\epsilon^2q}{d}\right)}`$.
We note that Denote
$`X_i = w^i_j\mathrm{sign}({\left\langle \mathbf{w}^i,\mathbf{x} \right\rangle})`$.
Note that $`X_1,\ldots,X_q`$ are i.i.d.Ā We have Similarly, As a result And, this implies that and that By Lemma 19 we have Ā ā» Ā ā» data-reference=“lem:rf_main_simp”>13 Fix a bounded activation $`\sigma:\mathbb{R}\to\mathbb{R}`$. Given
$`0\le \beta\le 1`$, called the bias magnitude we define a kernel on
$`\mathbb{R}^n`$ by Note that
$`\psi((\mathbf{w},b),\mathbf{x} ) = \sigma(\mathbf{w}^\top\mathbf{x} +b)`$
is a RFS for $`k_{\sigma,\beta,n}`$. We next analyze the functions in
the corresponding kernel space $`{\cal H}_{\sigma,\beta,n}`$. To this
end, we will use the Hermite expansion of $`\sigma`$ in order to find an
explicit expression of $`k_{\sigma,\beta,n}`$, as well as an explicit
embedding
$`\Psi_{\sigma,\beta,n}:\mathbb{R}^n\to \bigoplus_{s=0}^\infty \left(\mathbb{R}^{n+1}\right)^{\otimes s}`$
whose kernel is $`k_{\sigma,\beta,n}`$. Let be the Hermite expansion of $`\sigma`$. For $`r\ge 1`$ denote Note that $`a_s(1)=a_s`$ Lemma 37. *We have Likewise, $`k_{\sigma,\beta,n}`$ is the kernel of the embedding
$`\Psi_{\sigma,\beta,n}:\mathbb{R}^n\to \bigoplus_{s=0}^\infty \left(\mathbb{R}^{n+1}\right)^{\otimes s}`$
given by Thus, Ā ā» Proof. (of Lemma 37) We will prove the formula for
$`k_{\sigma,\beta,n}`$. It is not hard to verify that it implies that
$`k_{\sigma,\beta,n}`$ is the kernel of $`\Psi_{\sigma,\beta,n}`$ using
the fact that
$`{\left\langle \mathbf{x} ^{\otimes s},\mathbf{y}^{\otimes s} \right\rangle} = {\left\langle \mathbf{x} ,\mathbf{y} \right\rangle}^s`$.
By definition
$`k_{\sigma,\beta,n}(\mathbf{x} ,\mathbf{y}) = \mathbb{E}[\sigma(\mathbf{w}^\top\mathbf{x} +b)\sigma(\mathbf{w}^\top\mathbf{y}+b)]`$
where $`b\sim {\cal N}(0,\beta^2)`$ and
$`\mathbf{w}\sim{\cal N}\left(0,\frac{1-\beta^2}{n}I_n\right)`$. Let
$`X = \mathbf{w}^\top\mathbf{x} +b`$ and
$`Y = \mathbf{w}^\top\mathbf{y}+b`$. We note that $`(X,Y)`$ is a
centered Gaussian vector with correlation matrix $`\begin{pmatrix}
\frac{1-\beta^2}{n}\|\mathbf{x} \|^2+\beta^2 & \frac{1-\beta^2}{n}{\left\langle \mathbf{x} ,\mathbf{y} \right\rangle}+\beta^2\\
\frac{1-\beta^2}{n}{\left\langle \mathbf{x} ,\mathbf{y} \right\rangle}+\beta^2 & \frac{1-\beta^2}{n}\|\mathbf{y}\|^2+\beta^2
\end{pmatrix}`$. Denote
$`r_\mathbf{x} = \sqrt{\frac{1-\beta^2}{n}\|\mathbf{x} \|^2 + \beta^2}`$
and
$`r_\mathbf{y}= \sqrt{\frac{1-\beta^2}{n}\|\mathbf{y}\|^2 + \beta^2}`$.
Likewise let $`\tilde X = \frac{1}{r_\mathbf{x} }X`$ and
$`\tilde Y = \frac{1}{r_\mathbf{y}}Y`$. Note that $`(X,Y)`$ is a
centered Gaussian vector with correlation matrix $`\begin{pmatrix}
1 & \rho\\
\rho & 1
\end{pmatrix}`$ for
$`\rho = \frac{\frac{1-\beta^2}{n}{\left\langle \mathbf{x} ,\mathbf{y} \right\rangle}+\beta^2}{r_\mathbf{x} r_\mathbf{y}}`$
Now, by Lemma 38 we have Hence, Ā ā» Lemma 39. *Let $`r>0`$ such that $`|1-r^2|=:\epsilon<\frac{1}{2}`$.
We have Ā ā» Lemma 40. Assume that $`1-\beta^2<\frac{1}{2}`$ for $`\beta>0`$.
Let $`{\cal X}\subseteq [-1,1]^n`$. Let $`p:{\cal X}\to\mathbb{R}`$ be a
degree $`K`$ polynomial. Let $`K'\ge K`$. There is
$`g\in {\cal H}_{\sigma,\beta,n}({\cal X})`$ such that $`g(\mathbf{x} ) = \frac{a_{K'}\left(\sqrt{\frac{1-\beta^2}{n}\|\mathbf{x} \|^2+\beta^2}\right)}{a_{K'}}p(\mathbf{x} )`$ $`\|g\|_{\sigma,\beta,n} \le \frac{1}{a_{K'}\beta^{K'-K}}\left(\frac{n}{1-\beta^2}\right)^{K/2}\|p\|_{\mathrm{co}}`$ $`\|g-p\|_\infty \le \|p\|_\infty \frac{\|\sigma\|}{a_{K'}}2^{(K'+2)/2}\frac{1-\beta^2}{\sqrt{1-2(1-\beta^2)^2}}`$ Proof. Write
$`p(\mathbf{x} )= \sum_{\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K} b_\alpha \mathbf{x} ^\alpha`$.
For $`\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K`$ we let
$`\tilde\alpha \in [n+1]^{K'}`$ be a sequence such that for any
$`i\in [n]`$ we have $`\tilde\alpha_j = i`$ for exactly $`\alpha_i`$
indices $`j\in [K']`$ and $`\tilde\alpha_j = n+1`$ for the remaining
$`K'-\|\alpha\|_1`$ indices. Let
$`A\in (\mathbb{R}^{n+1})^{\otimes K'}\subseteq \bigoplus_{s=0}^\infty(\mathbb{R}^{n+1})^{\otimes s}`$
be the tensor and let It is not hard to verify that
$`g(\mathbf{x} ) = \frac{a_{K'}\left(\sqrt{\frac{1-\beta^2}{n}\|\mathbf{x} \|^2+\beta^2}\right)}{a_{K'}}p(\mathbf{x} )`$.
By Theorem 25
$`g\in {\cal H}_{\sigma,\beta,n}`$ and satisfies
$`\|g\|_{\sigma,\beta,n} \le \|A\|`$. Finally, since
$`\frac{1}{\beta^{K'-\|\alpha\|_1}}\left(\frac{n}{1-\beta^2}\right)^{\|\alpha\|_1/2}\le \frac{1}{\beta^{K'-K}}\left(\frac{n}{1-\beta^2}\right)^{K/2}`$
we
have$`\|A\|\le \frac{1}{a_{K'}\beta^{K'-K}}\left(\frac{n}{1-\beta^2}\right)^{K/2}\|p\|_{\mathrm{co}}`$.
We therefore proved the first and the second items. To prove the last
item we note that for any $`\mathbf{x} \in {\cal X}`$ we have Define $`r = \sqrt{\frac{\|\mathbf{x} \|^2}{n}(1-\beta^2) + \beta^2}`$
and note that since $`0\le \|\mathbf{x} \|^2\le n`$ we have Hence, by Lemma 39 we have which proves the last itemĀ ā» Combining with Lemma
40 with Lemma
27 we get Lemma 41. *Assume that $`1-\beta^2<\frac{1}{2}`$ for $`\beta>0`$.
Let $`{\cal X}\subset [-1,1]^n`$. Fix a degree $`K`$ polynomial
$`p:{\cal X}\to [-1,1]`$ and $`K'\ge K`$. Let
$`(W,\mathbf{b})\in\mathbb{R}^{q\times n}\times \mathbb{R}^{q}`$ be
$`\beta`$-Xavier pair. Then there is a vector
$`\mathbf{w}=\mathbf{w}(W,\mathbf{b})\in\mathbb{R}^{q}`$ such that for Moreover We also defined We can now prove Lemma
13 restated which we restate next. Lemma 42. *(Lemma
13 restated) Fix
$`{\cal X}\subset [-1,1]^n`$, a degree $`K`$ polynomial
$`p:{\cal X}\to [-1,1]`$, $`K'\ge K`$ and $`\epsilon>0`$. Let
$`(W,\mathbf{b})\in\mathbb{R}^{q\times n}\times \mathbb{R}^{q}`$ be
$`\beta`$-Xavier pair for $`1>\beta\ge \beta_{\sigma,K',K}(\epsilon)`$.
Then there is a vector
$`\mathbf{w}=\mathbf{w}(W,\mathbf{b})\in\mathbb{B}^{q}`$ such that for Moreover Define $`\mathbf{w}`$ to be the projection of $`\mathbf{v}`$ on
$`\mathbb{B}^d`$. We now split into cases. If
$`\frac{4\|\sigma\|_\infty}{\epsilon\sqrt{{q}}}\cdot\frac{1}{a^2_{K'}\beta^{2K'-2K}}\left(\frac{n}{1-\beta^2}\right)^{K}\|p\|^2_{\mathrm{co}}\le 1`$
then $`\mathbf{v}=\mathbf{w}`$ and
$`\delta=\delta_{\sigma,K',K}(\epsilon,\beta,q,\|p\|_\mathrm{co},n)`$,
so the Lemma follows from Equation
[eq:lem:rf_main_simp_1].
Otherwise, we have
$`\delta_{\sigma,K',K}(\epsilon,\beta,q,\|p\|_\mathrm{co},n)= 1`$ and
the Lemma is trivially true.Ā ā» The research described in this paper was funded by the European Research
Council (ERC) under the European Unionās Horizon 2022 research and
innovation program (grant agreement No. 101041711), and the Simons
Foundation (as part of the Collaboration on the Mathematical and
Scientific Foundations of Deep Learning). The author thanks Elchanan
Mossel and Mariano Schain for useful comments. We note that in practice it is often the case that an example
posses several positive labels (for instance, ādog" and āanimal").
However, each training example usually comes with just one of its
positive labels. We hope that future work will be able to handle
this more realistic type of supervision. ↩︎Polynomial Threshold Functions
\begin{equation}
\label{eq:truncated_ball}
{\cal B}_r(\mathbf{x} ) = \left\{\tilde \mathbf{x} \in [-1,1]^d : \|\mathbf{x} -\tilde\mathbf{x} \|_\infty\le r \right\}
\end{equation}\forall \mathbf{x} \in {\cal X}\;\forall \tilde \mathbf{x} \in {\cal B}_{\xi}(\mathbf{x} ),\;\;B\ge p(\tilde\mathbf{x} )f(\mathbf{x} )\ge 1\forall \mathbf{x} \in {\cal X}\;\forall \mathbf{y}\in {\cal B}_\xi(\mathbf{h}(\mathbf{x} )),\;\;B\ge p(\mathbf{y})f(\mathbf{x} )\ge 1Strong Convexity
f(\mathbf{y}) \ge f(\mathbf{x} ) + {\left\langle \mathbf{y}-\mathbf{x} ,\nabla f(\mathbf{x} ) \right\rangle} + \frac{\lambda}{2}\|\mathbf{y}-\mathbf{x} \|^2\begin{eqnarray}
\label{eq:strongly_conv_guarantee}
f(\mathbf{x} ) &\le & f(\mathbf{y}) - \frac{\lambda}{2}\|\mathbf{y}-\mathbf{x} \|^2 + \|\mathbf{y}-\mathbf{x} \| \cdot \|\nabla f(\mathbf{x} )\|\nonumber
\\
&=& f(\mathbf{y})+\frac{\|\nabla f(\mathbf{x} )\|^2}{2\lambda} - \frac{1}{2\lambda}\left(\|\nabla f(\mathbf{x} )\|-\lambda\|\mathbf{y}-\mathbf{x} \|\right)^2
\\
&\le & f(\mathbf{y})+ \frac{\|\nabla f(\mathbf{x} )\|^2}{2\lambda}\nonumber
\\
&\le & f(\mathbf{y})+ \frac{\epsilon^2}{2\lambda}\nonumber
\end{eqnarray}Hermite Polynomials
\begin{equation}
\label{eq:hermite_rec}
xh_{n}(x) = \sqrt{n+1}h_{n+1}(x) + \sqrt{n}h_{n-1}(x)\;\;,\;\;\;\;\;h_0(x)=1,\;h_1(x)=x
\end{equation}h_{n+1}(x) = \frac{x}{\sqrt{n+1}}h_{n}(x) - \sqrt{\frac{n}{n+1}}h_{n-1}(x)\begin{equation}
\label{eq:hermite_gen_fun}
e^{xt - \frac{t^2}{2}} = \sum_{n=0}^\infty \frac{h_n(x)t^n}{\sqrt{n!}}
\end{equation}\begin{equation}
\label{eq:hermite_derivative}
h_n' = \sqrt{n}h_{n-1}
\end{equation}\begin{equation}
\label{eq:hermite_prod_exp}
\mathbb{E}h_i(X)h_j(Y) = \delta_{ij}\rho^{i}
\end{equation}The Hierarchical Model
S= \{(\mathbf{x} ^1,\mathbf{f}^*(\mathbf{x} ^1),\ldots,(\mathbf{x} ^m,\mathbf{f}^*(\mathbf{x} ^m))\} \in \left({\cal X}\times \{\pm 1\}^{n}\right)^m
The āBrain Dump" Hierarchy
\begin{equation*}
\forall \mathbf{x} \in{\cal X}, \quad G^i_j(\mathbf{x} ) = h^i_j(G^{i-1}(\mathbf{x} )),
\end{equation*}\begin{equation*}
f^*_{i,j}(\mathbf{x} ) = \mathrm{sign}\left(\sum_{l=1}^d w_l^{i,j}G^i_l(\mathbf{x} )\right),
\end{equation*}{\cal W}_{d,k} := \left\{\mathbf{w}\in \{-1,0,1\}^d : \sum_{l=1}^d |w_l|= k \right\}Extension to Sequential and Ensemble Models
S= \{(\vec\mathbf{x} ^1,\mathbf{f}^*(\vec\mathbf{x} ^1)),\ldots,(\vec\mathbf{x} ^m,\mathbf{f}^*(\vec\mathbf{x} ^m))\} \in \left({\cal X}^G\times \{\pm 1\}^{n,G}\right)^m
Algorithm and Main Result
\Psi_1(\vec\mathbf{x} ) = W^1_2\sigma(W^1_1 E(\vec\mathbf{x} )+\mathbf{b}^1)\Psi_k(\vec\mathbf{x} ) = \vec\mathbf{x} + W^k_2\sigma(W^k_1 E(\vec\mathbf{x} )+\mathbf{b}^k)\Psi_D(\vec\mathbf{x} ) = W^D\vec\mathbf{x}\ell(\hat\mathbf{y},\mathbf{y}) = \frac{1}{|G|}\sum_{g\in G}\ell(\hat y_g\cdot y_g)\ell_{S,j}(\hat\mathbf{f}) =\ell_{S,j}(\hat\mathbf{f}_j) = \frac{1}{m}\sum_{t=1}^m\ell\left(\hat \mathbf{f}_{j}(\vec\mathbf{x} ^t),\mathbf{y}^t_{j}\right)\ell_{S}(\hat\mathbf{f}) = \sum_{j=1}^n\ell_{S,j}(\hat\mathbf{f})\;\;\;\;\;\text{ and }\;\;\;\;\; \ell_{S}(\vec W)=\ell_{S}\left(\hat\mathbf{f}_{\vec W}\right)\begin{equation}
\label{eq:loss}
\ell = \ell_{1/(2B)}+\frac{1}{4m|G|}\ell_{1-\xi/2}\;\;\;\;\text{ for }\;\;\;\;\ell_{\eta}(z) = \begin{cases}1-\frac{z}{\eta} & 0\le z\le \eta\\
0 & \eta\le z\le 1
\\
\infty&\text{otherwise}\end{cases}
\end{equation}
Proof of Theorem <a href="#thm:main_intro" data-reference-type=“ref”
\begin{equation}
\label{eq:large_margin_sample_err_def}
\mathrm{Err}_{S,\gamma}(\hat{\mathbf{f}})=\frac{1}{m}\sum_{t=1}^m 1\left[\exists (i,g)\in [n]\times G\text{ s.t. }\hat f_{i,g}(\vec\mathbf{x} ^t)\cdot f^*_{i,g}(\vec\mathbf{x} ^t) < \gamma \right]
\end{equation}
\Psi_{k'}(\vec\mathbf{x} ) = \vec\mathbf{x} + W^{k'}_2\sigma(W^{k'}_1 E(\vec\mathbf{x} )+\mathbf{b}^{k'}) = \vec\mathbf{x}\Phi^{k-1}(\vec\mathbf{x} ) = \sigma(W^k_1E(\Gamma^{k-1}(\vec\mathbf{x} ))+\mathbf{b}^k) = \sigma(W^k_1E( (W^D)^{-1}\hat\mathbf{f}^k(\mathbf{x} ))+\mathbf{b}^k) =\hat\mathbf{f}^k_{j,\mathbf{w}}(\vec\mathbf{x} ) = \hat\mathbf{f}^{k-1}_{j}(\vec\mathbf{x} ) + \mathbf{w}^\top \Phi^{k-1}(\vec\mathbf{x} )\begin{eqnarray*}
\ell^{k}_{S,j}(\mathbf{w})&=&\ell_{S,j}\left(\hat\mathbf{f}^k_{j,\mathbf{w}}\right) + \frac{\epsilon_\mathrm{opt}}{2}\|\mathbf{w}\|^2
\end{eqnarray*}\ell_{S,j}(\hat \mathbf{f}^k) \le \ell^k_{S,j}(\mathbf{w}^*) + \frac{\epsilon_\mathrm{opt}}{2}\|\mathbf{w}^*\|^2 + \frac{\epsilon_\mathrm{opt}}{2}
```*
</div>
<div class="proof">
*Proof.* When the $`k`$āth layer is trained, since all deeper layers
during this training phase are the identity function, the output of the
network as a function of $`W^k_2`$ (the parameters that are trained in
the $`k`$āth step) is
``` math
G(W^k_2,\vec\mathbf{x} ) = W^D\left(\Gamma_{k-1}(\vec\mathbf{x} ) + W^k_2\Phi^{k-1}(\vec\mathbf{x} )\right) = \hat{\mathbf{f}}^{k-1}(\vec\mathbf{x} ) + W^DW^k_2\Phi^{k-1}(\vec\mathbf{x} )L(W^k_2)=\frac{\epsilon_\mathrm{opt}}{2}\|W^k_2\|^2+\frac{1}{m}\sum_{t=1}^m\sum_{j=1}^n\ell(\hat{\mathbf{f}}^{k-1}(\vec\mathbf{x} ) + W^DW^k_2\Phi^{k-1}(\vec\mathbf{x} ),\mathbf{y}^t_j)\begin{eqnarray*}
L'(W)=L((W^D)^{-1}W)&=&\frac{\epsilon_\mathrm{opt}}{2}\|(W^D)^{-1}W\|^2+\frac{1}{m}\sum_{t=1}^m\sum_{j=1}^n\ell(\hat{\mathbf{f}}_j^{k-1}(\vec\mathbf{x} ) + W^d(W^D)^{-1}W\Phi^{k-1}(\vec\mathbf{x} ),\mathbf{y}^t_j)
\\
&\stackrel{W^D\text{ is orthogonal}}{=}&\frac{\epsilon_\mathrm{opt}}{2}\|W\|^2+\frac{1}{m}\sum_{t=1}^m\sum_{j=1}^n\ell(\hat{\mathbf{f}}_j^{k-1}(\vec\mathbf{x} ) + W\Phi^{k-1}(\vec\mathbf{x} ),\mathbf{y}^t_j)
\\
&=&\sum_{j=1}^n\left(\frac{\epsilon_\mathrm{opt}}{2}\|W_{j\cdot}\|^2+\frac{1}{m}\sum_{t=1}^m\ell(\hat{\mathbf{f}}_j^{k-1}(\vec\mathbf{x} ) + W_{j\cdot}\Phi^{k-1}(\vec\mathbf{x} ),\mathbf{y}^t_j)\right)
\\
&=&\sum_{j=1}^n\ell^k_{S,j}(W_{j\cdot})
\end{eqnarray*}\ell_{S,j}(\hat \mathbf{f}^k) \le \ell^k_{S,j}(\mathbf{w}^*) + \frac{\epsilon_\mathrm{opt}}{2}\|\mathbf{w}^*\|^2 + \frac{\epsilon_\mathrm{opt}}{2}
\begin{equation}
\label{eq:rob_loss}
\ell^{\mathrm{rob},\epsilon}(z) = \max(\ell(z),\ell(z-\epsilon)) = \max_{0\le t\le \epsilon}\ell(z-t)
\end{equation}\begin{equation}
\sigma = \sum_{s=0}^\infty a_sh_s
\end{equation}\frac{\|\sigma\|}{a_{K'}}2^{(K'+2)/2}\frac{1-\beta^2}{\sqrt{1-2(1-\beta^2)^2}}\le \frac{\epsilon}{2}\delta(\epsilon,\beta,q,M,n) = \delta_{\sigma,K',K}(\epsilon,\beta,q,M,n) = \begin{cases}
1 & \frac{4\|\sigma\|_\infty}{\epsilon\sqrt{{q}}}\cdot\frac{1}{a^2_{K'}\beta^{2K'-2K}}\left(\frac{n}{1-\beta^2}\right)^{K}M^2 > 1
\\
2\exp\left(-{q}\cdot\frac{a^4_{K'}\beta^{4K'-4K}(1-\beta^2)^{2K}\epsilon^4}{512 n^{2K}M^4\|\sigma\|_\infty^4}\right) & \text{otherwise}
\end{cases}\begin{equation}
\label{eq:beta_est}
\delta(\epsilon,\beta,q,M,n) = \exp\left(-\Omega\left(q\cdot\frac{ \epsilon^{2K+4}}{n^{2K}M^4}\right)\right)
\end{equation}\forall\mathbf{x} \in {\cal X},\;\Pr\left(|{\left\langle \mathbf{w},\sigma(W\mathbf{x} +\mathbf{b}) \right\rangle}-p(\mathbf{x} )|\ge \epsilon \right) \le \delta_{\sigma,K',K}(\epsilon,\beta,q,\|p\|_\mathrm{co},n)
```*
</div>
We are now ready to show that if there a polynomial
$`p:\mathbb{R}^{wn}\to\mathbb{R}`$ such that
$`\ell^{\mathrm{rob},\epsilon_1}_{S,j}(p \circ E_g\circ \hat\mathbf{f}^k)`$
is small, then w.h.p.Ā $`\ell_{S,j}(\hat\mathbf{f}^{k+1})`$ will be small
as well.
<div id="lem:pol_imp_small_loss" class="lemma">
**Lemma 14**. *Fix $`\epsilon_1>0`$, $`1>\beta>\beta(\epsilon_1/2)`$ and
a polynomial $`p:\mathbb{R}^{wn}\to\mathbb{R}`$. Given that
$`\ell^{\mathrm{rob},\epsilon_1}_{S,j}(p \circ E_g\circ \hat\mathbf{f}^k)\le \epsilon`$,
we have that
$`\ell_{S,j}( \hat\mathbf{f}^{k+1})\le \epsilon + \epsilon_\mathrm{opt}`$
w.p. $`1-m|G|\delta(\epsilon_1/2,\beta,q,\|p\|_\mathrm{co}+1,wn)`$*
</div>
<div class="proof">
*Proof.* By lemma
<a href="#lem:each_layer_is_convex" data-reference-type="ref"
data-reference="lem:each_layer_is_convex">12</a> we have
$`\ell_{S,j}( \hat\mathbf{f}^{k+1})\le \ell_{S,j}\left(\hat\mathbf{f}^{k+1}_{j,\mathbf{w}^*}\right)+\frac{\epsilon_\mathrm{opt}}{2}\|\mathbf{w}^*\|^2+\frac{\epsilon_\mathrm{opt}}{2}`$
for any $`\mathbf{w}^*\in \mathbb{R}^q`$. Thus, it is enough to show
that w.p.
$`1-m|G|\delta(\epsilon_1/2,\beta,q,\|p\|_\mathrm{co}+1,wn)=:1-\delta`$
over the choice of $`W^k_1`$ there is $`\mathbf{w}^*\in\mathbb{B}^d`$
such that
$`\ell_{S,j}\left(\hat\mathbf{f}^{k+1}_{j,\mathbf{w}^*}\right)\le \epsilon`$.
By the definition of $`\ell^{\mathrm{rob},\epsilon_1}_{S,j}`$ it is
enough to show that w.p.Ā $`1-\delta`$ there is
$`\mathbf{w}^*\in\mathbb{B}^d`$ such that
``` math
\begin{equation}
\label{eq:lem:pol_imp_small_loss}
y^t_{j,g}\cdot p \circ E_g\circ \hat\mathbf{f}^k(\vec\mathbf{x} ^t)-\epsilon_1\le y^t_{j,g}\cdot\hat f^{k+1}_{j,g,\mathbf{w}^*}(\vec\mathbf{x} ^t)\le y^t_{j,g}\cdot p \circ E_g\circ \hat\mathbf{f}^k(\vec\mathbf{x} ^t)
\end{equation}\left| p \circ E_g\circ \hat\mathbf{f}^k(\vec\mathbf{x} ^t) - \hat f^{k+1}_{j,g,\tilde\mathbf{w}^*}(\vec\mathbf{x} ^t)\right|\le \frac{\epsilon_1}{2}\hat f^{k+1}_{j,g,\mathbf{w}^*}(\vec\mathbf{x} ) = \hat f^{k}_{j,g}(\vec\mathbf{x} ) + {\left\langle \mathbf{w}^*,\sigma(W^{k+1}_1E_g\circ\hat{\mathbf{f}}^{k}(\vec\mathbf{x} )+\mathbf{b}^{k+1}) \right\rangle}\left| \tilde p \circ E_g\circ \hat\mathbf{f}^k(\vec\mathbf{x} ^t) - {\left\langle \mathbf{w}^*,\sigma(W^{k+1}_1E_g\circ\hat{\mathbf{f}}^{k}(\vec\mathbf{x} ^t)+\mathbf{b}^{k+1}) \right\rangle}\right|\le \frac{\epsilon_1}{2}
\delta = m|G|\exp\left(-\Omega\left(q\cdot\frac{ \gamma^{2K+4}}{(wn)^{2K}(M+1)^4}\right)\right)\ell_{S,j}(\mathbf{f}^{k+t})\le e^{-\gamma t}\ell_{S,j}(\mathbf{f}^{k})+\frac{1}{1-e^{-\gamma}}\epsilon_\mathrm{opt}\le \frac{e^{-\gamma t}}{2m|G|} + \frac{\xi}{16m^2|G|^2}\ell_{S,j}(\mathbf{f}^{k+t_0})\le \frac{\xi}{8m^2|G|^2}\ell_{S,j}(\mathbf{f}^{k})\le \frac{\xi}{8m^2|G|^2}
1\ge y^t_{j',g}\hat f^k_{j',g}(\vec \mathbf{x} ^t)\ge (1-\xi/2)(1-\xi/2)\ge1-\xi\begin{equation}
\label{eq:loss_improvments_proof}
\ell^{\mathrm{rob},\gamma}_{S,j}(q\circ \hat\mathbf{f}^k)\le e^{-\gamma}\ell_{S,j}(\hat\mathbf{f}^k)
\end{equation}\ell\left(y^t_{j,g} q\left( \hat f^k_{j,g}(\vec \mathbf{x} ^t)\right)\right) = \ell\left( q\left( y^t_{j,g}\cdot\hat f^k_{j,g}(\vec \mathbf{x} ^t)\right)\right)\begin{equation}
\label{eq:proof_pushing_pol}
\tilde q(x')-x'=\frac{1}{2}x'(1-x'^2)=\frac{1}{2}x'(1-x')(1+x')\ge \frac{1}{2}x'(1-x') \ge \frac{1}{4}\min\left(1/4B,1/2\xi\right) \ge 2\gamma
\end{equation}\begin{eqnarray*}
\ell(\tilde q(x)-\gamma) &\stackrel{x'\le x}{\le}& \ell(\tilde q(x')-\gamma)
\\
&\stackrel{\text{Eq. \eqref{eq:proof_pushing_pol}}}{\le}&\ell(x'+\gamma)
\\
&=& \ell\left(\frac{1-x'-\gamma}{1-x'}x' + \frac{\gamma}{1-x'}\right)
\\
&\stackrel{\text{Convexity and }\frac{\gamma}{1-x'}\le 1}{\le}& \frac{1-x'-\gamma}{1-x'}\ell(x') + \frac{\gamma}{1-x'}\ell(1)
\\
&\stackrel{\ell(1)=0\text{ and }\ell(x')=\ell(x)}{=}& \frac{1-x'-\gamma}{1-x'}\ell(x)
\\
&\le& e^{-\gamma}\ell(x)
\end{eqnarray*}Conclusion and Future Work
More Preliminaries
Concentration of Measure
\Pr\left(\left|\frac{1}{q}\sum_{i=1}^q X_i - \mu\right|\ge \epsilon\right)\le 2e^{-\frac{q\epsilon^2}{2B^2}}
```*
</div>
<div id="lem:chernoff" class="lemma">
**Lemma 18** (Chernoff). *Let $`X_1,\ldots,X_q\in \{0,1\}`$ be
i.i.d.Ā with mean $`\mu`$. Then, for any $`0\le\epsilon\le \mu`$ we have
``` math
\Pr\left(\left|\frac{1}{q}\sum_{i=1}^q X_i - \mu\right|\ge \epsilon\right)\le 2e^{-\frac{q\epsilon^2}{3\mu}}
```*
</div>
We will also need to following version of Chernoffās bound.
<div id="lem:extended_chernoff" class="lemma">
**Lemma 19**. *Let $`X_1,\ldots,X_q\in \{-1,1,0\}`$ be i.i.d.Ā random
variables with mean $`\mu`$. Then for
$`\epsilon\le \frac{\min\left(\Pr(X_i= 1),\Pr(X_i= -1)\right)}{2|\mu|}`$,
$`\Pr\left(\left|\frac{1}{q |\mu|}\sum_{i=1}^n X_i-\frac{\mu}{|\mu|}\right|\ge \epsilon\right)\le 4e^{-\frac{q\epsilon^2|\mu|^2}{12\Pr(X_i\ne 0)}}`$*
</div>
<div class="proof">
*Proof.* (of Lemma
<a href="#lem:extended_chernoff" data-reference-type="ref"
data-reference="lem:extended_chernoff">19</a>) Let
$`X^+_i = \max(X_i,0)`$ and $`\mu_+=\mathbb{E}X^+_i= \Pr(X_i=1)`$.
Similarly, let $`X^-_i = \max(-X_i,0)`$ and
$`\mu_-=\mathbb{E}X^-_i= \Pr(X_i=-1)`$. By Chernoff bound (Lemma
<a href="#lem:chernoff" data-reference-type="ref"
data-reference="lem:chernoff">18</a>) we have for $`0\le \delta\le 1`$
``` math
\Pr\left(\left|\frac{1}{q}\sum_{i=1}^n X^+_i-\mu_+\right|\ge \delta\mu_+\right)\le 2e^{-\frac{q\delta^2\mu_+}{3}}\Pr\left(\left|\frac{1}{q |\mu|}\sum_{i=1}^n X^+_i-\frac{\mu_+}{|\mu|}\right|\ge \delta\frac{\mu_+}{|\mu|}\right)\le 2e^{-\frac{q\delta^2\mu_+}{3}}\Pr\left(\left|\frac{1}{q |\mu|}\sum_{i=1}^n X^+_i-\frac{\mu_+}{|\mu|}\right|\ge \epsilon\right)\le 2e^{-\frac{q\epsilon^2|\mu|^2}{3\mu_+}} \le 2e^{-\frac{q\epsilon^2|\mu|^2}{3\Pr(X_i \ne 0)}}\Pr\left(\left|\frac{1}{q |\mu|}\sum_{i=1}^n X^-_i-\frac{\mu_-}{|\mu|}\right|\ge \epsilon\right) \le 2e^{-\frac{q\epsilon^2|\mu|^2}{3\Pr(X_i \ne 0)}}\begin{eqnarray*}
\Pr\left(\left|\frac{1}{q |\mu|}\sum_{i=1}^n X_i-\frac{\mu}{|\mu|}\right|\ge \epsilon\right) &\le& \Pr\left(\left|\frac{1}{q |\mu|}\sum_{i=1}^n X^+_i-\frac{\mu_+}{|\mu|}\right|\ge \frac{\epsilon}{2}\right) + \Pr\left(\left|\frac{1}{q |\mu|}\sum_{i=1}^n X^-_i-\frac{\mu_-}{|\mu|}\right|\ge \frac{\epsilon}{2}\right)
\\
&\le & 4e^{-\frac{q\epsilon^2|\mu|^2}{12\Pr(X_i\ne 0)}}
\end{eqnarray*}Misc Lemmas
\|q\|_\mathrm{co}\le \sum_{\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K} |b_\alpha | \cdot \|e_\alpha\| \stackrel{\text{C.S.}}{\le} \|p\|_\mathrm{co}\cdot \sqrt{\sum_{\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K} \|e_\alpha\|^2}\|e_\alpha\|^2 = \left\|\mathbf{a}^{\otimes\sigma_1}_1\otimes\ldots\otimes\mathbf{a}_n^{\otimes\sigma_n}\right\|^2 = \prod_{i=1}^n \|\mathbf{a}_1\|^{2\sigma_i}\le R^{2K}A Generalization Result
\mathrm{Err}_{{\cal D},\gamma}(\hat{\mathbf{f}})=\Pr_{\mathbf{x} \sim{\cal D}}\left(\exists i\in [n]\text{ s.t. }\hat f_{i}(\mathbf{x} )\cdot f_{i}(\mathbf{x} ) < \gamma \right)\forall\mathbf{x} \in{\cal X},\;\|\mathbf{g}(\mathbf{x} )-\tilde \mathbf{g}(\mathbf{x} )\|_\infty \le \frac{\gamma}{2}\mathrm{Err}_{\cal D}(\mathbf{g})\ge \epsilon \Rightarrow \mathrm{Err}_{{\cal D},\gamma/2}(\tilde\mathbf{g})\ge \epsilon \Rightarrow \mathrm{Err}_{S,\gamma/2}(\tilde\mathbf{g}) > 0 \Rightarrow \mathrm{Err}_{S,\gamma}(\mathbf{g}) > 0\mathbb{E}_S\mathrm{Err}_{{\cal D}}(\hat\mathbf{f}) \le \mathbb{E}_S[\mathrm{Err}_{{\cal D}}(\hat\mathbf{f})|A] + \Pr(A^\complement) \le \epsilon + N(1-\epsilon)^m +\delta_1+\delta_2Kernels
Random Features Schemes
\begin{equation}
\label{eq:ker_eq_inner}
\forall \mathbf{x} ,\mathbf{x} '\in{\cal X},\;\;\;\;k(\mathbf{x} ,\mathbf{x} ') =
\mathbb{E}_{\omega\sim \mu}\psi(\omega,\mathbf{x} )\psi(\omega,\mathbf{x} ')\,.
\end{equation}\Psi_{\boldsymbol{\omega}}(\mathbf{x} ) :=
\left(\psi({\omega_1},\mathbf{x} ),\ldots , \psi({\omega_q},\mathbf{x} ) \right) \,,\begin{equation}
\label{eqn:psi-embedding}
\mathbf{x} \mapsto\Psi^\mathbf{x} \; \mbox{ where } \;
\Psi^\mathbf{x} :=\psi(\cdot,\mathbf{x} )\in L^2(\Omega) \,.
\end{equation}\begin{equation}
\label{eq:f_norm_eq}
\|\check{f}\|_{L^2(\Omega)} = \|f\|_{k}
\end{equation}\begin{equation}
\label{eq:f_x_as_inner}
f(\mathbf{x} ) = {\left\langle \check{f},\Psi^\mathbf{x} \right\rangle}_{L^2(\Omega)} = \mathbb{E}_{\omega\sim\mu}\check{f}(\omega)\psi(\omega,\mathbf{x} )\,.
\end{equation}\begin{eqnarray*}
\frac{1}{q}\mathbb{E}_{\omega\sim\mu}
\left|\check{f}(\omega)\psi(\omega,\mathbf{x} )\right|^2
&\le &
\frac{\|\psi\|_\infty^2}{q}\mathbb{E}_{\omega\sim\mu}
\left|\check{f}(\omega)\right|^2
\\
&=& \frac{\|\psi\|_\infty^2\|f\|^2_{k}}{q}\,.
\end{eqnarray*}\mathbb{E}_{\boldsymbol{\omega}}\|f - f_{\boldsymbol{\omega}}\|_{2,{\cal D}} \stackrel{\text{Jensen}}{\le} \sqrt{\mathbb{E}_{\boldsymbol{\omega}}\|f - f_{\boldsymbol{\omega}}\|^2_{2,{\cal D}}} = \sqrt{\mathbb{E}_{\boldsymbol{\omega}}\mathbb{E}_{\mathbf{x} \sim{\cal D}}|f(\mathbf{x} ) - f_{\boldsymbol{\omega}}(\mathbf{x} )|^2} = \sqrt{\mathbb{E}_\mathbf{x} \mathbb{E}_{\boldsymbol{\omega}}|f(\mathbf{x} ) - f_{\boldsymbol{\omega}}(\mathbf{x} )|^2} \le \frac{\|\psi\|_\infty\|f\|_{k}}{\sqrt{q}}f_{{\boldsymbol{\omega}},B}(\mathbf{x} ) = \frac{1}{q}\sum_{i=1}^q
{\left\langle \check{f}(\omega_i) \right\rangle}_B\cdot\psi(\omega_i,\mathbf{x} )\begin{equation}
\label{eq:trunc_f_vomeg}
\Pr\left(\left|f_{{\boldsymbol{\omega}},B}(\mathbf{x} )-\mathbb{E}_{{\boldsymbol{\omega}}'} f_{{\boldsymbol{\omega}}',B}(x)\right|>\epsilon/2\right) \le 2e^{-\frac{q\epsilon^2}{8B^2C^2}}
\end{equation}\begin{eqnarray*}
\left|f(x) - \mathbb{E}_{{\boldsymbol{\omega}}'} f_{{\boldsymbol{\omega}}',B}(x)\right| &=& \left|\mathbb{E}\left(f_{\boldsymbol{\omega}}(x) - f_{{\boldsymbol{\omega}},B}(x)\right)\right|
\\
&=& \left|\mathbb{E}\left(\check{f}(\omega) - {\left\langle \check{f}(\omega) \right\rangle}_B\right)\cdot\psi(\omega,\mathbf{x} )\right|
\\
&=& \left|\mathbb{E}1_{|\check{f}(\omega)|>B} \check{f}(\omega)\psi(\omega,\mathbf{x} )\right|
\\
&\le& \sqrt{\Pr (|\check{f}(\omega)|>B) \mathbb{E}\left(\check{f}(\omega)\psi(\omega,\mathbf{x} )\right)^2}
\\
&\le& \|\psi\|_\infty\sqrt{\Pr (|\check{f}(\omega)|>B) \mathbb{E}\left(\check{f}(\omega)\right)^2}
\\
&=& \frac{\|\psi\|_\infty\|f\|^2_{k}}{B}
\end{eqnarray*}\Pr\left(\left|f_{{\boldsymbol{\omega}},B}(\mathbf{x} )-f(\mathbf{x} )\right|>\epsilon\right) \le 2e^{-\frac{q\epsilon^4}{32M^4C^4}}\begin{equation*}
\|\mathbf{w}\|\le \frac{2CM^2}{\epsilon\sqrt{q}}
\end{equation*}
```*
</div>
# Examples of Hierarchies and Proof Theorem <a href="#thm:brain_dump_intro" data-reference-type="ref"
data-reference="thm:brain_dump_intro">4</a>
Fix $`{\cal X}\subset [-1,1]^n`$, a proximity mapping
$`\mathbf{e}:G\to G^w`$, and a collection of sets
$`{\cal L}= \{L_1,\ldots,L_r\}`$ such that
$`L_1\subseteq L_2\subset\ldots\subseteq L_r = [n]`$. So far, we have
seen one formal example to a hierarchy: In the non-ensemble setting
(i.e. $`w=|G|=1`$) Example
<a href="#example:o(1)_supp" data-reference-type="ref"
data-reference="example:o(1)_supp">2</a> shows that if any label depends
on $`K`$ simpler labels, and the labels in the first level are
$`(K,1)`$-PTFs of the input, then $`{\cal L}`$ is an
$`(r,K,1)`$-hierarchy. In this section we expand our set of examples. We
first show (Lemma
<a href="#lem:cur_imp_ref_cur" data-reference-type="ref"
data-reference="lem:cur_imp_ref_cur">28</a>) that if
$`({\cal L},\mathbf{e})`$ is an $`(r,K,M)`$-hierarchy then it is an
$`(r,K,2M,B,\xi)`$-hierarchy for suitable $`B`$ and $`\xi`$. Then, in
section <a href="#sec:few_lables_dependence" data-reference-type="ref"
data-reference="sec:few_lables_dependence">8.1</a>, consider in more
detail the case that each label depends on a few simpler labels, in a
few locations, and show that the parameters obtained from Lemma
<a href="#lem:cur_imp_ref_cur" data-reference-type="ref"
data-reference="lem:cur_imp_ref_cur">28</a> can be improved in this
case. Finally, in section
<a href="#sec:brain_dump" data-reference-type="ref"
data-reference="sec:brain_dump">8.2</a> we prove Theorem
<a href="#thm:brain_dump_intro" data-reference-type="ref"
data-reference="thm:brain_dump_intro">4</a>, showing that if all the
labels are ārandom snippets" from a given circuit, and there is enough
of them, then the target function has a low-complexity hierarchy.
<div id="lem:cur_imp_ref_cur" class="lemma">
**Lemma 28**. *Any $`(r,K,M)`$-hierarchy of
$`\mathbf{f}^*:{\cal X}^G\to \{\pm 1\}^{n,G}`$ is also an
$`(r,K,2M,B,\xi)`$-hierarchy for
$`\xi = \frac{1}{2(wn+1)^{\frac{K+1}{2}}KM}`$ and
$`B = 2(w\max(n,d)+1)^{K/2}M`$*
</div>
Lemma <a href="#lem:cur_imp_ref_cur" data-reference-type="ref"
data-reference="lem:cur_imp_ref_cur">28</a> follows immediately from the
definition of hierarchy and the following lemma
<div id="lem:ptf_imp_ref_ptf" class="lemma">
**Lemma 29**. *Any $`(K,M)`$-PTF $`f:{\cal X}\to \{\pm 1\}`$ is a
$`(K,2M,B,\xi)`$-PTF w.r.t. for
$`\xi = \frac{1}{2(n+1)^{\frac{K+1}{2}}KM}`$ and $`B = 2(n+1)^{k/2}M`$*
</div>
Lemma <a href="#lem:ptf_imp_ref_ptf" data-reference-type="ref"
data-reference="lem:ptf_imp_ref_ptf">29</a> is implied by Lemmas
<a href="#lem:Lip_and_boundness_of_pol" data-reference-type="ref"
data-reference="lem:Lip_and_boundness_of_pol">30</a> and
<a href="#lem:PTF_from_lip_and_bounded" data-reference-type="ref"
data-reference="lem:PTF_from_lip_and_bounded">31</a>
<div id="lem:Lip_and_boundness_of_pol" class="lemma">
**Lemma 30**. *Let $`p:\mathbb{R}^n\to\mathbb{R}`$ be a degree $`K`$
polynomial. Then $`p`$ is
$`((n+1)^{\frac{K+1}{2}}K\|p\|_\mathrm{co})`$-Lipschitz in
$`[-1,1]^{n}`$ w.r.t.Ā the $`\|\cdot\|_\infty`$ norm and satisfies
$`|p(\mathbf{x} )|\le (n+1)^{k/2}\|p\|_\mathrm{co}`$ for any
$`\mathbf{x} \in [-1,1]^{n}`$.*
</div>
<div class="proof">
*Proof.* Denote
$`p(\mathbf{x} )= \sum_{\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K} a_\alpha \mathbf{x} ^\alpha`$.
We have
``` math
\frac{\partial p }{\partial x_i}\left(\mathbf{x} \right)= \sum_{\alpha\in \{0,\ldots,K-1\}^{n},\|\alpha\|_1\le K-1} a_{\alpha+\mathbf{e}_i}\cdot(\alpha_i+1)\cdot \mathbf{x} ^\alpha\begin{eqnarray*}
\left|\frac{\partial p }{\partial x_i}\left(\mathbf{x} \right)\right| &\le & \sum_{\alpha\in \{0,\ldots,K-1\}^{n},\|\alpha\|_1\le K-1} \left|a_{\alpha+\mathbf{e}_i}\cdot(\alpha_i+1) \cdot\mathbf{x} ^\alpha\right|
\\
&\le & K\sum_{\alpha\in \{0,\ldots,K-1\}^{n},\|\alpha\|_1\le K-1} \left|a_{\alpha+\mathbf{e}_i}\right|
\\
&\le & K\sqrt{(n+1)^{K-1}}\|p\|_\mathrm{co}
\end{eqnarray*}p(\mathbf{x} ) \le \sum_{\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K}| a_\alpha| \le 2(n+1)^{K/2}\|p\|_\mathrm{co}
2B\ge 2p(\tilde\mathbf{x} )f(\mathbf{x} ) \ge 1\begin{eqnarray*}
p(\tilde\mathbf{x} ) &\ge& p(\mathbf{x} ) - |p(\tilde\mathbf{x} )-p(\mathbf{x} )|
\\
&\ge & p(\mathbf{x} ) - L\cdot\|\mathbf{x} -\tilde\mathbf{x} \|_\infty
\\
&\ge&1- \frac{L}{2L}
\\
&=&\frac{1}{2}
\end{eqnarray*}Each Label Depends on $`O(1)`$ Simpler Labels
\left|\frac{\partial F}{\partial x_i}\right| = \left|\sum_{i\in A\subseteq[K]}a_A\mathbf{x} ^A\right| \le \sum_{i\in A\subseteq[K]}\left|a_A\right|\le \sum_{i\in A\subseteq[K]}\left|a_A\right|\|\nabla F(\mathbf{x} )\|_1 \le \sum_{A\subseteq[K]}|A|\left|a_A\right| \stackrel{\text{Cauchy Schwartz}}{\le} K 2^{K/2}|\tilde p(\tilde \mathbf{x} )-f(\mathbf{x} )| = |\tilde p(\tilde \mathbf{x} )-f(\mathrm{sign}(\tilde \mathbf{x} ))| = |F(q(\tilde \mathbf{x} ))-F(\mathrm{sign}(\tilde \mathbf{x} ))| \le \|q(\tilde \mathbf{x} )-\mathrm{sign}(\tilde \mathbf{x} )\|_\infty \le \epsilon1+\epsilon\ge \tilde p(\tilde \mathbf{x} )f(\mathbf{x} ) \ge 1-\epsilonB\ge p(\tilde \mathbf{x} )f(\mathbf{x} ) \ge 1Proof of Theorem <a href="#thm:brain_dump_intro" data-reference-type=“ref”
\begin{equation}
\label{eq:brain_dump_circ_def}
\forall \mathbf{x} \in{\cal X}, \quad G^i_j(\mathbf{x} ) = p^i_j(G^{i-1}(\mathbf{x} )),
\end{equation}\begin{equation}
\label{eq:brain_dump_target_def}
f^*_{i,j}(\mathbf{x} ) = \mathrm{sign}\left(\sum_{l=1}^d w_l^{i,j}G^i_l(\mathbf{x} )\right),
\end{equation}{\cal W}_{d,k} := \left\{\mathbf{w}\in \{-1,0,1\}^d : \sum_{l=1}^d |w_l|= k \right\}\Psi_{i-1}(\mathbf{x} ) = (f^*_{i-1,1}(\mathbf{x} ),\ldots,f^*_{i-1,q}(\mathbf{x} ))f^*_{i,j}(\mathbf{x} ) = \mathrm{sign}\left(\sum_{l=1}^d w_l^{i,j} p^i_l(G^{i-1}(\mathbf{x} ))\right) =: \mathrm{sign}\left(q(G^{i-1}(\mathbf{x} ))\right)1-\xi\le \frac{1-\xi/2}{q\alpha_{d,k}}\left(W\Psi(\mathbf{x} )\odot \mathbf{x} \right)_j\le1B\ge p\left( \frac{1-\xi/2}{q\alpha_{d,k}}W\Psi(\mathbf{x} )\right)\cdot f(\mathbf{x} ) \ge 1\|q\|_\mathrm{co}\le \|p\|_\mathrm{co}\cdot \left(\frac{\sqrt{q+1}}{\sqrt{q}\alpha_{d,k}}\right)^{K}\frac{1}{q\alpha_{d,k}}\left(W\Psi(\mathbf{x} )\right)_j = \frac{1}{q}\sum_{i=1}^q \frac{w^i_j\mathrm{sign}({\left\langle \mathbf{w}^i,\mathbf{x} \right\rangle})}{\alpha_{d,k}}\begin{eqnarray*}
\Pr(X_i = x_j) &=& \frac{k}{2d}\left[\Pr(\mathrm{sign}({\left\langle \mathbf{w}^i,\mathbf{x} \right\rangle})=1|w_j=x_j) + \Pr(\mathrm{sign}({\left\langle \mathbf{w}^i,\mathbf{x} \right\rangle})=-1|w_j=-x_j)\right]
\\
&=& \frac{k}{2d 2^{k-1}}\left[\binom{k-1}{\ge (k-1)/2}+\binom{k-1}{\ge (k-1)/2}\right]
\\
&=& \frac{k}{2d }\left[1 + \frac{\binom{k-1}{ (k-1)/2}}{2^{k-1}}\right]
\end{eqnarray*}\begin{eqnarray*}
\Pr(X_i = -x_j) &=& \frac{k}{2d}\left[\Pr(\mathrm{sign}({\left\langle \mathbf{w}^i,\mathbf{x} \right\rangle})-1|w_j=x_j) + \Pr(\mathrm{sign}({\left\langle \mathbf{w}^i,\mathbf{x} \right\rangle})=1|w_j=-x_j)\right]
\\
&=& \frac{k}{2d 2^{k-1}}\left[\binom{k-1}{> (k-1)/2}+\binom{k-1}{> (k-1)/2}\right]
\\
&=& \frac{k}{2d }\left[1 - \frac{\binom{k-1}{ (k-1)/2}}{2^{k-1}}\right]
\end{eqnarray*}\mathbb{E}X_i = \left(\Pr(X_i= x_j)-\Pr(X_i=-x_j)\right)x_j = \alpha_{d,k}\cdot x_j\Pr(X_i\ne 0) = \Pr(X_i= x_j) + \Pr(X_i=-x_j)= \frac{k}{d}\frac{\min\left(\Pr(X_i = 1),\Pr(X_i = -1)\right)}{|\mathbb{E}X_i|} = \frac{k}{2d \alpha_{d,k} }\left[1 - \frac{\binom{k-1}{ (k-1)/2}}{2^{k-1}}\right] \ge \frac{k}{\alpha_{d,k}4d} \ge \frac{1}{2}\frac{|\mathbb{E}X_i|^2}{\Pr(\mathrm{sign}({\left\langle \mathbf{w},\mathbf{x} \right\rangle})w_i\ne 0)} = \frac{k}{d}\left(\frac{\binom{k-1}{(k-1)/2}}{2^{k-1}}\right)^2 \stackrel{\text{Lemma \ref{lem:binom_assimptotics}}}{=} \Theta\left( \frac{1}{d}\right)\Pr\left(\left|\frac{1}{q\alpha_{d,k}}\left(W\Psi(\mathbf{x} )\right)_j-x_j\right|\ge\epsilon\right)\le 4e^{-\Omega\left(\frac{\epsilon^2q}{d}\right)}Kernels From Random Neurons and Proof of Lemma <a href="#lem:rf_main_simp" data-reference-type=“ref”
\begin{equation}
k_{\sigma,\beta,n}(\mathbf{x} ,\mathbf{y}) = \mathbb{E}[\sigma(\mathbf{w}^\top\mathbf{x} +b)\sigma(\mathbf{w}^\top\mathbf{y}+b)]\;,\;\;\;\; b\sim {\cal N}(0,\beta^2),\;\mathbf{w}\sim{\cal N}\left(0,\frac{1-\beta^2}{n}I_n\right)
\end{equation}\begin{equation}
\label{eq:hermite_expan_of_sigma}
\sigma = \sum_{s=0}^\infty a_sh_s
\end{equation}\begin{equation}
a_s(r) = \sum_{j=0}^{\infty} a_{s+2j}\sqrt{\frac{(s+2j)!}{s!}} \frac{(r^2-1)^j}{j! 2^j}
\end{equation}k_{\sigma,\beta,n}(\mathbf{x} ,\mathbf{y}) = \sum_{s=0}^\infty a_s\left(\sqrt{\frac{1-\beta^2}{n}\|\mathbf{x} \|^2 + \beta^2}\right)a_s\left(\sqrt{\frac{1-\beta^2}{n}\|\mathbf{y}\|^2 + \beta^2}\right)\left(\frac{1-\beta^2}{n}{\left\langle \mathbf{x} ,\mathbf{y} \right\rangle}+\beta^2\right)^s\Psi_{\sigma,\beta,n}(\mathbf{x} ) = \left(a_s\left(\sqrt{\frac{1-\beta^2}{n}\|\mathbf{x} \|^2 + \beta^2}\right)\cdot\begin{bmatrix}\sqrt{\frac{1-\beta^2}{n}}\mathbf{x} \\\beta\end{bmatrix}^{\otimes s}\right)_{s=0}^\infty
```*
</div>
To prove Lemma <a href="#lem:ker_formula" data-reference-type="ref"
data-reference="lem:ker_formula">37</a> We will use the following Lemma.
<div id="lem:hermite_scaling" class="lemma">
**Lemma 38**. *We have
$`h_s(ax) = \sum_{j=0}^{\lfloor s/2 \rfloor} \sqrt{\frac{s!}{(s-2j)!}} \frac{a^{s-2j}(a^2-1)^j}{j! 2^j} h_{s-2j}(x)`$*
</div>
<div class="proof">
*Proof.* By formula
<a href="#eq:hermite_gen_fun" data-reference-type="eqref"
data-reference="eq:hermite_gen_fun">[eq:hermite_gen_fun]</a> we have
``` math
\begin{eqnarray*}
\sum_{s=0}^\infty \frac{h_s(ax)t^s}{\sqrt{s!}} &=& e^{xat - \frac{t^2}{2}} \\
&=& e^{xat - \frac{(at)^2}{2} + \frac{(at)^2}{2} - \frac{t^2}{2}}
\\
&\stackrel{\text{Eq. }\eqref{eq:hermite_gen_fun}}{=}& e^{\frac{(at)^2}{2} - \frac{t^2}{2}}\left( \sum_{s=0}^\infty \frac{h_s(x)a^st^s}{\sqrt{s!}}\right)
\\
&=& e^{(a^2-1)\frac{t^2}{2}}\left( \sum_{s=0}^\infty \frac{h_s(x)a^st^s}{\sqrt{s!}}\right)
\\
&=& \left( \sum_{s=0}^\infty \frac{(a^2-1)^s}{s!2^s} t^{2s}\right)\left( \sum_{s=0}^\infty \frac{h_s(x)a^s}{\sqrt{s!}}t^s\right)
\\
&=& \sum_{s=0}^\infty \left(\sum_{j=0}^{\left\lfloor \frac{s}{2}\right\rfloor}
\frac{(a^2-1)^j}{j!2^j}\frac{h_{s-2j}(x)a^{s-2j}}{\sqrt{(s-2j)!}}
\right)t^{s}
\end{eqnarray*}\frac{h_s(ax)}{\sqrt{s!}} = \sum_{j=0}^{\left\lfloor \frac{s}{2}\right\rfloor}
\frac{(a^2-1)^j}{j!2^j}\frac{a^{s-2j}}{\sqrt{(s-2j)!}}h_{s-2j}(x)\begin{eqnarray*}
\sigma(rx) &=& \sum_{s=0}^\infty h_s(rx)
\\
&=& \sum_{s=0}^\infty \left(\sum_{j=0}^{\infty} a_{s+2j}\sqrt{\frac{(s+2j)!}{s!}} \frac{(r^2-1)^j}{j! 2^j} \right)r^{s}h_s(x)
\\
&=& :\sum_{s=0}^\infty a_s(r)r^sh_s(x)
\end{eqnarray*}\begin{eqnarray*}
k_{\sigma,\beta,n}(\mathbf{x} ,\mathbf{y}) &=& \mathbb{E}\sigma(r_\mathbf{x} \tilde X) \sigma(r_\mathbf{y}\tilde Y)
\\
&=&\sum_{i=0}^\infty \sum_{j=0}^\infty a_i(r_\mathbf{x} )r^i_\mathbf{x} a_j(r_\mathbf{y})r^j_\mathbf{x} \mathbb{E}h_i(\tilde X)h_j(\tilde Y)
\\
&\stackrel{\text{Eq. }\eqref{eq:hermite_prod_exp}}{=}&\sum_{s=0}^\infty a_s(r_\mathbf{x} )r^s_\mathbf{x} a_s(r_\mathbf{y})r^s_\mathbf{y}\rho^s
\\
&=&\sum_{s=0}^\infty a_s(r_\mathbf{x} ) a_s(r_\mathbf{y}) \left(\frac{1-\beta^2}{n}{\left\langle \mathbf{x} ,\mathbf{y} \right\rangle}+\beta^2\right)^s
\end{eqnarray*}|a_s(r)-a_s(1)| \le \|\sigma\|2^{(s+2)/2} \frac{\epsilon}{\sqrt{1-2\epsilon^2}}
```*
</div>
<div class="proof">
*Proof.* We have
``` math
\begin{eqnarray*}
|a_s(r)-a_s(1)| &=& \left|\sum_{j=1}^{\infty} a_{s+2j}\sqrt{\frac{(s+2j)!}{s!}} \frac{(r^2-1)^j}{j! 2^j}\right|
\\
&\stackrel{\text{Cauchy-Schwartz and }\|\sigma\|=\sqrt{\sum_{i=0}^\infty a_i^2}}{\le}& \|\sigma\| \sqrt{\sum_{j=1}^{\infty} \frac{(s+2j)!}{s!} \frac{(r^2-1)^{2j}}{(j!)^2 2^{2j}} }
\\
&\stackrel{(2j)!\le (j!2^j)^2}{\le}& \|\sigma\| \sqrt{\sum_{j=1}^{\infty} \frac{(s+2j)!}{s!(2j)!} (r^2-1)^{2j} }
\\
&=& \|\sigma\| \sqrt{\sum_{j=1}^{\infty} \binom{s+2j}{s} (r^2-1)^{2j} }
\\
&\le& \|\sigma\| \sqrt{\sum_{j=1}^{\infty} 2^{s+2j} (r^2-1)^{2j} }
\\
&=& \|\sigma\|2^{s/2} \sqrt{\sum_{j=1}^{\infty} (2r^2-2)^{2j} }
\\
&=& \|\sigma\|2^{s/2}|2r^2-2| \frac{1}{\sqrt{1-(2r^2-2)^2}}
\end{eqnarray*}
A_{\gamma} = \begin{cases}\frac{1}{a_{K'}\beta^{K'-\|\alpha\|_1}}\left(\frac{n}{1-\beta^2}\right)^{\|\alpha\|_1/2}b_\alpha & \gamma=\tilde\alpha\text{ for some }\alpha \\ 0&\text{otherwise}\end{cases}g(\mathbf{x} ) = {\left\langle A,\Psi_{\sigma,\beta,n}(\mathbf{x} ) \right\rangle}\begin{eqnarray*}
|g(\mathbf{x} ) - p(\mathbf{x} )| &=& | p(\mathbf{x} )|\cdot\left|\frac{a_{K'}\left(\sqrt{\frac{1-\beta^2}{n}\|\mathbf{x} \|^2+\beta^2}\right)}{a_{K'}}-1\right|
\\
&=& \frac{| p(\mathbf{x} )|}{a_{K'}}\left|a_{K'}\left(\sqrt{\frac{1-\beta^2}{n}\|\mathbf{x} \|^2+\beta^2}\right)-a_{K'}\right|
\end{eqnarray*}\beta^2\le r^2\le 1 \Rightarrow \epsilon:= |1- r^2 |\le 1-\beta^2 < \frac{1}{2}|g(\mathbf{x} ) - p(\mathbf{x} )|\le \frac{| p(\mathbf{x} )|}{a_{K'}}\|\sigma\|2^{(K'+2)/2}\frac{1-\beta^2}{\sqrt{1-2(1-\beta^2)^2}}\forall\mathbf{x} \in {\cal X},\;\;\Pr\left(|{\left\langle \mathbf{w},\sigma(W\mathbf{x} +\mathbf{b}) \right\rangle}-p(\mathbf{x} )|\ge \epsilon + \frac{\|\sigma\|}{a_{K'}}2^{(K'+2)/2}\frac{1-\beta^2}{\sqrt{1-2(1-\beta^2)^2}}\right) \le \delta\delta = 2\exp\left(-{q}\cdot\frac{a^4_{K'}\beta^{4K'-4K}(1-\beta^2)^{2K}\epsilon^4}{32 n^{2K}\|p\|_{\mathrm{co}}^4\|\sigma\|_\infty^4}\right)\|\mathbf{w}\|\le \frac{2\|\sigma\|_\infty}{\epsilon\sqrt{{q}}}\cdot\frac{1}{a^2_{K'}\beta^{2K'-2K}}\left(\frac{n}{1-\beta^2}\right)^{K}\|p\|^2_{\mathrm{co}}
```*
</div>
We next specialize Lemma
<a href="#lem:rf_main" data-reference-type="ref"
data-reference="lem:rf_main">41</a> for the needs of our paper and
explain how it implies Lemma
<a href="#lem:rf_main_simp" data-reference-type="ref"
data-reference="lem:rf_main_simp">13</a>. Recall that for $`\epsilon>0`$
we defined $`\frac{3}{4}
\le\beta_{\sigma,K',K}(\epsilon)<1`$ as the minimal number such that if
$`\beta_{\sigma,K',K}(\epsilon)\le \beta<1`$ then
``` math
\frac{\|\sigma\|}{a_{K'}}2^{(K'+2)/2}\frac{1-\beta^2}{\sqrt{1-2(1-\beta^2)^2}}\le \frac{\epsilon}{2}\delta_{\sigma,K',K}(\epsilon,\beta,q,M,n) = \begin{cases}
1 & \frac{4\|\sigma\|_\infty}{\epsilon\sqrt{{q}}}\cdot\frac{1}{a^2_{K'}\beta^{2K'-2K}}\left(\frac{n}{1-\beta^2}\right)^{K}M^2 > 1
\\
2\exp\left(-{q}\cdot\frac{a^4_{K'}\beta^{4K'-4K}(1-\beta^2)^{2K}\epsilon^4}{512 n^{2K}M^4\|\sigma\|_\infty^4}\right) & \text{otherwise}
\end{cases}\forall\mathbf{x} \in {\cal X},\;\Pr\left(|{\left\langle \mathbf{w},\sigma(W\mathbf{x} +\mathbf{b}) \right\rangle}-p(\mathbf{x} )|\ge \epsilon \right) \le \delta_{\sigma,K',K}(\epsilon,\beta,q,\|p\|_\mathrm{co},n)
```*
</div>
<div class="proof">
*Proof.* Fix $`\mathbf{x} \in {\cal X}`$. By Lemma
<a href="#lem:rf_main" data-reference-type="ref"
data-reference="lem:rf_main">41</a> there is a vector
$`\mathbf{v}\in\mathbb{R}^q`$ such that
``` math
\begin{equation}
\label{eq:lem:rf_main_simp_1}
\Pr\left(|{\left\langle \mathbf{v},\sigma(W\mathbf{x} +\mathbf{b}) \right\rangle}-p(\mathbf{x} )|\ge \epsilon \right)\le
\Pr\left(|{\left\langle \mathbf{v},\sigma(W\mathbf{x} +\mathbf{b}) \right\rangle}-p(\mathbf{x} )|\ge \frac{\epsilon}{2} + \frac{\|\sigma\|}{a_{K'}}2^{(K'+2)/2}\frac{1-\beta^2}{\sqrt{1-2(1-\beta^2)^2}}\right)
\le \delta
\end{equation}\begin{equation*}
\delta = 2\exp\left(-{q}\cdot\frac{a^4_{K'}\beta^{4K'-4K}(1-\beta^2)^{2K}\epsilon^4}{512 n^{2K}\|p\|_{\mathrm{co}}^4\|\sigma\|_\infty^4}\right)
\end{equation*}\|\mathbf{v}\|\le \frac{4\|\sigma\|_\infty}{\epsilon\sqrt{{q}}}\cdot\frac{1}{a^2_{K'}\beta^{2K'-2K}}\left(\frac{n}{1-\beta^2}\right)^{K}\|p\|^2_{\mathrm{co}}Acknowledgments
A Note of Gratitude
The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.