Deep Networks Learn Deep Hierarchical Models

Reading time: 8 minute
...

šŸ“ Original Paper Info

- Title: Deep Networks Learn Deep Hierarchical Models
- ArXiv ID: 2601.00455
- Date: 2026-01-01
- Authors: Amit Daniely

šŸ“ Abstract

We consider supervised learning with $n$ labels and show that layerwise SGD on residual networks can efficiently learn a class of hierarchical models. This model class assumes the existence of an (unknown) label hierarchy $L_1 \subseteq L_2 \subseteq \dots \subseteq L_r = [n]$, where labels in $L_1$ are simple functions of the input, while for $i > 1$, labels in $L_i$ are simple functions of simpler labels. Our class surpasses models that were previously shown to be learnable by deep learning algorithms, in the sense that it reaches the depth limit of efficient learnability. That is, there are models in this class that require polynomial depth to express, whereas previous models can be computed by log-depth circuits. Furthermore, we suggest that learnability of such hierarchical models might eventually form a basis for understanding deep learning. Beyond their natural fit for domains where deep learning excels, we argue that the mere existence of human ``teachers" supports the hypothesis that hierarchical structures are inherently available. By providing granular labels, teachers effectively reveal ``hints'' or ``snippets'' of the internal algorithms used by the brain. We formalize this intuition, showing that in a simplified model where a teacher is partially aware of their internal logic, a hierarchical structure emerges that facilitates efficient learnability.

šŸ’” Summary & Analysis

1. **Proving Learnability of Hierarchical Models**: The paper demonstrates that deep learning algorithms, especially residual networks, can learn hierarchical models efficiently. This insight helps explain the success of deep learning. 2. **Brain-Like Hierarchical Structure**: Similar to how the human brain processes information hierarchically, this approach shows how complex concepts are learned step-by-step in fields like computer vision and natural language processing. 3. **Utilizing Internet Data for Learning**: The vast amount of labeled data available on the internet provides 'hints' that deep learning models can use as intermediate steps to learn more complex concepts.

šŸ“„ Full Paper Content (ArXiv Source)

# Introduction

A central objective in deep learning theory is to demonstrate that gradient-based algorithms can efficiently learn a class of models sufficiently rich to capture reality. This effort began over a decade ago, coincidental with the undeniable empirical success of deep learning. Initial theoretical results demonstrated that deep learning algorithms can learn linear models, followed later by proofs for simple non-linear models.

This progress is remarkable, especially considering that until recently, no models were known to be provably learnable by deep learning algorithms. Moreover, the field was previously dominated by hardness results indicating severe limitations on the capabilities of neural networks. However, despite this progress, learning linear or simple non-linear models is insufficient to explain the practical success of deep learning.

In this paper, we advance this research effort by showing that deep learning algorithms—specifically layerwise SGD on residual networks—provably learn hierarchical models. We consider a supervised learning setting with $`n`$ possible labels, where each example is associated with a subset of these labels. Let $`\mathbf{f}^* \colon \mathcal{X} \to \{\pm 1\}^n`$ be the ground truth labeling function. We assume an unknown hierarchy of labels $`L_1 \subseteq L_2 \subseteq \dots \subseteq L_r = [n]`$ such that labels in $`L_1`$ are simple functions (specifically, polynomial thresholds) of the input, while for $`i > 1`$, any label in $`L_i`$ is a simple function of simpler labels (i.e., those in $`L_{i-1}`$).

We suggest that the learnability of hierarchical models offers a compelling basis for understanding deep learning. First, hierarchical models are natural in domains where neural networks excel. In computer vision, for instance, a first-level label might be ā€œthis pixel is redā€ (i.e.Ā the input itself); a second-level label might be ā€œcurved lineā€ or ā€œdark regionā€; and a third-level label might be ā€œleafā€ or ā€œrectangleā€, and so on. Similar hierarchies exist in text and speech processing. Indeed, this hierarchical structure motivated the development of successful architectures such as convolutional and residual networks.

Second, one might even argue further that the mere existence of human ā€œteachersā€ supports the hypothesis that hierarchical labeling exists and can be supplied to the algorithm. Consider the classic problem of recognizing a car in an image. Early AI approaches (circa 1970s–80s) failed because they attempted to manually codify the cognitive algorithms used by the human brain. This was superseded by machine learning, which approximates functions based on input-output pairs. While this data-driven approach has surpassed human performance, the standard narrative of its success might be somewhat misleading.

We suggest that recent breakthroughs are not solely due to ā€œlearning from scratchā€, but also because models are trained on datasets containing a vast number of granular labels. These labels represent a middle ground between explicit programming and pure input-output learning; they serve as ā€œhintsā€ or intermediate steps for learning complex concepts. Although we lack full access to the brain’s internal algorithm, we can provide ā€œsnippetsā€ of its logic. By identifying lower-level features—such as windows, wheels, or geometric shapes—we effectively decompose the task into a hierarchy.

At a larger scale, we can consider the following perspective for the creation of LLMs. From the 1990s to the present, humanity created the internet (websites, forums, images, videos, etc.). As a byproduct, humanity implicitly provided an extensive number of labels and examples. Because these labels are so numerous—ranging from the very simple to the very complex—they are likely to possess a hierarchical structure. Following the creation of the internet, huge models were trained on these examples, succeeding largely as a result of this structure (alongside, of course, the extensive data volume and compute power). In a sense, the evolution of the internet and modern LLMs can be viewed as an enormous collective effort to create a circuit that mimics the human brain, in the sense that all labels of interest are effectively a composition of this circuit and a simple function.

We present a simplified formalization of this intuition. We model the human brain as a computational circuit, where each label (representing a ā€œbrain snippetā€) corresponds to a majority vote over a subset of the brain’s neurons. To formalize the postulate that these labels are both granular and diverse, we assume that the specific collections of neurons defining each label are chosen at random prior to the learning process. We demonstrate that this setting yields a hierarchical structure that facilitates efficient learnability by residual networks. Crucially, neither the residual network architecture nor the training algorithm relies on knowledge of this underlying label hierarchy.

Finally, we note that hierarchical models surpass previous classes of models shown to be learnable by SGD. To the best of our knowledge, prior results were limited to models that can be realized by log-depth circuits. In contrast, hierarchical models reach the depth limit of efficient learnability. For any polynomial-sized circuit, we can construct a corresponding hierarchical model learnable by SGD on a ResNet, effectively computing the circuit as one of its labels.

Linear, or fixed representation models are defined by a fixed (usually non-linear) feature mapping followed by a learned linear mapping. This includes kernel methods, random features , and others. Several papers in the last decade have shown that neural networks can provably learn various linear models, e.g. . Several works consider model-classes which go beyond fixed representations, but still can be efficiently learned by gradient based methods on neural networks. One line of work shows learnability of parities under non-uniform distributions, or other models directly expressible by neural networks of depth two, e.g. . Closer to our approach are that consider certain hierarchical models. As mentioned above, we believe that our work is another step towards models that can capture reality. From a more formal perspective, we improves over previous work in the sense that the models we consider can be arbitrarily deep. In contrast, all the mentioned papers consider models that can be realized by networks of logarithmic depth. In fact, with the exception of which considers composition of permutations, depth two suffices to express all the above mentioned models.

Another line of related work is which argue that deep learning is successful due to hierarchical structure. This series of papers give an example to a hierarchical model that is efficiently learnable, but it is conjectured that it requires deep architecture to express. Additional attempts to argue that hierarchy is essential for deep learning includes

Notation and Preliminaries

We denote vectors using bold letters (e.g., $`\mathbf{x} ,\mathbf{y},\mathbf{z},\mathbf{w},\mathbf{v}`$) and their coordinates using standard letters. For instance, $`x_i`$ denotes the $`i`$-th coordinate of $`\mathbf{x}`$. Likewise, we denote vector-valued functions and polynomials (i.e., those whose range is $`\mathbb{R}^d`$) using bold letters (e.g., $`\mathbf{f},\mathbf{g},\mathbf{h},\mathbf{p},\mathbf{q},\mathbf{r}`$), and their $`i`$-th coordinate using standard letters. We will freely use broadcasting operations. For instance, if $`\vec\mathbf{x} = (\mathbf{x} _1,\ldots,\mathbf{x} _n)`$ is a sequence $`n`$ of vectors in $`\mathbb{R}^d`$ and $`g`$ is a function from $`\mathbb{R}^d`$ to some set $`Y`$, then $`g(\vec\mathbf{x} )`$ denotes the sequence $`(g(\mathbf{x} _1),\ldots,g(\mathbf{x} _n))`$. Similarly, for a matrix $`A\in M_{q,d}`$, we denote $`A\vec\mathbf{x} = (A\mathbf{x} _1,\ldots,A\mathbf{x} _n)`$.

For a polynomial $`p:\mathbb{R}^n\to\mathbb{R}`$, we denote by $`\|p\|_{\mathrm{co}}`$ the Euclidean norm of the coefficient vector of $`p`$. We call $`\|p\|_{\mathrm{co}}`$ the coefficient norm of $`p`$. For $`\sigma:\mathbb{R}\to\mathbb{R}`$, we denote by $`\|\sigma\| = \sqrt{\mathbb{E}_{X\sim{\cal N}(0,1)}[\sigma^2(X)]}`$ the $`\ell^2`$ norm with respect to the standard Gaussian measure. We denote the Frobenius norm of a matrix $`A\in M_{n,m}`$ by $`\|A\|_F = \sqrt{\sum_{i,j}A^2_{ij}}`$, and the spectral norm by $`\|A\| = \max_{\|\mathbf{x} \|=1}\|A\mathbf{x} \|`$.

We denote by $`\mathbb{R}^{d,n}`$ the space of sequences of $`n`$ vectors in $`\mathbb{R}^d`$. More generally, for a set $`G`$, we let $`\mathbb{R}^{d,G} = \{\vec\mathbf{x} = (\mathbf{x} _g)_{g\in G} : \forall g\in G,\; \mathbf{x} _g\in\mathbb{R}^d\}`$. We denote the Euclidean unit ball by $`\mathbb{B}^d = \{\mathbf{x} \in\mathbb{R}^d : \|\mathbf{x} \|\le 1\}`$. We denote the point-wise (Hadamard) multiplication of vectors and matrices by $`\odot`$ and the concatenation of vectors by $`(\mathbf{x} |\mathbf{y})`$. For $`\mathbf{x} \in\mathbb{R}^n`$, $`A\subseteq [n]`$, and $`\sigma\in \mathbb{Z}^n`$, we use the multi-index notation $`\mathbf{x} ^{A} = \prod_{i\in A} x_i`$ and $`\mathbf{x} ^{\sigma} = \prod_{i=1}^n x_i^{\sigma_i}`$. For $`\mathbf{f}:{\cal X}\to \mathbb{R}^n`$ and $`L\subseteq[n]`$, we denote by $`\mathbf{f}_L:{\cal X}\to \mathbb{R}^{|I|}`$ the restriction $`\mathbf{f}_L=(f_{i_1},\ldots,f_{i_k})`$, where $`L=\{i_1,\ldots,i_k\}`$ with $`i_1<\ldots

Polynomial Threshold Functions

Fix a set $`{\cal X}\subseteq[-1,1]^d`$, a function $`f:{\cal X}\to \{\pm 1\}`$, a positive integer $`K`$, and $`M>0`$. We say that $`f`$ is a $`(K,M)`$-PTF if there is a degree $`\le K`$ polynomial $`p:\mathbb{R}^d\to\mathbb{R}`$ such that $`\|p\|_\mathrm{co}\le M`$ and $`\forall \mathbf{x} \in {\cal X},\;\;p(\mathbf{x} )f(\mathbf{x} )\ge 1`$. More generally, we say that $`f`$ a $`(K,M)`$-PTF of $`\mathbf{h}:{\cal X}\to\mathbb{R}^s`$ if there is a degree $`\le K`$ polynomial $`p:\mathbb{R}^s\to\mathbb{R}`$ such that $`\|p\|_\mathrm{co}\le M`$ and $`\forall \mathbf{x} \in {\cal X},\;\;p(\mathbf{h}(\mathbf{x} ))f(\mathbf{x} )\ge 1`$. An example of a $`(K,1)`$-PTF that we will use frequently is a function $`f:\{\pm 1\}^d\to\{\pm 1\}`$ that depends on $`K`$ variables. Indeed, Fourier analysis on $`\{\pm 1\}^d`$ tell us that $`f`$ is a restriction of a degree $`\le K`$ polynomial $`p`$ with $`\|p\|_\mathrm{co}=1`$. For this polynomial we have $`\forall \mathbf{x} \in {\cal X},\;\;p(\mathbf{x} )f(\mathbf{x} )= 1`$.

We will also need a more refined definitions of PTFs, which allows to require two sided inequity $`B\ge p(\mathbf{x} )f(\mathbf{x} )\ge 1`$, as well as some robustness to perturbation of $`\mathbf{x}`$. To this end, for $`\mathbf{x} \in [-1,1]^d`$ and $`r>0`$ we define

MATH
\begin{equation}
\label{eq:truncated_ball}
    {\cal B}_r(\mathbf{x} ) = \left\{\tilde \mathbf{x} \in [-1,1]^d : \|\mathbf{x} -\tilde\mathbf{x} \|_\infty\le r \right\}
\end{equation}
Click to expand and view more

Fix $`B\ge 1`$ and $`1\ge \xi>0`$. We say that $`f`$ is a $`(K,M,B,\xi)`$-PTF if there is a degree $`\le K`$ polynomial $`p:\mathbb{R}^d\to\mathbb{R}`$ such that $`\|p\|_\mathrm{co}\le M`$ and

MATH
\forall \mathbf{x} \in {\cal X}\;\forall \tilde \mathbf{x} \in {\cal B}_{\xi}(\mathbf{x} ),\;\;B\ge p(\tilde\mathbf{x} )f(\mathbf{x} )\ge 1
Click to expand and view more

Likewise, we say that $`f`$ is a $`(K,M,B,\xi)`$-PTF of $`\mathbf{h}=(h_1,\ldots,h_s):{\cal X}\to[-1,1]`$ if there is a degree $`\le K`$ polynomial $`p:\mathbb{R}^s\to\mathbb{R}`$ such that $`\|p\|_\mathrm{co}\le M`$ and

MATH
\forall \mathbf{x} \in {\cal X}\;\forall \mathbf{y}\in {\cal B}_\xi(\mathbf{h}(\mathbf{x} )),\;\;B\ge p(\mathbf{y})f(\mathbf{x} )\ge 1
Click to expand and view more

Finally, We say that $`f`$ is a $`(K,M,B)`$-PTF (resp.Ā $`(K,M,B)`$-PTF of $`\mathbf{h}`$) if it is a $`(K,M,B,1)`$-PTF (resp.Ā $`(K,M,B,1)`$-PTF of $`\mathbf{h}`$).

Strong Convexity

Let $`W\subseteq\mathbb{R}^d`$ be convex. We say that a differentiable $`f:W\to \mathbb{R}`$ is $`\lambda`$-strongly-convex if for any $`\mathbf{x} ,\mathbf{y}\in W`$ we have

MATH
f(\mathbf{y}) \ge f(\mathbf{x} ) + {\left\langle \mathbf{y}-\mathbf{x} ,\nabla f(\mathbf{x} ) \right\rangle} + \frac{\lambda}{2}\|\mathbf{y}-\mathbf{x} \|^2
Click to expand and view more

We note that if $`f`$ is strongly convex and $`\|\nabla f(\mathbf{x} )\|\le \epsilon`$ for $`\mathbf{x} \in W`$ when $`\mathbf{x}`$ minimizes $`f`$ up to an additive error of $`\frac{\epsilon^2}{2\lambda}`$. Indeed, for any $`\mathbf{y}\in W`$ we have

MATH
\begin{eqnarray}
\label{eq:strongly_conv_guarantee}
f(\mathbf{x} ) &\le & f(\mathbf{y}) - \frac{\lambda}{2}\|\mathbf{y}-\mathbf{x} \|^2 + \|\mathbf{y}-\mathbf{x} \| \cdot \|\nabla f(\mathbf{x} )\|\nonumber
\\
&=& f(\mathbf{y})+\frac{\|\nabla f(\mathbf{x} )\|^2}{2\lambda} - \frac{1}{2\lambda}\left(\|\nabla f(\mathbf{x} )\|-\lambda\|\mathbf{y}-\mathbf{x} \|\right)^2
\\
&\le & f(\mathbf{y})+ \frac{\|\nabla f(\mathbf{x} )\|^2}{2\lambda}\nonumber
\\
&\le & f(\mathbf{y})+ \frac{\epsilon^2}{2\lambda}\nonumber
\end{eqnarray}
Click to expand and view more

Hermite Polynomials

The results we state next can be found in . The Hermite polynomials $`h_0,h_1,h_2,\ldots`$ are the sequence of orthonormal polynomials corresponding to the standard Gaussian measure $`\mu`$ on $`\mathbb{R}`$. That is, they are the sequence of orthonormal polynomials obtained by the Gram-Schmidt process of $`1,x,x^2,x^3,\ldots \in L^2(\mu)`$. The Hermite polynomials satisfy the following recurrence relation

MATH
\begin{equation}
\label{eq:hermite_rec}
    xh_{n}(x) = \sqrt{n+1}h_{n+1}(x) + \sqrt{n}h_{n-1}(x)\;\;,\;\;\;\;\;h_0(x)=1,\;h_1(x)=x
\end{equation}
Click to expand and view more

or equivalently

MATH
h_{n+1}(x) = \frac{x}{\sqrt{n+1}}h_{n}(x) - \sqrt{\frac{n}{n+1}}h_{n-1}(x)
Click to expand and view more

The generating function of the Hermite polynomials is

MATH
\begin{equation}
\label{eq:hermite_gen_fun}
    e^{xt - \frac{t^2}{2}} = \sum_{n=0}^\infty \frac{h_n(x)t^n}{\sqrt{n!}}
\end{equation}
Click to expand and view more

We also have

MATH
\begin{equation}
\label{eq:hermite_derivative}
    h_n' = \sqrt{n}h_{n-1}
\end{equation}
Click to expand and view more

Likewise, if $`X,Y\sim{\cal N}\left(0,\begin{pmatrix}1&\rho\\\rho&1\end{pmatrix}\right)`$

MATH
\begin{equation}
\label{eq:hermite_prod_exp}
    \mathbb{E}h_i(X)h_j(Y) = \delta_{ij}\rho^{i}
\end{equation}
Click to expand and view more

The Hierarchical Model

Let $`{\cal X}\subseteq [-1,1]^d`$ be our instance space. We consider the multi-label setting, in which each instance can have anything between $`0`$ to $`n`$ positive labels, and each training example comes with a list of all1 its positive labels. Hence, our goal is to learn the labeling function $`\mathbf{f}^*:{\cal X}\to \{\pm 1\}^n`$ based on a sample

MATH
S= \{(\mathbf{x} ^1,\mathbf{f}^*(\mathbf{x} ^1),\ldots,(\mathbf{x} ^m,\mathbf{f}^*(\mathbf{x} ^m))\} \in \left({\cal X}\times \{\pm 1\}^{n}\right)^m
Click to expand and view more

of i.i.d.Ā labeled examples that comes from a distribution $`{\cal D}`$ on $`{\cal X}`$. Specifically, our goal is to find a predictor $`\hat \mathbf{f}:{\cal X}\to\mathbb{R}^n`$ whose error, $`\mathrm{Err}_{\cal D}(\hat{\mathbf{f}})=\Pr_{\mathbf{x} \sim{\cal D}}\left(\mathrm{sign}(\hat{\mathbf{f}}(\mathbf{x} ))\ne \mathbf{f}^*(\mathbf{x} ) \right)`$, is small. We assume that there is a hierarchy of labels (unknown to the algorithm), with the convention that

  • The first level of the hierarchy consists of labels which are simple ($`=`$ easy to learn) functions of the input. Specifically, each such label is a polynomial threshold function (PTF) of the input.

  • Any label in the $`i`$’th level of the hierarchy is a simple function (again, a PTF) of labels from lower levels of the hierarchy.

We next give the formal definition of hierarchy.

Definition 1 (hierarchy). Let $`{\cal L}= \{L_1,\ldots,L_r\}`$ be a collection of sets such that $`L_1\subseteq L_2\subset\ldots\subseteq L_r = [n]`$. We say that $`{\cal L}`$ is a hierarchy for $`\mathbf{f}^*:{\cal X}\to \{\pm 1\}^n`$ of complexity $`(r,K,M)`$ (or $`(r,K,M)`$-hierarchy for short) if for any $`j\in L_1`$ the function $`f^*_j`$ is a $`(K,M)`$-PTF and for $`i\ge 2`$, and $`j\in L_i`$ we have that $`\mathbf{f}^*_j = \tilde f_j\circ \mathbf{f}^*_{L_{i-1}}`$ for a $`(K,M)`$-PTF $`\tilde f_j:\{\pm 1\}^{|L_{i-1}|}\to\{\pm 1\}`$.

Example 2. Fix $`{\cal L}= \{L_1,\ldots,L_r\}`$ as in Definition 1, and recall that a boolean function that depends on $`K`$ coordinates is a $`(K,1)`$-PTF. Hence, if for any $`i\ge 2`$, any label $`j\in L_i`$ depends on at most $`K`$ labels from $`L_{i-1}`$, and any label $`j\in L_1`$ is a $`(K,1)`$-PTF of the input, then $`{\cal L}`$ is an $`(r,K,1)`$-hierarchy.

Assuming that $`K`$ is constant, our main result will show that given $`\mathrm{poly}(n,d,M,1/\epsilon)`$ samples, a poly-time SGD algorithm on a residual network of size $`\mathrm{poly}(n,d,M,1/\epsilon)`$ can learn any function $`\mathbf{f}^*:{\cal X}\to\{\pm 1\}^n`$ with error of $`\epsilon`$, provided that $`\mathbf{f}^*`$ has a hierarchy of complexity $`(r,K,M)`$ (the algorithm and the network do not depend on the hierarchy, but just on $`r,K,M`$).

One of the steps in the proof of this result is to show that any $`(K,M)`$-PTF on a subset of $`[-1,1]^n`$ is necessarily a $`(K,2M,B,\xi)`$-PTF for $`\xi = \frac{1}{2(n+1)^{\frac{K+1}{2}}KM}`$ and $`B=2(\max(n,d)+1)^{K/2}M`$ (see Lemma 29). This is enough for establishing our main result as informally described above. Yet, in some cases of interest, we can have much larger $`\xi`$ and smaller $`B`$. In this case, we can guarantee learnability with smaller network, and less samples and runtime. Hence, we next refine the definition of hierarchy by adding $`B`$ and $`\xi`$ as parameters.

Definition 3 (hierarchy). Let $`{\cal L}= \{L_1,\ldots,L_r\}`$ be a collection of sets such that $`L_1\subseteq L_2\subset\ldots\subseteq L_r = [n]`$. We say that $`{\cal L}`$ is a hierarchy for $`\mathbf{f}^*:{\cal X}\to \{\pm 1\}^n`$ of complexity $`(r,K,M,B,\xi)`$ (or $`(r,K,M,B,\xi)`$-hierarchy for short) if for any $`j\in L_1`$ the function $`f^*_j`$ is a $`(K,M,B)`$-PTF and for $`i\ge 2`$, and $`j\in L_i`$ we have that $`\mathbf{f}^*_j = \tilde f_j\circ \mathbf{f}^*_{L_{i-1}}`$ for a $`(K,M,B,\xi)`$-PTF $`\tilde f_j:\{\pm 1\}^{|L_{i-1}|}\to\{\pm 1\}`$.

The ā€œBrain Dump" Hierarchy

Fix a domain $`{\cal X}\subseteq\{\pm 1\}^d`$ and a sequence of functions $`G^i:\{\pm 1\}^d\to\{\pm 1\}^d`$ for $`1\le i\le r`$. We assume that $`G^0(\mathbf{x} ) = \mathbf{x}`$, and for any depth $`i\in [r]`$ and coordinate $`j\in [d]`$, we have

MATH
\begin{equation*}
    \forall \mathbf{x} \in{\cal X}, \quad G^i_j(\mathbf{x} ) = h^i_j(G^{i-1}(\mathbf{x} )),
\end{equation*}
Click to expand and view more

where $`h^i_j:\{\pm 1\}^d\to\{\pm 1\}`$ is a function that depends on $`K`$ coordinates. We view the sequence $`G^1, \ldots, G^r`$ as a computation circuit, or a model of a ā€œbrain.ā€

Suppose we wish to learn a function of the form $`f^* = h\circ G^r`$, where $`h:\{\pm 1\}^d\to\{\pm 1\}`$ also depends only on $`K`$ inputs, given access to labeled samples $`(\mathbf{x} ,f^*(\mathbf{x} ))`$. The function $`f^*`$ can be extremely complex. For instance, $`G`$ could compute a cryptographic function. In such cases, learning $`f^*`$ solely from labeled examples $`(\mathbf{x} ,f^*(\mathbf{x} ))`$ is likely intractable; if our access to $`f^*`$ is restricted to the black-box scenario described above, the task appears impossible. On the other extreme, if we had complete white-box access to $`f^*`$—meaning a full description of the circuit $`G`$—the learning problem would become trivial. However, if $`G`$ truly models a human brain, such transparent access is unrealistic.

Consider a middle ground between these black-box and white-box scenarios. Assume we can query the labeler (the human whose brain is modeled by $`G`$) for additional information. For instance, if $`f^*`$ is a function that recognizes cars in an image, we can ask the labeler not only whether the image contains a car, but also to identify specific features: wheels, windows, dark areas, curves, and whatever he thinks is relevant. Each of these additional labels represents another simple function computed over the circuit $`G`$. We model these auxiliary labels as random majorities of randomly chosen $`G^i_j`$’s. We show that with enough such labels, the resulting problem admits a low-complexity hierarchy and is therefore efficiently learnable.

Formally, fix an integer $`q`$. We assume that for every depth $`i\in [r]`$, there are $`q`$ auxiliary labels $`f^*_{i,j}`$ for $`1\le j\le q`$, each of which is a signed Majority of an odd number of components of $`G^i`$. Moreover, we assume these functions are random. Specifically, prior to learning, the labeler independently samples $`qr`$ functions such that for any $`i\in [r]`$ and $`j\in [q]`$,

MATH
\begin{equation*}
    f^*_{i,j}(\mathbf{x} ) = \mathrm{sign}\left(\sum_{l=1}^d w_l^{i,j}G^i_l(\mathbf{x} )\right),
\end{equation*}
Click to expand and view more

where the weight vectors $`\mathbf{w}^{i,j}\in \mathbb{R}^{d}`$ are independent uniform vectors chosen from

MATH
{\cal W}_{d,k} := \left\{\mathbf{w}\in \{-1,0,1\}^d : \sum_{l=1}^d |w_l|= k \right\}
Click to expand and view more

for some odd integer $`k`$.

Theorem 4. If $`q=\tilde\omega\left(k^2d\log(|{\cal X}|)\right)`$ then $`\mathbf{f}^*`$ has $`\left(r,K,O\left(kd^{K}\right),2k+1\right)`$-hierarchy w.p.Ā $`1-o(1)`$

Extension to Sequential and Ensemble Models

We next extend the notion of hierarchy for the common setting in which the input and the output of the learned function is an ensemble of vectors. Let $`G`$ be some set. We will refer to elements in $`G`$ as locations. In the context of images a natural choice would be $`G=[T_1]\times [T_2]`$, where $`T_1\times T_2`$ is the maximal size of an input image. In the context of language a natural choice would be $`G=[T]`$, where $`T`$ is the maximal number of tokens in the input. We denote by $`\vec\mathbf{x} =(\mathbf{x} _g)_{g\in G}`$ ensemble of vectors and let $`\mathbb{R}^{d,G} = \{\vec\mathbf{x} = (\mathbf{x} _g)_{g\in G} : \forall g\in G,\; \mathbf{x} _g\in\mathbb{R}^d\}`$.

Fix $`{\cal X}\subseteq [-1,1]^d`$ and let $`{\cal X}^{G}`$ be our instance space. Assume that there are $`n`$ labels. We consider the setting in which each instance at each location can have anything between $`0`$ to $`n`$ positive labels. In light of that, our goal is to learn the labeling function $`\mathbf{f}^*:{\cal X}^G\to \{\pm 1\}^{n,G}`$ based on a sample

MATH
S= \{(\vec\mathbf{x} ^1,\mathbf{f}^*(\vec\mathbf{x} ^1)),\ldots,(\vec\mathbf{x} ^m,\mathbf{f}^*(\vec\mathbf{x} ^m))\} \in \left({\cal X}^G\times \{\pm 1\}^{n,G}\right)^m
Click to expand and view more

of i.i.d.Ā labeled examples coming from a distribution $`{\cal D}`$ on $`{\cal X}^G`$. We assume that there is a hierarchy of labels (unknown to the algorithm), with the convention that

  • The first level of the hierarchy consists of labels which are simple ($`=`$ easy to learn) functions of the input. Specifically, each such label at location $`g`$ is a PTF of the input near $`g`$.

  • Any label in the $`i`$’th level of the hierarchy is a simple function of labels from lower levels. Specifically, each such label at location $`g`$ is a PTF of lower level labels, at locations near $`g`$.

We will capture the notion of proximity of locations in $`G`$ via a proximity mapping, which designates $`w`$ nearby locations to any element $`g\in G`$. We will always consider $`g`$ itself as a point near $`g`$. This is captured in the following definition

Definition 5 (proximity mapping). A proximity mapping of width $`w`$ is a mapping $`\mathbf{e}=(e_1,\ldots,e_w):G\to G^{w}`$ such that $`e_1(g)=g`$ for any $`g`$.

For instance, if $`G=[T]`$, it is natural to choose $`\mathbf{e}:G\to G^{2w+1}`$ such that $`\{e_1(g),\ldots,e_{2w+1}(g)\} = \{g'\in T : |g'-g|\le w\}`$. Likewise, if $`G=[T]\times [T]`$, it is natural to choose $`\mathbf{e}:G\to G^{(2w+1)^2}`$ such that $`\{e_1(g_1,g_2),\ldots,e_{(2w+1)^2}(g_1,g_2)\} = \{(g_1',g_2')\in T\times T : |g_1'-g_1|\le w\text{ and }|g_2'-g_2|\le w\}`$. Given a proximity mapping $`\mathbf{e}`$ and $`\vec\mathbf{x} \in\mathbb{R}^{d,G}`$ we define $`E_g(\vec\mathbf{x} )`$ as the concatenation of all vectors $`\mathbf{x} _{g'}`$ where $`g'`$ is close to $`g`$ according to $`\mathbf{e}`$. Formally,

Definition 6. Given a proximity mapping $`\mathbf{e}:G\to G^w`$, $`g\in G`$ and $`\vec\mathbf{x} \in \mathbb{R}^{d,G}`$ we define $`E_g(\vec{\mathbf{x} }) = (\mathbf{x} _{e_1(g)}|\ldots|\mathbf{x} _{e_w(g)})\in\mathbb{R}^{dw}`$. Likewise, we let $`E(\vec\mathbf{x} )\in \mathbb{R}^{dw,G}`$ be $`E(\vec\mathbf{x} )= (E_g(\vec{\mathbf{x} }))_{g\in G}`$.

We next extend the definition of PTF to accommodate the ensemble setting.

Definition 7 (hierarchy). Let $`{\cal L}= \{L_1,\ldots,L_r\}`$ be a collection of sets such that $`L_1\subseteq L_2\subset\ldots\subseteq L_r = [n]`$. Let $`\mathbf{e}:G\to G^w`$ be a proximity function. We say that $`({\cal L},\mathbf{e})`$ is a hierarchy for $`\mathbf{f}^*:{\cal X}^G\to \{\pm 1\}^{n,G}`$ of complexity $`(r,K,M,B,\xi)`$ (or $`(r,K,M,B,\xi)`$-hierarchy for short) if

  • For any $`j\in L_1`$ there is a $`(K,M,B,\xi)`$-PTF $`\tilde f_j:{\cal X}^w\to\{\pm 1\}`$ such that $`f_{j,g}(\mathbf{x} )=\tilde f(E_g(\mathbf{x} ))`$ for any $`\mathbf{x} \in {\cal X}^G`$ and $`g\in G`$

  • For $`i\ge 2`$, and $`j\in L_i`$ there is a $`(K,M,B,\xi)`$-PTF $`\tilde f_j:\{\pm 1\}^{|L_1|w}\to\{\pm 1\}`$ such that $`f_{j,g}(\mathbf{x} )=\tilde f(E_g(\mathbf{f}^*_{L_{i-1}}(\mathbf{x} )))`$ for any $`\mathbf{x} \in {\cal X}^G`$ and $`g\in G`$

We note that the previous definition of hierarchy (i.e.Ā definitions 1 and 3) is the special case $`w=|G|=1`$.

Algorithm and Main Result

Fix $`{\cal X}\subseteq [-1,1]^d`$, a location set $`G`$, a proximity mapping $`e:G\times N\to G`$ of width $`w`$, some constant integer $`K\ge1`$, and an activation function $`\sigma:\mathbb{R}\to\mathbb{R}`$ that is Lipschitz, bounded and is not a constant function. We will view $`\sigma`$ and $`K`$ as fixed, and will allow big-$`O`$ notation to hide constants that depend on $`\sigma`$ and $`K`$.

We start by describing the residual network architecture that we will consider. Let $`{\cal X}^G`$ be our instance space. The first layer (actually, it is two layers, but it will be easier to consider it as one layer) of the network will compute the function

MATH
\Psi_1(\vec\mathbf{x} ) =  W^1_2\sigma(W^1_1 E(\vec\mathbf{x} )+\mathbf{b}^1)
Click to expand and view more

We assume that $`W^1_2\in \mathbb{R}^{n\times q}`$ is initialized to $`0`$, while $`(W^1_1,\mathbf{b}^1)\in \mathbb{R}^{q\times wd}\times \mathbb{R}^{q}`$ is initialized using $`\beta`$-Xavier initialization as defined next.

Definition 8 (Xavier Initialization). Fix $`1\ge \beta \ge 0`$. A random pair $`(W,\mathbf{b})\in \mathbb{R}^{q\times d}\times \mathbb{R}^{q}`$ has $`\beta`$-Xavier distribution if the entries of $`W`$ are i.i.d.Ā centered Gaussians of variance $`\frac{1-\beta^2}{d}`$, and $`\mathbf{b}`$ is independent from $`W`$ and its entries are i.i.d.Ā centered Gaussians of variance $`\beta^2`$

The remaining layers are of the form

MATH
\Psi_k(\vec\mathbf{x} ) =  \vec\mathbf{x} + W^k_2\sigma(W^k_1 E(\vec\mathbf{x} )+\mathbf{b}^k)
Click to expand and view more

where $`(W^k_1,\mathbf{b}^k)\in \mathbb{R}^{q\times (wn)}\times\mathbb{R}^{q}`$ is initialized using $`\beta`$-Xavier initialization and $`W^k_2\in \mathbb{R}^{n\times q}`$ is initialized to $`0`$. Finally, the last layer computes

MATH
\Psi_D(\vec\mathbf{x} ) = W^D\vec\mathbf{x}
Click to expand and view more

for an orthogonal matrix $`W^D\in \mathbb{R}^{n\times n}`$. We will denote the collection of weight matrices by $`\vec W`$, and the function computed by the network by $`\hat \mathbf{f}_{\vec W}`$. Fix a convex loss function $`\ell:\mathbb{R}\to [0,\infty)`$ we extend it to a loss $`\ell: \mathbb{R}^{G}\times \{\pm 1\}^G\to [0,\infty)`$ by averaging:

MATH
\ell(\hat\mathbf{y},\mathbf{y}) = \frac{1}{|G|}\sum_{g\in G}\ell(\hat y_g\cdot y_g)
Click to expand and view more

Likewise, for a function $`\hat \mathbf{f}:{\cal X}^G\to \mathbb{R}^{n,G}`$ and $`j\in [n]`$ we define

MATH
\ell_{S,j}(\hat\mathbf{f}) =\ell_{S,j}(\hat\mathbf{f}_j) = \frac{1}{m}\sum_{t=1}^m\ell\left(\hat \mathbf{f}_{j}(\vec\mathbf{x} ^t),\mathbf{y}^t_{j}\right)
Click to expand and view more

Finally, let

MATH
\ell_{S}(\hat\mathbf{f}) = \sum_{j=1}^n\ell_{S,j}(\hat\mathbf{f})\;\;\;\;\;\text{ and }\;\;\;\;\; \ell_{S}(\vec W)=\ell_{S}\left(\hat\mathbf{f}_{\vec W}\right)
Click to expand and view more

We will consider the following algorithm

Algorithm 9. At each step $`k=1,\ldots,D-1`$ optimize the $`\ell_S(\vec W)+ \frac{\epsilon_\mathrm{opt}}{2}\|W^k_2\|^2`$ over $`W^k_2`$, until a gradient of size $`\le\epsilon_\mathrm{opt}`$ is reached. (as the $`k`$’th step objective is $`\epsilon_\mathrm{opt}`$-strongly convex the algorithm finds an $`\frac{\epsilon_\mathrm{opt}}{2}`$-minimizer of it.)

We will consider the following loss function.

MATH
\begin{equation}
\label{eq:loss}
\ell = \ell_{1/(2B)}+\frac{1}{4m|G|}\ell_{1-\xi/2}\;\;\;\;\text{ for }\;\;\;\;\ell_{\eta}(z) = \begin{cases}1-\frac{z}{\eta} & 0\le z\le \eta\\
0 & \eta\le z\le 1
\\
\infty&\text{otherwise}\end{cases}
\end{equation}
Click to expand and view more

We are now ready to state our main result.

Theorem 10 (Main). Assume that $`\mathbf{f}^*`$ has $`(r,K,M,B,\xi)`$-hierarchy and let $`\gamma = \frac{1}{32}\min\left(\frac{1}{B},\xi\right)`$. Assume that

  • $`D>r\cdot\left(\left\lceil\frac{\ln(8m|G|/\xi)}{\gamma} \right\rceil+1\right)`$

  • $`\epsilon_\mathrm{opt}\le \frac{(1-e^{-\gamma})\xi}{16m^2|G|^2}`$

Then, there is a choice of $`\beta`$ and $`q=\tilde O\left(\frac{(M+1)^4(wn)^{2K}}{\gamma^{4+2K}}\right)`$ such that algorithm 9 will learn a classifier with expected error at most $`\tilde O\left(\frac{D^2(M+1)^4(wn)^{2K+1}}{\gamma^{4+2K}m}\right)`$.

Proof of Theorem <a href="#thm:main_intro" data-reference-type=“ref”

data-reference=“thm:main_intro”>10: Hierarchical Learning by Resnets

In order to prove Theorem 10 it is enough to prove Theorem 11 below, which shows that there is a choice of $`\beta`$ and $`q=\tilde O\left(\frac{(M+1)^4(wn)^{2K}}{\gamma^{4+2K}}\right)`$ such that algorithm 9 will learn a classifier with empirical large margin error of $`0`$ w.p. $`\frac{1}{m}`$. That is, we define

MATH
\begin{equation}
\label{eq:large_margin_sample_err_def}
\mathrm{Err}_{S,\gamma}(\hat{\mathbf{f}})=\frac{1}{m}\sum_{t=1}^m 1\left[\exists (i,g)\in [n]\times G\text{ s.t. }\hat f_{i,g}(\vec\mathbf{x} ^t)\cdot f^*_{i,g}(\vec\mathbf{x} ^t) < \gamma \right]
\end{equation}
Click to expand and view more

And show that algorithm 9 will learn a classifier $`\hat \mathbf{f}`$ with $`\mathrm{Err}_{S,1/2}(\hat\mathbf{f})=0`$ w.p. $`\frac{1}{m}`$. Let’s call such an algorithm $`(1/m)`$-consistent. Given this guarantee, Theorem 10 will follow from a standard parameter counting argument: The number of trained parameters is $`p=Dqn`$, and their magnitude is bounded by $`\frac{2n}{\epsilon_\mathrm{opt}}+1`$ due to the $`\ell^2`$ regularization term. Likewise, excluding the small probability event that one of the initial weights has magnitude $`\ge \ln(Dq(n+d)wm)`$ (which happens w.p.Ā $`\ll \frac{1}{m}`$, since all $`Dq(n+d)w`$ initial weights are centered Guassians with variance $`\le 1`$), it is not hard to verify that as a composition of $`2D`$ layers, the network’s output is $`L`$-Lipchitz w.r.t.Ā the trained parameters for $`L=2^{\tilde O(D)}`$. Thus, the expected error of any $`(1/m)`$-consistent algorithm is $`\tilde O\left(\frac{p\log(L)}{m}\right) =\tilde O\left(\frac{Dp}{m}\right) = \tilde O\left(\frac{D^2qn}{m}\right)`$. (See Lemma 23 for a precise statement).

Theorem 11 (Main - Restated). Let $`\gamma = \frac{1}{32}\min\left(\frac{1}{B},\xi\right)`$. Assume that

  • $`\mathbf{f}^*`$ has $`(r,K,M,B,\xi)`$-hierarchy $`({\cal L},e)`$

  • $`D>r\cdot\left(\left\lceil\frac{\ln(8m|G|/\xi)}{\gamma} \right\rceil+1\right)`$

  • $`\epsilon_\mathrm{opt}\le \frac{(1-e^{-\gamma})\xi}{16m^2|G|^2}`$

There is a choice of $`\beta`$ such that w.p.Ā $`1-2n mD |G|\exp\left(-\Omega\left(q\cdot\frac{ \gamma^{2K+4}}{(wn)^{2K}(M+1)^4}\right)\right)`$ over the initial choice of the weights, Algorithm 9 will learn a classifier $`\hat\mathbf{f}:{\cal X}^G\to\mathbb{R}^{n,G}`$ with $`\mathrm{Err}_{S,1/2}(\hat\mathbf{f})=0`$.

For $`1\le k\le D`$, let $`\hat \mathbf{f}^k:{\cal X}^G\to \mathbb{R}^{n,G}`$ be the function computed by the network after the $`k`$’th layer is trained. Also, let $`\Gamma^k:{\cal X}^G\to \mathbb{R}^{n,G}`$ be the function computed by the layers $`1`$ to $`k`$ after the $`k`$’th layer is trained. For $`k=0`$ we denote by $`\hat \mathbf{f}^0=\Gamma^0`$ the identity mapping from $`:{\cal X}^G`$ to $`\mathbb{R}^{d,G}`$. We note that when algorithm 9 trains the $`k`$’th layer we have $`W^{k'}_2=0`$ for any $`k'>k`$. Hence,

MATH
\Psi_{k'}(\vec\mathbf{x} ) =  \vec\mathbf{x} + W^{k'}_2\sigma(W^{k'}_1 E(\vec\mathbf{x} )+\mathbf{b}^{k'}) = \vec\mathbf{x}
Click to expand and view more

so when the $`k`$’th layer is trained the $`k'`$’th layer is simply the identity function for any $`k'>k`$. As a result, we have $`\hat \mathbf{f}^k(\mathbf{x} )=W^D \Gamma^k(\mathbf{x} )`$.

Our first observation in the proof of Theorem 11 is that the $`k`$’th step of algorithm 9 (i.e., obtaining $`\hat\mathbf{f}^k`$ from $`\hat \mathbf{f}^{k-1}`$) is essentially equivalent to learning a linear classifier on top of random features extension of that data representation $`\vec\mathbf{x} \mapsto \hat \mathbf{f}^{k-1}(\vec\mathbf{x} )`$. Specifically, define an input space embedding $`\Phi^{k-1}:{\cal X}^G\to \mathbb{R}^{q,G}`$ by

MATH
\Phi^{k-1}(\vec\mathbf{x} ) = \sigma(W^k_1E(\Gamma^{k-1}(\vec\mathbf{x} ))+\mathbf{b}^k) = \sigma(W^k_1E( (W^D)^{-1}\hat\mathbf{f}^k(\mathbf{x} ))+\mathbf{b}^k) =
Click to expand and view more

For $`\mathbf{w}\in\mathbb{R}^{q}`$ we define

MATH
\hat\mathbf{f}^k_{j,\mathbf{w}}(\vec\mathbf{x} ) = \hat\mathbf{f}^{k-1}_{j}(\vec\mathbf{x} ) + \mathbf{w}^\top  \Phi^{k-1}(\vec\mathbf{x} )
Click to expand and view more

We have that

Lemma 12. *For any $`D-1\ge k\ge 1`$ $`\hat \mathbf{f}_j^k = \hat\mathbf{f}^k_{j,\mathbf{w}}`$ where $`\mathbf{w}`$ is an $`\frac{\epsilon_\mathrm{opt}}{2}`$-minimizer of the convex objective

MATH
\begin{eqnarray*}
\ell^{k}_{S,j}(\mathbf{w})&=&\ell_{S,j}\left(\hat\mathbf{f}^k_{j,\mathbf{w}}\right) + \frac{\epsilon_\mathrm{opt}}{2}\|\mathbf{w}\|^2
\end{eqnarray*}
Click to expand and view more

over $`\mathbf{w}\in \mathbb{R}^{q}`$. Furthermore,

MATH
\ell_{S,j}(\hat \mathbf{f}^k) \le \ell^k_{S,j}(\mathbf{w}^*) + \frac{\epsilon_\mathrm{opt}}{2}\|\mathbf{w}^*\|^2 + \frac{\epsilon_\mathrm{opt}}{2}
```*

</div>

<div class="proof">

*Proof.* When the $`k`$’th layer is trained, since all deeper layers
during this training phase are the identity function, the output of the
network as a function of $`W^k_2`$ (the parameters that are trained in
the $`k`$’th step) is
``` math
G(W^k_2,\vec\mathbf{x} ) = W^D\left(\Gamma_{k-1}(\vec\mathbf{x} ) + W^k_2\Phi^{k-1}(\vec\mathbf{x} )\right) = \hat{\mathbf{f}}^{k-1}(\vec\mathbf{x} ) + W^DW^k_2\Phi^{k-1}(\vec\mathbf{x} )
Click to expand and view more

In particular, if we denote by $`\hat W^k_2`$ the value of $`W^k_2`$ after the $`k`$’th layer is trained, then we have $`\hat \mathbf{f}_j^k = \hat\mathbf{f}^k_{j,\mathbf{w}}`$ where $`\mathbf{w}`$ is the $`j`$’th row of the matrix $`W = W^D\hat W^k_2`$. It remains therefore to show that $`\mathbf{w}`$ minimizes $`\ell^k_{S,j}`$. To this end, we note that at the $`k`$’th step algorithm 9 finds an $`\frac{\epsilon_\mathrm{opt}}{2}`$-minimizer of

MATH
L(W^k_2)=\frac{\epsilon_\mathrm{opt}}{2}\|W^k_2\|^2+\frac{1}{m}\sum_{t=1}^m\sum_{j=1}^n\ell(\hat{\mathbf{f}}^{k-1}(\vec\mathbf{x} ) + W^DW^k_2\Phi^{k-1}(\vec\mathbf{x} ),\mathbf{y}^t_j)
Click to expand and view more

As a result, $`\hat W:=W^D\hat W^k_2`$ is an $`\frac{\epsilon_\mathrm{opt}}{2}`$-minimizer of

MATH
\begin{eqnarray*}
L'(W)=L((W^D)^{-1}W)&=&\frac{\epsilon_\mathrm{opt}}{2}\|(W^D)^{-1}W\|^2+\frac{1}{m}\sum_{t=1}^m\sum_{j=1}^n\ell(\hat{\mathbf{f}}_j^{k-1}(\vec\mathbf{x} ) + W^d(W^D)^{-1}W\Phi^{k-1}(\vec\mathbf{x} ),\mathbf{y}^t_j)
\\
&\stackrel{W^D\text{ is orthogonal}}{=}&\frac{\epsilon_\mathrm{opt}}{2}\|W\|^2+\frac{1}{m}\sum_{t=1}^m\sum_{j=1}^n\ell(\hat{\mathbf{f}}_j^{k-1}(\vec\mathbf{x} ) + W\Phi^{k-1}(\vec\mathbf{x} ),\mathbf{y}^t_j)
\\
&=&\sum_{j=1}^n\left(\frac{\epsilon_\mathrm{opt}}{2}\|W_{j\cdot}\|^2+\frac{1}{m}\sum_{t=1}^m\ell(\hat{\mathbf{f}}_j^{k-1}(\vec\mathbf{x} ) + W_{j\cdot}\Phi^{k-1}(\vec\mathbf{x} ),\mathbf{y}^t_j)\right)
\\
&=&\sum_{j=1}^n\ell^k_{S,j}(W_{j\cdot})
\end{eqnarray*}
Click to expand and view more

In particular, $`\mathbf{w}=\hat W_{j\cdot}`$ must be $`\frac{\epsilon_\mathrm{opt}}{2}`$-minimizer of $`\ell^k_{S,j}`$ Finally, since $`\ell^{k}_{S,j}`$ is $`\epsilon_\mathrm{opt}`$-strongly convex, Equation [eq:strongly_conv_guarantee] implies that for any $`\mathbf{w}^*\in \mathbb{R}^{q}`$,

MATH
\ell_{S,j}(\hat \mathbf{f}^k) \le \ell^k_{S,j}(\mathbf{w}^*) + \frac{\epsilon_\mathrm{opt}}{2}\|\mathbf{w}^*\|^2 + \frac{\epsilon_\mathrm{opt}}{2}
Click to expand and view more

Ā ā—»

With lemma 12 at hand, we can present the strategy of the proof. Since the labels in $`L_1`$ are PTF of the input, we will learn them when the first layer is trained. That is, $`\hat\mathbf{f}^1`$ will predict the labels in $`L_1`$ correctly. The reason for that is that, roughly speaking, PTFs are efficiently learnable by training a linear classifier on top of random features embedding.

Since, $`\hat\mathbf{f}^1`$ predicts the labels in $`L_1`$ correctly, the labels in $`L_2`$ become a simple function of $`\hat\mathbf{f}^1`$. Concretely, PTF of $`\mathrm{sign}(\hat\mathbf{f}^1)`$. It is therefore tempting to try using the same reasoning as above in order to prove that after training the next layer, we will learn the labels in $`L_2`$, and more generally, that after $`r`$ layers are trained, the network will predict all labels correctly. This however won’t work that smoothly: PTF of $`\mathrm{sign}(\hat\mathbf{f}^1)`$ is not necessarily learnable by training a linear classifier on top of random-features embedding on $`\hat\mathbf{f}^1`$. To circumvent this, we show that after the network predicts correctly a label $`j`$, the loss of this label keeps improving when training additional layers, so after training additional $`O(B+1/\xi)`$ layers, the loss will be small enough to guarantee that the labels in $`L_2`$ are PTFs of $`\hat\mathbf{f}^1`$ (and not just of $`\mathrm{sign}(\hat\mathbf{f}^1)`$). Thus, after $`O(B+1/\xi)`$ layers are trained, the network will predict the labels in $`L_2`$ correctly, and more generally, after $`O(rB+r/\xi)`$ layers are trained, the network will predict all the labels correctly.

The course of the proof will be as follows

  1. We start with Lemma 14 which shows that if a label $`j`$ is a large PTF of $`\hat\mathbf{f}^k`$ then $`\hat \mathbf{f}^{k+1}`$ will predict it correctly. To be more accurate, we show that if a robust version of $`\ell_{S,j}(p\circ E\circ\hat\mathbf{f}^k)`$ is small for a polynomial $`p`$, then $`\ell_{S,j}(\hat\mathbf{f}^{k+1})`$ is small.

  2. We then continue with Lemma 15 which uses Lemma 14 to show that (i) $`\ell_{S,j}(\hat\mathbf{f}^1)`$ is small for any $`j\in L_1`$, (ii) for any $`j\in [n]`$, if $`\ell_{S,j}(\hat\mathbf{f}^k)`$ is small, then it will shrink exponentially as we train deeper layers and (iii) if $`\ell_{S,j}(\hat\mathbf{f}^k)`$ is very small for any $`j\in L_{i-1}`$, then $`\ell_{S,j}(\hat\mathbf{f}^{k+1})`$ is small for any $`j\in L_{i}`$.

  3. Based Lemma 15, we will prove Theorem 11.

The carry out the first step, we will need some notation. First, we define the $`\epsilon`$ robust version of $`\ell`$ as

MATH
\begin{equation}
\label{eq:rob_loss}
\ell^{\mathrm{rob},\epsilon}(z) = \max(\ell(z),\ell(z-\epsilon)) = \max_{0\le t\le \epsilon}\ell(z-t)
\end{equation}
Click to expand and view more

Note that for $`z\le 1`$ we have $`\ell^{\mathrm{rob},\epsilon}(z) =\ell(z-\epsilon)`$ while for $`z< 0`$ we have $`\ell^{\mathrm{rob},\epsilon}(z) =\ell(z)=\infty`$. Denote the Hermite expansion of $`\sigma`$ by

MATH
\begin{equation}
    \sigma = \sum_{s=0}^\infty a_sh_s
\end{equation}
Click to expand and view more

Let $`K'`$ be the minimal integer $`K'\ge K`$ such that $`a_{K'}\ne0`$ (such $`K'`$ exists as otherwise $`\sigma`$ is a polynomial, which contradicts the assumption that it is bounded and non-constant). For $`\epsilon>0`$ define $`\beta(\epsilon) = \beta_{\sigma,K',K}(\epsilon)<1`$ as the minimal positive number greater that $`\frac{3}{4}`$ such that if $`\beta_{\sigma,K',K}(\epsilon)\le \beta<1`$ then

MATH
\frac{\|\sigma\|}{a_{K'}}2^{(K'+2)/2}\frac{1-\beta^2}{\sqrt{1-2(1-\beta^2)^2}}\le \frac{\epsilon}{2}
Click to expand and view more

Note that $`\beta(\epsilon)`$ is well defined as $`h(\beta):=\frac{1-\beta^2}{\sqrt{1-2(1-\beta^2)^2}}`$ is continuous near $`\beta=1`$ and equals to $`0`$ at $`\beta=1`$. In fact, since $`h`$ is differentiable near $`\beta=1`$ we have that $`1- \beta(\epsilon) = \Omega\left(\epsilon2^{-K'}\frac{a_{K'}}{\|\sigma\|}\right)`$. In particular, for fixed $`\sigma,K',K`$ we have that $`1- \beta(\epsilon) = \Omega(\epsilon)`$. Define also

MATH
\delta(\epsilon,\beta,q,M,n) = \delta_{\sigma,K',K}(\epsilon,\beta,q,M,n) = \begin{cases}
1 & \frac{4\|\sigma\|_\infty}{\epsilon\sqrt{{q}}}\cdot\frac{1}{a^2_{K'}\beta^{2K'-2K}}\left(\frac{n}{1-\beta^2}\right)^{K}M^2 > 1
\\
2\exp\left(-{q}\cdot\frac{a^4_{K'}\beta^{4K'-4K}(1-\beta^2)^{2K}\epsilon^4}{512 n^{2K}M^4\|\sigma\|_\infty^4}\right) & \text{otherwise}
\end{cases}
Click to expand and view more

Note that for fixed $`\sigma,K',K`$ and $`1-\beta = \Omega(\epsilon)`$ we have

MATH
\begin{equation}
\label{eq:beta_est}
\delta(\epsilon,\beta,q,M,n)  = \exp\left(-\Omega\left(q\cdot\frac{ \epsilon^{2K+4}}{n^{2K}M^4}\right)\right)
\end{equation}
Click to expand and view more

We will need the following Lemma that is proved at the end of section 9, and shows that it is possible to approximate a polynomial by composing a random layer, and a linear function.

Lemma 13. *Fix $`{\cal X}\subset [-1,1]^n`$, a degree $`K`$ polynomial $`p:{\cal X}\to [-1,1]`$, $`K'\ge K`$ and $`\epsilon>0`$. Let $`(W,\mathbf{b})\in\mathbb{R}^{q\times n}\times \mathbb{R}^{q}`$ be $`\beta`$-Xavier pair for $`1>\beta\ge \beta_{\sigma,K',K}(\epsilon)`$. Then there is a vector $`\mathbf{w}=\mathbf{w}(W,\mathbf{b})\in\mathbb{B}^{q}`$ such that

MATH
\forall\mathbf{x} \in {\cal X},\;\Pr\left(|{\left\langle \mathbf{w},\sigma(W\mathbf{x} +\mathbf{b}) \right\rangle}-p(\mathbf{x} )|\ge \epsilon \right) \le  \delta_{\sigma,K',K}(\epsilon,\beta,q,\|p\|_\mathrm{co},n)
```*

</div>

We are now ready to show that if there a polynomial
$`p:\mathbb{R}^{wn}\to\mathbb{R}`$ such that
$`\ell^{\mathrm{rob},\epsilon_1}_{S,j}(p \circ E_g\circ  \hat\mathbf{f}^k)`$
is small, then w.h.p.Ā $`\ell_{S,j}(\hat\mathbf{f}^{k+1})`$ will be small
as well.

<div id="lem:pol_imp_small_loss" class="lemma">

**Lemma 14**. *Fix $`\epsilon_1>0`$, $`1>\beta>\beta(\epsilon_1/2)`$ and
a polynomial $`p:\mathbb{R}^{wn}\to\mathbb{R}`$. Given that
$`\ell^{\mathrm{rob},\epsilon_1}_{S,j}(p \circ E_g\circ  \hat\mathbf{f}^k)\le \epsilon`$,
we have that
$`\ell_{S,j}( \hat\mathbf{f}^{k+1})\le \epsilon + \epsilon_\mathrm{opt}`$
w.p. $`1-m|G|\delta(\epsilon_1/2,\beta,q,\|p\|_\mathrm{co}+1,wn)`$*

</div>

<div class="proof">

*Proof.* By lemma
<a href="#lem:each_layer_is_convex" data-reference-type="ref"
data-reference="lem:each_layer_is_convex">12</a> we have
$`\ell_{S,j}( \hat\mathbf{f}^{k+1})\le \ell_{S,j}\left(\hat\mathbf{f}^{k+1}_{j,\mathbf{w}^*}\right)+\frac{\epsilon_\mathrm{opt}}{2}\|\mathbf{w}^*\|^2+\frac{\epsilon_\mathrm{opt}}{2}`$
for any $`\mathbf{w}^*\in \mathbb{R}^q`$. Thus, it is enough to show
that w.p.
$`1-m|G|\delta(\epsilon_1/2,\beta,q,\|p\|_\mathrm{co}+1,wn)=:1-\delta`$
over the choice of $`W^k_1`$ there is $`\mathbf{w}^*\in\mathbb{B}^d`$
such that
$`\ell_{S,j}\left(\hat\mathbf{f}^{k+1}_{j,\mathbf{w}^*}\right)\le \epsilon`$.
By the definition of $`\ell^{\mathrm{rob},\epsilon_1}_{S,j}`$ it is
enough to show that w.p.Ā $`1-\delta`$ there is
$`\mathbf{w}^*\in\mathbb{B}^d`$ such that
``` math
\begin{equation}
\label{eq:lem:pol_imp_small_loss}
    y^t_{j,g}\cdot p \circ E_g\circ  \hat\mathbf{f}^k(\vec\mathbf{x} ^t)-\epsilon_1\le y^t_{j,g}\cdot\hat f^{k+1}_{j,g,\mathbf{w}^*}(\vec\mathbf{x} ^t)\le y^t_{j,g}\cdot p \circ E_g\circ  \hat\mathbf{f}^k(\vec\mathbf{x} ^t)
\end{equation}
Click to expand and view more

for any $`t`$ and $`g`$. Since $`y^t_{j,g}\cdot p \circ E_g\circ \hat\mathbf{f}^k(\vec\mathbf{x} ^t)\ge\epsilon_1`$ (as otherwise we will have $`\ell^{\mathrm{rob},\epsilon_1}_{S,j}(p \circ E_g\circ \hat\mathbf{f}^k) = \infty`$), it is enough to show that w.p.Ā $`1-\delta`$ there is $`\tilde\mathbf{w}^*\in\mathbb{B}^d`$ such that

MATH
\left| p \circ E_g\circ  \hat\mathbf{f}^k(\vec\mathbf{x} ^t) - \hat f^{k+1}_{j,g,\tilde\mathbf{w}^*}(\vec\mathbf{x} ^t)\right|\le \frac{\epsilon_1}{2}
Click to expand and view more

for any $`t`$ and $`g`$. Indeed, in this case Equation [eq:lem:pol_imp_small_loss] holds true for $`\mathbf{w}^* = \frac{\tilde\mathbf{w}^*}{1+\epsilon_1/2}`$. Finally, since

MATH
\hat f^{k+1}_{j,g,\mathbf{w}^*}(\vec\mathbf{x} ) = \hat f^{k}_{j,g}(\vec\mathbf{x} ) + {\left\langle \mathbf{w}^*,\sigma(W^{k+1}_1E_g\circ\hat{\mathbf{f}}^{k}(\vec\mathbf{x} )+\mathbf{b}^{k+1}) \right\rangle}
Click to expand and view more

it is enough to show that w.p.Ā $`1-\delta`$ there is $`\tilde\mathbf{w}^*\in\mathbb{B}^d`$ such that

MATH
\left| \tilde p \circ E_g\circ  \hat\mathbf{f}^k(\vec\mathbf{x} ^t) - {\left\langle \mathbf{w}^*,\sigma(W^{k+1}_1E_g\circ\hat{\mathbf{f}}^{k}(\vec\mathbf{x} ^t)+\mathbf{b}^{k+1}) \right\rangle}\right|\le \frac{\epsilon_1}{2}
Click to expand and view more

for the polynomial $`\tilde p(\mathbf{x} ^1|\ldots|\mathbf{x} ^w)=p(\mathbf{x} ^1|\ldots|\mathbf{x} ^w)-x^1_j`$ (note that $`\tilde p(E(\hat\mathbf{f}^k(\vec\mathbf{x} )))=p(E(\hat\mathbf{f}^k(\vec\mathbf{x} ))) - \hat \mathbf{f}^{k}_{j}(\vec\mathbf{x} )`$ and that $`\|\tilde p\|_\mathrm{co}\le \|p\|_\mathrm{co}+1`$), and for any $`t`$ and $`g`$. The existence of such $`\mathbf{w}^*`$ w.p.Ā $`1-\delta`$ follows from Lemma 13 and a union bound over $`X = \{E_g\circ\hat{\mathbf{f}}^{k}(\vec\mathbf{x} ^t) : g\in G,t\in [m]\}`$Ā ā—»

We continue with the following Lemma which quantitatively describes how the loss of the different labels improves when training deeper and deeper layers.

Lemma 15. Let $`\gamma = \frac{1}{32}\min\left(\frac{1}{B},\xi\right)`$ Assume that $`1>\beta\ge\beta(\gamma/2)`$ and let $`\delta=m|G|\delta(\gamma/2,\beta,q,\|p\|_\mathrm{co}+5,wn)`$ Then,

  • For any $`j\in L_{1}`$, w.p. $`1-\delta`$, $`\ell_{S,j}(\mathbf{f}^1)\le \frac{1}{4m|G|} + \epsilon_\mathrm{opt}`$

  • Given that $`\ell_{S,j}(\mathbf{f}^{k})\le \frac{1}{2m|G|}`$ we have that $`\ell_{S,j}(\mathbf{f}^{k+1})\le e^{-\gamma}\ell_{S,j}(\mathbf{f}^{k})+\epsilon_\mathrm{opt}`$ w.p. $`1-\delta`$. Furthermore, if $`\epsilon_\mathrm{opt}\le \frac{1-e^{-\gamma}}{2m|G|}`$ then w.p. $`1-t\delta`$ we have $`\ell_{S,j}(\mathbf{f}^{k+t})\le e^{-\gamma t}\ell_{S,j}(\mathbf{f}^{k})+\frac{1-e^{-\gamma t}}{1-e^{-\gamma}}\epsilon_\mathrm{opt}`$.

  • Given that $`\ell_{S,j'}(\mathbf{f}^{k})\le \frac{\xi}{8m^2|G|^2}`$ for any $`j'\in L_{i-1}`$ we have that $`\ell_{S,j}(\mathbf{f}^{k+1})\le \frac{1}{4m|G|}+\epsilon_\mathrm{opt}`$ for any $`j\in L_{i}`$ w.p. $`1-|L_i|\delta`$

Before proving Lemma 15 implies, we show that it implies Theorem 11.

Proof. (of Theorem 11) Choose $`\beta = \beta(\gamma/2)`$ (more generally, $`1>\beta\ge \beta(\gamma/2)`$ such that $`1-\beta = \Omega(\gamma)`$). Denote $`\delta=m|G|\delta(\gamma/2,\beta,q,M+5,wn)`$ and note that by Equation [eq:beta_est] we have

MATH
\delta  = m|G|\exp\left(-\Omega\left(q\cdot\frac{ \gamma^{2K+4}}{(wn)^{2K}(M+1)^4}\right)\right)
Click to expand and view more

Since $`\epsilon_\mathrm{opt}\le \frac{(1-e^{-\gamma})\xi}{16m^2|G|^2}`$, we have that if $`\ell_{S,j}(\mathbf{f}^{k})\le \frac{1}{2m|G|}`$ then w.p. $`1-t\delta`$

MATH
\ell_{S,j}(\mathbf{f}^{k+t})\le e^{-\gamma t}\ell_{S,j}(\mathbf{f}^{k})+\frac{1}{1-e^{-\gamma}}\epsilon_\mathrm{opt}\le \frac{e^{-\gamma t}}{2m|G|} + \frac{\xi}{16m^2|G|^2}
Click to expand and view more

Choosing $`t_0 = \left\lceil\frac{\ln(8m|G|/\xi)}{\gamma} \right\rceil`$ we get

MATH
\ell_{S,j}(\mathbf{f}^{k+t_0})\le \frac{\xi}{8m^2|G|^2}
Click to expand and view more

w.p. $`1-t_0\delta`$. Hence, it is not hard to verify by induction on $`1\le i\le r`$ that for any $`j\in L_i`$, if $`k\ge i (t_0+1)`$ then

MATH
\ell_{S,j}(\mathbf{f}^{k})\le  \frac{\xi}{8m^2|G|^2}
Click to expand and view more

w.p. $`1-nk\delta`$Ā ā—»

To prove lemma 15 we will use the following fact which is an immediate consequence of the definition of the loss.

Fact 16.

  • If $`\ell_{S,j}(\hat\mathbf{f})\le \frac{\epsilon}{m|G|}`$ then for any $`t\in [m]`$ and $`g\in G`$ we have $`1 \ge \hat f_{j,g}(\vec\mathbf{x} ^t)\cdot f^*_{j,g}(\vec\mathbf{x} ^t)\ge \frac{(1-\epsilon)}{2B}`$

  • If $`\ell_{S,j}(\hat\mathbf{f})\le \frac{\epsilon}{4m^2|G|^2}`$ then for any $`t\in [m]`$ and $`g\in G`$ we have $`1 \ge \hat f_{j,g}(\vec\mathbf{x} ^t)\cdot f^*_{j,g}(\vec\mathbf{x} ^t)\ge (1-\epsilon)(1-\xi/2)`$

  • If for any $`t\in [m]`$ and $`g\in G`$ we have $`1 \ge \hat f_{j,g}(\vec\mathbf{x} ^t)\cdot f^*_{j,g}(\vec\mathbf{x} ^t)\ge \frac{1}{B}`$ then $`\ell^{\mathrm{rob},1/2B}_{S,j}(\hat\mathbf{f})\le \frac{1}{4m|G|}`$

We next prove lemma 15.

Proof. (of lemma 15) Let $`p_1,\ldots p_n`$ be polynomials that witness that $`({\cal L},e)`$ is an $`(r,K,M,B,\xi)`$-hierarchy for $`\mathbf{f}^*`$. We start with the first item. By the definition of hierarchy, we have that for any $`t\in [m]`$ and $`g\in G`$, $`B\ge p_j(E_p(\mathbf{f}^0(\vec\mathbf{x} ^t)))f_{j,g}(\vec\mathbf{x} ^t)\ge 1`$. Fact 16 implies that for $`\tilde p_j = \frac{1}{B}p_j`$ we have $`\ell^{\mathrm{rob},\gamma}_{S,j}(\tilde p_j\circ \hat\mathbf{f}^0)\le\ell^{\mathrm{rob},1/2B}_{S,j}(\tilde p_j\circ \hat\mathbf{f}^0)\le \frac{1}{4m |G|}`$. The first item therefore follows from Lemma 14.

The third item is proved similarly. If $`\ell_{S,j'}(\mathbf{f}^{k})\le \frac{\xi}{8m^2|G|^2}`$ for any $`j'\in L_{i-1}`$ then Fact 16 implies that for any $`j'\in L_{i-1},\;t\in [m]`$ and $`g\in G`$ we have

MATH
1\ge y^t_{j',g}\hat f^k_{j',g}(\vec \mathbf{x} ^t)\ge (1-\xi/2)(1-\xi/2)\ge1-\xi
Click to expand and view more

Hence, by the definition of hierarchy, we have that for any $`t\in [m]`$ and $`g\in G`$, $`B\ge p_j(E_p(\hat\mathbf{f}^k(\vec\mathbf{x} ^t)))f_{j,g}(\vec\mathbf{x} ^t)\ge 1`$. Fact 16 now implies that for $`\tilde p_j = \frac{1}{B}p_j`$ we have $`\ell^{\mathrm{rob},\gamma}_{S,j}(\tilde p_j\circ \hat\mathbf{f}^k)\le\ell^{\mathrm{rob},1/2B}_{S,j}(\tilde p_j\circ \hat\mathbf{f}^k)\le \frac{1}{4m |G|}`$. The third item therefore follows from Lemma 14.

It remains to prove the second item. Define $`q:\mathbb{R}^{n}\to\mathbb{R}`$ by $`q(\mathbf{x} ) = 1.5 x_j-0.5x_j^3`$. By lemma 14 it is enough to show that

MATH
\begin{equation}
\label{eq:loss_improvments_proof}
\ell^{\mathrm{rob},\gamma}_{S,j}(q\circ \hat\mathbf{f}^k)\le e^{-\gamma}\ell_{S,j}(\hat\mathbf{f}^k)
\end{equation}
Click to expand and view more

To do so, we note that since $`\ell_{S,j}(\hat\mathbf{f}^k)\le \frac{1}{2m|G|}`$ then Fact 16 implies that $`\forall t,g,\;y^t_{j,g}\hat f^k_{j,g}(\vec \mathbf{x} ^t)\ge 1/(4B)`$. Now, since $`q`$ is odd we have

MATH
\ell\left(y^t_{j,g} q\left( \hat f^k_{j,g}(\vec \mathbf{x} ^t)\right)\right) = \ell\left( q\left( y^t_{j,g}\cdot\hat f^k_{j,g}(\vec \mathbf{x} ^t)\right)\right)
Click to expand and view more

Equation [eq:loss_improvments_proof] therefore follows from the following claim

Claim 1. Let $`\tilde q(x) = 1.5x-0.5x^3`$. Then, for any $`\frac{1}{4B}\le x\le 1`$ we have $`\ell^{\mathrm{rob},\gamma}(\tilde q(x))=\ell(\tilde q(x)-\gamma)\le e^{-\gamma}\ell(x)`$.

Proof. Denote $`x'=\min(x,1-\xi/2)`$ and note that $`\ell(x)=\ell(x')`$ and that

MATH
\begin{equation}
\label{eq:proof_pushing_pol}
\tilde q(x')-x'=\frac{1}{2}x'(1-x'^2)=\frac{1}{2}x'(1-x')(1+x')\ge \frac{1}{2}x'(1-x') \ge \frac{1}{4}\min\left(1/4B,1/2\xi\right) \ge 2\gamma
\end{equation}
Click to expand and view more

Now, we have

MATH
\begin{eqnarray*}
\ell(\tilde q(x)-\gamma) &\stackrel{x'\le x}{\le}& \ell(\tilde q(x')-\gamma)
\\
&\stackrel{\text{Eq. \eqref{eq:proof_pushing_pol}}}{\le}&\ell(x'+\gamma) 
\\
&=& \ell\left(\frac{1-x'-\gamma}{1-x'}x' + \frac{\gamma}{1-x'}\right)
\\
&\stackrel{\text{Convexity and }\frac{\gamma}{1-x'}\le 1}{\le}& \frac{1-x'-\gamma}{1-x'}\ell(x') + \frac{\gamma}{1-x'}\ell(1)
\\
&\stackrel{\ell(1)=0\text{ and }\ell(x')=\ell(x)}{=}& \frac{1-x'-\gamma}{1-x'}\ell(x) 
\\
&\le& e^{-\gamma}\ell(x)
\end{eqnarray*}
Click to expand and view more

Ā ā—»

Ā ā—»

Conclusion and Future Work

In this work, we argued that the availability of extensive and granular labeling suggests that the target functions in modern deep learning are inherently hierarchical, and we showed that deep learning—specifically, SGD on residual networks—can exploit such hierarchical structure. Our proof builds on a layerwise mechanism of the learning process, where each layer acts simultaneously as a representation learner and a predictor, iteratively refining the output of the previous layer. Our results give rise to several perspectives, which we outline below:

  • Supervised Learning is inherently tractable. Contrary to worst-case hardness results, the existence of a teacher (and thus a hierarchy) implies that the problem is learnable in polynomial time, given the right supervision.

  • Very deep models are provably learnable. Unlike previous theoretical works, we prove that ResNets can learn models that are realizable only by very deep circuits.

  • A middle ground between Software Engineering and Learning. Modern deep learning can be viewed as a relaxation of software engineering and a strengthening of classical learning. Instead of manually ā€œcodifying the brain’s algorithmā€ (traditional AI) or learning blindly from input-output pairs (classical ML), we provide snippets of the brain’s logic via related labels. This approach renders the learning task feasible without requiring full knowledge of the underlying circuit.

  • A modified narrative for learning theory. Historically, the narrative governing learning theory, particularly from a computational perspective, has been the following: (i)Ā Learning all functions is impossible. (ii)Ā Upon closer inspection, we are interested only in functions that are efficiently computable. (iii)Ā This function class is learnable using polynomial samples. (iv)Ā Unfortunately, learning it requires exponential time. (v)Ā Nevertheless, some simple function classes are learnable.

    The aforementioned narrative, however, is at odds with practice. Our work suggests that it might be possible to replace item (v) with the following: ā€œ(v)Ā Re-evaluating our scope, we are primarily interested in functions that are efficiently computable by humans. (vi)Ā We have good reasons to believe that these functions are hierarchical. (vii)Ā As a result, they are learnable using polynomial time and samples.ā€

Our work suggests using hierarchical models as a basis for understanding neural networks. Significant future work is required to advance this direction. First, theoretically, it would be useful to extend the scope of hierarchical models. To this end, one might:

  • Analyze attention mechanisms through the lens of hierarchical models.

  • Extend hierarchical models to capture a ā€œsingle-function hierarchy.ā€ This refers to a scenario where a function $`f`$ has ā€œsimple versionsā€ that are easy to learn, the mastery of which renders $`f`$ itself easy to learn. This aligns with previous work on the learnability of non-linear models via gradient-based algorithms (e.g., ), as many of these studies assumed (often implicitly) such a hierarchical structure on the target model.

  • Extend the inherent justification of hierarchical models by generalizing Theorem 4. That is, define formal models of teachers that are ā€œpartially awareā€ to their internal logic, and show that hierarchical labeling which facilitates efficient learnability can be provided by such teachers. Put differently, show that ā€œgeneric non-linear projection" of a hierarchical function is hierarchical itself.

  • Identify low-complexity hierarchies for known algorithms. This could lead to new hierarchical architectures, and might even shed some light on how humans discovered these algorithms, and facilitate teaching them.

Second, on the empirical side, it would be valuable to:

  • Build practical learning algorithms with principled optimization procedures based more directly on the hierarchical learning perspective.

  • Empirically test the hypothesis that, given enough labels, real-world data exhibits a hierarchical structure. In this respect, finding this explicit hierarchical structure can be viewed as an interpretation of the learned model.

Finally, we address specific limitations of our results, which rely on several assumptions. We outline the most prominent ones here, hoping that future work will be able to relax these constraints.

We begin with the technical assumptions. A clear direction for future work is to improve our quantitative bounds; while polynomial, they are likely far from optimal. Other technical constraints include the assumption that the output matrix is orthogonal and that the number of labels equals the dimension of the hidden layers. It would be more natural to consider an arbitrary number of labels and an output matrix initialized as a Xavier matrix (we note, however, that Xavier matrices are ā€œalmost orthogonalā€). Finally, the loss function used in our analysis is non-standard.

Next, we address more inherent limitations. First, we assumed extremely strong supervision: that each example comes with all positive labels it possesses. In practice, one usually obtains only a single positive label per example. We note that while it is straightforward to show that hierarchical models are efficiently learnable with this standard supervision, proving that gradient-based algorithms on neural networks succeed in this setting remains an open problem.

Another limitation is our assumption of layer-wise training, whereas in reality, all layers are typically trained jointly. While this makes the mathematical analysis more intricate, joint training is likely superior for several reasons. First, empirically, it is the standard method. Second, if the goal of training lower layers is merely to learn representations, there is little utility in exhausting data to achieve marginal improvements in the loss. Indeed, to ensure data efficiency, it is preferable to utilize features as soon as they are sufficiently good (i.e., once the gradient w.r.t.Ā these features is large).

More Preliminaries

In the sequel we denote by $`(\mathbb{R}^n)^{\otimes t}`$ the space of order $`t`$ real tensors whose all axes has dimension $`n`$. We equip it with the inner product $`{\left\langle A,B \right\rangle} = \sum_{1\le i_1,\ldots,i_t\le n}A_{i_1,\ldots,i_t}B_{i_1,\ldots,i_t}`$. For $`\mathbf{x} \in\mathbb{R}^d`$ we denote by $`\mathbf{x} ^{\otimes t}\in (\mathbb{R}^n)^{\otimes t}`$ the tensor whose $`(i_1,\ldots,i_t)`$ entry is $`\prod_{j=1}^t x_{i_j}`$. We note that $`{\left\langle \mathbf{x} ^{\otimes t},\mathbf{y}^{\otimes t} \right\rangle} = {\left\langle \mathbf{x} ,\mathbf{y} \right\rangle}^t`$.

Concentration of Measure

We will use the Chernoff and Hoeffding’s inequalities:

Lemma 17 (Hoeffding). *Let $`X_1,\ldots,X_q\in [-B,B]`$ be i.i.d.Ā with mean $`\mu`$. Then, for any $`\epsilon > 0`$ we have

MATH
\Pr\left(\left|\frac{1}{q}\sum_{i=1}^q  X_i - \mu\right|\ge \epsilon\right)\le 2e^{-\frac{q\epsilon^2}{2B^2}}
```*

</div>

<div id="lem:chernoff" class="lemma">

**Lemma 18** (Chernoff). *Let $`X_1,\ldots,X_q\in \{0,1\}`$ be
i.i.d.Ā with mean $`\mu`$. Then, for any $`0\le\epsilon\le \mu`$ we have
``` math
\Pr\left(\left|\frac{1}{q}\sum_{i=1}^q  X_i - \mu\right|\ge \epsilon\right)\le 2e^{-\frac{q\epsilon^2}{3\mu}}
```*

</div>

We will also need to following version of Chernoff’s bound.

<div id="lem:extended_chernoff" class="lemma">

**Lemma 19**. *Let $`X_1,\ldots,X_q\in \{-1,1,0\}`$ be i.i.d.Ā random
variables with mean $`\mu`$. Then for
$`\epsilon\le \frac{\min\left(\Pr(X_i= 1),\Pr(X_i= -1)\right)}{2|\mu|}`$,
$`\Pr\left(\left|\frac{1}{q |\mu|}\sum_{i=1}^n X_i-\frac{\mu}{|\mu|}\right|\ge \epsilon\right)\le 4e^{-\frac{q\epsilon^2|\mu|^2}{12\Pr(X_i\ne 0)}}`$*

</div>

<div class="proof">

*Proof.* (of Lemma
<a href="#lem:extended_chernoff" data-reference-type="ref"
data-reference="lem:extended_chernoff">19</a>) Let
$`X^+_i = \max(X_i,0)`$ and $`\mu_+=\mathbb{E}X^+_i= \Pr(X_i=1)`$.
Similarly, let $`X^-_i = \max(-X_i,0)`$ and
$`\mu_-=\mathbb{E}X^-_i= \Pr(X_i=-1)`$. By Chernoff bound (Lemma
<a href="#lem:chernoff" data-reference-type="ref"
data-reference="lem:chernoff">18</a>) we have for $`0\le \delta\le 1`$
``` math
\Pr\left(\left|\frac{1}{q}\sum_{i=1}^n X^+_i-\mu_+\right|\ge \delta\mu_+\right)\le 2e^{-\frac{q\delta^2\mu_+}{3}}
Click to expand and view more

Hence,

MATH
\Pr\left(\left|\frac{1}{q |\mu|}\sum_{i=1}^n X^+_i-\frac{\mu_+}{|\mu|}\right|\ge \delta\frac{\mu_+}{|\mu|}\right)\le 2e^{-\frac{q\delta^2\mu_+}{3}}
Click to expand and view more

Defining $`\epsilon = \delta\frac{\mu_+}{|\mu|}`$ we get for $`\epsilon\le \frac{\mu_+}{|\mu|}`$

MATH
\Pr\left(\left|\frac{1}{q |\mu|}\sum_{i=1}^n X^+_i-\frac{\mu_+}{|\mu|}\right|\ge \epsilon\right)\le 2e^{-\frac{q\epsilon^2|\mu|^2}{3\mu_+}} \le 2e^{-\frac{q\epsilon^2|\mu|^2}{3\Pr(X_i \ne 0)}}
Click to expand and view more

A similar argument implies that for $`\epsilon\le \frac{\mu_-}{|\mu|}`$ we have

MATH
\Pr\left(\left|\frac{1}{q |\mu|}\sum_{i=1}^n X^-_i-\frac{\mu_-}{|\mu|}\right|\ge \epsilon\right) \le 2e^{-\frac{q\epsilon^2|\mu|^2}{3\Pr(X_i \ne 0)}}
Click to expand and view more

As a result for $`\epsilon\le \frac{\min\left(\mu_+,\mu_-\right)}{2|\mu|}`$ we have

MATH
\begin{eqnarray*}
        \Pr\left(\left|\frac{1}{q |\mu|}\sum_{i=1}^n X_i-\frac{\mu}{|\mu|}\right|\ge \epsilon\right) &\le& \Pr\left(\left|\frac{1}{q |\mu|}\sum_{i=1}^n X^+_i-\frac{\mu_+}{|\mu|}\right|\ge \frac{\epsilon}{2}\right) + \Pr\left(\left|\frac{1}{q |\mu|}\sum_{i=1}^n X^-_i-\frac{\mu_-}{|\mu|}\right|\ge \frac{\epsilon}{2}\right)
        \\
        &\le & 4e^{-\frac{q\epsilon^2|\mu|^2}{12\Pr(X_i\ne 0)}}
\end{eqnarray*}
Click to expand and view more

Ā ā—»

Misc Lemmas

We will use the following asymptotics of binomials Coefficients, which follows from Stirling’s approximation

Lemma 20. We have $`\frac{\binom{2k}{k}}{2^{2k}}\sim \frac{1}{\sqrt{\pi k}}`$

We will also need the following approximation of the sign function using polynomials.

Lemma 21. Let $`0<\xi<1`$ and $`\epsilon>0`$. There is a polynomial $`p:\mathbb{R}\to\mathbb{R}`$ such that

  • $`p([-1,1])\subseteq [-1,1]`$

  • For any $`x\in [-1,1]\setminus [-\xi,\xi]`$ we have $`|p(\mathbf{x} )-\mathrm{sign}(\mathbf{x} )|\le \epsilon`$.

  • $`\deg(p) = O\left(\frac{\log(1/\epsilon)}{\xi}\right)`$

  • $`p`$’s coefficients are all bounded by $`2^{O\left(\frac{\log(1/\epsilon)}{\xi}\right)}`$

The existence of a polynomial that satisfies the first three properties is shown in . The bound on the coefficients (the last item) follows from Lemma 2.8.Ā in (see also here ). Finally, we will use the following bound on the coefficient norm of a composition of a polynomial with a linear function.

Lemma 22. Fix a degree $`K`$ polynomial $`p:\mathbb{R}^n\to \mathbb{R}`$ and $`A\in M_{n,m}`$ whose rows has Euclidean norm at most $`R`$. Define $`q(\mathbf{x} )= p(A\mathbf{x} )`$. Then, $`\|q\|_\mathrm{co}\le \|p\|_\mathrm{co}R^K(n+1)^{K/2}`$

Proof. Let $`\mathbf{a}_i`$ be the $`i`$’th row of $`A`$. Denote $`p(\mathbf{x} )= \sum_{\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K} b_\alpha \mathbf{x} ^\alpha`$ and $`e_\alpha(\mathbf{x} ) = \prod_{i=1}^n {\left\langle \mathbf{a}_i,\mathbf{x} \right\rangle}^{\sigma_i}`$. We have $`q = \sum_{\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K} b_\alpha e_\alpha`$. Hence,

MATH
\|q\|_\mathrm{co}\le \sum_{\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K} |b_\alpha | \cdot \|e_\alpha\| \stackrel{\text{C.S.}}{\le} \|p\|_\mathrm{co}\cdot \sqrt{\sum_{\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K} \|e_\alpha\|^2}
Click to expand and view more

Finally

MATH
\|e_\alpha\|^2 = \left\|\mathbf{a}^{\otimes\sigma_1}_1\otimes\ldots\otimes\mathbf{a}_n^{\otimes\sigma_n}\right\|^2 = \prod_{i=1}^n \|\mathbf{a}_1\|^{2\sigma_i}\le R^{2K}
Click to expand and view more

Ā ā—»

A Generalization Result

It is well established that for ā€œnicely behaved" function classes in which functions are defined by a vector of parameters, the sample complexity is proportional to the number of parameters. For instance, a function class of the form $`{\cal F}= \{\mathbf{x} \mapsto F(\mathbf{w},\mathbf{x} ) : \mathbf{w}\in[-B,B]^p\}`$ for a function $`F`$ that is $`L`$-Lipschitz in the first argument has realizable large margin sample complexity of $`\tilde O\left(\frac{p}{\epsilon}\right)`$. To be more precise, if there is a function in $`{\cal F}`$ with $`\gamma`$-error $`0`$, then any algorithm that is guaranteed to return a function with empirical $`\gamma`$-error $`0`$, enjoys this aforementioned sample complexity guarantee. We next slightly extend this fact, allowing $`F`$ to be random and allowing the algorithm to fail with some small probability.

Lemma 23. Suppose that $`{\cal F}\subset (\mathbb{R}^{n})^{\cal X}`$ is a random function class such that

  • There is a random function $`F:[-B,B]^p\times{\cal X}\to \mathbb{R}^n`$ such that $`{\cal F}= \{\mathbf{x} \mapsto F(\mathbf{w},\mathbf{x} ) : \mathbf{w}\in[-B,B]^p\}`$

  • W.p. $`1-\delta_1`$, for any $`\mathbf{x} \in{\cal X}`$, $`\mathbf{w}\mapsto F(\mathbf{w},\mathbf{x} )`$ is $`L`$-Lipschitz w.r.t.Ā the $`\ell^\infty`$ norm.

Let $`{\cal A}`$ be an algorithm, and assume that for some $`\mathbf{f}^*:{\cal X}\to\{\pm 1\}^n`$, $`{\cal A}`$ has the property that on any $`m`$-points sample $`S`$ labeled by $`\mathbf{f}^*`$, it returns $`\hat\mathbf{f}\in{\cal F}`$ with $`\mathrm{Err}_{S,\gamma}(\hat\mathbf{f})=0`$ w.p. $`1-\delta_2`$ (where the probability is over the randomness of $`F`$ and the internal randomness of $`{\cal A}`$). Then if $`S`$ is an i.i.d.Ā sample labeled by $`\mathbf{f}^*`$ we have

  • $`\mathrm{Err}_{{\cal D}}(\hat\mathbf{f})\le\epsilon`$ w.p. $`(LB/\gamma)^{O(p)}(1-\epsilon)^m+\delta_1+\delta_2`$

  • $`\mathbb{E}_S\mathrm{Err}_{{\cal D}}(\hat\mathbf{f}) \le O\left(\frac{p\ln(LB/\gamma) + \ln(m)}{m}\right) + \delta_1+\delta_2`$

Proof. (sketch) For $`\hat\mathbf{f}:{\cal X}\to \mathbb{R}^n`$ we define

MATH
\mathrm{Err}_{{\cal D},\gamma}(\hat{\mathbf{f}})=\Pr_{\mathbf{x} \sim{\cal D}}\left(\exists i\in [n]\text{ s.t. }\hat f_{i}(\mathbf{x} )\cdot f_{i}(\mathbf{x} ) < \gamma \right)
Click to expand and view more

It is not hard to see that w.p. $`1-\delta_1`$ there is $`\tilde{\cal F}\subseteq{\cal F}`$ of size $`N=(LB/\gamma)^{O(p)}`$ such that for any $`\mathbf{g}\in {\cal F}`$ there is $`\tilde \mathbf{g}\in\tilde{\cal F}`$ such that

MATH
\forall\mathbf{x} \in{\cal X},\;\|\mathbf{g}(\mathbf{x} )-\tilde \mathbf{g}(\mathbf{x} )\|_\infty \le \frac{\gamma}{2}
Click to expand and view more

Let $`A`$ be the event that such $`\tilde{\cal F}`$ exists, that $`{\cal A}`$ return a function in $`{\cal F}`$ with $`\mathrm{Err}_{S,\gamma}(\hat \mathbf{f})=0`$, and that for any $`\tilde \mathbf{g}\in \tilde {\cal F}`$ with $`\mathrm{Err}_{{\cal D},\gamma/2}(\tilde\mathbf{g})\ge \epsilon`$ we have $`\mathrm{Err}_{S,\gamma/2}(\tilde\mathbf{g}) >0`$. We have that the probability of $`A`$ is at least $`1-\delta_1-\delta_2-N(1-\epsilon)^m`$. Given $`A`$ we have for any $`\mathbf{g}\in{\cal F}`$,

MATH
\mathrm{Err}_{\cal D}(\mathbf{g})\ge \epsilon \Rightarrow \mathrm{Err}_{{\cal D},\gamma/2}(\tilde\mathbf{g})\ge \epsilon \Rightarrow \mathrm{Err}_{S,\gamma/2}(\tilde\mathbf{g}) > 0 \Rightarrow \mathrm{Err}_{S,\gamma}(\mathbf{g}) > 0
Click to expand and view more

Thus, the probability that $`{\cal A}`$ return a function with error $`\ge \epsilon`$ is at most $`N(1-\epsilon)^m +\delta_1+\delta_2`$ which proves the first part of the lemma. As for the second part, we note that we have

MATH
\mathbb{E}_S\mathrm{Err}_{{\cal D}}(\hat\mathbf{f}) \le \mathbb{E}_S[\mathrm{Err}_{{\cal D}}(\hat\mathbf{f})|A] + \Pr(A^\complement) \le \epsilon + N(1-\epsilon)^m +\delta_1+\delta_2
Click to expand and view more

Optimizing over $`\epsilon`$ we get $`\mathbb{E}_S\mathrm{Err}_{\cal D}(\hat\mathbf{f}) \le \frac{\ln(Nm)}{m} + \delta_1+\delta_2`$ which proves the second partĀ ā—»

Kernels

The results we state next can be found in Chapter 2.Ā of . Let $`{\cal X}`$ be a set. A kernel is a function $`k:{\cal X}\times {\cal X}\to\mathbb{R}`$ such that for every $`x_1,\ldots,x_m\in {\cal X}`$ the matrix $`\{k(x_i,x_j)\}_{i,j}`$ is positive semi-definite. A kernel space is a Hilbert space $`{\cal H}`$ of functions from $`{\cal X}`$ to $`\mathbb{R}`$ such that for every $`x\in {\cal X}`$ the linear functional $`f\in{\cal H}\mapsto f(x)`$ is bounded. The following theorem describes a one-to-one correspondence between kernels and kernel spaces.

Theorem 24. For every kernel $`k`$ there exists a unique kernel space $`{\cal H}_k`$ such that for every $`x,x'\in {\cal X}`$, $`k(x,x') = {\left\langle k(\cdot,x),k(\cdot,x') \right\rangle}_{{\cal H}_k}`$. Likewise, for every kernel space $`{\cal H}`$ there is a kernel $`k`$ for which $`{\cal H}={\cal H}_k`$.

We denote the norm and inner product in $`{\cal H}_k`$ by $`\|\cdot\|_k`$ and $`{\left\langle \cdot,\cdot \right\rangle}_k`$. The following theorem describes a tight connection between kernels and embeddings of $`{\cal X}`$ into Hilbert spaces.

Theorem 25. A function $`k:{\cal X}\times {\cal X}\to \mathbb{R}`$ is a kernel if and only if there exists a mapping $`\Psi: {\cal X}\to{\cal H}`$ to some Hilbert space for which $`k(x,x')={\left\langle \Psi(x),\Psi(x') \right\rangle}_{{\cal H}}`$. In this case, $`{\cal H}_k = \{f_{\Psi,\mathbf{v}} \mid \mathbf{v}\in{\cal H}\}`$ where $`f_{\Psi,\mathbf{v}}(x) = {\left\langle \mathbf{v},\Psi(x) \right\rangle}_{\cal H}`$. Furthermore, $`\|f\|_{k} = \min\{\|\mathbf{v}\|_{\cal H}: f_{\Psi,\mathbf{v}}\}`$ and the minimizer is unique.

Random Features Schemes

Let $`{\cal X}`$ be a measurable space and let $`k:{\cal X}\times{\cal X}\to \mathbb{R}`$ be a kernel. A random features scheme (RFS) for $`k`$ is a pair $`(\psi,\mu)`$ where $`\mu`$ is a probability measure on a measurable space $`\Omega`$, and $`\psi:\Omega\times{\cal X}\to \mathbb{R}`$ is a measurable function, such that

MATH
\begin{equation}
\label{eq:ker_eq_inner}
\forall \mathbf{x} ,\mathbf{x} '\in{\cal X},\;\;\;\;k(\mathbf{x} ,\mathbf{x} ') =
  \mathbb{E}_{\omega\sim \mu}\psi(\omega,\mathbf{x} )\psi(\omega,\mathbf{x} ')\,.
\end{equation}
Click to expand and view more

We often refer to $`\psi`$ (rather than $`(\psi,\mu)`$) as the RFS. We define $`\|\psi\|_\infty = \sup_{\mathbf{x} }\|\psi(\cdot,\mathbf{x} )\|_\infty`$, and say that $`\psi`$ is $`C`$-bounded if $`\|\psi\|_\infty\le C`$. The random $`q`$-embedding generated from $`\psi`$ is the random mapping

MATH
\Psi_{\boldsymbol{\omega}}(\mathbf{x} ) :=
\left(\psi({\omega_1},\mathbf{x} ),\ldots , \psi({\omega_q},\mathbf{x} ) \right) \,,
Click to expand and view more

where $`\omega_1,\ldots,\omega_q\sim \mu`$ are i.i.d.Ā The random $`q`$-kernel corresponding to $`\Psi_{\boldsymbol{\omega}}`$ is $`k_{\boldsymbol{\omega}}(\mathbf{x} ,\mathbf{x} ') = \frac{{\left\langle \Psi_{\boldsymbol{\omega}}(\mathbf{x} ),\Psi_{\boldsymbol{\omega}}(\mathbf{x} ') \right\rangle}}{q}`$. Likewise, the random $`q`$-kernel space corresponding to $`\frac{1}{\sqrt{q}}\Psi_{\boldsymbol{\omega}}`$ is $`{\cal H}_{k_{\boldsymbol{\omega}}}`$. We next discuss approximation of functions in $`{\cal H}_k`$ by functions in $`{\cal H}_{k_{\boldsymbol{\omega}}}`$. It would be useful to consider the embedding

MATH
\begin{equation}
\label{eqn:psi-embedding}
\mathbf{x} \mapsto\Psi^\mathbf{x} \; \mbox{ where } \;
  \Psi^\mathbf{x} :=\psi(\cdot,\mathbf{x} )\in L^2(\Omega) \,.
\end{equation}
Click to expand and view more

FromĀ [eq:ker_eq_inner] it holds that for any $`\mathbf{x} ,\mathbf{x} '\in{\cal X}`$, $`k(\mathbf{x} ,\mathbf{x} ') = {\left\langle \Psi^\mathbf{x} ,\Psi^{\mathbf{x} '} \right\rangle}_{L^2(\Omega)}`$. In particular, from TheoremĀ 25, for every $`f\in{\cal H}_k`$ there is a unique function $`\check{f}\in L^2(\Omega)`$ such that

MATH
\begin{equation}
\label{eq:f_norm_eq}
\|\check{f}\|_{L^2(\Omega)} = \|f\|_{k}
\end{equation}
Click to expand and view more

and for every $`\mathbf{x} \in{\cal X}`$,

MATH
\begin{equation}
\label{eq:f_x_as_inner}
f(\mathbf{x} ) = {\left\langle \check{f},\Psi^\mathbf{x}  \right\rangle}_{L^2(\Omega)} =  \mathbb{E}_{\omega\sim\mu}\check{f}(\omega)\psi(\omega,\mathbf{x} )\,.
\end{equation}
Click to expand and view more

Let us denote $`f_{\boldsymbol{\omega}}(\mathbf{x} ) = \frac{1}{q}\sum_{i=1}^q {\left\langle \check{f}(\omega_i),\psi(\omega_i,\mathbf{x} ) \right\rangle}`$. FromĀ [eq:f_x_as_inner] we have that $`\mathbb{E}_{\boldsymbol{\omega}}\left[f_{\boldsymbol{\omega}}(\mathbf{x} )\right] = f(\mathbf{x} )`$. Furthermore, for every $`\mathbf{x}`$, the variance of $`f_{\boldsymbol{\omega}}(\mathbf{x} )`$ is at most

MATH
\begin{eqnarray*}
\frac{1}{q}\mathbb{E}_{\omega\sim\mu}
  \left|\check{f}(\omega)\psi(\omega,\mathbf{x} )\right|^2
&\le &
\frac{\|\psi\|_\infty^2}{q}\mathbb{E}_{\omega\sim\mu}
  \left|\check{f}(\omega)\right|^2 
\\
&=& \frac{\|\psi\|_\infty^2\|f\|^2_{k}}{q}\,.
\end{eqnarray*}
Click to expand and view more

An immediate consequence is the following corollary.

Corollary 26 (Function Approximation). For all $`\mathbf{x} \in{\cal X}`$, $`\mathbb{E}_{\boldsymbol{\omega}}|f(\mathbf{x} ) - f_{\boldsymbol{\omega}}(\mathbf{x} )|^2 \le \frac{\|\psi\|_\infty^2\|f\|^2_{k}}{q}`$.

Now, if $`{\cal D}`$ is a distribution on $`{\cal X}`$ we get that

MATH
\mathbb{E}_{\boldsymbol{\omega}}\|f - f_{\boldsymbol{\omega}}\|_{2,{\cal D}} \stackrel{\text{Jensen}}{\le}  \sqrt{\mathbb{E}_{\boldsymbol{\omega}}\|f - f_{\boldsymbol{\omega}}\|^2_{2,{\cal D}}}   = \sqrt{\mathbb{E}_{\boldsymbol{\omega}}\mathbb{E}_{\mathbf{x} \sim{\cal D}}|f(\mathbf{x} ) - f_{\boldsymbol{\omega}}(\mathbf{x} )|^2} = \sqrt{\mathbb{E}_\mathbf{x} \mathbb{E}_{\boldsymbol{\omega}}|f(\mathbf{x} ) - f_{\boldsymbol{\omega}}(\mathbf{x} )|^2} \le \frac{\|\psi\|_\infty\|f\|_{k}}{\sqrt{q}}
Click to expand and view more

Thus, $`O\left(\frac{\|f\|_k^2}{\epsilon^2}\right)`$ random features suffices to guarantee that $`\mathbb{E}_{\boldsymbol{\omega}}\|f - f_{\boldsymbol{\omega}}\|_{2,{\cal D}}\le\epsilon`$. In this paper such an $`\ell^2`$ guarantee will not suffice, and we will need an approximation of functions in $`{\cal H}_k`$ by functions in $`{\cal H}_{k_{\boldsymbol{\omega}}}`$ w.r.t.Ā the stronger $`\ell^\infty`$ norm. We next show this can be obtained, unfortunately with a quadratic growth in the required number of features. For $`z\in\mathbb{R}`$ we define $`{\left\langle z \right\rangle}_B = \begin{cases}z & |z|\le B\\0&\text{otherwise}\end{cases}`$. We will consider the following a truncated version of $`f_{\boldsymbol{\omega}}`$

MATH
f_{{\boldsymbol{\omega}},B}(\mathbf{x} ) = \frac{1}{q}\sum_{i=1}^q
  {\left\langle \check{f}(\omega_i) \right\rangle}_B\cdot\psi(\omega_i,\mathbf{x} )
Click to expand and view more

Now, if $`\psi`$ is $`C`$-bounded we have that $`f_{{\boldsymbol{\omega}},B}(\mathbf{x} )`$ is and average of $`q`$ i.i.d.Ā $`CB`$-bounded random variables. By Hoeffding’s inequality, we have

MATH
\begin{equation}
\label{eq:trunc_f_vomeg}
    \Pr\left(\left|f_{{\boldsymbol{\omega}},B}(\mathbf{x} )-\mathbb{E}_{{\boldsymbol{\omega}}'} f_{{\boldsymbol{\omega}}',B}(x)\right|>\epsilon/2\right) \le 2e^{-\frac{q\epsilon^2}{8B^2C^2}}
\end{equation}
Click to expand and view more

Likewise, we have

MATH
\begin{eqnarray*}
    \left|f(x) -  \mathbb{E}_{{\boldsymbol{\omega}}'} f_{{\boldsymbol{\omega}}',B}(x)\right| &=& \left|\mathbb{E}\left(f_{\boldsymbol{\omega}}(x) -  f_{{\boldsymbol{\omega}},B}(x)\right)\right|
    \\
    &=& \left|\mathbb{E}\left(\check{f}(\omega) - {\left\langle \check{f}(\omega) \right\rangle}_B\right)\cdot\psi(\omega,\mathbf{x} )\right|
    \\
    &=& \left|\mathbb{E}1_{|\check{f}(\omega)|>B} \check{f}(\omega)\psi(\omega,\mathbf{x} )\right|
    \\
    &\le& \sqrt{\Pr (|\check{f}(\omega)|>B) \mathbb{E}\left(\check{f}(\omega)\psi(\omega,\mathbf{x} )\right)^2}
    \\
    &\le& \|\psi\|_\infty\sqrt{\Pr (|\check{f}(\omega)|>B) \mathbb{E}\left(\check{f}(\omega)\right)^2}
    \\
    &=& \frac{\|\psi\|_\infty\|f\|^2_{k}}{B}
\end{eqnarray*}
Click to expand and view more

We get that

Lemma 27. *Let $`f\in{\cal H}_k`$ with $`\|f\|_{k}\le M`$ and assume that, $`\|\psi\|_\infty\le C`$. For $`B = \frac{2CM^2}{\epsilon}`$ we have

MATH
\Pr\left(\left|f_{{\boldsymbol{\omega}},B}(\mathbf{x} )-f(\mathbf{x} )\right|>\epsilon\right) \le 2e^{-\frac{q\epsilon^4}{32M^4C^4}}
Click to expand and view more

Furthermore, the norm of weight vector vector defining $`f_{{\boldsymbol{\omega}},B}`$, i.e.Ā $`\mathbf{w}= \frac{1}{q}\left({\left\langle \check{f}(\omega_1) \right\rangle}_B,\ldots,{\left\langle \check{f}(\omega_q) \right\rangle}_B\right)`$, satisfies

MATH
\begin{equation*}
   \|\mathbf{w}\|\le  \frac{2CM^2}{\epsilon\sqrt{q}}
\end{equation*}
```*

</div>

# Examples of Hierarchies and Proof Theorem <a href="#thm:brain_dump_intro" data-reference-type="ref"
data-reference="thm:brain_dump_intro">4</a>

Fix $`{\cal X}\subset [-1,1]^n`$, a proximity mapping
$`\mathbf{e}:G\to G^w`$, and a collection of sets
$`{\cal L}= \{L_1,\ldots,L_r\}`$ such that
$`L_1\subseteq L_2\subset\ldots\subseteq L_r = [n]`$. So far, we have
seen one formal example to a hierarchy: In the non-ensemble setting
(i.e. $`w=|G|=1`$) Example
<a href="#example:o(1)_supp" data-reference-type="ref"
data-reference="example:o(1)_supp">2</a> shows that if any label depends
on $`K`$ simpler labels, and the labels in the first level are
$`(K,1)`$-PTFs of the input, then $`{\cal L}`$ is an
$`(r,K,1)`$-hierarchy. In this section we expand our set of examples. We
first show (Lemma
<a href="#lem:cur_imp_ref_cur" data-reference-type="ref"
data-reference="lem:cur_imp_ref_cur">28</a>) that if
$`({\cal L},\mathbf{e})`$ is an $`(r,K,M)`$-hierarchy then it is an
$`(r,K,2M,B,\xi)`$-hierarchy for suitable $`B`$ and $`\xi`$. Then, in
section <a href="#sec:few_lables_dependence" data-reference-type="ref"
data-reference="sec:few_lables_dependence">8.1</a>, consider in more
detail the case that each label depends on a few simpler labels, in a
few locations, and show that the parameters obtained from Lemma
<a href="#lem:cur_imp_ref_cur" data-reference-type="ref"
data-reference="lem:cur_imp_ref_cur">28</a> can be improved in this
case. Finally, in section
<a href="#sec:brain_dump" data-reference-type="ref"
data-reference="sec:brain_dump">8.2</a> we prove Theorem
<a href="#thm:brain_dump_intro" data-reference-type="ref"
data-reference="thm:brain_dump_intro">4</a>, showing that if all the
labels are ā€œrandom snippets" from a given circuit, and there is enough
of them, then the target function has a low-complexity hierarchy.

<div id="lem:cur_imp_ref_cur" class="lemma">

**Lemma 28**. *Any $`(r,K,M)`$-hierarchy of
$`\mathbf{f}^*:{\cal X}^G\to \{\pm 1\}^{n,G}`$ is also an
$`(r,K,2M,B,\xi)`$-hierarchy for
$`\xi = \frac{1}{2(wn+1)^{\frac{K+1}{2}}KM}`$ and
$`B = 2(w\max(n,d)+1)^{K/2}M`$*

</div>

Lemma <a href="#lem:cur_imp_ref_cur" data-reference-type="ref"
data-reference="lem:cur_imp_ref_cur">28</a> follows immediately from the
definition of hierarchy and the following lemma

<div id="lem:ptf_imp_ref_ptf" class="lemma">

**Lemma 29**. *Any $`(K,M)`$-PTF $`f:{\cal X}\to \{\pm 1\}`$ is a
$`(K,2M,B,\xi)`$-PTF w.r.t. for
$`\xi = \frac{1}{2(n+1)^{\frac{K+1}{2}}KM}`$ and $`B = 2(n+1)^{k/2}M`$*

</div>

Lemma <a href="#lem:ptf_imp_ref_ptf" data-reference-type="ref"
data-reference="lem:ptf_imp_ref_ptf">29</a> is implied by Lemmas
<a href="#lem:Lip_and_boundness_of_pol" data-reference-type="ref"
data-reference="lem:Lip_and_boundness_of_pol">30</a> and
<a href="#lem:PTF_from_lip_and_bounded" data-reference-type="ref"
data-reference="lem:PTF_from_lip_and_bounded">31</a>

<div id="lem:Lip_and_boundness_of_pol" class="lemma">

**Lemma 30**. *Let $`p:\mathbb{R}^n\to\mathbb{R}`$ be a degree $`K`$
polynomial. Then $`p`$ is
$`((n+1)^{\frac{K+1}{2}}K\|p\|_\mathrm{co})`$-Lipschitz in
$`[-1,1]^{n}`$ w.r.t.Ā the $`\|\cdot\|_\infty`$ norm and satisfies
$`|p(\mathbf{x} )|\le (n+1)^{k/2}\|p\|_\mathrm{co}`$ for any
$`\mathbf{x} \in [-1,1]^{n}`$.*

</div>

<div class="proof">

*Proof.* Denote
$`p(\mathbf{x} )= \sum_{\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K} a_\alpha \mathbf{x} ^\alpha`$.
We have
``` math
\frac{\partial p }{\partial x_i}\left(\mathbf{x} \right)=  \sum_{\alpha\in \{0,\ldots,K-1\}^{n},\|\alpha\|_1\le K-1} a_{\alpha+\mathbf{e}_i}\cdot(\alpha_i+1)\cdot \mathbf{x} ^\alpha
Click to expand and view more

This implies that for any $`\mathbf{x} \in [-1,1]^{n}`$ we have

MATH
\begin{eqnarray*}
    \left|\frac{\partial p }{\partial x_i}\left(\mathbf{x} \right)\right| &\le & \sum_{\alpha\in \{0,\ldots,K-1\}^{n},\|\alpha\|_1\le K-1} \left|a_{\alpha+\mathbf{e}_i}\cdot(\alpha_i+1) \cdot\mathbf{x} ^\alpha\right|
    \\
    &\le & K\sum_{\alpha\in \{0,\ldots,K-1\}^{n},\|\alpha\|_1\le K-1} \left|a_{\alpha+\mathbf{e}_i}\right|
    \\
    &\le & K\sqrt{(n+1)^{K-1}}\|p\|_\mathrm{co}
\end{eqnarray*}
Click to expand and view more

Hence, $`\|\nabla p(\mathbf{x} )\|_1\le nK\sqrt{(n+1)^{K-1}}\|p\|_\mathrm{co}\le K\sqrt{(n+1)^{K+1}}\|p\|_\mathrm{co}`$. Showing that $`p`$ is $`((n+1)^{\frac{K+1}{2}}K\|p\|_\mathrm{co})`$-Lipschitz in $`[-1,1]^{n}`$ w.r.t.Ā the $`\|\cdot\|_\infty`$ norm. Likewise, for any $`\mathbf{x} \in [-1,1]^n`$ we have

MATH
p(\mathbf{x} ) \le \sum_{\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K}| a_\alpha| \le 2(n+1)^{K/2}\|p\|_\mathrm{co}
Click to expand and view more

Ā ā—»

Lemma 31. Assume that $`f:{\cal X}\to \{\pm 1\}`$ is $`\left(K,M\right)`$-PTF w.r.t.Ā as witnessed by a polynomial $`p:\mathbb{R}^{n}\to\mathbb{R}`$ that is $`L`$-Lipschitz w.r.t. $`\|\cdot\|_\infty`$.

  • If $`p`$ is bounded by $`B`$ is $`\cup_{\mathbf{x} \in {\cal X}}{\cal B}_{1/(2L)}(\mathbf{x} )`$. Then, $`f`$ is $`\left(K,2M,2B,\frac{1}{2L}\right)`$-PTF witnessed by $`2p`$

  • If $`p`$ is bounded by $`B`$ is $`{\cal X}`$. Then, $`f`$ is $`\left(K,2M,2B+1,\frac{1}{2L}\right)`$-PTF witnessed by $`2p`$

Proof. We first note the the second item follows form the first. Indeed, if $`p`$ is bounded by $`B`$ in $`{\cal X}`$ then $`p`$ is bounded by $`B+1/2`$ is $`\cup_{\mathbf{x} \in {\cal X}}{\cal B}_{1/(2L)}(\mathbf{x} )`$. To prove the first item we need to show that for any $`\mathbf{x} \in {\cal X}`$ and $`\tilde\mathbf{x} \in{\cal B}_\xi(\mathbf{x} )`$ we have

MATH
2B\ge 2p(\tilde\mathbf{x} )f(\mathbf{x} ) \ge 1
Click to expand and view more

The left inequality is clear. For the right inequality we assume that $`f(\mathbf{x} )=1`$ (the other case is similar). Since $`\|\mathbf{x} -\tilde\mathbf{x} \|_\infty\le \frac{1}{2L}`$ we have

MATH
\begin{eqnarray*}
p(\tilde\mathbf{x} ) &\ge& p(\mathbf{x} ) - |p(\tilde\mathbf{x} )-p(\mathbf{x} )|
\\
&\ge & p(\mathbf{x} ) - L\cdot\|\mathbf{x} -\tilde\mathbf{x} \|_\infty
\\
&\ge&1- \frac{L}{2L}
\\
&=&\frac{1}{2}
\end{eqnarray*}
Click to expand and view more

Ā ā—»

Each Label Depends on $`O(1)`$ Simpler Labels

Assume now that $`{\cal X}\subseteq\{\pm 1\}^d`$, and that any label $`j\in L_i`$ depends on at most $`K`$ labels from $`L_{i-1}`$ in at most $`K`$ locations (of $`K`$ input locations if $`i=1)`$. That is, for any $`j\in L_i`$, there is a function $`\tilde f_j :\{\pm 1\}^{wn}\to \{\pm 1\}`$ (or $`\tilde f_j :\{\pm 1\}^{dw}\to \{\pm 1\}`$ if $`i=1`$) that depends at most $`K`$ coordinates, from $`\{kn+l: 0\le k\le w-1,\;l\in L_{i-1}\}`$ (from $`[dw]`$ if $`i=1`$), for which the following holds. For any $`g\in G`$, $`f^*_{j,g}(\vec \mathbf{x} ) = \tilde f_j(E_{g}(\mathbf{f}^*(\vec\mathbf{x} )))`$ (or $`f^*_{j,g}(\vec \mathbf{x} ) = f_j(E_{g}(\vec\mathbf{x} ))`$ if $`i=1`$).

As in example 2, since any Boolean function depending on $`K`$ variables is a $`(K,1)`$-PTF, we have that the functions $`\tilde f_j`$ are $`(K,1)`$-PTFs, implying that $`({\cal L},\mathbf{e})`$ in an $`(r,K,1)`$-hierarchy. Lemma 28 implies that $`({\cal L},\mathbf{e})`$ is $`(r,K,2,B,\xi)`$-hierarchy for $`\xi = \frac{1}{2K(wn+1)^{(K+1)/2}}`$ and $`B=2(w\max(n,d)+1)^{K/2}`$. The following lemma shows that this can be substantially improved.

Lemma 32. Any Boolean function depending on $`K`$ coordinates is a $`(K,2,3,\xi)`$-PTF for $`\xi = \frac{1}{K2^{(K+2)/2}}`$. As a result $`({\cal L},\mathbf{e})`$ is $`(r,K,2,3,\xi)`$-hierarchy.

Lemma 32 follows from the following Lemma together with Lemma 31

Lemma 33. Let $`f:\{\pm 1\}^K\to \{\pm 1\}`$ and let $`F(\mathbf{x} )= \sum_{A\subseteq[K]}a_A\mathbf{x} ^A`$ be its standard multilinear extension. Then, $`F`$ is $`(K 2^{K/2})`$-Lipschitz in $`[-1,1]^K`$ w.r.t.Ā the $`\|\cdot\|_\infty`$ norm.

Proof. For $`\mathbf{x} \in [-1,1]^K`$ we have

MATH
\left|\frac{\partial F}{\partial x_i}\right| = \left|\sum_{i\in A\subseteq[K]}a_A\mathbf{x} ^A\right| \le \sum_{i\in A\subseteq[K]}\left|a_A\right|\le \sum_{i\in A\subseteq[K]}\left|a_A\right|
Click to expand and view more

Hence,

MATH
\|\nabla F(\mathbf{x} )\|_1 \le \sum_{A\subseteq[K]}|A|\left|a_A\right| \stackrel{\text{Cauchy Schwartz}}{\le} K 2^{K/2}
Click to expand and view more

Ā ā—»

The following Lemma shows that $`\xi`$ and $`B`$ can be improved even further, at the expense of the degree and the coefficient norm.

Lemma 34. For any $`0<\xi<1

Proof. Fix $`f:\{\pm 1\}^K\to\{\pm 1\}`$. We need to show that $`f`$ is a $`(K',M,B,\xi)`$-PTF. Let $`\epsilon = \frac{B-1}{B+1}`$. By Lemma 21 there is a uni-variate polynomial $`q`$ of degree $`O\left(\frac{K+\log(1/\epsilon)}{1-\xi}\right)`$ such that $`q([-1,1])\subseteq [-1,1]`$, for any $`y\in [-1,1]\setminus [-1+\xi,1-\xi]`$ we have $`|q(y)-\mathrm{sign}(y)|\le \frac{\epsilon}{K2^{K/2}}`$, and the coefficients of $`q`$ are all bounded by $`2^{O\left(\frac{K+\log(1/\epsilon)}{1-\xi}\right)}`$. Consider now the polynomial $`\tilde p(\mathbf{x} ) = F(q(\mathbf{x} ))`$ where $`F`$ is the multilinear extension on $`f`$. It is not hard to verify that $`\deg(\tilde p)\le \deg(q)K = O\left(\frac{K^2+K\log(1/\epsilon)}{1-\xi}\right)`$ and that $`\|\tilde p\|_\mathrm{co}\le 2^{O\left(\frac{K^2+K\log(1/\epsilon)}{1-\xi}\right)}`$. Finally, fix $`\mathbf{x} \in \{\pm 1\}^K`$ and $`\tilde\mathbf{x} \in {\cal B}_\xi(\mathbf{x} )`$. Note that $`\mathbf{x} = \mathrm{sign}(\tilde\mathbf{x} )`$. Since $`F`$ is $`K2^{k/2}`$-Lipschitz w.r.t.Ā the $`\|\cdot\|_\infty`$ norm in $`[-1,1]^K`$ (lemma 33) we have

MATH
|\tilde p(\tilde \mathbf{x} )-f(\mathbf{x} )| = |\tilde p(\tilde \mathbf{x} )-f(\mathrm{sign}(\tilde \mathbf{x} ))|  = |F(q(\tilde \mathbf{x} ))-F(\mathrm{sign}(\tilde \mathbf{x} ))| \le \|q(\tilde \mathbf{x} )-\mathrm{sign}(\tilde \mathbf{x} )\|_\infty \le \epsilon
Click to expand and view more

Since $`f(\mathbf{x} )\in \{\pm 1\}`$ this implies that

MATH
1+\epsilon\ge \tilde p(\tilde \mathbf{x} )f(\mathbf{x} ) \ge 1-\epsilon
Click to expand and view more

Taking $`p(x) = \frac{1}{1-\epsilon}\tilde p(x)`$ and noting that $`B = \frac{1+\epsilon}{1-\epsilon}`$ we get

MATH
B\ge  p(\tilde \mathbf{x} )f(\mathbf{x} ) \ge 1
Click to expand and view more

which implies that $`f`$ is a $`(K',M,B,\xi)`$-PTF.Ā ā—»

Proof of Theorem <a href="#thm:brain_dump_intro" data-reference-type=“ref”

data-reference=“thm:brain_dump_intro”>4

In this section we will prove (a slightly extended version of) Theorem 4. We first recall and slightly extend the setting. Fix a domain $`{\cal X}\subseteq\{\pm 1\}^d`$ and a sequence of functions $`G^i:\{\pm 1\}^d\to\{\pm 1\}^d`$ for $`1\le i\le r`$. We assume that $`G^0(\mathbf{x} ) = \mathbf{x}`$, and for any depth $`i\in [r]`$ and coordinate $`j\in [d]`$, we have

MATH
\begin{equation}
\label{eq:brain_dump_circ_def}
    \forall \mathbf{x} \in{\cal X}, \quad G^i_j(\mathbf{x} ) = p^i_j(G^{i-1}(\mathbf{x} )),
\end{equation}
Click to expand and view more

where $`p^i_j:\{\pm 1\}^d\to\{\pm 1\}`$ is a function whose multi-linear extension is a polynomial of degree at most $`K`$. Furthermore, we assume this extension is $`L`$-Lipschitz in $`[-1,1]^d`$ with respect to the $`\ell_\infty`$ norm (if $`p^i_j`$ depends on $`K`$ coordinates, as in the problem description in section 3.1, Lemma 33 implies that this holds with $`L=K2^{K/2}`$). Fix an integer $`q`$. We assume that for every depth $`i\in [r]`$, there are $`q`$ auxiliary labels $`f^*_{i,j}`$ for $`1\le j\le q`$, each of which is a signed Majority of an odd number of components of $`G^i`$. Moreover, we assume these functions are random. Specifically, prior to learning, the labeler independently samples $`qr`$ functions such that for any $`i\in [r]`$ and $`j\in [q]`$,

MATH
\begin{equation}
\label{eq:brain_dump_target_def}
    f^*_{i,j}(\mathbf{x} ) = \mathrm{sign}\left(\sum_{l=1}^d w_l^{i,j}G^i_l(\mathbf{x} )\right),
\end{equation}
Click to expand and view more

where the weight vectors $`\mathbf{w}^{i,j}\in \mathbb{R}^{d}`$ are independent uniform vectors chosen from

MATH
{\cal W}_{d,k} := \left\{\mathbf{w}\in \{-1,0,1\}^d : \sum_{l=1}^d |w_l|= k \right\}
Click to expand and view more

for some odd integer $`k`$. The following theorem, which slightly extends Theorem 4, shows that if $`q\gg dL^2\log(|{\cal X}|)`$, then with high probability over the choice of $`\mathbf{f}^*`$, the target function $`\mathbf{f}^*`$ has an $`\left(r,K,O\left(kd^{K}\right),2k+1\right)`$-hierarchy.

Theorem 35. W.p.Ā $`1-4drq|{\cal X}|e^{-\Omega\left(\frac{q}{L^2k^2d}\right)}`$ the function $`\mathbf{f}^*`$ has $`\left(r,K,O\left(kd^{K}\right),2k+1\right)`$-hierarchy

In order to prove Theorem 35 it is enough to show that for any $`i\in[r]`$ and $`j\in [q]`$, $`f^*_{i,j}`$ is a $`(K,O\left(kd^{K}\right),2k+1)`$-PTF of

MATH
\Psi_{i-1}(\mathbf{x} ) = (f^*_{i-1,1}(\mathbf{x} ),\ldots,f^*_{i-1,q}(\mathbf{x} ))
Click to expand and view more

By equations [eq:brain_dump_target_def] and [eq:brain_dump_circ_def] we have

MATH
f^*_{i,j}(\mathbf{x} ) = \mathrm{sign}\left(\sum_{l=1}^d w_l^{i,j} p^i_l(G^{i-1}(\mathbf{x} ))\right) =: \mathrm{sign}\left(q(G^{i-1}(\mathbf{x} ))\right)
Click to expand and view more

Hence, $`f^*_{i,j}`$ is $`(K,k)`$-PTF of $`G^{i-1}`$, as witnessed by $`q`$ (note that $`1\le|q(G^{i-1}(\mathbf{x} ))|\le k`$ since $`q(G^{i-1}(\mathbf{x} ))`$ is a sum of $`k`$ numbers in $`\{\pm 1\}`$ and $`k`$ is odd. Likewise, $`\|q\|_\mathrm{co}\le \sum_{l=1}^d |w_l^{i,j}|\cdot\|p^i_l\|_\mathrm{co}\stackrel{\|p^i_l\|_\mathrm{co}\le 1}{\le} \sum_{l=1}^d |w_l^{i,j}|=k`$). Since $`q`$ is $`(kL)`$-Lipschitz and bounded by $`k`$, Lemma 31 implies that $`f^*_{i,j}`$ is $`(K,k,2k+1,1/(2kL))`$-PTF of $`G^{i-1}`$ Hence, Theorem 35 follows from the following lemma and a union bound on the $`rq`$ different $`f^*_{i,j}`$.

Lemma 36. Let $`f:{\cal X}\to \{\pm 1\}`$ be a $`(K,M,B,\xi)`$-PTF and let $`\mathbf{w}^1,\ldots,\mathbf{w}^q\in {\cal W}_{d,k}`$ be independent and uniform. Define $`\psi_i(\mathbf{x} ) = \mathrm{sign}({\left\langle \mathbf{w}^i,\mathbf{x} \right\rangle})`$. Then, w.p.Ā $`1-4d|{\cal X}|e^{-\Omega\left(\frac{\xi^2q}{d}\right)}`$ $`f`$ is $`\left(K,O\left(Md^{K}\right),B\right)`$-PTF of $`\Psi=(\psi_1,\ldots,\psi_q)`$.

Proof. Let $`W = [\mathbf{w}_1\cdots\mathbf{w}_q]\in M_{d,q}`$ We first show that w.h.p. $`W`$ approximately reconstruct $`\mathbf{x}`$ from $`\Psi(\mathbf{x} )`$

Claim 2. Let $`\alpha_{d,k} = \frac{k}{d}\cdot\frac{\binom{k-1}{(k-1)/2}}{2^{k-1}}`$. For any $`\mathbf{x} \in\{\pm 1\}^d`$ and $`\frac{1}{4}\ge\epsilon>0`$ we have $`\Pr\left(\left\|\frac{1}{q\alpha_{d,k}}W\Psi(\mathbf{x} )-\mathbf{x} \right\|_\infty\ge\epsilon\right)\le 4de^{-\Omega\left(\frac{\epsilon^2q}{d}\right)}`$

Before proving the claim, we show that it implies the lemma. Indeed, it implies that w.p.Ā $`1-4d|{\cal X}|e^{-\Omega\left(\frac{\xi^2q}{d}\right)}`$ we have that $`\left\|\frac{1}{q\alpha_{d,k}}W\Psi(\mathbf{x} )-\mathbf{x} \right\|_\infty\le\frac{\xi}{2}`$ for any $`\mathbf{x} \in {\cal X}`$. Given this event, we have that

MATH
1-\xi\le  \frac{1-\xi/2}{q\alpha_{d,k}}\left(W\Psi(\mathbf{x} )\odot \mathbf{x} \right)_j\le1
Click to expand and view more

for any $`\mathbf{x} \in{\cal X}`$ and $`j\in [d]`$. Thus, if $`p:{\cal X}\to\mathbb{R}`$ is a polynomial hat witness that $`f`$ is $`(K,M,B,\xi)`$-PTF, then we have

MATH
B\ge p\left( \frac{1-\xi/2}{q\alpha_{d,k}}W\Psi(\mathbf{x} )\right)\cdot f(\mathbf{x} ) \ge 1
Click to expand and view more

Hence, for $`q(\mathbf{y}):=p\left( \frac{1-\xi/2}{q\alpha_{d,k}}W\mathbf{y}\right)`$ we have that $`f`$ is $`(K,\|q\|_\mathrm{co},B)`$-PTF of $`\Psi`$. By Lemma 22 and the fact that the norm of each row of $`\frac{1-\xi/2}{q\alpha_{d,k}}W`$ is at most $`\frac{1}{\sqrt{q}\alpha_{d,k}}`$ (since the entries of $`W`$ are in $`\{-1,1,0\}`$) we have

MATH
\|q\|_\mathrm{co}\le \|p\|_\mathrm{co}\cdot \left(\frac{\sqrt{q+1}}{\sqrt{q}\alpha_{d,k}}\right)^{K}
Click to expand and view more

This implies the lemma as $`\alpha_{d,k} = \Theta\left( \frac{\sqrt{k}}{d}\right)`$ by Lemma 20.

Proof. (of Claim 2) Fix a coordinate $`j\in [d]`$. It is enough to show that $`\Pr\left(\left|\frac{1}{q\alpha_{d,k}}\left(W\Psi(\mathbf{x} )\right)_j-x_j\right|\ge\epsilon\right)\le 4e^{-\Omega\left(\frac{\epsilon^2q}{d}\right)}`$. We note that

MATH
\frac{1}{q\alpha_{d,k}}\left(W\Psi(\mathbf{x} )\right)_j = \frac{1}{q}\sum_{i=1}^q \frac{w^i_j\mathrm{sign}({\left\langle \mathbf{w}^i,\mathbf{x}  \right\rangle})}{\alpha_{d,k}}
Click to expand and view more

Denote $`X_i = w^i_j\mathrm{sign}({\left\langle \mathbf{w}^i,\mathbf{x} \right\rangle})`$. Note that $`X_1,\ldots,X_q`$ are i.i.d.Ā We have

MATH
\begin{eqnarray*}
\Pr(X_i =  x_j)  &=& \frac{k}{2d}\left[\Pr(\mathrm{sign}({\left\langle \mathbf{w}^i,\mathbf{x}  \right\rangle})=1|w_j=x_j) +   \Pr(\mathrm{sign}({\left\langle \mathbf{w}^i,\mathbf{x}  \right\rangle})=-1|w_j=-x_j)\right]
\\
&=& \frac{k}{2d 2^{k-1}}\left[\binom{k-1}{\ge (k-1)/2}+\binom{k-1}{\ge (k-1)/2}\right]
\\
&=& \frac{k}{2d }\left[1 + \frac{\binom{k-1}{ (k-1)/2}}{2^{k-1}}\right]
\end{eqnarray*}
Click to expand and view more

Similarly,

MATH
\begin{eqnarray*}
\Pr(X_i =  -x_j)  &=& \frac{k}{2d}\left[\Pr(\mathrm{sign}({\left\langle \mathbf{w}^i,\mathbf{x}  \right\rangle})-1|w_j=x_j) +   \Pr(\mathrm{sign}({\left\langle \mathbf{w}^i,\mathbf{x}  \right\rangle})=1|w_j=-x_j)\right]
\\
&=& \frac{k}{2d 2^{k-1}}\left[\binom{k-1}{> (k-1)/2}+\binom{k-1}{> (k-1)/2}\right]
\\
&=& \frac{k}{2d }\left[1 - \frac{\binom{k-1}{ (k-1)/2}}{2^{k-1}}\right]
\end{eqnarray*}
Click to expand and view more

As a result

MATH
\mathbb{E}X_i  = \left(\Pr(X_i= x_j)-\Pr(X_i=-x_j)\right)x_j  = \alpha_{d,k}\cdot x_j
Click to expand and view more

And,

MATH
\Pr(X_i\ne 0)  = \Pr(X_i= x_j) + \Pr(X_i=-x_j)= \frac{k}{d}
Click to expand and view more

this implies that

MATH
\frac{\min\left(\Pr(X_i = 1),\Pr(X_i = -1)\right)}{|\mathbb{E}X_i|} = \frac{k}{2d \alpha_{d,k} }\left[1 - \frac{\binom{k-1}{ (k-1)/2}}{2^{k-1}}\right] \ge \frac{k}{\alpha_{d,k}4d} \ge \frac{1}{2}
Click to expand and view more

and that

MATH
\frac{|\mathbb{E}X_i|^2}{\Pr(\mathrm{sign}({\left\langle \mathbf{w},\mathbf{x}  \right\rangle})w_i\ne 0)} = \frac{k}{d}\left(\frac{\binom{k-1}{(k-1)/2}}{2^{k-1}}\right)^2 \stackrel{\text{Lemma \ref{lem:binom_assimptotics}}}{=} \Theta\left( \frac{1}{d}\right)
Click to expand and view more

By Lemma 19 we have

MATH
\Pr\left(\left|\frac{1}{q\alpha_{d,k}}\left(W\Psi(\mathbf{x} )\right)_j-x_j\right|\ge\epsilon\right)\le 4e^{-\Omega\left(\frac{\epsilon^2q}{d}\right)}
Click to expand and view more

Ā ā—»

Ā ā—»

Kernels From Random Neurons and Proof of Lemma <a href="#lem:rf_main_simp" data-reference-type=“ref”

data-reference=“lem:rf_main_simp”>13

Fix a bounded activation $`\sigma:\mathbb{R}\to\mathbb{R}`$. Given $`0\le \beta\le 1`$, called the bias magnitude we define a kernel on $`\mathbb{R}^n`$ by

MATH
\begin{equation}
    k_{\sigma,\beta,n}(\mathbf{x} ,\mathbf{y}) = \mathbb{E}[\sigma(\mathbf{w}^\top\mathbf{x} +b)\sigma(\mathbf{w}^\top\mathbf{y}+b)]\;,\;\;\;\; b\sim {\cal N}(0,\beta^2),\;\mathbf{w}\sim{\cal N}\left(0,\frac{1-\beta^2}{n}I_n\right)
\end{equation}
Click to expand and view more

Note that $`\psi((\mathbf{w},b),\mathbf{x} ) = \sigma(\mathbf{w}^\top\mathbf{x} +b)`$ is a RFS for $`k_{\sigma,\beta,n}`$. We next analyze the functions in the corresponding kernel space $`{\cal H}_{\sigma,\beta,n}`$. To this end, we will use the Hermite expansion of $`\sigma`$ in order to find an explicit expression of $`k_{\sigma,\beta,n}`$, as well as an explicit embedding $`\Psi_{\sigma,\beta,n}:\mathbb{R}^n\to \bigoplus_{s=0}^\infty \left(\mathbb{R}^{n+1}\right)^{\otimes s}`$ whose kernel is $`k_{\sigma,\beta,n}`$. Let

MATH
\begin{equation}
\label{eq:hermite_expan_of_sigma}
    \sigma = \sum_{s=0}^\infty a_sh_s
\end{equation}
Click to expand and view more

be the Hermite expansion of $`\sigma`$. For $`r\ge 1`$ denote

MATH
\begin{equation}
a_s(r) =     \sum_{j=0}^{\infty} a_{s+2j}\sqrt{\frac{(s+2j)!}{s!}} \frac{(r^2-1)^j}{j! 2^j}
\end{equation}
Click to expand and view more

Note that $`a_s(1)=a_s`$

Lemma 37. *We have

MATH
k_{\sigma,\beta,n}(\mathbf{x} ,\mathbf{y}) = \sum_{s=0}^\infty a_s\left(\sqrt{\frac{1-\beta^2}{n}\|\mathbf{x} \|^2 + \beta^2}\right)a_s\left(\sqrt{\frac{1-\beta^2}{n}\|\mathbf{y}\|^2 + \beta^2}\right)\left(\frac{1-\beta^2}{n}{\left\langle \mathbf{x} ,\mathbf{y} \right\rangle}+\beta^2\right)^s
Click to expand and view more

Likewise, $`k_{\sigma,\beta,n}`$ is the kernel of the embedding $`\Psi_{\sigma,\beta,n}:\mathbb{R}^n\to \bigoplus_{s=0}^\infty \left(\mathbb{R}^{n+1}\right)^{\otimes s}`$ given by

MATH
\Psi_{\sigma,\beta,n}(\mathbf{x} ) = \left(a_s\left(\sqrt{\frac{1-\beta^2}{n}\|\mathbf{x} \|^2 + \beta^2}\right)\cdot\begin{bmatrix}\sqrt{\frac{1-\beta^2}{n}}\mathbf{x} \\\beta\end{bmatrix}^{\otimes s}\right)_{s=0}^\infty
```*

</div>

To prove Lemma <a href="#lem:ker_formula" data-reference-type="ref"
data-reference="lem:ker_formula">37</a> We will use the following Lemma.

<div id="lem:hermite_scaling" class="lemma">

**Lemma 38**. *We have
$`h_s(ax) = \sum_{j=0}^{\lfloor s/2 \rfloor} \sqrt{\frac{s!}{(s-2j)!}} \frac{a^{s-2j}(a^2-1)^j}{j! 2^j} h_{s-2j}(x)`$*

</div>

<div class="proof">

*Proof.* By formula
<a href="#eq:hermite_gen_fun" data-reference-type="eqref"
data-reference="eq:hermite_gen_fun">[eq:hermite_gen_fun]</a> we have
``` math
\begin{eqnarray*}
\sum_{s=0}^\infty \frac{h_s(ax)t^s}{\sqrt{s!}} &=& e^{xat - \frac{t^2}{2}} \\
&=& e^{xat - \frac{(at)^2}{2} +  \frac{(at)^2}{2} - \frac{t^2}{2}}     
\\
&\stackrel{\text{Eq. }\eqref{eq:hermite_gen_fun}}{=}& e^{\frac{(at)^2}{2} - \frac{t^2}{2}}\left( \sum_{s=0}^\infty \frac{h_s(x)a^st^s}{\sqrt{s!}}\right)
\\
&=& e^{(a^2-1)\frac{t^2}{2}}\left( \sum_{s=0}^\infty \frac{h_s(x)a^st^s}{\sqrt{s!}}\right)
\\
&=& \left( \sum_{s=0}^\infty \frac{(a^2-1)^s}{s!2^s} t^{2s}\right)\left( \sum_{s=0}^\infty \frac{h_s(x)a^s}{\sqrt{s!}}t^s\right)
\\
&=& \sum_{s=0}^\infty \left(\sum_{j=0}^{\left\lfloor \frac{s}{2}\right\rfloor}
\frac{(a^2-1)^j}{j!2^j}\frac{h_{s-2j}(x)a^{s-2j}}{\sqrt{(s-2j)!}}
\right)t^{s}
\end{eqnarray*}
Click to expand and view more

Thus,

MATH
\frac{h_s(ax)}{\sqrt{s!}} = \sum_{j=0}^{\left\lfloor \frac{s}{2}\right\rfloor}
\frac{(a^2-1)^j}{j!2^j}\frac{a^{s-2j}}{\sqrt{(s-2j)!}}h_{s-2j}(x)
Click to expand and view more

Ā ā—»

Proof. (of Lemma 37) We will prove the formula for $`k_{\sigma,\beta,n}`$. It is not hard to verify that it implies that $`k_{\sigma,\beta,n}`$ is the kernel of $`\Psi_{\sigma,\beta,n}`$ using the fact that $`{\left\langle \mathbf{x} ^{\otimes s},\mathbf{y}^{\otimes s} \right\rangle} = {\left\langle \mathbf{x} ,\mathbf{y} \right\rangle}^s`$. By definition $`k_{\sigma,\beta,n}(\mathbf{x} ,\mathbf{y}) = \mathbb{E}[\sigma(\mathbf{w}^\top\mathbf{x} +b)\sigma(\mathbf{w}^\top\mathbf{y}+b)]`$ where $`b\sim {\cal N}(0,\beta^2)`$ and $`\mathbf{w}\sim{\cal N}\left(0,\frac{1-\beta^2}{n}I_n\right)`$. Let $`X = \mathbf{w}^\top\mathbf{x} +b`$ and $`Y = \mathbf{w}^\top\mathbf{y}+b`$. We note that $`(X,Y)`$ is a centered Gaussian vector with correlation matrix $`\begin{pmatrix} \frac{1-\beta^2}{n}\|\mathbf{x} \|^2+\beta^2 & \frac{1-\beta^2}{n}{\left\langle \mathbf{x} ,\mathbf{y} \right\rangle}+\beta^2\\ \frac{1-\beta^2}{n}{\left\langle \mathbf{x} ,\mathbf{y} \right\rangle}+\beta^2 & \frac{1-\beta^2}{n}\|\mathbf{y}\|^2+\beta^2 \end{pmatrix}`$. Denote $`r_\mathbf{x} = \sqrt{\frac{1-\beta^2}{n}\|\mathbf{x} \|^2 + \beta^2}`$ and $`r_\mathbf{y}= \sqrt{\frac{1-\beta^2}{n}\|\mathbf{y}\|^2 + \beta^2}`$. Likewise let $`\tilde X = \frac{1}{r_\mathbf{x} }X`$ and $`\tilde Y = \frac{1}{r_\mathbf{y}}Y`$. Note that $`(X,Y)`$ is a centered Gaussian vector with correlation matrix $`\begin{pmatrix} 1 & \rho\\ \rho & 1 \end{pmatrix}`$ for $`\rho = \frac{\frac{1-\beta^2}{n}{\left\langle \mathbf{x} ,\mathbf{y} \right\rangle}+\beta^2}{r_\mathbf{x} r_\mathbf{y}}`$ Now, by Lemma 38 we have

MATH
\begin{eqnarray*}
\sigma(rx) &=& \sum_{s=0}^\infty h_s(rx)
\\
&=& \sum_{s=0}^\infty \left(\sum_{j=0}^{\infty} a_{s+2j}\sqrt{\frac{(s+2j)!}{s!}} \frac{(r^2-1)^j}{j! 2^j}  \right)r^{s}h_s(x)
\\
&=& :\sum_{s=0}^\infty a_s(r)r^sh_s(x)
\end{eqnarray*}
Click to expand and view more

Hence,

MATH
\begin{eqnarray*}
    k_{\sigma,\beta,n}(\mathbf{x} ,\mathbf{y}) &=& \mathbb{E}\sigma(r_\mathbf{x} \tilde X) \sigma(r_\mathbf{y}\tilde Y)
    \\
    &=&\sum_{i=0}^\infty \sum_{j=0}^\infty a_i(r_\mathbf{x} )r^i_\mathbf{x} a_j(r_\mathbf{y})r^j_\mathbf{x} \mathbb{E}h_i(\tilde X)h_j(\tilde Y)
    \\
    &\stackrel{\text{Eq. }\eqref{eq:hermite_prod_exp}}{=}&\sum_{s=0}^\infty a_s(r_\mathbf{x} )r^s_\mathbf{x} a_s(r_\mathbf{y})r^s_\mathbf{y}\rho^s
    \\
    &=&\sum_{s=0}^\infty a_s(r_\mathbf{x} ) a_s(r_\mathbf{y}) \left(\frac{1-\beta^2}{n}{\left\langle \mathbf{x} ,\mathbf{y} \right\rangle}+\beta^2\right)^s
\end{eqnarray*}
Click to expand and view more

Ā ā—»

Lemma 39. *Let $`r>0`$ such that $`|1-r^2|=:\epsilon<\frac{1}{2}`$. We have

MATH
|a_s(r)-a_s(1)| \le \|\sigma\|2^{(s+2)/2} \frac{\epsilon}{\sqrt{1-2\epsilon^2}}
```*

</div>

<div class="proof">

*Proof.* We have
``` math
\begin{eqnarray*}
    |a_s(r)-a_s(1)| &=& \left|\sum_{j=1}^{\infty} a_{s+2j}\sqrt{\frac{(s+2j)!}{s!}} \frac{(r^2-1)^j}{j! 2^j}\right| 
    \\
    &\stackrel{\text{Cauchy-Schwartz and }\|\sigma\|=\sqrt{\sum_{i=0}^\infty a_i^2}}{\le}& \|\sigma\| \sqrt{\sum_{j=1}^{\infty} \frac{(s+2j)!}{s!} \frac{(r^2-1)^{2j}}{(j!)^2 2^{2j}} }
    \\
    &\stackrel{(2j)!\le (j!2^j)^2}{\le}& \|\sigma\| \sqrt{\sum_{j=1}^{\infty} \frac{(s+2j)!}{s!(2j)!} (r^2-1)^{2j} }
    \\
    &=& \|\sigma\| \sqrt{\sum_{j=1}^{\infty} \binom{s+2j}{s} (r^2-1)^{2j} }
    \\
    &\le& \|\sigma\| \sqrt{\sum_{j=1}^{\infty} 2^{s+2j} (r^2-1)^{2j} }
    \\
    &=& \|\sigma\|2^{s/2} \sqrt{\sum_{j=1}^{\infty}  (2r^2-2)^{2j} }
    \\
    &=& \|\sigma\|2^{s/2}|2r^2-2| \frac{1}{\sqrt{1-(2r^2-2)^2}}
\end{eqnarray*}
Click to expand and view more

Ā ā—»

Lemma 40. Assume that $`1-\beta^2<\frac{1}{2}`$ for $`\beta>0`$. Let $`{\cal X}\subseteq [-1,1]^n`$. Let $`p:{\cal X}\to\mathbb{R}`$ be a degree $`K`$ polynomial. Let $`K'\ge K`$. There is $`g\in {\cal H}_{\sigma,\beta,n}({\cal X})`$ such that

  1. $`g(\mathbf{x} ) = \frac{a_{K'}\left(\sqrt{\frac{1-\beta^2}{n}\|\mathbf{x} \|^2+\beta^2}\right)}{a_{K'}}p(\mathbf{x} )`$

  2. $`\|g\|_{\sigma,\beta,n} \le \frac{1}{a_{K'}\beta^{K'-K}}\left(\frac{n}{1-\beta^2}\right)^{K/2}\|p\|_{\mathrm{co}}`$

  3. $`\|g-p\|_\infty \le \|p\|_\infty \frac{\|\sigma\|}{a_{K'}}2^{(K'+2)/2}\frac{1-\beta^2}{\sqrt{1-2(1-\beta^2)^2}}`$

Proof. Write $`p(\mathbf{x} )= \sum_{\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K} b_\alpha \mathbf{x} ^\alpha`$. For $`\alpha\in \{0,\ldots,K\}^{n},\|\alpha\|_1\le K`$ we let $`\tilde\alpha \in [n+1]^{K'}`$ be a sequence such that for any $`i\in [n]`$ we have $`\tilde\alpha_j = i`$ for exactly $`\alpha_i`$ indices $`j\in [K']`$ and $`\tilde\alpha_j = n+1`$ for the remaining $`K'-\|\alpha\|_1`$ indices. Let $`A\in (\mathbb{R}^{n+1})^{\otimes K'}\subseteq \bigoplus_{s=0}^\infty(\mathbb{R}^{n+1})^{\otimes s}`$ be the tensor

MATH
A_{\gamma} = \begin{cases}\frac{1}{a_{K'}\beta^{K'-\|\alpha\|_1}}\left(\frac{n}{1-\beta^2}\right)^{\|\alpha\|_1/2}b_\alpha & \gamma=\tilde\alpha\text{ for some }\alpha \\  0&\text{otherwise}\end{cases}
Click to expand and view more

and let

MATH
g(\mathbf{x} ) = {\left\langle A,\Psi_{\sigma,\beta,n}(\mathbf{x} ) \right\rangle}
Click to expand and view more

It is not hard to verify that $`g(\mathbf{x} ) = \frac{a_{K'}\left(\sqrt{\frac{1-\beta^2}{n}\|\mathbf{x} \|^2+\beta^2}\right)}{a_{K'}}p(\mathbf{x} )`$. By Theorem 25 $`g\in {\cal H}_{\sigma,\beta,n}`$ and satisfies $`\|g\|_{\sigma,\beta,n} \le \|A\|`$. Finally, since $`\frac{1}{\beta^{K'-\|\alpha\|_1}}\left(\frac{n}{1-\beta^2}\right)^{\|\alpha\|_1/2}\le \frac{1}{\beta^{K'-K}}\left(\frac{n}{1-\beta^2}\right)^{K/2}`$ we have$`\|A\|\le \frac{1}{a_{K'}\beta^{K'-K}}\left(\frac{n}{1-\beta^2}\right)^{K/2}\|p\|_{\mathrm{co}}`$. We therefore proved the first and the second items. To prove the last item we note that for any $`\mathbf{x} \in {\cal X}`$ we have

MATH
\begin{eqnarray*}
    |g(\mathbf{x} ) - p(\mathbf{x} )| &=& | p(\mathbf{x} )|\cdot\left|\frac{a_{K'}\left(\sqrt{\frac{1-\beta^2}{n}\|\mathbf{x} \|^2+\beta^2}\right)}{a_{K'}}-1\right| 
    \\
    &=& \frac{| p(\mathbf{x} )|}{a_{K'}}\left|a_{K'}\left(\sqrt{\frac{1-\beta^2}{n}\|\mathbf{x} \|^2+\beta^2}\right)-a_{K'}\right|
\end{eqnarray*}
Click to expand and view more

Define $`r = \sqrt{\frac{\|\mathbf{x} \|^2}{n}(1-\beta^2) + \beta^2}`$ and note that since $`0\le \|\mathbf{x} \|^2\le n`$ we have

MATH
\beta^2\le  r^2\le 1 \Rightarrow \epsilon:= |1- r^2 |\le 1-\beta^2 < \frac{1}{2}
Click to expand and view more

Hence, by Lemma 39 we have

MATH
|g(\mathbf{x} ) - p(\mathbf{x} )|\le \frac{| p(\mathbf{x} )|}{a_{K'}}\|\sigma\|2^{(K'+2)/2}\frac{1-\beta^2}{\sqrt{1-2(1-\beta^2)^2}}
Click to expand and view more

which proves the last itemĀ ā—»

Combining with Lemma 40 with Lemma 27 we get

Lemma 41. *Assume that $`1-\beta^2<\frac{1}{2}`$ for $`\beta>0`$. Let $`{\cal X}\subset [-1,1]^n`$. Fix a degree $`K`$ polynomial $`p:{\cal X}\to [-1,1]`$ and $`K'\ge K`$. Let $`(W,\mathbf{b})\in\mathbb{R}^{q\times n}\times \mathbb{R}^{q}`$ be $`\beta`$-Xavier pair. Then there is a vector $`\mathbf{w}=\mathbf{w}(W,\mathbf{b})\in\mathbb{R}^{q}`$ such that

MATH
\forall\mathbf{x} \in {\cal X},\;\;\Pr\left(|{\left\langle \mathbf{w},\sigma(W\mathbf{x} +\mathbf{b}) \right\rangle}-p(\mathbf{x} )|\ge \epsilon +  \frac{\|\sigma\|}{a_{K'}}2^{(K'+2)/2}\frac{1-\beta^2}{\sqrt{1-2(1-\beta^2)^2}}\right) \le \delta
Click to expand and view more

for

MATH
\delta = 2\exp\left(-{q}\cdot\frac{a^4_{K'}\beta^{4K'-4K}(1-\beta^2)^{2K}\epsilon^4}{32 n^{2K}\|p\|_{\mathrm{co}}^4\|\sigma\|_\infty^4}\right)
Click to expand and view more

Moreover

MATH
\|\mathbf{w}\|\le \frac{2\|\sigma\|_\infty}{\epsilon\sqrt{{q}}}\cdot\frac{1}{a^2_{K'}\beta^{2K'-2K}}\left(\frac{n}{1-\beta^2}\right)^{K}\|p\|^2_{\mathrm{co}}
```*

</div>

We next specialize Lemma
<a href="#lem:rf_main" data-reference-type="ref"
data-reference="lem:rf_main">41</a> for the needs of our paper and
explain how it implies Lemma
<a href="#lem:rf_main_simp" data-reference-type="ref"
data-reference="lem:rf_main_simp">13</a>. Recall that for $`\epsilon>0`$
we defined $`\frac{3}{4}
\le\beta_{\sigma,K',K}(\epsilon)<1`$ as the minimal number such that if
$`\beta_{\sigma,K',K}(\epsilon)\le \beta<1`$ then
``` math
\frac{\|\sigma\|}{a_{K'}}2^{(K'+2)/2}\frac{1-\beta^2}{\sqrt{1-2(1-\beta^2)^2}}\le \frac{\epsilon}{2}
Click to expand and view more

We also defined

MATH
\delta_{\sigma,K',K}(\epsilon,\beta,q,M,n) = \begin{cases}
1 & \frac{4\|\sigma\|_\infty}{\epsilon\sqrt{{q}}}\cdot\frac{1}{a^2_{K'}\beta^{2K'-2K}}\left(\frac{n}{1-\beta^2}\right)^{K}M^2 > 1
\\
2\exp\left(-{q}\cdot\frac{a^4_{K'}\beta^{4K'-4K}(1-\beta^2)^{2K}\epsilon^4}{512 n^{2K}M^4\|\sigma\|_\infty^4}\right) & \text{otherwise}
\end{cases}
Click to expand and view more

We can now prove Lemma 13 restated which we restate next.

Lemma 42. *(Lemma 13 restated) Fix $`{\cal X}\subset [-1,1]^n`$, a degree $`K`$ polynomial $`p:{\cal X}\to [-1,1]`$, $`K'\ge K`$ and $`\epsilon>0`$. Let $`(W,\mathbf{b})\in\mathbb{R}^{q\times n}\times \mathbb{R}^{q}`$ be $`\beta`$-Xavier pair for $`1>\beta\ge \beta_{\sigma,K',K}(\epsilon)`$. Then there is a vector $`\mathbf{w}=\mathbf{w}(W,\mathbf{b})\in\mathbb{B}^{q}`$ such that

MATH
\forall\mathbf{x} \in {\cal X},\;\Pr\left(|{\left\langle \mathbf{w},\sigma(W\mathbf{x} +\mathbf{b}) \right\rangle}-p(\mathbf{x} )|\ge \epsilon \right) \le  \delta_{\sigma,K',K}(\epsilon,\beta,q,\|p\|_\mathrm{co},n)
```*

</div>

<div class="proof">

*Proof.* Fix $`\mathbf{x} \in {\cal X}`$. By Lemma
<a href="#lem:rf_main" data-reference-type="ref"
data-reference="lem:rf_main">41</a> there is a vector
$`\mathbf{v}\in\mathbb{R}^q`$ such that
``` math
\begin{equation}
\label{eq:lem:rf_main_simp_1}
\Pr\left(|{\left\langle \mathbf{v},\sigma(W\mathbf{x} +\mathbf{b}) \right\rangle}-p(\mathbf{x} )|\ge \epsilon \right)\le
\Pr\left(|{\left\langle \mathbf{v},\sigma(W\mathbf{x} +\mathbf{b}) \right\rangle}-p(\mathbf{x} )|\ge \frac{\epsilon}{2} +  \frac{\|\sigma\|}{a_{K'}}2^{(K'+2)/2}\frac{1-\beta^2}{\sqrt{1-2(1-\beta^2)^2}}\right) 
\le \delta
\end{equation}
Click to expand and view more

for

MATH
\begin{equation*}
    \delta = 2\exp\left(-{q}\cdot\frac{a^4_{K'}\beta^{4K'-4K}(1-\beta^2)^{2K}\epsilon^4}{512 n^{2K}\|p\|_{\mathrm{co}}^4\|\sigma\|_\infty^4}\right)
\end{equation*}
Click to expand and view more

Moreover

MATH
\|\mathbf{v}\|\le \frac{4\|\sigma\|_\infty}{\epsilon\sqrt{{q}}}\cdot\frac{1}{a^2_{K'}\beta^{2K'-2K}}\left(\frac{n}{1-\beta^2}\right)^{K}\|p\|^2_{\mathrm{co}}
Click to expand and view more

Define $`\mathbf{w}`$ to be the projection of $`\mathbf{v}`$ on $`\mathbb{B}^d`$. We now split into cases. If $`\frac{4\|\sigma\|_\infty}{\epsilon\sqrt{{q}}}\cdot\frac{1}{a^2_{K'}\beta^{2K'-2K}}\left(\frac{n}{1-\beta^2}\right)^{K}\|p\|^2_{\mathrm{co}}\le 1`$ then $`\mathbf{v}=\mathbf{w}`$ and $`\delta=\delta_{\sigma,K',K}(\epsilon,\beta,q,\|p\|_\mathrm{co},n)`$, so the Lemma follows from Equation [eq:lem:rf_main_simp_1]. Otherwise, we have $`\delta_{\sigma,K',K}(\epsilon,\beta,q,\|p\|_\mathrm{co},n)= 1`$ and the Lemma is trivially true.Ā ā—»

Acknowledgments

The research described in this paper was funded by the European Research Council (ERC) under the European Union’s Horizon 2022 research and innovation program (grant agreement No. 101041711), and the Simons Foundation (as part of the Collaboration on the Mathematical and Scientific Foundations of Deep Learning). The author thanks Elchanan Mossel and Mariano Schain for useful comments.

A Note of Gratitude

The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.

  1. We note that in practice it is often the case that an example posses several positive labels (for instance, ā€œdog" and ā€œanimal"). However, each training example usually comes with just one of its positive labels. We hope that future work will be able to handle this more realistic type of supervision. ↩︎

Start searching

Enter keywords to search articles

↑↓
↵
ESC
⌘K Shortcut