Reliably Learning the ReLU in Polynomial Time

Reliably Learning the ReLU in P olynomia l Time Surbhi Go el 1 , V arun Kanade 2 , Adam Kliv ans 1 , and Justin Thaler 3 1 Univ ersit y of T exas at Austin 2 Univ ersit y of Oxford and Alan T uring Institute 3 Georgetow n Univ ersity Abstract W e give the ﬁrst dimension-eﬃcient algo rithms for learning Rectiﬁed Linear Units (ReLUs), which are functions of the form x 7→ max (0 , w · x ) with w ∈ S n − 1 . Our algor ithm works in the challenging Relia ble Agnostic learning mo del of K alai, Kanade, and Ma nsour [18] where the learner is given access to a distribution D on labeled examples but the lab e ling may b e ar bitrary . W e constr uc t a hypothesis that simultaneously minimizes the fa ls e-p ositive r ate a nd the loss on inputs given p o sitive lab els b y D , for a ny conv ex, b o unded, and Lipsc hitz los s function. The algor ithm runs in p olyno mia l-time (in n ) with r esp ect to any distribution o n S n − 1 (the unit sphere in n dimensions) and for an y error para meter ǫ = Ω(1 / log n ) (this yields a PT AS for a question ra ised by F. Bach on the c o mplexity of maximizing ReLUs). These results are in contrast to known eﬃcient alg orithms for relia bly learning linea r thresho ld functions , where ǫ must b e Ω(1) and strong assumptions are required on the mar ginal distribution. W e can comp ose our results to obtain the ﬁrst set of eﬃcient algor ithms for learning co nstant-depth net works of ReLUs. Our techn iques com bine kernel metho ds a nd po lynomial appr oximations with a “dual-loss” approach to convex pro gramming. As a byproduct we obtain a n umber of applicatio ns including the ﬁrst set o f eﬃcient algor ithms for “conv ex piecewise- linear ﬁtting” a nd the ﬁrst eﬃcie nt algorithms for nois y p olynomial r econstruction of lo w-weight p o lynomials on the unit sphere. 1 In tro du ction Let X = S n − 1 , th e set of all un it ve ctors in R n , and let Y = [0 , 1]. W e deﬁne a ReLU (Rectiﬁed Linear Unit) to b e a function f ( x ) : X → Y equal to max (0 , w · x ) where w ∈ S n − 1 is a ﬁx ed elemen t of S n − 1 and w · x denotes the standard inn er pro duct. 1 The ReLU is a k ey bu ilding blo c k in the area of d eep nets, where the goal is to construct a netw ork or circuit of ReLUs th at “ﬁts” a training set with resp ect to v arious measures of loss. Recen tly , the R eLU has b ecome the “activ ation fun ction of c hoice” for practitioners in deep nets, as it leads to striking p erform ance in v arious ap p lications [23]. Surp r isingly little is kno wn ab out the computational complexit y of learning ev en the sh allo w est of nets: a single ReLU. In this w ork, we pro vide th e ﬁ r st set of eﬃcient algorithms for learning a ReLU. T he algorithms succeed with r esp ect to any distrib ution D on S n − 1 , tol erate arbitrary lab elings (equiv alen tly viewed as adversarial noise), and r un in p olynomial-time for any accuracy parameter ǫ = Ω(1 / log n ). This is in contrast to the pr oblem of learning thresh old functions, i.e., functions of the form sign( w · x ), where only computational hardness results are kno wn (unless stronger assumptions are made on the problem). Recall the follo wing t wo fundamental mac hin e-learning prob lems: Problem 1.1 (Ordinary L east Squares Regression) . L et D b e a distribution on S n − 1 × [0 , 1] . Given i.i.d. examp les dr awn fr om D , ﬁnd w ∈ S n − 1 that minimizes E ( x ,y ) ∼D [( w · x − y ) 2 ] . Problem 1.2 (Ag nostically Learning a Thr eshold F unction) . L et D b e a d istribution o n S n − 1 × { 0 , 1 } . Given i.i.d. examples dr awn fr om D , ﬁnd w ∈ S n − 1 that appr oximately minimizes Pr ( x ,y ) ∼D [ sign ( w · x ) 6 = y ] . The term agnostic ab o ve refers to the fact that the lab eling on {− 1 , 1 } m ay b e arbitr ary . In this work, we relax the notion of su ccess to impr op er le arning , where th e learner ma y outpu t an y p olynomial-time compu table h yp othesis ac hieving a loss that is within ǫ of the optimal solution from the concept class. T ak en together, these tw o p roblems are at the core of man y imp ortan t tec hniques from mo der n Mac hine Learning and Statistics. It is w ell-kno wn ho w to eﬃcien tly solv e ord inary least s quares and other v arian ts of linear regression; we kno w of multiple p olynomial-time solutions, all extensive ly used in practice [30]. In contrast, Problem 1.2 is though t to b e compu tationally intracta ble due to the man y existing hard ness results in the literature [7, 11, 19, 21]. The ReLU is a hybrid function that lies “in-b et w een” a linear fu nction and a threshold function in the follo wing sens e: restricte d to inp uts x such that w · x > 0, th e ReLU is linear, and for inputs x s u c h that w · x ≤ 0, the ReLU thresholds the v a lue w · x and simply outputs zero. In this sen se, w e could view the ReLU as a “one-sided” thr eshold fun ction. Since learning a ReLU has asp ects of b oth linear regression and thresh old learning, it is not straightfo rward to identify a notion of loss that captures b oth of these asp ects. 1.1 Reliably L earning Real-V alued F unctions W e introd uce a natural mo del for learning ReLUs inspir ed b y the Reliable Agnostic learnin g mo d el that wa s in tro du ced by Kalai et al. [18] in the con text of Bo olean f u nctions. The goal w ill b e to minimize b oth the false p ositiv e rate and a loss fu nction (for example, square-loss) on p oints the distribution lab els n on-zero. In this work, we giv e eﬃcien t algorithms for learning a ReLU o ve r th e unit sphere w ith resp ect to any loss function that satisﬁes mild prop erties (conv exit y , monotonicit y , b oun d edness, an d Lipsc hitz-ness). The Reliable Agnostic mo d el is motiv ated by the Neyman-Pea rs on criteria, and is in tended to capture settings in which false p ositive errors are more costly than f alse negativ e err ors (e.g., spam detection) or vice v ersa. W e observe that the asymmetric mann er in whic h the Reliable Agnostic 1 Throughout this manuscript, b old lo wer case v ariables d enote vectors. Unb olded low er case v ariables denote real num b ers. 1 mo del [18 ] treats diﬀeren t typ es of errors n aturally corresp ond s to the one-sided n ature of a ReLU. In particular, there ma y b e settings in whic h mistakenly predicting a p ositiv e v alue instead of zero carries a high cost. As a concrete example, imagine that in puts are commen ts on an online news article. Supp ose that eac h commen t is assigned a numerical score of qualit y or appropriateness, where the true scoring fun ction is r easonably m o deled by a linear function of the features of the comment. The newspap er wan ts to implemen t an automat ed system in whic h commen ts are either a) r ejected outrigh t if the score is b elo w a threshold or b) p osted in ord er of score, p ossibly after und er going h um an review. 2 In this situation, it may b e costlier to p ost (or sub jec t to human review) a lo w- qualit y or inappropriate commen t than it is to automatic ally reject a commen t that is sligh tly ab o ve the threshold for p osting. More formally , for a fun ction h and d istribution D o ve r R n × [0 , 1 ] deﬁne the follo wing losses L =0 ( h ) = Pr ( x ,y ) ∼D [ h ( x ) 6 = 0 ∧ y = 0] L > 0 ( h ) = E ( x ,y ) ∼D [ ℓ ( h ( x ) , y ) · I ( y > 0)] . Here, ℓ is a d esir ed loss fun ction, and I ( y > 0) equals 0 if y ≤ 0 and 1 otherwise. These t w o quan tities are resp ectiv ely the false-p ositiv e r ate and the exp ected loss (un d er ℓ ) on examples for whic h the tru e lab el y is p ositive . 3 Let C b e a cla ss of fu nctions mapp ing S n − 1 to [0 , 1] (e.g., C ma y b e the class of all ReLUs). Let C + = { c ∈ C | L =0 ( c ) = 0 } . W e say C is r eliably le arnable if there exists a learning algorithm A that (with high pr obabilit y) outputs a h yp othesis that 1) has at most ǫ false p ositive rate and 2) on p oin ts with p ositive lab els, has exp ected loss that is within ǫ of the b est c from C + . That is, the hyp othesis m ust b e b oth r eliable and comp etitiv e with the optimal classiﬁer from the class C + ( agnostic ). 1.2 Our Contributions W e can now state our main theorem giving a p oly-time algorithm (in n , the dimension) for reliably learning an y ReLU. All of our results hold for loss f u nctions ℓ that satisfy con v exit y , m onotonicit y , b oundedness, and Lipsc hitz-ness. F or b revit y , we a v oid making th ese requirements explicit in th e theorem statemen ts of this in tro duction, and w e omit the dep endence of t he runt ime on the fa ilur e p robabilit y δ of the algorithm or on th e b oundedness and L ipsc hitz parameters of the loss fun ction. All theorem statemen ts in su b sequent sections d o state explicitly to what class of loss functions they apply , as w ell as the runtime dep endence on these additional parameters. Theorem 1.3. L et C = { x 7→ ma x (0 , w · x ) : k w k 2 ≤ 1 } b e the class of R eLUs with weight ve ctors w satisfying k w k 2 ≤ 1 . Ther e exists a le arning algorithm A that r eliably le a rns C in time 2 O (1 /ǫ ) · n O (1) . Remark 1.4. We c an obtain the same c omplexity b ounds for le arning R eLUs in the standar d agnostic mo del with r esp e ct to the same class of loss fu nctions. This yields a PT AS (p oly nomial- time appr oximation scheme) for an optimization pr oblem r e gar ding R eLUs p ose d by Bach [3]. Se e Se ction 3.4 for details. F or the problem of learning thr eshold functions, all kno wn p olynomial-time alg orithms requ ire strong assu mptions on the marginal distribu tion (e.g., Gaussian [19] or large-margin [31]). I n con- trast, for ReLUs, w e succeed with resp ect to any d istribution on S n − 1 . W e lea v e op en the p roblem of impro ving the dep endence of Theorem 1.3 on ǫ . W e n ote that f or th e problem of learning 2 F or examp le, The New Y ork Times recen tly ann ounced that t hey are moving t o a hybrid commen t mod eration system that combines human and automated review [10]. 3 W e restrict Y = [0 , 1] as it is a natural setting for the case of ReLUs. How ev er, our results can easily b e extended to larger ranges. 2 threshold fun ctions—ev en assuming the marginal d istr ibution is Gaussian—th e run -time complex- it y must b e at least n Ω(log 1 /ǫ ) under the widely b eliev ed assu mption that learning sparse p arities is h ard [22]. F ur th er, the b est known algorithms for agnostically learning threshold f unctions with resp ect to Gaussians run in time n O (1 /ǫ 2 ) [8, 19 ]. C on trast this to our result for learning R eLUs, where w e give p olynomial-time algorithms eve n for ǫ as small as 1 / log n . W e can comp ose our results to obtain eﬃcient algorithms for s mall-depth n et w orks of ReLUs. F or b r evit y , here we state results only for linear combinatio ns of ReLUs (whic h are o ften ca lled depth-two net works of ReLUs, see, e.g., [9]). F ormal results for other t yp es o f net works can b e found in Section 4. Theorem 1.5. L et C b e a depth-2 network of R eLUs with k hidd en units. Then C is r eliably le arna ble in time 2 O ( √ k/ǫ ) · n O (1) . The ab o ve results are p erhap s surp rising in ligh t of the hardn ess result due to Livni et al. [25] who sho w ed that for X = { 0 , 1 } n , learning the diﬀerence of ev en t w o ReLUs is as hard as learning a threshold function. W e also obtain r esults for noisy p olynomial r e c onstruction on th e sphere (equiv alen tly , agnosti- cally learning a p olynomial) with resp ect to a large class of loss functions: Theorem 1.6. L et C b e the c lass of p olynom ials p : S n − 1 → [ − 1 , 1] in n variables such tha t th at the tota l de gr e e of p is at most d , a nd the sum of squar es of c o eﬃcients of p (in the standar d monomial b asis) is at most B . Then C is agnostic al ly le arnable under any (unknown) distribution over S n − 1 × [ − 1 , 1] in time p oly ( n, d, B , 1 /ǫ ) . Andoni et al. [1] were the ﬁr st to giv e eﬃcien t algorithms for noisy p olynomial reconstruction o v er non-Bo olean d omains. In particular, they gav e algorithms that succeed on the unit cub e but require an und erlying pro du ct d istribution and d o not work in the agnostic setting (they also run in time exp onenti al in the d egree d ). A t a high lev el the p ro ofs of b oth Theorem 1.3 and 1.6 follo w the same outline, but we do not kno w how to obtai n one from the other. 1.3 Applications to Con v ex P iecewise R egression W e establish a no v el connection b et ween learning net works of ReLUs and a b road class of piecewise- linear r egression problems studied in mac hine learning and op timization. The follo win g problem w as deﬁ n ed b y Bo yd and Magnani [27] as a generalizat ion of the w ell-kno wn MARS (multiv ariate adaptiv e r egression splines) framew ork du e to F riedm an [13 ]: Problem 1.7 (Conv ex Piecewise-Linear Regression: Max k -Aﬃne) . L et C b e the class of f unctions of the form f ( x ) = max ( w 1 · x , . . . , w k · x ) with w 1 , . . . , w k ∈ S n − 1 mapping S n − 1 to R . L et D b e an (unknown) distribution on S n − 1 × [ − 1 , 1] . Given i.i.d. examples dr awn fr om D , output h such that E ( x ,y ) ∼D [( h ( x ) − y ) 2 ] ≤ min c ∈C E ( x ,y ) ∼D [( c ( x ) − y ) 2 ] + ǫ . Applying our learnabilit y resu lts for net w orks o f ReLUs, w e obtain the ﬁrst p olynomial-time algorithms for solving the ab o v e max- k - aﬃne regression problem and th e sum of max- 2 -aﬃne regression problem when k = O (1). Bo yd and Magnani sp eciﬁcally highligh t the case of k = O (1) and pro vide a v ariet y of heur istics; w e obtain the ﬁrs t pr o v ably eﬃcien t results. Theorem 1.8. Ther e is an algorithm A for solving the c onvex pie c ewise- line ar ﬁtting pr o blem (cf. Deﬁnition 1.7) i n time 2 O ( ( k/ǫ ) log k ) · n O (1) . W e can also use our results for learning n et wo rks of R eLUs to learn the so-calle d “leaky ReLUs” and “parameterized” ReLUs (PReLUs); see Section 4.3 for d etails. W e obtain these results by comp osing v arious “ReLU gadgets,” i.e., constant- d ep th net works of ReLUs w ith a small num b er of b ounded -w eigh t h id den units. 3 1.4 Hardness W e also prov e the ﬁ r st hard ness results for le arn ing a single ReLU via s imple redu ctions to the problem of learning sparse parities w ith noise. These results highlight the diﬀerence b et we en learn- ing Bo olean and real-v alued fun ctions and justify our fo cus on (1) inp ut distribu tions o ver S n − 1 and (2) learning problems that are not scale in v arian t (for example, learning a linear threshold f unction o v er the Bo olean domain is equiv alen t to learning o ver S n − 1 in the distribution-free setting). Theorem 1.9. L et C b e the class of R eLUs over the domain X = { 0 , 1 } n . Then any algorithm for r eliably le arning C in time g ( ǫ ) · p oly ( n ) for any function g wil l give a p olyno mial time algorithm for le arning ω (1) -sp a rse p ar ities with noise (for any ǫ = O (1) ). Eﬃcien tly learning sparse parities (of any sup erconstan t length) with n oise is considered one the most c hallenging prob lems in theoretical compu ter science. 1.5 T echniq ues and Related W ork W e giv e a h igh-lev el ov erview of our pro of. Let C b e the class of all ReLUs, and let S = { ( x 1 , y 1 ) , . . . , ( x m , y m ) } b e a training s et of examples drawn i.i. d . from some arbitrary d istri- bution D on S n − 1 × [ − 1 , 1]. T o obtain our mai n result for reliably learning a single ReLU (cf. Theorem 1.3), our starting p oint is O ptimization Problem 1 b elo w. Optimization Problem 1 minimize w X i : y i > 0 ℓ ( y i , max (0 , w · x i )) sub ject to max (0 , w · x i ) = 0 for all i s uc h that y i = 0 k w k 2 ≤ 1 In Optimization Pr oblem 1 , ℓ d enotes the loss fun ction used to deﬁne L > 0 . Using stand ard gen- eralizatio n error argum ents, it is p ossible to sho w that (for reasonable c hoices of ℓ ) if w is an optimal solution to Optimization Problem 1 when run on a p olynomial size sample ( x 1 , y 1 ) , . . . , ( x m , y m ) dra wn from D , then it is suﬃcien t to outpu t the hypothesis h ( x ) := ma x (0 , w · x ). Unfortu nately Optimization P r oblem 1 is n ot con vex in w , and hence it ma y not b e p ossible to ﬁnd an optimal solution in p olynomial time. Instead, w e will giv e an eﬃcien t appr o ximate solution that will suﬃce for reliable learning. Our starting p oin t will b e to p ro ve the existence of lo w-d egree, lo w-we ight p olynomial appro x- imators for eve ry c ∈ C . The p olynomial metho d has a w ell established history in computational learning theory (e.g., Kalai et al. [19] for agnostically learnin g halfspaces under distributional as- sumptions), and we can app ly classical tec h n iques f r om appr o ximation theory and recen t work du e to Shersto v [32] to construct lo w-weigh t, lo w-d egree app ro ximators for any ReLU. W e can then r elax Optimization Pr oblem 1 to the sp ace of lo w-w eigh t p olynomials and follo w the approac h of Shalev-Shw artz et al. [31] who used tools fr om Repro ducing K er n el Hilb ert S paces (RKHS) to learn lo w-w eigh t p olynomials eﬃcien tly (Shalev-Shw artz et al. fo cu sed on a relaxati on of the 0/1 loss for halfspaces). The main challenge is to obtain r eliabilit y; i.e., to sim ultaneously min im ize the false-p ositiv e rate and the loss dictated by the ob jectiv e fu nction. T o do this w e tak e a “dual-loss” approac h and carefully construct tw o loss fu nctions th at will b oth b e minimized with high probability . Pro ving that these losses generaliz e for a large class of ob jectiv e functions is sub tle and requires “clipping” in order to apply the appropriate Rademacher b ound. Our ﬁnal outp ut hyp othesis is max (0 , h ) where h is a “clipp ed” v ersion of the optimal lo w-weig ht, lo w-degree p olynomial on the training data, appropriately k ernelized. 4 Our learning algorithms for net works of ReLUs are obtained by generalizing a comp osition tec h- nique due to Zhang et al. [36], who considered net wo rks of “smo oth” activ ation functions computed b y p o w er series (we discuss this more in Section 4). Using a sequ ence of “gadget” reductions, w e then sho w that ev en small-size n et w orks of ReLUs are surpr isin gly p o werful, yielding the ﬁrst set of pro v ably eﬃcien t algorithms for a v ariet y of p iecewise-linear regression p roblems in high dimension. Note: A recen t man uscript app earing on the Arxiv du e to R. Arora et al. [2] co ns id ers th e complexit y of training depth-2 n et w orks of ReLUs with k hidden un its on a samp le of size m b ut when the dimension n = 1 . They giv e a prop er learning algorithm that run s in time exp onent ial in k and m . Th ese concept classes, how ev er, can b e imp rop erly learned in time p olynomial in k and m using a straigh tforw ard reduction to piecewise linear regression on the real line. 2 Preliminaries 2.1 Notation The input space is denoted by X and the outpu t space b y Y . In most of this pap er, w e consider settings in wh ich X = S n − 1 , the unit sph ere in R n , 4 and Y is either [0 , 1] or [ − 1 , 1]. Let B n (0 , r ) denote the origin cen tered b all of radius r in R n . W e denote ve ctors by b oldface lo werca se letters s uc h as w or x , and w · x denotes the stand ard scalar (dot) pro duct. By k w k w e denote the standard ℓ 2 (i.e., Euclidean) n orm of the vect or w ; when necessary we will use sub scripts to indicate o ther norms. If f : S n − 1 → R is a r eal-v alued function o v er the un it sphere, w e s ay that a m ultiv ariate p olynomial p is an ǫ -appr o ximation to f if | p ( x ) − f ( x ) | ≤ ǫ for all x ∈ S n − 1 . F or a natural num b er n ∈ N , [ n ] = { 0 , 1 , . . . , n } . 2.2 Concept Cla sses Neural n et w orks are comp osed of u nits—eac h unit h as some x ∈ R n as in p ut (for some v al u e of n , and x m a y consist of outputs of other units) and the outp ut is typically a linear fun ction comp osed with a non-linear activation function, i.e., the output of a unit is of th e form f ( w · x ), wh ere w ∈ R n and f : R → R . Deﬁnition 2.1 (Recti ﬁ er ) . The r e ctiﬁer (denote d by σ relu ) is an activation fu nction deﬁne d as σ relu ( x ) = max (0 , x ) . Deﬁnition 2.2 (ReLU( n, W )) . F or w ∈ R n , let relu w : R n → R denote the function relu w ( x ) = max (0 , w · x ) . L et W ∈ R + ; we denote by ReLU ( n, W ) the class of r e ctiﬁe d line ar units deﬁne d by { relu w | w ∈ B n (0 , W ) } . Our results on relia b le learning focu s on th e class ReLU( n, 1). W e deﬁne net works of ReLUs in Section 4, where we also present r esults on agnostic learning and reliable learning of net works of ReLUs. Deﬁnition 2.3 ( P ( n , d, B )) . L et B ∈ R + , n, d ∈ N . We denote b y P ( n , d, B ) the class of n -variate p o lynomials p of total de gr e e at most d such that the sum of the squar es of the c o eﬃcients of p in the standar d monom ial b asis i s b ounde d by B . 2.3 Learning Mo dels W e consid er t w o learning mo dels in this pap er. T he ﬁrst is the stand ard ag n ostic learning mo del [14, 20] and the second is a generalization of the reliable agnostic learning framew ork [18]. W e describ e these mo dels brieﬂy; the reader ma y r efer to the original articles for further details. 4 All of our algorithms would also work under arbitrary distributions ov er the unit b al l . 5 Deﬁnition 2.4 (Agnostic Learnin g [14, 20]) . We say that a c onc ept class C ⊆ Y X is agnostic al ly le arna ble with r esp e ct to loss function ℓ : Y ′ × Y → R + (wher e Y ⊆ Y ′ ), if f or every δ, ǫ > 0 ther e exists a le arning algorithm A that for every distribution D over X × Y satisﬁes the fol lowing. Given ac c ess to examples dr awn fr om D , A outputs a hyp oth esis h : X → Y ′ , such that with pr ob ability at le ast 1 − δ , E ( x ,y ) ∼D [ ℓ ( h ( x ) , y )] ≤ min c ∈ C E ( x ,y ) ∼D [ ℓ ( c ( x ) , y )] + ǫ. (1) F urthermor e, if X ⊆ R n and s is a p ar ameter that c aptur es the r epr ese ntation c omp lexity (i.e., description length) of c o nc epts c in C , we say that C is eﬃcien tly agnostically learnable to er r or ǫ if A c an output an h satisfying Equation (1) with running time p olynomial in n , s , and 1 /δ . 5 Next, w e formally describ e our extension of the reliable ag n ostic learning mo d el in tro duced by Kalai et al. [18] to the setting of real-v alued functions (see Sectio n 1 for motiv ation). Supp ose the data is distributed according to some distribution D o v er X × [0 , 1]. F or Y ′ ⊇ [0 , 1], let h : X → Y ′ b e some function and let ℓ : Y ′ × [0 , 1] → R + b e a loss function. W e deﬁne the follo wing t wo losses for f with resp ect to the distribu tion D : L =0 ( h ; D ) = Pr ( x ,y ) ∼D [ h ( x ) 6 = 0 ∧ y = 0] (2) L > 0 ( h ; D ) = E ( x ,y ) ∼D [ ℓ ( h ( x ) , y ) · I ( y > 0)] , (3) where I ( y > 0) is 1 if y > 0 and 0 otherwise. In w ords, L =0 considers the zero-one error on p oints where the target y equals 0 an d L > 0 considers the loss (or risk) when y > 0. Bo th of these losses are deﬁned with resp ect to the d istribution D , w ith ou t cond itioning on the ev en ts y = 0 or y > 0. This is necessary to mak e eﬃcien t learning p ossible—if th e probabilit y of the even ts y = 0 or y > 0 is to o small, it is imp ossible for learning algorithms to mak e an y meaningful predictions conditioned on those ev en ts. Deﬁnition 2.5 (Reliable Agnostic Learning) . We say that a c onc ept class C ⊆ [0 , 1] X is r eliably agnostic a l ly le arnable (r eliably le a rnable for short) with r esp e ct to loss function ℓ : Y ′ × [0 , 1] → R + (wher e [0 , 1] ⊆ Y ′ ), if the fol lowing hold s. F or every δ, ǫ > 0 , ther e exists a le arning algorithm A such that, f or ev ery distribution D over X × [0 , 1] , when A is given ac c ess to examples dr awn fr om D , A outputs a hyp othesis h : X → Y ′ , suc h that with pr ob a bility at le ast 1 − δ , the fol lowing hold: (i) L =0 ( h ; D ) ≤ ǫ , (ii) L > 0 ( h ; D ) ≤ min c ∈C + ( D ) L > 0 ( c ) + ǫ, wher e C + ( D ) = { c ∈ C | L =0 ( c ; D ) = 0 } . F urthermor e , if X ⊆ R n and s is a p ar ameter that c aptur es the r epr esenta tion c omplexity of c onc epts c in C , we say that C is eﬃciently r eliably agno stic al ly le arna ble to err or ǫ if A c an output an h satisfying the ab ove c ondition s with running time that is p o lynomial in n , s , and 1 /δ . 5 2.3.1 Loss F unctions W e h a v e deﬁned agnostic and reliable learning in terms of general loss fu nctions. B elo w we describ e certain prop erties of loss functions that are r equ ired in order for our results to hold. Let Y denote the range of concepts from the concept class; this will typically b e [ − 1 , 1] or [0 , 1]. Let Y ′ ⊇ Y . W e consider loss functions of the form, ℓ : Y ′ × Y → R + and deﬁne the follo wing p rop erties: • W e say that ℓ is c onvex i n its ﬁrst ar gument i f for eve ry y ∈ Y the fun ction ℓ ( · , y ) is con v ex. • W e sa y that ℓ is mono tone if for ev ery y ∈ Y , if y ′′ ≤ y ′ ≤ y , then ℓ ( y ′ , y ) ≤ ℓ ( y ′′ , y ) and if y ≤ y ′ ≤ y ′′ , ℓ ( y ′ , y ) ≤ ℓ ( y ′′ , y ). Note that th is is w eak er than requiring that | y ′ − y | ≤ | y ′′ − y | implies ℓ ( y ′ , y ) ≤ ℓ ( y ′′ , y ). This latter condition is not satisﬁed by sev eral commonly used loss functions, e.g., hinge loss. 5 The accuracy parameter ǫ is purp osely omitted from the deﬁnition of eﬃciency; in our results w e will explicitly state the dep end ence on ǫ and for what ranges of ǫ the running time remains p olynomial in th e rema ining parameters. 6 • W e sa y that ℓ is b -b ounde d on the in terv al [ u, v ], if for ev ery y ∈ Y , ℓ ( y ′ , y ) ≤ b for y ′ ∈ [ u, v ]. • W e say that ℓ is L -Lipschitz in in terv al [ u, v ], if f or eve ry y ∈ Y , ℓ ( · , y ) is L -Lipsc hitz in the in terv al [ u, v ]. The results presen ted in this w ork hold for loss functions that are con v ex, monotone, b ound ed and Lip sc hitz con tin uous in some suitable interv al. (Monotonicit y is not strictly a requiremen t for our results, bu t the sample complexit y b oun ds ma y b e w orse for non-monotone loss functions; we p oint this out when relev a nt.) These restrictions are qu ite m ild, and virtu ally every loss function commonly considered in (con vex approac hes to) machine learning satisfy these conditions. F or instance, when Y = Y ′ = [0 , 1], it is easy to see that any ℓ p loss fu nction is con v ex, monotone, b oun d ed by 1 and p -Lipsc hitz for p ≥ 1. 2.4 Kernel Metho ds W e mak e use of ke rn el metho ds in our learnin g algorithms. F or completeness, we deﬁn e ke rn els and a few imp ortan t r esults concerning kernel metho ds. The reader ma y refer to Hofmann et al. [16] (or an y standard text) f or fu rther details. An y function K : X × X → R is called a kernel [28]. A k ernel K is symmetric if K ( x , x ′ ) = K ( x ′ , x ) , ∀ x , x ′ ∈ X ; K is p ositiv e deﬁnite if ∀ n ∈ N , ∀ x 1 , . . . , x n ∈ X , the n × n matrix K , where K i,j = K ( x i , x j ), is p ositiv e semi-deﬁnite. F or any p ositive deﬁnite ke rn el, there exists a Hilb ert space H equipp ed with an inner pro d u ct h· , ·i and a function ψ : X → H suc h that ∀ x , x ′ ∈ X , K ( x , x ′ ) = h ψ ( x ) , ψ ( x ′ ) i . W e refer to ψ as th e fe at ur e map for K . By con v en tion, we will use · to denote the stand ard inner pro duct in R n and h· , ·i for the inner pro duct in a Hilb ert Space H . When H = R n for some ﬁ n ite n , w e will use h· , ·i and · in terc hangeably . W e will use the follo wing v arian t of the p olynomial k ern el: Deﬁnition 2.6 (Multinomial Kernel) . Deﬁne ψ d : R n → R N d , wh er e N d = 1 + n + · · · + n d , indexe d by tuples ( k 1 , . . . , k j ) ∈ [ n ] j for e ach j ∈ { 0 , 1 , . . . , d } , wher e the entry of ψ d ( x ) c orr esp onding to tuple ( k 1 , . . . , k j ) e quals x k 1 · · · x k j . (When j = 0 we have an empty tuple and th e c orr esp onding entry is 1 .) De ﬁne kernel MK d via: MK d ( x , x ′ ) = h ψ d ( x ) , ψ d ( x ′ ) i = d X j =0 ( x · x ′ ) j . Also deﬁne H MK d to b e the c orr esp onding R epr o ducing Kernel Hilb ert Sp ac e (RKHS). Observe that MK d is the sum of standard p olynomial k ernels (cf. [35]) of degree i for i ∈ [ d ]. Ho w eve r, the feature map co nv en tionally u sed for a standard p olynomial k ernel has only  n + d d  en tries and, un der that d eﬁ nition in v olv es coeﬃcien ts o f size as large as d Θ( d ) . The feature map ψ d used b y M K d a v oids these co eﬃcien ts b y usin g N d en tries as d eﬁned ab ov e (that is, en tries of ψ d ( x ) are indexed b y or der e d subsets of [ n ], while en tries of the standard feature map are indexed b y unor der e d subsets of [ n ].) Let q : R n → R b e a m ultiv ariate p olynomial of total degree d . W e sa y that a v ector v ∈ H MK d r epr esents q if q ( x ) = h v , ψ d ( x ) i for all x ∈ S n − 1 . Note that although the feature map ψ d is ﬁxed, a p olynomial q will ha ve man y represent ations v as a vec tor in H MK d . F urthermore, observ e that the Euclidean norm, h v , v i , of these represent ations may not b e equal. The follo wing example will pla y an imp ortant role in our algo rithm s for learning ReLUs. Let w ∈ R n and let p ( t ) b e a univ ariate degree- d equal to P d i =0 β i t i b e give n. D eﬁn e the multiv ariate p olynomial p w ( x ) := p ( w · x ). Consider the represen tation of p w as an element of H MK d deﬁned as follo ws: the entry of index ( k 1 , . . . , k j ) ∈ [ n ] j of the represen tation equals β j · Q j i =1 w k i for j ∈ [ d ]. Abus in g notat ion, we use p w to denote b oth the m ultiv ariate p olynomial and the v ector in H MK d . The follo wing lemma 7 establishes that p w ∈ H MK d is indeed a represen tation of the p olynomial p w , and giv es a b ound on h p w , p w i . The pro of follo ws an analysis applied by Shalev-Shw artz et al. [31, Lemma 2.4] to a diﬀeren t kernel (cf. Remark 2.8 b elo w). Lemma 2.7. L et p ( t ) = P d i =0 β i t i b e a given univariate p olynomial with P d i =1 β 2 i ≤ B . F or w such that k w k ≤ 1 , c onsider the p o lynomial p w ( x ) := p ( w · x ) . Then p w is r epr esente d by th e ve ctor p w ∈ H MK d deﬁne d ab ove. M or e over, h p w , p w i ≤ B . Pr o of. T o see that p w ( x ) = h p w , ψ d ( x ) i for all x ∈ R n , observ e that p w ( x ) = p ( w · x ) = d X i =0 β i · ( w · x ) i = d X i =0 X ( k 1 ,...,k i ) ∈ [ n ] i β i · w k 1 · · · · · w k i · x k 1 · · · · · x k i = h p w , ψ d ( x ) i . F urthermore, w e can compu te h p w , p w i = d X i =0 X ( k 1 ,...,k i ) ∈ [ n ] i β 2 i · w 2 k 1 · · · · · w 2 k i = d X i =0 β 2 i · X k 1 ∈ [ n ] w 2 k 1 · · · · · X k i ∈ [ n ] w 2 k i = d X i =0 β 2 i k w k 2 i 2 = d X i =0 β 2 i ≤ B . Remark 2.8. Shalev-Shwartz et al. [31] pr ove d a b ound on the Euclide an no rm of r epr esen- tations of p olynom ials of the fo rm p ( w · x ) in the RKHS c o rr esp o nding to the ke rnel f u nction K ( x , y ) = 1 1 − 1 2 h x , y i . This al lowe d them to r epr esent functions c ompute d by p ower series, as op- p o se d to p olynomials of (ﬁnite) de gr e e d . However, for de gr e e d p olyno mials, the use of their kernel r esults in a Euclide an norm b ound that is a factor of 2 d worse tha n what we obtain fr om L emma 2.7. This diﬀer enc e is c entr al to our r esults on noisy p olynomial r e c on struction in Se ction 3.5, wher e we addr ess this issue in mor e te chnic al detail. 2.5 Generalization B ounds W e m ak e use of the follo wing sta n d ard generaliza tion boun d for h yp othesis classes with small Rademac her complexit y . Readers un familiar with Rademac her complexit y ma y refer to the pap er of Bartlet t and Mend elson [4]. Theorem 2.9 (Bartlett and Mend elson [4]) . L et D b e a distribution over X × Y and let ℓ : Y ′ × Y (wher e Y ⊆ Y ′ ⊆ R ) b e a b - b ounde d loss function that i s L -Lispschitz in its ﬁrst ar gument. L et F ⊆ ( Y ′ ) X and for any f ∈ F , let L ( f ; D ) := E ( x,y ) ∼D [ ℓ ( f ( x ) , y )] and b L ( f ; S ) := 1 m P m i =1 ℓ ( f ( x i ) , y i ) , wher e S = (( x 1 , y 1 ) , . . . , ( x m , y m )) ∼ D m . Then f or any δ > 0 , with pr ob a bility at le ast 1 − δ (over the r a ndom sampl e dr aw for S ), simultane ously for al l f ∈ F , the fol lowing is true: |L ( f ; D ) − b L ( f ; S ) | ≤ 4 · L · R m ( F ) + 2 · b · r log(1 /δ ) 2 m wher e R m ( F ) is the R adema cher c omplexity of the f u nction class F . 8 W e w ill co mbine the follo wing t w o theorems with Theorem 2. 9 ab o v e to b oun d the generalizatio n error of our algorithms for agnostic and reliable learning. Theorem 2.10 (Kak ade et al. [17]) . L et X b e a subset of a H i lb ert sp ac e e quipp e d with inner pr o duct h· , ·i such that for e ach x ∈ X , h x , x i ≤ X 2 , and let W = { x 7→ h x , w i | h w , w i ≤ W 2 } b e a class of line ar functions. Then it holds that R m ( W ) ≤ X · W · r 1 m . The follo wing r esu lt as stated app ears in [4] but is originally attributed to [24]. Theorem 2.11 (Bartlett and Mendelson [4], Ledoux and T alagrand [24]) . L et ψ : R → R b e Lipschitz with c onsta nt L ψ and supp o se that ψ (0) = 0 . L et Y ⊆ R , and for a function f ∈ Y X , let ψ ◦ f deno te the standar d c omp ositio n of ψ and f . Final ly, for F ⊆ Y X , let ψ ◦ F = { ψ ◦ f : f ∈ F } . It holds that R m ( ψ ◦ F ) ≤ 2 · L ψ · R m ( F ) . 2.6 Appro ximation Theory First, w e show that the rect iﬁer activ ation function σ relu ( x ) = max (0 , x ) ca n b e ǫ -approximate d using a p olynomial of degree O (1 /ǫ ). This result follo ws usin g Jackso n ’s theorem (see, e.g., [29]). F or conv enience in later pro ofs, w e w ill r equire that in f act th e p olynomial also tak es v alues in the range [0 , 1] on the interv al [ − 1 , 1]. Of course, this is ac h iev ed easily starting from the p olynomial obtained from Jac kson’s theorem and applying elemen tary transformations. Lemma 2.12. L et σ relu ( x ) = max (0 , x ) and ǫ ∈ (0 , 1) . Ther e exists a p olynomial p of de gr e e O (1 /ǫ ) such that for al l x ∈ [ − 1 , 1] , | σ relu ( x ) − p ( x ) | ≤ ǫ and p ([ − 1 , 1]) ⊆ [0 , 1] . Pr o of. W e ca n express σ relu ( x ) = max (0 , x ) as σ relu ( x ) = ( x + | x | ) / 2. W e kno w from Jac kson’s Theorem [29] that there exists a p olynomial ˜ p of degree O (1 /ǫ ) suc h that for all x ∈ [ − 1 , 1], || x | − ˜ p ( x ) | ≤ ǫ 2 − ǫ . Consider the p olynomial ¯ p ( x ) = ˜ p ( x )+ x 2 , whic h satisﬁes for any x ∈ [ − 1 , 1], | σ relu ( x ) − ¯ p ( x ) | =     | x | + x 2 − ˜ p ( x ) + x 2     =     | x | − ˜ p ( x ) 2     ≤ ǫ 2(2 − ǫ ) . Finally , let p ( x ) = 2 − ǫ 2 ( ¯ p ( x ) − 1 2 ) + 1 2 . W e hav e for x ∈ [ − 1 , 1], | σ relu ( x ) − p ( x ) | = ǫ 2 | σ relu ( x ) | + 2 − ǫ 2 | σ relu ( x ) − ¯ p ( x ) | + 1 2     2 − ǫ 2 − 1     ≤ ǫ. F urthermore, it is clearly the case that p ([ − 1 , 1]) ⊆ [0 , 1]. W e remark that a c onsequ en ce of the linear relationship b et ween σ relu ( x ) and | x | is that the degree giv en b y Jac kson’s theorem is essen tially th e low est p ossible [29]. Lemma 2.12 asserts th e existence of a (r elativ ely) lo w-degree appro ximation p to the rectiﬁer activ ati on function σ relu . W e will also requir e a b ound on the sum of the squares of th e co eﬃcien ts of p . Ev en though Lemma 2.12 is non-constructiv e, w e are nonetheless able to obtain suc h a b ound b elo w via standard in terp olation metho ds. Lemma 2.13. L et p ( t ) = P d i =0 β i t i b e a univ ariate p olynom ial of de gr e e d . L et M b e such that max t ∈ [ − 1 , 1] | p ( t ) | ≤ M . Then d X i =0 β 2 i ≤ ( d + 1) · (4 e ) 2 d · M 2 . Pr o of. Lemma 4.1 from S hersto v [32] states that for any p olynomial satisfying the conditions in the statemen t of the lemma, the f ollo wing holds for all i ∈ { 0 , . . . , d } : | β i | ≤ (4 e ) d max j =0 ,...,d     p  j d      . 9 W e then hav e that d X i =0 β 2 i = d X i =0 | β i | 2 ≤ ( d + 1) · (4 e ) 2 d · M 2 . Theorem 2.14. L et C = ReLU( n, W ) (for W ≥ 1 ) and ǫ ∈ (0 , 1) . L et X = S n − 1 . F o r x , x ′ ∈ X , c onsider the kernel MK d , with H MK d and ψ d the c orr esp onding RKHS and fe atur e map (cf. Deﬁnition 2.6). Then for every w ∈ R n with k w k ≤ W , ther e exists a multivariate p olynomia l p w of de gr e e at most O ( W /ǫ ) , suc h that, for every x ∈ S n − 1 , | relu w ( x ) − p w ( x ) | ≤ ǫ . F urthermor e, p w ( S n − 1 ) ⊆ [0 , W ] and p w when viewe d as a memb er of H MK d as describ e d in Se ction 2.4, satisﬁes h p w , p w i ≤ W 2 · 2 O ( W /ǫ ) . Pr o of. Let p b e the univ a riate p olynomial of degree d = O ( W /ǫ ) giv en by Lemma 2.12 that satisﬁes | p ( x ) − σ relu ( x ) | ≤ ǫ W for x ∈ [ − 1 , 1]. Let p ( x ) = P d i =0 β i · x i ; then b y Lemma 2.13, w e ha ve P d i =0 β 2 i ≤ ( d + 1) · (4 e ) 2 d = 2 O ( W /ǫ ) (as | p ( x ) | ≤ 1 for x ∈ [ − 1 , 1]). Let q b e the univ ariate p olynomial deﬁned as q ( x ) = W · p ( x/W ) for W > 1. T he degree of q is d , the same as th at of p , and if α i are the co eﬃcien ts of q , w e hav e P d i =0 α 2 i ≤ W 2 · P d i =0 β 2 i ≤ W 2 · 2 O ( W /ǫ ) = 2 O ( W /ǫ ) (since W > 1 ). Let p w ( x ) = q ( w · x ). Note that | p w ( x ) − relu w ( x ) | = | W · p ( w · x /W ) − W · relu ( w /W ) ( x ) | ≤ ǫ and p w ( S n − 1 ) ⊆ q ([ − 1 , 1]) ⊆ [0 , W ]. Finally , b y applying Lemma 2.7, w e get that h p w , p w i ≤ W 2 · 2 O ( W /ǫ ) . 3 Reliably Learning the ReLU In this section, we fo cus on the problem of reliably learnin g a single rectiﬁed linear unit with w eigh t v ectors of norm b ounded b y 1, i.e., the concept class ReLU( n, 1). S p eciﬁcally , our goa l is to pro ve Th eorem 1.3 from Section 1.2. Belo w w e d escrib e the algorithm and then give a fu ll p ro of of Theorem 1.3. 3.1 Ov erview of the Algorit hm and Its Analysis In ord er t o reliably lea rn ReLUs, it w ould su ﬃce to solv e Optimization Problem 1 (see Section 1). This m athematical p rogram, ho wev er, is n ot con ve x; hence, w e co ns ider a su itable conv ex relaxation. The con v ex r elaxation optimizes o v er p olynomials of a suitable degree. Theorem 2.14 sho ws that an y concept in ReLU( n, 1) can b e uniformly approxima ted to er r or ǫ b y a degree O (1 /ǫ ) p olynomial. It will b e more conv enien t to view this p olynomial as an elemen t of the RKHS H MK d deﬁned in Deﬁnition 2.6. Recall that the corr esp onding k ern el is M K d ( x , x ′ ) = P d i =0 ( x · x ′ ) i and the feature map is denoted ψ d . Thus, instead of min imizing o v er w directly as in Optimization Problem 1, Optimization Problem 2 (below) m inimizes o v er v ∈ H MK d of suitably b ounded norm. In particular, w e k n o w that for an y w , the corresp onding p olynomial p w that ǫ -approximat es max(0 , w · x ), wh en view ed as an elemen t of H MK d , satisﬁes h p w , p w i ≤ B = 2 O (1 /ǫ ) (see Theorem 2.14). Recall that h p w , ψ d ( x ) i = p w ( x ). Thus, we ha v e th e follo wing optimization p roblem: Optimization Problem 2 minimize v ∈H MK d X i : y i > 0 ℓ ( h v , ψ d ( x i ) i , y i ) sub ject to h v , ψ d ( x i ) i ≤ ǫ for all i suc h that y i = 0 h v , v i ≤ B 10 Clearly , if w is a feasible s olution to Optimization Problem 1, th en the corresp ondin g elemen t p w ∈ H MK d is a feasible solution to Optimization Problem 2. W e consider the v alue of the program for th e feasible solution p w . F or ev ery x ∈ S n − 1 , p w ( x ) = h p w , ψ d ( x ) i ∈ [0 , 1]. Assuming that the loss function ℓ is L -Lipsc hitz in its ﬁr st argumen t in the int erv al [0 , 1], we ha v e       X i : y i > 0 ℓ (relu w ( x ) , y i ) − X i : y i > 0 ℓ ( h p w , ψ d ( x ) i , y i )       ≤ |{ i | y i > 0 }| · L · ǫ. Th u s, an optimal solution to Optimization P r oblem 2 ac hieves a loss on the training data that is within |{ i | y i > 0 }| · L · ǫ of that ac hieved b y the optimal solution to Op timization Problem 1. While O ptimization Pr oblem 2 is conv ex, it is still not trivial to solv e eﬃcien tly . F or one, the RKHS H MK d has dimens ion n Θ( d ) . Ho w ev er, materializi n g suc h vec tors explicitly requir es n Θ( d ) time, and Theorem 1.3 promises a learning algorithm with run time 2 O (1 /ǫ ) · n O (1) ≪ n O ( d ) . As in S halev-Sh wartz et al. [31], we apply the Representer Th eorem (see e.g., [6]), to guarante e that Optimization Problem 2 can b e solv ed in time th at is p olynomial in the num b er of samples used. The Represente r Theorem states that for any v ector v , there exists a v ector v α = P m i =1 α i ψ d ( x i ) for α 1 , . . . , α m ∈ R suc h th at the loss fun ction of O p timization Problem 2 sub ject to the constraint h v , v i ≤ B do es not increase when v is replaced with v α . Crucially , we ma y further constrain these v ectors v α to ob ey the inequalit y h v α , ψ d ( x i ) i ≤ ǫ for all i such that y i = 0. Thus, Op timization Problem 2 can b e reform ulated in terms of the v ariable v ector α α α = ( α 1 , . . . , α m ). Th is mathematica l program is describ ed as Optimization Problem 3 b elo w. Optimization Problem 3 minimize α α α ∈ R m X i : y i > 0 ℓ   m X j =1 α j MK d ( x j , x i ) , y i   sub ject to m X j =1 α j · MK d ( x j , x i ) ≤ ǫ for all i suc h that y i = 0 m X i,j =1 α i · α j · MK d ( x i , x j ) ≤ B Let K denote the m × m Gr am matrix whose ( i, j ) th en try is MK d ( x i , x j ). Using th e notation α α α = ( α 1 , . . . , α m ), the last constraint is equiv alent t o α α α T K α α α ≤ B . As MK d  0 , this deﬁn es a con v ex subset of R m . The remaining constrain ts are linear in α α α and whenever the loss fu nction ℓ is con v ex in its ﬁrst argumen t, the resulting program is conv ex. Th us, Optimization Problem 3 can b e solv ed in time p olynomial in m . 3.2 Description of t he Out put Hypot hesis Let α α α ∗ denote an optimal solution to Optimization Problem 3 and let f ( · ) = P m i =1 α ∗ i MK d ( x i , · ). T o obtain strong b oun d s on the generaliz ation error of our h yp othesis, our a lgorithm do es not simply output f itsel f. T he ob s tacle is that, although f (view ed as an elemen t of H MK d ) satisﬁes h f , f i ≤ B , the b est b ound we can obtain on | f ( x ) | = |h f , x i| for x ∈ S n − 1 is √ B b y th e Cauch y- Sc hw artz inequalit y . Observ e th at for man y commonly us ed loss functions, suc h as the squared loss, this ma y r esult in a v ery p o or Lipsc h itz constant and b ound on the loss function, when applied to f in the in terv al [ − √ B , √ B ] ( r ecall that the only b ound we h a v e is B = 2 O (1 /ǫ ) ). Hence, a direct app lication of standard generali zation b ounds (cf. Section 2.5) yields a very wea k b ound on the generaliza tion error of f itself. F or example, supp ose y ∈ { 0 , 1 } and consider th e loss fu nction ℓ ( y ′ , y ) = exp ( − y ′ (2 y − 1) + 1) − 1 if y ′ (2 y − 1) ≤ 1 and ℓ ( y ′ , y ) = 0 otherwise (this loss fun ction 11 is lik e the hinge loss, bu t the linear side is replaced b y an exp onential) . The Lipsc hitz constant of ℓ on the int erv al [ − √ B , √ B ] is exp onenti ally large in B , w h ic h w ould lead to a sample complexit y b oun d that is doubly -exp onentially large in 1 /ǫ . T o address this issue, w e w ill “clip” the function to alw a ys output a v alue b etw een [0 , 1]: Deﬁnition 3.1. Deﬁne the function clip a,b : R → [ a, b ] as fol lows: clip a,b ( x ) = a for x ≤ a , clip a,b ( x ) = x for a ≤ x ≤ b and clip a,b ( x ) = b for b ≤ x . The h yp othesis h output b y our algorithm is as follo ws. h ( x ) = ( 0 if clip 0 , 1 ( f ( x )) ≤ 2 · ǫ clip 0 , 1 ( f ( x )) otherwise . W e u se a fact due to Ledoux and T alagrand o n the Rademac her complexit y of comp osed function classes (Theorem 2.11) to b ound the generalization error. Clipping comes at a small cost, in the sense that it forces u s to require that the loss function b e monotone. Ho we ver, w e can h andle non- monotone losses if the output hyp othesis is not clipp ed, alb eit with sample complexity b oun ds that dep end p olynomially on the Lipschitz-c onstant and b ound of the loss in the interv al [ − √ B , √ B ] as opp osed to [0 , 1]. 3.3 F ormal V ersion of Theorem 1.3 and Its Pro of The rest of this section is dev oted to the pro of of Theorem 1.3 (or, more pr ecisely , its formal v arian t Theorem 3.2 b elo w, wh ic h mak es explicit the conditions on the loss function ℓ that are required for the theorem to hold). In particular, we sh o w that whenever the samp le size m is a suﬃcien tly large p olynomial in 2 O (1 /ǫ ) , n , and log (1 /δ ), the hyp othesis h output by the algo rithm satisﬁes L =0 ( h ; D ) = O ( ǫ ) and L > 0 ( h ; D ) ≤ m in c ∈C + ( D ) L > 0 ( c ) + O ( ǫ ), where C + ( D ) = { relu w ∈ ReLU( n, 1) | L =0 (relu w ; D ) = 0 } . Rescaling ǫ app ropriately completes the pro of of Theorem 3.2. Theorem 3.2 (F ormal V ersion of T h eorem 1.3) . L et X = S n − 1 and Y = [0 , 1] . The c onc ept class ReLU( n, 1) is r eliably le arnable with r esp e ct to any loss function th at is c onvex, monotone, and L -Lipschitz and b -b ounde d in the interval [0 , 1] . The sample c omplexity and running time of the algorithm is p oly nomial in n , b , log (1 /δ ) and 2 O ( L/ǫ ) . In p a rticular, ReLU( n, 1) is le arnable in time p o lynomial in n , b and log (1 /δ ) up to ac cur acy ǫ ≥ ǫ 0 = Θ( L/ log ( n )) , wher e L is the Lipschitz c onsta nt of the loss fu nc tion in the interval [0 , 1] . Pr o of. In order to pro v e the theorem, w e need to b ound the follo wing t w o losses fo r th e output h yp othesis h . L =0 ( h ; D ) = Pr ( x ,y ) ∼D [ h ( x ) 6 = 0 ∧ y = 0] (4) L > 0 ( h ; D ) = E ( x ,y ) ∼D [ ℓ ( h ( x ) , y ) · I ( y > 0)] (5) First, w e analyze L =0 ( h ; D ); in order to analyz e th is loss, it is useful to consider a sligh tly diﬀeren t loss fu nction that is (1 / ǫ ) -Lipsc hitz in its ﬁrst argument, ℓ ǫ -zo ( y ′ , y ). W e deﬁn e th is loss separately for the case w hen y > 0 and y = 0. F or y > 0, we deﬁne ℓ ǫ -zo ( y ′ , y ) := 0 for all y ′ . F or y = 0, w e deﬁn e ℓ ǫ -zo ( y ′ , 0) :=      0 if y ′ ≤ ǫ y ′ − ǫ ǫ if ǫ < y ′ ≤ 2 · ǫ 1 if 2 · ǫ < y ′ . F or f : X → Y , let L ǫ -zo ( f ; D ) := E ( x ,y ) ∼D [ ℓ ǫ -zo ( f ( x ) , y )]. Let d = O (1 /ǫ ) b e suc h that Theo- rem 2.14 applies for the class ReLU( n, 1), and ψ d and H MK d the corresp onding f eature m ap and Hilb ert space. D eﬁn e F B ⊂ H MK d as th e set of all f ∈ H MK d suc h that h f , f i ≤ B . Observ e that for all x ∈ X = S n − 1 , h ψ d ( x ) , ψ d ( x ) i ≤ P d i =0 ( x · x ) i = d + 1. Mo reov er, th e f unction clip 0 , 1 : R → [0 , 1] 12 satisﬁes, clip 0 , 1 (0) = 0, and clip 0 , 1 is 1-Lipsc hitz. T h u s, Th eorems 2.1 0 and 2.11 imp ly the follo wing: R m ( F B ) ≤ r ( d + 1) · B m , (6) R m (clip 0 , 1 ◦ F B ) ≤ 2 · r ( d + 1) · B m (7) The loss function ℓ ǫ -zo is (1 /ǫ )-Lipsc hitz in its ﬁrs t argumen t and 1-b oun ded on all of R , so in particular in the in terv al [0 , 1]; the loss function ℓ (us ed for L > 0 ) is L -Lipschitz in its ﬁrs t argum en t and b -b ounded in the interv al [0 , 1] ( by assump tion in the theorem state ment). W e assume the follo wing b ound on m (note that it is p olynomial in all the required factors): m ≥ 1 ǫ 2 8 max { L, ǫ − 1 } p ( d + 1) · B + max { b, 1 } · r 2 log 1 δ ! 2 . (8) Represen tative Sample Assumption : In the rest of the pro of we assume that for the sample S ∼ D m used in the algorithm, it is the case that for loss functions ℓ ǫ -zo and ℓ and for all f ∈ F B , the follo wing hold: |L ǫ -zo ( f ; D ) − b L ǫ -zo ( f ; S ) | ≤ ǫ (9) |L > 0 (clip 0 , 1 ◦ f ; D ) − b L > 0 (clip 0 , 1 ◦ f ; S ) | ≤ ǫ (10) Theorems 2.9, 2.10 and 2.11 together w ith the b oun ds on the Rademac her complexit y giv en b y (6) and (7) and the facts that ℓ ǫ -zo is 1 /ǫ -Lipsc hitz and 1-boun ded on R and that ℓ is L -Lipschitz and b -b ound ed on [0 , 1], imply that f or m satisfying (8), this is the case with probabilit y at least 1 − 2 δ ; w e allo w the algorithm to fail with probabilit y 2 δ . No w consider the follo wing to b ou n d L =0 ( h ; D ). L =0 ( h ; D ) = Pr ( x ,y ) ∼D [ h ( x ) > 0 ∧ y = 0] ≤ E ( x ,y ) ∼D [ ℓ ǫ -zo ( f ( x ) , y )] (11) = L ǫ -zo ( f ; D ) ≤ b L ǫ -zo ( f ; S ) + ǫ ≤ ǫ, (12) Ab o ve in (11), w e use the fact that for any x suc h that h ( x ) > 0, it m us t b e the case that f ( x ) > 2 · ǫ and hence if h ( x ) > 0 and y = 0, then ℓ ǫ -zo ( f ( x ) , y ) = 1. Inequalit y (12) h olds und er the representa tive sample assump tion using (9) (n ote that we ha v e already accoun ted for the f act that the algorithm ma y fail with p r obabilit y O ( δ )). Next we giv e b oun ds on L > 0 ( h ; D ). W e observe that for a loss function ℓ that is con vex in its ﬁrst argum en t, monotone, L -Lipsc hitz, and b -b ounded in th e in terv al [0 , 1], the follo wing holds for an y y ∈ (0 , 1]: ℓ ( h ( x ) , y ) ≤ ℓ (clip 0 , 1 ( f ( x )) , y ) + 2 ǫL (13) Clearly , whenev er f ( x ) > 2 ǫ or f ( x ) < 0, the ab ov e statemen t is trivially true. If f ( x ) ∈ [0 , 2 ǫ ] the statemen t follo ws from the L -Lipsc hitz contin uit y of ℓ ( · , y ) in the in terv al [0 , 1]. Let w ∈ R n b e suc h that L =0 (relu w ; D ) = 0 and let p w b e the corresp onding p olynomial ǫ -appro ximation in H MK d (cf. Th eorem 2.14). Then consider the follo wing: L > 0 ( h ; D ) = E ( x ,y ) ∼D [ ℓ ( h ( x ) , y ) · I ( y > 0)] ≤ E ( x ,y ) ∼D  ℓ (clip 0 , 1 ( f ( x )) , y ) · I ( y > 0)  + 2 ǫL (14) 13 = L > 0 (clip 0 , 1 ( f ); D ) + 2 ǫL ≤ b L > 0 (clip 0 , 1 ( f ); S ) + ǫ + 2 ǫL (15) ≤ b L > 0 ( f ; S ) + ǫ + 2 ǫL (16) ≤ b L > 0 ( p w ; S ) + ǫ + 2 ǫL (17) = b L > 0 (clip 0 , 1 ◦ p w ; S ) + ǫ + 2 ǫL (18) ≤ L > 0 (clip 0 , 1 ◦ p w ; D ) + 2 ǫ + 2 ǫL ( 19) ≤ L > 0 ( p w ; D ) + 2 ǫ + 2 ǫL (20) ≤ L > 0 (relu w ; D ) + 2 ǫ + 3 ǫL (21) Step (14) is obtained s imply by applying (13). Step (15) follo ws using the repr esen tativ e sample assump tion using (10 ). Step (16) follo ws b y the monotone prop ert y of ℓ ( · , y ); in p articular, it m ust alw ays b e the case that either y ≤ clip 0 , 1 ( f ( x )) ≤ f ( x ) or f ( x ) ≤ clip 0 , 1 ( f ( x )) ≤ y ; th us ℓ (clip 0 , 1 ( f ( x )) , y ) ≤ ℓ ( f ( x ) , y ). Step (17) follo ws from the f act that f is the optimal solution to Op timization Problem 3 a nd p w is a fe asible solutio n . Steps (1 8) and (20) use the f act that clip 0 , 1 ◦ p w = p w as p w ( S n − 1 ) ⊆ [0 , 1]. Step (19) follo ws un der the repr esen tativ e sample assumption using (10). And ﬁnally , Step (21) follo ws as b oth relu w ( x ) ∈ [0 , 1] and p w ( x ) ∈ [0 , 1] for x ∈ S n − 1 , | p w ( x ) − relu w ( x ) | ≤ ǫ and the L -Lipschitz con tin uit y of ℓ in the inte rv al [0 , 1]. As the argumen t holds for an y w ∈ S n − 1 satisfying L =0 (relu w ; D ) = 0 this completes the pro of of theorem b y rescaling ǫ to ǫ/ (2 + 3 L ) an d δ to δ / 2. Discussion: Depende nce on the L ipsc hitz Constan t Theorem 3.2 giv es a sample co mp lexit y and run ning time b ound that is p olynomial on 2 O ( L/ǫ ) (in addition to b eing p olynomial in other parameters). Recall that, here, L is the Lipschitz constan t of the loss function ℓ on the in terv al [0 , 1]. F or m an y loss fun ctions, suc h as ℓ p -loss for constan t p , hinge loss, logistic loss, etc., the v alue of L is a constan t. Nonetheless, it is instru ctiv e to examine wh y we obtain suc h a dep endence L , and ident ify some restricted settings in whic h th is dep end ence can b e a v oided. The dep endence of our run ning time and sample complexit y b oun ds on L arises due to Steps (13) and (21) in the proof of Theorem 3.2, where the excess e r r or compared to the optimal ReLU is b oun d ed ab o ve by O ( Lǫ ). Th is r equires us to start with a p olynomial that is an O ( ǫ/L )-uniform appro ximation to the σ relu activ ation function, to ensur e excess error at most ǫ . W e sho w ed that suc h an appro ximating p olynomial exists, with degree O ( L/ǫ ) and with co eﬃcien ts w h ose squares sum to 2 O ( L/ǫ ) . It is sometimes p ossible to a voi d this exp onential dep endence on L in the setting of agnostic learning (as opp osed to reliable learnin g). Indeed, in the case of agnostic learning there is n o need to threshold the output at 2 ǫ (this thr esholding con tribu ted 2 ǫL to our b ound on the excess error established in Inequalit y (13)); simp ly clipping the output to b e in the range of Y suﬃces. 3.4 An Implication for Learning Con v ex Neural Net w orks In a recent wo rk, Bac h [3] co n sidered con v ex relaxations of op timization p roblems relat ed to learning neural netw orks with a sin gle hidd en la y er and non-decreasing h omogeneous activ ation function. 6 One sp eciﬁc pr oblem raised in his pap er [3, Sec. 6] is under s tanding the compu tational complexit y of the follo wing pr oblem. Problem 3.3 (Increment al Optimization Problem [3]) . L et h ( x i , y i ) i m i =1 ∈ ( S n − 1 × [ − 1 , 1]) m . Find a w ∈ S n − 1 that maximizes 1 m P m i =1 y i · r elu w ( x i ) . While Bac h [3] consider s t he sett ing where y i ∈ R , rather th an [ − 1 , 1], we fo cus on th e case when y i ∈ [ − 1 , 1]. The p r oblem as p osed ab o ve is an optimiza tion problem on a ﬁnite dataset that 6 His setting allo ws p otenti ally uncountably many h idden un its along with a sparsity-inducing regularizer. 14 requires th e output solution to b e f rom a sp eciﬁc class, in this case a ReLU. In our setting, this can b e rephrased as a (p r op er) lea rn ing problem where the goal is to outpu t a h yp othesis that has exp ected loss, deﬁn ed b y ℓ ( y ′ , y ) = − y ′ · y , not m uc h larger than the b est p ossible ReLU, giv en access to dra ws from a distrib ution o v er S n − 1 × [ − 1 , 1]. Here, we relax this goal to impr op er learning, where the algorithm is p ermitted to output a hyp othesis th at is not itself a ReLU. The same approac h as used in th e pro of of Th eorem 3.2 can b e used to giv e a p olynomial-time approximati on sc heme for appr oximate ly solving this problem to within ǫ of optimal, in time 2 O (1 /ǫ ) · n O (1) . W e describ e th e mo diﬁed algorithm and the minor diﬀerences in the pro of b elo w. Optimization Problem 4 minimize α α α ∈ R m m X i =1 ℓ   m X j =1 α j MK d ( x j , x i ) , y i   sub ject to m X i,j =1 α i α j MK d ( x i , x j ) ≤ B The loss function used is ℓ ( y ′ , y ) = − y ′ y . Let α α α ∗ denote an optimal solution to O ptimization Problem 4 and let f ( · ) = P m i =1 α ∗ i MK d ( x i , · ). In Problem 3.3 , there is n o reliabilit y required and hence w e do not threshold negativ e (or su ﬃcien tly small p ositive ) v alues as w as don e in Section 3.2. Lik ewise, we do n ot clip the function f ; this is b ecause while the loss f unction ℓ ( y ′ , y ) = − y ′ y is indeed con vex in its ﬁ rst argument, 1-Lipschitz on R , and √ B -b ounded on the interv al [ − √ B , √ B ] (for y ∈ [ − 1 , 1 ]; note that | f ( x ) | ≤ |h f , ψ d ( x ) |i ≤ √ B by the Cauc hy-Sc h wa r tz inequalit y), it is v ery m uch not monotone. Th us, it is n o longer the case that clip − 1 , 1 ( f ) is a b etter hyp othesis than f itself. W e observ e that the pro of of Theorem 3.2 only m akes use of the monotone nature of ℓ to conclude that exp ected loss of clip 0 , 1 ◦ f is less than that of f . As w e no longer output a clipp ed h yp othesis, this is not necessary . Theorem 3.4. Give n i.i.d. examples ( x i , y i ) dr awn fr om an (unknown) distribution D over S n − 1 × [ − 1 , 1] , ther e is an algorithm that outp uts a hyp oth esis h su ch that E ( x ,y ) ∼D [ − y · h ( x )] ≤ min w ∈ S n − 1 E ( x ,y ) ∼D [ − y · relu w ( x )] + ǫ . The algorithm runs in time 2 O (1 /ǫ ) · n O (1) . 3.5 Noisy P olynomial Reconstruction ov er S n − 1 In the noisy p olynomial reconstruction problem, a learner is giv en access to examp les d ra wn from a distribution and lab eled according to a fun ction f ( x ) = p ( x ) + w ( x ) w h ere p is a p olynomial and w is an arb itrary function (co r r esp ond ing to n oise). W e will consider a m ore general scenario, where a learner is giv en sample access to an arbitr a ry distribution D on S n − 1 × [ − 1 , 1] and m us t output the b est ﬁtting p olynomial with r esp ect to some ﬁxed loss fu n ction. W e s ay that the reconstruction is pr o p er if, giv en a h yp othesis h en co ding a multiv ariate p olynomial, we can obtain an y co eﬃcien t of our c ho osing in time p olynomial in n . Note that noisy p olynomial reconstruction as deﬁned ab ov e is equiv alen t to the problem of agnostical ly learning multiv a riate p olynomials. W e giv e an algorithm for noisy p olynomial recon- struction whose runtime is p oly( B , n, d, 1 /ǫ ), w here B is an upp er b ound on th e sum of the squared co eﬃcien ts of the p olynomial in the standard m on omial basis. Throughout this sectio n, w e refer to the sum of the squared co eﬃcien ts of p as the weig ht of p . Analogous pr oblems o ve r the Bo olean domain are though t to b e computationally intractable. Andoni et al. [1] w ere the ﬁ rst to observ e that o ver non-Bo olean domains, the problem admits some non-trivial solutions. In particular, they g av e an alg orithm that run s in time p oly( B , n , 2 d , 1 /ǫ ) with the requ iremen t th at the un derlying distrib u tion b e pr o duct o ver the unit cub e (and that the noise function is structured). 15 Consider a m ultiv ariate p olynomial p of degree d such that su m of the s q u ared co eﬃcients is b oun d ed b y B . Denote the co eﬃcien t of monomial x i 1 1 · · · x i n n b y β ( i 1 , . . . , i n ) for ( i 1 , . . . , i n ) ∈ { 0 , . . . , d } n . W e ha ve p ( x ) = X ( i 1 ,...,i n ) ∈ [ d ] n i i + ··· + i n ≤ d β ( i 1 , . . . , i n ) x i 1 1 · · · x i n n (22) suc h that X ( i 1 ,...,i n ) ∈{ 0 ,...,d } n i i + ··· + i n ≤ d β ( i 1 , . . . , i n ) 2 ≤ B . Let M b e the map that takes an ordered tuple ( k 1 , . . . , k j ) ∈ [ n ] j for j ∈ [ d ] to the tuple ( i 1 , . . . , i n ) ∈ { 0 , . . . , d } n suc h that x k 1 · · · x k j = x i 1 1 · · · x i n n . Let C ( i 1 , . . . , i n ) b e the n umber of distinct orderings of the i j ’s for j ∈ [ n ]; C ( i 1 , . . . , i n ) which can b e computed from the m ultinomial theorem (cf. [34]). Observe that the num b er of tuples that M m ap s to ( i 1 , . . . , i n ) is precisely C ( i 1 , . . . , i n ). Recall that H MK d denotes the RKHS fr om Deﬁnition 2.6. Observ e that the p olynomial p from Equation (22) is represented by the vec tor v p ∈ H MK d deﬁned as follo ws. F or j ∈ [ d ], en try ( k 1 , . . . , k j ) of v p equals β ( M ( k 1 , . . . , k j )) C ( M ( k 1 , . . . , k j )) . It is easy to see that v p as deﬁned represents p . Ind eed, h v p , ψ d ( x ) i = d X j =0 X ( k 1 ,...,k j ) ∈ [ n ] j β ( M ( k 1 , . . . , k j )) C ( M ( k 1 , . . . , k j )) x k 1 · · · x k j = d X j =0 X ( i 1 ,...,i n ) ∈{ 0 ,...,d } n i i + ··· + i n = j C ( i 1 , . . . , i n ) β ( i 1 , . . . , i n ) C ( i 1 , . . . , i n ) x i 1 1 · · · x i n n = p ( x ) . F urthermore, w e can compu te, h v p , v p i = d X j =0 X ( k 1 ,...,k j ) ∈ [ n ] j β ( M ( k 1 , . . . , k j )) 2 C ( M ( k 1 , . . . , k j )) 2 = d X j =0 X ( i 1 ,...,i n ) ∈{ 0 ,...,d } n i i + ··· + i n = j C ( i 1 , . . . , i n ) β ( i 1 , . . . , i n ) 2 C ( i 1 , . . . , i n ) 2 ≤ d X j =0 X ( i 1 ,...,i n ) ∈{ 0 ,...,d } n i i + ··· + i n = j β ( i 1 , . . . , i n ) 2 ≤ B . Ov erview of the Algorithm. Let C b e the cla ss of all m ultiv a riate p olynomials a nd let S = { ( x 1 , y 1 ) , . . . , ( x m , y m ) } b e a training set of examples dra wn i.i.d. from some arb itrary distribu tion D on S n − 1 × [ − 1 , 1]. Similar to Op timization Problem 2 in Section 3.1, we wish to solv e Optimizatio n Problem 5 b elo w. 16 Optimization Problem 5 minimize v ∈H MK d m X i =1 ℓ ( h v , ψ d ( x i ) i , y i ) sub ject to h v , v i ≤ B Notice f rom the previous analysis, a d egree d p olynomial p can b e r epresen ted as a v ector v p ∈ H MK d suc h that p ( x ) = h v p , ψ d ( x ) i for all x ∈ S n − 1 , a nd h v p , v p i ≤ B . T h u s , v p is a feasible solution to Optimization Problem 5. Optimization Problem 5 can easily b e solv ed in time p oly( n d ), but this r untime is not p olynomial in B and n . Instead, ju st as in Section 3.1, we u se the Represen ter Theorem to solve Optimization Problem 5 in time that is p olynomial in the num b er of samples used. Sp eciﬁcally , the Repr esen ter Th eorem states that there is an optimal solution to Optimization Problem 5 of the form v = P m i =1 α i ψ d ( x i ) for some v al u es α 1 , . . . , α m ∈ R . Thus, Optimization Problem 5 can b e reform ulated in terms of the v a r iable vec tor α α α = ( α 1 , . . . , α m ). This mathematica l p r ogram is describ ed as Optimization Problem 6 b elow. Optimization Problem 6 minimize α α α ∈ R m m X i =1 ℓ   m X j =1 α j MK d ( x j , x i ) , y i   sub ject to m X i,j =1 α i · α j · MK d ( x i , x j ) ≤ B Via a standard analysis iden tical to that of Section 3.1, Optimization Prob lem 6 is a con vex program and can b e solv ed in time p olynomial in m , n , and d . Let α α α ∗ denote an optimal solution to Optimiza tion Problem 6 and let f ( · ) = P m i =1 α ∗ i MK d ( x i , · ). The h yp othesis h output b y our algorithm is as follo ws. h ( x ) = clip − 1 , 1 ( f ( x )) . Observe that h ∈ clip − 1 , 1 ◦ C . 3.6 Prop er Lea rning As discussed in Section 3 .2 , we r equire clipping t o a v oid a weak b ound on the g eneralization error for general l oss fu nctions. If, ho w ev er, w e consider learning with resp ect to an y ℓ p loss for constan t p ≥ 1, it can b e sh o wn that we can d o without clipping (with only a p olynomial facto r increase in sample complexit y). In this case, the learner h = f is a pr op er le arner in the follo wing sense. Recalling the feature map ψ d asso ciated with MK d from Deﬁnition 2.6, we can compute the co eﬃcien t β ( I ) f or I = ( i 1 , . . . , i n ) ∈ [ d ] n corresp ondin g to the monomial x i 1 1 · · · x i n n . β ( I ) = m X i =1 α ∗ i X k 1 ,...,k j ∈ [ n ] j j ∈{ 0 ,...,d } M ( k 1 ,...,k j )=( i 1 ,...,i n ) ( x i ) k 1 · · · ( x i ) k j = m X i =1 α ∗ i C ( i 1 , . . . , i n ) ( x i ) i 1 1 · · · ( x i ) i n n Observe th at the ab ov e can b e easily compu ted since w e kn o w x i for all i ∈ [ m ], and the f u nction C can b e eﬃcie ntly compu ted as d iscussed b efore usin g the m ultinomial theorem. Hence, the h yp othesis is itself a p olynomial of d egree at most d , an y desired coeﬃcient of which can b e computed eﬃcien tly . 17 3.7 F ormal V ersion of Theorem 1.6 and Its Pro of The rest of this section is dev oted to the pro of of T heorem 1.6 (or, more precisely , its formal v arian t Theorem 3.5 b elo w, wh ic h mak es explicit the conditions on the loss fu n ction ℓ that are required for the theorem to hold). In particular, w e sh o w that w h enev er the sample size m is a su ﬃcien tly large p olynomial in d , n , B , 1 /ǫ , and log(1 /δ ), the hypothesis h output by th e algorithm satisﬁes E ( x ,y ) ∼ D [ ℓ ( h ( x ) , y )] ≤ opt + ǫ. where opt is th e error of the b est ﬁtting m u ltiv ariate p olynomial p of degree d whose sum of squares of coeﬃcient s is b ounded b y B . Theorem 3.5 (F ormal V ersion of Theorem 1.6) . L et P ( n , d, B ) b e the class of p o lynomials p : S n − 1 → [ − 1 , 1] in n variables such that that the to tal de gr e e of p is at most d , and the sum o f squ ar es of c o eﬃcients of p (in the standar d monomial b asis) is at most B . L et ℓ b e any loss f unction such that is c onvex, monotone, and L -Lipschitz and b -b ounde d in the i nterval [ − 1 , 1] . Then p oly( n , d, B ) is agnostic a l ly le arnable under any (unknown) distribution over S n − 1 × [ − 1 , 1] with r esp e ct to the loss function ℓ in time p oly ( n, d, B , 1 /ǫ, L, b, log 1 δ ) . The le arning algorithm is pr op er if the loss function ℓ e quals ℓ p for c onstant p > 0 . Pr o of. In order to pro ve the theorem, w e need to b oun d L ( h ; D ) = E ( x ,y ) ∼ D [ ℓ ( h ( x ) , y )] . W e kn o w that for all x ∈ S n − 1 , h ψ d ( x ) , ψ d ( x ) i = d +1. Moreo v er, letting v p b e the corr esp onding elemen t of the RKHS for p olynomial p ∈ C , we kno w from previous analysis that h v p , v p i ≤ B . In addition, the function clip − 1 , 1 : R → [ − 1 , 1] satisﬁes clip − 1 , 1 (0) = 0, and clip − 1 , 1 is 1-Lipsc hitz. Th u s, Theorems 2.1 0 and 2.11 imply the follo wing: R m ( C ) ≤ r ( d + 1) · B m , (23) R m (clip − 1 , 1 ◦ C ) ≤ 2 · r ( d + 1) · B m . (24) By assu mption, ℓ is L -Lipsc hitz in its ﬁrst argument an d b -b ound ed in the interv al [ − 1 , 1]. W e assume the follo wing b ound on m (note that it is p olynomial in all the required factors): m ≥ 1 ǫ 2 8 max { L, ǫ − 1 } p ( d + 1) · B + max { b, 1 } · r 2 log 1 δ ! 2 . (25) In the rest of the pro of w e assume that for every f ∈ P ( n, d, B ), the follo wing hold: |L ( f ; D ) − b L ( f ; S ) | ≤ ǫ. (26) |L (clip − 1 , 1 ◦ f ; D ) − b L (clip − 1 , 1 ◦ f ; S ) | ≤ ǫ. (27 ) Using Theorem 2.9 toget her w ith the b ounds on Rademac her complexit y giv en by (23) and (24) and the L -Lipsc hitz con tinuit y in its ﬁrst argumen t and b -b oundedness of ℓ on the interv al [ − 1 , 1], w e get that the ab ov e inequalities hold w ith pr obabilit y at least 1 − 2 δ . W e let the algorithm fail with probabilit y 2 δ . No w consider the follo wing to b ou n d L ( h ; D ). Letti n g p b e an y p olynomial in P ( n, d, B ), L ( h ; D ) ≤ b L ( h ; S ) + ǫ (28) ≤ b L ( f ; S ) + ǫ (29) ≤ b L ( p ; S ) + ǫ (30) 18 ≤ L ( p ; D ) + 2 · ǫ (31) Ab o ve in (28), w e app eal to (27). In (29), we use the fact that D is a distribution ov er S n − 1 × [ − 1 , 1], and ℓ is monotone. In (30), w e use the fact that the coeﬃcient v ector of p is a feasible solution to Optimization Prob lem 5, and Optimization Problem 6 is a reform ulation of Op timization Problem 5. Finally , in (31), we app eal (26). The theorem now follo ws b y rep lacing ǫ with ǫ/ 2, δ with δ / 2, and observing that the algorithm runs in time p oly ( m ) = p oly ( n, d, B , 1 /ǫ, L, b, log 1 δ ). 4 Net w orks of Re LUs In this section, w e extend learnabilit y results for a single ReLU to net wo rk of ReLUs. Ou r results in this section apply to th e standard agnosti c model of learnin g in the case that th e ou tp ut is a linear combination of h idden units. If our output la y er, how ev er, is a single ReLU, then our resu lts can b e extended to the reliable setting using similar tec hn iqu es from Section 3. W e will use the same framew ork as Zhang et al. [36], wh o show ed how to learn net works wh ere the act iv ation function is co mp uted exactly by a p o w er series (with b ound ed sum of squares of co eﬃcien ts B ) with resp ect to loss functions that are b ound ed on a d omain that is a fun ction of B . Their algorithm wo rk s b y rep eatedly comp osing the k ernel of Shalev-Shw artz et al. [31] and optimizing in the corresp ond ing R K HS. Note, how ev er, that since σ relu is not d iﬀeren tiable at 0, there is no p o we r series for σ relu , and the approac h of Zhang et al. [36] cannot b e used; their work app lies to a s mo oth activ ation fun ction that has a shap e that is “Sigmoid-lik e” or “ReLU-l ike,” but is not a go o d appro ximation to σ relu in a precise mathematica l sens e. W e generalize their results to activ at ion fu nctions that are appr oximate d by p olynomials. This allo ws us to capture m any classes of activ ation functions including ReLUs. Ou r clipping tec hnique also allo ws us to work w ith resp ect to a broader class of loss fun ctions. Our results f or learning net wo rks of ReLUs h a ve a n u mb er of new app lications. First, w e giv e the ﬁrst eﬃcient algorithms for learning “parameterized” ReLUs and “leaky” ReLUs. S econd, w e obtain the ﬁ rst p olynomial-time appro ximation schemes for con vex piecewise-linear regression (see Section 4.5 f or detail s). As far as we are a w are, there were no prov ably eﬃcien t algorithms kno wn for these t yp es of m ultiv ariate piecewise-linea r regression p roblems. 4.1 Notation W e use the follo wing n otation of Z hang et al. [36]. Consider a net wo rk with D h idden la y ers and an output unit (w e assume that the output is one-dimensional). Let σ : R → R denote the activ ation function applied at eac h unit of all the hid den la y ers. Let n ( i ) denote th e num b er of units in hidden la y er i with n (0) = n (i.e., in put d imension) and w ( i ) j k b e the weigh t of the ed ge b et we en u nit j in la y er i and unit k in la y er i + 1. W e deﬁne, y ( i ) j to b e th e function th at maps x ∈ X to th e output of unit j in la y er i , y ( i ) j ( x ) = σ   n ( i − 1) X k =1 w ( i − 1) j k · y ( i − 1) k ( x )   , where y (0) j ( x ) = x for all j . W e similarly deﬁne h ( i ) j to b e the fu nction that map s x ∈ X to th e input of unit j in la y er i + 1: h ( i ) j ( x ) = n ( i ) X k =1 w ( i ) j k · y ( i ) k ( x ) . 19 Finally , we d eﬁne the output of the net w ork as a function N : R n → R as N ( x ) = n ( D ) X k =1 w ( D ) 1 k · y ( D ) k ( x ) . F or a b etter u nderstand ing of the ab o ve notati on, consider a fu lly-connected net wo rk N 1 with a single hidden la y er (these are also kno wn as depth-2 net w orks) consisting of k un its: N 1 : x 7→ k X i =1 u i σ ( w i · x ) . In this case, output of u nit i ∈ [ k ] in the h id den lay er is y (1) i ( x ) = σ ( w i · x ) and th e input to the same unit is h (0) i ( x ) = w i · x . W e consider a class of net works with edge w eigh ts of b ounded ℓ 1 or ℓ 2 norm. Th e class is formalized as follo ws. Deﬁnition 4.1 (W eigh t-b oun ded Net w orks) . L et N [ σ , D, W, M ] b e the class of ful ly-c onne cte d networks with D hidden layers and σ as the activation function. A dditional ly, the weights ar e c onstr aine d such that P n j =1 ( w (0) ij ) 2 ≤ M 2 for al l units i in layer 0 a nd P n ( i ) k =1 | w ( i ) j k | ≤ W for al l units j in al l layers i ∈ { 1 , . . . , D } . Also, the inputs to e ach unit ar e b ounde d in magnitude by M , i.e., h ( l ) j ( x ) ∈ [ − M , M ] with M ≥ 1 for e ach l < D and j = 1 , . . . , n ( l +1) . W e consid er activ ation functions w hic h can b e app ro ximated b y p olynomials with su m of squares of coeﬃcients b ounded . W e term them low-weight appr oxima ble acti v ation fu nctions, formalized as follo ws. Deﬁnition 4.2 (Low-w eigh t Appro ximable F unctions) . F or activation fu nction σ : R → R , for ǫ ∈ (0 , 1) , M ≥ 1 , B ≥ 1 , we sa y tha t a p olynomial p ( t ) = P d i =1 β i t i is a de g r e e d , ( ǫ, M , B ) - appr oximation to σ if for ev ery t ∈ [ − M , M ] , | σ ( t ) − p ( t ) | ≤ ǫ and furthermor e , P d i =0 2 i β 2 i ≤ B . 4.2 Appro ximate P olynomial Netw orks W e ﬁrst b ound the e r r or incurred when eac h activ ation function is replaced by a corr esp onding lo w-w eight p olynomial appr o ximation. Theorem 4.3 (Appro ximate P olynomial Net w ork) . L et σ b e an activation fu nc tion that is 1 - Lipschitz 7 and such that ther e exists a de gr e e d p olynomial p that i s a ( ǫ W D D , 2 M , B ) appr oximation for σ , with ǫ ∈ (0 , 1) , with d, M , B ≥ 1 . Then, for al l N ∈ N [ σ, D , W , M ] , ther e exists ¯ N ∈ N [ p, D , W , 2 M ] such that sup x ∈ S n − 1   N ( x ) − ¯ N ( x )   ≤ ǫ. Pr o of. Let N ∈ N [ σ, D , W , M ] and let ¯ N ∈ N [ p, D , W , M ] b e such that it h as the same s tructure and we ights as N with the activ ation replaced with p . F or N let h ( i ) ( x ) b e the inpu ts to la yer i + 1 and y ( i ) ( x ) b e the outpu ts of un it j of la y er i as deﬁ ned previously . Corresp ondingly , for ¯ N let ¯ h ( i ) ( x ) b e the inputs to la y er i + 1 and ¯ y ( i ) ( x ) b e the outputs of la y er i . W e pro ve by in duction on la y er i that for all units j of la y er i , sup x ∈ S n − 1    h ( i ) j ( x ) − ¯ h ( i ) j ( x )    ≤ iǫ W D − i D . (32) 7 Note that this is not a restriction, as we hav e not explicitly constrained the w eights W . Th us, to allow a Lipsc hitz constant L , we simply replace W b y W L . 20 F or la y er i = 0, we hav e h (0) ( x ) j = ¯ h (0) ( x ) j = w (0) j · x ∈ [ − M , M ] w hic h trivially satisﬁes (32). No w, w e pro v e that the desired prop ert y h olds for la y er l , assu ming the f ollo wing holds for la y er l − 1. W e h a v e f or all units j in la y er l − 1, sup x ∈ S n − 1    h ( l − 1) j ( x ) − ¯ h ( l − 1) j ( x )    ≤ ( l − 1) ǫ W D − l +1 D . (33) Note that this imp lies that    ¯ h ( l − 1) j ( x )    ≤    h ( l − 1) j ( x )    + ( l − 1) ǫ W D − l +1 D ≤ 2 M . Here the second inequalit y follo ws fr om the assu mption that inputs to eac h unit are b ound ed b y M and ǫ < 1. W e ha v e for all x and j ,    h ( l ) j ( x ) − ¯ h ( l ) j ( x )    =       n ( l ) X k =1 w ( l ) j k · σ  h ( l − 1) k ( x )  − n ( l ) X k =1 w ( l ) j k · p  ¯ h ( l − 1) k ( x )        = n ( l ) X k =1    w ( l ) j k       σ  h ( l − 1) k ( x )  − p  ¯ h ( l − 1) k ( x )     ≤ n ( l ) X k =1    w ( l ) j k        σ  h ( l − 1) k ( x )  − σ  ¯ h ( l − 1) k ( x )     + ǫ W D D  (34) ≤ n ( l ) X k =1    w ( l ) j k        h ( l − 1) k ( x ) − ¯ h ( l − 1) k ( x )    + ǫ W D D  (35) ≤ n ( l ) X k =1    w ( l ) j k     ( l − 1) ǫ W D − l +1 D + ǫ W D D  (36) = k w ( l ) j k 1 l · ǫ W D − l +1 D ≤ l · ǫ W D − l D (37) Step (34) follo ws since ¯ h ( l − 1) j ( x ) ∈ [ − 2 M , 2 M ] and p uniformly ǫ W D D -appro ximates σ in [ − 2 M , 2 M ]. Step (35) follo ws fr om σ b eing 1-Lipschitz. Step (36) follo ws from (33). Finally Step (37) follo ws from k w ( l ) j k 1 ≤ W whic h is giv en. This complete s the ind uctiv e pro of. W e conclude by n oting that N ( x ) = h ( D ) 1 ( x ) and ¯ N ( x ) = ¯ h ( D ) 1 ( x ). Thus, fr om ab o ve w e get, sup x ∈ S n − 1   N ( x ) − ¯ N ( x )   = su p x ∈ S n − 1    h ( N ) 1 ( x ) − ¯ h ( N ) 1 ( x )    ≤ ǫ. This completes the pro of. Giv en the ab o ve transformation to a p olynomial net work and associated error b ounds, we ap p ly the main theorem of Zh ang et al. [36] com bined with the clipping tec hnique from Section 3 to obtain the follo wing result: Theorem 4.4 (Learnabilit y of Neural Net w ork) . L et σ b e an activation function that is 1-Lipschitz 7 and such that ther e exists a de gr e e d p olyno mial p that is an ( ǫ ( L +1) · W D · D , 2 M , B ) appr oximat ion for σ , for d, B , M ≥ 1 . L e t ℓ b e a loss function tha t is c onvex, L -Lipschitz in the ﬁrst ar gument, and b b ounde d on [ − 2 M · W , 2 M · W ] . Then ther e exists an algorithm that outputs a pr e dictor b f such that with pr ob ability at le ast 1 − δ , for any (unknown) distribution D over S n − 1 × [ − M · W, M · W ] , E ( x ,y ) ∼D [ ℓ ( b f ( x ) , y )] ≤ min N ∈N [ σ ,D , W,M ] E ( x ,y ) ∼D [ ℓ ( N ( x ) , y )] + ǫ. 21 The time c omp lexity of the ab o ve algorithm is b ound e d by n O (1) · B O ( d ) D − 1 · log(1 /δ ) , wher e d is the de gr e e of p , and B is a b ound on P d i =0 2 i β 2 i (se e Defn. 4.2). Pr o of. F rom Theorem 4.3 w e ha v e that for all N ∈ N [ σ , D, W, M ], there is a net w ork ¯ N ∈ N [ p, D , W , M ] suc h that sup x ∈ S n − 1   N ( x ) − ¯ N ( x )   ≤ ǫ L + 1 . Since the loss function ℓ is L -Lipsc hitz, th is implies that ℓ ( ¯ N ( x ) , y ) − ℓ ( N ( x ) , y ) ≤ L · | ¯ N ( x ) − N ( x ) | ≤ L L + 1 · ǫ. (38) Let N min = arg min N ∈N [ σ ,D , W,M ] E ( x ,y ) ∼D [ ℓ ( N ( x ) , y )]. By the ab o v e, we get that there exists ¯ N min ∈ N [ p, D , W , M ] such that min ¯ N ∈N [ p,D ,W,M ] E ( x ,y ) ∼D [ ℓ ( ¯ N ( x ) , y )] ≤ E ( x,y ) ∼D [ ℓ ( ¯ N min ( x ) , y )] ≤ E ( x ,y ) ∼D [ ℓ ( N min ( x ) , y )] + L · ǫ = min N ∈N [ σ ,D , W,M ] E ( x ,y ) ∼D [ ℓ ( N ( x ) , y )] + L L + 1 · ǫ. No w from [36, Th eorem 1], w e kno w that there exists an algorithm that outputs a predictor b f suc h that with pr obabilit y at least 1 − δ for an y distribution D E ( x ,y ) ∼D [ ℓ ( b f ( x ) , y )] ≤ min ¯ N ∈N [ p, D ,W,M ] E ( x ,y ) ∼D [ ℓ ( ¯ N ( x ) , y )] + ǫ L + 1 . F or loss functions that tak e on large v alues on the range of the predictor, we in stead output the clipp ed version of the predictor clip ( b f ) in ord er to satisfy the requirement s of the Rademac her b oun d s (as in S ection 3). The run time of the algo rith m is p oly( n, ( L +1) /ǫ, log (1 /δ ) , H D (1)), where H ( a ) = q P d i =0 2 i β i a 2 i , and H ( D ) is obtained b y comp osing H with itself D times. By simple algebra, we conclude that H D (1) is b oun ded by B O ( d ) D − 1 . Com binin g the ab o ve in equ alities, w e hav e E ( x ,y ) ∼D [ ℓ ( b f ( x ) , y )] ≤ min N ∈N [ D ,W,M ,σ relu ] E ( x ,y ) ∼D [ ℓ ( N ( x ) , y )] + ǫ. This completes the pro of. W e can no w s tate the learnabilit y resu lt for ReLU net w orks as follo ws. Corollary 4.5 (Learnabilit y of ReLU Net work) . Ther e exists an algorithm that outputs a pr e dictor b f suc h that with pr ob ability at le a st 1 − δ for any distribution D over S n − 1 × [ − M · W, M · W ] , and loss function ℓ which i s c onvex, L - Li pschitz in the ﬁrst ar gument, and b b ounde d on [ − 2 M · W , 2 M · W ] , E ( x ,y ) ∼D [ ℓ ( b f ( x ) , y )] ≤ min N ∈N [ D ,W,M , relu] E ( x ,y ) ∼D [ ℓ ( N ( x ) , y )] + ǫ. The time c omp lexity of the ab ove algorithm is b ounde d by n O (1) · 2 (( L +1) · M · W D · D · ǫ − 1 ) D · log(1 /δ ) . The pro of of the corollary follo ws from applying Theorem 4.4 for th e activ ation function σ relu since σ relu is 1-Lipsc hitz and low-weight appr oximable (from Theorem 2.12 and 2.13). W e obtain the follo wing corollary sp eciﬁcally for depth-2 net works. Corollary 4.6. Depth-2 networks with k hidden units and activatio n function σ relu such that the weight ve ctors have ℓ 2 -norm b ounde d by 1 ar e agnostic al ly le ar nable over S n − 1 × [ − √ k , √ k ] with 22 r esp e ct to loss function ℓ which is c onvex, O (1) -Lipschitz in the ﬁrst ar gument, and b b ounde d on [ − 2 √ k , 2 √ k ] in time n O (1) · 2 O ( √ k /ǫ ) · log (1 /δ ) . The pro of of the corolla ry follo ws from setting L = 1, D = 1, M = 1 and W = √ k in Theorem 4.5. W = √ k follo ws f r om b ound ing the ℓ 1 -norm of the w eigh ts giv en a b ound on the ℓ 2 -norm. W e remark here th at the abov e analysis holds for fu lly-connected net w orks with activ atio n function σ sig ( x ) = 1 1+ e − x (Sigmoid f unction). Note that σ sig is 1-Lipsc h itz. The follo wing lemma due to Livni et al. [25, Lemma 2] exhibits a lo w degree p olynomial appro ximation for σ sig . It is in turn based on a result of Shalev-Sh wartz et al. [31, Lemma 2]. Lemma 4.7 (Livni et al. [25 ]) . F or ǫ ∈ (0 , 1) , ther e exists a p olyn omial p ( a ) = P d i =1 β i a i for d = O (log (1 /ǫ )) such that for al l a ∈ [ − 1 , 1] , | p ( a ) − σ sig ( a ) | ≤ ǫ . Let p ( a ) = P d i =1 β i a i b e the uniform ǫ -app ro ximation σ sig whic h is guarantee d to exist by the ab o ve lemma. Using a similar t r ick as in Lemma 2.12, w e can further b ound p ([ − 1 , 1]) ⊆ [0 , 1]. Also, usin g L emm a 2.13 , we can sho w that P d i =0 2 i β 2 i is b ounded b y (1 /ǫ ) O (1) . This sh o ws that σ sig is low-weight appr oximable . Using Theorem 4.4, w e state the f ollo win g learnabilit y r esult for depth-2 sigmoid net works. Corollary 4.8. Depth-2 networks with k hidden units and sigmoidal activation function such that the weight ve ctors have ℓ 2 -norm b ounde d by 1 ar e agnostic al ly le a rnable over S n − 1 × [ − √ k , √ k ] with r esp e ct to loss function ℓ which is c onvex, O (1) -Lipschitz in the ﬁrst ar gument, and b b ounde d on [ − 2 √ k , 2 √ k ] in time p oly ( n , k , 1 /ǫ, log (1 /δ )) . Observe that the ab o v e result is p olynomial in all parameters. Livni et al. (cf. [25 , Th eorem 5]) state an incomparable resu lt for learning sigmoids: their ru n time is su p erp olynomial in n for L = ω (1), where L is the b ound on ℓ 1 -norm of the w eigh t vec tors ( L ma y b e as large a s √ k in the setting of Corollary 4.8). Th ey , how ev er, w ork o v er the Bo olean cub e (wh er eas w e are w orking o v er the domain S n − 1 ). 4.3 Application: Learning Parametric Rectiﬁed Linear Unit A P arametric Recti ﬁ ed Linear Unit (PReLU) is a generalizat ion of ReLU introd uced b y He et al. [15]. Compared t o th e ReLU, it h as an additional parameter that is lea rn ed. F ormally , it is deﬁned as Deﬁnition 4.9 (P arametric Rectiﬁer) . The p a r ametric r e ctiﬁer (denote d by σ PR eLU ) is an activa- tion f unction deﬁne d as σ PR eLU ( x ) = ( x if x ≥ 0 a · x if x < 0 wher e a is a le arnable p ar ameter. Note th at we can represen t σ PReLU ( x ) = max (0 , x ) − a · max (0 , − x ) = σ relu ( x ) − a · σ relu ( − x ) whic h is a depth-2 n et w ork of ReLUs. Therefore, we can s tate the follo wing learnabilit y result for a single PReLU parameterized by a weig ht v ector w based on learning depth-2 ReLU net works. Corollary 4.10. L et PR eLU with the p ar a meter a b e such that | a | i s b ounde d by a c onstant and the weight ve ctor w has 2-norm b ounde d by 1. Then, PR eLU is agnost ic al ly l e arnable over S n − 1 with r esp e ct to any O (1) -Lipschitz loss function in time n O (1) · 2 O (1 /ǫ ) · log(1 /δ ) . The pro of of the corolla r y follo ws from setting L = 1, D = 1, M = 1 and W = O (1) in Th eorem 4.5. The condition that | a | b e b oun ded b y 1 is r easonable as in pr actice the v alue of a is very rarely ab o ve 1 as observed by He et al. [15]. Also note that Leaky-ReLUs [26] are PReLUs with ﬁxed a (usually 0.01). Hence, we can agnostically learn them u nder the same conditions using an iden tical argumen t as ab ov e. Note that a net w ork of PReLU can also b e similarly learned as a ReLU b y replacing eac h ReLU in the netw ork b y a linear combinatio n of t wo ReLUs as describ ed b efore. 23 4.4 Application: Learning the Piecewise Linear T ransfer F unction Sev eral fun ctions ha ve b een used to r elax th e 0 / 1 loss in the context of learning linear classiﬁers. The b est example is the sigmoid fu nction d iscussed earlier. Here we consider the p iecewise linear transfer function. F ormally , it is deﬁned as Deﬁnition 4.11 (Piecewise Linear T ransf er F unction) . The C -Lipschitz pie c ewise line ar tr ansfer function (denote d by σ pw ) is an activation function deﬁne d as σ pw ( x ) = max  0 , min  1 2 + C x, 1  . Note that we can rep resen t σ p w ( x ) = max  0 , 1 2 + C x  − max  0 , − 1 2 + C x  = σ relu  1 2 + C x  − σ relu  − 1 2 + C x  whic h is a depth-2 net work of ReLU. Therefore, w e can state the follo wing learn- abilit y result for a piecewise linear trans f er fu nction p arameterized b y w eigh t v ector w follo wing a similar argument as in the pr evious section. Corollary 4.12. The class of C -Li pschitz pie c ewise line ar tr ansfer functions p ar ametrize d by weight ve ctor w with 2-norm b o unde d by 1 is agnostic al ly le a rnable over S n − 1 with r esp e ct to any O (1) -Lipschitz loss function in time n O (1) · 2 O ( C /ǫ ) · log (1 /δ ) . The p ro of of the corollary follo ws from setting L = 1, D = 1, M = 1 and W = O ( L ) in Th eorem 4.5. Shalev-Shw artz et al. [31] in App end ix A solv ed the ab o ve problem for l 1 loss and ga ve a run ning time with dep end ence on C, ǫ as p oly  exp  C 2 ǫ 2 log  C ǫ   . Our approac h giv es an exp onentia l impro vemen t in terms of C ǫ and w orks for general constan t Lipsc hitz loss f u nctions. 4.5 Application: Con v ex Piecewise-Lin ear Fitting In this section we can use our learnability r esu lts for net w orks of ReLUs to giv e p olynomial-time appro ximation sc hemes for con v ex piecewise-li n ear regression [27]. T hese pr oblems ha ve b een studied in optimization and notably in mac hine learning in the con text of Multiv ariate Adaptive Regression Splines (MARS [13]). Note that these are not the same as univariate p iecewise or segmen ted regressio n problems, for w hic h p olynomial-time algorithms are kn o wn. Although our algorithms ru n in time exp onen tial in k (the num b er of aﬃne functions), w e note that no p ro v ably eﬃcien t algorithms were kno wn prior to our w ork even for the case k = 2. 8 The k ey idea will b e to r ed uce piecewise regression problems to an optimization p roblem on net wo rks of ReLUs using simple ReLU “gadgets.” W e form ally d escrib e the problems and d escrib e the gadgets in detail. 4.5.1 Sum of Max 2-Aﬃne W e start with a simp le class of con ve x p iecewise linear functions repr esented as a sum of a ﬁxed n umb er of fu n ctions where eac h of these functions is a maxi mum of 2 aﬃne functions. This is formally deﬁned as follo ws. Deﬁnition 4.13 (Su m of k Max 2-Aﬃne Fitting [27]) . L et C b e the class of functions of the form f ( x ) = P k i =1 max ( w 2 i − 1 · x , w 2 i · x ) with w 1 , . . . , w 2 k ∈ S n − 1 mapping S n − 1 to R . L et D b e a n (unknown) distribution on S n − 1 × [ − k , k ] . Given i.i. d. examples dr awn fr om D , f or any ǫ ∈ (0 , 1) ﬁnd a function h (not ne c essarily in C) such that E ( x ,y ) ∼D [( h ( x ) − y ) 2 ] ≤ min c ∈C E ( x ,y ) ∼D [( c ( x ) − y ) 2 ] + ǫ . It is easy to see that max ( a, b ) = max (0 , a − b ) + max (0 , b ) − max (0 , − b ) = σ relu ( a − b ) + σ relu ( b ) − σ relu ( − b ) wh ere σ relu ( a ) = max (0 , a ). This is simply a linear com b in ation of R eLUs. W e can thus 24 w 1 · x w 2 · x σ relu σ relu σ relu output Figure 1: Represen tation of max ( w 1 · x , w 2 · x ) as a d epth-2 R eLU netw ork. Note that solid edges represent a weigh t of 1, dashed ed ges represen t a w eight of -1, and the absence of an edge represents a w eigh t of 0. represent max ( w 1 · x , w 2 · x ) as a depth-2 n etw ork (see Figure 1). Ad ding copies of this, we can represent a su m of k max 2-aﬃne functions as a d epth-2 net wo rk N Σ with 3 k hidden units and activ ation fun ction σ relu satisfying the follo wing pr op erties, • k w (0) j k ≤ 2 • k w (1) 1 k 1 ≤ 3 k • Each input to eac h unit is b ounded in magnitude b y 2. The ﬁ r st prop ert y holds as k w (0) j k ≤ max ( k w 2 j − 1 − w 2 j k , k w 2 j − 1 k , k w 2 j k ) ≤ k w 2 j − 1 k + k w 2 j k ≤ 2 using the tr iangle inequalit y . The second holds b ecause eac h of the k max sub-netw orks contributes 3 to k w (1) 1 k 1 . The third is implied b y the fact that eac h input to eac h un it is b ounded by | max ( w 1 · x , − w 1 · x , ( w 1 − w 2 ) · x ) | ≤ 2. Theorem 4.14. L et C b e as in De ﬁnition 4.13, then ther e is an algorithm A for solving sum of k max 2-aﬃne ﬁtting pr oblem in time n O (1) 2 O ( ( k 2 /ǫ ) ) log(1 /δ ) . Pr o of. As p er our construction, we kn o w that there exists a n et w ork N Σ with activ at ion function σ relu and 1 h idden la y er such that k w (0) j k 2 ≤ 2 and k w (1) 1 k 1 ≤ 3 k . Also, input to eac h u nit is b oun d ed in magnitude b y 2. Th us, using Th eorem 4.5 with K = 1, M = 2 and W = 3 k we get that there exists an a lgorithm th at solv es the sum of k max 2-aﬃne ﬁtti n g p r oblem in time n O (1) 2 ( O ( k 2 /ǫ )) log(1 /δ ). 4.5.2 Max k -Aﬃne In this section, w e mo ve to a more general con v ex piecewise linear function represented as t he maxim um of k aﬃne f unctions. This is formally deﬁned as follo ws. Deﬁnition 4.15 ( Max k -Aﬃne Fitting [27]) . L et C b e the class of functions of the form f ( x ) = max ( w 1 · x , . . . , w k · x ) with w 1 , . . . , w n ∈ S n − 1 mapping S n − 1 to R . L et D b e a distribution on S n − 1 × [ − 1 , 1] . Given i.i.d. examples dr awn fr om D , for any ǫ ∈ (0 , 1) ﬁnd a fu nction h (no t ne c e ssarily in C) such that E ( x ,y ) ∼D [( h ( x ) − y ) 2 ] ≤ min c ∈C E ( x ,y ) ∼D [( c ( x ) − y ) 2 ] + ǫ . Note that th is form is univ ersal since any con v ex piecewise-linear fun ction can b e expressed as a max-aﬃne function, for some v alue of k . Ho w eve r , we fo cus on b ounded k and giv e learnabilit y b oun d s in terms of k . Observe that max k -aﬃne can b e expressed in a complete binary tree structure of height ⌈ log k ⌉ with a max op eration at eac h unit and w i · x f or i ∈ [ k ] at the k leaf un its (for example, see Figure 25 w 1 · x w 2 · x w 3 · x w 4 · x max max max Figure 2: T ree structur e for ev aluating max k -aﬃne with k = 4. 2 Note that if k is not a p o wer of 2, then we can trivially add lea ves with v alue w 1 · x and mak e it a complete tree. Th u s, the class of con v ex piecewise linear functions can b e expressed as a net w ork of ReLUs with ⌈ log k ⌉ h idden la yers by replacing eac h max u nit in the tree by 3 ReLUs and addin g an outpu t unit. See Figure 3 for the construction for k = 4. More formally , w e ha v e a net wo rk N max with ⌈ log k ⌉ hidden la yers and one outpu t unit w ith σ relu as the activ ati on function. Hidden la y er i has 3 · 2 ⌈ log k ⌉− i units. The weigh t vec tors for the units in the ﬁrst hidden la y er are w (0) 3 j − m =      w 2 j − w 2 j − 1 m = 0 w 2 j − 1 m = 1 − w 2 j − 1 m = 2 for j ∈ [3 · 2 ⌈ log k ⌉− 1 ]. F ur th er, the weigh t vect ors input to hidd en la yer i ∈ { 2 , . . . , ⌈ log k ⌉} of the net wo rk are w ( i − 1) 3 j − m =      e 6 j + e 6 j − 1 − e 6 j − 2 − ( e 6 j − 3 + e 6 j − 4 − e 6 j − 5 ) m = 0 e 6 j − 3 + e 6 j − 4 − e 6 j − 5 m = 1 − ( e 6 j − 3 + e 6 j − 4 − e 6 j − 5 ) m = 2 . for j ∈ [3 · 2 ⌈ log k ⌉− i +1 ]. Note, e i refers to the vecto r with 1 at p osition i and 0 ev erywh ere else. Finally the weig ht vec tor for the output u n it is w ( ⌈ log k ⌉ ) 1 = e 1 + e 2 − e 3 . The follo wing p rop erties of N max are easy to deduce. • k w (0) j k 2 ≤ 2 • k w ( i ) j k 1 ≤ 6 for i ∈ [ ⌈ log k ⌉ ] • T he input to eac h un it is b ounded b y 2. Here, the ﬁrst and third conditions are the same conditions as in the previous section. The second holds b y the v alues of the weigh ts deﬁned a b o ve . Using the ab o v e constr u ction, we obtain the follo wing resu lt. Theorem 4.16. L et C b e as in De ﬁnition 4.15, then ther e is an algorithm A for solving the max k -aﬃne ﬁtting pr oblem in time n O (1) · 2 O ( k /ǫ ) ⌈ log k ⌉ · log (1 /δ ) . 8 Bo yd and Magnani [27] sp eciﬁcally fo cus on the case of small k , writing “Ou r interest, h ow ev er, is in the case when the num b er of terms k is relatively small, say no more than 10, or a few 10s.” 26 w 1 · x w 2 · x w 3 · x w 4 · x σ relu σ relu σ relu σ relu σ relu σ relu σ relu σ relu σ relu output Figure 3: Net work w ith σ relu for ev aluating max k -aﬃn e with k = 4. Note that solid edges r epresent a we ight of 1, dashed edges represent a weig ht of -1 and the absence of an edge repr esen ts a weig ht of 0. Pr o of. As p er our constru ction, w e know that there exists a net work N max with activ ation fu n c- tion σ relu and ⌈ log k ⌉ hidd en la y ers su ch th at k w (0) j k 2 ≤ 2 an d k w ( i ) j k 1 ≤ 6 f or i ∈ [ ⌈ log k ⌉ ]. Also, inpu t to eac h unit is b ounded b y 2. Thus, using Theorem 4.5 with K = [ ⌈ log k ⌉ ], M = 2 and W = 6, we get th at there exists an algorithm that solves the max k -aﬃne problem in time n O (1) 2 ( k/ǫ ) O (log k ) log(1 /δ ). 5 Hardness of Learning ReLU W e also establish the ﬁr st hardness results for learning a single R eLU with r esp ect to distribu tions supp orted on the Boolean h yp ercu b e ( { 0 , 1 } n ). The high-lev el “tak ea wa y” from our hardn ess results is that learning fun ctions of the form max (0 , w · x ) where | w · x | ∈ ω (1) is as hard as solving notoriously d iﬃcult p r oblems in computational learning theory . This justiﬁes our fo cu s in previous sections on inpu t distribu tions supp orted on S n − 1 and in dicates that learning real-v alued fu nctions on the sph ere is one a ve nue for avoiding the v ast literature of hardness r esu lts on Bo olean fu nction learning. T o b egin, we recall the follo wing problem fr om computational learning theory wid ely thought to b e computationally int ractable. Deﬁnition 5.1. (L e a rning Sp arse Parity with Noise) L et χ S : { 0 , 1 } n → { 0 , 1 } n b e an unknown p a rity fu nction on a subset S , | S | ≤ k , of n inputs bits (i.e., any input, r estr icte d to S , with an o dd numb er of ones is mapp e d to 1 and 0 otherwise). L et C k b e the c onc ept class of al l p arity functions on su bsets S of size at most k. L et D b e a distribution on { 0 , 1 } n × {− 1 , 1 } and deﬁne opt = min χ ∈C k Pr ( x ,y ) ∼D [ χ ( x ) 6 = y ] . The Sparse Learning Parit y with Noise pr oblem is as fol lows: Given i.i.d. examples dr awn fr om D , ﬁnd h su ch that Pr ( x ,y ) ∼D [ h ( x ) 6 = y ] ≤ opt + ǫ . Our hardness assumption is as follo ws: 27 Assumption 5.2. F or every algorithm A that solves the Sp ar se L e arning Parity with Noise pr ob- lem, ther e exists ǫ = O (1) and k ∈ ω (1) such that A r e quir es time n Ω( k ) . An y algorithm breaking the ab o v e assumption would b e a ma jor resu lt in theoretical computer science. The b est kno wn algorithms due to Blum, Kalai, W asserman [5] and V alian t [33] ru n in time 2 O ( n/ log n ) and n 0 . 8 k , resp ectiv ely . Under this assumption, w e can r ule out p olynomial-time algorithms for reliably learning ReLUs on distribu tions s upp orted on { 0 , 1 } n . Theorem 5.3. L et C b e the class of R eLUs over the domain X = { 0 , 1 } n with the adde d r estriction that k w k 1 ≤ 2 k . A ny algorith m A for r eliably le arning C in time g ( ǫ ) · p oly ( n ) for any function g wil l give a p o lynomial time algorithm for le arning sp arse p ar ities with noise of size k f or ǫ = O (1) . Pr o of. W e will sho w ho w to use a reliable ReLU learner to agnostical ly lea rn c onjunctions on { 0 , 1 } n and use an observ atio n due to F eldman and Kothari [12] who sho w ed that agnostically learning conjun ctions is h arder than the Sparse Learning Parit y w ith Noise pr oblem. Let C O k b e the concept class of all Bo olean conjunctions of length at most k . Notice that for the domain X = { 0 , 1 } n , th e conjunction of literals x 1 , . . . , x k can b e computed exactly as max (0 , x 1 + · · · + x k − ( k − 1)). Fix an arbitrary distribution D on { 0 , 1 } n × { 0 , 1 } and deﬁne opt = min c ∈C O k Pr ( x ,y ) ∼D [ c ( x ) 6 = y ] . Kalai et al. [19] (T h eorem 5) observ ed that in ord er outpu t a hyp othesis h with error opt + ǫ it suﬃces to minimize (to within ǫ ) the follo wing qu an tit y: opt 1 = min c ∈C O k E ( x ,y ) ∼D [ | c ( x ) − y ] | . Consider th e follo wing trans formed distribution D ′ on { 0 , 1 } n × { ǫ, 1 + ǫ } that adds a small p ositiv e ǫ to ev ery y output by D . Note that this c hanges opt 1 b y at most ǫ . F u rther, all lab els in D ′ are n o w p ositiv e. Since ev ery c ∈ C O k is computed exactly b y a ReLU, and th e reliable learning mo del demands that w e minimize L > 0 ( h ; D ′ ) ov er all ReLUs, algorithm A will ﬁnd an h such that E ( x ,y ) ∼D ′ [ | h ( x ) − y | ] ≤ opt 1 + ǫ ≤ opt + 2 ǫ . By app ropriately rescaling ǫ , we ha ve sh o wn ho w to agnostical ly learn conjun ctions using reliable learner A . This completes the p r o of. The ab o v e pr o of also sho ws hardness of learnin g ReLUs agnosti cally . Note the ab o v e hard ness result holds if we require the learning algo rithm to succeed on all domains where | ( w · x ) | ca n gro w without b ound with resp ect to n : Corollary 5.4. L e t A b e an algorith m that le a rns R eLUs on al l doma ins X ⊆ R n wher e ( w · x ) may take on values that ar e ω (1) with r esp e ct to the dimension n . Then any algorithm for r eliably le arning C in time g ( ǫ ) · poly ( n ) wil l br e ak the Sp arse L e arning Parity with Noise har dness assumption. Finally , w e p oin t out Kalai et al. [18] prov ed that reliably learnin g conjunctions is also as hard as P A C Learn ing DNF formulas. Th us, by our ab o v e reduction, an y eﬃcien t algorithm for r eliably learning ReLUs w ould giv e an eﬃcient algorithm for P A C learnin g DNF formula s (again th is w ould b e considered a breakthrough result in computational learnin g theory). 6 Conclusions and Op en Problems W e ha v e giv en the ﬁ rst set of eﬃcien t algorithms for ReLUs in a n atur al learning mo d el. ReLUs are b oth eﬀectiv e in pr actice and , unlik e linear thr eshold functions (halfsp aces), admit non-trivial learning algorithms for al l distributions with r esp ect to adv ersarial noise. W e “sidestepp ed” the hardness results in Bo olean fu nction learning by fo cusing on pr oblems that are not entirely scale- in v arian t with resp ect to the choice of domain (e.g., reliably learning ReLUs). The obvio us op en 28 question is to impr o v e t he d ep enden ce of our main r esult on 1 /ǫ . W e hav e no reason to b eliev e that 2 O (1 /ǫ ) is the b est p ossib le. Ac kno wledgements. The authors are grateful to Sanjeev Arora and Roi Livni for helpful feedbac k and useful discussions on this w ork. References [1] Ale xand r Andoni, Rina Panigrah y , Gregory V alian t, and Li Zhang. Learnin g sparse p olynomial functions. In Pr o c e e dings of the Twenty-Fifth Annual ACM-SIAM Symp osium on D iscr ete Algor ithms, SODA 2014, Portland, Or e gon, USA, January 5-7, 2014 , p ages 500–510, 2014. [2] Raman Ar ora, Amitabh Basu, Poorya Mianjy , an d Anribit Mukherjee. Und erstanding deep neural net works with rectiﬁed linear units, 2016. URL: https://arx iv.org/a bs/1611.01491 . [3] F rancis Bac h. Breaking the curse of dimensionalit y with conv ex neural net works. 2014 . [4] P eter L. Bartlet t and Shahar Mendelson. Rademac her and gaussian complexities: Risk b ound s and structural results. Journal of Machine L e arning R ese ar ch , 3:463–482 , 200 2. [5] Blum, Kalai, and W asserman. Noise-tole rant learning, the parit y problem, and the statistica l query mo del. JACM: Journal of the ACM , 50, 2003. [6] Nel lo C r istianini and John Sha w e-T a ylor. An i ntr o duction to supp ort ve ctor machines and other kernel-b ase d le arning metho ds . Cambridge Univ ersit y Press, 2000. [7] Amit Daniely . C omp lexit y theoretic limitations on learning halfspaces. In STOC , pages 105 – 117. A CM, 2016. [8] Ilia s Diak onik olas, Daniel M. Kane, and Jelani Nelson. Bound ed indep endence fo ols degree-2 threshold functions. In FOCS , pages 11 –20. IEEE Computer So ciet y , 2010. [9] Ronen Eldan and Ohad Shamir. Th e p ow er of depth for feedforw ard neu r al net wo rks. In Vitaly F eldman, Alexander Rakhlin, and Oh ad S hamir, editors, Pr o c e e d ings of the 29th Confer enc e on L e arning The ory, COL T 20 16, New Y ork, USA, June 23-26, 2016 , v olume 49 of JMLR Workshop and Confer enc e Pr o c e e d ings , p ages 907– 940. JMLR.org, 2016. [10] Bassey Etim. App ro ve or Reject: Can You Mo derate Fiv e New York Times Commen ts? The New Y o rk Times , 2016. Originally published Septem b er 20, 2016. Retrieve d Octob er 4, 2016 . [11] V. F eldman , P . Gopalan, S. Khot, and A. K. Po nnusw ami. On agnostic learning of parities, monomials, and halfspaces. SIAM J. Comput , 39(2):60 6–645, 200 9. [12] Vitaly F eldman and Pra ve sh Kothari. Agnostic learning of disju nctions on symmetric distri- butions. Journal of Machine L e arning R ese ar ch , 16:345 5–3467, 2015. [13] J erome H. F r iedm an. Multiv ariate adaptiv e regression splines. Ann. Statist , 1991. [14] David Haussler. Decision theoretic generalizations of the p ac mo del for neur al net and other learning applications. Inf. Comput. , 100(1):78 –150, 1992 . [15] K aiming He, Xiangyu Z hang, Shao qing Ren , and Jian Sun. Delvi n g deep in to rectiﬁers: Surp assin g human-lev el p erformance on imagenet classiﬁcation. In Pr o c e e dings of the IEEE International Confer enc e on Computer V ision , pages 1026–1 034, 2015. [16] T homas Hofmann, Bernhard S c h¨ olk opf, and Alexander J Sm ola. Ker n el metho ds in mac hine learning. The annals of statistics , pages 1171– 1220, 2008. 29 [17] S ham M. Kak ade, Karthik Sridharan, an d Am buj T ewari. On the complexit y of linear pr edic- tion: Risk b oun d s, margin b ounds, and regularizat ion. 2008. [18] Ad am T au m an Kalai, V arun Kanade, and Yisha y Mansour. Reliable agnostic learning. Journal of Computer and System Scienc es , 78(5):148 1–1495, 2012 . [19] Ad am T auman Kalai, Adam R. Kliv ans, Yisha y Mansour , and Ro cco A. Serv edio. Agnostically learning halfspaces. SIAM Journal on Computing , 37(6):17 77–1805, 200 8. [20] Michae l J . Kearns, Rob ert E . Sc hapire, and Linda M. Sellie. T ow ard eﬃcient agnostic learnin g. Mach. L e arn. , 17(2-3 ):115–141, 1994. [21] A. R. Kliv ans and A. A. Shersto v. Cryptographic hardness for lea r n ing in tersections of halfs- paces. J. Comput. Syst. Sci , 75(1): 2–12, 2009. [22] Ad am Kliv ans and Pra vesh Kothari. Em b ed d ing hard learning problems in to gaussian space. In RANDOM , 2014 . [23] Y. LeCun, Y. Bengio, and G. Hin ton. Deep learning. N atur e , 521 (7533):436 –444, Ma y 2015. [24] Michel Ledoux and Mic hel T alagrand. Pr ob ability in Banach Sp ac es: Isop erimetr y and Pr o - c esses . Springer, 1991 . [25] R oi Livni, Sh ai Shalev-Shw artz, and O had Sh amir . On the computational eﬃciency of training neural net wo rks . pages 855–863 , 201 4. [26] An drew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectiﬁer nonlinearities imp r o v e neural net wo rk acoustic mo dels. In Pr o c. ICML , v olume 30, 2013. [27] Alessand ro Mag nani and Stephen P . Bo yd. Con vex piecewise-linear ﬁtting. Optimization and Engine ering , 10( 1):1–17, 2009. [28] J ames Mercer. F u nctions of p ositiv e and n egativ e t yp e, and their connection with the theory of integral equations. Philos ophic al tr ansactions of the r oyal so ciety of L ondo n. Series A , c onta ining p a p ers of a mathematic al or physic al char a cter , 209:415 –446, 1909 . [29] D. J . Newman. Rational approximat ion to | x | . Michigan Math. J. , 11(1):11– 14, 03 196 4. [30] P hillipp e Rigollet. High-Di mensional Statistics . MIT, 1st edition, 2015. [31] S hai Shalev-Shw artz, O h ad Shamir, and Karthik Sr idharan. Learning kernel-based halfspaces with the 0-1 loss. SIAM J. Comput. , 40(6):16 23–1646, 201 1. [32] Alexander A. Shersto v. Making p olynomials robust to noise. In Pr o c e e d ings of the F orty-fourth Annual ACM Symp osium on The ory of Computing , STOC ’12, p ages 747– 758, New Y ork, NY, USA, 2012 . ACM. [33] Gr egory V alian t. Finding correla tions in sub qu adratic time, w ith applications to learning parities and the closest pair problem. J. ACM , 62 (2):13:1–1 3:45, Ma y 2015. [34] Wikip edia. Multinomial theorem — Wikip edia, the free encyclop edia, 2016. URL: https:// en.wikip edia.org/wiki/Multinomial_theorem . [35] Wikip edia. P olynomial k ernel — Wikip edia, the free encycloped ia, 2016 . URL: https:// en.wikip edia.org/wiki/Polynomial_kernel . [36] Y u c hen Zhang, Jason Lee, and Mic hael Jordan. ℓ 1 net wo rks are improp erly learnable in p olynomial-time. In ICML , 2016. 30

Reliably Learning the ReLU in Polynomial Time

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment