Deep Online Convex Optimization with Gated Games

1 Deep Online Con v e x Optimization with Gated Games Da vid Balduzzi david.balduzzi@vuw.ac.nz Abstract —Methods from conv ex optimization are widely use d as building b locks for de ep learning algori thms. How ev er , the reasons for their empirical success are unclear , since moder n conv olutional networks (conv net s), incorporating rectiﬁer units and max-pooling, are neither smooth nor conv ex . Standard guarantees theref ore do not apply . T his paper provides the ﬁrs t conv ergence rates for gr adient descent on rectiﬁer convnets . The proof utili zes the par ticular structure of rectiﬁer net works which consists in binary active/inactive gates applied on top of an underlying linear network. The approach generalizes to max-pooling, dropout and maxout. In other w ords, to precisely the neural networks that perform best empirically . The key s tep is to introduce gated games, an e xtension of conv ex games with similar conv ergence proper ties that capture the gating function of rectiﬁers . The main result is that rectiﬁer con vnets conver ge to a critical point at a rate controlled by the gated-regret of the units in the network. Corollaries of the main result include: (i) a game-theoretic description of the representations learned by a neural network; (ii) a logarithmic-regret algorithm for training n eural nets; and (iii) a formal setting for analyzing conditional computation in neural nets that can be applied to recen t ly dev eloped models of attenti on. Index T erms —conve x optimi zation, dee p learning, error backpropagation, game theor y , no-regret learning. ✦ 1 I N T R O D U C T I O N Deep learning algorithms have yielded impressive perfor - mance across a range of tasks, including object and voice recognition [ 1 ]. The workhorse underlying deep lea rning is error backpropagation [ 2 ]–[ 4 ] – a decades old algorithm that yields state-of-the-art performance on ma s sive labeled data s ets when combined with r ecent innovations such as rectiﬁers and dropout [ 5 ]. Backprop is gradient descent plus the chain rule. Gradient descent ha s convergence guara ntees in s ettings tha t a re smooth or convex or both. However , modern convnets are neither smooth nor convex . Although it is well-known that convnets are not convex, it is pe rhaps under-emphasized that the spec- tacular recent r esults obtained by convnets on benchmarks such as ImageNet [ 6 ] rely on ar chitectures tha t are not smooth. Starting with AlexNet in 2012, e v ery winner of the I mageNet classiﬁcation challenge has used rectiﬁer (also known as recti- linear) activat ion functions [ 7 ]–[ 10 ]. Rectiﬁers a nd ma x-pooling a re non-smooth functions that are used in essentia lly all modern convnets [ 11 ]–[ 16 ]. In fact, the representational power of rectiﬁer nets derives precisely from their nondifferentiability: the num ber of nondiffer entiable boundaries in the parameter space grows exponentially with depth [ 17 ]. It follows that none of the standard convergence guarantees from the opt imization literature apply to modern convnets. In this p aper , we provide the ﬁrst convergence rate s for convolutional networks with rectiﬁers and max-pooling. T o do so we intr oduce a new class of gated game s which generaliz e the convex game s studied by S toltz and Lugosi in [ 18 ]. W e reformulate learning in convnets as gated games and ada p t results on c onvergence to corr ela ted equilibria fr om convex to gated games . 1.1 Open Questions in the Foundatio ns of Deep Learning Theoretical questions about deep learning can be loosely grouped into four categories: Q1. Representational power The set of functions ap proximated by neu ral networks have been extens iv ely studied. Early results s how that neural net- works with a s ingle hidden layer are universal function ap- proximators [ 19 ], [ 20 ]. Mor e recently , researchers have focused on the role of dep t h and rectiﬁers in function ap proximation [ 21 ]–[ 23 ]. Q2. Generalization guarantees Standard guarantees f rom VC-theory apply to neural nets, although these are quite loose [ 24 ] , [ 25 ]. Recent wo rk by Hardt et al shows that the convergence rate of stochas tic gradient methods have implications for generalization bounds in both convex and nonconvex settings [ 26 ]. Unfortunately , their re- sults rely on a smoothnes s assump tion 1 that does not hold for rectiﬁers or max-pooling. Thus, although sugges tive, the results do not apply to modern convnets. Feng et al have initiated a promising direction based on ensemble robustness is [ 27 ], a lthough robustness cannot be ev aluated analytically . A related problem is to better understand r egula rization methods such as dropout [ 28 ], [ 29 ]. Regret-bounds for dropout have been found in the setting of prediction with exp ert advice [ 30 ]. However , it is unclear how t o extend these results to neural nets . Q3. Local and glob a l minim a The third problem is t o understand how far the critical points found by ba ckpropagation ar e from local minima a nd the global minimu m. The problem is ch a llenging since neu ra l networks are not convex. Ther e has been theoretical work studying conditions u nder which gradient descent converges 1. Func t ion f : H → R is smooth if ∀ w , w ′ ∈ H , ∃ β > 0 such th at k ∇ f ( w ) − ∇ f ( w ′ ) k ≤ β k w − w ′ k . Rectiﬁers are not smooth for any β . 2 to local minima on nonconvex pr oblems [ 31 ], [ 32 ]. The a s - sumptions required for these results are quite strong, and include smoothness a ssumptions that do not hold for rectiﬁers. It has also been observed that saddles slow down training, even when the algorithm does not c onverge to a saddle point; designing algorithms that avoid saddles is an ar ea of active research [ 33 ]. Recent work by Choromanska et a l s uggests that most loc a l optima in neural nets have error rates that are reasonably close to the global optimum [ 34 ], [ 35 ]. Sea rching for good local optima ma y therefor e be of less practica l importance than ensuring rapid conver gence. Q4. Convergence rates The las t problem, and the focus of this paper , is to under- stand the conver gence of gradient-based methods on neural networks. S peeding up the training time of neural nets is a problem of enormous practical importance. Although t here is a large body of empirical work on optimizing neural nets, there are no theoretical guarantees that ap ply to t he methods used to train rectiﬁer convnets since they are neither smooth nor convex. Recent work has inves tigated the convergence of pr oxi- mal me t hods for nonconvex nonsmooth problems [ 36 ], [ 37 ]. However computing pro x- operators appears infeasible for neural nets. Inte resting results have been d erived for varia nce- reduced gradient optimization in the nonconvex setting [ 38 ]– [ 40 ], a lthough smoothness is still r equired. 1.2 Outline T raining modern convnets, with rectiﬁers and max-p ooling, entails s e a rching over a rich subset of a u niv ersal class of function ap p roximators with a lo s s function that is neit her smooth nor convex. There is little hop e of obtaining useful convergence results at this level of generality . It is theref ore necessary to utilize the st ructure of r ectiﬁer networ ks . Our strategy is to decompose neural nets into interacting optimizers that are easier to analyze individually than the net as a whole. In short, the strategy is to import techniques from game theory into deep learning. 1.2.1 Utilizing network struc ture W e make two observat ions about neural network structure. The ﬁrst, sect ion 2 , is to r eformulate linear networks as convex games where the players are the units in the network. Although the loss is not a convex function of the weights of the network; it is a convex function of the weights of the individual p layers. The obs ervation connects the dynamics of games u nder no- regr et lea rning to t he dynamics of linear net works under backpropagation. Linear networks ar e a special case. It is natural to ask whether neu ra l networks with nonlinearities are also convex games. Unfortunately , the answer is no: introd ucing any non- linearity b reaks convexity . 2 Although the situa tion seems hope- less it turns out, remarkably , that game-theoretic convergence results can be imported – despite nonconvexity – for precisely the nonlinearities used in modern convnets. The s econd observation, section 3.1 , is t ha t a rectiﬁer net- work is a linear network equipped with gates that control 2. In short, the “no nlinearity” f w ould have to b e afﬁne since w e need all linear combinations of f to be convex, including f ( a ) and − f ( a ) ∀ a . whether units are active for a given input. If a u nit is inac- tive during t he feedforwar d sweep, it is also inactive during backpropagation, and ther efore does not update its weights. This motivates generalizing convex games to ga ted games . 1.2.2 Gated games In a classical game, each player chooses a series of actions and, on ea ch round, incurs a convex loss. The regret of a p layer is the differ ence between it s cumu la tive loss a nd what t he cumu- lative would ha ve been had the player chosen the best action in hindsight [ 41 ]. Pla yers can be imp lemented with so-called no-regr et algorithms that minimize their loss relative to the best action in hindsight. More precisely , a no-regr et algorithm has sublinear cumulativ e regret. The regret per round therefor e vanishes as ymptotically . Section 3 introduces gated game s where p layers only incur a convex loss on rounds fo r which they are active . After extend- ing the deﬁnitions of regr et and correlated equilibrium to gat ed games, p roposition 3 shows that if players follow no-gated- regr et strategies, then they converge to a corr elated equilib- rium. Gated players generaliz e the sleepy exp erts introduced by Blum in [ 42 ], s ee als o [ 43 ]. A us eful technical tool is p ath-sum game s . Thes e are game s constructed over dir ected acyclic graphs with weighted edges. Lemma 4 shows that p ath-sums encode the dynamics of the feedforwar d and feedback sweeps of rectiﬁer nets . P ropo- sition 5 shows that p a th-sum games ar e gated games and proposition 6 exte nds the result to convolutional networks. 1.2.3 Summary of contribution The ma in contributions of the paper are a s follows: C1. Theorem 1 : Rectiﬁer convnets converge to a cr itical point under backpropagation at a rate con trolled by the gated- regret of the units in the network. Corollary 1 specializes the result to gra dient descent. T o the best of our knowledge, there are no previous convergence rates applicable to neural nets with rectiﬁer nonlinearities a nd max-pooling. Finding conditions t hat guarantee convergence to local minima is deferred to futur e work. The results derive from a deta iled ana lysis of the int e rna l structure of rectifer nets and their updates u nder ba ckpropa- gation. They require no new ideas r ega rding optimiza tion in general. Our m ethods provide the ﬁrst rigor ous explanation for how methods designed for convex optimization improve convergence rates on mod ern convnets. The r esu lt s d o not apply to a ll neural networks: they hold for p recisely the neural networks that perform best empirically [ 7 ]–[ 16 ]. The philosophy underlying the pa per is to decompose training neural net s into two distinct tas ks: communication and optimization. Communication is handled by backpropagation which sends the correct gradient information to pla yers (units ) in the net . Optimiza tion is handled locally by the individual players. Note tha t although this philosophy is widely ap plied when designing and implementing neural nets , it has been under-utilized in the a nalysis of neu ral net s . The role of pla yers in a convnet is enca psulated in t he Gated Forecaster set ting, Section 4 . Our results provide a dictionary that translates the guarantee s ap p licable to any no-regret algorithm into a convergence rate for the network as a whole. C2. Reformulate neural networks a s games. The p rimary conceptual contribution of the paper is to connect game theory to deep learning. An interesting consequence 3 of the main result is corollary 2 which provides a compact description of the weights learned by a neural network via the signal underlying correlated equilibrium. More generally , neural nets are a basic exa mple of a game with a structur ed communication protocol (the path-sums ) which determines how players int e ra ct [ 44 ]. It may be fruitful to investigate broader classes of str uctured gam e s . It ha s been suggest e d that rectiﬁers perform well because they a re nonnegative homogeneous which has implications for regularization [ 45 ] and robustness to changes in initialization [ 23 ]. Our results p rovide a complem e nt ary ex planation. Rec- tiﬁers simultaneous ly (i) introduce a nonlinearity into neural nets providing them with enormous repr ese nt a tional p ower and (ii) act as gat es that select s u bnetworks of an underlying linear neural network, so tha t convex methods are applicable with guarant ees. C3. Logarithmic regret algorith m. As a concrete app lication of the gated for ecaste r framework, we adapt the Online Newton Step algorithm [ 46 ] to neural nets and show it has logarithmic-regret, corollary 3 . T he resulting algorithm approximates Newton’s method locally at the level of individual units of the network – rather tha n globally for the network as a whole. The local, unit-wise imp lementation reduces computa tional complexity and sidesteps the tendenc y of quas i-newton methods to ap proach s addle points. C4. Condition a l computation. A secondary conceptu al contribution is to introduce a frame- work for conditional computation. Up to this point, we as- sumed the gate is a ﬁx e d property of the game. Concretely , gates correspond to rectiﬁers and max-pooling in convolutional networks – which are baked into the architecture before the network is exposed to data. It is natural to consider optimizing when players in a gated game are a ctive, s ection 4.5 . Recent work along these lines has app lied reinfor cement learning algorithms to ﬁnd data-dependent dropout policies [ 47 ], [ 48 ]. Conditional computat ion is closely related to models of att ention [ 49 ], [ 50 ]. Slightly further aﬁeld, long-short term memories (LSTMs) and Gated Recurr ent Units (GRUs) use complicated s ets of sigmoid-gates to control activity within memory cells [ 51 ], [ 52 ]. Unfortunately the r esu lt ing a rchitec- tures a re difﬁcult to analyze; see [ 53 ] for a principled simpliﬁ- cation of recurr ent neural network architectures motivated by similar considerations to the present pa per . As a ﬁrst step towards analyzing conditional computation in n eu ral nets, we introd uce the Conditional Gate (CoG) setting. CoGs are contextua l ba ndits or contextual partia l monitors that optimize when sets o f units are active. CoGs a re a s econd class of pla yers that can be intr oduced into in neural ga m es, and m a y provide a useful tool when designing deep learning algorithms. 1.2.4 Ca veat Neural nets are typ ically trained on miniba tches s a mpled i.i.d. from a data set. In contrast, the analysis below p rovides guarantees in adversarial settings. Our results are therefore conservative. Extending the analysis to take advantage of stochastic settings is an important open problem. However , it is worth mentioning that neural nets are increasingly app lied to data tha t is not i.i.d. sampled. For example, adversa rially trained generative networks have achieved impr ess ive perfor- mance [ 54 ], [ 55 ]. Similarly , there has been spectacu la r progress applying neural nets t o reinfor cement learning [ 56 ], [ 57 ]. Activity within a neural network is not i.i. d. even when t he inputs are, a phenomenon known as internal covariate s hift [ 58 ]. T wo relevant developments are batch-normaliza t ion [ 58 ] and optimis tic mirror descent [ 59 ]–[ 61 ]. Batch normalization signiﬁcantly reduces the training time of neural nets by a ctively reducing intern a l covariate shift. Optimistic mirr or descent takes advantage of the fact that all players in a game are implementing no-regret learning to s peed up convergence. It is interesting to investigat e whether reducing internal cova ria t e shift can be understood gam e -theoretically , and whether opti- mistic lea rning algorithms can be adapted to neural networks. 1.3 Related work A number of papers have brought techniques from convex optimization into the analysis of neural net works. A line of work initiated by Bengio in [ 62 ] shows that allowing the learning algorithm to choose the numbe r of hidden units can convert neu ral networ k optimization in a convex problem, see also [ 63 ]. A convex multi-layer architecture is developed in [ 64 ], [ 65 ]. Although these methods are interesting, they have not a chieved the p ractical su cces s of convnets. In t his pa p er , we analyze convnets as they are rat her than proposing a more tractable, but potentially less us e ful, mod el. Game theory was deve lop ed to model interactions between humans [ 66 ]. However , it ma y be more directly app licable as a toolbox for analyzing machina economicus – that is, interacting populations of algorithms that are optimizing objective func- tions [ 67 ]. W e go one step further , and develop a game-theoretic analysis of the int e rnal structur e of b ackpropagation. The idea of decomposing deep lea rning algorithms into cooperating modules dates b ack to at lea s t the work of Bot- tou [ 68 ]. A related line of work modeling biological neu ral networks from a game-theoretic persp ective can be found in [ 69 ]–[ 72 ]. 2 W A R M U P : L I N E A R N E T W O R K S The paper combines disp arate techniques and notation from game theory , convex optimization and deep learning, and is therefor e s omewhat dense. T o get oriented, we sta rt with linear neural networks. Linear nets p rovide a simple but nontrivial worked examp le . They are not convex. Their energy landscapes and dynamics under backpropagation have been exte ns ively studied and turn out to be surprisingly intricate [ 73 ], [ 74 ]. 2.1 Neural netwo r ks Consider a neural network with L − 1 hidden layers. Let h 0 := x denote the inp ut to the network. For each layer l ∈ { 1 , . . . , L } , set a l = W l · h l − 1 and h l = s ( a l ) where s is a (typically nonlinear) function a pplied coordinatewise. Let W := { W 1 , . . . , W L } denote the s et of weight matrices. A convenient shorthand f or the output o f the network is f W ( x ) . For simplicity , sup pose the outpu t layer consists in a single unit and the output is a s calar (the a ssumption is dropped in section 3 ). Let ( x ( i ) , y ( i ) ) n i =1 denote a sample of labeled data and let ℓ ( f , y ) be a loss function that is convex in the ﬁrst argument. T raining the neu ral network reduces to solving the optimization problem W ∗ = argmin W E ( x ,y ) ∼ ˆ P h ℓ  f W ( x ) , y  i , (1) 4 where ˆ P is the empirical distribution over the data. T raining is typically performed using gradient descent W ← W − E ( x ,y ) ∼ ˆ P h η · ∇ W ℓ  f W ( x ) , y  i . (2) Since Eq. ( 1 ) is not a convex function of W , ther e are no guarantees that gradient descent will ﬁnd the global minimum. 2.1.1 Backprop Let us recall how ba ckprop is ex tracted from gradient descent. Subscripts now refer to units, not layers. Setting E := E [ ℓ ] , Eq. ( 2 ) can be written mor e con cretely for a single weight a s w ij ← w ij − η · ∂ E ∂ w ij . By the chain rule the derivative decomposes as ∂ E ∂ w ij = ∂ E ∂ a j ∂ a j ∂ w ij where δ j = ∂ E ∂ a j is the backpropagated error . Backprop com- putes δ j recursively v ia ∂ E ∂ a j | {z } δ j = X { k : j → k } δ k z }| { ∂ E ∂ a k · ∂ a k ∂ a j | {z } δ j = X { k : j → k } δ k · w j k h ′ j | {z } δ j (3) 2.2 Linear netwo rks In a linea r network, the function s is the identity . I t follows tha t the outpu t of t he network is f ( x ) = f W ( x ) = L Y l =1 W l ! · x . For linea r networks, the backpropagation computa tion in Eq. ( 3 ) reduces to δ j = X { k : j → k } δ k · w j k (4) It is convenient f or our purp oses to d ecompos e δ j slightly diff erently , by factorizing it into ∂ E ∂ f , the derivative of the loss with r esp ect to the outpu t, and ∂ f ∂ a j , the sensitivity of the network’s outpu t to unit j : δ j = ∂ E ∂ a j = ∂ E ∂ f · ∂ f ∂ a j | {z } δ j W e now r eformulate the forward- and b ack- propagation in linear nets in terms of path-sums [ 34 ], [ 72 ]: Deﬁnition 1 (path-sums in linear nets) . A path is a direct e d -sequence of ed g es connecting two units. Let ( i j ) denote t h e set of paths from unit i to unit j . The weight of a path w eight ( ρ ) is the pro d uct of the weights of the edges along the path. • sum(paths from i to j ): σ i j := X ρ ∈ ( i j ) weigh t( ρ ) • sum(paths from i n to j ): σ • j := X s ∈ in x s ·   X ρ ∈ ( s j ) weigh t( ρ )   • sum(paths from j to out ): σ j • := X ρ ∈ ( j • ) weigh t( ρ ) • sum(paths avoiding j ): σ − j := X s ∈ in x s ·   X { ρ | ρ ∈ ( s • ) and j 6∈ ρ } weigh t( ρ )   Proposition 1 (str ucture of linear nets) . Let σ in ( j ) = ( σ • i ) { i : i → j } . For a linear ne twork as abov e, 1) Feedforward computation of outputs. The output of unit j is a j = σ • j = h w j , σ in ( j ) i . 2) Sensitivity of network output. The sensitivity of t h e netw ork’ s output to unit j , denoted ∂ f W ∂ a j , is the sum of the weights of all paths from unit j to the output unit: ∂ f W ∂ a j = σ j • 3) D ecomposition of network output. The out p ut of a linear netw ork decomposes, with r e s pect unit j , as f W ( x ) = h w j , σ in ( j ) i · σ j • + σ − j = a j · σ j • + σ − j 4) B ackpropagated errors. Let β := ∂ E ∂ f denote the deriv ative of E with respect to the output of the network. The backpropagated error signal received by unit j is δ j = β · σ j out . 5) Error gradients. Finally , ∇ w j E = δ j · σ in ( j ) = β · σ j • · σ in ( j ) Note that a j is a lin ea r function of the weight vector w j of unit j , a nd that neither the pat h-sums from j nor the pa th-sums avoiding j depend on w j . Proof. Direct computa tions. The output of a linear neural network is a polynomial function of its weights. This can be seen fr om the path-sum perspective by noting that the output of a linear net is the sum over all paths from t he input layer to the output unit. 2.3 Game theory and online learning In this subsection we reformulate linea r neural networks a s convex games . Deﬁnition 2 (convex game) . A convex game ([ N ] , H , ℓ ) consists o f a set [ N ] := { 1 , . . . , N } of players, actions H = Q N j =1 H j and loss vector ℓ = ( ℓ 1 , . . . , ℓ N ) . Player j picks actions w i from a convex compact set H j ⊂ R d j . Player j ’ s loss ℓ j : H → R is convex in the j th argument. The classical games of von N eumann and M orgensten [ 66 ] are a special case of convex games where H i = △ d i is the probability simplex over a ﬁnite set of actions [ d i ] available to each agent i , and the loss is multilinear . It is well known that, even in the linea r cas e, the los s of a neural network is not a convex function of it s weights or the weights of its individual la ye rs . However , the loss of a linear network is a convex function of the weights of the individual units. 5 Proposition 2 (lin ea r net works are convex ga mes) . The players correspond to the units, where we impose th at weight vectors ar e chosen from compact, convex s set H j ⊂ R n j for e ach unit j , whe re n j is the in-degree of unit j . Let w − j denote the se t of all weights in a neural network ex cept those of unit j . D eﬁne the loss of unit j as ℓ j ( w j , w − j , x , y ) := ℓ ( f W ( x ) , y ) . Then ℓ j is a convex function of w j for all j . Proof. Note that the los s of every unit is the same and corre- sponds t o the loss of the network as a whole; the notat ion is introduced to emphasize the r elevant para m e t ers. By proposi- tion 1 .3, the loss can b e written ℓ j ( w j , w − j , x , y ) = ℓ  h w j , σ in ( j ) i · σ j • + σ − j , y  , where σ in ( j ) , σ j • and σ − j are functions of w − j and x , a nd so constant with respect to w j . It follows that the loss is the composite of an afﬁne function of w j with a convex function, and so is convex. Remark 1 (any neural network is a game) . Any neural network can be reformulated as a game by treating the individual units as players . H owever , in general the loss will not b e a convex function of the players’ actions and so convergence guarantees are not available. The main conceptual c ontribution of the paper is to show that mode r n convnets form a class of games wh ich, although not convex, are close enough that convergence results fro m game theory can b e adapted to the setting. As a concr ete exam ple, consider a network equipp ed with the mean-squa re error . The los s of unit j is ℓ j ( w ) =  σ j • · h w j , σ in ( j ) i −  y − σ − j   2 . Deﬁne the residue y − j := y − σ − j σ j • . Unit j ’s loss can b e rewritten ℓ j ( w ) = ( σ 2 j • ·  h w j , σ in ( j ) i − y − j  2 if σ j • 6 = 0 0 else. Thus, unit j performs linear r egression on the residue, am- pliﬁed by a sca lar that reﬂects the net work’s sens it ivity to j ’s output. The goal of each player j in a game is to minimize its loss ℓ j . Unfortunately , this is not realistic, s ince ℓ j ( w j , w − j ) depends on the actions of other players. If the gam e is repeated, then an a ttainable goal is for players to minimize their regr et. A player ’s cumulative re gr et is the difference between the loss incurred over a series of plays and the loss that would have been incurred ha d the player consistently chosen the best play in hindsight: Regret j ( T ) = sup w j ∈H j 1 T T X t =1  ℓ j ( w t 1 , . . . , w t j , . . . , w t N ) − ℓ j ( w t 1 , . . . , w j , . . . , w t N )  . An algorithm has no-regret if Regret j ( T ) T − → ∞ 0 . That is, a n algorithm has no-regret (asymptotically) if Regre t j ( T ) grows sublinearly in T . It is important to note that no-regret guara n- tees hold a ga inst any sequence of a ctions b y the other players in the game – be they stochastic, adversa rial, or something else. A pla yer with no-r egret plays optimally given the actions of the other players. Examp les of no- regret algorithms on convex losses include online gradient descent, the exp onential weights algorithm, fo llow the r egula rized lea der , AdaGrad, and online mirror descent. It was obs erved by Foster and V ohra [ 75 ] that, if p layers play accord ing to no-regr et online learning rules, then the average of the s equence of p lays converges to a correlated equi- librium [ 76 ]. Proposition 3 below shows a mor e general result: no-gated-regr et algorithms converge to corr elated equilibrium at a rate that depends on the gate d-regret. Let u s brieﬂy recall the relevant notion of correlated equi- librium. A distribution P ∈ △ H is an ǫ - coarse correlated equilibriu m if, fo r every player j , it holds that E w ∼ P h ℓ j ( w ) i ≤ inf w j ∈H j E w ∼ P h ℓ j  w j , w − j  i + ǫ. (5) When ǫ = 0 we refer to a coarse correlated equilibrium. The ǫ -term in Eq. ( 5 ) quantiﬁes the deviation of P fr om a coarse correlated equilibrium [ 77 ]. The notion of correlated equilib- rium is weaker than Nash equilibrium . The set of correlated equilibria contains the convex hull of the set of Nash equilibria as a subset. W e thus have two perspectives on linear nets: as networks or as games. T o train a network, we use algorithms such as gradient descent implemented via backpropagation. T o play a game, the players use no-r egret algorithms. Sections 3 and 4 show the two perspectives are equivalent in the more general setting of modern convnets. In particular , corr elat ed equilib- ria games map to critical p oints of energy landscap e s . Our strategy is then to convert results about the conver gence of convex games to correlated e quilib ria into results about the convergence of backp ropagation on neural net s . 3 G AT E D G A M E S A N D C O N VO L U T I O N A L N E T W O R K S This section presents a deta iled a nalysis of rectiﬁer nets. The key observation is that rectiﬁers act as gates, which leads directly to gated game s . Gates games ar e not convex. However , they a re close enough that results on convergence t o correlated equilibria can easily be adap t ed to t he sett ing. The ma in technical work of t he section is to introduce no- tation to handle the interaction between path-sums and gates. Path-sum games are then introduced as a class of ga ted games capturing the dynamics of rectiﬁer nets, see proposition 5 . Finally , we show how to extend the r esu lts to convnets . 3.1 Rectiﬁer netw orks Historically , neural net works typically used sigmoid σ ( a ) = 1 1+ e − a or tanh τ ( a ) = e a − e − a e a + e − a nonlinearities. A lt ernatives were investigated b y Jarrett et al in [ 11 ], who found that rectiﬁers ρ ( a ) = max(0 , a ) often pe rform much better than sigmoids in practice. Rectiﬁers are now the default nonlinearity in convnets [ 12 ]–[ 16 ]. T here are many variants on the theme, including noisy r ectifers, ρ N ( a ) = max(0 , a + N (0 , σ ( a )) , and leaky rectifers ρ L ( a ) = ( a if a > 0 0 . 01 a else introduced in [ 12 ] a nd [ 14 ] respectively . 6 3.1.1 Rectiﬁers gate error backpropagation The rectiﬁer is convex and dif ferentiable except at a = 0 , with subgradient 1 ( a ) = dρ da = ( 1 if a ≥ 0 0 else. The s ubgradient acts as an indicator function, which motivates the choice of notat ion. Substituting 1 j for h ′ j in the recursive backprop computation, Eq. ( 3 ), yields δ j = ∂ ℓ ∂ a j = X { k : j → k } δ k · w j k · 1 j (6) Rectiﬁers act as gates : t he o nly diff erence between backpr op in a linear network, Eq. ( 4 ), a nd a rectiﬁer network, Eq. ( 6 ), is that s ome u nits are ze roed out during both the forwar d and backward sweeps. In the for wa rd sweep, r ectiﬁe rs zero out units which would ha v e produced negativ e outp uts; on the backward sweep , the rectiﬁer subgradients zero out the exa ct same units by a cting as indicator functions. Zeroed out (or inac tive ) unit s do not contribute to the feed- forward sweep and do not receive an error signal during b a ck- propagation. In effect, the rectiﬁers select a linear subnetwork of active units for use during forwar d and backpropagation. 3.2 Gated Games W e have seen that linear networks a re convex games. Extend- ing the result t o rectiﬁer networks requir es generalizing convex games to the setting where only a s u bset of pla yers are active on each r ound. Deﬁnition 3 (gated games) . Let P ( S ) denote the p owerset of S . A g ated game ([ N ] , H , ℓ , A ) is a convex game equipp e d with a gate A : H → P ([ N ]) . Playe r s j ∈ A ( w 1: N ) are active . Inactive players incur no loss. Each act ive player incurs a convex loss ℓ j : H A ( w 1: N ) → R , that d epends on its action and the actions of the other active players; i.e. H A ( w 1: N ) := Q j ∈A ( w 1: N ) H j . Inactive players do not incur a loss. The gated forecaster setting formalizes the p erspective of a player in a gated game: Setting 1: G AT E D F O R E C A S T E R input : Set E of other players initialize w 1 ∈ H for rounds t = 1 , 2 , . . . do Active players E t ⊂ E reveal actions x t E := { x t i } i ∈ E t if gate ( w t , x t ) then Environment reveals co nvex ℓ t Forecaster incurs loss ℓ t  w t , x t E  Forecaster u p dates weights w t +1 ← − weight ( w t , x t E , ℓ t ) else Forecaster ina ct iv e; no loss incurred W eights remain unchanged w t +1 ← − w t In neural nets, rectiﬁer functions, max-pooling, and dropout all a ct a s forms of gate s tha t control (deterministically or probabilistically) whether units activ ely respond to an input. Importantly , inactive units do not receive any err or under backpropagation, as discussed in section 3.1 . In the ga t ed setting, p layers only exp erience r egret with respect to their actions when active. W e t herefore introduce gated-regret : GRegre t ( T ) = 1 T i   X { t ∈ [ T ] : i ∈A ( w t ) } ℓ t i ( w t i , w t − i ) − sup w i ∈H i X { t ∈ [ T ] : i ∈A ( w t ) } ℓ t ( w i , w t − i )   , where T i := | { t ∈ [ T ] : i ∈ A ( w t ) }| is the number of rounds in which p layer i is active. Remark 2 (permanently inact iv e units) . If a player is permanently inactive then, trivially , it has no gated- regret. This suggests there is a loophole in the deﬁnition t h at p lay ers can exploit. We make two comments. Firs tly , players do not control when they are inactive. Rather , they op timize over the rounds they are ex posed to. Secondly , in practice, some units in r ectiﬁer networks do b ecome inactive. The problem is mild: rectiﬁer nets still outperform other architectur es . Reducing the number of inactive units was one of the motivations for maxout units [ 78 ]. The next ste p is to extend correlated equilibrium to gated games. The intuit ion behind corr elated equilibrium is that a signal P ( w ) is sent to all players which guides their behavior . However , inactive playe rs do not observe the s ignal. The signal received by pla yer j when active is the conditional distribution P j := P  w | j ∈ A ( w )  = ( P ( w ) P ( { w | j ∈A ( w ) } ) if j ∈ A ( w ) 0 else. The foll owing proposition extends the result that no-r egret learning leads t o coarse correlated equilibria from convex to gated games : Proposition 3 (no gated-regr et → correlated equilib rium) . Let G = ([ N ] , H , ℓ , A ) be a g ated g ame and s uppose that the p layers follow strategies with gated-regret ≤ ǫ . Then, the empirical dis trib u- tion ˆ P of the actions played is an ǫ -coarse correlated equilibrium . The rate at which the gated-regret of the players decay thus controls the rate of convergence to a corr elated equilibrium. Proof. W e adapt a theorem for two-player convex ga mes by Hazan a nd Kale in [ 79 ] t o our sett ing. Since player j has gated- regr et ≤ ǫ , it follows that 1 T j X { t ∈ [ T ] : j ∈A ( w t ) }  ℓ j ( w t ) − ℓ j  w j , w t − j   ≤ ǫ ∀ w j ∈ H j . The em p irical distribution ˆ P j assigns probability 1 T j to each joint action w t occurring while pla ye r j is active. W e can therefor e r ewrite the a bove inequality as E w ∼ ˆ P j h ℓ j ( w ) i − E w ∼ ˆ P j h ℓ j  w j , w − j  i ≤ ǫ ∀ w j ∈ H j and the r esult f ollows. 3.3 Pat h-Sum Games Let G be a directed acyclic graph corresponding to a rectiﬁer neural networ k with N units that are not input units. W e provide an alternate description of the dynamics of the feedfor - ward and feedback sweep on the neural net in terms of path- sums. The deﬁnitions are somewhat technical; the underlying 7 intuition can b e f ound in the discussions o f linear a nd rectiﬁer networks above. Let d j := |{ i : i → j }| denote the indegr ee of node j . Every edge is ass igned a weigh t . In addition, each s ource node s (with no incoming edges) is as s igned a weight w s . The weights assigned to source nodes are used to encode the inp ut to t he neural net. Recall that given path ρ , we write ρ ∈ ( j k ) if ρ sta rts at node j and ﬁnis hes at node k . Given a set of nodes A , wr ite ρ ⊂ A if all the nodes along ρ a re elements of A . Deﬁnition 4 (path-sums in rectiﬁer nets) . The weight of a path is the product of the weights of the e dges along the path. If a p ath starts at a s ource node s , then w s is included in the product. Given a set of nodes A and a node j , the path-sum σ A • j is the sum of the weights of all paths in A from source nodes to j : σ A • j := X s ∈ source X { ρ | ρ ⊂ A and ρ ∈ ( s j ) } weigh t( ρ ) . By convention, σ A • j is zero if no such path exists (for ex am ple, if j 6∈ A ). The set A of active units is d eﬁned inductively on κ , which tracks the length of the longest path from sourc e units to a given unit: Let A κ = { active units with longest p ath from a sourc e ≤ κ } . Source units are always active, so set A 0 = { all sour ce units } . Suppose unit j has source-path-length κ + 1 and elements in A κ have been identiﬁed. Then, j is active if it corresponds t o • a linear unit or • a r ectiﬁer with σ A κ ∪{ j } • j > 0 . For simplicity we suppress that A is a function of weights from the notation. It is also convenient to drop the superscript A via the shorthand ς • j := σ A • j . The following pr oposition connects active path-su ms to the feedforwar d and feedback sweeps in a neural network: Proposition 4 (str ucture of rectiﬁer nets ) . Let w j = ( w i ) { i : i → j } and ς in ( j ) := ( ς • i ) { i : i → j } . Further , introduce notation ς out = ( ς o ) o ∈ out for the output layer . Then 1. Feedforward outputs. If inp uts to the network are encoded in source weights as above then the output of unit j in the neural network is ς j . Speciﬁcally , if j is linear then ς • j = h w j , ς in ( j ) i ; if j is a rectiﬁer then ς • j = max(0 , h w j , ς in ( j ) i ) . 2. Decomposition of network out put. Let ς j • denote the sum of active path weigh ts from j t o the output laye r . The output of the network decomposes as ς out = ( ς j • ·  w j , ς in ( j )  + σ A\{ j } out if player j active σ A\{ j } out else where σ A\{ j } out is the sum over active paths from sources to outputs that do not inters ect j . 3. Backpropagated e r rors. Suppose the network is equipped with error function ℓ ( ς out , y ) . Let g := ∇ out ℓ ( ς out , y ) denote the gradient of ℓ . The backpropa- gated er ror sig nal received by unit j is δ j = h g , ς j • i . 4. Error g radients. Finally ,  ∇ w j ℓ ( ς out , y ) , w j  = D g , ς out − σ A\{ j } out E = h g , ς j out i ·  w j , ς in ( j )  = δ j · ς • j Proof. Direct computa tion, paralleling proposition 1 . The output of a rectiﬁer network is a piecewis e polynomial function of its weights. T o se e this, observe that the output of a rectiﬁer net is the sum over all active paths from the inpu t layer to the output unit, see a lso [ 34 ]. The next step is to construct a ga me played b y the units of the neural network. I t turns out there are two ways of doing so: Deﬁnition 5 (path-sum games) . The set of p layers is { 0 } ∪ [ N ] . The zeroth player corresponds to the environment and is always active. The environment plays labeled datapoints w 0 = ( w source , y ) ∈ R | source | + | label | and suffer s no loss. The remaining N playe rs correspond to non-source units of G . Player j p lays weight vector w j in compact convex H j ⊂ R d j . The losses in the two games are: • Path-sum prediction game ( PS-Pred ). Player j incurs ℓ j  w  := ℓ ( ς out , y ) when active and no loss when ina ctive. • Path-sum gradient game ( P S-Grad ). Player j incurs h∇ ℓ j , w j i when active , where ∇ ℓ j := ∇ w j ℓ j , and no loss when inactive. PS-Pr ed and PS-Grad are analogs of prediction with ex- pert a dvice and the hedge setting. In the hedge setting, pla yers receive linear los ses and choose actions from the simplex; in PS-Gr ad , players receive linear losses. The r esults below hold for both games, although our primary interest is in PS- Pred . Note that PS-Gr ad has the important p roperty that the loss of player j is a line ar function of player j ’s action when it is active: h∇ ℓ j , w j i = ( δ j · h w j , ς in ( j ) i if j active 0 else . Finally , observe that the regr et when playing PS-Grad upper bounds PS- Pred , since r egret-bounds for linear losses are the worst-case a m ongs t convex loss es. Remark 3 (minibatch games) . It is possible to construct batch or minibatch games, by allowing the environment to play sequences of moves on each round. Proposition 5 (path-sum ga mes are gated games) . PS-Pr ed and PS -Grad are gated games if the error functio n ℓ ( ς out , y ) is convex in its ﬁrst argument. That is, rectiﬁer nets are gated g am es. The gating structure is essentia l; pa t h-sum games are not convex, even for rectiﬁers with the mean-squared error: com- posing a r ectiﬁer with a quadratic ca n yield the nonconvex function f ( x ) = (max(0 , x ) − 1 ) 2 . Even simp ler , the negative of a r ectiﬁe r is not convex. Proof. It is required t o show that the losses under PS-Pre d and PS-Gr ad , that is ℓ j and h∇ ℓ j , w j i , are convex functions of w j when pla yer j is act iv e. Clearly each loss is a scalar-valued function. 8 By proposition 4 .2, when p layer j is a ctive the network loss has the f orm ℓ j ( w ) = ℓ  ς j • ·  w j , ς in ( j )  + σ A\{ j } out , y  = ℓ  c 1 ·  w j , ς in ( j )  + c 2 , y  . The terms ς j • , ς in ( j ) and σ A\{ j } out are all constants with respect to w j . Thus, the network loss is an afﬁne transformation of w j (dot-pro duct followed by multiplication by a c onst ant and adding a consta nt ) composed with a convex function, and so convex. By proposition 4 .4, the gradient loss has the form ∇ w j ℓ j ( w ) = h g , ς j out i · ς in ( j ) = c 1 · ς in ( j ) when p layer j is a ct iv e – which is linear in w j since all the other terms ar e constants with respect to w j . Remark 4 (dependence of loss on other players) . We have sh own that it is a convex function of player j ’ s action, when player j is active. Note that: (i) the loss of play er j depends on the actions chosen by other players in the game and (ii) the loss is not a convex functio n of the joint-action of all the players. It is for these reasons th at the game-theoretic analysis is essential. The proposition does not me rely hold for toy cases. The next section extends the result to maxout u nits, DropOut, DropConnect, a nd convolutional networks with shared weights and max-pooling. Pr opos ition 5 thus applies to convolutional networks a s they ar e used in practice. Fina lly , note that pr opos ition 5 does not hold for leaky rectiﬁer units [ 14 ] or units that are not piecewise linear , such as s igmoid or tanh . 3.4 Con vo lutional Netw orks W e extend pr oposit ion 5 from rectiﬁer nets to convnets . Proposition 6 (c onvnets are gated game s ) . Let N be a convolutional network with any combination of linear , rectiﬁer , maxout and max-pooling units. Then, N is a g ated gam e. The proof consists in identifying t he relevant players and gates for each ca se (max out units , max-pooling, we ight -tying in c onvolutional layers, d ropout and dropconnect) in turn. W e sketch the r es ult below . 3.4.1 Maxout units Maxout units were introduced in [ 78 ] to deal with the p roblem that rectiﬁer units sometimes saturat e at zero resulting in them being insufﬁciently active and to comp lement dropout. A maxout unit has k j weight vectors w j, 1 , . . . , w j,k j ∈ R d j and, given input f j ∈ R d j , outpu ts maxout: m ( f j ) := max c ∈ [ k j ]  w j,c , f j  Construct a new graph, ˜ G , which has: one node per input, linear and rectiﬁer unit; and k j nodes per ma xout u nit. Players correspond to nodes of ˜ G and are denoted by greek letters. The extended grap h inherits its edge str ucture from G : there is a connection between p layers α → β in ˜ G iff the u nderlying units in G are connected. Pa th weights and p ath-sums are deﬁned exact ly as before, except that we work on ˜ G inst ead of G . The deﬁnition of active u nits is modiﬁed a s follows: The set A of active players for maxout units is deﬁned inductively . Let A k = { active pla yers with longes t source-path ≤ k } . Sour ce players a re activ e ( k = 0 ). Player β with source-path-length k + 1 is active if • it cor responds to a linear unit; or • a r ectiﬁer with σ A k ∪{ β } β > 0 ; or • a maxout unit with σ A k ∪{ β } β > σ A k ∪{ α } α for all α corresponding to the same maxout u nit . 3.4.2 Max-pooling Max-pooling is heav ily us ed in convnets t o as a form of dimen- sionality reduction. A m ax-pooling unit j has no parameters and outputs the maximum of the outputs of the units from which it receives inputs : max-pooling: max { i : i → j } σ A i Gates can be extended to max-pooling by a dding the condition that, to be a ctive, the outpu t of u nit i mu st be gr eater tha n any other unit i ′ that feeds (dir ectly) into t he sam e pooling unit. A unit may thus produce an output and still count as inactive be ca use it is ignored by the max-pooling layer , a nd so has no effect on the output of the neural network. I n particular , units that a re ignored by ma x-pooling do not update their weights under backpro p agation. 3.4.3 Con volutional lay ers Units in a co nvolut ional la ye r share weights. O b v ersely t o maxout units, each of which corresponds to many playe rs , weight-sharing units co rrespond to a single composite p layer . Suppose that rectiﬁer units j 1 , . . . , j L share weight v ector w j . Let A 0 o Component α in layer j is a ctive if α ∈ A j . Notice tha t, since players corr espond t o many u nits, two pla yers may be connected by more than one edge. Player j is active if any of its components is active, i.e. if | A j | > 0 . T he output of player j is the sum of its active components: ς j = * w j , X α ∈ A j σ A

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment