Adaptive Neural Networks for Efficient Inference

Adaptiv e Neural Networks f or Efﬁcient Infer ence T olga Bolukbasi 1 Joseph W ang 2 Ofer Dekel 3 V enkatesh Saligrama 1 Abstract W e present an approach to adaptively utilize deep neural networks in order to reduce the e v aluation time on ne w examples without loss of accurac y . Rather than attempting to redesign or approxi- mate existing networks, we propose tw o schemes that adaptiv ely utilize networks. W e ﬁrst pose an adaptiv e network ev aluation scheme, where we learn a system to adapti vely choose the compo- nents of a deep network to be evaluated for each example. By allo wing examples correctly clas- siﬁed using early layers of the system to exit, we av oid the computational time associated with full ev aluation of the network. W e extend this to learn a network selection system that adaptiv ely selects the netw ork to be ev aluated for each ex- ample. W e show that computational time can be dramatically reduced by exploiting the f act that many examples can be correctly classiﬁed using relativ ely efﬁcient networks and that complex, computationally costly networks are only neces- sary for a small fraction of examples. W e pose a global objectiv e for learning an adapti ve early exit or network selection polic y and solv e it by reducing the policy learning problem to a layer- by-layer weighted binary classiﬁcation problem. Empirically , these approaches yield dramatic re- ductions in computational cost, with up to a 2.8x speedup on state-of-the-art networks from the ImageNet image recognition challenge with min- imal ( < 1% ) loss of top5 accuracy . 1. Introduction Deep neural networks (DNNs) are among the most pow- erful and versatile machine learning techniques, achieving state-of-the-art accuracy in a variety of important applica- tions, such as visual object recognition ( He et al. , 2016 ), 1 Boston Univ ersity , Boston, MA, USA 2 Amazon, Cambridge, MA, USA 3 Microsoft Research, Redmond, W A, USA. Corre- spondence to: T olga Bolukbasi . Pr oceedings of the 34 th International Confer ence on Machine Learning , Sydne y , Australia, PMLR 70, 2017. Copyright 2017 by the author(s). speech recognition ( Hinton et al. , 2012 ), and machine translation ( Sutskev er et al. , 2014 ). Howe ver , the power of DNNs comes at a considerable cost, namely , the com- putational cost of applying them to ne w e xamples. This cost, often called the test-time cost , has increased rapidly for man y tasks (see Fig. 1 ) with ever -gro wing demands for impro ved performance in state-of-the-art systems. As a point of fact, the Resnet152 ( He et al. , 2016 ) architec- ture with 152 layers, realizes a substantial 4.4% accuracy gain in top-5 performance over GoogLeNet ( Sze gedy et al. , 2015 ) on the large-scale ImageNet dataset ( Russakovsky et al. , 2015 ) but is about 14X slo wer at test-time. The high test-time cost of state-of-the-art DNNs means that the y can only be deployed on powerful computers, equipped with massiv e GPU accelerators. As a result, technology com- panies spend billions of dollars a year on expensi ve and power -hungry computer hardware. Moreov er , high test- time cost pre vents DNNs from being deplo yed on resource constrained platforms, such as those found in Internet of Things (IoT) devices, smart phones, and wearables. This problem has giv en rise to a concentrated research effort to reduce the test-time cost of DNNs. Most of the work on this topic focuses on designing more efﬁcient network topolo- gies and on compressing pre-trained models using various techniques (see related work below). W e propose a differ - ent approach, which leaves the original DNN intact and in- stead changes the way in which we apply the DNN to new Figure 1. Performance v ersus e v aluation complexity of the DNN architectures that won the ImageNet challenge over past several years. The model evaluation times increase exponentially with respect to the increase in accuracy . Adaptive Neural Netw orks for Efﬁcient Infer ence examples. W e exploit the f act that natural data is typically a mix of easy examples and difﬁcult examples, and we posit that the easy examples do not require the full power and complexity of a massi ve DNN. W e pursue two concrete variants of this idea. First, we pro- pose an adaptiv e early-exit strategy that allo ws easy exam- ples to bypass some of the network’ s layers. Before each expensi ve neural network layer (e.g., con volutional layers), we train a polic y that determines whether the current ex- ample should proceed to the next layer , or be di verted to a simple classiﬁer for immediate classiﬁcation. Our second approach, an adaptive network selection method, takes a set of pre-trained DNNs, each with a different cost/accuracy trade-off, and arranges them in a directed acyclic graph ( T rapezniko v & Saligrama , 2013 ; W ang et al. , 2015 ), with the the cheapest model ﬁrst and the most expensi ve one last. W e then train an exit policy at each node in the graph, which determines whether we should rely on the current model’ s predictions or predict the most beneﬁcial next branch to forward the example to. In this context we pose a global objecti ve for learning an adaptiv e early e xit or network selection policy and solve it by reducing the pol- icy learning problem to a layer-by-layer weighted binary classiﬁcation problem. W e demonstrate the merits of our techniques on the Im- ageNet object recognition task, using a number of popu- lar pretrained DNNs. The early e xit technique speeds up the a verage test-time ev aluation of GoogLeNet ( Sze gedy et al. , 2015 ), and Resnet50 ( He et al. , 2016 ) by 20-30% within reasonable accuracy margins. The network cas- cade achie ves 2.8x speed-up compared to pure Resnet50 model at 1% top-5 accurac y loss and 1.9x speed-up with no change in model accuracy . W e also show that our method can approximate a oracle policy that can see true errors suf- fered for each instance. In addition to reducing the av erage test-time cost of DNNs, it is worth noting that our techniques are compatible with the common design of large systems of mobile de vices, such as smart phone networks or smart surveillance-camera networks. These systems typically include a large number of resource-constrained edge devices that are connected to a central and resource-rich cloud. One of the main chal- lenges inv olved in designing these systems is determining whether the machine-learned models will run in the devices or in the cloud. Of ﬂoading all of the w ork to the cloud can be problematic due to network latency , limited cloud ingress bandwidth, cloud av ailability and reliability issues, and priv acy concerns. Our approach can be used to design such a system, by deploying a small inaccurate model and an exit policy on each de vice and a large accurate model in the cloud. Easy examples would be handled by the devices, while difﬁcult ones would be forwarded to the cloud. Our approach naturally generalizes to a fog computing topology (where resource constrained edge de vices are connected to a more powerful local gateway computer, which in turn is connected to a sequence of increasingly powerful comput- ers along the path to the data-center). Such designs allow our method to be used in memory constrained settings as well due to ofﬂoading of comple x models from the device. 2. Related W ork Past work on reducing ev aluation time of deep neural net- works has centered on reductions in precision and arith- metic computational cost, design of efﬁcient network struc- ture, and compression or sparsiﬁcation of networks to re- duce the number of conv olutions, neurons, and edges. The approach proposed in this paper is complimentary . Our ap- proach does not modify netw ork structure or training and can be applied in tandem with these approaches to further reduce computational cost. The early ef forts to compress large DNNs used a lar ge teacher model to generate an endless stream of labeled ex- amples for a smaller student model ( Bucila et al. , 2006 ; Hinton et al. , 2015 ). The wealth of labeled training data generated by the teacher model allowed the small student model to mimic its accuracy . Reduced precision networks ( Gong et al. , 2014 ; Cour- bariaux et al. , 2015 ; Chen et al. , 2015 ; Hubara et al. , 2016a ; W u et al. , 2016 ; Rastegari et al. , 2016 ; Hubara et al. , 2016b ) hav e been extensi vely studied to reduce the memory foot- print of networks and their test-time cost. Similarly , com- putationally efﬁcient network structures have also been proposed to reduce the computational cost of deep net- works by exploiting ef ﬁcient operations to approximate complex functions, such as the inception layers introduced in GoogLeNet ( Szegedy et al. , 2015 ). Network sparsiﬁcation techniques attempt to identify and prune away redundant parts of a large neural networks. A common approach is to remov e unnecessary nodes/edges from the network( Liu et al. , 2015 ; Iandola et al. , 2016 ; W en et al. , 2016 ). In con volutional neural networks, the e x- pensiv e con v olution layers can be approximated ( Bagher - inezhad et al. , 2016 ) and redundant computation can be av oided ( Figurnov et al. , 2016b ). More recently , researchers ha ve designed spatially adap- tiv e networks ( Figurnov et al. , 2016a ; Bengio et al. , 2015 ) where nodes in a layer are selectiv ely acti v ated. Others hav e de veloped cascade approaches ( Leroux et al. , 2017 ; Odena et al. , 2017 ) that allo w early exits based on conﬁ- dence feedback. Our approach can be seen as an instance of conditional computation, where we seek computational gains through layer -by-layer and network-le vel early ex- its. Howe ver , we propose a general framew ork which opti- Adaptive Neural Netw orks for Efﬁcient Infer ence Figure 2. (Left) An example network selection system topology for networks Alexnet(A), GoogLeNet(G) and Resnet(R). Green γ blocks denote the selection policy . The policy ev aluates Alexnet, receives conﬁdence feedback and decides to jump directly to Resnet or send the sample to GoogLeNet->Resnet cascade. (Right) An example early exit system topology (based on Ale xnet). The policy chooses one of the multiple exits available to it at each stage for feedback. If the sample is easy enough, the system sends it down to exit, otherwise it sends the sample to the next layer . mizes a nov el system risk that includes computational costs as well as accuracy . Our method does not require within layer modiﬁcations and works with directed acyclic graphs that allow multiple model e valuation paths. Our techniques for adaptive DNNs borrow ideas from the related sensor selection problem ( Xu et al. , 2013 ; K us- ner et al. , 2014 ; W ang et al. , 2014 ; 2015 ; Trapezniko v & Saligrama , 2013 ; Nan et al. , 2016 ; W ang & Saligrama , 2012 ). The goal of sensor selection is to adapti vely choose sensor measurements or features for each example. 3. Adaptive Early Exit Netw orks Our ﬁrst approach to reducing the test-time cost of deep neural networks is an early exit strategy . W e ﬁrst frame a global objecti ve function and reduce policy training for op- timizing the system-wide risk to a layer-by-layer weighted binary classiﬁcation (WBC) . W e denote a labeled exam- ple as ( x, y ) ∈ R d × { 1 , . . . , L} , where d is the dimen- sion of the data and { 1 , . . . , L} is the set of classes rep- resented in the data. W e deﬁne the distribution generating the examples as X × Y . For a predicted label ˆ y , we de- note the loss L ( ˆ y , y ) . In this paper, we focus on the task of classiﬁcation and, for exposition, focus on the indicator loss L ( ˆ y , y ) = 1 ˆ y = y , in this section. In practice we upper bound the indicator functions with logistic loss for compu- tational efﬁcienc y . As a running DNN example, we consider the AlexNet ar- chitecture ( Krizhevsk y et al. , 2012 ), which is composed of 5 con volutional layers follo wed 3 fully connected layers. During ev aluation of the network, computing each conv o- lutional layer takes more than 3 times longer than comput- ing a fully connected layer , so we consider a system that allows an e xample to exit the network after each of the ﬁrst 4 con volutional layers. Let ˆ y ( x ) denote the label predicted by the network for example x and assume that computing this prediction takes a constant time of T . Moreover , let σ k ( x ) denote the output of the k th con volutional layer for example x and let t k denote the time it takes to compute this value (from the time that x is fed to the input layer). Finally , let ˆ y k ( x ) be the predicted label if we exit after the k th layer . After computing the k th con volutional layer , we introduce a decision function γ k that determines whether the example should exit the network with a label of ˆ y k ( x ) or proceed to the ne xt layer for further e v aluation. The input to this decision function is the output of the corresponding con v o- lutional layer σ k ( x ) , and the v alue of γ k ( σ k ( x )) is either − 1 (indicating an early exit) or 1 . This architecture is de- picted on the right-hand side of Fig. 2 . Globally , our goal is to minimize the e valuation time of the network such that the error rate of the adapti ve system is no more than some user-chosen value B greater than the full network: min γ 1 ,...,γ 4 E x ∼X [ T γ 1 ,...,γ 4 ( x )] . (1) s.t. E ( x,y ) ∼X ×Y  ( L ( ˆ y γ 1 , ..., γ 4 ( x ) , y ) − L ( ˆ y ( x ) , y )) +  ≤ B Here, T γ 1 ,...,γ 4 ( x ) is the prediction time for example x for the adapti ve system, ˆ y γ 1 , ..., γ 4 ( x ) is the label predicted by the adaptiv e system for example x . In practice, the time required to predict a label and the e xcess loss introduced by the adaptive system can be recursively deﬁned. As in ( W ang et al. , 2015 ) we can reduce the early exit policy training for minimizing the global risk to a WBC problem. The key idea is that, for each input, a policy must identify whether or not the future rew ard (expected future accurac y minus comp. loss) outweighs the current-stage accuracy . T o this end, we ﬁrst focus on the problem of learning the decision function γ 4 , which determines if an example should exit after the fourth conv olutional layer or whether it will be classiﬁed using the entire network. The time it takes to predict the label of example x depends on this de- Adaptive Neural Netw orks for Efﬁcient Infer ence cision and can be written as T 4 ( x, γ 4 ) = ( T + τ ( γ 4 ) if γ 4 ( σ 4 ( x )) = 1 t 4 + τ ( γ 4 ) otherwise , (2) where τ ( γ 4 ) is the computational time required to e valuate the function γ 4 . Our goal is to learn a system that trades-off the ev aluation time and the induced error: argmin γ 4 ∈ Γ E x ∼X [ T 4 ( x, γ 4 )] + λ E ( x,y ) ∼X ×Y h  L ( ˆ y 4 ( x ) , y ) − L ( ˆ y ( x ) , y )  + 1 γ 4 ( σ 4 ( x ))= − 1 i (3) where ( · ) + is the function ( z ) + = max( z , 0) and λ ∈ R + is a trade-off parameter that balances between e v aluation time and error . Note that the function T 4 ( x, γ 4 ) can be expressed as a sum of indicator functions: T 4 ( x, γ 4 ) = ( T + τ ( γ 4 )) 1 γ 4 ( σ 4 ( x ))=1 + ( t 4 + τ ( γ 4 )) 1 γ 4 ( σ 4 ( x ))= − 1 = T 1 γ 4 ( σ 4 ( x ))=1 + t 4 1 γ 4 ( σ 4 ( x ))= − 1 + τ 4 ( γ 4 ) Substituting for T 4 ( x, γ 4 ) allo ws us to reduce the problem to an importance weighted binary learning problem: argmin γ 4 ∈ Γ E ( x,y ) ∼X ×Y  C 4 ( x, y ) 1 γ 4 ( σ 4 ( x )) 6 = β 4 ( x )  + τ ( γ 4 ) (4) where β 4 ( x ) and C 4 ( x, y ) are the optimal decision and cost at stage 4 for the example ( x, y ) deﬁned: β 4 ( x ) =        − 1 if T >  t 4 + λ  L ( ˆ y 4 ( x ) , y ) − L ( ˆ y ( x ) , y )  +  1 otherwise and C 4 ( x, y ) =   T − t 4 − λ ( L ( ˆ y 4 ( x ) , y ) − L ( ˆ y ( x ) , y )) +   . Note that the regularization term, τ ( γ 4 ) , is important to choose the optimal functional form for the function γ 4 as well as a natural mechanism to deﬁne the structure of the early exit system. Rather than limiting the family of function Γ to a single functional form such as a linear function or a speciﬁc network architecture, we assume the family of functions Γ is the union of multiple functional families, notably including the constant decision functions γ 4 ( x ) = 1 , ∀ x ∈ |X | . Although this constant function does not allow for adaptive network ev aluation at the spe- ciﬁc location, it additionally does not introduce any com- putational overhead, that is, τ ( γ 4 ) = 0 . By including this constant function in Γ , we guarantee that our technique can only decrease the test-time cost. Empirically , we ﬁnd that the most effecti ve policies oper - ate on classiﬁer conﬁdences such as classiﬁcation entropy . Speciﬁcally , we consider the family of functions Γ as the union of three functional f amilies, the aforementioned con- stant functions, linear classiﬁer on conﬁdence features gen- erated from linear classiﬁers applied to σ 4 ( x ) , and linear classiﬁer on conﬁdence features generated from deep clas- siﬁers applied to σ 4 ( x ) . Rather than optimizing jointly over all three networks, we lev erage the fact that the optimal solution to Eqn. ( 4 ) can be found by optimizing o ver each of the three families of func- tions independently . F or each family of functions, the pol- icy ev aluation time τ ( γ 4 ) is constant, and therefore solving ( 4 ) over a single family of functions is equiv alent to solv- ing an unregularized learning problem. W e exploit this by solving the three unregularized learning problems and tak- ing the minimum ov er the three solutions. In order to learn the sequence of decision functions, we consider a bottom-up training scheme, as previously pro- posed in sensor selection ( W ang et al. , 2015 ). In this scheme, we learn the deepest (in time) early exit block ﬁrst, then ﬁx the outputs. Fixing the outputs of this trained func- tion, we then train the early exit function immediately pre- ceding the deepest early exit function ( γ 3 in Fig. 2 ). For a general early exit system, we recursively deﬁne the future time, T k ( x, γ k ) , and the future predicted label, ˜ y k ( x, γ k ) , as T k ( x, γ k ) =          T + τ ( γ k ) if γ k ( σ k ( x )) = 1 , k = K T k +1 ( x, γ k if γ k ( σ k ( x )) = 1 , k < K +1) + τ ( γ k ) t k + τ ( γ k ) otherwise and ˜ y k ( x, γ k ) =                    ˆ y ( x ) if k = K + 1 ˆ y ( x ) if k = K and γ k ( σ k ( x )) = 1 ˜ y k +1 ( x, γ k +1 ) if k < K and γ k ( σ k ( x )) = − 1 ˆ y k ( x ) otherwise . Using these deﬁnitions, we can generalize Eqn. ( 4 ). For a system with K early exit functions, the k th early exit func- tion can be trained by solving the supervised learning prob- lem: argmin γ k ∈ Γ E ( x,y ) ∼X ×Y  C k ( x, y ) 1 γ k ( x ) 6 = β k ( σ k ( x ))  + τ ( γ k ) , (5) where optimal decision and cost β k ( x ) and C k ( x, y ) can be Adaptive Neural Netw orks for Efﬁcient Infer ence deﬁned: β k ( x ) =                − 1 if k < K and T k +1 ( x, γ k +1 ) ≥ t k + λ ( L ( ˆ y k ( x ) , y ) − L ( ˜ y k +1 ( x ) , y )) + − 1 if k = K and T ≥ t k + λ ( L ( ˆ y k ( x ) , y ) − L ( ˜ y k +1 ( x ) , y )) + 1 otherwise C k ( x, y ) =                   T k +1 ( x, γ k +1 ) − t k if k < K − λ  L ( ˆ y k ( x ) , y ) − L ( ˜ y k +1 ( x ) , y )  +       T − t k otherwise − λ ( L ( ˆ y k ( x ) , y ) − L ( ˆ y ( x ) , y )) +    . Eqn. ( 5 ) allo ws for efﬁcient training of an early exit sys- tem by sequentially training early exit decision functions from the bottom of the network upwards. Furthermore, by including constant functions in the family of functions Γ and training early e xit functions in all potential stages of the system, the early exit architecture can also naturally be discov ered. Finally , in the case of single option at each exit, the layer-wise learning scheme is equi v alent to jointly optimizing all the exits with respect to full system risk. 4. Network Selection As shown in Fig. 1 , the computational time has grown dramatically with respect to classiﬁcation performance. Rather than attempting to reduce the complexity of the state-of-the-art networks, we instead le verage this non- linear gro wth by extending the early exiting strate gy to the re gime of netw ork selection. Conceptually , we seek to e xploit the f act that many examples are correctly clas- siﬁed by relativ ely ef ﬁcient networks such as ale xnet and googlenet, whereas only a small fraction of examples are correctly classiﬁed by computationally expensi ve networks such as resnet 152 and incorrectly classiﬁed by googlenet and alexnet. As an example, assume we have three pre-trained networks, N 1 , N 2 , and N 3 . F or an example x , denote the predictions for the networks as N 1 ( x ) , N 2 ( x ) , and N 3 ( x ) . Addition- ally , denote the ev aluation times for each of the networks as τ ( N 1 ) , τ ( N 2 ) , and τ ( N 3 ) . As in Fig. 2 , the adapti ve system composed of two deci- sion functions that determine which network is e valuated for each example. First, κ 1 : |X | → { N 1 , N 2 , N 3 } is ap- plied after ev aluation of N 1 to determine if the classiﬁca- tion decision from N 1 should be returned or if network N 2 or network N 3 should be ev aluated for the example. For examples that are ev aluated on N 2 , κ 2 : |X | → { N 2 , N 3 } determines if the classiﬁcation decision from N 2 should be returned or if network N 3 should be ev aluated. Our goal is to learn the functions κ 1 and κ 2 that minimize the average ev aluation time subject to a constraint on the av erage loss induced by adaptiv e network selection. As in the adaptiv e early exit case, we ﬁrst learn κ 2 to trade-off between the av erage e v aluation time and induced error: min κ 2 ∈ Γ E x ∼X  τ ( N 3 ) 1 κ 2 ( x )= N 3  + τ ( κ 2 ) + λ E ( x,y ) ∼X ×Y "  L ( N 2 ( x ) , y ) − L ( N 3 ( x ) , y )  + 1 κ 2 ( x )= N 2 # , (6) where λ ∈ R + is a trade-off parameter . As in the adap- tiv e network usage case, this problem can be posed as an importance weighted supervised learning problem: min κ 2 ∈ Γ E ( x,y ) ∼X ×Y  W 2 ( x, y ) 1 κ 2 ( x ) 6 = θ 2 ( x )  + τ ( κ 2 ) , (7) where θ 2 ( x ) and W 2 ( x, y ) are the cost and optimal decision at stage 4 for the example/label pair ( x, y ) deﬁned: θ 2 ( x ) = ( N 2 if τ ( N 3 ) > λ ( L ( N 3 ( x ) , y ) − L ( N 2 ( x ) , y )) + N 3 otherwise and W 2 ( x, y ) =    τ ( N 3 ) − λ ( L ( N 2 ( x ) , y ) − L ( N 3 ( x ) , y )) +    . Once κ 2 has been trained according to Eqn. ( 7 ), the training times for examples that pass through N 2 and are routed by κ 2 can be deﬁned T κ 2 ( x ) = τ ( N 2 ) + τ ( κ 2 ) + τ ( N 3 ) 1 κ 2 ( x )= N 3 . As in the adaptiv e early e xit case, we train and ﬁx the last decision function, κ 2 , then train the earlier function, κ 1 . As before, we seek to trade-of f be- tween ev aluation time and error: min κ 1 ∈ Γ E x ∼X  τ ( N 3 ) 1 κ 1 ( x )= N 3 + τ ( N 2 ) 1 κ 1 = N 2  + τ ( κ 1 )+ λ E ( x,y ) ∼X ×Y " ( L ( N 2 ( x ) , y ) − L ( N 3 ( x ) , y )) + 1 κ 1 ( x )= N 2 + ( L ( N 1 ( x ) , y ) − L ( N 3 ( x ) , y )) + 1 κ 1 ( x )= N 1 # (8) This can be reduced to a cost sensitiv e learning problem: min κ 1 ∈ Γ E ( x,y ) ∼X ×Y " R 3 ( x, y ) 1 κ 1 ( x )= N 3 + R 2 ( x, y ) 1 κ 1 ( x )= N 2 + R 1 ( x, y ) 1 κ 1 ( x )= N 1 # + τ ( κ 1 ) , (9) where the costs are deﬁned: R 1 ( x, y ) = ( L ( N 1 ( x ) , y ) − L ( N 3 ( x ) , y )) + R 2 ( x, y ) = ( L ( N 2 ( x ) , y ) − L ( N 3 ( x ) , y )) + + τ ( N 2 ) R 3 ( x, y ) = τ ( N 3 ) . Adaptive Neural Netw orks for Efﬁcient Infer ence Algorithm 1 Adaptiv e Network Learning Pseudocode Input: Data: ( x i , y i ) n i =1 , Models S , routes, E , model costs τ ( . ) ), while ∃ untrained π do ( 1 ) Choose the deepest policy decision j , s.t. all do wn- stream policies are trained for e xample i ∈ { 1 , . . . , n } do ( 2 ) Construct the weight vector ~ w i of costs per ac- tion from Eqn. ( 7 ). end for ( 3 ) π j ← Learn clf. (( x 1 , ~ w 1 ) , . . . , ( x n , ~ w n )) ( 4 ) Evaluate π j and update route costs to model j : C ( x i , y i , s n , s j ) ← ~ w j i ( π j ( x i )) + C ( x i , y i , s n , s j ) end while ( 5 ) Prune models the policy does not route any example to from the collection Output: Policy functions, π 1 , . . . , π K 5. Experimental Section W e ev aluate our method on the Imagenet 2012 classiﬁca- tion dataset ( Russak ovsk y et al. , 2015 ) which has 1000 object classes. W e train using the 1.28 million training images and ev aluate the system using 50k validation im- ages. W e use the pre-trained models from Caf fe Model Zoo for Alexnet, GoogLeNet and Resnet50 ( Krizhe vsky et al. , 2012 ; Sze gedy et al. , 2015 ; He et al. , 2016 ). For pre- processing we follow the same routines proposed for these networks and verify the ﬁnal netw ork performances within a small margin ( < 0 . 1% ). Note that it is common to use ensembles of networks and multiple crops to achie ve max- imum performance. These methods add minimal g ain in accuracy while increasing the system cost dramatically . As the speedup margin increases, it becomes trivial for the pol- icy to show signiﬁcant speedups within the same accuracy tolerance. W e belie ve such speedups are not useful in prac- tice and focus on single crop with single model case. T emporal measur ements: W e measure network times using the b uilt-in tool in the Caffe library on a serv er that utilizes a Nvidia T itan X Pascal with CuDNN 5. Since our focus is on the computational cost of the netw orks, we ignore the data loading and preprocessing times. The reported times are actual measurements including the policy o verhead. P olicy form and meta-featur es: In addition to the outputs of the con v olutional layers of earlier networks, we augment the feature space with the entropy of prediction probabili- ties. W e relax the indicators in equations ( 5 ) and ( 9 ) learn linear logistic re gression model on these features for our policy . W e experimented with pooled internal representa- tions, but in practice, inclusion of the entropy feature with a simple linear policy signiﬁcantly outperforms more com- plex polic y functions that exclude the entrop y feature. 5.1. Network Selection Baselines: Our full system, depicted in Figure 2 , starts with Alexnet. Follo wing the ev aluation of Alexnet, the system determines for each example either to return the prediction, route the example to GoogLeNet, or route the example to Resnet50. For examples that are routed to GoogLeNet, the system either returns the prediction output by GoogLeNet or routes the example to Resnet50. As baselines, we com- pare against a uniform policy and a myopic policy which learns a single threshold based on model conﬁdence. W e also report performance from different system topologies. T o provide a bound on the achiev able performance, we show the performance of a soft oracle. The soft oracle has access to classiﬁcation labels and sends each example to the fastest model that correctly classiﬁes the example. Since having access to the labels is too strong, we made the oracle softer by adding two constraints. First, it follows the same network topology , also it can not make decisions without observing the model feedback ﬁrst, getting hit by the same overhead. Second, it can only exit at a cheaper model if all latter models agree on the true label. This sec- ond constraint is added due to the fact that our goal is not to improve the prediction performance of the system but to reduce the computational time, and therefore we prev ent the oracle from “correcting” mistak es made by the most complex networks. W e sweep the cost trade-off parameter in the range 0.0 to 0.1 to achie ve different b udget points. Note that due to weights in our cost formulation, even when the pseudo la- bels are identical, polic y beha vior can dif fer . Conceptually , the weights balance the importance of the samples that gain in classiﬁcation loss in future stages versus samples that gain in computational savings by e xiting early stages. The results are demonstrated in Figure 3 . W e see that both full tree and a->g->r50 cascade achieve signiﬁcant (2.8x) speedup over using Resnet50 while maintaining its accu- racy within 1% . The classiﬁer feedback for the policy has a dramatic impact on its performance. Although, Alexnet introduces much less ov erhead compared to GoogLeNet ( ≈ 0.2 vs ≈ 0.7), the a->r50 policy performs signiﬁcantly worse in lower budget regions. Our full tree policy learns to choose the best order for all budget regions. Furthermore, the policy matches the soft oracle performance in both the high and low b udget regions. Note that GoogLeNet is a very well positioned at 0.7ms per image budget, probably due to its efﬁcienc y oriented architectural design with inception blocks ( Szegedy et al. , 2015 ). F or low budget regions, the overhead of the pol- icy is a detriment, as ev en when it can learn to send al- Adaptive Neural Netw orks for Efﬁcient Infer ence Figure 3. Performance of network selection policy on Imagenet (Left: top-5 error Right: top-1 error). Our full adaptiv e system (denoted with blue dots) signiﬁcantly outperforms any individual network for almost all budget regions and is close to the performance of the oracle. The performances are reported on the validation set of ImageNet dataset. Figure 4. (Left) Different network selection topologies that we considered. Arrows denote possible jumps allo wed to the policy . A, G and R denote Alexnet, GoogLeNet and Resnet50, respecti vely . (Right) Statistics for proportion of total time spent on dif ferent networks and proportion of samples that exit at each network. T op ro w is sampled at 2.0ms and bottom ro w is sampled at 2.8ms system ev aluation most half the samples to Ale xnet instead of GoogLeNet with marginal loss in accuracy , the extra 0.23ms Alexnet ov erhead brings the balance point, ≈ 0 . 65 ms , very close to using only GoogLeNet at 0.7ms. The ratio between net- work ev aluation times is a signiﬁcant f actor for our system. Fortunately , as mentioned before, for man y applications the ratio of different models can be very high (cloud computing upload times, resnet versus Ale xnet dif ference etc.). W e further analyzed the network usage and runtime pro- portion statistics for samples at dif ferent budget regions. Fig. 4 demonstrates the results at three different budget lev els. Full tree policy avoids using GoogLeNet altogether for high budget regions. This is the expected behavior since the a->r50 polic y performs just as well in those regions and using GoogLeNet in the decision adds too much overhead. At mid lev el budgets the policy distributes samples more ev enly . Note that the sum of the ov erheads is close to useful runtime of cheaper networks in this re gion. This is possible since the earlier networks are very lightweight. 5.2. Network Early Exits T o output a prediction following each conv olutional layer , we train a single layer linear classiﬁer after a global average pooling for each layer . W e added global pooling to mini- mize the policy ov erhead in earlier exits. For Resnet50 we added an exit after output layers of 2a, 2c, 3a, 3d, 4a and 4f. The dimensionality of the e xit features after global av erage pooling are 256, 256, 512, 512, 1024 and 1024 in the same order as the layer names. F or GoogLeNet we added the exits after concatenated outputs of e very inception layer . T able 1 shows the early exit performance for different net- works. The gains are more marginal compared to network selection. Fig 5 shows the accuracy gains per ev aluation time for different layers. Interestingly , the accuracy gain per time is more linear within the same architecture com- Adaptive Neural Netw orks for Efﬁcient Infer ence Network policy top-5 uniform top-5 GoogLeNet@1 9% 2% GoogLeNet@2 22% 9% GoogLeNet@5 33% 20% Resnet50@1 8% 1% Resnet50@2 18% 12% Resnet50@5 22% 10% T able 1. Early exit performances at dif ferent accuracy/b udget trade-offs for different networks. @x denotes x loss from full model accuracy and reported numbers are percentage speed-ups. pared to dif ferent network architectures. This e xplains why the adapti ve policy works better for network selection com- pared to early exits. Figure 5. The plots show the accurac y gains at dif ferent layers for early e xits for networks GoogLeNet (top) and Resnet50 (bottom). 5.3. Network Err or Analysis Fig. 6 shows the distributions ov er examples of the net- works that correctly label the example. Notably , 50% and 77% of the examples are correctly classiﬁed by all net- works for top 1 and top 5 error , respectiv ely . Similarly , 18% and 5% of the examples are incorrectly classiﬁed by all networks with respect to their top 1 and top 5 error , re- spectiv ely . These results v erify our hypothesis that for a large fraction of data, there is no need for costly networks. In particular, for the 68% and 82% of data with no change Figure 6. Analysis of top-1 and top-5 errors for dif ferent net- works. Majority of the samples are easily classiﬁed by Alexnet, and only a minority of them require deeper networks. in top 1 and top 5 error , respectiv ely , the use of any network apart from Alexnet is unnecessary and only adds unneces- sary computational time. Additionally , it is worth noting the balance between exam- ples incorrectly classiﬁed by all networks, 18% and 5% re- spectiv ely for top 1 and top 5 error , and the fraction of ex- amples correctly classiﬁed by either GoogLeNet or Resnet but not Alexnet, 25 . 1% and 15 . 1% for top 1 and top 5 er- ror , respectiv ely . This behavior supports our observation that entropy of classiﬁcation decisions is an important fea- ture in making policy decisions, as examples likely to be incorrectly classiﬁed by Alexnet are likely to be classiﬁed correctly by a later network. Note that our system is trained using the same data used to train the netw orks. Generally , the resulting ev aluation error for each network on training data is signiﬁcantly lo wer than error that arises on test data, and therefore our system is bi- ased tow ards sending examples to more complex networks that generally show negligible training error . Practically , this problem is alleviated through the use of v alidation data to train the adaptiv e systems. In order to maintain the re- ported performance of the netw ork without e xpansion of the training set, we instead utilize the same data for train- ing both networks and adaptive systems, howe ver we note that performance of our adaptive systems is generally better when trained on data excluded from the network training. 6. Conclusion W e proposed two different schemes to adaptively trade off model accuracy with model ev aluation time for deep neural networks. W e demonstrated that signiﬁcant gains in com- putational time is possible through our novel policy with negligible loss in accuracy on ImageNet image recogni- tion dataset. W e posed a global objective for learning an adaptiv e early exit or network selection policy and solved it by reducing the policy learning problem to a layer -by- layer weighted binary classiﬁcation problem. W e believ e that adaptivity is v ery important in the age of growing data for models with high variance in computational time and quality . W e also showed that our method approximates an Oracle based policy that has beneﬁt of access to true error for each instance from all the networks. Adaptive Neural Netw orks for Efﬁcient Infer ence Acknowledgements This material is based upon work supported in part by NSF Grants CCF: 1320566, NSF Grant CNS: 1330008 NSF CCF: 1527618, the U.S. Department of Homeland Secu- rity , Science and T echnology Directorate, Of ﬁce of Uni ver- sity Programs, under Grant A ward 2013-ST -061-ED0001, and by ONR contract N00014-13-C-0288. The views and conclusions contained in this document are those of the au- thors and should not be interpreted as necessarily represent- ing the social policies, either expressed or implied, of the NSF , U.S. DHS, ONR or AF . References Bagherinezhad, Hessam, Rastegari, Mohammad, and Farhadi, Ali. Lcnn: Lookup-based con v olutional neural network. arXiv preprint , 2016. Bengio, Emmanuel, Bacon, Pierre-Luc, Pineau, Joelle, and Precup, Doina. Conditional computation in neural networks for faster models. arXiv pr eprint arXiv:1511.06297 , 2015. Bucila, Cristian, Caruana, Rich, and Niculescu-Mizil, Alexandru. Model compression. In Pr oceedings of the 12th A CM SIGKDD international confer ence on Knowl- edge discovery and data mining , pp. 535–541. A CM, 2006. Chen, W enlin, W ilson, James T , T yree, Stephen, W ein- berger , Kilian Q, and Chen, Y ixin. Compressing neural networks with the hashing trick. In ICML , pp. 2285– 2294, 2015. Courbariaux, Matthieu, Bengio, Y oshua, and David, Jean- Pierre. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Pr ocessing Systems , pp. 3123–3131, 2015. Figurnov , Michael, Collins, Maxwell D, Zhu, Y ukun, Zhang, Li, Huang, Jonathan, V etrov , Dmitry , and Salakhutdinov , Ruslan. Spatially adapti ve compu- tation time for residual networks. arXiv pr eprint arXiv:1612.02297 , 2016a. Figurnov , Mikhail, Ibraimov a, Aizhan, V etro v , Dmitry P , and K ohli, Pushmeet. Perforatedcnns: Acceleration through elimination of redundant con volutions. In Ad- vances in Neural Information Pr ocessing Systems , pp. 947–955, 2016b. Gong, Y unchao, Liu, Liu, Y ang, Ming, and Bourde v , Lubomir . Compressing deep con v olutional networks us- ing vector quantization. arXiv pr eprint arXiv:1412.6115 , 2014. He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In Pr oceedings of the IEEE Confer ence on Computer V i- sion and P attern Recognition , pp. 770–778, 2016. Hinton, Geoffrey , Deng, Li, Y u, Dong, Dahl, Geor ge E, Mohamed, Abdel-rahman, Jaitly , Navdeep, Senior, An- drew , V anhoucke, V incent, Nguyen, Patrick, Sainath, T ara N, et al. Deep neural networks for acoustic mod- eling in speech recognition: The shared views of four research groups. IEEE Signal Pr ocessing Magazine , 29 (6):82–97, 2012. Hinton, Geoffrey , V inyals, Oriol, and Dean, Jeff. Distill- ing the knowledge in a neural network. arXiv preprint arXiv:1503.02531 , 2015. Hubara, Itay , Courbariaux, Matthieu, Soudry , Daniel, El- Y ani v , Ran, and Bengio, Y oshua. Binarized neural net- works. In Advances in Neural Information Pr ocessing Systems , pp. 4107–4115, 2016a. Hubara, Itay , Courbariaux, Matthieu, Soudry , Daniel, El- Y ani v , Ran, and Bengio, Y oshua. Quantized neu- ral networks: Training neural networks with low precision weights and activ ations. arXiv preprint arXiv:1609.07061 , 2016b. Iandola, Forrest N, Han, Song, Moske wicz, Matthew W , Ashraf, Khalid, Dally , William J, and K eutzer , Kurt. Squeezenet: Alexnet-le vel accuracy with 50x fewer pa- rameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360 , 2016. Krizhevsk y , Alex, Sutske ver , Ilya, and Hinton, Geoffre y E. Imagenet classiﬁcation with deep con volutional neural networks. In Advances in neur al information pr ocessing systems , pp. 1097–1105, 2012. Kusner , M, Chen, W , Zhou, Q, Xu, Z, W einberger , K, and Chen, Y . Feature-cost sensitive learning with submodu- lar trees of classiﬁers. In T wenty-Eighth AAAI Confer- ence on Artiﬁcial Intelligence , 2014. Leroux, Sam, Bohez, Ste ven, De Coninck, Elias, V erbelen, T im, V ankeirsbilck, Bert, Simoens, Pieter , and Dhoedt, Bart. The cascading neural network: b uilding the inter- net of smart things. Knowledge and Information Sys- tems , pp. 1–24, 2017. Liu, Baoyuan, W ang, Min, Foroosh, Hassan, T appen, Mar- shall, and Pensky , Marianna. Sparse conv olutional neu- ral networks. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pp. 806–814, 2015. Adaptive Neural Netw orks for Efﬁcient Infer ence Nan, Feng, W ang, Joseph, and Saligrama, V enkatesh. Prun- ing random forests for prediction on a b udget. In Ad- vances in Neural Information Pr ocessing Systems 29: Annual Confer ence on Neur al Information Pr ocessing Systems 2016, December 5-10, 2016, Bar celona, Spain , pp. 2334–2342, 2016. Odena, Augustus, Lawson, Dieterich, and Olah, Christo- pher . Changing model behavior at test-time using rein- forcement learning. arXiv pr eprint arXiv:1702.07780 , 2017. Rastegari, Mohammad, Ordonez, V icente, Redmon, Joseph, and F arhadi, Ali. Xnor-net: Imagenet classi- ﬁcation using binary con volutional neural networks. In Eur opean Confer ence on Computer V ision , pp. 525–542. Springer , 2016. Russakovsk y , Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjee v , Ma, Sean, Huang, Zhiheng, Karpa- thy , Andrej, Khosla, Aditya, Bernstein, Michael, et al. Imagenet lar ge scale visual recognition challenge. Inter- national Journal of Computer V ision , 115(3):211–252, 2015. Sutske ver , Ilya, V inyals, Oriol, and Le, Quoc V . Se- quence to sequence learning with neural netw orks. In Advances in neural information pr ocessing systems , pp. 3104–3112, 2014. Szegedy , Christian, Liu, W ei, Jia, Y angqing, Sermanet, Pierre, Reed, Scott, Anguelo v , Dragomir, Erhan, Du- mitru, V anhoucke, V incent, and Rabinovich, Andre w . Going deeper with conv olutions. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pp. 1–9, 2015. T rapezniko v , K and Saligrama, V . Supervised sequential classiﬁcation under budget constraints. In International Confer ence on Artiﬁcial Intelligence and Statistics , pp. 581–589, 2013. W ang, J., Bolukbasi, T ., Trapezniko v , K., and Saligrama, V . Model selection by linear programming. In Eur opean Confer ence on Computer V ision , pp. 647–662, 2014. W ang, Joseph and Saligrama, V enkatesh. Local supervised learning through space partitioning. In Advances in Neu- ral Information Pr ocessing Systems (NIPS) , pp. 91–99, 2012. W ang, Joseph, Trapeznik ov , Kirill, and Saligrama, V enkatesh. Efﬁcient learning by directed acyclic graph for resource constrained prediction. In Advances in Neural Information Pr ocessing Systems , pp. 2152–2160, 2015. W en, W ei, W u, Chunpeng, W ang, Y andan, Chen, Y iran, and Li, Hai. Learning structured sparsity in deep neural networks. In Advances in Neural Information Pr ocessing Systems , pp. 2074–2082, 2016. W u, Jiaxiang, Leng, Cong, W ang, Y uhang, Hu, Qinghao, and Cheng, Jian. Quantized con volutional neural net- works for mobile devices. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recogni- tion , pp. 4820–4828, 2016. Xu, Z., Kusner , M., Chen, M., and W einberger , K. Cost- sensitiv e tree of classiﬁers. In Pr oceedings of the 30th In- ternational Conference on Machine Learning , pp. 133– 141, 2013.

Adaptive Neural Networks for Efficient Inference

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment