HD-CNN: Hierarchical Deep Convolutional Neural Network for Large Scale Visual Recognition

HD-CNN: Hierar chical Deep Con volutional Neural Netw ork f or Large Scale V isual Recognition Zhicheng Y an † , Hao Zhang ‡ , Robinson Piramuthu ∗ , V ignesh Jagadeesh ∗ , Dennis DeCoste ∗ , W ei Di ∗ , Y izhou Y u ? † Uni versity of Illinois at Urbana-Champaign, ‡ Carnegie Mellon Uni versity ∗ eBay Research Lab, ? The Uni versity of Hong K ong Abstract In image classiﬁcation, visual separability between dif- fer ent object cate gories is highly uneven, and some cate- gories ar e more dif ﬁcult to distinguish than others. Suc h dif- ﬁcult cate gories demand mor e dedicated classiﬁers. How- ever , e xisting deep convolutional neural networks (CNN) ar e trained as ﬂat N-way classiﬁers, and few efforts have been made to le vera ge the hierar chical structur e of cate- gories. In this paper , we intr oduce hierar chical deep CNNs (HD-CNNs) by embedding deep CNNs into a category hier- ar chy . An HD-CNN separates easy classes using a coarse cate gory classiﬁer while distinguishing difﬁcult classes us- ing ﬁne category classiﬁers. During HD-CNN training, component-wise pretr aining is followed by global ﬁnetun- ing with a multinomial logistic loss r e gularized by a coarse cate gory consistency term. In addition, conditional execu- tions of ﬁne category classiﬁers and layer parameter com- pr ession make HD-CNNs scalable for lar ge-scale visual r ecognition. W e achie ve state-of-the-art r esults on both CI- F AR100 and larg e-scale ImageNet 1000-class benchmark datasets. In our experiments, we b uild up thr ee dif fer ent HD-CNNs and the y lower the top-1 err or of the standard CNNs by 2 . 65% , 3 . 1% and 1 . 1% , r espectively . 1. Introduction Deep CNNs are well suited for lar ge-scale learning based visual recognition tasks because of its highly scalable train- ing algorithm, which only needs to cache a small chunk (mini-batch) of the potentially huge volume of training data during sequential scans (epochs). They have achiev ed in- creasingly better performance in recent years. As datasets become bigger and the number of object cat- egories becomes larger , one of the complications that come along is that visual separability between different object cat- egories is highly une ven. Some categories are much harder to distinguish than others. T ake the cate gories in CIF AR100 as an e xample. It is easy to tell an Apple from a Bus , but harder to tell an Apple from an Orange . In fact, both Ap- ples and Oranges belong to the same coarse category fruit and ve getables while Buses belong to another coarse cat- egory vehicles 1 , as deﬁned within CIF AR100. Nonethe- less, most deep CNN models now adays are ﬂat N-way clas- siﬁers, which share a set of fully connected layers. This makes us wonder whether such a ﬂat structure is adequate for distinguishing all the difﬁcult categories. A very natu- ral and intuitiv e alternativ e or ganizes classiﬁers in a hier- archical manner according to the divide-and-conquer strat- egy . Although hierarchical classiﬁcation has been prov en effecti ve for con ventional linear classiﬁers [38, 8, 37, 22], few attempts ha ve been made to exploit category hierarchies [3, 29] in deep CNN models. Since deep CNN models are large models themselves, organizing them hierarchically imposes the follo wing chal- lenges. First, instead of a handcrafted category hierarchy , how can we learn such a category hierarchy from the train- ing data itself so that cascaded inferences in a hierarchical classiﬁer will not degrade the ov erall accuracy while ded- icated ﬁne category classiﬁers exist for hard-to-distinguish categories? Second, a hierarchical CNN classiﬁer consists of multiple CNN models at dif ferent le vels. How can we lev erage the commonalities among these models and effec- tiv ely train them all? Third, it would also be slo wer and more memory-consuming to run a hierarchical CNN clas- siﬁer on a novel testing image. Ho w can we alleviate such limitations? In this paper , we propose a generic and principled hierar- chical architecture, Hierar chical Deep Con volutional Neu- ral Network (HD-CNN), that decomposes an image classi- ﬁcation task into two steps. An HD-CNN ﬁrst uses a coarse category CNN classiﬁer to separate easy classes from one another . More challenging classes are routed do wnstream to ﬁne category classiﬁers that focus on confusing classes. W e adopt a module design principle and every HD-CNN is built upon a b uilding block CNN. The b uilding block can be chosen to be any of the currently top ranked single CNNs. Thus HD-CNNs can always beneﬁt from the progress of single CNN design. An HD-CNN follo ws the coarse-to-ﬁne 4321 Root Coarse category 1: {white shark, numbfish , hammerhead, stingray, … } Coarse category 2: {toy terrier, greyhound, whippet, basenji, … } Coarse category K: {mink, cougar, bear, fox squirrel, otter, … } ." ." ." (a) Shared layers Fine component independent layers F 1 ." ." ." Coarse component independent layers B coarse prediction fine prediction 1 final prediction Image low-level features Probabilistic averaging layer Fine component independent layers F K fine prediction K (b) Figure 1: (a) A two-le vel cate gory hierarchy where the classes are taken from ImageNet 1000-class dataset. (b) Hierarchical Deep Con volutional Neural Network (HD-CNN) architecture. classiﬁcation paradigm and probabilistically integrates pre- dictions from the ﬁne category classiﬁers. Compared with the building block CNN, the corresponding HD-CNN can achiev e lower error at the cost of a manageable increase in memory footprint and classiﬁcation time. In summary , this paper has the following contributions. First, we introduce a nov el hierarchical architecture, called HD-CNN, for image classiﬁcation. Second, we develop a scheme for learning the two-lev el organization of coarse and ﬁne categories, and demonstrate v arious components of an HD-CNN can be independently pretrained. The com- plete HD-CNN is further ﬁne-tuned using a multinomial logistic loss regularized by a coarse category consistency term. Third, we make the HD-CNN scalable by compress- ing the layer parameters and conditionally e xecuting the ﬁne category classiﬁers. W e have performed e valuations on the medium-scale CIF AR100 dataset and the lar ge-scale ImageNet 1000-class dataset, and our method has achie ved state-of-the-art performance on both of them. 2. Related W ork Our work is inspired by progresses in CNN design and efforts on integrating a category hierarchy with linear clas- siﬁers. The main novelty of our method is a ne w scalable HD-CNN architecture that integrates a category hierarchy with deep CNNs. 2.1. Con volutional Neural Networks CNN-based models hold state-of-the-art performance in various computer vision tasks, including image classifca- tion [18], object detection [10, 13], and image parsing [7]. Recently , there has been considerable interest in enhanc- ing CNN components, including pooling layers [36], acti- vation units [11, 28], and nonlinear layers [21]. These en- hancements either improve CNN training [36], or expand the network learning capacity . This work boosts CNN per- formance from an orthogonal angle and does not redesign a speciﬁc part within any existing CNN model. Instead, we design a no vel generic hierarchical architecture that uses an existing CNN model as a building block. W e embed multi- ple building blocks into a lar ger hierarchical deep CNN. 2.2. Category Hierarch y for V isual Recognition In visual recognition, there is a vast literature exploiting category hierarchical structures [32]. For classiﬁcation with a lar ge number of classes using linear classiﬁers, a common strategy is to build a hierarch y or taxonomy of classiﬁers so that the number of classiﬁers ev aluated giv en a testing im- age scales sub-linearly in the number of classes [2, 9]. The hierarchy can be either predeﬁned[23, 33, 16] or learnt by top-down and bottom-up approaches [25, 12, 24, 20, 1, 6, 27]. In [5], the predeﬁned category hierarchy of ImageNet dataset is utilized to achie ve the trade-of fs between classiﬁ- cation accuracy and speciﬁcity . In [22], a hierarchical label tree is constructed to probabilistically combine predictions from leaf nodes. Such hierarchical classiﬁer achiev es sig- niﬁcant speedup at the cost of certain accuracy loss. One of the earliest attempts to introduce a category hi- erarchy in CNN-based methods is reported in [29] but their main goal is transferring knowledge between classes to im- prov e the results for classes with insuf ﬁcient training e x- amples. In [3], various label relations are encoded in a hi- erarchy . Improved accuracy is achiev ed only when a sub- set of training images are relabeled with internal nodes in the hierarchical class tree. They are not able to improve the accuracy in the original setting where all training im- ages are labeled with leaf nodes. In [34], a hierarchy of CNNs is introduced b ut they experimented with only two coarse cate gories mainly due to scalability constraints. HD- CNN e xploits the cate gory hierarchy in a nov el way that we embed deep CNNs into the hierarchy in a scalable manner and achiev es superior classiﬁcation results over the standard CNN. 3. Overview of HD-CNN 3.1. Notations The following notations are used belo w . A dataset con- sists of images { x i , y i } i . x i and y i denote the image data and label, respectiv ely . There are C ﬁne categories of im- ages in the dataset { S f j } C j =1 . W e will learn a category hier- archy with K coarse categories { S c k } K k =1 . 4322 3.2. HD-CNN Architectur e HD-CNN is designed to mimic the structure of category hierarchy where ﬁne categories are divided into coarse cat- egories as in Fig 1(a). It performs end-to-end classiﬁcation as illustrated in Fig 1(b). It mainly comprises four parts, namely shared layers, a single coarse category component B , multiple ﬁne category components { F k } K k =1 and a sin- gle probabilistic av eraging layer . On the left side of Fig 1 (b) are the shared layers. They receiv e raw image pixel as input and extract lo w-level features. The conﬁguration of shared layers are set to be the same as the preceding layers in the building block net. On the top of Fig 1(b) are independent layers of coarse category component B which reuses the conﬁguration of rear layers from the building block CNN and produces a ﬁne prediction { B f ij } C j =1 for an image x i . T o produce a predic- tion { B ik } K k =1 ov er coarse categories, we append a ﬁne-to- coarse aggregation layer which aggregates ﬁne predictions into coarse ones when a mapping from ﬁne categories to coarse ones P : [1 , C ] 7→ [1 , K ] is gi ven. The coarse cat- egory probabilities serve two purposes. First, they are used as weights for combining the predictions made by ﬁne cat- egory components. Second, when thresholded, they enable conditional executions of ﬁne category components whose corresponding coarse probabilities are sufﬁciently lar ge. In the bottom-right of Fig 1 (b) are independent layers of a set of ﬁne category classiﬁers { F k } K k =1 , each of which makes ﬁne category predictions. As each ﬁne component only excels in classifying a small set of cate gories, the y pro- duce a ﬁne prediction over a partial set of categories. The probabilities of other ﬁne categories absent in the partial set are implicitly set to zero. The layer conﬁgurations are mostly copied from the building block CNN except that in the ﬁnal classiﬁcation layer the number of ﬁlters is set to be the size of partial set instead of the full categories. Both coarse category component and ﬁne category com- ponents share common layers. The reason is three-fold. First, it is shown in [35] that preceding layers in deep networks response to class-agnostic low-le vel features such as corners and edges, while rear layers extract more class- speciﬁc features such as dog face and bird’ s legs. Since low-le vel features are useful for both coarse and ﬁne classi- ﬁcation tasks, we allo w the preceding layers to be shared by both coarse and ﬁne components. Second, it reduces both the total ﬂoating point operations and the memory footprint of network execution. Both are of practical signiﬁcance to deploy HD-CNN in real applications. Last but not the least, it can decrease the number of HD-CNN parameters which is critical to the success of HD-CNN training. On the right side of Fig 1 (b) is the probabilistic av erag- ing layer which recei ves ﬁne cate gory predictions as well as coarse category prediction and produces a weighted av erage as the ﬁnal prediction. p ( x i ) = P K k =1 B ik p k ( x i ) P K k =1 B ik (1) where B ik is the probability of coarse cate gory k for the image x i predicted by the coarse cate gory component B . p k ( x i ) is the ﬁne category prediction made by the ﬁne cat- egory component F k . W e stress that both coarse and ﬁne category components reuse the layer conﬁgurations from the building block CNN. This ﬂexible modular design allo ws us to choose the best module CNN as the building block, depending on the task at hand. 4. Learning a Category Hierar chy Our goal of b uilding a category hierarchy is grouping confusing ﬁne categories into the same coarse category for which a dedicated ﬁne category classiﬁer will be trained. W e employ a top-down approach to learn the hierarchy from the training data. W e randomly sample a held-out set of images with bal- anced class distribution from the training set. The rest of the training set is used to train a building block net. W e obtain a confusion matrix F by ev aluating the net on the held-out set. A distance matrix D is deriv ed as D = 1 − F and its diagonal entries are set to be zero. D is further trans- formed by D = 0 . 5 ∗ ( D + D T ) to be symmetric. The en- try D ij measures ho w easy it is to discriminate categories i and j . Spectral clustering is performed on D to cluster ﬁne categories into K coarse cate gories. The result is a two- lev el category hierarchy representing a many-to-one map- ping P d : [1 , C ] 7→ [1 , K ] from ﬁne to coarse categories. Here, the coarse categories are disjoint. Overlapping Coarse Categories With disjoint coarse cat- egories, the o verall classiﬁcation depends heavily on the coarse category classiﬁer . If an image is routed to an in- correct ﬁne category classiﬁer , then the mistake can not be corrected as the probability of ground truth label is implic- itly set to zero there. Removing the separability constraint between coarse categories can make the HD-CNN less de- pendent on the coarse category classiﬁer . Therefore, we add more ﬁne cate gories to the coarse cat- egories. For a certain ﬁne classiﬁer F k , we prefer to add those ﬁne categories { j } that are likely to be misclassﬁed into the coarse category k . Therefore, we estimate the like- lihood u k ( j ) that an image in ﬁne category j is misclassiﬁed into a coarse category k on the held-out set. u k ( j ) = 1    S f j    X i ∈ S f j B d ik (2) B d ik is the coarse category probability which is obtained by aggregating ﬁne category probabilities { B f ij } j according to 4323 the mapping P d : B d ik = P j | P d ( j )= k B f ij . W e threshold the likelihood u k ( j ) using a parametric variable u t = ( γ K ) − 1 and add to the partial set S c k all ﬁne categories { j } such that u k ( j ) ≥ u t . Note that each branching component gives a full set prediction when u t = 0 and a disjoint set prediction when u t = 1 . 0 . W ith overlapping coarse categories, the cat- egory hierarchy mapping P d is extended to be a many-to- many mapping P o and the coarse category predictions are updated accordingly B o ik = P j | k ∈ P o ( j ) B ij . Note the sum of { B o ik } K k =1 exceeds 1 and hence we perform L 1 normal- ization. The use of overlapping coarse categories was also shown to be useful for hierarchical linear classiﬁers [24]. 5. HD-CNN T raining As we embed ﬁne category components into HD-CNN, the number of parameters in rear layers gro ws linearly in the number of coarse categories. Given the same amount of training data, this increases the training complexity and the risk of over -ﬁtting. On the other hand, the training im- ages within the stochastic gradient descent mini-batch are probabilistically routed to different ﬁne category compo- nents. It requires larger mini-batch to ensure parameter gra- dients in the ﬁne category components are estimated by a sufﬁciently large number of training samples. Large train- ing mini-batch both increases the training memory footprint and slo ws do wn the training process. Therefore, we decom- pose the HD-CNN training into multiple steps instead of training the complete HD-CNN from scratch as outlined in Algorithm 1. 5.1. Pretraining HD-CNN W e sequentially pretrain the coarse category component and ﬁne category components. 5.1.1 Initializing the Coarse Category Component W e ﬁrst pretrain a building block CNN F p using the training set. As both the preceding and rear layers in coarse cate gory component resemble the layers in the building block CNN, we copy the weights of F p into coarse category component for initialization purpose. 5.1.2 Pretraining the Rear Layers of Fine Category Components Fine category components { F k } k can be independently pretrained in parallel. Each F k should specialize in classi- fying ﬁne categories within the coarse category S c k . There- fore, the pretraining of each F k only uses images { x i | i ∈ S c k } from the coarse category S c k . The shared preceding lay- ers are already initialized and kept ﬁxed in this stage. For Algorithm 1 HD-CNN training algorithm 1: procedur e H D - C N N T R A I N I N G 2: Step 1: Pretrain HD-CNN 3: Step 1.1: Initialize coarse category component 4: Step 1.2: Pretrain ﬁne category components 5: Step 2: Fine-tune the complete HD-CNN each F k , we initialize all the rear layers e xcept the last con- volutional layer by copying the learned parameters from the pretrained model F p . 5.2. Fine-tuning HD-CNN After both coarse category component and ﬁne category components are properly pretrained, we ﬁne-tune the com- plete HD-CNN. Once the category hierarchy as well as the associated mapping P o is learnt, each ﬁne category compo- nent focuses on classifying a ﬁx ed subset of ﬁne categories. During ﬁne-tuning, the semantics of coarse categories pre- dicted by the coarse category component should be kept consistent with those associated with ﬁne category compo- nents. Thus we add a coarse cate gory consistency term to regularize the con ventional multinomial logistic loss. Coarse category consistency The learnt ﬁne-to-coarse cat- egory mapping P : [1 , C ] 7→ [1 , K ] provides a way to spec- ify the target coarse category distribution { t k } . Speciﬁcally , t k is set to be the fraction of all the training images within the coarse category S c k under the assumption the distribution ov er coarse categories across the training dataset is close to that within a training mini-batch. t k = P j | k ∈ P ( j ) | S j | P K k 0 =1 P j | k 0 ∈ P ( j ) | S j | ∀ k ∈ [1 , K ] (3) The ﬁnal loss function we use for ﬁne-tuning the HD- CNN is shown belo w . E = − 1 n n X i =1 log ( p y i ) + λ 2 K X k =1 ( t k − 1 n n X i =1 B ik ) 2 (4) where n is the size of training mini-batch. λ is a re gulariza- tion constant and is set to λ = 20 . 6. HD-CNN T esting As we add ﬁne category components into the HD-CNN, the number of parameters, memory footprint and execution time in rear layers, all scale linearly in the number of coarse categories. T o ensure HD-CNN is scalable for large-scale visual recognition, we de velop conditional execution and layer parameter compression techniques. Conditional Execution . At test time, for a giv en image, it is not necessary to e valuate all ﬁne cate gory classiﬁers as most of them ha ve insigniﬁcant weights B ik as in Eqn 1. 4324 Their contrib utions to the ﬁnal prediction are negligible. Conditional executions of the top weighted ﬁne components can accelerate the HD-CNN classiﬁcation. Therefore, we threshold B ik using a parametric variable B t = ( β K ) − 1 and reset B ik to zero when B ik < B t . Those ﬁne category classiﬁers with B ik = 0 are not ev aluated. Parameter Compression . In HD-CNN, the number of pa- rameters in rear layers of ﬁne category classiﬁers grows lin- early in the number of coarse categories. Thus we compress the layer parameters at test time to reduce memory foot- print. Speciﬁcally , we choose the Product Quantization ap- proach [14] to compress the parameter matrix W ∈ R m × n by ﬁrst partitioning it horizontally into se gments of width s . W = [ W 1 , ..., W ( n/s ) ] . Then k -means clustering is used to cluster the ro ws in W i , ∀ i ∈ [1 , ( n/s )] . By only storing the nearest cluster indices in a 8-bit inte ger ma- trix I ∈ R m × ( n/s ) and cluster centers in a single-precision ﬂoating number matrix C ∈ R k × n , we can achiev e a com- pression factor (32 mn ) / (32 kn + 8 mn/s ) . The hyperpa- rameters for parameter compression are ( s, k ) . 7. Experiments 7.1. Overview W e ev aluate HD-CNN on the benchmark datasets CI- F AR100 [17] and ImageNet [4]. HD-CNN is implemented on the widely deployed Caffe [15] software. The network is trained by back propagation [18]. W e run all the testing experiments on a single NVIDIA T esla K40c card. 7.2. CIF AR100 Dataset The CIF AR100 dataset consists of 100 classes of natu- ral images. There are 50K training images and 10K testing images. W e follow [11] to preprocess the datasets (e.g. global contrast normalization and ZCA whitening). Ran- domly cropped and ﬂipped image patches of size 26 × 26 are used for training. W e adopt a NIN network 1 with three stacked layers [21]. W e denote it as CIF AR100-NIN which will be the HD-CNN building block. Fine category com- ponents share preceding layers from con v1 to pool1 which accounts for 6% of the total parameters and 29% of the total ﬂoating point operations. The remaining layers are used as independent layers. For building the category hierarchy , we randomly choose 10K images from the training set as held-out set. Fine cat- egories within the same coarse categories are visually more similar . W e pretrain the rear layers of ﬁne category compo- nents. The initial learning rate is 0 . 01 and it is decreased by a factor of 10 e very 6K iterations. Fine-tuning is performed for 20K iterations with large mini-batches of size 256. The 1 https://github.com/mavenlin/cuda- convnet/blob/ master/NIN/cifar- 100_def T able 1: 10-view testing errors on CIF AR100 dataset. No- tation CCC =coarse category consistenc y . Method Error Model av eraging ( 2 CIF AR100-NIN nets) 35 . 13 DSN [19] 34 . 68 CIF AR100-NIN-double 34 . 26 dasNet [30] 33 . 78 Base: CIF AR100-NIN 35 . 27 HD-CNN, no ﬁnetuning 33 . 33 HD-CNN, ﬁnetuning w/o CCC 33 . 21 HD-CNN, ﬁnetuning w/ CCC 32 . 62 initial learning rate is 0 . 001 and is reduced by a f actor of 10 once after 10K iterations. For ev aluation, we use 10-vie w testing [18]. W e ex- tract ﬁve 26 × 26 patches (the 4 corner patches and the center patch) as well as their horizontal reﬂections and av- erage their predictions. The CIF AR100-NIN net obtains 35 . 27% testing error . Our HD-CNN achie ves testing error of 32 . 62% which improves the building block net by 2 . 65% . Category hierarch y . During the construction of the cate- gory hierarchy , the number of coarse categories can be ad- justed by the clustering algorithm. W e can also make the coarse categories either disjoint or o verlapping by varying the hyperparameter γ . Thus we in vestigate their impacts on the classiﬁcation error . W e experiment with 5, 9, 14 and 19 coarse categories and v ary the value of γ . The best results are obtained with 9 overlapping coarse categories and γ = 5 as shown in Fig 2 left. A histogram of ﬁne category occur- rences in 9 o verlapping coarse categories is sho wn in Fig 2 right. The optimal v alue of coarse category number and hyperparameter γ are dataset dependent, mainly af fected by the inherent hierarchy within the categories. number of coarse categories 5 1 01 52 0 33 33.5 34 34.5 35 35.5 Disjoint coarse categories Overlapping coarse categories, γ = 5 Overlapping coarse categories, γ = 10 0 1 2 3 4 5 6 7 8 fine category occur r e nces 0 5 10 15 20 25 Figure 2: Left : HD-CNN 10-view testing error against the number of coarse categories on CIF AR100 dataset. Right : Histogram of ﬁne category occurrences in 9 ov erlapping coarse categories. Shared layers . The use of shared layers makes both com- 4325 putational complexity and memory footprint of HD-CNN sublinear in the number of ﬁne category classiﬁers when compared to the building block net. Our HD-CNN with 9 ﬁne category classiﬁers based on CIF AR100-NIN con- sumes less than three times as much memory as the build- ing block net without parameter compression. W e also want to inv estigate the impact of the use of shared layers on the classiﬁcation error , memory footprint and the net ex ecution time (T able 2). W e b uild another HD-CNN where coarse category component and all ﬁne category components use independent preceding layers initialized from a pretrained building block net. Under the single-view testing where only a central cropping is used, we observe a minor error increase from 34 . 36% to 34 . 50% . But using shared layers dramatically reduces the memory footprint from 1356 MB to 459 MB and testing time from 2 . 43 seconds to 0 . 28 sec- onds. Conditional executions . By v arying the hyperparameter β , we can effecti vely affect the number of ﬁne category com- ponents that will be ex ecuted. There is a trade-of f between ex ecution time and classiﬁcation error . A larger value of β leads to higher accurac y at the cost of e xecuting more com- ponents for ﬁne cate gorization. By enabling conditional ex- ecutions with hyperparameter β = 6 , we obtain a substan- tial 2 . 8 X speed up with merely a minor increase in error from 34 . 36% to 34 . 57% (T able 2). The testing time of HD-CNN is about 2 . 5 times as much as that of the building block net. Parameter compression . As ﬁne category CNNs hav e in- dependent layers from con v2 to cccp6 , we compress them and reduce the memory footprint from 447 MB to 286 MB with a minor increase in error from 34 . 57% to 34 . 73% . Comparison with a strong baseline. As our HD-CNN memory footprint is about two times as much as the building block model (T able 2), it is necessary to compare a stronger baseline of similar comple xity with HD-CNN. W e adapt CIF AR100-NIN and double the number of ﬁlters in all con- volutional layers which accordingly increases the memory footprint by three times. W e denote it as CIF AR100-NIN- double and obtain error 34 . 26% which is 1 . 01% lower than that of the b uilding block net but is 1 . 64% higher than that of HD-CNN. Comparison with model averaging . HD-CNN is funda- mentally different from model av eraging [18]. In model av eraging, all models are capable of classifying the full set of the categories and each one is trained independently . The main sources of their prediction differences are differ - ent initializations. In HD-CNN, each ﬁne category clas- siﬁer only excels at classifying a partial set of categories. T o compare HD-CNN with model averaging, we indepen- dently train two CIF AR100-NIN networks and take their av- eraged prediction as the ﬁnal prediction. W e obtain an error of 35 . 13% , which is about 2 . 51% higher than that of HD- CNN (T able 1). Note that HD-CNN is orthogonal to the model av eraging and an ensemble of HD-CNN networks can further improv e the performance. Coarse category consistency . T o verify the effecti veness of coarse category consistenc y term in our loss function (4), we ﬁne-tune a HD-CNN using the traditional multinomial logistic loss function. The testing error is 33 . 21% , which is 0 . 59% higher than that of a HD-CNN ﬁne-tuned with coarse category consistenc y (T able 1). Comparison with state-of-the-art. Our HD-CNN im- prov es on the current two best methods [19] and [30] by 2 . 06% and 1 . 16% respectively and sets ne w state-of-the-art results on CIF AR100 (T able 1). 7.3. ImageNet 1000-class Dataset The ILSVRC-2012 ImageNet dataset consists of about 1 . 2 million training images, 50 , 000 validation images. T o demonstrate the generality of HD-CNN, we experiment with two dif ferent building block nets. In both cases, HD- CNNs achie ve signiﬁcantly lower testing errors than the building block nets. 7.3.1 Network-In-Network Building Block Net W e choose a public 4-layer NIN net 2 as our ﬁrst b uild- ing block as it has greatly reduced number of parameters compared to AlexNet [18] but similar error rates. It is de- noted as ImageNet-NIN. In HD-CNN, various components share preceding layers from conv1 to pool3 which account for 26% of the total parameters and 82% of the total ﬂoating point operations. W e follow the training and testing protocols as in [18]. Original images are resized to 256 × 256 . Randomly cropped and horizontally reﬂected 224 × 224 patches are used for training. At test time, the net makes a 10-view av- eraged prediction. W e train ImageNet-NIN for 45 epochs. The top-1 and top-5 errors are 39 . 76% and 17 . 71% . T o b uild the category hierarchy , we take 100K training images as the held-out set and ﬁnd 89 overlapping coarse categories. Each ﬁne category CNN is ﬁne-tuned for 40K iterations while the initial learning rate 0 . 01 is decreased by a factor of 10 ev ery 15K iterations. Fine-tuning the complete HD-CNN is not performed as the required mini- batch size is signiﬁcantly higher than that for the building block net. Nev ertheless, we still achie ve top-1 and top-5 er- rors of 36 . 66% and 15 . 80% and improv e the building block net by 3 . 1% and 1 . 91% , respectively (T able 3). The class- wise top-5 error impro vement over the building block net is shown in Fig 4 left. Case studies W e want to in vestigate how HD-CNN corrects the mistakes made by the building block net. In Fig 3, we 2 https://gist.github.com/mavenlin/ d802a5849de39225bcc6 4326 her m it crab 0 1 tarantula fiddler c rab crayfish scorpion alligator lizar d 0 1 CC 82 CC 7 CC 10 CC 11 CC 6 0 1 crick et banded g eck o fiddler c rab alligator lizar d her m it crab 0 1 alligator lizar d banded g eck o her m it crab ringneck snak e tarantula 0 1 banded g eck o gar scorpion thunder s nak e her m it crab 0 1 ringneck snak e alligator lizar d fiddler c rab tarantula her m it crab hand blower 0 1 power drill dumbbell punching bag barbell plunger 0 1 CC 50 CC 81 CC 40 CC 74 CC 49 0 1 barbell plunger hair spray hand blower punching bag 0 1 barbell maraca hair spray punching bag hand blower 0 1 maraca hand blower hair spray dumbbell punching bag 0 1 dumbbell plunger hair spray punching bag hand blower digital clock 0 1 notebook cab pr ojector laptop odometer 0 1 CC 63 CC 53 CC 75 CC 72 CC 65 0 1 digital clock notebook pool table odometer iP od 0 1 iP od digital clock laptop odometer monitor 0 1 traffic light ski m ask P olar oid camera digital clock laptop 0 1 pr ojector iP od odometer laptop digital clock ice lolly 0 1 whistle cinema sunglass sunglasses bubble 0 1 CC 59 CC 50 CC 78 CC 75 CC 62 0 1 scor eboar d garbage truck ice lolly umbr ella maypole 0 1 maraca plunger sunglasses sunglass ice lolly 0 1 cinema maypole whistle bow tie ice lolly 0 1 maypole umbr ella sunglasses sunglass ice lolly (a) (b) (c) (d) (e) (f) (g) Figure 3: Case studies on ImageNet dataset. Each row represents a testing case. Column (a) : test image with ground truth label. Column (b) : top 5 guesses from the building block net ImageNet-NIN. Column (c) : top 5 Coarse Category ( CC ) probabilities. Column (d)-(f) : top 5 guesses made by the top 3 ﬁne category CNN components. Column (g) : ﬁnal top 5 guesses made by the HD-CNN. See text for details. T able 2: Comparison of testing errors, memory footprint and testing time between building block nets and HD-CNNs on CIF AR100 and ImageNet datasets. Statistics are col- lected under single-view testing. Three building block nets CIF AR100-NIN, ImageNet-NIN and ImageNet-VGG-16- layer are used. The testing mini-batch size is 50. No- tations: SL =Shared layers, CE =Conditional executions, PC =Parameter compression. Model top-1, top-5 Memory (MB) T ime (sec.) Base:CIF AR100-NIN 37 . 29 188 0.04 HD-CNN w/o SL 34 . 50 1356 2.43 HD-CNN 34 . 36 459 0.28 HD-CNN+CE 34 . 57 447 0.10 HD-CNN+CE+PC 34 . 73 286 0.10 Base:ImageNet-NIN 41 . 52 , 18 . 98 535 0.19 HD-CNN 37 . 92 , 16 . 62 3544 3.25 HD-CNN+CE 38 . 16 , 16 . 75 3508 0.52 HD-CNN+CE+PC 38 . 39 , 16 . 89 1712 0.53 Base:ImageNet-VGG- 16-layer 32 . 30 , 12 . 74 4134 1.04 HD-CNN+CE+PC 31 . 34 , 12 . 26 6863 5.28 collect four testing cases. In the ﬁrst case, the building block net fails to predict the label of the tin y hermit crab in the top 5 guesses. In HD-CNN, two coarse cate gories #6 and #11 receiv e most of the coarse probability mass. The ﬁne cat- egory component #6 specializes in classifying crab breeds and strongly suggests the ground truth label. By combin- ing the predictions from the top ﬁne category classiﬁers, the HD-CNN predicts hermit crab as the most probable label. In the second case, the ImageNet-NIN confuses the ground truth hand blower with other objects of close shapes and appearances, such as plunger and barbell . For HD-CNN, the coarse category component is also not conﬁdent about which coarse category the object belongs to and thus as- signs ev en probability mass to the top coarse categories. For the top 3 ﬁne category classiﬁers, #74 strongly predicts ground truth label while the other two #49 and #40 rank the ground truth label at the 2nd and 4th place respectiv ely . Overall, the HD-CNN ranks the ground truth label at the 1st place. This demonstrates HD-CNN needs to rely on multi- ple ﬁne category classiﬁers to make correct predictions for difﬁcult cases. Overlapping coarse categories .T o in vestigate the impact of overlapping coarse categories on the classiﬁcation, we train another HD-CNN with 89 ﬁne category classiﬁers us- ing disjoint coarse categories. It achie ves top-1 and top-5 errors of 38 . 44% and 17 . 03% respectiv ely , which is higher than those of the HD-CNN using ov erlapping coarse cate- gory hierarchy by 1 . 78% and 1 . 23% (T able 3). Conditional executions . By v arying the hyperparameter β , we can control the number of ﬁne category components that will be executed. There is a trade-off between ex ecu- tion time and classiﬁcation error as sho wn in Fig 4 right. A larger v alue of β leads to lower error at the cost of more ex ecuted ﬁne category components. By enabling condi- tional executions with hyperparameter β = 8 , we obtain a substantial 6 . 3 X speed up with merely a minor increase of single-view testing top-5 error from 16 . 62% to 16 . 75% (T able 2). With such speedup, the HD-CNN testing time is less than 3 times as much as that of the building block net. Parameter compr ession . W e compress independent layers con v4 and cccp7 as they account for 60% of the parame- 4327 T able 3: Comparisons of 10-view testing errors between ImageNet-NIN and HD-CNN. Notation CC =Coarse cate- gory . Method top-1, top-5 Base:ImageNet-NIN 39 . 76 , 17 . 71 Model av eraging ( 3 base nets) 38 . 54 , 17 . 11 HD-CNN, disjoint CC 38 . 44 , 17 . 03 HD-CNN 36 . 66 , 15 . 80 ters in ImageNet-NIN. Their parameter matrices are of size 1024 × 3456 and 1024 × 1024 and we use compression hyperparameters ( s, k ) = (3 , 128) and ( s, k ) = (2 , 256) . The compression factors are 4 . 8 and 2 . 7 . The compres- sion decreases the memory footprint from 3508 MB to 1712 MB and merely increases the top-5 error from 16 . 75% to 16 . 89% under single-view testing (T able 2). 0 200 400 600 800 1000 class −0.3 −0.2 −0.1 0.0 0.1 0.2 er r o r i mpr ovement log 2 β - 5 - 4 - 3 - 2 - 1 01234 Mean # of executed components 0 2 4 6 8 10 12 Top 5 error 0.15 0.155 0.16 0.165 0.17 0.175 0.18 Figure 4: Left : Class-wise HD-CNN top-5 error improve- ment over the building block net. Right : Mean number of ex ecuted ﬁne category classiﬁers and top-5 error against hy- perparameter β on the ImageNet validation dataset. Comparison with model averaging . As the HD-CNN memory footprint is about three times as much as the build- ing block net, we independently train three ImageNet-NIN nets and average their predictions. W e obtain top-5 error 17 . 11% which is 0 . 6% lower than the building block but is 1 . 31% higher than that of HD-CNN (T able 3). 7.3.2 V GG-16-layer Building Block Net The second building block net we use is a 16-layer CNN from [26]. W e denote it as ImageNet-VGG-16-layer 3 . The layers from con v1 1 to pool4 are shared and they account for 5 . 6% of the total parameters and 90% of the total ﬂoat- ing number operations. The remaining layers are used as independent layers in coarse and ﬁne category classiﬁers. W e follo w the training and testing protocols as in [26]. F or training, we ﬁrst sample a size S from the range [256 , 512] and resize the image so that the length of short edge is S . Then a randomly cropped and ﬂipped patch of size 224 × 224 is used for training. For testing, dense e valuation is performed on three scales { 256 , 384 , 512 } and the aver - 3 https://github.com/BVLC/caffe/wiki/Model- Zoo aged prediction is used as the ﬁnal prediction. Please refer to [26] for more training and testing details. On ImageNet validation set, ImageNet-VGG-16-layer achiev es top-1 and top-5 errors 24 . 79% and 7 . 50% respectiv ely . W e build a category hierarchy with 84 overlapping coarse categories. W e implement multi-GPU training on Caffe by exploiting data parallelism [26] and train the ﬁne category classiﬁers on two NVIDIA T esla K40c cards. The initial learning rate is 0 . 001 and it is decreased by a fac- tor of 10 e very 4K iterations. HD-CNN ﬁne-tuning is not performed. Due to large memory footprint of the building block net (T able 2), the HD-CNN with 84 ﬁne category clas- siﬁers cannot ﬁt into the memory directly . Therefore, we compress the parameters in layers fc6 and fc7 as they ac- count for over 85% of the parameters. Parameter matrices in fc6 and fc7 are of size 4096 × 25088 and 4096 × 4096 . Their compression hyperparameters are ( s, k ) = (14 , 64) and ( s, k ) = (4 , 256) . The compression factors are 29 . 9 and 8 respectiv ely . The HD-CNN obtains top-1 and top-5 er - rors 23 . 69% and 6 . 76% on ImageNet v alidation set and im- prov es over ImageNet-VGG-16-layer by 1 . 1% and 0 . 74% respectiv ely . T able 4: Errors on ImageNet validation set. Method top-1, top-5 GoogLeNet,multi-crop [31] N/A, 7 . 9 VGG-19-layer , dense [26] 24 . 8 , 7 . 5 VGG-16-layer+V GG-19-layer ,dense 24 . 0 , 7 . 1 Base:ImageNet-VGG-16-layer ,dense 24 . 79 , 7 . 50 HD-CNN,dense 23 . 69 , 6 . 76 Comparison with state-of-the-art . Currently , the tw o best nets on ImageNet dataset are GoogLeNet [31] (T able 4) and VGG 19-layer network [26]. Using multi-scale multi- crop testing, a single GoogLeNet net achie ves top-5 error 7 . 9% . With multi-scale dense e valuation, a single VGG 19- layer net obtains top-1 and top-5 errors 24 . 8% and 7 . 5% and improv es top-5 error of GoogLeNet by 0 . 4% . Our HD- CNN decreases top-1 and top-5 errors of V GG 19-layer net by 1 . 11% and 0 . 74% respectively . Furthermore, HD-CNN slightly outperforms the results of a veraging the predictions from VGG-16-layer and V GG-19-layer nets. 8. Conclusions and Future W ork W e demonstrated that HD-CNN is a ﬂexible deep CNN architecture to improve over e xisting deep CNN models. W e showed this empirically on both CIF AR-100 and Image- Net datasets using three dif ferent b uilding block nets. As part of future work, we plan to e xtend HD-CNN architec- tures to those with more than 2 hierarchical lev els and also verify our empirical results in a theoretical frame work. 4328 References [1] H. Bannour and C. Hudelot. Hierarchical image annota- tion using semantic hierarchies. In Pr oceedings of the 21st A CM International Conference on Information and Knowl- edge Manag ement , 2012. [2] S. Bengio, J. W eston, and D. Grangier . Label embedding trees for large multi-class tasks. In NIPS , 2010. [3] J. Deng, N. Ding, Y . Jia, A. Frome, K. Murphy , S. Bengio, Y . Li, H. Ne ven, and H. Adam. Large-scale object classiﬁca- tion using label relation graphs. In ECCV . 2014. [4] J. Deng, W . Dong, R. Socher , L. Li, K. Li, and F . Li. Ima- genet: A large-scale hierarchical image database. In CVPR , 2009. [5] J. Deng, J. Krause, A. C. Berg, and L. Fei-Fei. Hedging your bets: Optimizing accuracy-speciﬁcity trade-offs in large scale visual recognition. In CVPR , 2012. [6] J. Deng, S. Satheesh, A. C. Berg, and F . Li. Fast and bal- anced: Efﬁcient label tree learning for large scale object recognition. In Advances in Neural Information Pr ocessing Systems , 2011. [7] C. Farabet, C. Couprie, L. Najman, and Y . LeCun. Learning hierarchical features for scene labeling. IEEE T ransactions on P attern Analysis and Machine Intelligence , 2013. [8] R. Fergus, H. Bernal, Y . W eiss, and A. T orralba. Semantic label sharing for learning with many categories. In ECCV . 2010. [9] T . Gao and D. K oller . Discriminati ve learning of relax ed hierarchy for large-scale visual recognition. In ICCV , 2011. [10] R. Girshick, J. Donahue, T . Darrell, and J. Malik. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. In CVPR , 2014. [11] I. Goodfello w , D. W arde-Farley , M. Mirza, A. Courville, and Y . Bengio. Maxout networks. In ICML , 2013. [12] G. Grifﬁn and P . Perona. Learning and using taxonomies for fast visual categorization. In CVPR , 2008. [13] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep con volutional networks for visual recognition. In ECCV . 2014. [14] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE T ransactions on P attern Analysis and Machine Intelligence , 2011. [15] Y . Jia. Caf fe: An open source conv olutional architecture for fast feature embedding, 2013. [16] Y . Jia, J. T . Abbott, J. Austerweil, T . Grifﬁths, and T . Dar- rell. V isual concept learning: Combining machine vision and bayesian generalization on concept hierarchies. In NIPS , 2013. [17] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of T or onto, T ech. Rep , 2009. [18] A. Krizhe vsky , I. Sutske ver , and G. Hinton. Imagenet clas- siﬁcation with deep conv olutional neural networks. In NIPS , 2012. [19] C.-Y . Lee, S. Xie, P . Gallagher , Z. Zhang, and Z. T u. Deeply- supervised nets. arXiv pr eprint arXiv:1409.5185 , 2014. [20] L.-J. Li, C. W ang, Y . Lim, D. M. Blei, and L. Fei-Fei. Build- ing and using a semanti visual image hierarchy . In CVPR , 2010. [21] M. Lin, Q. Chen, and S. Y an. Network in network. CoRR , 2013. [22] B. Liu, F . Sadeghi, M. T appen, O. Shamir , and C. Liu. Prob- abilistic label trees for ef ﬁcient large scale image classiﬁca- tion. In CVPR , 2013. [23] M. Marszalek and C. Schmid. Semantic hierarchies for vi- sual object recognition. In CVPR , 2007. [24] M. Marszałek and C. Schmid. Constructing category hierar - chies for visual recognition. In ECCV . 2008. [25] R. Salakhutdinov , A. T orralba, and J. T enenbaum. Learning to share visual appearance for multiclass object detection. In CVPR , 2011. [26] K. Simonyan and A. Zisserman. V ery deep conv olu- tional networks for large-scale image recognition. CoRR , abs/1409.1556, 2014. [27] J. Sivic, B. C. Russell, A. Zisserman, W . T . Freeman, and A. A. Efros. Unsupervised discovery of visual object class hierarchies. In CVPR , 2008. [28] J. T . Springenberg and M. Riedmiller . Improving deep neu- ral networks with probabilistic maxout units. arXiv pr eprint arXiv:1312.6116 , 2013. [29] N. Sriv astav a and R. Salakhutdinov . Discriminativ e transfer learning with tree-based priors. In NIPS , 2013. [30] M. F . Stollenga, J. Masci, F . Gomez, and J. Schmidhuber . Deep networks with internal selective attention through feed- back connections. In NIPS , 2014. [31] C. Szegedy , W . Liu, Y . Jia, P . Sermanet, S. Reed, D. Anguelov , D. Erhan, V . V anhoucke, and A. Rabi- novich. Going deeper with con volutions. arXiv preprint arXiv:1409.4842 , 2014. [32] A.-M. T ousch, S. Herbin, and J.-Y . Audibert. Semantic hier- archies for image annotation: A surve y . P attern Recognition , 2012. [33] N. V erma, D. Mahajan, S. Sellamanickam, and V . Nair . Learning hierarchical similarity metrics. In CVPR , 2012. [34] T . Xiao, J. Zhang, K. Y ang, Y . Peng, and Z. Zhang. Error- driv en incremental learning in deep con volutional neural net- work for large-scale image classiﬁcation. In Pr oceedings of the A CM International Confer ence on Multimedia , 2014. [35] M. Zeiler and R. Fergus. V isualizing and understanding con- volutional netw orks. In ECCV . 2014. [36] M. D. Zeiler and R. Fergus. Stochastic pooling for regu- larization of deep conv olutional neural networks. In ICLR , 2013. [37] B. Zhao, F . Li, and E. P . Xing. Large-scale category structure aware image cate gorization. In NIPS , 2011. [38] A. Zweig and D. W einshall. Exploiting object hierarchy: Combining models from different category lev els. In ICCV , 2007. 4329

HD-CNN: Hierarchical Deep Convolutional Neural Network for Large Scale Visual Recognition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment