Modular network for high accuracy object detection

Modular network f or high accuracy object detection Erez Y ahalomi Abstract W e present a novel modular object detection con volutional neural network that signiﬁcantly improves the accuracy of object detection. The network consists of two stages in a hierarchical structure. The ﬁrst stage is a single network that detects general classes. The second stage consists of sev eral separate networks to reﬁne the classiﬁcation and localization of each of the general class objects. Compared to a state-of-the-art object detection network, the classiﬁcation error in the modular network is improv ed by approximately 3-5 times, 12% to 2.5 %-4.5%. This network is easy to implement and has a 0.94 mAP . The network architecture is a platform to improv e the accuracy of man y detection networks and other types of deep learning networks. W e sho w experimentally and theoretically that a deep learning network that is initialized by transfer learning become more accurate as the number of classes it later trained to detect become smaller . 1 Introduction There is constant ef fort to increase the accuracy of deep learning networks for object detection. A major topic in object detection is ﬁne-grained [ 10 , 2 , 27 , 7 ] detection for distinguishing dif ferences between similar object classes. In this paper , we present a nov el, highly accurate deep learning network for computer vision object detection, in particular , for ﬁne-grained object detection. Our contribution in this paper is a new object detection modular network of two stages in hierarchical structure, from detection of general classes to more detailed classes. The ﬁrst stage is one deep learning object detection network for detecting multi-class objects where the classes are general. The second stage consists of sev eral separate deep learning object detection networks, each trained to detect only ﬁne-grained classes that belong to one of the general classes of the ﬁrst stage network. Images belonging to one of the general classes detected in the ﬁrst stage are passed on to the appropriate network in the second stage for more detailed identiﬁcation of an object’ s type and location. W e compared the results of our modular object detection network to a state-of-the-art object detection network, which was trained to detect the same classes as the modular network. The experiments showed that the modular network has signiﬁcantly higher accuracy . W e show both experimentally and theoretically that a deep learning network designed to detect a smaller number of classes and initially trained by transfer learning is more accurate than a network trained to detect more classes. The modular network architecture suggested in this paper can be used to increase the accuracy of state-of-the-art object detection netw orks by integrating them as the b uilding blocks of this modular network without changing the intensi ve optimizations carried out on their architecture. Other types of networks can improv e their accuracy , as well, by being inserted as b uilding blocks into this modular network platform. 2 Related W ork W e are the ﬁrst to propose a modular network with a hierarchical structure for ﬁne-grained object detection whose consist entirely on deep learning object detection networks [ 30 ]. An additional nov elty of our modular netw ork is that input images are passed on for detection to the appropriate second stage networks based on the objects classes and their conﬁdence score detected by the ﬁrst stage object detection network. 2.1 Object detection Notable con volutional neural networks for object detection are [ 23 , 17 , 21 , 28 ] and F aster R-CNN [ 22 ] which, consists of a classiﬁcation network, a re gion proposal network which di vides the image into rectangular regions, follo wed by regression for additional accuracy in classiﬁcation and location. Most of the state-of-the-art object detection networks include a core image classiﬁcation network, such as Alexnet [ 13 ], VGG [ 26 ] or Resnet [ 6 ]. These networks use transfer learning based on the training on a large image data, set such as Imagenet [24, 3] and Coco [16]. 2.2 Hierarchical structur es Hierarchical structures appear in man y forms in computer vision [ 4 , 25 ]. Jarrett et al. [ 14 ] present a hierarchical feature extraction and classiﬁcation system with fast (feed-forw ard) processing. The hierarchy stacks one or sev eral feature e xtraction stages, each of which consists of ﬁlter bank layer , non-linear transformation layers, and a pooling layer . Salakhutdinov et al [ 25 ] presented a hierarchical classiﬁcation model that allo ws rare objects to borro w statistical strength from related objects that may have many training instances. The y use hierarchical classiﬁcation model where parameters of each class are giv en by the sum of parameters along the tree. 3 The modular network 3.1 Modular network ar chitecture W e present in this paper a ne w modular and hierarchical object detection network. The network consists of two stages. The ﬁrst is a deep learning, object detection network trained to detect predetermined general classes, and the second stage consists of se veral, deep learning, object detection networks, each trained independently on more ﬁne-grained classes belonging to the same single general class of the ﬁrst stage network. All the building block networks inside the modular netw ork are trained on ne gati ve images as well. All the object detection networks inside the modular networks are whole and separate object detection networks. Each deep learning network in the modular network independently goes through the complete object detection process of training and inference. The full image data set for inference is inserted into the ﬁrst stage network. If an object in an image is detected as belonging to one of the network general classes, the initial image without changes is passed onto inference by a second stage network trained to detect sub-classes of this class. The purpose of the second stage network is to distinguish between objects of similar classes making more detailed classiﬁcation and more accurate locations of each object in the image. Each sub-network inside the modular network is initialized by transfer learning weights [ 8 , 12 , 18 , 19 , 33 ] trained on ImageNet database. Figure.1 sho ws the modular network of our experiment. The building blocks or sub-networks of the modular network are Faster -RCNN (frcnn for short) networks [ 22 ]. In the ﬁrst stage, there is a single network trained to detect ﬁ ve general objects classes. Based on the ﬁrst stage network output images are passed onto ﬁne-grained detection in the second stage at the appropriate network trained to detect the detailed classes belonging to the general classes determined by the ﬁrst stage. One of the main reasons for better accuracy ov er a regular multi-class network is that each of the networks inside this modular network is designated to detect fewer classes than a regular multi-class network. For a very lar ge number of classes, a possible further modiﬁcation of the modular network would be to add additional hierarchical stages. 3.2 Algorithm and the modular network construction T o detect multiple classes, we use on the ﬁrst stage an object detection netw ork trained by transfer learning. Similar classes are merged into a general class. The ﬁrst stage network, is trained to detect general classes C i and additional negati ve images that do not belong to any of these general classes. For each of the general classes, we train a second stage network on the same training images of this general class and on negativ e images. This time, we sort and label the training images with 2 Figure 1: A modular network whose ﬁrst stage is a single deep learning network frcnn trained to detect ﬁv e general classes. Its second stage consists of ﬁv e separate frcnn, each trained to detect two distinct sub-classes of one of the general classes. ﬁne-grained classes all belonging to this general class. Images are then input into the ﬁrst stage network for inference. Input images with object detected as belonging to a general class on the basis of the network ’ s conﬁdence score of the detected object are passed ont o the second stage network dedicated to that class for ﬁne-grained object classiﬁcation and location. 3.3 Advantages and risk of the modular network The advantages of the modular network are: In each of the sub-conv olutional neural networks inside the modular network, there are fewer classes than in a regular netw ork designated to detect the same number of classes as the whole modular network. Thus, there are more features, ﬁlters, and network parameters dedicated to the detection of each class, resulting in better accuracy . A small number of features for each class allows less distinction in detection of similar classes as well as errors in detecting rare class objects. When the number of features is small, features are formed to identify multiple classes, adding errors in ﬁne-grained object detection. Fewer classes in the object detection network means potentially fewer bounding box es of detected objects in the image, which gives fe wer errors in identifying the objects and ﬁnding their locations. The adv antage of the hierarchical structure of the modular network compared to detection by many networks each detect few classes with no connection between the networks is that the hierarchical structure of the modular network drastically cuts do wn the number of required inferences, as they are arranged in a tree structure. A risk of the modular network is: Assuming we use the same type of object detection netw ork as the multi class network as the b uilding block network of the modular networks. If the multi-class network has lo w accuracy , then it is preferred since the building blocks networks inside the modular netw ork should hav e a very large impro vement in accuracy compared to the multi-class network to make the modular network work more accurate. The condition the accuracy of the modular network will be better than a multi class network is,when: a < ( a + ∆ 1 )( a + ∆ 2 ) (1) a - represents the multi-class network accuracy; ∆ 1 is- the improv ement in accuracy of the ﬁrst stage of the modular network compared to the multi-class network; and ∆ 2 shows the impro vement in accuracy of the second stage compared to the multi-class network. Most state-of-the-art object detection networks are accurate enough to use them as the b uilding block network for the modular network allo wing a modular network with higher accuracy compared to the selected state-of-the-art object detection network. The risk of the modular network is the detection of false ne gatives in the ﬁrst stage netw ork. This may reduce accuracy , as some images with true objects may be omitted from the input of the second stage network. T o deal with this problem, we designed a second v ersion of the modular network speciﬁcally for images sequence [ 5 , 15 , 20 ] where the same object is assumed to appear in more than one image. The network architecture of this version, denoted as Modular Network v .2 is the same as Modular Network v .1, the dif ference is that once an object of general class is detected in the ﬁrst stage all of the inference images set is sent for inference to the appropriate ﬁne-grained network in the second stage. In this way , the loss of accuracy due to false negati ve 3 detection in the ﬁrst stage is reduced. 4 Con volutional neural network classiﬁcation err or model. This model describes ho w reducing the number of classes for detection in a conv olutional neural network (CNN) reduces the network classiﬁcation error . Each of the building block netw orks inside the modular network has fe wer classes than the re gular multi-class network. Let x= { x 1 . . . x f } be the features space. Let c be a set of classes c={ c 0 . . . c n }. Every detection of an object in an image is deﬁned by a set of features that are active if this object appears in the image. For example, the features set { x m . . . x p } identiﬁes objects belonging to class C 1 . N represents the total number of features of the designated classes that the CNN can identify . L and T are the numbers of features of the designated classes that the CNN can identify based on transfer learning and ﬁne tuning [ 33 , 14 , 25 ], respectiv ely , where each feature belongs to a single class. U is the number of features that the CNN can identify that are common to se veral classes. N= L+T+U. When each of the designated classes has a similar number of training images, S - the number of features detecting a designated class, is approximated as S ≈ L + T n + U . In this approximation, the number of features for detecting a single designated class is in versely related to n, the number of the CNN designated classes. The smaller the n, the more features there are for detecting the designated class, making this class object detection more accurate. The parameters that determine sup K - the upper bound of features of all the classes that a CNN can identify include: r - the number of parameters in the CNN; a – the number of ﬁlters; d – the size of the ﬁlters; h – the number of ﬁlter channels; and q – the number of layers in the CNN. These parameters are constant for each network. In this model, ev ery CNN has an upper bound of the total number of features, sup K(r ,a,d,h,q), that it can identify without increasing the classiﬁcation errors. Classiﬁcation error caused by having more features than the upper bound number can be, for example, from two channels in the same ﬁlter , where the weight patterns formed in each channel detect different classes. Suppose each of these channels doing a con volution with its respecti ve feature map. Where the different objects classes features on the dif ferent feature map are located in similar locations in the input feature maps. The two output feature maps of the two channel patterns can hav e partial ov erlap in their locations. Let A and B be matrices presented the two channels output features maps. Some of the feature weights in Matrix A can hav e the same pixel coordinates as the weights of the feature in Matrix B. X i,j ∈ G ( | A | i,j + | B | i,j ) > X i,j ∈ G | A | i,j (2) In eq.2, i and j are the raw and column coordinates of the elements in matrices A or B. G is a set of all the coordinates that are acti ve in both A and B matrices. Eq.2 indicates that when adding the elements of these coordinates from both feature maps the sum is no longer presenting a feature map of the object detected in matrix A but a deformation caused by the features sum of two different classes objects. This can cause classiﬁcation error . From these it is obvious that increasing the number of ﬁlters a in deep learning network layers increase the network accurac y or able to identify more classes without reducing the accurac y . Since it will able to spread the different features k ernels on more ﬁlters. T o estimate the classiﬁcation error Bayes error is used [ 32 , 29 , 1 , 9 ]. As an example, we analyzed the classiﬁcation of two ﬁne-grained classes, C 1 and C 0 . According to Bayes error estimation, there is a probability that feature x i appears in the feature map when there is an object of class C 0 in the image. There is also a probability density that feature x i is activ ated when an object of class C 1 is in the image. The classiﬁcation error caused by feature x i is the smallest probability density between these two probabilities densities. The sum of the smallest probability densities for all the features that activ ated by the two classes object is the classiﬁcation error . Assuming the probability densities to be activ ated by objects of classes C 1 or C 0 are known for each of the features in the network, the probability for classiﬁcation error is described in Equation 2, where P( C 0 ) and P( C 1 ) are the prior probability densities of classes C 0 and C 1 , respectiv ely . P( x i | C 0 ) and P( x i | C 1 ) are the conditional probability densities that feature x i is activ e given the class is C 0 or C 1 , respectiv ely . An additional criterion in Equation 2 is the signiﬁcance of the feature x i in the classiﬁcation. Because if an active feature does not inﬂuence the classiﬁcation of an object, it does not contrib ute to the classiﬁcation 4 probability of the object class. Feature x i weights for classes C 0 and C 1 are denoted by w i ( C 0 ) and w i ( C 1 ), respecti vely . The values of the weights w i ( C 0 ) and w i ( C 1 ) are based on ho w many times feature x i is essential for classiﬁcation out of all the times this feature was activ ated by the class objects. P err or = N f X i =1 min ( P ( x i | C 0 ) P ( C 0 ) w i ( C 0 ) , P ( x i | C 1 ) P ( C 1 ) w i ( C 1 )) (3) The probability densities of the features are presented in discrete v alues, which we approximate as a continues graph.The Graphs in ﬁgure.2a,2b, present the features probabilities densities to be acti vated by objects of classes C 0 and C 1 . The X-axis is the feature range, i - is the ﬁlter inde x number . The Y -axis presents the probability density that a feature is acti v ated. In the graph, all features with the probability of matching a particular class are in the same area on the X-axis. Features with a probability of matching the tw o classes are displayed in the graph in a shared area for both classes. The Bayes classiﬁcation error is the sum, or integration, of the minimal probability densities of ev ery feature within the mutual area, which is the ov erlapping of the classes C 0 and C 1 curves. (a) T en classes network (b) T wo classes netw ork Figure 2: Features vs Probabilities densities. The x-axis in the graphs present the network features, denoted as i. The Y axis is the features probability density denoted as PDF . The graph in ﬁgure 2.a illustrates the features probabilities densities of identifying C 0 and C 1 in a network trained to detect ten dif ferent classes and negati ve images. The activ e features are nearly a quarter of the total feature s in the network. The area of miss-classiﬁed features is signiﬁcant compared to the total areas of the features of classes C 0 and C 1 which indicates a large classiﬁcation error . This is because there are man y classes and the number of features dedicated to each class is small. Additional cause is there are many classes, the total number of features exceeds the upper bound number of ﬁlters optimal for this netw ork resulting f alse defections. The graph in ﬁgure 2.b illustrates a network trained to detect only two classes and negati ve images. Most of the features detected by this network are of classes C 0 and C 1 . The miss-classiﬁed feature area is small compared to the total area of both classes, indicating that the classiﬁcation error is small. The number of features for each class is large enabling the training of features for detecting e ven more detailed features, further reducing the classiﬁcation error . In the ﬁrst stage of the modular network that trained to detect general classes C 0 and C 1 ore both include in C g , a general class, C g = C 0 ∪ C 1 . This eliminates the error of miss-classiﬁcation between the two classes result in lo w classiﬁcation error . Classiﬁcation errors in this network are between general classes. 5 Experiments 5.1 Implementation The original data set for training contains 522 original images, expanded to 46,044 training images by mirror symmetry , sharpness, brightness, and contrast augmentations and used as the training data by 5 both the modular and the multi-class networks. The images distributed similarly between 10 classes or ﬁ ve pairs of similar classes: Pekinese, Spaniel, Kayak, canoe, swan, duck, sport bike, mountain bike ,Mars, Saturn and negati ve images with no labels. The test set contained additional 125 original images, expanded by cross validation to 647 original test images. The size of the network input images is 800x800 pixels. The multi-class network is Faster R-CNN with a backbone classiﬁcation network, VGG 16, initialized by transfer learning training on ImageNet 2012 database. The building blocks of the modular network is Faster R-CNN too with VGG 16 backbone and same initialization . Faster R-CNN networks trained on 40-50 original images for each class for v arious object detection tasks are pre valent [ 31 , 11 ]. T o compare between the multi-class network and the modular network, both networks have the same hyper-parameter values previously optimized on classes other than those the networks are designated to detect. The modular network and the multi-class netw ork had Fine tuning training on all the networks layers. Each of the networks trained for 40 epochs, with learning rates of 0.001 on the ﬁrst 10 epochs; 0.0001 on the next 10 epochs; and 0.00001 on the last 20 epochs. Both the modular and the multi-class networks inferred on this test data. Most of the original images for the training and the test sets were taken from the Caltech 101 image database and the rest were randomly chosen from the internet. 5.2 Experiments results The Faster -RCNN multi-class object detection network was trained on the data set of 46,044 images with the ten classes and the ne gati ve images. The training loss [ 22 ] is 0.0229. The multiclass network inference results are 0.87 mAP and 12% classiﬁcation error . The modular network has tw o stages. The ﬁrst stage network is trained on the same data set as the multi-class network including the negati ve images, but the objects in the images labeled with ﬁv e general classes instead of the more detailed 10 classes of the multi-class network. The modular network’ s ﬁrst stage classes are dog, planet, bik e, boat, and bird. Each of these classes is a uniﬁcation of a couple of similar classes from the 10 classes labeled for training by the multi-class network. The training loss of the modular network’ s ﬁrst stage is 0.0216. In the second stage, each network is trained on two ﬁne-grained pair of classes of the multi-class network and negativ e images. For example, one netw ork trains on two dog species classes, Pekinese and Spaniel and negati ve images, with a training loss of 0.0151 loss, while a second network is trained to detect two planets, Mars and Saturn and neg ativ e images, with a training loss of 0.0170. The modular network v1 inference results are 0.94 mAP and 4.5% classiﬁcation error . The modular network v2 inference results are 0.95 mAP and 2.5% classiﬁcation error . These experimental results indicate that the modular network is signiﬁcantly more accurate than the multi-class network. T able 1 sho ws experiments results of the T able 1: Object detection average precision Network mAP Modular net v1 0.94 Modular net v2 0.95 Multi-class net (Faster R-CNN) 0.87 First stage of the Modular net, general classes 0.93 mean av erage precision, mAP , of the modular networks and the multi class network, which is a regular Faster R-CNN netw ork, tested on the same images. The modular network v1 mAP is calculated by taking into account the images detected as false ne gati ves in the ﬁrst stage of the modular netw ork and thereby not appearing on the mAP of the second stage. Each false negati ve precision is rated as zero, and its counterpart in the calculation of the whole modular network mAP is one divided by the total number of inference images in the modular network. In table.1 the total modular networks mAP is higher than the modular network ﬁrst stage mAP e ven that their classiﬁcation error is lar ger than the ﬁrst stage classiﬁcation error . This means the second stage ﬁne grained object detection networks has higher accuracy in detecting object location compared to the ﬁrst stage object detection network. T able 2 shows the experimental results of the network classiﬁcation errors. The modular network classiﬁcation error is signiﬁcantly reduced to 4.5% and 2.5% for Modular net v1 and Modular net v2 respectiv ely compared to 12%, respectively , in the multi-class network the regular Faster R-CNN . 6 T able 2: Classiﬁcation Error Network Classiﬁcation error Modular net v1 4.5% Modular net v2 2.5% Multi-class net (Faster R-CNN) 12% First stage of the Modular net, general classes 2.25% Figure 3: Left column are object detection images by the multi class network, center column are detected images by the general classes network and right column are images detected by ﬁne grained networks. In the ﬁrst three rows of the ﬁrst column of Figure 3, the images detected by the multi-class network hav e errors in classiﬁcation. Howe ver , the general class netw ork and the ﬁne-grained network both detected the same objects correctly . The second row of images shows that the detection of object 7 location is more accurate in the image detected by the ﬁne-grained netw ork (right) compared to the image detected by the multi-class network (left). 6 Discussion Our e xperimental result sho ws that network with fewer classes is more accurate. The results sho w that most of the classiﬁcation errors in the multi-class network are between similar classes. The accuracy of the modular network for both v .1 and v .2 is higher by 7.5%and 9.5%, respecti vely , compared to the multi-class network or regular object detection network. This is a reduction of classiﬁcation error by 2.7 and 4.8 times, respectively . W e found the accuracy of a network that trained to detect only two similar objects is 9.5% higher compared to the multi-class netw ork that detects 10 classes The training results indicate that the training loss becomes smaller as the number of classes trained to be detected by the network becomes smaller . A fundamental question in machine learning is what kind of learning has higher accurac y? A network trained to detect only few focused classes or one that is trained to detect many classes of a wide range of subjects. W e obtained that a network initially trained on a wide range of classes by transfer learning and later trained to detect fewer classes by ﬁne tuning on all the network layers is more accurate than a network initialized by transfer learning and later trained to detect lar ger numbers of classes. Pre vious works on transfer learning [ 8 , 33 ] determined that a network initially trained by transfer learning and later trained to detect designated classes is more accurate compared to only being trained to detect the same designated classes. From both ﬁndings we obtain: a network initially trained by transfer learning and then designated to detect a small number of classes is more accurate compared to being trained to detect a larger number of classes with or without transfer learning. 7 Conclusion The modular network presented in this paper signiﬁcantly improv es object detection performances in both classiﬁcation and location. This is true especially for detection that requires dif ferentiating between similar classes. This modular network improves state-of-the-art deep learning object detection networks without requiring changes in networks architecture or ev en hyper-parameters, adjusting the hyper -parameters may giv e ev en higher performances. W e found that reducing the number of classes a con volutional neural netw ork is trained to detect increases the network accuracy . This modular network could be a platform for other types of deep learning networks, for example, segmentation networks, by improving their accurac y by implementing them as building blocks of the modular network. References [1] W . Hou B. Juang and C. Lee. Minimum classiﬁcation error rate methods for speech recognition. IEEE T ransactions on Speech and A udio Pr ocessing , 5, 1997. [2] R.B Girshick B.Hariharan, P . Arbelaez and J.Malik. Hypercolumns for object segmentation and ﬁne-grained localization. 2015. [3] Jia Deng, W ei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large- scale hierarchical image database. 2009 IEEE Conference on Computer V ision and P attern Recognition , pages 248–255, 2009. [4] K. Fukushima. Neocognitron: A hierarchical neural network capable of visual pattern recogni- tion. Neural Networks , 1, 1988. [5] Dweepna Garg and K etan K otecha. Object Detection fr om V ideo Sequences Using Deep Learning: An Overview , pages 137–148. 01 2018. ISBN 978-981-10-4602-5. doi: 10.1007/ 978- 981- 10- 4603- 2_14. [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Pr oceedings of the IEEE confer ence on computer vision and pattern recognition , pages 770–778, 2016. 8 [7] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, W eijun W ang, T obias W eyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Ef ﬁcient conv olutional neural networks for mobile vision applications. arXiv pr eprint arXiv:1704.04861 , 2017. [8] Minyoung Huh, Pulkit Agra wal, and Ale xei A Efros. What makes imagenet good for transfer learning? arXiv preprint , 2016. [9] H. Deng J.Zhang. Gene selection for classiﬁcation of microarray data based on the bayes error . BMC Bioinformatics , 8, 2007. [10] Y . Jianchao K. Jonathan, J. Hailin and FLi. Fine-grained recognition without part annotations. In The IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , June 2015. [11] karol zak. Cntk-hotel-pictures-classiﬁcator . https://github .com/karolzak/cntk-hotel-pictur es- classiﬁcator , 2018. [12] A. Karpathy . Lar ge-scale video classiﬁcation with con volutional neural netw orks. In The IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , June 2014. [13] Alex Krizhe vsky , Ilya Sutskev er, and Geoffre y E. Hinton. Imagenet classiﬁcation with deep con volutional neural networks. In NIPS , 2012. [14] Christoph Käding, Erik Rodner, Ale xander Freytag, and Joachim Denzler . Fine-tuning deep neural networks in continuous learning scenarios. pages 588–605, 03 2017. doi: 10.1007/ 978- 3- 319- 54526- 4_43. [15] A. Lecler , L. Duron, Daniel Balvay , Julien Sav atovsky , Olivier Bergès, Mathieu Zmuda, Edgard Farah, O. Galatoire, A. Bouchouicha, and Laure Fournier . Combining multiple magnetic resonance imaging sequences provides independent reproducible radiomics features. Scientiﬁc Reports , 9, 02 2019. doi: 10.1038/s41598- 018- 37984- 8. [16] Tsung-Y i Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, De va Ramanan, Piotr Dollár , and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European confer ence on computer vision , pages 740–755. Springer, 2014. [17] W ei Liu, Dragomir Anguelov , Dumitru Erhan, Christian Szegedy , Scott Reed, Cheng-Y ang Fu, and Alexander C. Ber g. Ssd: Single shot multibox detector . pages 21–37, 2016. [18] Maxime Oquab, Leon Bottou, Ivan Laptev , and Josef Sivic. Learning and transferring mid- lev el image representations using con volutional neural netw orks. In Pr oceedings of the IEEE confer ence on computer vision and pattern recognition , pages 1717–1724, 2014. [19] Sinno Jialin Pan and Qiang Y ang. Ieee transactions on knowledge and data engineering. CoRR , 22, 2009. [20] Jielin Qiu, Ge Huang, and T ai Sing Lee. V isual sequence learning in hierarchical prediction networks and primate visual cortex. In Advances in Neural Information Pr ocessing Systems , pages 2658–2669, 2019. [21] Joseph Redmon, Santosh Di vvala, Ross Girshick, and Ali Farhadi. Y ou only look once: Uniﬁed, real-time object detection. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pages 779–788, 2016. [22] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: T o wards real-time object detection with region proposal networks. IEEE T r ansactions on P attern Analysis and Machine Intellig ence , 39 (6):1137–1149, June 2017. ISSN 1939-3539. doi: 10.1109/TP AMI.2016.2577031. [23] Girshick Ross. F ast r-cnn. In Pr oceedings of the IEEE international confer ence on computer vision , pages 1440–1448, 2015. [24] Olga Russako vsky , Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy , Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. Int. J. Comput. V ision , 115(3), 2015. 9 [25] R. Salakhutdinov, A. T orralba, and J. T enenbaum. Learning to share visual appearance for multiclass object detection. In CVPR 2011 , pages 1481–1488, 2011. [26] K. Simon yan and A. Zisserman. V ery deep conv olutional networks for large-scale image recognition. arXiv pr eprint arXiv:1409.1556 , 2014. [27] Bharat Singh, T im K Marks, Michael Jones, Oncel T uzel, and Ming Shao. A multi-stream bi-directional recurrent neural network for ﬁne-grained action detection. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pages 1961–1970, 2016. [28] Christian Szegedy , W ei Liu, Y angqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov , Dumitru Erhan, V incent V anhoucke, and Andrew Rabino vich. Going deeper with con volutions. In Pr oceedings of the IEEE conference on computer vision and pattern r ecognition , pages 1–9, 2015. [29] Bao Xiaomin and W ang Y aming. Apple image se gmentation based on the minimum error bayes decision [j]. T ransactions of the Chinese Society of Agricultur al Engineering , 5, 2006. [30] Erez Y ahalomi. Deep learning networks for medical images. PhD Resear ch plan , March 2019. [31] Erez Y ahalomi, Michael Chernofsky , and Michael W erman. Detection of distal radius fractures trained by a small set of x-ray images and faster r-cnn. In Intelligent Computing. CompCom 2019. Advances in Intelligent Systems and Computing , v olume 997, pages 971–981, 2019. [32] Shuang Hong Y ang and Bao-Gang Hu. Discriminativ e feature selection by nonparametric bayes error minimization. IEEE T ransactions on knowledge and data engineering , 24(8):1422–1434, 2012. [33] Jason Y osinski, Jef f Clune, Y oshua Bengio, and Hod Lipson. Ho w transferable are features in deep neural networks? In Advances in neur al information pr ocessing systems , pages 3320–3328, 2014. 10

Modular network for high accuracy object detection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment