Modular network for high accuracy object detection

We present a novel modular object detection convolutional neural network that significantly improves the accuracy of object detection. The network consists of two stages in a hierarchical structure. The first stage is a network that detects general c…

Authors: Erez Yahalomi

Modular network for high accuracy object detection
Modular network f or high accuracy object detection Erez Y ahalomi Abstract W e present a novel modular object detection con volutional neural network that significantly improves the accuracy of object detection. The network consists of two stages in a hierarchical structure. The first stage is a single network that detects general classes. The second stage consists of sev eral separate networks to refine the classification and localization of each of the general class objects. Compared to a state-of-the-art object detection network, the classification error in the modular network is improv ed by approximately 3-5 times, 12% to 2.5 %-4.5%. This network is easy to implement and has a 0.94 mAP . The network architecture is a platform to improv e the accuracy of man y detection networks and other types of deep learning networks. W e sho w experimentally and theoretically that a deep learning network that is initialized by transfer learning become more accurate as the number of classes it later trained to detect become smaller . 1 Introduction There is constant ef fort to increase the accuracy of deep learning networks for object detection. A major topic in object detection is fine-grained [ 10 , 2 , 27 , 7 ] detection for distinguishing dif ferences between similar object classes. In this paper , we present a nov el, highly accurate deep learning network for computer vision object detection, in particular , for fine-grained object detection. Our contribution in this paper is a new object detection modular network of two stages in hierarchical structure, from detection of general classes to more detailed classes. The first stage is one deep learning object detection network for detecting multi-class objects where the classes are general. The second stage consists of sev eral separate deep learning object detection networks, each trained to detect only fine-grained classes that belong to one of the general classes of the first stage network. Images belonging to one of the general classes detected in the first stage are passed on to the appropriate network in the second stage for more detailed identification of an object’ s type and location. W e compared the results of our modular object detection network to a state-of-the-art object detection network, which was trained to detect the same classes as the modular network. The experiments showed that the modular network has significantly higher accuracy . W e show both experimentally and theoretically that a deep learning network designed to detect a smaller number of classes and initially trained by transfer learning is more accurate than a network trained to detect more classes. The modular network architecture suggested in this paper can be used to increase the accuracy of state-of-the-art object detection netw orks by integrating them as the b uilding blocks of this modular network without changing the intensi ve optimizations carried out on their architecture. Other types of networks can improv e their accuracy , as well, by being inserted as b uilding blocks into this modular network platform. 2 Related W ork W e are the first to propose a modular network with a hierarchical structure for fine-grained object detection whose consist entirely on deep learning object detection networks [ 30 ]. An additional nov elty of our modular netw ork is that input images are passed on for detection to the appropriate second stage networks based on the objects classes and their confidence score detected by the first stage object detection network. 2.1 Object detection Notable con volutional neural networks for object detection are [ 23 , 17 , 21 , 28 ] and F aster R-CNN [ 22 ] which, consists of a classification network, a re gion proposal network which di vides the image into rectangular regions, follo wed by regression for additional accuracy in classification and location. Most of the state-of-the-art object detection networks include a core image classification network, such as Alexnet [ 13 ], VGG [ 26 ] or Resnet [ 6 ]. These networks use transfer learning based on the training on a large image data, set such as Imagenet [24, 3] and Coco [16]. 2.2 Hierarchical structur es Hierarchical structures appear in man y forms in computer vision [ 4 , 25 ]. Jarrett et al. [ 14 ] present a hierarchical feature extraction and classification system with fast (feed-forw ard) processing. The hierarchy stacks one or sev eral feature e xtraction stages, each of which consists of filter bank layer , non-linear transformation layers, and a pooling layer . Salakhutdinov et al [ 25 ] presented a hierarchical classification model that allo ws rare objects to borro w statistical strength from related objects that may have many training instances. The y use hierarchical classification model where parameters of each class are giv en by the sum of parameters along the tree. 3 The modular network 3.1 Modular network ar chitecture W e present in this paper a ne w modular and hierarchical object detection network. The network consists of two stages. The first is a deep learning, object detection network trained to detect predetermined general classes, and the second stage consists of se veral, deep learning, object detection networks, each trained independently on more fine-grained classes belonging to the same single general class of the first stage network. All the building block networks inside the modular netw ork are trained on ne gati ve images as well. All the object detection networks inside the modular networks are whole and separate object detection networks. Each deep learning network in the modular network independently goes through the complete object detection process of training and inference. The full image data set for inference is inserted into the first stage network. If an object in an image is detected as belonging to one of the network general classes, the initial image without changes is passed onto inference by a second stage network trained to detect sub-classes of this class. The purpose of the second stage network is to distinguish between objects of similar classes making more detailed classification and more accurate locations of each object in the image. Each sub-network inside the modular network is initialized by transfer learning weights [ 8 , 12 , 18 , 19 , 33 ] trained on ImageNet database. Figure.1 sho ws the modular network of our experiment. The building blocks or sub-networks of the modular network are Faster -RCNN (frcnn for short) networks [ 22 ]. In the first stage, there is a single network trained to detect fi ve general objects classes. Based on the first stage network output images are passed onto fine-grained detection in the second stage at the appropriate network trained to detect the detailed classes belonging to the general classes determined by the first stage. One of the main reasons for better accuracy ov er a regular multi-class network is that each of the networks inside this modular network is designated to detect fewer classes than a regular multi-class network. For a very lar ge number of classes, a possible further modification of the modular network would be to add additional hierarchical stages. 3.2 Algorithm and the modular network construction T o detect multiple classes, we use on the first stage an object detection netw ork trained by transfer learning. Similar classes are merged into a general class. The first stage network, is trained to detect general classes C i and additional negati ve images that do not belong to any of these general classes. For each of the general classes, we train a second stage network on the same training images of this general class and on negativ e images. This time, we sort and label the training images with 2 Figure 1: A modular network whose first stage is a single deep learning network frcnn trained to detect fiv e general classes. Its second stage consists of fiv e separate frcnn, each trained to detect two distinct sub-classes of one of the general classes. fine-grained classes all belonging to this general class. Images are then input into the first stage network for inference. Input images with object detected as belonging to a general class on the basis of the network ’ s confidence score of the detected object are passed ont o the second stage network dedicated to that class for fine-grained object classification and location. 3.3 Advantages and risk of the modular network The advantages of the modular network are: In each of the sub-conv olutional neural networks inside the modular network, there are fewer classes than in a regular netw ork designated to detect the same number of classes as the whole modular network. Thus, there are more features, filters, and network parameters dedicated to the detection of each class, resulting in better accuracy . A small number of features for each class allows less distinction in detection of similar classes as well as errors in detecting rare class objects. When the number of features is small, features are formed to identify multiple classes, adding errors in fine-grained object detection. Fewer classes in the object detection network means potentially fewer bounding box es of detected objects in the image, which gives fe wer errors in identifying the objects and finding their locations. The adv antage of the hierarchical structure of the modular network compared to detection by many networks each detect few classes with no connection between the networks is that the hierarchical structure of the modular network drastically cuts do wn the number of required inferences, as they are arranged in a tree structure. A risk of the modular network is: Assuming we use the same type of object detection netw ork as the multi class network as the b uilding block network of the modular networks. If the multi-class network has lo w accuracy , then it is preferred since the building blocks networks inside the modular netw ork should hav e a very large impro vement in accuracy compared to the multi-class network to make the modular network work more accurate. The condition the accuracy of the modular network will be better than a multi class network is,when: a < ( a + ∆ 1 )( a + ∆ 2 ) (1) a - represents the multi-class network accuracy; ∆ 1 is- the improv ement in accuracy of the first stage of the modular network compared to the multi-class network; and ∆ 2 shows the impro vement in accuracy of the second stage compared to the multi-class network. Most state-of-the-art object detection networks are accurate enough to use them as the b uilding block network for the modular network allo wing a modular network with higher accuracy compared to the selected state-of-the-art object detection network. The risk of the modular network is the detection of false ne gatives in the first stage netw ork. This may reduce accuracy , as some images with true objects may be omitted from the input of the second stage network. T o deal with this problem, we designed a second v ersion of the modular network specifically for images sequence [ 5 , 15 , 20 ] where the same object is assumed to appear in more than one image. The network architecture of this version, denoted as Modular Network v .2 is the same as Modular Network v .1, the dif ference is that once an object of general class is detected in the first stage all of the inference images set is sent for inference to the appropriate fine-grained network in the second stage. In this way , the loss of accuracy due to false negati ve 3 detection in the first stage is reduced. 4 Con volutional neural network classification err or model. This model describes ho w reducing the number of classes for detection in a conv olutional neural network (CNN) reduces the network classification error . Each of the building block netw orks inside the modular network has fe wer classes than the re gular multi-class network. Let x= { x 1 . . . x f } be the features space. Let c be a set of classes c={ c 0 . . . c n }. Every detection of an object in an image is defined by a set of features that are active if this object appears in the image. For example, the features set { x m . . . x p } identifies objects belonging to class C 1 . N represents the total number of features of the designated classes that the CNN can identify . L and T are the numbers of features of the designated classes that the CNN can identify based on transfer learning and fine tuning [ 33 , 14 , 25 ], respectiv ely , where each feature belongs to a single class. U is the number of features that the CNN can identify that are common to se veral classes. N= L+T+U. When each of the designated classes has a similar number of training images, S - the number of features detecting a designated class, is approximated as S ≈ L + T n + U . In this approximation, the number of features for detecting a single designated class is in versely related to n, the number of the CNN designated classes. The smaller the n, the more features there are for detecting the designated class, making this class object detection more accurate. The parameters that determine sup K - the upper bound of features of all the classes that a CNN can identify include: r - the number of parameters in the CNN; a – the number of filters; d – the size of the filters; h – the number of filter channels; and q – the number of layers in the CNN. These parameters are constant for each network. In this model, ev ery CNN has an upper bound of the total number of features, sup K(r ,a,d,h,q), that it can identify without increasing the classification errors. Classification error caused by having more features than the upper bound number can be, for example, from two channels in the same filter , where the weight patterns formed in each channel detect different classes. Suppose each of these channels doing a con volution with its respecti ve feature map. Where the different objects classes features on the dif ferent feature map are located in similar locations in the input feature maps. The two output feature maps of the two channel patterns can hav e partial ov erlap in their locations. Let A and B be matrices presented the two channels output features maps. Some of the feature weights in Matrix A can hav e the same pixel coordinates as the weights of the feature in Matrix B. X i,j ∈ G ( | A | i,j + | B | i,j ) > X i,j ∈ G | A | i,j (2) In eq.2, i and j are the raw and column coordinates of the elements in matrices A or B. G is a set of all the coordinates that are acti ve in both A and B matrices. Eq.2 indicates that when adding the elements of these coordinates from both feature maps the sum is no longer presenting a feature map of the object detected in matrix A but a deformation caused by the features sum of two different classes objects. This can cause classification error . From these it is obvious that increasing the number of filters a in deep learning network layers increase the network accurac y or able to identify more classes without reducing the accurac y . Since it will able to spread the different features k ernels on more filters. T o estimate the classification error Bayes error is used [ 32 , 29 , 1 , 9 ]. As an example, we analyzed the classification of two fine-grained classes, C 1 and C 0 . According to Bayes error estimation, there is a probability that feature x i appears in the feature map when there is an object of class C 0 in the image. There is also a probability density that feature x i is activ ated when an object of class C 1 is in the image. The classification error caused by feature x i is the smallest probability density between these two probabilities densities. The sum of the smallest probability densities for all the features that activ ated by the two classes object is the classification error . Assuming the probability densities to be activ ated by objects of classes C 1 or C 0 are known for each of the features in the network, the probability for classification error is described in Equation 2, where P( C 0 ) and P( C 1 ) are the prior probability densities of classes C 0 and C 1 , respectiv ely . P( x i | C 0 ) and P( x i | C 1 ) are the conditional probability densities that feature x i is activ e given the class is C 0 or C 1 , respectiv ely . An additional criterion in Equation 2 is the significance of the feature x i in the classification. Because if an active feature does not influence the classification of an object, it does not contrib ute to the classification 4 probability of the object class. Feature x i weights for classes C 0 and C 1 are denoted by w i ( C 0 ) and w i ( C 1 ), respecti vely . The values of the weights w i ( C 0 ) and w i ( C 1 ) are based on ho w many times feature x i is essential for classification out of all the times this feature was activ ated by the class objects. P err or = N f X i =1 min ( P ( x i | C 0 ) P ( C 0 ) w i ( C 0 ) , P ( x i | C 1 ) P ( C 1 ) w i ( C 1 )) (3) The probability densities of the features are presented in discrete v alues, which we approximate as a continues graph.The Graphs in figure.2a,2b, present the features probabilities densities to be acti vated by objects of classes C 0 and C 1 . The X-axis is the feature range, i - is the filter inde x number . The Y -axis presents the probability density that a feature is acti v ated. In the graph, all features with the probability of matching a particular class are in the same area on the X-axis. Features with a probability of matching the tw o classes are displayed in the graph in a shared area for both classes. The Bayes classification error is the sum, or integration, of the minimal probability densities of ev ery feature within the mutual area, which is the ov erlapping of the classes C 0 and C 1 curves. (a) T en classes network (b) T wo classes netw ork Figure 2: Features vs Probabilities densities. The x-axis in the graphs present the network features, denoted as i. The Y axis is the features probability density denoted as PDF . The graph in figure 2.a illustrates the features probabilities densities of identifying C 0 and C 1 in a network trained to detect ten dif ferent classes and negati ve images. The activ e features are nearly a quarter of the total feature s in the network. The area of miss-classified features is significant compared to the total areas of the features of classes C 0 and C 1 which indicates a large classification error . This is because there are man y classes and the number of features dedicated to each class is small. Additional cause is there are many classes, the total number of features exceeds the upper bound number of filters optimal for this netw ork resulting f alse defections. The graph in figure 2.b illustrates a network trained to detect only two classes and negati ve images. Most of the features detected by this network are of classes C 0 and C 1 . The miss-classified feature area is small compared to the total area of both classes, indicating that the classification error is small. The number of features for each class is large enabling the training of features for detecting e ven more detailed features, further reducing the classification error . In the first stage of the modular network that trained to detect general classes C 0 and C 1 ore both include in C g , a general class, C g = C 0 ∪ C 1 . This eliminates the error of miss-classification between the two classes result in lo w classification error . Classification errors in this network are between general classes. 5 Experiments 5.1 Implementation The original data set for training contains 522 original images, expanded to 46,044 training images by mirror symmetry , sharpness, brightness, and contrast augmentations and used as the training data by 5 both the modular and the multi-class networks. The images distributed similarly between 10 classes or fi ve pairs of similar classes: Pekinese, Spaniel, Kayak, canoe, swan, duck, sport bike, mountain bike ,Mars, Saturn and negati ve images with no labels. The test set contained additional 125 original images, expanded by cross validation to 647 original test images. The size of the network input images is 800x800 pixels. The multi-class network is Faster R-CNN with a backbone classification network, VGG 16, initialized by transfer learning training on ImageNet 2012 database. The building blocks of the modular network is Faster R-CNN too with VGG 16 backbone and same initialization . Faster R-CNN networks trained on 40-50 original images for each class for v arious object detection tasks are pre valent [ 31 , 11 ]. T o compare between the multi-class network and the modular network, both networks have the same hyper-parameter values previously optimized on classes other than those the networks are designated to detect. The modular network and the multi-class netw ork had Fine tuning training on all the networks layers. Each of the networks trained for 40 epochs, with learning rates of 0.001 on the first 10 epochs; 0.0001 on the next 10 epochs; and 0.00001 on the last 20 epochs. Both the modular and the multi-class networks inferred on this test data. Most of the original images for the training and the test sets were taken from the Caltech 101 image database and the rest were randomly chosen from the internet. 5.2 Experiments results The Faster -RCNN multi-class object detection network was trained on the data set of 46,044 images with the ten classes and the ne gati ve images. The training loss [ 22 ] is 0.0229. The multiclass network inference results are 0.87 mAP and 12% classification error . The modular network has tw o stages. The first stage network is trained on the same data set as the multi-class network including the negati ve images, but the objects in the images labeled with fiv e general classes instead of the more detailed 10 classes of the multi-class network. The modular network’ s first stage classes are dog, planet, bik e, boat, and bird. Each of these classes is a unification of a couple of similar classes from the 10 classes labeled for training by the multi-class network. The training loss of the modular network’ s first stage is 0.0216. In the second stage, each network is trained on two fine-grained pair of classes of the multi-class network and negativ e images. For example, one netw ork trains on two dog species classes, Pekinese and Spaniel and negati ve images, with a training loss of 0.0151 loss, while a second network is trained to detect two planets, Mars and Saturn and neg ativ e images, with a training loss of 0.0170. The modular network v1 inference results are 0.94 mAP and 4.5% classification error . The modular network v2 inference results are 0.95 mAP and 2.5% classification error . These experimental results indicate that the modular network is significantly more accurate than the multi-class network. T able 1 sho ws experiments results of the T able 1: Object detection average precision Network mAP Modular net v1 0.94 Modular net v2 0.95 Multi-class net (Faster R-CNN) 0.87 First stage of the Modular net, general classes 0.93 mean av erage precision, mAP , of the modular networks and the multi class network, which is a regular Faster R-CNN netw ork, tested on the same images. The modular network v1 mAP is calculated by taking into account the images detected as false ne gati ves in the first stage of the modular netw ork and thereby not appearing on the mAP of the second stage. Each false negati ve precision is rated as zero, and its counterpart in the calculation of the whole modular network mAP is one divided by the total number of inference images in the modular network. In table.1 the total modular networks mAP is higher than the modular network first stage mAP e ven that their classification error is lar ger than the first stage classification error . This means the second stage fine grained object detection networks has higher accuracy in detecting object location compared to the first stage object detection network. T able 2 shows the experimental results of the network classification errors. The modular network classification error is significantly reduced to 4.5% and 2.5% for Modular net v1 and Modular net v2 respectiv ely compared to 12%, respectively , in the multi-class network the regular Faster R-CNN . 6 T able 2: Classification Error Network Classification error Modular net v1 4.5% Modular net v2 2.5% Multi-class net (Faster R-CNN) 12% First stage of the Modular net, general classes 2.25% Figure 3: Left column are object detection images by the multi class network, center column are detected images by the general classes network and right column are images detected by fine grained networks. In the first three rows of the first column of Figure 3, the images detected by the multi-class network hav e errors in classification. Howe ver , the general class netw ork and the fine-grained network both detected the same objects correctly . The second row of images shows that the detection of object 7 location is more accurate in the image detected by the fine-grained netw ork (right) compared to the image detected by the multi-class network (left). 6 Discussion Our e xperimental result sho ws that network with fewer classes is more accurate. The results sho w that most of the classification errors in the multi-class network are between similar classes. The accuracy of the modular network for both v .1 and v .2 is higher by 7.5%and 9.5%, respecti vely , compared to the multi-class network or regular object detection network. This is a reduction of classification error by 2.7 and 4.8 times, respectively . W e found the accuracy of a network that trained to detect only two similar objects is 9.5% higher compared to the multi-class netw ork that detects 10 classes The training results indicate that the training loss becomes smaller as the number of classes trained to be detected by the network becomes smaller . A fundamental question in machine learning is what kind of learning has higher accurac y? A network trained to detect only few focused classes or one that is trained to detect many classes of a wide range of subjects. W e obtained that a network initially trained on a wide range of classes by transfer learning and later trained to detect fewer classes by fine tuning on all the network layers is more accurate than a network initialized by transfer learning and later trained to detect lar ger numbers of classes. Pre vious works on transfer learning [ 8 , 33 ] determined that a network initially trained by transfer learning and later trained to detect designated classes is more accurate compared to only being trained to detect the same designated classes. From both findings we obtain: a network initially trained by transfer learning and then designated to detect a small number of classes is more accurate compared to being trained to detect a larger number of classes with or without transfer learning. 7 Conclusion The modular network presented in this paper significantly improv es object detection performances in both classification and location. This is true especially for detection that requires dif ferentiating between similar classes. This modular network improves state-of-the-art deep learning object detection networks without requiring changes in networks architecture or ev en hyper-parameters, adjusting the hyper -parameters may giv e ev en higher performances. W e found that reducing the number of classes a con volutional neural netw ork is trained to detect increases the network accuracy . This modular network could be a platform for other types of deep learning networks, for example, segmentation networks, by improving their accurac y by implementing them as building blocks of the modular network. References [1] W . Hou B. Juang and C. Lee. Minimum classification error rate methods for speech recognition. IEEE T ransactions on Speech and A udio Pr ocessing , 5, 1997. [2] R.B Girshick B.Hariharan, P . Arbelaez and J.Malik. Hypercolumns for object segmentation and fine-grained localization. 2015. [3] Jia Deng, W ei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large- scale hierarchical image database. 2009 IEEE Conference on Computer V ision and P attern Recognition , pages 248–255, 2009. [4] K. Fukushima. Neocognitron: A hierarchical neural network capable of visual pattern recogni- tion. Neural Networks , 1, 1988. [5] Dweepna Garg and K etan K otecha. Object Detection fr om V ideo Sequences Using Deep Learning: An Overview , pages 137–148. 01 2018. ISBN 978-981-10-4602-5. doi: 10.1007/ 978- 981- 10- 4603- 2_14. [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Pr oceedings of the IEEE confer ence on computer vision and pattern recognition , pages 770–778, 2016. 8 [7] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, W eijun W ang, T obias W eyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Ef ficient conv olutional neural networks for mobile vision applications. arXiv pr eprint arXiv:1704.04861 , 2017. [8] Minyoung Huh, Pulkit Agra wal, and Ale xei A Efros. What makes imagenet good for transfer learning? arXiv preprint , 2016. [9] H. Deng J.Zhang. Gene selection for classification of microarray data based on the bayes error . BMC Bioinformatics , 8, 2007. [10] Y . Jianchao K. Jonathan, J. Hailin and FLi. Fine-grained recognition without part annotations. In The IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , June 2015. [11] karol zak. Cntk-hotel-pictures-classificator . https://github .com/karolzak/cntk-hotel-pictur es- classificator , 2018. [12] A. Karpathy . Lar ge-scale video classification with con volutional neural netw orks. In The IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , June 2014. [13] Alex Krizhe vsky , Ilya Sutskev er, and Geoffre y E. Hinton. Imagenet classification with deep con volutional neural networks. In NIPS , 2012. [14] Christoph Käding, Erik Rodner, Ale xander Freytag, and Joachim Denzler . Fine-tuning deep neural networks in continuous learning scenarios. pages 588–605, 03 2017. doi: 10.1007/ 978- 3- 319- 54526- 4_43. [15] A. Lecler , L. Duron, Daniel Balvay , Julien Sav atovsky , Olivier Bergès, Mathieu Zmuda, Edgard Farah, O. Galatoire, A. Bouchouicha, and Laure Fournier . Combining multiple magnetic resonance imaging sequences provides independent reproducible radiomics features. Scientific Reports , 9, 02 2019. doi: 10.1038/s41598- 018- 37984- 8. [16] Tsung-Y i Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, De va Ramanan, Piotr Dollár , and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European confer ence on computer vision , pages 740–755. Springer, 2014. [17] W ei Liu, Dragomir Anguelov , Dumitru Erhan, Christian Szegedy , Scott Reed, Cheng-Y ang Fu, and Alexander C. Ber g. Ssd: Single shot multibox detector . pages 21–37, 2016. [18] Maxime Oquab, Leon Bottou, Ivan Laptev , and Josef Sivic. Learning and transferring mid- lev el image representations using con volutional neural netw orks. In Pr oceedings of the IEEE confer ence on computer vision and pattern recognition , pages 1717–1724, 2014. [19] Sinno Jialin Pan and Qiang Y ang. Ieee transactions on knowledge and data engineering. CoRR , 22, 2009. [20] Jielin Qiu, Ge Huang, and T ai Sing Lee. V isual sequence learning in hierarchical prediction networks and primate visual cortex. In Advances in Neural Information Pr ocessing Systems , pages 2658–2669, 2019. [21] Joseph Redmon, Santosh Di vvala, Ross Girshick, and Ali Farhadi. Y ou only look once: Unified, real-time object detection. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pages 779–788, 2016. [22] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: T o wards real-time object detection with region proposal networks. IEEE T r ansactions on P attern Analysis and Machine Intellig ence , 39 (6):1137–1149, June 2017. ISSN 1939-3539. doi: 10.1109/TP AMI.2016.2577031. [23] Girshick Ross. F ast r-cnn. In Pr oceedings of the IEEE international confer ence on computer vision , pages 1440–1448, 2015. [24] Olga Russako vsky , Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy , Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. Int. J. Comput. V ision , 115(3), 2015. 9 [25] R. Salakhutdinov, A. T orralba, and J. T enenbaum. Learning to share visual appearance for multiclass object detection. In CVPR 2011 , pages 1481–1488, 2011. [26] K. Simon yan and A. Zisserman. V ery deep conv olutional networks for large-scale image recognition. arXiv pr eprint arXiv:1409.1556 , 2014. [27] Bharat Singh, T im K Marks, Michael Jones, Oncel T uzel, and Ming Shao. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , pages 1961–1970, 2016. [28] Christian Szegedy , W ei Liu, Y angqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov , Dumitru Erhan, V incent V anhoucke, and Andrew Rabino vich. Going deeper with con volutions. In Pr oceedings of the IEEE conference on computer vision and pattern r ecognition , pages 1–9, 2015. [29] Bao Xiaomin and W ang Y aming. Apple image se gmentation based on the minimum error bayes decision [j]. T ransactions of the Chinese Society of Agricultur al Engineering , 5, 2006. [30] Erez Y ahalomi. Deep learning networks for medical images. PhD Resear ch plan , March 2019. [31] Erez Y ahalomi, Michael Chernofsky , and Michael W erman. Detection of distal radius fractures trained by a small set of x-ray images and faster r-cnn. In Intelligent Computing. CompCom 2019. Advances in Intelligent Systems and Computing , v olume 997, pages 971–981, 2019. [32] Shuang Hong Y ang and Bao-Gang Hu. Discriminativ e feature selection by nonparametric bayes error minimization. IEEE T ransactions on knowledge and data engineering , 24(8):1422–1434, 2012. [33] Jason Y osinski, Jef f Clune, Y oshua Bengio, and Hod Lipson. Ho w transferable are features in deep neural networks? In Advances in neur al information pr ocessing systems , pages 3320–3328, 2014. 10

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment