A Supervised Neural Autoregressive Topic Model for Simultaneous Image Classification and Annotation

A Supervised Neural Autore gressi v e T opic Model for Simultaneous Image Classiﬁcation and Annotation Y in Zheng Department of Electronic Engineering, Tsinghua Uni versity , Beijing, China, 10084 y-zheng09@mails.tsinghua.edu.cn Y u-Jin Zhang Department of Electronic Engineering Tsinghua Uni versity , Beijing, China, 10084 zhang-yj@mail.tsinghua.edu.cn Hugo Larochelle D ´ epartment d’Informatique Uni versit ´ e de Sherbrooke, Sherbrooke (QC), Canada, J1K 2R1 hugo.larochelle@usherbrooke.ca Nov ember 20, 2021 Abstract T opic modeling based on latent Dirichlet allocation (LDA) has been a framew ork of choice to perform scene recognition and annotation. Recently , a ne w type of topic model called the Document Neural Autore gressi ve Distrib u- tion Estimator (DocN ADE) was proposed and demonstrated state-of-the-art performance for document modeling. In this work, we sho w how to successfully apply and e xtend this model to the context of visual scene modeling. Specif- ically , we propose SupDocN ADE, a supervised extension of DocNADE, that increases the discriminative po wer of the hidden topic features by incorporating label information into the training objecti ve of the model. W e also describe how to lev erage information about the spatial position of the visual words and how to embed additional image an- notations, so as to simultaneously perform image classiﬁcation and annotation. W e test our model on the Scene15, LabelMe and UIUC-Sports datasets and show tha t it compares fa vorably to other topic models such as the supervised variant of LD A. 1 Intr oduction Image classiﬁcation and annotation are two important tasks in computer vision. In image classiﬁcation, one tries to describe the image globally with a single descriptive label (such as coast , outdoor , inside city , etc.), while annotation focuses on tagging the local content within the image (such as whether it contains “ sk y ”, a “ car ”, a “ tr ee ”, etc.). Since these two problems are related, it is natural to attempt to solv e them jointly . For e xample, an image labeled as str eet is more likely to be annotated with “ car ”, “ pedestrian ” or “ building ” than with “ beach ” or “ see water ”. Although there has been a lot of work on image classiﬁcation and annotation separately , less work has looked at solving these two problems simultaneously . W ork on image classiﬁcation and annotation is often based on a topic model, the most popular being latent Dirichlet allocation or LD A [1]. LD A is a generati ve model for documents that originates from the natural language processing community but that has had great success in computer vision for scene modeling [1, 2]. LD A models a document as a multinomial distribution over topics, where a topic is itself a multinomial distribution over words. While the distribution over topics is speciﬁc for each document, the topic-dependent distributions over words are shared across all documents. T opic models can thus extract a meaningful, semantic representation from a document by inferring its latent distribution ov er topics from the words it contains. In the context of computer vision, LD A can be used by ﬁrst extracting so-called “visual words” from images, con vert the images into visual word documents and training an LD A 1 topic model on the bags of visual words. Image representations learned with LD A have been used successfully for many computer vision tasks such as visual classiﬁcation [3, 4], annotation [5, 6] and image retrie v al [7, 8]. Although the original LD A topic model was proposed as an unsupervised learning method, supervised variants of LD A ha ve been proposed [9, 2]. By modeling both the documents’ visual words and their class labels, the discrimina- tiv e po wer of the learned image representations could thus be impro ved. At the heart of most topic models is a generative story in which the image’ s latent representation is generated ﬁrst and the visual words are subsequently produced from this representation. The appeal of this approach is that the task of extracting the representation from observations is easily framed as a probabilistic inference problem, for which many general purpose solutions exist. The disadv antage howe ver is that as a model becomes more sophisticated, inference becomes less tri vial and more computationally expensi ve. In LDA for instance, inference of the distribution ov er topics does not have a closed-form solution and must be approximated, either using v ariational approximate inference or MCMC sampling. Y et, the model is actually relativ ely simple, making certain simplifying independence assumptions such the conditional independence of the visual words gi ven the image’ s latent distribution ov er topics. Recently , an alternati ve generati ve modeling approach for documents was proposed by Larochelle and Lauly [10]. Their model, the Document Neural Autoregressiv e Distribution Estimator (DocNADE), models directly the joint dis- tribution of the words in a document, by decomposing it through the probability chain rule as a product of conditional distributions and modeling each conditional using a neural netw ork. Hence, DocN ADE doesn’ t incorporate an y latent random variables over which potentially expensi v e inference must be performed. Instead, a document representation can be computed efﬁciently in a simple feed-forward fashion, using the value of the neural network’ s hidden layer . Larochelle and Lauly [10] also show that DocNADE is a better generativ e model of text documents and can extract a useful representation for text information retrie v al. In this paper , we consider the application of DocNADE in the context of computer vision. More speciﬁcally , we propose a supervised variant of DocNADE (SupDocN ADE), which models the joint distribution ov er an image’ s visual words, annotation words and class label. The model is illustrated in Figure 1. W e in vest igate how to successfully incorporate spatial information about the visual words and highlight the importance of calibrating the generativ e and discriminativ e components of the training objective. Our results conﬁrm that this approach can outperform the supervised variant of LD A and is a competitiv e alternati v e for scene modeling. 2 Related W ork Simultaneous image classiﬁcation and annotation is often addressed using models extending the basic LDA topic model. W ang et al. [2] proposed a supervised LDA formulation to tackle this problem. W ang and Mori [11] opted instead for a maximum margin formulation of LDA (MMLDA). Our work also belongs to this line of work, extend- ing topic models to a supervised computer vision problem: our contribution is to extend a different topic model, DocN ADE, to this context. What distinguishes DocNADE from other topic models is its reliance on a neural network architecture. Neural net- works are increasingly used for the probabilistic modeling of images (see [12] for a re vie w). In the realm of document modeling, Salakhutdinov and Hinton [13] proposed a Replicated Softmax model for bags of words. DocN ADE is in fact inspired by that model and was shown to improve over its performance while being much more computationally efﬁcient. W an et al. [14] also proposed a hybrid model that combines LDA and a neural network. They applied their model to scene classiﬁcation only , outperforming approaches based on LD A or on a neural network only . In our exper- iments, we sho w that our approach outperforms theirs. Generally speaking, we are not a ware of an y other work which has considered the problem of jointly classifying and annotating images using a hybrid topic model/neural network approach. 3 Document N ADE In this section, we describe the original DocN ADE model. In Larochelle and Lauly [10], DocN ADE was use to model documents of real words, belonging to some predeﬁned vocab ulary . T o model image data, we assume that images hav e ﬁrst been conv erted into a bag of visual words. A standard approach is to learn a vocab ulary of visual words 2 d 2 v 1 v 3 v 4 v 2 h 1 h 3 h 4 h 1 ˆ v 2 ˆ v 3 ˆ v 4 ˆ v y h , Wc , Ud Visual word Annotat ion To p i c Feat ure Classification Annotat ion 4 v y Binary Wo r d Tr e e Figure 1: Illustration of SupDocN ADE for joint classiﬁcation and annotation of images. V isual and annota- tion words are extracted from images and modeled by SupDocN ADE, which models the joint distrib ution of the words v = [ v 1 , . . . , v D ] and class label y as p ( v , y ) = p ( y | v ) Q i p ( v i | v 1 , . . . , v i − 1 ) . All conditionals p ( y | v ) and p ( v i | v 1 , . . . , v i − 1 ) are modeled using neural networks with shared weights. Each predictiv e word conditional p ( v i | v 1 , . . . , v i − 1 ) (noted ˆ v i for bre vity) follows a tree decomposition where each leaf is a possible word. At test time, the annotation words are not used (illustrated with a dotted box) to compute the image’ s topic feature representation. by performing K -means clustering on SIFT descriptors densely exacted from all training images. See Section 5.2 for more details about this procedure. From that point on, any image can thus be represented as a bag of visual words v = [ v 1 , v 2 , . . . , v D ] , where each v i is the index of the closest K -means cluster to the i th SIFT descriptor extracted from the image and D is the number of extracted descriptors. DocN ADE models the joint probability of the visual words p ( v ) by rewritting it as p ( v ) = D Y i =1 p ( v i | v 0) = [1 ( g ( a 1 ) > 0) , . . . , 1 ( g ( a H ) > 0) ] (12) where 1 P is 1 if P is true and 0 otherwise. Algorithms 1 and 2 giv e pseudocodes for efﬁciently computing the joint distribution p ( v , y ) and the parameter gradients of Equation 10 required for stochastic gradient descent training. 4.2 Dealing with Multiple Regions Spatial information plays an important role for understanding an image. For example, the sky will often appear on the top part of the image, while a car will most often appear at the bottom. A lot of previous work has exploited this intuition successfully . For example, in the seminal work on spatial pyramids [3], it is shown that extracting different visual word histograms over distinct regions instead of a single image-wide histogram can yield substantial gains in performance. W e follow a similar approach, whereby we model both the presence of the visual words and the identity of the re- gion they appear in. Speciﬁcally , let’ s assume the image is di vided into se veral distinct regions R = { R 1 , R 2 , . . . , R M } , where M is the number of regions. The image can now be represented as v R = [ v R 1 , v R 2 , . . . , v R D ] = [( v 1 , r 1 ) , ( v 2 , r 2 ) , . . . , ( v D , r D )] (13) where r i ∈ R is the region from which the visual word v i was extracted. T o model the joint distribution ov er these visual w ords, we decompose it as p ( v R ) = Q i p (( v i , r i ) | v R 0 δ U ← δ d h c | δ c ← 0 , δ b ← 0 , δ V ← 0 , δ W ← 0 for i from D to 1 do δ h i ← 0 for m from 1 to | π ( v i ) | do δ t ← λ ( p ( π ( v i ) m | v 0 δ c ← δ c + δ h i ◦ 1 h i > 0 δ W : ,v i ← δ W : ,v i + δ a end for 4.3 Dealing with Annotations The annotation of an image consists in a list of words 1 describing the content of the image. For e xample, in the image of Figure 1, the annotation might contain the words “ trees ” or “ people ”. Because annotations and labels are clearly dependent, we try to model them jointly within our SupDocN ADE model. Speciﬁcally , let A be the predeﬁned vocabulary of all annotation words, we will note the annotation of a giv en image as a = [ a 1 , a 2 , . . . , a L ] where a i ∈ A , with L being the number of words in the annotation. Thus, the image with its annotation can be represented as a mixed bag of visual and annotation words: v A = [ v A 1 , . . . , v A D , v A D +1 , . . . , , v A D + L ] = [ v R 1 , . . . , v R D , a 1 , . . . , a L ] . (14) T o embed the annotation words into the SupDocNADE framework, we treat each annotation word the same way we deal with visual words. Speciﬁcally , we use a joint inde xing of all visual and annotation words and use a lar ger binary word tree so as to augment it with leav es for the annotation words. By training SupDocN ADE on this joint image/annotation representation v A , it can learn the relationship between the labels, the spatially-embedded visual words and the annotation words. At test time, the annotation words are not giv en and we wish to predict them. T o achieve this, we compute the document representation h y ( v R ) based only on the visual words and compute for each possible annotation word a ∈ A the probability that it would be the ne xt observed w ord p ( v A i = a | v A = v R ) , based on the tree decomposition as in Equation 3. In other words, we only compute the probability of paths that reach a leaf corresponding to an annotation word (not a visual word). W e then rank the annotation words in A in decreasing order of their probability and select the top 5 words as our predicted annotation. 5 Experiments and Results In this section, we test our model on 3 real-world datasets: a subset of the LabelMe dataset [18], the UIUC-Sports dataset [19] and the Scene15 dataset [3]. Scene15 is used to e v aluate image classiﬁcation performance only , while La- belMe and UICU-Sports come with annotations and is a popular classiﬁcation and annotation benchmark. W e pro vide 1 Annotations contain multiword expressions as well such as per son walking , but the y are treated as a single token. 7 a quantitati ve comparison between SupDocN ADE, the original DocN ADE model and supervised LD A (sLD A) [9, 2]. The code to download the datasets and for SupDocN ADE is av ailable at http://www.anonymous.com . 5.1 Datasets Description The Scene15 dataset contains 4485 images, belonging to 15 different classes. Follo wing previous work, we ﬁrst resize the images so the maximum side (length or width) is 300 pixels wide, without changing the aspect ratio. For each experiment, we randomly select 100 images as the training set, using the remaining images for the test set. Follo wing W ang et al. [2], we constructed our LabelMe dataset using the online tool to obtain images of size 256 × 256 pixels from the following 8 classes: highway , inside city , coast , for est , tall building , street , open country and mountain . For each class, 200 images were randomly selected and split evenly in the training and test sets, yielding a total of 1600 images. The UIUC-Sports dataset contains 1792 images, classiﬁed into 8 classes: badminton (313 images), bocce (137 images), croquet (330 images), polo (183 images), roc kclimbing (194 images), r owing (255 images), sailing (190 images), snowboar ding (190 images). Following pre vious work, the maximum side of each image was resized to 400 pixels, while maintaining the aspect ratio. W e randomly split the images of each class evenly into training and test sets. For both LabelMe and UIUC-Sports datasets, we removed the annotation words occurring less than 3 times, as in W ang et al. [2]. 5.2 Experimental Setup Follo wing W ang et al. [2], 128 dimensional, densely extracted SIFT features were used to extract the visual words. The step and patch size of the dense SIFT extraction was set to 8 and 16, respecti vely . The dense SIFT features from the training set were quantized into 240 clusters, to construct our visual word vocabulary , using K -means. W e di vided each image into a 2 × 2 grid to extract the spatial position information, as described in Section 4.2. This produced 2 × 2 × 240 = 960 different visual w ord/region pairs. W e use classiﬁcation accuracy to ev aluate the performance of image classiﬁcation and the av erage F-measure of the top 5 predicted annotations to ev aluate the annotation performance, as in previous work. The F-measure of an image is deﬁned as F -measure = 2 × Precision × Recall Precision + Recall (15) where recall is the percentage of correctly predicted annotations out of all ground-truth annotations for an image, while the precision is the percentage of correctly predicted annotations out of all predicted annotations 2 . W e used 5 random train/test splits to estimate the av erage accuracy and F-measure. Image classiﬁcation with SupDocNADE is performed by feeding the learned document representations to a RBF kernel SVM. In our experiments, all hyper-parameters (learning rate, unsupervised learning weight λ in SupDoc- N ADE, C and γ in RBF kernel SVM), were chosen by cross validation. W e emphasize that the annotation words are not av ailable at test time and all methods predict an image’ s class based solely on its bag of visual words. 5.3 Quantitative Comparison In this section, we describe our quantitative comparison between SupDocNADE, DocNADE and sLD A. W e used the implementation of sLD A av ailable at http://www.cs.cmu.edu/ ˜ chongw/slda/ in our comparison. For models which did not have a publicly available implementation (hybrid topic/neural network model [14] and MMLD A [11]), we compare instead with the results reported in the literature. 5.3.1 Image Classiﬁcation W e ﬁrst test the classiﬁcation performance of our method on the Scene15 dataset. Figure 2 illustrates the performance of all methods, for a varying number of topics. W e observe that SupDocNADE outperforms sLD A by a large margin, while also improving o ver the orignal DocN ADE model. 2 When there are repeated words in the ground-truth annotations, the repeated terms were remov ed to calculate the F-measure 8 60 80 100 120 140 160 180 200 220 Number of Topics 65 70 75 80 85 Classiﬁcation Accuracy % Classiﬁcation Performance on Scene15 SupDocNADE ( 2 × 2 ) , λ varies sLDA ( 2 × 2 ) DocNADE ( 2 × 2 ) 60 80 100 120 140 160 180 200 220 Number of Topics 72 74 76 78 80 82 Classiﬁcation Accuracy % Classiﬁcation Performance on Scene15 SupDocNADE ( 2 × 2 ) , λ varies SupDocNADE ( 1 × 1 ) , λ varies SupDocNADE ( 2 × 2 ) , λ = 1 SupDocNADE ( 2 × 2 ) , λ = 0 Figure 2: Classiﬁcation performance comparison on Scene15 dataset. The left ﬁgure shows the performance compar- ison between SupDocNADE, DocNADE and sLD A. The ﬁgure on the right compares the performance of different variants of SupDocN ADE. 60 80 100 120 140 160 180 200 220 Number of Topics 81 82 83 84 85 Classiﬁcation Accuracy % Classiﬁcation Performance on LabelMe SupDocNADE ( 2 × 2 ) , λ varies sLDA ( 2 × 2 ) DocNADE ( 2 × 2 ) 60 80 100 120 140 160 180 200 220 Number of Topics 72 73 74 75 76 77 78 79 80 Classiﬁcation Accuracy % Classiﬁcation Performance on UIUC-Sports SupDocNADE ( 2 × 2 ) , λ varies sLDA ( 2 × 2 ) DocNADE ( 2 × 2 ) 60 80 100 120 140 160 180 200 220 Number of Topics 80 81 82 83 84 85 86 87 Classiﬁcation Accuracy % Classiﬁcation Performance on LabelMe SupDocNADE ( 2 × 2 ) , λ varies SupDocNADE ( 1 × 1 ) , λ varies SupDocNADE ( 2 × 2 ) , λ = 1 SupDocNADE ( 2 × 2 ) , λ = 0 60 80 100 120 140 160 180 200 220 Number of Topics 72 73 74 75 76 77 78 79 80 81 Classiﬁcation Accuracy % Classiﬁcation Performance on UIUC-Sports SupDocNADE ( 2 × 2 ) , λ varies SupDocNADE ( 1 × 1 ) , λ varies SupDocNADE ( 2 × 2 ) , λ = 1 SupDocNADE ( 2 × 2 ) , λ = 0 Figure 3: Classiﬁcation performance comparison on LabelMe and UIUC-Sports datasets. In the top row , we compare the classiﬁcation performance of SupDocNADE, DocN ADE and sLD A. In the bottom row , we compare the perfor- mance between different variants of SupDocN ADE. Results on LabelMe are on the left and results on UIUC-Sports are on the right. 9 In Figure 2, we also compare SupDocNADE with other design choices for the model, such as performing purely generativ e ( λ = 1 ) or purely discriminativ e ( λ = 0 ) training, or ignoring spatial position information (i.e. using a single region, cov ering the whole image). W e see that both using position information and tuning the weight λ are important, with pure discriminativ e learning performing worse. W an et al. [14] also performed experiments on the Scene15 dataset using their h ybrid topic/neural netw ork model, but used a slightly different setup: they used 45 topics, a visual word vocabulary of size 200, a dense SIFT patch size of 32 × 32 and a step size of 16. They also didn’t incorporate spatial position information using a spatial grid. When running SupDocNADE using this conﬁguration, we obtain a classiﬁcation accuracy of 73 . 36% , compared to 70 . 1% for their model. 5.3.2 Simultaneous Classiﬁcation and Annotation W e now look at the simultaneous image classiﬁcation and annotation performance on LabelMe and UIUC-Sports datasets. The classiﬁcation results are illustrated in Figure 3. Similarly , we observe that SupDocNADE outperforms Doc- N ADE and sLD A. T uning the trade-off between generative and discriminative learning and exploiting position infor- mation is usually beneﬁcial. There is just one exception, on LabelMe, with 200 hidden topic units, where using a 1 × 1 grid slightly outperforms a 2 × 2 grid. As for image annotation, we computed the performance of our model with 200 topics. SupDocN ADE obtains an F -measure of 43 . 87% and 46 . 95% on the LabelMe and UIUC-Sports datasets respectively . This is slightly superior to regular DocN ADE, which obtains 43 . 32% and 46 . 38% . Since code for performing image annotation using sLDA is not publicly av ailable, we compare directly with the results found in the corresponding paper [2]. W ang et al. [2] report F -measures of 38 . 7% and 35 . 0% for sLD A, which is belo w SupDocN ADE by a large margin. W e also compare with MMLDA [11], a max-margin formulation of LDA, that has been applied to image classiﬁca- tion and annotation separately . The reported classiﬁcation accurac y for MMLD A is 81 . 47% (for LabelMe) and 74 . 65% (for UIUC-Sports), which is less than SupDocNADE. As for annotation, F -measures of 46 . 64% (for LabelMe) and 44 . 51% (for UIUC-Sports) are reported, which is better than SupDocNADE on LabelMe but worse on UIUC-Sports. W e should mention that MMLD A did not address the problem of simultaneously classifying and annotating images, these tasks being treated separately . Figure 5 illustrates e xamples of correct and incorrect predictions made by SupDocN ADE on the LabelMe dataset. 5.4 V isualization of Learned Representations Since topic models are often used to interpret and explore the semantic structure of image data, we looked at how we could observe the structure learned by SupDocN ADE. W e tried to extract the visual/annotation words that were most strongly associated with certain class labels within SupDocN ADE. For example, giv en a class label street , which corresponds to a column U : ,i in matrix U , we selected the top 3 topics (hidden units) ha ving the lar gest connection weight in U : ,i . Then, we averaged the columns of matrix W corresponding to these 3 hidden topics and selected the visual/annotation words with lar gest av eraged weight connection. The results of this procedure for classes street , sailing , forest and highway is illustrated in Figure 4. T o visualize the visual words, we show 16 image patches belonging to each visual word’ s cluster , as extracted by K - means. The learned associations are intuitive: for example, the class str eet is associated with the annotation words “ building ”, “ buildings ”, “ window ”, “ person walking ” and “ sky ”, while the visual words showcase parts of buildings and windows. 6 Conclusion and Discussion In this paper, we proposed SupDocNADE, a supervised extension of DocN ADE. Like all topic models, our model is trained to model the distribution of the bag of words representation of images and can extract a meaningful represen- tation from it. Unlike most topic models ho wever , the image representation is not modeled as a latent random v ariable in a model, but instead as the hidden layer of a neural network. While the resulting model might be less interpretable 10 building , buildings, window , p erson walking, sky Class : Street Visual wor ds Annota tion words athlete, sky , boat, oar , floater Class : Sailing Visual wor ds Annota tion words tree trunk, tree, trees, stone, sky Class : Fore st Visual wor ds Annota tion words c ar c ar occluded, road, fence, trees Class : Highw a y Visual wor ds Annota tion words Figure 4: V isualization of learned representations. Class labels are colored in red. For each class, we list 4 visual words (each represented by 16 image patches) and 5 annotation words that are strongly associated with each class. See Section 5.4 for more details. (as typical with neural networks), it has the advantage of not requiring an y iterati ve, approximate inference procedure to compute an image’ s representation. Our experiments conﬁrm that SupDocNADE is a competitiv e approach for the classiﬁcation and annotation of images. Refer ences [1] D. M. Blei, A. Y . Ng, and M. I. Jordan, “Latent dirichlet allocation, ” JMLR , 2003. [2] C. W ang, D. Blei, and F .-F . Li, “Simultaneous image classiﬁcation and annotation, ” in CVPR , 2009. [3] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, ” in CVPR , 2006. [4] J. Y ang, K. Y u, Y . Gong, and T . Huang, “Linear spatial pyramid matching using sparse coding for image classi- ﬁcation, ” in CVPR , 2009. 11 Coast roc k , , sky seawa ter , rocks, sand beach Ta l l b u i l d i n g sky ,skyscr aper occluded,buildings, skyscr aper ,building occluded, Highwa y sky ,car , road , sign,field , Mount ain mount ain,sky , tree, tr ees,field , Coast roc k ,s a nd beach, sea wat er , sky , Ta l l b u i l d i n g sky ,buildings occluded, trees, skyscraper Highwa y sky ,road , sign, centr alreservation , trees,car Mount ain sky ,mountain, trees , roc k y mountain, river wa t er Mount ain tree, tr ees,sky , tree trunk , buildings occlud ed Street roa d ,c ar ,s ig n , trees,building Mount ain sky , tr ees, tr ee, field,mountain Insidecity window ,building occluded,bui lding, sidewalk,door Fores t h ouse occluded , sky ,ground gr ass Highwa y sky ,tr ees,sign,car , bus,road,cent r al res e r vat io n Opencountry sky ,mountain, trees ,river wa t er , boat Ta l l b u i l d i n g buildings occluded, building,buildings , window Figure 5: Predicted class and annotation by SupDocN ADE on LabelMe dataset. W e list some correctly (top row) and incorrectly (bottom row) classiﬁed images. The predicted (in blue) and ground-truth (in black) class labels and annotation words are presented under each image. 12 [5] C.-F . Tsai, “Bag-of-w ords representation in image annotation: A re vie w , ” ISRN Artiﬁcial Intelligence , v ol. 2012, 2012. [6] J. W eston, S. Bengio, and N. Usunier , “Large scale image annotation: learning to rank with joint word-image embeddings, ” Machine learning , 2010. [7] Z. W u, Q. Ke, J. Sun, and H.-Y . Shum, “ A multi-sample, multi-tree approach to bag-of-words image representa- tion for image retriev al, ” in ICCV , 2009. [8] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retriev al with large vocabularies and fast spatial matching, ” in CVPR , 2007. [9] J. D. M. David M. Blei, “Supervised topic models, ” NIPS , 2007. [10] H. Larochelle and S. Lauly , “ A neural autoregressi v e topic model, ” in NIPS 25 , 2012. [11] Y . W ang and G. Mori, “Max-mar gin latent dirichlet allocation for image classiﬁcation and annotation, ” in BMVC , 2011. [12] Y . Bengio, A. Courville, and P . V incent, “Representation learning: A re vie w and new perspecti ves, ” arXiv pr eprint arXiv:1206.5538 , 2012. [13] R. Salakhutdinov and G. E. Hinton, “Replicated softmax: an undirected topic model, ” NIPS , 2009. [14] L. W an, L. Zhu, and R. Fergus, “ A hybrid neural network-latent topic model, ” 2012. [15] G. Bouchard, B. Triggs, et al. , “The tradeoff between generative and discriminativ e classiﬁers, ” in COMPST A T , 2004. [16] X. Glorot, A. Bordes, and Y . Bengio, “Deep sparse rectiﬁer networks, ” in AIST ATS , 2011. [17] V . Nair and G. E. Hinton, “Rectiﬁed linear units improve restricted boltzmann machines, ” in ICML , 2010. [18] B. C. Russell, A. T orralba, K. P . Murphy , and W . T . Freeman, “Labelme: a database and web-based tool for image annotation, ” IJCV , 2008. [19] L.-J. Li and L. Fei-Fei, “What, where and who? classifying events by scene and object recognition, ” in ICCV , 2007. 13

A Supervised Neural Autoregressive Topic Model for Simultaneous Image Classification and Annotation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment