A Supervised Neural Autoregressive Topic Model for Simultaneous Image Classification and Annotation

Topic modeling based on latent Dirichlet allocation (LDA) has been a framework of choice to perform scene recognition and annotation. Recently, a new type of topic model called the Document Neural Autoregressive Distribution Estimator (DocNADE) was p…

Authors: Yin Zheng, Yu-Jin Zhang, Hugo Larochelle

A Supervised Neural Autoregressive Topic Model for Simultaneous Image   Classification and Annotation
A Supervised Neural Autore gressi v e T opic Model for Simultaneous Image Classification and Annotation Y in Zheng Department of Electronic Engineering, Tsinghua Uni versity , Beijing, China, 10084 y-zheng09@mails.tsinghua.edu.cn Y u-Jin Zhang Department of Electronic Engineering Tsinghua Uni versity , Beijing, China, 10084 zhang-yj@mail.tsinghua.edu.cn Hugo Larochelle D ´ epartment d’Informatique Uni versit ´ e de Sherbrooke, Sherbrooke (QC), Canada, J1K 2R1 hugo.larochelle@usherbrooke.ca Nov ember 20, 2021 Abstract T opic modeling based on latent Dirichlet allocation (LDA) has been a framew ork of choice to perform scene recognition and annotation. Recently , a ne w type of topic model called the Document Neural Autore gressi ve Distrib u- tion Estimator (DocN ADE) was proposed and demonstrated state-of-the-art performance for document modeling. In this work, we sho w how to successfully apply and e xtend this model to the context of visual scene modeling. Specif- ically , we propose SupDocN ADE, a supervised extension of DocNADE, that increases the discriminative po wer of the hidden topic features by incorporating label information into the training objecti ve of the model. W e also describe how to lev erage information about the spatial position of the visual words and how to embed additional image an- notations, so as to simultaneously perform image classification and annotation. W e test our model on the Scene15, LabelMe and UIUC-Sports datasets and show tha t it compares fa vorably to other topic models such as the supervised variant of LD A. 1 Intr oduction Image classification and annotation are two important tasks in computer vision. In image classification, one tries to describe the image globally with a single descriptive label (such as coast , outdoor , inside city , etc.), while annotation focuses on tagging the local content within the image (such as whether it contains “ sk y ”, a “ car ”, a “ tr ee ”, etc.). Since these two problems are related, it is natural to attempt to solv e them jointly . For e xample, an image labeled as str eet is more likely to be annotated with “ car ”, “ pedestrian ” or “ building ” than with “ beach ” or “ see water ”. Although there has been a lot of work on image classification and annotation separately , less work has looked at solving these two problems simultaneously . W ork on image classification and annotation is often based on a topic model, the most popular being latent Dirichlet allocation or LD A [1]. LD A is a generati ve model for documents that originates from the natural language processing community but that has had great success in computer vision for scene modeling [1, 2]. LD A models a document as a multinomial distribution over topics, where a topic is itself a multinomial distribution over words. While the distribution over topics is specific for each document, the topic-dependent distributions over words are shared across all documents. T opic models can thus extract a meaningful, semantic representation from a document by inferring its latent distribution ov er topics from the words it contains. In the context of computer vision, LD A can be used by first extracting so-called “visual words” from images, con vert the images into visual word documents and training an LD A 1 topic model on the bags of visual words. Image representations learned with LD A have been used successfully for many computer vision tasks such as visual classification [3, 4], annotation [5, 6] and image retrie v al [7, 8]. Although the original LD A topic model was proposed as an unsupervised learning method, supervised variants of LD A ha ve been proposed [9, 2]. By modeling both the documents’ visual words and their class labels, the discrimina- tiv e po wer of the learned image representations could thus be impro ved. At the heart of most topic models is a generative story in which the image’ s latent representation is generated first and the visual words are subsequently produced from this representation. The appeal of this approach is that the task of extracting the representation from observations is easily framed as a probabilistic inference problem, for which many general purpose solutions exist. The disadv antage howe ver is that as a model becomes more sophisticated, inference becomes less tri vial and more computationally expensi ve. In LDA for instance, inference of the distribution ov er topics does not have a closed-form solution and must be approximated, either using v ariational approximate inference or MCMC sampling. Y et, the model is actually relativ ely simple, making certain simplifying independence assumptions such the conditional independence of the visual words gi ven the image’ s latent distribution ov er topics. Recently , an alternati ve generati ve modeling approach for documents was proposed by Larochelle and Lauly [10]. Their model, the Document Neural Autoregressiv e Distribution Estimator (DocNADE), models directly the joint dis- tribution of the words in a document, by decomposing it through the probability chain rule as a product of conditional distributions and modeling each conditional using a neural netw ork. Hence, DocN ADE doesn’ t incorporate an y latent random variables over which potentially expensi v e inference must be performed. Instead, a document representation can be computed efficiently in a simple feed-forward fashion, using the value of the neural network’ s hidden layer . Larochelle and Lauly [10] also show that DocNADE is a better generativ e model of text documents and can extract a useful representation for text information retrie v al. In this paper , we consider the application of DocNADE in the context of computer vision. More specifically , we propose a supervised variant of DocNADE (SupDocN ADE), which models the joint distribution ov er an image’ s visual words, annotation words and class label. The model is illustrated in Figure 1. W e in vest igate how to successfully incorporate spatial information about the visual words and highlight the importance of calibrating the generativ e and discriminativ e components of the training objective. Our results confirm that this approach can outperform the supervised variant of LD A and is a competitiv e alternati v e for scene modeling. 2 Related W ork Simultaneous image classification and annotation is often addressed using models extending the basic LDA topic model. W ang et al. [2] proposed a supervised LDA formulation to tackle this problem. W ang and Mori [11] opted instead for a maximum margin formulation of LDA (MMLDA). Our work also belongs to this line of work, extend- ing topic models to a supervised computer vision problem: our contribution is to extend a different topic model, DocN ADE, to this context. What distinguishes DocNADE from other topic models is its reliance on a neural network architecture. Neural net- works are increasingly used for the probabilistic modeling of images (see [12] for a re vie w). In the realm of document modeling, Salakhutdinov and Hinton [13] proposed a Replicated Softmax model for bags of words. DocN ADE is in fact inspired by that model and was shown to improve over its performance while being much more computationally efficient. W an et al. [14] also proposed a hybrid model that combines LDA and a neural network. They applied their model to scene classification only , outperforming approaches based on LD A or on a neural network only . In our exper- iments, we sho w that our approach outperforms theirs. Generally speaking, we are not a ware of an y other work which has considered the problem of jointly classifying and annotating images using a hybrid topic model/neural network approach. 3 Document N ADE In this section, we describe the original DocN ADE model. In Larochelle and Lauly [10], DocN ADE was use to model documents of real words, belonging to some predefined vocab ulary . T o model image data, we assume that images hav e first been conv erted into a bag of visual words. A standard approach is to learn a vocab ulary of visual words 2 d 2 v 1 v 3 v 4 v 2 h 1 h 3 h 4 h 1 ˆ v 2 ˆ v 3 ˆ v 4 ˆ v y h , Wc , Ud Visual word Annotat ion To p i c Feat ure Classification Annotat ion 4 v y Binary Wo r d Tr e e Figure 1: Illustration of SupDocN ADE for joint classification and annotation of images. V isual and annota- tion words are extracted from images and modeled by SupDocN ADE, which models the joint distrib ution of the words v = [ v 1 , . . . , v D ] and class label y as p ( v , y ) = p ( y | v ) Q i p ( v i | v 1 , . . . , v i − 1 ) . All conditionals p ( y | v ) and p ( v i | v 1 , . . . , v i − 1 ) are modeled using neural networks with shared weights. Each predictiv e word conditional p ( v i | v 1 , . . . , v i − 1 ) (noted ˆ v i for bre vity) follows a tree decomposition where each leaf is a possible word. At test time, the annotation words are not used (illustrated with a dotted box) to compute the image’ s topic feature representation. by performing K -means clustering on SIFT descriptors densely exacted from all training images. See Section 5.2 for more details about this procedure. From that point on, any image can thus be represented as a bag of visual words v = [ v 1 , v 2 , . . . , v D ] , where each v i is the index of the closest K -means cluster to the i th SIFT descriptor extracted from the image and D is the number of extracted descriptors. DocN ADE models the joint probability of the visual words p ( v ) by rewritting it as p ( v ) = D Y i =1 p ( v i | v