Visual-Semantic Scene Understanding by Sharing Labels in a Context Network

V isual-Semantic Scene Understanding by Sharing Labels in a Conte xt Network Ishani Chakraborty Rutgers Uni versity Ne w Jersey , USA 08854 ishanic@cs.rutgers.edu Ahmed Elgammal Rutgers Uni versity Ne w Jersey , USA 08854 elgammal@cs.rutgers.edu Abstract W e consider the problem of naming objects in complex, natural scenes containing widely varying object appear- ance and subtly different names. Informed by cogniti v e research, we propose an approach based on sharing con- text based object hypotheses between visual and lexical spaces. T o this end, we present the V isual Semantic In- tegration Model (VSIM) that represents object labels as entities shared between semantic and visual contexts and infers a new image by updating labels through context switching. At the core of VSIM is a semantic P achinko Allocation Model and a visual nearest neighbor Latent Dirichlet Allocation Model. For inference, we deri ve an iterativ e Data Augmentation algorithm that pools the la- bel probabilities and maximizes the joint label posterior of an image. Our model surpasses the performance of state- of-art methods in se veral visual tasks on the challenging SUN09 dataset. 1 Intr oduction The human visual system is expert at parsing a complex scene and naming objects within it. But ho w does a hu- man mind navigate the comple x visual layout of objects while using the le xical or semantic kno wledge of the en vi- ronment to precisely identify objects in the scene? While the exact mechanisms are yet unkno wn, sharing conte xt based object hypotheses across visual and lexical spaces is kno wn to be one of the key guiding principles in cogni- tion [5, 12]. Figure 1: T op-down facilitation of scene understanding by sharing labels through visual and semantic contexts. Consider the famous Nightha wks painting in Figure 1. At the ﬁrst glance, the scene is hypothesized into indi vid- ual objects such as a person, street, etc. Some of these objects cohere semantically to create a scene context of a roadside bar with people (and other possibly incorrect contexts). In effect, conﬁdence in some objects can re- duce, categories may become more speciﬁc (e.g. road as a sidewalk) and some new objects may appear (such as buildings, which may not be visually evident at the ﬁrst glance). W e repeat this process of updating our object hypotheses by iterating through our visual and semantic contextual kno wledge base. T o f acilitate this joint inference, we conceptualize scene understanding as a top-down integration of lexical and visual spaces via object names. The lexical space is the v ocab ulary of all object names in the kno wledge base, while the visual space maps the visual appearances to ob- 1 ject names. Each space has a different contextual relation- ship between objects. In the lexical space, related objects typically appear together in a gi v en environment (seman- tic conte xt), while in the visual space, related objects are visually similar in their appearance (visual context). T o this end, we present the V isual Semantic Integra- tion Model (VSIM) that connects the semantic and visual contexts through their shared object names. VSIM mod- els scene interpretation as a top-down approach where se- mantically contextual labels are ﬁrst created to represent a coherent scene composition. These labels are then re- interpreted with their visually contextual counterparts in the appearance space. Speciﬁcally , VSIM is a probabilis- tic, hierarchical model of latent context and observed fea- tures. In the ﬁrst lev el, the image is modeled as a dis- tribution ov er latent semantic contexts which determines the semantic labels that compose the scene. In the next lev el, each semantic label’ s visual context determines the appearance features that are ﬁnally the observed v ariables in the model. Inference in VSIM is initiated in a bottom- up manner , where observed image regions are the only cues used to infer the semantic and visual object labels in the image. I.e., the goal of VSIM is to infer the semantic object labels in an image, gi ven its appearance features. The general o vervie w and our main contrib utions are as follows. • For representing complex scene semantics, we in- troduce Pachinko Allocation Model (P AM) to ef fec- tiv ely capture the semantic hierarchy of concepts in natural images through a directed graphical struc- ture. • For representing visual context, we propose nearest neighbor based Latent Dirichlet Allocation (nnLD A) that ﬁnds discriminativ e visual concepts. nnLD A exploits the strength of nearest neighbor decisions within a structured generativ e LD A approach. • T o infer labels in a new image, we deri ve an itera- tiv e Data Augmentation algorithm that alternates be- tween the tw o context spaces to correctly pool the la- bel probabilities inferred from each space and maxi- mize the label posterior for the image. • Finally , our V isual Semantic Integration Model (VSIM) is motiv ated by the human cognitive pro- cess of shared conte xt and represents a novel algo- rithmic formulation of that process. It mimics the cognitiv e process by representing object labels as en- tities shared between semantic and visual contexts and infering a new image by updating labels through context switching. This is the most signiﬁcant contri- bution of this paper and conceptually dif ferent from previous approaches where context has been used mostly as a ﬁlter to reduce false detections. Our nov el approach combined with an appropriate proba- bilistic technique for inference is able to surpass the state- of-the-art approaches for identifying diverse object cate- gories in natural scenes. 2 Related W ork Context in cognition has been studied in psychoph ysics and linguistics. Particularly , studies by Bar et al. [5] found evidence of an interactive context network in the brain that facilitates object prediction through the so-called context frames of reference that bind visually or semantically re- lated objects. Swinney’ s Cross-Modal Priming T ask [12] prov ed that lexical access follows a multiple hypothesis model where listeners accessed multiple meanings for ambiguous words ev en when faced with strong biasing contexts. These ﬁndings provide a strong motiv ation to- wards modeling an interactiv e context network integrating visual and lexical spaces. Mapping images to related text is gaining impor- tance in large scale learning of web images. One strand of research is aimed at generating natural language sen- tences from objects and their inter-relations [7]. Our prob- lem is related to joint image and word sense discrimina- tion encountered in image retriev al tasks. These works hav e analyzed polysemy in images returned from key- word searches, in terms of visual senses of keywords. Howe v er the ambiguity in these tasks lies mostly in the visual domain since keyw ords are usually static, sparse and well-deﬁned. Hence, the sense mapping between key- words and images is either abstracted through a single la- tent sense [9], picked up from knowledge sources e.g., wikipedia or the image and text words are jointly mod- eled through a single latent variable [1]. As sho wn in the results, these simple correlations are not effecti v e in 2 Figure 2: Flowchart of the o verall approach. mapping the rich interactions between semantic and vi- sual space. Hierarchical context networks provide a nice frame- work for scene understanding due to the modular separa- tion of concepts at different granularities. Mostly , pre vi- ous work has used semantic networks as ﬁlters to remove incompatible object detections in the scene [3, 13]. A vi- sual hierarchy of object classes is proposed in [6]. Our work is related to the topic modeling algorithms for scene understanding [14, 2, 1, 8]. Ho we ver , these models try to capture ov erlapping information between images and text to reinforce each other . In contrast, our method captures the complementary information in these contexts and ex- ploits them to improve the quality of the inferred labels. T o the best of our knowledge, no pre vious work has con- sidered such joint inference framework across dichoto- mous information spaces. 3 Modeling Context Network Giv en an image, we wish to predict a set of objects that best ﬁt the image content. VSIM models these object labels as connection between two different context net- works. The semantic context of labels is modeled in a Pachinko Allocation model (P AM) through a hierarchy of semantic supertopics and subtopics. The visual context of labels is established through visual topics of a near- est neighbor Latent Dirichlet Allocation (nnLD A). Intu- itiv ely , the topic distributions encode the grouping be- tween labels in these two contexts. W e brieﬂy giv e an ov erview of our approach. 3.1 Semantic context in lexical space A complex natural scene may be composed of multiple sub-scenes each with distinct coherency of objects within them. Thus, scene context should be established through a hierarchy of semantics. Single level topic models lik e Latent Dirichlet Allocation which is commonly used for modeling context in images [14, 1] cannot encode such re- lationships. In VSIM, this comple xity of semantic context is encoded with a probabilistic Directed Acyclic Graph (D A G) of topics known as Pachinko Allocation Model (P AM). Unlike the single lev el LD A models, P AM ex- plicitly models relations among words and topics through arbitrary , nested and possibly sparse dependencies. In natural scenes, this enables discovery of ﬁne-grained and tightly coherent subscenes. Figure 3 shows a subset of the semantic context net- work of supertopics and subtopics. These topics are learnt from co-occurrence statistics of object labels in images from SUN09 dataset. Not only do labels that occur to- gether very frequently form a strong clusters in subtopic space (bookshelf subtopic: books, notebook, table), but also the related subtopics are learnt as a higher level su- pertopic (a bookshelf can be found in isolation or can oc- cur with a living room). W ithout such a explicit hierarchy of topic ontologies encoded in the P AM, such relations would get captured as “nonsensical” topics. Figure 6: Illustration of discov ery of visual topic man- ifold of { sea, ri ver , sno w , water , swimpool } in nnLDA. While nearest neighbors capture dense matchings, a topic manifold captures implicit, spatially extended and sparse relations between labels. 3 Figure 3: P art of the semantic context graph. Su- pertopics (gray nodes) and subtopics (sho wn by most frequent labels). Our interpretation of each subtopic is denoted in red. Figure 4: Some of the visual topics. The top 5 la- bels are marked. Some clusters capture general to speciﬁc objects (door , shop window), casual coinci- dences (book, text) and intra-class variabilities (per- son,animals and person,bottles) Figure 5: Left: predicted object labels using visual conte xt alone (initial). Right: Maximum aposteriori labels after joint inference (ﬁnal). Middle: initial (green) and ﬁnal (yellow) predicted semantic subtopic distribution. 3.2 A ppearance context in visual space Most latent variable models quantize rich image features to visual words to facilitate a multinomial (word count) modeling of image data. T o a void this lossy quanti- zation while keeping the conv enience of a multinomial inference, we represent images in a supervised feature space using a bag-of-labels formulation. Here, the feature space is spatially clustered using nearest neighbor group- ings of features and a bag-of-labels is constructed from each grouping. Such a nearest neighborhood is effecti ve in ﬁnding dense similarities within a geometrically con- strained location. Howe ver , it does not capture the im- plicit, spatially extended and sparse relations between la- bels. T o learn such relations, we construct bag-of-labels for each region and perform LDA on it. This topic model formulation enables the rich feature space to be projected into arbitrary topic manifolds (Figure 6), such that sparse and strong visual similarity in labels can be discovered. Some other topic distributions are sho wn in Figure 4 3.3 Inference by label sharing between switching contexts Labels are inferred in VSIM through joint inference in the semantic and visual space. W e deriv e an iterativ e Data Augmentation algorithm which alternates between the two spaces to arriv e at the joint inference. The r oom 4 Figure 7: V isual-Semantic Integration Model (VSIM) I number of images in the corpus R variable number of re gions per image N N number of  -nearest labels of an image region S number of semantic supertopics T number of semantic subtopics A number of visual topics L number of object labels z s , z t ,z semantic super , subtopic and visual topic sample lz , lw semantic and observed label sample ( α 0 , θ s ) Dir − Mult of supertopics, per image ( α s , θ t ) Dir − Mult of subtopics, per image ( β , φ ) Dir − Mult o ver semantic labels, per subtopic ( α , θ ) Dir − Mult of visual topics, per object label ( ψ , γ ) Dir − Mult o ver labels, per visual topic Figure 8: Deﬁnition of variables in VSIM with a view (Figure 5) illustrates the functioning of our algorithm. The left panel shows some regions along with an initial set of most-likely labels using visual context alone. The right panel shows the ﬁnal, maximum aposteri- ori (MAP) labels of the same regions after joint inference. The middle panel shows the initial (green) and ﬁnal (yel- low) distributions o ver inferred semantic subtopics. The joint inference shows a behavior quite similar to the cog- nitiv e process described in the introduction. Some MAP labels show an increase in beliefs, such as ”bed”, ”cush- ion”, and ”curtain”. Some labels are updated to a more speciﬁc class e.g., ”water” is relabeled as ”sea”. Ambigu- ity between visually similar labels is reduced as contextu- ally more appropriate labels are enhanced. F or example, “car” is changed to “boat”, since “boat” is visually simi- lar to “car” but ﬁts better with “sea”. The label probabil- ities at regions which don’t ﬁt any context become more diffused, e.g., p(sidewalk) and hence can be thresholded out. As label probabilities at image regions conv erge, the semantic subtopic distribution becomes more peaky . By alternating between the two spaces, the scene ﬁnally con- ver ges to bedroom and seavie w related concepts (topic 11 and 13, resp.). 4 The V isual-Semantic Integration Model Fig. 7 sho ws the plate notation of the VSIM graphical model. The core of VSIM is a cascade of two models: the P AM model which generates semantic labels, followed by the nnLD A model which generates the observed la- bels. Context is represented through topics, which are Dirichlet-Multinomial distributions over labels. Hence, a semantic topic is learnt as a probabilistic group of labels that co-occur frequently in images. A visual topic is learnt when labels map to similar appearance features and thus get grouped frequently in different bags of labels. Finally , the structural hierarchy of supertopics and subtopics in the P AM are also probabilistically estimated by adapting the scale parameters of the subtopic Dirichlets based on the most likely paths discov ered by the model. For inference, we deri ve an iterati ve Data Augmenta- tion (DA) algorithm that alternates between the two con- text spaces to correctly pool the label probabilities in- ferred in each space and maximizes the label posteriors. Concisely , the inference is seeded with the most likely labels based on image features alone. This is achieved by passing the bag of labels through nnLD A inference. W ithin each iteration, label samples are drawn from the current distribution and used to estimate scene multino- mials in the P AM. Then, these semantic multinomials up- date the label distrib ution. This information is propagated 5 back to the visual model to update and normalize the ob- served labels at each image location. W e use collapsed Gibbs sampling for estimating topic distributions in each iteration. In the following, we provide details of parame- ter estimation and inference in VSIM. 4.1 Semantic Context: Generating object labels. W e model the co-occurrence conte xt of object labels us- ing a three-level P achinko Allocation Model (P AM) [11]. Giv en an image corpus of size I , labels l of an image d are generated by sampling topics at two levels. The per-image supertopic multinomials, θ s are sampled from a symmet- ric Dirichlet hyperparameter α 0 , while subtopic multino- mials, θ t are sampled for each supertopic from an asym- metric Dirichlet hyperparameter α s . The role of α s is crucial since it establishes sparse D AG structure between super and subtopics. The label mixing multinomials φ per subtopic are sampled corpus-wide from a symmetric Dirichlet hyperparameter β . Finally , each label l in the image is sampled from a topic path ( z s, z t ) . W e refer to these object labels as the semantic labels . • For each image d = { 1 · · · I } , θ s ∼ Dir ( α 0 ) , θ t ∼ Dir ( α s ) , s = { 1 · · · S } . • For each subtopic z t = { 1 · · · T } , φ t ∼ Dir ( β ) . • For each label lz = { 1 · · · L } - – Sample a topic path: z s ∼ Mult ( θ s ) , z t ∼ Mult ( θ t z s ). – Sample a label from subtopic, lz = Mult ( φ z t ) . 4.2 A ppearance Context: Generating im- age regions W e formulate appearance context through a bag-of-labels representation. T o achiev e this, we ﬁrst project image regions and their corresponding labels into a supervised feature space and ﬁnd their nearest labels in an  neigh- borhood. Thus, each groundtruth (semantic label) corre- sponds to a bag of observ ed labels based on its image fea- tures. Let { ~ f 1 , ~ f 2 , · · · ~ f r } be the features of r image re- gions in a feature space (e.g., SIFT) with the correspond- ing semantic labels { l z 1 , lz 2 , · · · lz r } . The bag-of-labels is computed as follows. lw r = { l 0 | ∀ j ; k f r ( lz r ) − f j ( l 0 ) k≤  } , (1) where, k . k is a distance norm in the feature space. Thus, a semantic label l z is associated with a set of observed labels lw = { l 0 } . This induces a many-to-many , bipar- tite relation between semantic labels and the observed la- bels (from the same label pool), which is then modeled effecti vely using an LD A. Speciﬁcally , the topic multino- mials, θ capture the visual polysemous relation between one semantic label l z and multiple topics while the label mixing multinomials γ capture the visual synon ymous re- lation between multiple topics and one observed label lw . The generativ e process is as follows. • For each label lz = { 1 · · · L } , θ l ∼ Dir ( α ) . • For each visual topic z = { 1 · · · A } , γ z ∼ Dir ( ψ ) • For each of the N N observed labels – Sample a visual topic z , z ∼ Mult ( θ lz ) – Sample a label l w = Mult ( γ z ) . 4.3 Parameter Learning and Inference The joint probability distribution over all the variables in the model, giv en hyperparameters: J.P.D . = P r ( hidden variables z }| { z s, z t, l z , z , l w |{z} observables , parameters z }| { α s , θ s, θ t, φ, θ , γ ) (2) For learning the parameters in the model, we use an- notated images, hence lz is known. W e use these groundtruth labels to ground the semantic labels during learning due to which the semantic and the visual mod- els become conditionally independent. W e use collapsed Gibbs sampling for estimating topic distributions in each context space. 4.4 Estimating semantic topics The joint distribution of the semantic model from the abov e J.P .D. formulation yields: J P D semantic = { Y I p ( θ s | α 0 )( Y S p ( θ t | α s )) (3) Y R p ( z s | θ s ) p ( z t | θ s z s ) p ( lz | φ z t ) Y T p ( φ | β ) } 6 The proposal for Gibbs sampling of supertopic and subtopic pairs for the i th label is deriv ed to be [11]: P ( z s i = s, z t i = t | lz = l, z s ∼ i , z t ∼ i , α 0 , αs t , β ) ∝ (4)     n d s + α 0 X S n d s + S α 0         n d st + αs st X T n d st + X T αs st         n t l + β X L n t l + Lβ     , (5) where z s i and z t i are supertopic and subtopic assign- ments for lz i , and z s ∼ i and z t ∼ i are topic assignments for all the remaining labels in the image. Excluding the current token, n d s is the count of topic s in image d and n d st is the number of times subtopic t is sampled from su- pertopic s within image d . n t l denotes the number of times label l is assigned to subtopic t in the entire corpus. 4.5 Estimating topic hierarch y W e estimate αs within each Gibbs iteration of the P AM. These hyperparameter values capture the structural links between supertopics and subtopics. Therefore, the strength of the these connections need to be estimated in a data-driv en manner . W e use co-occurrence counts of super and subtopics to estimate αs . Speciﬁcally , we use moment matching to estimate the approximate MLE of αs . In this technique, the model mean and variance of each αs st is computed by matching them to the sample mean and v ariance of topics’ co-occurrence counts across all images. E [ αs st ] = αs st X T αs st = αs st exp ( log P T αs st ) = 1 N X I n d st X T n d st log X T αs k = 1 T − 1 X T − 1 log  E [ αs st ](1 − E [ αs st ]) v ar [ αs st ] − 1  (6) 4.6 Estimating visual topics Starting with the joint distrib ution of visual model, the proposal distribution for visual topic of the i th label is deriv ed to be: J P D visual = Y N N p ( z | θ lz ) p ( lw | γ z ) { Y L p ( θ l | α ) }{ Y A p ( γ a | ψ ) } P ( z i = a | lw , l z , z ∼ i , α, ψ ) ∝     n lz a + α X A n lz a + Aα         n a lw + β X L n a lw + Lβ     , (7) where z i is the visual topic of the i th semantic label, n lz a is the number of times topic a is sampled for semantic label lz . n a lw denotes the count when an observed label lw is assigned to topic a across the entire corpus. T o get an intuiti ve insight into the counts that relate topics and labels, we consider a pair of labels ( l z , l w ) and a topic a . If both n lz a and n a lw are low , topics would be assigned at random. If n lz a is high but n a lw is lo w it means that topic a is consistent with lz but the observed lw is an outlier . If n lz a is low but n a lw is high, the observed label has a generic appearance (e.g., a white wall) so it is matched to many objects. The signal is relev ant and peaky only when n lz a and n a lw are both high, which implies that topic a would be consistently sampled for this ( lz , lw ) pair and they would be grouped together . 4.6.1 Inference Giv en an image, VSIM model needs to compute posterior probabilities over the semantic labels l z for each image region, conditioned on the bag of observed labels l w . This distribution containing latent v ariables is as follows. P ( lz | l w ) = X z s,z t,z P ( lz , z s, z t, z | l w ) (8) = X z s,z t,z P ( lz | z s, z t, z , l w ) P ( z s, z t, z | l w ) In the second equation, the ﬁrst term denotes the condi- tional probability of l z gi ven augmented (observed and la- tent variables) data ( z s, z t, z , l w ) . The second term giv es the predictive likelihood of latent data gi ven observ ations. Based on the dependencies from the graphical model, it can be computed by marginalizing o ver lz , as follows. P ( z s, z t, z | lw ) = X lz P ( z s, z t | l z ) P ( l z | l w ) P ( z | l w , l z ) (9) The abov e formulation leads to a coupled inference prob- lem. For solving P ( l z | l w ) , we need the predicti ve topic probabilities P ( z s, z t, a | l w ) . Ho wever , since the link be- tween the semantic topics and the observed l w passes through lz , it needs to be marginalized out. W e deri ve a Data Augmentation algorithm [15] to solv e this inference problem. The idea of D A is similar to 7 Expectation Maximization, but applies to posterior sam- pling. The general frame work consists of an iterati ve sampling framework with tw o steps (1) Data imputation step , in which the current guess of the posterior distri- bution p ( lz | l w ) is used to generate multiple samples of the hidden variables ( z s, z t, z ) from the predicti ve dis- tribution in Eq. 9 and (2) P osterior sampling step , in which the posterior is updated to be a mixture of the N s augmented posteriors and approximated to be the a verage of p ( l z | lw , z s, z t, a ) . Thus, a stationary posterior distri- bution is achie ved through successi ve substitution. For - mally , the two steps can be represented as: 4.7 Data Imputation { z s ( t +1) , z t ( t +1) , z ( t +1) } ∼ X lz P ( z s, z t | lz ) P ( z | lw , lz ) P ( t ) ( lz | l w ) (10) In this step, we begin from the visual end of the model. W e create bag of labels for each image region and per- form nnLD A inference. The Gibbs proposal during in- ference updates only the topic assignments of the new la- bels, while keeping the counts obtained from the learning phase ﬁxed. The θ r for a region is computed after a num- ber of iterations of topic sampling is completed (100, in our case). p (˜ a | lw , ˜ a ∼ i ; M ) =( ˜ n r a, ∼ i + α X A ˜ n r a + Aα )( n a lw + ˜ n a lw, ∼ i + β X L n a lw + + ˜ n a lw, ∼ i + Lβ ) , (11) P ( ˜ a | lw ) = ˜ θ r a = ˜ n r a + α a P A ˜ n r a + α a (12) Using the estimated topic proportions, we compute the likelihood of lz at each region: P ( l z | lw ) = X A P ( l z = l | z = a ) P ( z = a | lw ) (13) = X A P ( z = a | lz = l ) P ( lz = l ) P ( z = a ) P ( z = a | lw ) = θ l a n l n a ˜ θ r a , θ l a is the learnt topic proportions for label l and ˜ θ r a is the estimated topic proportions for the ne w image region r . n l and n a are the corpus wide counts of label l and topic a , resp. From this multinomial distribution, we no w draw samples of lz . These are used as observations for the se- mantic model. The Gibbs proposal for inference in P AM is similar in form to Eq. 4 (estimation proposal for sam- pling ( z s, z t ) ), howe ver only the topic assignments for new labels are changed (as in nnLD A inference). (See supplemental for the deriv ation). Thus, for each image, we are able to compute a complete set of topic samples ( z s, z t, z ) . 4.8 Posterior Sampling P ( t +1) ( lz | lw ) ∼ 1 N s X N s P ( lz | z s ( t +1) , z t ( t +1) , M ) · P ( t ) ( lw ) (14) In this step, we start from the semantic end of the model. The imputed topic samples alongwith the learnt semantic parameters { ˜ θ s, ˜ θ t, φ } are used to generati vely dra w the semantic labels for an image. The average over multiple samples N s is used to update the P ( l z | lw ) distribution. This new semantic label distrib ution is used to modulate the observed label distribution at each image region. After this step, we return to data imputation to use the new set of observed label probabilities. 5 Experiments Dataset and Experiment Settings: W e ev aluate our pro- posed VSIM model by performing different visual tasks on the SUN09 dataset [3]. SUN09 is a collection of 8600 natural, indoor and outdoor images. Each image contains an average of 7 dif ferent annotated objects and the av- erage occupancy of each object is 5% of image size. The frequencies of object categories follow a po wer la w distri- bution. W e consider the top 200 categories. 4367 images were considered for learning the models and 4317 images were used for testing. For learning the model, we use the annotated ground-truth locations and labels provided with the dataset. In test images, we use the bounding boxes de- tected by DPM [4] detector as image regions, b ut not their decisions. A 256 × 256 image has about 400 re gions. Featur e Representation: Each image region is repre- sented using three types of features, as described in [10]. Color is represented by normalized R, G, B histograms with their means and variances for a 36 length vector . T exture is captured using a 40-ﬁlter bank te xtons. W e use a codebook of 100 textons for a texton histogram. Dense SIFT features are used for discriminative patterns using 8 Object classes Highest AP gain Least AP gain pillow ( +29 . 47 ) shoes ( − 40 . 95 ) text ( +15 . 37 ) ingots ( − 13 . 19 ) desk ( +14 . 83 ) ﬁsh ( − 8 . 80 ) armchair ( +12 . 55 ) chandelier ( − 7 . 30 ) ﬂowers ( +12 . 45 ) monitor ( − 6 . 30 ) cabinet ( +12 . 01 ) glass ( − 6 . 27 ) fence ( +11 . 19 ) faucet ( − 6 . 06 ) Mean A verage Precision impro vement = +4 . 19% T able 1: A verage precision improvement with nnLDA compared to NN. 400 word histogram. Each feature space is used to gener- ate a bag-of-labels representation that feeds into the visual context model. Model r epresentation: The P AM is learnt with 20 su- pertopics and 50 subtopics. The supertopic Dirichlet α 0 is set to a uniform value of 1. The subtopic Dirichlet αs is learnt during parameter estimation. For nnLD A, we choose a neighborhood radius empirically for each feature space and 50 visual topics are learnt. During parameter estimation, the Gibbs sampling is run for 1000 iterations in each model. For posterior inference of topics, we use 100 iterations. During imputation step, 500 samples are generated for the average distrib ution. The D A algorithm is run for 6 iterations. The ﬁnal label posterior distribu- tions are thresholded for label retriev al. 5.1 Initial label prediction: nnLD A versus nearest neighbors (NN): T o compare nnLDA with NN, we use average precision gain (AP gain) in label retriev al, in T able 1. The mean AP gain across 200 object categories is 4 . 0% , in which 168 objects show positi ve gain. It is interesting to note that the objects with maximum gain are categories with few training examples in the dataset e.g., pillow , text and are visually similar to frequent categories. In contrast, the objects with maximum loss are categories with dis- tinct appearances e.g., shoes, ingots, which might be los- ing distinctiv eness through contextual groupings. The re- sults highlight that nnLD A is better at handling the data imbalance problem and for visually ambiguous objects. 5.2 Semantic scene prediction using VSIM: W e compare the scene detection performance of our model vis-a-vis the groundtruth. The ground-truth scene multinomial is computed by grounding semantic labels with groundtruth labels and inferring the P AM super and subtopics. W e estimate the multinomials from image re- gions. Symmetric Kullback-Leibler Diver gence between subtopic multinomials is use to ev aluate how closely our joint inference ﬁts the true distribution. W e also compare our performance to two baselines: Correspondence LD A (CorrLD A [1]) and T otal Scene Un- derstanding model (TSU [8]). CorrLD A models both vi- sual words and lexical words as children of the same topic. This implies that lexical labels and visual features must display similar contextual groupings. TSU models a sin- gle semantic topic for an image and assumes one-to-one correspondence between object labels and visual words. W e implemented the supervised versions of both these models and compare with our performance. The KL di- ver gence measures is lowest for the joint VSIM model (T able 2). Conceptually , this means our model accurately maps the visual-semantic space and therefore generalizes better in the test set. Algorithmically , it implies that labels predicted by our joint inference technique closely match the groundtruth labels. 5.3 Predicting top labels: W e look at the 5 most conﬁdent labels predicted by the models and verify their presence in the groundtruth. The results are sho wn the Figure 9. Our performance improves ov er the hcontext [3], both at the initial stage (without se- mantic context) and after joint inference. This is because, unlike hcontext which relies on DPM outputs and ﬁlters out incompatible detections through a tree context, we are able to retrieve missed detections from the visual process- ing by reinforcing them later through semantics. T able 2 shows results of other baseline generativ e models. Quali- tativ e results are shown in the supplemental material. 5.4 Object detection: W e use precision to report our scores and compare with DPM detector [4] in Figure 10. The relation between pre- cision and training/learning size is highlighted by sorting 9 Figure 9: Accuracy of prediction of top N (1-5) la- bels using VSIM on 200 categories versus hcontext on 107 categories and CorrLD A [1] Figure 10: Mean precision of object categories sorted by training size. The precisions of ev ery consecutiv e 25 objects are averaged. Compared to Felzenswalb’ s DPM detector [4], our method is much less sensitive to training size. TSU CorrLD A Init VSIM Joint VSIM Scene dist. 44.03 33.42 19.57 13.21 T op label pred. 0.29 0.36 0.63 0.87 T able 2: Model comparisons sho wing (1) K ullback- Leibler Di vergence between estimated and groundtruth scene parameters. (2) A verage accuracy of prediction of most conﬁdent label in an image. object categories from most to least frequent and their pre- cisions are av eraged over ev ery 25 objects (for a smooth trend). W e see that the DPM precision falls quickly as the size of training set reduces. In contrast, our method generalizes better and performs fav orably across all ob- ject categories. This advantage on impo verished data is due to a richer set of constraints that prev ents overﬁtting in our model. W e show that VSIM better handles the data-imbalance problem frequently seen in many learning problems with natural categories which follow a power law distrib ution. W e also show the precisions of some objects and com- pare them to the hcontext detector in T able 3. W e report the precisions at 0.25 False Positi ves Per Image(FPPI). Since our model can handle fe wer training examples, we consider a larger number of object cate gories (200 vs. 107 in hcontext). The blanks in the table correspond to objects where hcontext gi ves no response. Pr ecision at .25 FPPI(JointVSIM, InitVSIM, Hcontext [3]) a) people ( 0 . 74 , 0 . 0 , − ) , cars ( 0 . 68 , 0 . 42 , − ) food ( 0 . 63 , 0 . 0 , − ) , picture ( 0 . 71 , 0 . 25 , 0 . 76 ) b) boat ( 0 . 79 , 0 . 17 , − ) , truck ( . 82 , 0 . 3 , 0 . 85 ) painting ( 0 . 62 , 0 . 23 , − ) , poster ( 0 . 55 , 0 . 0 , 0 . 57 ) shop window ( 0 . 73 , 0 . 11 , − ) , balcony ( 0 . 91 , 0 . 64 , 0 . 80 ) c) videos ( 0 . 93 , 0 . 24 , 1 . 0 ) , bottles ( 0 . 72 , 0 . 34 , 0 . 60 ) books ( 0 . 72 , 0 . 28 , 0 . 80 ) , merchandise ( 0 . 93 , 0 . 68 , − ) d) cow ( 0 . 93 , 0 . 0 , − ) , ﬁsh ( 0 . 87 , 0 . 45 , − ) deck chair ( 0 . 67 , 0 , − ) , umbrella ( 0 . 59 , 0 . 07 , 0 . 57 ) T able 3: Precision scores of some SUN09 objects 6 Conclusions In this paper we ha ve presented VSIM, a scene under- standing system that captures both the semantics of a scene and the visual ambiguities that arise due to mapping into image space, within a single model. W e explain how VSIM is biologically sound and show that it statistically performs well on a variety of visual tasks. W e belie ve that VSIM maps the lexical-visual space accurately by sharing label hypotheses between semantic and appearance con- texts and hence is able to generalize well on new images. In future, we want to de velop this method to identify new objects and learn new conte xts in the wild. 10 Refer ences [1] D. Blei and M. Jordan. Modeling annotated data. 2003. [2] L. Cao and F . Li. Spatially coherent latent topic model for concurrent segmentation and classiﬁca- tion of objects and scenes. In ICCV , 2007. [3] M. Choi, J. Lim, A. T orralba, and A. W illsky . Ex- ploiting hierarchical context on a large database of object categories. In CVPR , 2010. [4] P . Felzenszwalb, R. Girshick, D. McAllester , and D. Ramanan. Object detection with discriminativ ely trained part based models. IEEE P AMI , 2010. [5] M. Fenske, E. Aminoff, N. Gronau, and M. Bar . T op-down facilitation of visual object recogni- tion: Object-based and context-based contributions. Pr ogr ess in Brain Resear ch , 2006. [6] T . Gao and K. D. Discriminative learning of re- laxed hierarchy for lar ge-scale visual recognition. In ICCV , 2011. [7] G. Kulkarni, V . Premraj, S. Dhar, S. Li, Y . Choi, A. Berg, and T . Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR , 2011. [8] L. Li, R. Socher , and L. Fei Fei. T owards total scene understanding: Classiﬁcation, annotation and seg- mentation in an automatic framew ork. In CVPR , 2009. [9] A. Lucchi and J. W eston. Joint image and word sense discrimination for image retriev al. 2012. [10] T . Malisiewicz, J. Huang, and A. Efros. Detecting objects via multiple segmentations and latent topic models. Carnegie Mellon University T ech Report , 2006. [11] D. Mimno, W . Li, and A. McCallum. Pachinko allo- cation: Dag-structured mixture models of topic cor- relations. In ICML , 2006. [12] J. of V erbal Learning and V . Behavior . Lexical ac- cess during sentence comprehension: (re) consider- ation of context effects. Journal of V erbal Learning and V erbal Behavior , 1979. [13] A. Rabinovich, A. V edaldi, C. Galleguillos, E. W iewiora, and S. Belongie. Objects in context. In ICCV , 2007. [14] E. Sudderth, A. T orralba, W . Freeman, and A. W ill- sky . Describing visual scenes using transformed ob- jects and parts. IJCV , 2008. [15] M. T anner and W . W ong. The calculation of poste- rior distributions by data augmentation. In Journal of the Am. Stats. Assoc. , 1987. 11

Visual-Semantic Scene Understanding by Sharing Labels in a Context Network

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment