Modeling Loosely Annotated Images with Imagined Annotations

Abstract . In this paper , we present an appro ach to learning latent semantic an alysis models from loosely annotated images for automatic image annot ation and indexing. The given annotation in tr aining imag es is loose due to: (1) ambiguous correspondences between visual features and annotated keywords; (2) incomplete lists of annotated keywords. The second reas on motivates us to enrich the incomplete annotation in a simple way before learni ng topic models. In par ticular , some “imagined” keywords are poured into the incomplete annotation through measur ing similarity between keywords. Then, both given and imagined ann otations are used to learning probabilistic topic models for automatically annotating new images. W e conduct experiments on a typical Corel data set of images and loose annotations, and compare the proposed method with state- of-t he-art discrete annotation methods (using a set of discrete blobs to represen t an image). The proposed method improves word-driven probability Latent Semantic Ana lysis (PLSA-words) up to a c omparable performance with the best di screte annotation method, wh ile a merit of PLSA-words is still kept, i.e. , a wider semantic range . 1 Introduction Automati c image annotati on is a process to use a com puter program to assign keyw ords to images by lear ning from annotated images, whe re training im ages might be ann otated by hand. Due t o intensive labor and s ubjectiv ity of ha nd annotatio n, keywor ds used to be associated with images instea d of specific regi ons, and only releva nt to la belers. These images are referre d to as loosely annot ated images, fo r example Corel i mages used in [1]. The major challenge consists i n (1) the ambiguous corresponde nce betwee n visual features and annotate d keywords; (2) the incom pletion of annotated keywords. S ome ef forts have been devoted to learning fr om loosely an notat ed images, for instance learning latent semantic models [1-3], t r anslating fr om dis crete visual features to keywords [4-5], usi ng cross-media re levance m odel [6-7], lear ning a statist ic modeling f or image an notation [8-1 1], image annotation using multiple-ins tance learning [12], and so on . In [13], the authors present a m ethod to explicitly redu ce the correspondence ambiguity between visual features and keywords before modeling t he loosely annota ted images. However , to our best knowledge, there is no effort explicitly dedicated to solving the incompletion of hand annotati on before m odeling. The p ossible reason m ight be that it is almost a same challenging task as auto matically annotating new images. And it might be infeasible to solve a same difficult problem as an intermediate stage to achieve a final goal. W e approach the incompletion of loosely anno tated keywords in a simple way . In particular , 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Modeling Loosely Annotate d Images with “Imagined” Annotations Hong Tang 1 , Nozha Boujemma 2 , Yunhao C hen T 1 1 ADREM, Beijing Normal University 2 IMEDIA project, INRIA, France 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 ECCV -08 submission ID 46 2 2 we pick out and associate missing keywords i n annotated tr aining im ages with “ima gined” occurrence f requencies by averaging si milarity measures between them and an notated keywords. The se retrieved m issing keywor ds are referred to as “imagined” ann otations. Then, words-driven prob abilistic latent seman tic an alysis (PLSA-words [3]) is used to modeling b oth given a nd “imagined” a nnotations . At last, l earned model s are used to automatically annotate new im ages. Three exam ple images and three kinds of annotations are illustrated in Fig. 1, where the second row corresponds to the “imagined” annotation of images. Bay Sun T ree Coast Hill W ater Herd Plane T ree Z ebra Sky Cloud W ater Sea Forest W ave Beach Boat Rock Sky Jet Sky Grass Runway Elephant Sunset People Building W ater Sk y W ater Coast Sh ip Sky Mountain Herd Zebra Elephant W ater Fig. 1 Three kinds of annotation s of three exampl e images. The first-row keyw ord is the loose annotation given in training imag es. The second row is th e “imagined” annotation from all given annotations. The third row is automatic annota tions when the y are us ed as test images. The rest of this paper is organized as follows. In section 2, we formulate the problem of enriching the incomplete annotatio n in the framework of automatic image annotation. The proposed algorithm to solving the p roblem is presen ted in section 3. Experim ental results and discussions are given i n section 4. Some conclusi ons are drawn in the last section. 2 The Problem Given N trai ning images D ={ I 1 , I 2 , … , I N } with lo ose annotati ons, image I i is represented as a pair of histogram s ( B i , W i ). The first element is a blob-histogram with p bins B ={ b 1 , b 2 , … , b p }, where the bins are quantized fr om visua l features of se gmented re gions. The sec ond is a word-histogram with q bins W ={ w 1 , w 2 , … , w q }. W i 0 and W i + (resp. B i 0 and B i + ) denote two sets of bins whose valu es are equal to zero and non-ze ro, respectively . In other words, W i 0 and W i + are a set of non-annotated and annotated keywords in image I i , respectively . Generally spe aking, the number of annotated keywords, | W i + |, is far less than that of non-annota ted keywords, | W i 0 |, for example, | W i + | ≤ 5 and | W i 0 | ≈ 150 in our e xperiments. Some non-a nnotated keywor ds should be use d to describe sem antics of image, but are missing occasionally . Our problem is to retrieve those missing keywords in a sim ple way before traini ng images are us ed to learn a model for aut omatic annotation of new i mages. T o disc riminate them from annotated keyw ords given i n training i mages, we re fer to them as “imagined” annotations, since th e process to obtaining them is associative and the results are not checked. T o retrieve these missing keywords, assume a model θ D has been learned from the trai ning images D . W e might predicate that a missing word w j ∈ W i 0 46 47 48 49 50 52 52 53 54 55 56 57 58 59 60 61 62 62 64 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 46 47 48 49 50 52 52 53 54 55 56 57 58 59 60 61 62 62 64 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 ECCV -08 submission ID 46 2 3 would be in the “imagined” anno tations, if the condition al probability of th e keyword given trai ning image I i τ θ ≥ + ) , , | ( D i i j B W w p , ( 1 ) wher e τ (0 ≤ τ ≤ 1) is a threshol d to be determ ined in experi ments. Anot her way is to pick out top K keywords with higher conditional probabilities among all m issing keywords W i 0 . As show n in Fig.1 , top 5 keywords are picke d out as “im agined” annotations f or example images. As we know , automatic image annotation is to assign some keywords to test image I test using a com puter program. In general, a keyword w k might be assi gned if it has a higher condition al probability among the word -vocabulary W ) , | ( ) , | ( θ θ test k test k B w p I w p = , ( 2 ) where B test is the blob-histogram of test image I test ; θ is a learned model for au tomatic image annotation. Comparing th e conditional probab ility in Eq. (1) and (2), we might conclude that retrieving m issing keywords is also a p rocess of aut omatic im age annotation with an addition al condition that missing k eyw ords are dependent on annotated keyword s W i + . Therefore, i t might be a t a same le vel of dif ficulty to learn the two models (i.e., θ D and θ ), respectivel y . Since our fi nal goal is to autom atically annotate new images, we propose a simple algorith m to approximate the condition al probability in Eq. (1) instead of directly learning the θ D from the training dataset D . In the following, we outline two kinds of latent v ariable models for automatic image annotation, since they are closely related to the proposed algorithm. One is the probabilistic latent semantic analysis model [3 , 14]. The other uses training image as latent variable, a nd annotate new images by summing out of al l training images, fo r example [4 , 5, 6]. If the learned model θ is the Probabilistic Latent Se mantic Analysis model (PLSA) with T topics, Eq. (2) be comes ∑ = = T t k test test k t w p I t p I w p 1 ) | ( ) | ( ) , | ( θ , ( 3 ) wher e p ( t | I test ) and p ( w k | t ) are model parameters to be estimated; the first parameter is a mixture coeffi cient of topics in the test im ag e; the second is a distribu tion over keyword s in the topic t . T o estimate these param eters, one might maximize the log-likelihood of annotated keywords in N training images D ∑∑ ∑ = == = N i q k T t k i ik t w p I t p t w p I t p w D L k i 11 1 ) | ( ); | ( ) | ( ) | ( log max ) , ( max θ θ , ( 4 ) where w ik is the count of word w k in image I i , i.e., the value of k th bin in word-histogram W i . It can be seen that some terms in the log-likelihood L ( D , θ ) will not change with the estimated model parameters, when w ik =0. In other wo rd, missing keywor d w k in the given annotation is out of consideration for estimating parameters in image I i . Generally speaking, Eq. (4) is a reas onable criterion for parameter estim ation, since maximizing likelihood is to generate all given observations in a join t probability as high as possible. In the word-histogram of training image I i , w ik =0 indicates that keyword w k was not observed in image I i . Consequent ly , it i s expected to b e independe nt of the l og-likeliho od. Due to the incompletio n of loose anno tation, even if w ik =0, keywor d w k might be still 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 1 10 111 1 12 1 13 1 14 1 15 1 16 1 17 1 18 1 19 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 1 10 111 1 12 1 13 1 14 1 15 1 16 1 17 1 18 1 19 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 ECCV -08 submission ID 46 2 4 significant ly relevant to cont ent of the image. Assume w e have poi nted out t his kind of keywords using estimated probability p ( w k | W i + , B i , θ D ) in Eq. (1). Con sequently , the likelihood in Eq. (4) wo uld change wit h these keywor ds, and a “bette r” model woul d be expected. In the proposed algorithm , during learning topi c models, the zero, i.e., w ik =0, would be directly replaced with th e conditional probability in Eq. (1) if it is lar ger than the given threshold τ . The other kind of latent variable models for automatic imag e annotation is to learn or approximate the condition al probability in Eq. (2 ) through summin g out of all training images [1 1] ∑ = ∑ = = N i i i i k N i i i k i k I B p I w p I B w p B w p 1 1 ) | ( ) | ( ) | , ( ~ ) , | ( θ , ( 5 ) where th e keyw ord w k and visual features B i are assumed to be condition al independ ence given im age I i . In this ki nd of m ethods, an uninform ative prior is actually assum ed for each training image. C onsequently , the prior of keywords (or blobs) used to be unequal and is de pendent on the gi ven traini ng datas et. In ge neral, thi s kind of methods seem m ore straightforward and “simpler” to estimation than probabilistic to pic models. In some cases, they are more effective t han PLSA because of the “sim plicity”. However , they are biased to annotate im ages with fre quent keywor ds (as show n in the late r experim ents). For the sake of “simplicity”, we use a similar met hod to approxim ate the conditional probability in Eq. (1) i n the propose d algorithm . Some m issing keywor ds are picke d out as “im agined” annotations i n a simple way . T hen, both given an d imagine d annotati ons are use d to estimate parameters of PLSA models. In this sense, the propose algorithm is a combination of two ki nds of late nt variabl e models m entioned above. Acco rdingly , one might expect that the pro posed algori thm woul d benefit f rom both kinds of latent va riable model s. 3 The Proposed Algorithm The proposed algorithm includes two sta ges: (1) obtaini ng “im agined” annotat ions throug h approximating conditio nal probability of missing keywords given trainin g images and loose annotati ons; (2) modeli ng both given and im agined annotations u sing PLSA-wo rds [3]. For conve nience of expr ession, we refer t o the proposed al gorithm as V irtual -W ord driven Probabilistic Latent Semantic Analys is (PLSA-vw). The two stages of PLSA-vw are detailed in the following two subsection s, respectively . 3.1 Obtaini ng “Imagi ned” Annota tions Before the process is specified, we describe how the conditional probability in Eq.(1) will be approximated. Assume that the conditional prob ability of a missing keyword w j is the average of all joint probabilities b etween it and keywords given in th e loose annotation ∑ + ∈ + + i k W w k j i D i i j D w w p W B W w p ) | , ( 1 ~ ) , , | ( θ , ( 6 ) where | W i + | is the number of anno tated keywords in train ing image I i . Please note that w j ∈ 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 ECCV -08 submission ID 46 2 5 W i 0 and w k ∈ W i + ; Actually , the approximation in Eq . (6) is achieved by assu ming an uninformative prior for the join probability p ( w j , w k | D ), which is furtherm ore approxim ated by summ ing out of all training i mages ∑ ∑ = = N i i k i j N i i k j k j I w p I w p I w w p D w w p 1 1 ) , ( ) , ( ~ ) | , ( ~ ) | , ( , ( 7 ) where keywords w j and w k are assum ed to be condit ional inde pendence gi ven traini ng images. At last, the joint probability that a k eyword co-occurs with an image is approxim ated by ∑ = = N i ij ij i j w w I w p 1 ) , ( , ( 8 ) where w ij is the count of keyword w j in image I i . It is easy to see that the joint probability p ( w j , w k | D ) in Eq. (7) is actually approximated by an inner product between two normalized word-count vectors or a c osine- like similarity measur e between keyw ords. Let count-matrix W (resp. B ) be a set of word- (re sp. blob-) hist ograms where eac h row corresponds to an training im age. The actual approxim ation method is an inverse process from Eq. (6) to (8), as show n in the fol lowing fo ur steps: St e p 1 : Compute normalized word-count matrix W W W T N N norm e e r r . = . St e p 2 : Compute similarity matrix between keywo rds norm T norm sim W W W = . St e p 3 : Approximate conditional pro bability matrix of keywords T q q sim img e e r r W WW W ). ( = . St e p 4 : Pick out all missing keywords: (1) if W ( i , j )>0, W img ( i , j ) is re placed with zero, since w j has been annotated i n image I i ; (2) if W img ( i , j ) ≥ τ , w j is a missing keyword in image I i , otherwise W img ( i , j ) is set into zero. In the ab ove-menti oned steps, x e r is an x dim ensional column vector whose elements are equal to 1, and t he matrix di vision is performed at every correspondent entry . Up to now , all non-zero entries in the matrix W img correspond t o keywords m issed in the loose annotati on, which are assum ed relevant to semantics of im ages. All keywords retrieved from training im ages are referred to as “i magined” annotat ions , in contrast with the given annotations in train ing images. 3.2 Modeling bot h Given and Imagined Annotations T o automatically annotate new images using conditional prob ability of keywords given its visual represe ntation, one m ight learn the model θ in Eq. (2) by m aximizing log-likelihood as shown in Eq . (4). In our cas e, we have two kinds o f observations, i.e., given and imagine d annotati ons. General ly speakin g, the im agined annot ations would be not as relia ble as given annotations. For ex ample, as shown in Fig.1, the im agined 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 21 1 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 21 1 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 ECCV -08 submission ID 46 2 6 annotation of the third exam ple image includes “ Jet Sky Grass Runwa y Elephant ”, where the “ Jet ” is obviously irrelevant to th e image. As mentioned before, th e imagined annotations are associative a nd are not checked by human su pervision. Th erefore, we simply regard the approximated co nditional probab ility of missing keywords as their reliability in the imagined anno tations. Furthermore, we use the appro ximated conditional probability as a r eal-value word-count. T ypically , the real-v alue wor d-count is l ess than one. Therefore , we can de fine a new word-count m atrix for learning img W W W + = * , ( 9 ) where W and W img are word-count matrixes in given and imagined annotations, respectively . It can be seen from t he process of ap proximat ion in subs ection 3.1 t hat imagined annotations W img are obtained by passing blob-c ount mat rix B . In other words, these annotations have been i magined wit hout consulti ng visual feat ures of traini ng images. T o ensure the imagination can be refl ected on visual features, we derive a new blob-count matrix for learning in the same way img B B B + = * , ( 1 0 ) where the imagined blob-coun t matrix B img is obtained from matrix B usi ng the same approxim ation method i n subsection 3. 1. In the proces s, the changes wha t we need ma ke include (1) replacing m atrixes B and B img with W and W img , respectively; (2) accordingl y , normali zed word-count m atrix W norm and similarity matrix W sim would be rewritten as B norm and B sim ; (3) the num ber of keyword, q , shoul d be replaced with that of blob, p , in step 3. Then, PLSA-wor ds is used t o modeling new observ ation mat rixes, i.e., W * and B * . W e outline PLSA-words in the following as two stages: (1) estimating parameters of probabilistic latent semantic analysis models; (2) au tomatic image annotation of a set of new images. Para meter estimation : Unlike Eq. (4), t h e likelihoo d is give n by ∑∑ ∑ = == = N i q j T t j i t w p I t p j i D L 11 1 * ) | ( ) | ( log ) , ( max ) , ( max W θ θ , ( 1 1 ) where W * ( i , j ) is the count of keyw ord w j in training image I i with bot h given and “imagined” a nnotations. By maxim izing Eq. (1 1), both t opic models o f keywords , i.e., p ( w j | t ), w j ∈ W , and mixture co efficients, i.e., p ( t | I i ), t ∈ [1, T ] and I i ∈ D , in each training image can be e stimated. Keeping p ( t | I i ) untouche d, maximi zing ∑∑ ∑ == = N i p j T t j i t b p I t p j i 11 1 * ) | ( ) | ( log ) , ( max B ( 1 2 ) is to obtain topic models of blobs, i.e., p ( b j | t ), b j ∈ B , where matrix B * is given in Eq. (1 0). This process is referred to as folding-in visual featu res in [3]. Automa tic image annotation : Given blob hi stograms of a set of test i mages B test , a new matrix B * test is derived using B * test = B test + B test img , where the imagi ned blob-c ount matrix of test images B test img is given by T p p test sim test test img e e r r B B B B ). ( = , ( 1 3 ) where the blob-similarity matrix B sim has been deri ved from training im ages instead of t est images. 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 ECCV -08 submission ID 46 2 7 Then, keepin g learned t opic model s of blobs p ( b j | t ) , t ∈ [1, T ], untouched, one can obtain mixture coe fficients p ( t | I test ) in test image I test by maxim izing Eq. (12), whe re B * should be replaced with B * test . Then, using learne d mixture coef ficients p ( t | I test ) in test image, one can compute the cond itional probability of each keyword given any test image ∑ = = T t j test test j t w p I t p B w p 1 ) | ( ) | ( ) , | ( θ , ( 1 4 ) where topic models o f keywords, p ( w j | t ), w j ∈ W , have been estim ated in the stage of parameter estimation as shown in Eq. (1 1). 4 Experiments and Discussion In this section, we evaluate the performance of the proposed algorithm from two aspects. First, we examine the relative improvement of PLSA-vw over PLSA-words in terms of image annotat ion, indexi ng and retrie val. Second, the proposed m ethod is com pared with three typical discrete annotation methods, i.e ., machine translation (MT) [4], tran slation table (TT) [ 5], and cross -media releva nce model (CMRM) [6] . Unlike PLSA -vw and PLSA-words, t hese methods use image as lat ent variabl e, and sum o ut of all trai ning images to annotate new im ages [1 1]. Ther efore, the annotatio n performance of these methods is he avily dom inated by the empirical dist ribution of keywords in training im ages. As shown in s ubsection 4.2, these m ethods are bi ased to annotat e images with fre quent keywords. 4.1 Experime ntal Setting W e c onducted e xperime nts over a p ublicly avai lable Corel d ataset (http://kobus.ca/research/d ata/jmlr_2003), which was provided by Bernard et al. [1]. This dataset is organized into ten samples from about 16 000 annotated images, in which there are segmentati on, hand annot ation, visual features an d its blob rep resentation. Each sample is split into a training set and two test sets . The files of the three subsets correspond to no-prefix (training), “test_1_” and “test_3_” (novel), respectively . W e use the first test set (i.e., “test_1_”) as new im ages to be anno tated. The average sizes of word and blob vocabulary are 161 and 500, respectivel y . Moreove r , the translation ta bles used in t he machine transl ation met hod [4] are als o provided i n the dat aset. The number of topics us ed in our ex periments i s 120, and t he threshol d τ in Eq. (1) is set to 0.01, which was determined through cross-v alidation with 10% holdout training images. W e use four i ndexes (i.e., AP , m AP , RP and RSI) to evaluate the perform ance of algorithm s, which are rela ted to im age annotation, i ndexing , retrieval and semanti c range, respectively . Given a test image, top 5 keywo rds with higher conditional probab ility in Eq. (2) will be used as predicated annotations. The first index is the av erage Annotation Precision of all test im ages (AP), where the an notat ion prec ision of a test image is a rati o of the num ber of correctly annotated keywords t o that of keyw ords in the ground-trut h annotation. The second ind ex is mean A verag e Precision (mAP), which is the m ean value of average retrieva l precisions of all k eyw ords. The av erage retr ieval precision of a keyword is defined as the s um of t he retrieval precisions of the c orrectly retri eved wor ds at its rank, divided by the to tal number of rele vant images. Here, a n image is termed as 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 31 1 312 313 314 315 316 31 7 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 31 1 312 313 314 315 316 31 7 ECCV -08 submission ID 46 2 8 relevant if the querying ke yword is in the ground- truth annotation. Therefo re, mAP is sensitive to the entire ranking o f image indexing by the condition al probability in Eq. (2). The third index is insen sitive to the entire ranking of image indexin g, i.e., average Retrieval Precision ( RP) of using every keywor d as querying. Given top M retrieved images, the ret rieval precision i s the num ber of images wit h the query ing keywor d divide d by M . In our expe riment, M is equal to 20. The last index is very similar to the “Range of Semantics Identified” (RSI), which was originally coded by Bernard et al. in [13, 15]. RSI is to measure how many keywords can be co rrectly identified during image annotation in [13]. Since t he number of keywords i n dif ferent sampl es is variable i n our expe riments, RSI is define d as what percentage of keywords is ever correctly annot ated at least once. All indexes mentioned a bove are a veraged over t en samples. 4.2 Results T able 1 l ists the perf ormance of a utomatic annot ation met hods used in ou r experim ents, where the number i n brackets i s the variance of t en samples. Gi ven an index (a column), the best an d next m ethods are m arked with red and blue color , respec tively . Although PLSA-vw is not the best for any ind ex and only is the second fo r three of four ind exes, it does, as expected, benefit from two kinds of latent variable models outlined in section 2. Ta b l e 1 Per formance of automat ic annotati on methods in term s of four indexes (% ). The number in brac kets is the variance of 10 sam ples. The best and ne xt performances in each column are marke d as red and blue color , respectively . AP mAP RP RSI PLSA-vw 17.56 (1.34) 12.34 (1.72) 15.89 (3.59) 68.43 (1.96) PLSA-wor ds 14.06 (1.85) 10.62 (1.98) 13.82 (2.68) 69.51 (2.35) TT 23.38 (1.36) 12.49 (1.29) 16.34 (2.16) 33.48 (1.72) MT 21.40 (1.19) 9.35 (0.94) 12.76 (1.34) 43.59 (2.35) CMRM 18.42 (0.98) 10.89 (1.34) 13.54 (1.50) 13.08 (0.88) First of all, it can be seen that the performance of PLSA-vw is improved over PLSA-words in terms of image annotation (i.e ., AP), indexing (i.e., mAP) and retrie val (i.e., RP). As shown in section 3, PLSA-words, in the proposed algorithm, is utilized to model bot h given a nd imagined annotatio ns. Therefore, PLSA-vw o nly dif fers from PLSA-words in the input, i.e., imagined anno tations, for modeli ng. It can be concluded that the improvements of PL SA-vw over PLSA-words do originate from the imagined annotations. T o some extent, ex perimental res ults show t hat it is useful an d might be necessary to enrich th e incomplete list of loo sely annotated keyword s for automatically annotating new images. This is the focus of this paper . At the same time, we can see that the improvements are at a little cost of the semantic range, i.e., 1% RSI in table 1, althoug h the cost is not significant rel ative to other methods. I n the following, we compare the proposed al gorithm wi th other th ree discrete an notation m ethods. Although TT is better than PLSA-vw in term s of three of four indexes (i.e., AP , mAP and RP), PLSA-vw is co mparable to TT for m AP and RP , and both of them are better than 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 ECCV -08 submission ID 46 2 9 other me thods. In term s of AP , the dif ference between them is significant and up to 5%. However , the annotati on perform ance of TT would be over -estimated if we do not care about the other i ndex, i.e., R SI. As shown in t able 1, R SI of TT is equal to 33.48 %. That means only 33.48 % keywords are eve r correctly annotated using TT . In other words, over 66% keywords are out of consi deration in autom atic annotat ion. Therefore, TT is heavily biased to using certain keywor ds as annotation of images. T o show what kind of keywords is liable to be used in TT , frequencies of keywords in the first sam p le are shown i n Fig. 2 (a), where red cross a nd blue dot respectivel y denote whether key words are ever correctly annotated in all test images or not. It can be seen from Fig. 2 (a) that T T can c orrectly annotate only 2 k eywords amon g keywords who se word-c ounts are less th an 50. In contrast, PLS A-vw can co rrectly annotat e more than 2 0 keywords as shown i n Fig. 2 (b). For the sam e index, i.e., AP , PLSA -vw can out perform PLSA-words u p to 3.5%. H owever , the cost of RSI is abou t 1%. In this sens e, the im provement of PLSA-vw over PLSA -word seems more attractive, although PLSA-vw is still not the best one. (a) TT (b) PLSA-vw Fig. 2 W ord-count of all keywords in the first sample. Some keywords are displayed along horizontal axis. All annotat ed and non-annotated keywords are ma rked as red cross and blue dot, respectively . There ar e only two ke ywords are ever correctly annota ted using TT when t he word-count is less than 50. In the same time, PLSA-vw can correctly anno tate more than 20 keywords. 5 Conclusions In this pa per , we present an ap proach to e nriching t he incomple te list of keywords. In particular , some mi ssing keywords are “imagined” as real annotatio ns for traini ng images by approximating the con ditional probability of missing keywords given loo sely annotated images. Experim ental results show that the proposed algorithm may improve the performance o f PLSA-words in terms of im age annotation, index ing and retrieval while keeping the m erit of PLS A-words than other discret e annotatio n methods , i.e., wide r semantic range. Therefore, it can be c oncluded that it is useful and necessary to enrich the incomplete list of keywords given in loo sely annotated images. Althoug h the proposed algorithm is rather simple, we m ake a very strong ass umption, i.e., the process of retrieving missing keywords is co nditional inde p endent of visual features giv en a same process of imagi nation for blobs . Our furtherm ore work is to remove th e assumption, and 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 ECCV -08 submission ID 46 2 10 make the imaginati on of keywords depen d directly on visual features of images. References 1. K. Barnar d, P . Duy gulu, D. Fo rsyth, N. Fr e itas, D. M. Blei and M. I. Jordan: Matching words and pict ures. Journal of Machine Learn ing Research 3 (2003) 1 10 7-1 135. 2. D. Blei and M. Jordan: Modeling an notated images. In Pr oc. 26 th int. Conf. on Research and Development in Information Retrieval (T oronto), Aug. 2 003. 3. F . Monay and D. Gati ca-Perez: Modeli ng Semant ic Aspects for C ross-Media Im age Indexin g. IEEE T ransactions on Pattern Analysis and Machine Intelligence 27 (2007) 1802-181 7. 4. P . Duygulu, K. Barnard , N. Freitas and D. Forsyth: Object recogn ition as machine translation. In Pr oc. of the Eur op. Conf. on Computer V ision , pages: 97-1 12, 2002. 5. J. Y. Pan, H. J. Yang, P. Duygulu an d C. Falout sos: Autom atic image capti oning. In IEEE International Conferen ce on multimed ia and expo , Taipei, 2004. 6. J. Jeon, V . Lavrenko and R. Manm atha: Autom atic image annot ation and ret rieval usin g cross-medi a relevance m odel. In Pr oc. 26 th int. Conf. on Resear ch and Development in Information Retrieval (T oronto), Aug. 2003. 7. S. L. Feng, R. Manmatha and V . Lavr enko: Multiple Berno ulli relevanc e models for image. In Pr oc. Of IEEE Int. Conf. on Compu ter V ision and Pattern Recognition , Jun. 2004. 8. J. Li and J. Z. W ang: Automa tic linguisti c indexing of pictures by a statis tical modeling approach. IEEE T ransactions on Pattern Analysis and Machin e Intelligence 25 (2003) 1075-108 8. 9. J. Li and J. Z. W ang: Real-t ime co mputerized annotation of pictures. In Pr oc. of 13 th ACM Int. Conf. on Multimedia , California, 2 006. 10. L. Fei-Fei and P . Pe rona: A Bay esian hierarchical model for l earning natural scene categories. In Pr o c. of IEEE Int. Conf. on Computer V ision and Pattern Recognition , San Diego, Jun. 2005. 1 1 . G . Carnerio, A.B. Chan, P .J. Moreno, N. V asconcelos: Supervised learning of semantic classes for image annotation and retrieval. IEEE T ransactions on Pattern Analysis and Machine Intelligen c e 29 (2007 ) 394 – 4 10. 12. C. B. Y ang, M. Dong and F . Fotouhi: Region-based image annotation using multiple-instance learning. In Pr oc. of 13 th ACM Int. Conf. on Multimedia , Singapore , 2006. 13. K. Barn ard and Q. F . F an: Reducin g correspondenc e ambiguit y in loosely l abeled training data. In Pr oc. of Int. Conf. on Computer V ision and Pattern Recognition , Minneapoli s, Jun. 2007. 14. T . Hof faman: Unsupervised Learning by Probabilistic Laten t Semantic Analysis. Machine Learning 42 (2001) 1 77-196. 15. K . B a r n a r d , Q . F . F a n , R . S w a m i n a t h a n , A . H o o g s , R . C o l l i n s , P . R o n d o t , J . Kaufhold: Evalu ation of localize d semantics: data, m ethodology , and exper iments. International Journal of Co mputer V ision , to appear . 408 409 410 41 1 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 449 450 451 452 453 408 409 410 41 1 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 449 450 451 452 453

Modeling Loosely Annotated Images with Imagined Annotations

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment