Modeling Social Annotation: a Bayesian Approach
Collaborative tagging systems, such as Delicious, CiteULike, and others, allow users to annotate resources, e.g., Web pages or scientific papers, with descriptive labels called tags. The social annotations contributed by thousands of users, can poten…
Authors: Anon Plangprasopchok, Kristina Lerman
Modeling Social Annotation : a Bayesia n Approach Anon Plangpra sopchok and Kr istina Ler man USC Inf ormation Sciences Institute Collab or ativ e tagging systems, such as Delicious , CiteULike , and others, allow users to annot ate resources, e.g., W eb pages o r scien tific pa pers , with descriptiv e labels called tags. The so cial anno- tations contributed b y thousands of users, can p oten tially be used to infer categorical kno wledge, classify documents or recommend new relev an t information. T raditional text inference methods do not mak e best use of social annotation, since they do not tak e int o accoun t v ariations in i ndivid- ual users’ persp ective s and vocabulary . In a previous work, we introduced a simple probabilis tic model that tak es int erests of i ndividual annotators into accoun t in order to find hidden topics of annotate d resources. Unfortunately , that approac h had one ma jor shortcoming: the num b er of topics and in terests must be specified a priori . T o address this drawbac k, w e extend the mo del to a fully Bay esi an framew ork, which offers a w ay to automatically estimate these n umbers. In particular, the mo del allows th e n umber of in terests and topics to c hange as suggeste d by the structure of the data. W e ev al uate the proposed model in detail on the synthetic and real-wo rld data by comparing its performance to Laten t Diric hlet Allo cation on the topic extraction ta sk. F or the l atter ev al uation, w e apply the model to infer topics of W eb resources fr om so cial anno- tations obtained from Delic ious in order to discov er new resources similar to a specified one. Our empirical r esults demonstrate that the prop osed mo del is a promi s ing metho d for exploiting so ci al kno wledge con tained in user-generated annotations. Cate gories an d Sub ject Descripto rs: H.2.8 [ D A T ABASE MAN A GEMENT ]: Datab ase Appl icati ons— Data min- ing ; I.5.1 [ P A TTERN RECOGNITION ]: Models— Sta tistic al General T erms: Algorithms, Experimentation Additional Key W ords and Phrases: Collab orative T agging, Probabilistic Mo del, Resource Dis- co v ery , So cial Annotation, So cial Inform ation Pro cessi ng 1. INTRODUCTION A new generation of Social W eb sites, such as Delicio us , Flickr , CiteULike , Y ouT ub e , and others, allow users to share conten t and annotate it with metadata in the form of comments, notes, ratings, an d descriptive labels known as tags. Social annotatio n captures the collec- ti ve knowledge o f th ousands o f users an d ca n potentially be used to enh ance a num ber of applications includ ing W eb search, inf ormation person alization and recommen dation, and ev en synthesize categorical knowledge [Schmitz 2006; Mika 2007 ]. In order to m ake best use of user-generated metadata, we need metho ds that effecti vely deal with the challen ges of data sparseness and n oise, a s well as take into accou nt v ariations in the vocabulary , interests, and the lev el of expertise amon g indi vidual users. Consider specifically tagging , whic h has becom e a p opular meth od fo r anno tating con- Author’ s address: USC Information Science s Institute, 4676 Admiralty W ay , Marina del Rey , CA 90292. Permission to mak e digita l/har d copy of all or part of this materia l without fee for persona l or classroom use provi ded th at the copie s ar e not made or d istrib uted for profit or commerc ial a dv antage, the A CM cop yright/serve r notice , the title of the publication , and its date appea r , and noti ce is giv en that copyin g is by permission of the A CM, Inc. T o cop y otherwise, to republish, to post on s erve rs, or to redistrib ute to lists requires prior specific permission and/or a fee. c 2010 A CM 0000-0000/2010/ 0000-0001 $5.00 AC M Journal Name, V ol. x, No. y, zz 2010, Pages 1–0 ?? . 2 · A. Plangprasop chok and K. Ler man tent o n the Social W eb . When a user tags a resour ce, be it a W eb pag e on th e soc ial bookm arking cite Delicious , a scientific p aper on CiteULike , or an image on the social photosha ring site Flic kr , the user is free to select any keyword, or tag, from an uncontrolled personal vocab ulary to describe the resourc e. W e can u se tags to categorize resources, sim- ilar to the way documents are categor ized using the ir text, altho ugh the u sual prob lems of sparseness (few unique k eywords per documen t), synonymy ( different keyw ords may have the same meaning), and ambig uity (same k eyword has multiple m eanings) , will also be present in this domain. Dimensio nality reduction techniq ues such as topic modeling [Hof- mann 1 999; Blei et al. 20 03; Buntine e t al. 2004], whic h p roject docum ents from word space to a dense topic space, can alle viate these pro blems to a certain degree. Spec ifically , such projection s address the sparseness and synonymy challeng es by combining “similar” words tog ether in a topic. Similarly , the challen ge of word ambiguity in a docum ent is addressed by taking in to account th e senses o f co-appea ring words in th at document. In other words, the sense of the word is determined jointly along with the other w ords in that docume nt. Straightfor ward ap plication of th e pr eviously men tioned metho ds to social an notation would aggregate resource’ s tags over all users, thereby losing importan t inform ation about individual variation in tag usage, wh ich can actually he lp th e categoriz ation task. Specifi- cally , in soc ial annotation, similar tags may have different meaning s accor ding to anno ta- tors’ perspectives on the resou rce they are anno tating [Le rman et al. 200 7]. For example, if one searches for W eb resources about car prices using the tag “jaguar” on Delicious , one receives back a list of resour ces co ntaining docum ents ab out luxury cars and dealers, as well as guitar manuals, wildlife videos, and documents about Apple Computer ’ s operatin g system. The above m entioned method s would simply compute tag occurre nces from anno - tations acro ss all users, effecti vely treating all annotations as if they we re comin g from a single user . As a result, a resource ann otated with the tag “jagu ar” will be undesirab ly as- sociated with any sense of the keyw ord simply based on t he number of times that ke yword (tag) was used for each sense. W e claim that users exp ress their individual interests and vocabulary throu gh tags, and that we can use th is in formatio n to learn a better topic mod el of tagged resources. For instance, we are likely to discov er that users who are interested in luxury cars use the key- word “jaguar ” to tag car -r elated W eb pages; while, those who are interested in wildlife use “jaguar” to tag wildlife-related W eb pages. The addition al inform ation about user interests is essential, especially since social an notations a re generally very spa rse. 1 In a previous work, [ Plangpraso pchok and Le rman 2 007], we pro posed a prob abilistic model th at takes into account interest variation among users to inf er a more accur ate topic model of tagged resources. In th is p aper we descr ibe a Bayesian version of the mode l (Section 3). W e ex- plore its perfo rmance in detail on the syn thetic data (Section 4.1) and compare it to Latent Dirichlet A llocation (LD A) [Blei et al. 2003], a popular do cument mod eling algo rithm. W e show that in dom ains with high tag a mbiguity , variations amo ng users can actually help discriminate between tag senses, leading to a better topic mod el. Our model is, there- fore, b est suited to make sense of social metadata, since this d omain is characterized both by a h igh degree of noise and ambiguity a nd a highly di verse user popula tion with varied 1 There are only 3 . 74 tags on av erage for a certain photo in Flickr [Rattenb ury et al. 2007]. In addition , there are 4 to 7 tags used by ea ch user on a certain URL from our observat ion in Delicious data we obtaine d; while tag voca bul ary on an resource gets stable after fe w bookmarks as reported in [Golder and Huberman 2006]. ACM Journal Name, V ol. x, No. y , zz 2010. Modelin g Social Anno tation : a Bayesian Approa ch · 3 interests. As a seco nd contribution o f the p aper, we incor porate a Hierarch ical Dir ichlet Pro- cess [T eh et al. 2006] to create an adapti ve version of the p roposed model (Section 5), which enables the learning method to auto matically adjust the model parameters. This ca- pability helps overcome one of the ma in difficulties of applyin g th e original model to the data: n amely , having to specify the righ t number of topics and interests. Finally , th e prop osed models are validated on a re al-world data set o btained from the social bookm arking site Deliciou s (Section 4 .2 and Section 5.2). W e first tra in the model on this data, then measur e the qu ality of the learned topic model. Specifically , the learned topic model is used as a compre ssed description of each W eb r esource. W e compu te sim- ilarity between r esources b ased on the compre ssed descr iption and manually e valuate re- sults to sho w that the topic model obtained by the method proposed in this paper identifies more similar resources than the baseline. 2. MODELING SOCIAL ANNO T A TION In general, a social annotation system in volves three e ntities: r esources ( e.g., W eb pag es on Delicious ), users and metadata . Althoug h there are different forms of metadata, such as descriptions, notes and ratings, we fo cus on tags only in this con text. W e define a v ariable R as resou rces, U as users, and T as tags. Th eir r ealizations are defin ed as r , u and t respectively . A post ( or book mark) k on a resource r by a u ser u , ca n be formalized as a tuple h r , u, { t 1 , t 2 , . . . , t j }i k , which can b e further b roken d own into co-occ urrence o f j resource- user-tag triples: h r, u, t i . N R , N U and N T are th e numb er of distinct r esources, users and tags respectively . In addition to the o bservable variables defined above, we introdu ce two ‘h idden’ or ‘latent’ variables, which we will attempt to infer from the observed data. The first variable, Z , represents resource topics, which we vie w as cate gories or concepts of resources. From our previous example, the tag “ jaguar” can be associated with topics ‘cars’, ‘ animals’, ‘South America’, ‘co mputers’, etc. The second variable, X , represents u ser interests, the degree to which users subscribe to these concepts. One user may be interested in collecting informa tion about luxu ry cars befo re purchasing one, while another user may be interested in v intage cars. A user u h as her in terest p rofile, ψ u , wh ich is a weig ht distribution over all po ssible intere sts x . And ψ (w ithout sub script) is ju st an N U × N X matrix. Similarly , a resou rce r has its topic pro file, φ r , which is again a weig ht distribution over all possible topics z , whereas φ (without sub script) is an N R × N Z matrix. Thu s, a resource abou t South American jaguars will have a higher weight on ‘anim als’ and ‘South America’ topics than on the ‘cars’ topic. Usage of tags for a cer tain interest-topic pair ( x, z ) is defined as a weight distribution over tags, θ x,z – that is, some tag s are m ore likely to occur for a given pair than others. The weight distribution of all tags, θ , can be viewed as an N T × N Z × N X matrix. W e cast an annotation event as a stochastic process as follows: —User u finds a resour ce r interestin g and would like to book mark it. —For each tag that u gen erates for r : —User u selects an interest x fr om her interest pro file ψ u ; resource r selects a topic z from its topic profile φ r . —T ag t is then chosen based on users’ s inter est a nd reso urce’ s topic fro m the interest- topic distribution ov er all tags θ x,z . ACM Journal Name, V ol. x, No. y , zz 2010. 4 · A. Plangprasop chok and K. Ler man User (u) Resource (r) Possible Tags (N T ) Tagging Profiles (N X ) Possible Topics (N Z ) User interests (x) Resource topics (z) Generated tags (t) User (u) Resource (r) User (u) Resource (r) Possible Tags (N T ) Tagging Profiles (N X ) Possible Topics (N Z ) User interests (x) User interests (x) Resource topics (z) Generated tags (t) Generated tags (t) Generated tags (t) Possible Words Possible Topics Document (r) Topics (z) Generated words (t) Possible Words Possible Topics Possible Words Possible Topics Document (r) Topics (z) Generated words (t) Generated words (t) Generated words (t) (a) Social Annotati on Process (b) Document W ord Genera tion Process Fig. 1. Schematic dia grams represent : (a) tag gen eratio n process in socia l ann otati on domain; (b) w ord gene ration process in document modeling domain. This proc ess is depicted schematically in Figure 1(a). Specifically , a user u has an interest profile, repre sented by a vector of inte rests ψ u . Mean while, a resour ce r has its own topic profile, repre sented b y a vector of topics φ r . Users who share the same interest ( x ) hav e th e same tagging p olicy — the tagging profile “plate”, shown in the diagr am. For the “plate” correspo nding to an in terest x , ea ch ro w cor respond s to a p articular to pic z , and it gi ves θ x,z , the distribution over all tags for that topic and interest. The process can be compared to the word g eneration process in standard topic mod- eling app roaches, e .g., LD A [ Blei et al. 2003] and p LSA [Ho fmann 2001 ], as sh own in Figure 1 (b). In top ic mode ling, words o f a certain doc ument are g enerated acco rding to a single policy , which assumes that all auth ors of docum ents in the corpu s share the same tagging patterns. In oth er words, a set of “similar” tags is u sed to repre sent a topic acro ss all authors. In our “jaguar” example, for instance, we ma y fin d one topic to be strongly associated with word s “cars”, “autom otive”, “parts”, “jag”, etc., wh ile anoth er to pic may be associate d with words “an imals”, “cats”, “cute”, “black”, etc., and still anoth er with “guitar”, “fender ”, “music”, etc. and so on. In social annotation, ho wev er , a resource can be anno tated b y many users, who may have different opinions, e ven on the same topic. Users who are interested in restoring vintag e cars will hav e a d ifferent tagging profile than those who are interested in shop ping f or a luxury car . The ‘cars’ topic would then decompose und er d ifferent taggin g pr ofiles in to one that is h ighly associate d with words “restoration”, “classic”, “parts”, “catalog ”, etc., and another that is associated with word s “luxury”, “design”, “perfo rmance”, “ brand” , etc. The separation of tagging profiles for each group of users who share the same interest provides a ma chinery to address this issue and constitutes th e m ajor distinction between our work and standard topic modeling. 3. FINITE INTEREST T OPIC MODEL In our pre vious work [Plangp rasopch ok and Le rman 2007 ], we p ropo sed a pro babilistic model th at de scribes social ann otation pro cess, which was extend ed from probab ilistic Latent Semantic Analysis (pLSA) [ Hofmann 2 001] . Howev er , the model in herited some shortcomin gs from pLSA. First, the strategy for estimating p arameters in both models — the point estimation using EM algorithm — has been criticized as being prone to loc al maxima [ Griffiths and Steyvers 20 04; Steyvers and Griffiths 20 06]. In ad dition, there ACM Journal Name, V ol. x, No. y , zz 2010. Modelin g Social Anno tation : a Bayesian Approa ch · 5 X Z T R U N t ψ N U φ N R θ N Z ,N X β α η D X X Z Z T T R R U U N t ψ ψ N U φ φ N R θ N Z ,N X β α η η D Fig. 2. G raphica l rep resentat ion of t he social annotat ion proce ss. R , U , T , X and Z denote v ariables “R esource” , “User”, “T ag”, “Interest” and “T opic” respect i vel y . ψ , φ and θ are distributi ons of user over interests, resource ov er topics and interest-to pic ove r tags respecti vely . N t represent s the number of tag occurrences for one book- mark (by a part icular user , u , on a part icula r resource, r ); D represent s the number of a ll bookmarks in the socia l annotat ion s ystem. The hyperparamete rs α , β , and η control dispersions of catego rical topics, interests and tags respect i vel y . is also no explicit way to extend these models to au tomatically inf er the d imensions of parameters, such as the number of comp onents used to represen t topics ( N Z ) and interests ( N X ). W e extend our previous Interest T opic Model (ITM) the same way pLSA was upgra ded to Latent Dir ichlet Allocation (L D A ) mo del [Blei et al. 2003]. In other words, we im- plement the mod el un der a Bayesian framew ork, which offers solutio ns [Blei et al. 2003; Griffiths and Steyvers 2004; Neal 20 00] to the previously mention ed pro blems. By doing so, we introduce pr iors on top of parameter s ψ , φ an d θ to make the model fully gener ativ e, i.e., the m echanism for ge nerating these par ameters is explicitly implemen ted. T o make the m odel analy tically simple, we use sy mmetric Dirich let pr iors. Follo wing the g ener- ativ e pro cess de scribed in Section 2, the model c an b e d escribed as a stochastic process, depicted a graph ical form [Buntine 199 4] in Figure 2: . ψ u ∼ Dir ichl e t ( β / N X , ..., β / N X ) (gene rating user u intere st’ s profile) . φ r ∼ Dir ichl e t ( α/ N Z , ..., α/ N Z ) (gene rating resource r to pic’ s profile) . θ x,z ∼ Di ri chl et ( η / N T , ..., η / N T ) . (gener ating tag’ s profile f or intere st x and topic z ) For each tag t i of a book mark, . x i ∼ D iscr ete ( ψ u ) . z i ∼ Discr ete ( φ r ) . t i ∼ D iscr ete ( θ x i ,z i ) . One possible way to estimate p arameters is to use Gibbs samp ling [Gilks et al. 19 96; Neal 2000]. Briefly , the idea behind the Gibbs sampling is to iteratively use the parameters ACM Journal Name, V ol. x, No. y , zz 2010. 6 · A. Plangprasop chok and K. Ler man of the current state to estimate param eters of the next state. I n particu lar , each next-state parameter is samp led fro m the po sterior distribution o f that p arameter given all other pa- rameters in th e previous state. The sampling pr ocess is done sequen tially until sampled parameters approa ch the tar get posterior distrib utions. Recently , this appr oach w as demon- strated to be simp le to implemen t, yet com petitiv ely efficient, and to yield relatively good perfor mance on the topic extraction task [Griffiths and Steyvers 200 4; Rosen-Zv i et al. 2004] . Since we use Dirichlet priors, it is straightforward to integrate out ψ , φ and θ . Thu s, we only need to sample hid den variables x and z and later on estimate ψ , φ and θ once x and z approach their target p osterior distribution. T o derive Gibbs sampling formula for sampling x and z , we first a ssume that all bookma rks a re bro ken into N K tuples. Each tuple is indexed b y i and we refer to the observable variables, resour ce, user and tag, of the tuple i as r i , u i , t i . W e refer to th e hidden variables, topic an d interest, for th is tuple as z i and x i respectively , with x and z r epresenting the vector of inter ests and topics over all tuples. W e define N r i ,z − i as the n umber of all tuples having r = r i and z but excluding the present tuple i . In w ords, if z = z i , N r i ,z − i = N r i ,z i − 1 ; other wise, N r i ,z − i = N r i ,z i . Similarly , N z − i ,x i ,t i is a numb er of all tup les having x = x i , t = t i and z but excluding th e present tuple i ; z − i represents all topic assignmen ts excep t that of the tup le i . The Gib bs sampling formulas for sampling z and x , whose derivation we provide in the Appendix, are as follows. p ( z i | z − i , x , t ) = N r i ,z − i + α/ N Z N r i + α − 1 . N z − i ,x i ,t i + η / N T N z − i ,x i + η (1) p ( x i | x − i , z , t ) = N u i ,x − i + β / N X N u i + β − 1 · N x − i ,z i ,t i + η / N T N x − i ,z i + η (2) Consider Eq. (1), which compu tes a pro bability o f a certain topic for the present tuple. This equation is composed of 2 factors. Suppose that we are curren tly determ ining the probab ility that the top ic of the present tuple i is j ( z i = j ). The left factor determine s the p robab ility o f topic j to which the resour ce r i belongs accor ding to th e p resent top ic distribution of r i . Mean while, the right factor determ ines the prob ability of tag t i under the topic j of the users who hav e interest x i . If resource r i assigned to t he topic j has many tags, and the present tag t i is “ very impo rtant” to the topic j according to the users with interest x i , there is a higher ch ance that tuple i will be assigned to topic j . A similar insight is also applicable to Eq. (2). In particular, sup pose that we are c urrently d etermining the probab ility that the interest of the present tuple i is k ( x i = k ). If u ser u i assigned to the interest k has many tags, an d tag t i is “very importan t” to the to pic z i accordin g to u sers with interest k , the tuple i will be assigned to interest k with higher prob ability . In the model training pro cess, we sample topic z and in terest x in the curren t iteration using their assignm ents from the p revious iteration. By samp ling z and x using Eq. (1) and Eq. ( 2) for each tuple, the p osterior distribution of topics a nd interests is expected to conv erge to th e tr ue po sterior distribution after enough iteration s. Although it is difficult to assess c onv ergence of Gib bs sampler in some cases a s mentio ned in [Sahu and Rober ts 1999] , we simply mo nitor it throug h the likeliho od of data g iv en the m odel, which mea- sures how well the estimated param eters fit to th e data. Onc e the likelihood reaches th e ACM Journal Name, V ol. x, No. y , zz 2010. Modelin g Social Anno tation : a Bayesian Approa ch · 7 stable state, it only slightly flu ctuates f rom o ne iteration to the next, i.e., there is no sys- tematic and s ignificant increase and decrease in likelihood. W e can use this as a part of the stopping criterion. Spe cifically , we monitor lik elihood changes over a number of consecu- ti ve iter ations. If the average of the se changes is less than some threshold, the estimatio n process terminates. M ore robust approach es to d eterminin g th e stable state are discussed else where, e. g. [Ritter and T an ner 1992]. The formula for the likelihood is defin ed as follows. f ( t ; ψ , φ, θ ) = Y i =1: N K N x i ,z i ,t i + η / N T N x i ,z i + η (3) T o a void a nume rical precision p roblem in m odel imp lementation, on e usually uses log likelihood l og ( f ( t ; ψ , φ, θ )) instead. Note that we use the strategy m entioned in [Escobar and W est 1995] (Section 6) to estimate α , β and η from data. The samplin g results in th e stable state are used to estimate mod el parameters. Ag ain, we define N r,z as the number of all tuples associated with resource r and topic z , with N r , N x,u , N u , N x,u,t and N x,z defined in a similar way . From Eq. (1 8) and Eq. (19) in the Append ix, t he formulas for compu ting such parame ters are as follo ws: φ r,z = N r,z + α/ N Z N r + α (4) ψ u,x = N u,x + β / N X N u + β (5) θ x,z ,t = N x,z ,t + η / N T N x,z + η (6) Parameter estimation via Gibbs sampling is less pr one to the local maxim a problem than the generic EM algorith m, a s argued in [ Rosen-Zvi et a l. 20 04]. In particu lar , this scheme do es no t estimate parameters φ , ψ , and θ directly . Rather , they are integrated out, while the hidden variables z and x are iteratively sampled during the training process. The process estimates the “posterior d istribution” over possible values o f φ , ψ , and θ . At a stable state, z and x are drawn from this distribution and then u sed to e stimate φ , ψ , and θ . Consequently , these parameter s are estimated f rom a co mbination of “m ost prob able solutions”, which are obtained from multiple maxima. This clearly differs from the g eneric EM with p oint estimation, whic h we used in our previous work [Plangprasopch ok an d Lerman 2007]. Specifically , the point estimation scheme estimates φ , ψ , and θ fro m single local maximum . Per training iter ation, the co mputation al comp lexity of Gibb s sampling is mo re expen- si ve than E M. This is because we need to sample h idden variables ( z an d x ) for each data point (tuple), whereas E M only req uires updating param eters. In gene ral, the n umber of the data points is larger th an the d imension of param eters. Howe ver, it has been reported in [Griffiths and Ste yvers 2004] that to reach the same perf ormanc e, Gibbs sampling requires fewer floatin g poin t operations than the other popula r approach es: V ariation al Bayes and Expectation Propagation [Minka 2001 ]. Moreover, to our knowledge, there is currently no ACM Journal Name, V ol. x, No. y , zz 2010. 8 · A. Plangprasop chok and K. Ler man explicit w a y to extend these approaches to automatically infer the size of hidden v ar iables, as Gibbs sampling can. Note that inference of these numbers is described in Section 5. 4. EV ALUA TION In this section we e valuate the Inter est T opic Mode l an d compar e its p erform ance to LD A [Blei et al. 2 003] on bo th syn thetic and real-world data. The synthetic d ata set enab les us to control the degree of tag ambig uity and indi vidual user v ariatio n, and e xamine in de- tail how both learning algorithms respond to these key challenges of learning from social metadata. The r eal-world data set, ob tained f rom the social boo kmark ing site Delicious , demonstra tes the utility of the prop osed model. The baseline in both comparison s is LD A, a probabilistic generative model origin ally developed for modeling text documen ts [Blei et al. 2 003], and more rec ently extended to o ther domain s, such as finding topics of scientific pap ers [Griffiths and Steyvers 2004], topic-auth or associations [Rosen-Zvi et al. 2004], user roles in a social network [M cCallum et al. 2007] , and Collabo rative Filtering [ Marlin 2 004]. In this model, th e distribution of a docu ment o ver a set o f topics is first sampled fro m a Dirichlet prior . F o r generating each word in th e docu ment, a top ic is first sampled from the d istribution; then, a word is selected from the distribution of topics ov er words. One can apply LD A to model how tags are gene rated for r esources on social tagging systems. One straightforward approach is to ignore inform ation about users, treating all tags as if they came from the same user . Th en, a resource can be vie wed as a document, while tags across different users who b ookm arked it are treated as words, and LD A is then used to learn param eters. ITM extends L D A by taking into accou nt individual variations among users. In partic- ular , a tag for a certain book mark is c hosen not only from the resour ce’ s top ics but also from user’ s interests. Th is allows each user grou p (with the sam e interest) to hav e its own policy , θ x,z ,t , for choosing tags to represent a topic. Each po licy is then used to up date resource topics as in Eq. (1). Consequently , φ r,z is upd ated based on interests of users who actually annotated resour ce r , rather than updating it from a single policy that ignores user informa tion. W e thus expect ITM to p erform better than LD A when annotations are made by diverse user groups, and especially when tags are ambiguou s. 4.1 Synthetic Data T o v erify the intuition about ITM, we e valuated the perfor mance of the learning algorithm s on synth etic data. Our data set consists of 40 r esources, 10 to pics, 100 users, 1 0 interests, and 100 tags. W e first separate r esources into fi ve groups, with resou rces in each group assigned top ic weights f rom th e same (Dirichlet) pro bability distribution, which fo rces each resource to favor 2–4 o ut of ten to pics. Rather th an simulate the ta gging beh avior of user groups by g enerating ind ividual taggin g po licy p lates as in Figu re 1(a) , we simplify the generative process to simu late the impact of diversity in u ser interests on tagg ing. T o this end, we represen t user interests as distributions ov er topics. W e create da ta sets un der different tag ambigu ity and user inter est variation le vels. T o make these s ettings tuna ble , we genera te distributions o f topics ov er tags, and distributions of resources over to pics u sing symmetric Dir ichlet distrib utions with different par ameter values. Intuitiv ely , when samp ling from the symmetric Dirichlet distribution 2 with a lo w 2 Samples that are sampled from Dirichl et distribut ion are discrete probability distribut ions ACM Journal Name, V ol. x, No. y , zz 2010. Modelin g Social Anno tation : a Bayesian Approa ch · 9 parameter v alue, for example 0.01, the sampled distrib ution contributes weights (pro babil- ity values that are greater than zero) to only a few eleme nts. In co ntrast, the distrib ution will c ontribute weights to many elements when it is sampled from a Dirichlet distribution with a high parameter value. W e u sed this parame ter of the symme tric Dirich let distribu- tion to adjust user variation , i.e., how b road or narrow user interests are, and tag ambiguity , i.e., how many or how fe w topics each tag belongs to. W ith hig her par ameter values, we can simulate a b ehavior of more ambiguous tags, such as “jag uar”, which has multiple senses, i.e., it has weights allocated to many topics. Low parameter values can be used to simulate low amb iguity tag s, such as “m ammal”, which h as one o r f ew senses. T he parameter values used in the experimen ts are 1, 0.5, 0.1, 0.05 and 0.01 . T o generate tags fo r each simulated data set, user interest pro files ψ u are first dr awn from the symme tric Dirichlet distrib ution with the same p arameter value. A similar pro cedure is d one fo r d istributions of to pics over words θ . A resource will presum ably be annotated by a u ser if th e ma tch between resour ce’ s topics an d user’ s inter ests is g reater than some threshold. The match is given by the inne r produ ct between the resour ce’ s topics and user’ s interests, and we set the threshold at 1 . 5 × the a verage match c omputed over a ll user - resource pairs. The rationale behind th is choice of th reshold is to en sure that a resource will be tagged by a user who is stro ngly interested in the topics of that resou rce. Whe n the user-resource m atch is greater th an threshold, a set of tags (a post or bookma rk) is generated acc ording to the following pr ocedur e. First, we compute the topic d istribution from an elemen t-wise pr oduct of the resource’ s to pics and user’ s interests. Next, we samp le a topic from this distribution and produce a tag from the tag distribution of that topic. This guaran tees th at tags are only generate d according to user’ s interests. W e repeat this process se ven times in each post 3 and eliminate r edund ant tags. The process o f g enerating tags is summarized below: f or each resource-use r pair ( u , r ) do m r,u = φ r · ψ u (compute the match score) end fo r ¯ m = Av er ag e ( m ) f or each resource-use r pair ( r , u ) do if m r,u > 1 . 5 ¯ m then topicP r ef = φ r × ψ u (element -wise product) f or i = 1 to 7 do z ∼ topicP r ef (draw a topic from the topic preferenc e) t i r,u ∼ θ z (sample i th tag for the ( u , r ) pair) end fo r Remov e redundant tags end if end fo r W e measure sensiti v ity to tag ambiguity an d user interest v ariation for LD A and ITM on the synthe tic data gen erated with different values of symmetric Dirich let parameter s. One way to measure sensiti vity is to determine how the learned top ic distribution, φ I T M r or φ LD A r , deviates from the actual topic distribution of resource r , φ Actual r . Unfortuna tely , we can not co mpare them directly , since to pic ord er of the learned topic distrib ution m ay 3 W e chose se ven because Delicious users in general use four to sev en tags in each post. ACM Journal Name, V ol. x, No. y , zz 2010. 10 · A. Plangp rasopchok and K. Ler man High Tag A m bigui ty (1) 0 0.1 0.2 0.3 0.4 0.5 0.01 0.1 1 Inte re s t S pre a d Delta LDA(1 0) IT M(10,3) LDA(3 0) ` Low Tag A mb iguity (0.01) 0 0.1 0.2 0.3 0.4 0.5 0.01 0.1 1 Inte re s t S pre a d Delta (a) High Ambiguity (b) Low Ambigu ity High In ter est Spre ad (1 ) 0 0.1 0.2 0.3 0.4 0.5 0.01 0.1 1 Ta g Ambi guity Delta Low Intere st Spread (0 .01 ) 0 0.1 0.2 0.3 0.4 0.5 0.01 0.1 1 Ta g Ambi guity Delta (c) High Inter est Spread (d) Low Interest Spread Fig. 3. De viatio ns, Delta (∆) , between actual and learned topics on synthetic data sets for differen t regi mes: (a)high tag ambiguity; (b)lo w tag ambiguity; (c)high interest s pread; (d)lo w interest spread. LD A(10) and LD A(30) re fers to LD A that i s trai ned with 10 and 30 topics respe cti vely; ITM(10,3) refers to ITM that is trai ned with 10 topi cs and 3 interest s. not be the same as that of the actual o ne. 4 An in direct way to m easure this deviation is to compare distances between pairs of resources computed using the actual and learned topic distributions. W e d efine this de viation as ∆ . W e calcula te the distance between two distri- butions using Jensen- Shannon di vergenc e (JSD) [Lin 1991]. If a model accurately learned the r esources’ topic distribution, the distance between two resources com puted using the learned d istribution will be equ al to the d istance co mputed fr om the actual d istribution. Hence, the lower ∆ , the better model p erform ance. The deviation between the actual an d learned topic distributions is ∆ = N R X r =1 N R X r ′ = r +1 | J S D ( φ Lear ned r , φ Lear ned r ′ ) − J S D ( φ Actual r , φ Actual r ′ ) | . (7) ∆ is compu ted s eparately for each algorith m, Learned = I T M a nd Lear ned = LD A . W e ran both LD A and I TM to learn distributions of reso urces over topics, φ , fo r simu- lated data set genera ted with d ifferent values of tag ambiguity an d user interest variation. W e set the numb er of topics to 1 0 for each mode l, and the nu mber of intere sts to thre e for ITM. Both models were initialized with random topic and in terest assignmen ts and then trained using 1000 iterations. For the last 100 iterations, we used topic and interest assign- ments in each itera tion to comp ute φ ( using Eq. (4 ) for ITM and Eq. (7 ) in [Griffiths and Steyvers 2004 ] for LD A). The a verage 5 of φ in this period is then used as the distributions 4 This propert y of probabilistic topic models is calle d exchangea bilit y of topics [Steyve rs and Griffiths 2006]. 5 The reason to use the averag e of φ is that, in the stable state, the topic/i nterest assignmen ts can still fluctuat e from one iteration to anothe r . T o avo id estimate φ from an iteration tha t possibl y has idio syncrati c topic/w ord assignments, one can av erage φ ove r multiple iterat ions [Steyve rs and Griffiths 2006]. ACM Journal Name, V ol. x, No. y , zz 2010. Modelin g Social Anno tation : a Bayesian Approach · 11 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0 0.1 0.2 0.3 0.4 0.5 ambiguity LDA interest spread Delta 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0 0.1 0.2 0.3 0.4 0.5 ambiguity ITM interest spread Delta (a) LDA(1 0) (b) ITM(10,3) Fig. 4. This plot shows the devia tion ∆ between actual and learne d topics on synthetic data sets, under diffe rent degre es of tag-ambigui ty and user interest varia tion. The ∆ of L D A is shown on the left (a); as tha t of ITM is on the right (b). The colors were automatic ally generated by the plotting program to improve reada bilit y . of resources over topics. W e r an the learning algorithm fiv e times for each data set. Deviations between learned top ics and actual ones of simulated data sets are shown in Figure 3 and Figure 4 . In the case when degre e of tag ambiguity is high, I TM is superior to LDA for the entire range of user interest v ariation, as shown in Fig ure 3( a). This is because ITM exploits user inf ormation to help d isambiguate tag senses; thu s, it can learn better topics, which are clo ser to the actual ones, than LDA. In the other regime, when tag ambiguity is low , user inf ormation does n ot help a nd can e ven degrade ITM perfo rmance , especially when th e degree of interest variation is low , as in Figu re 3 (b). This is bec ause low amount of user interest variation demotes statistical stren gth of the learn ed topics. Suppose that, for e xample, we have two similar resour ces: the first o ne is b ookm arked by on e grou p, the seco nd book marked by anoth er . I f these two grou ps hav e very different interest profiles, ITM will tend to split the “actual” topic that describes those resources into two different topics — one for each group. Hence, each of these resources will be as signed to a different learned top ic, resulting in a higher ∆ fo r ITM. In the case when user interest v ariation is high (Figure 3(c)), ITM is superior to LD A for the same reason that it uses u ser inf ormation to disambigu ate tag senses. Of course, there is no advantage to u sing ITM when the degree of tag ambig uity is very low , and it y ields similar performance to LDA. In the last regime , when interest variation is lo w (Figure 3( d)), ITM is superior to L D A for high degree of tag ambiguity , even though its top ics may lose some statistical stren gth. ITM ’ s perfor mance starts to degra de when tag ambiguity degre e is low , for the same r eason as in Figure 3(b ). These results are summarized in 3D plo ts in Figure 4. W e also ran LD A with 30 top ics, in order to compare LD A to I TM, wh en both models have th e same comp lexity . As sh own in the Figure 3, with the same model complexity , ITM is preferable to LD A in all settings. In some cases, LD A with higher co mplexity (30 topics) is infer ior to the L D A with lower complexity (10 topics). W e suspect that this degradation is caused by ov er-specification of the model with too many topics. For the computatio nal com plexity , b oth LD A and ITM are required to sample the hidden variables for all d ata poin ts using Gib bs sampling . For LDA, on ly the topic variable z is needed to be sampled; for ITM, the interest variable x is also requir ed. The comp utational cost in each sampling is proportional to a number of topics, N Z , for z , and that of interest, N X , for x . Let define κ as a constant. W e also define a number of all datapoin ts (tup les) ACM Journal Name, V ol. x, No. y , zz 2010. 12 · A. Plangprasop chok and K. Ler man as N K . Hence , a computatio nal cost fo r L D A, in each iteration can be appro ximated as N K × ( κ × N Z ) . The computational cost of ITM in each i teration can be appro ximated as N K × ( κ × ( N Z + N X )) . In summary , ITM is not superior to LD A in learning topics associated with resourc es in ev ery case. Howe ver, we showed that ITM is preferable to LD A in scenarios characterized by a high degree of tag ambiguity and some user interest variation, which is the case in t he social annotations domain. 4.2 Real-W or ld Data In this section we validate the proposed mod el on real-world data obtained from the s ocial bookm arking site Delicio us . The h ypothesis we make for evaluating the pro posed ap proach is that the mod el that takes u sers into account can infer higher quality (mor e accurate) topics φ than those inferr ed by the model that ignores user information. The “standard ” measure 6 used for ev aluatin g to pic m odels is the perplexity score [Blei et al. 200 3; Rosen-Zvi et al. 2004 ]. Specifically , it m easures gen eralization per forman ce on how a certain m odel can predict unseen obser vations. In documen t top ic mo deling, a portion of words in ea ch d ocumen t are set aside as testing da ta; while the rest are used as training d ata. Then the perplexity score is compu ted from a cond itional proba bility of the testing g iv en trainin g data. This ev aluation is infeasible in the social ann otation dom ain, where each bookmar k contains relatively few tags, compared to document’ s words. Instead of using perplexity , we p ropose to d irectly measur e the quality o f the learn ed topics on a simplified reso urce discovery task. The task is defined as follows : “given a seed re source, find other most similar r esources” [Ambite et al. 200 9]. Each resource is represented as a distrib u tion over learned topics, φ , which is compu ted using Eq. (4). T o p- ics learned by the better approa ch will have more discriminativ e power f or cate gorizing re- sources. When using such d istribution to rank reso urces b y similarity to the seed, we w ould expect the more similar resources to be ranked higher than less similar resour ces. Note that similarity b etween a p air o f resources A and B is compu ted using Jensen-Sh annon diver - gence (JSD) [Lin 1991 ] on their top ic distrib u tions, φ A and φ B . T o ev alu ate the appro ach, we collected data for five seeds: fl ytecomm , 7 geocoder , 8 wun- der gr ound , 9 whitepages , 10 and on line-r eservationz . 11 The flyteco mm allows users to track flights giv en the airline and flig ht number or de parture and ar riv al airports; geoco der re- turns geograph ic coordinates of a given address; wunder gr ound gives weather information for a particu lar locatio n (given by zip code, c ity and state, or airpor t); whitepages r eturns person’ s phone numbe rs and online-reservationz lists hotels av ailable in som e city on some dates. W e crawl Delicio us to gather resources possibly r elating to each seed. Th e crawling strategy is as follo ws: for each seed —retrieve the 20 most popular tags associated with this resource. —For each of the tags, retrie ve other resources that have been anno tated with the tag. 6 In f act, topic model’ s e v aluati on is still currentl y in controv ersy accordi ng to a personal communic ation at http:// nlper s.blogspot .com/2008/06/ev aluating-topic-models.html by Hal Daum ´ e. 7 http:/ /www .flytecomm.com/cgi-b in/trackflight/ 8 http:/ /geoco der .us 9 http:/ /www .wundergrou nd.com/ 10 http:/ /www .whitepage s.com/ 11 http:/ /www .online-rese rv ationz.com/ ACM Journal Name, V ol. x, No. y , zz 2010. Modelin g Social Anno tation : a Bayesian Approach · 13 Seed # Resourc es # Users # T ags #Tri pples Flyteco mm 3,562 34,594 14,297 2 ,284,308 Geocoder 5,572 46,764 16,887 3 ,775,832 W under ground 7,176 45,852 77,056 6 ,327,211 Whitepag es 6,455 12,357 64,591 2 ,843,427 Online-R ese vationz 764 41,003 9,194 162,763 T able I. The table presents statistic s for five data sets for e v aluat ing models’ performance. Note that a triple is a resource, user , and tag co-occurrenc e. —For each resource, collect all bookmarks (resource-user-tag triples). W e wrote a special-p urpo se page scraper to extract this information fr om Deliciou s . In principle, we could continue to expan d the collection of r esources by g athering tags and retrieving more resourc es tagged with those keywords, but in practice, even after a small trav ersal, we already obtain millio ns of triples. In e ach corpu s, each resource has at least one tag in common with the seed. Statistics on these data sets are presented in T able I. For each corpus, LD A is trained with 80 topics, while the num ber of topics and interests for ITM is set to 80 and 40 respectiv ely . Th e topic and inter est assignmen ts are randomly initialized, and then bo th m odels ar e trained with the 5 00 iterations. 12 For the last 1 00 iterations, we use the top ic an d interest assign ments, in each iteration, to comp ute the distributions of resources over topics, φ . The a verage of φ in this period is then used as the distributions of resources over topics. Next, the learned distributions of resources ov er topics, φ , are used to compute the simi- larity of resources in each corpus to the seed. The performan ce of each model is e valuated by manually checkin g the 100 most similar resources pr oduced by the m odel. A resource is judg ed to b e similar if it p rovides an inpu t form that takes semantically the same in puts as the seed and returns semantically the same data. Hence, flightaware 13 is judg ed similar to flytecomm because both take flight information and return flight status. Figure 5 shows the numb er of relevant resourc es iden tified with in the top x resources returned b y LDA and IT M. From the results, we can see that ITM is sup erior to LD A in three data sets: fl ytecomm , geoco der and online-reservationz . However , its performa nce for wunder gr ound an d whitepages is about the same as that of LDA. Although we hav e no empirical proof, we hypo thesize that weather and directory services are of interest to all users, and are ther efore b ookm arked by a large v ariety of user s, unlike users interested in tracking flights or b ookin g h otels onlin e. As a resu lt, ITM can not exploit individual user differences to learn mo re accurate topics φ in th e wunder gr oun d and whitepages cases. T o illustrate utility of IT M, we select examples of to pics and interests of the model in- duced from the flytecomm corpus. For purp oses of visualization, we first lis t in descend ing order the top tags that ar e highly associate d with each to pic , which are o btained fro m θ z (aggregated over all inter ests in th e topic z ). For ea ch to pic, we then enume rate so me in- ter ests , and pr esent a list of top tags for ea ch interest, obtained from θ x,z . W e manually label topics and interests (in italics ) accordin g to the meaning of its domin ant tags. T ravel & Flights topic : t ravel, Travel, flights, airfare, air- line, flight, airlines , guide, aviation, hote ls, deals, 12 W e disco vered that the model con verging very qui ckly . In particul ar , the model appear to rea ch the stable state within 300 ite ration s in all data sets 13 http:/ /flighta ware.com/liv e/ ACM Journal Name, V ol. x, No. y , zz 2010. 14 · A. Plangprasop chok and K. Ler man 0 5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 80 90 100 #retri eve resources #relevant resources LD A ( 80) I TM ( 80/40) 0 5 10 15 20 25 30 35 40 10 20 30 40 50 60 7 0 80 90 100 #retri eve resources #relevant resources LDA(80) ITM (80 /40) (a) Flytec omm (b) Geocode r 0 10 20 30 40 50 60 70 80 10 2 0 3 0 4 0 50 60 7 0 8 0 90 100 #retri eve resources #relevant resources LDA(80) ITM (80/3 0*) 0 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 90 100 #retrie v e resources #relevant resources LDA(80) ITM (80 /40) (c) W underground (d) Whitepag es 0 5 10 15 20 25 30 35 40 10 20 30 40 50 60 7 0 80 90 100 #retri eve resources #relevant resources LDA(8 0) IT M(80/40) (e) Online -reserv ationz Fig. 5. Performance of dif ferent mode ls on the fiv e data set s. X-axis re presents the number of retrie ved resources; y-axis represents the number of rele vant resources (that hav e the same function as the seed). LDA(80 ) refers to LD A that is traine d with 80 topics. ITM(80/40 ) refers to ITM that is trained with 80 topics and 40 intere sts. In wunder groun d case, we can only run ITM with 30 inte rests due to the memory limits. reference, airplane — Flight T racking inter est : trave l, flight, airline, airplane, tracking, guide, flights, hotel, aviation, tool, packing, plane — Deal & Booking interest : vacation, service, travelling, hotels, search, deals, europe, portal, tourism, price, compare, old — Guide inter est : travel, cool, useful, reference, world, advice, holiday, internation al, vacation , guide, information, resource V ideo & p2p topic : vi deo, download, bittorrent, p2p, youtube, media, torrent, torrents, movies, videos, Video, downloads, dvd, free, movie — p2p V ideo inter est : video, downloa d, bittorrent, youtube, torrents, p2p, torrent, videos, movies, dvd, media, ACM Journal Name, V ol. x, No. y , zz 2010. Modelin g Social Anno tation : a Bayesian Approach · 15 googlevideo, downloads , pvr — Media & Cr ea tion inter est : video , media, movies, multimedia, videos, film, editing, vlog, remix, sharing, rip, ipod, television, videoblog — F r ee V ideo inter e st : video, free, useful, videos, cool, downloads, hack, media, utilities , tool, hacks, flash, audio, podcast Refer e nce top ic : ref erence, database, cheatsheet, Reference, resources, documentati on, list, links, sql, lists, resource, useful, mysql — Databases inter est : referen ce, database, documentation, sql, info, databases, faq, technical, reviews, tech, oracle, manuals — T ip s & Pr o ductivity inter est : reference, useful, resources, information, tips, howto, geek, guide, info, produc- tivity, daily, computers — Manual & Refer ence inter est : resourc e, list, guide, resources, collection, help, directory, manual, index, portal, archive, bookmark The three interests in the “T ravel & Flights” top ic ha ve obviously different themes. The dominan t one is more about tracking status of a fligh t; while the less dom inant ones ar e about searchin g for travel d eals and tra veling guides respectively . This imp lies that there are subsets of users who have different per spectives (o r wha t we call interests) to wards the same top ic. Similarly , different interests also appear in the f ollowing topics, “V ideo & p2p” and “Reference. ” Figure 6 presents example s of to pic d istributions for three resources learn ed by LD A and IT M: the seed flytecomm , usato day , 14 and bo okings . 15 Although all are abo ut trav el, the first two resourc es ha ve specific flight tracking functionality; while the last one is about hotel & trip boo king. In distribution of resources over the topics lea rned by LD A, sho wn in Figure 6 (a), all resources hav e high weights on t opics #1 and #2, which are about t raveling deals and general aviation. In the case of topics learned by ITM, sho wn in Figure 6 (b), fly- tecomm and usatoda y ha ve their high weight on topic #25, which is about tracking flights, while boo kings does not. Con sequently , ITM will be more helpf ul than LD A in identify ing flight tracking resources. This demo nstrates t he advantage of ITM in exploiting individual differences to learn mo re accurate topics. 5. INFINITE INTEREST T OPIC MODEL In Sec tion 3, we assumed that parameters, such as, N Z and N X (numb er of to pics an d interests respectively), were fixed an d known a priori . T he choice of values for these pa- rameters can conceiv ably af fect the model perfor mance. The tr aditional way to determin e these n umbers is to learn th e mo del sev eral times with different values of parameter s, and then select those that yield the best perform ance [Grif fiths and Steyvers 2004]. 14 http:/ /www .usatoday . com/tra vel/flights/delays/tracker-i ndex.htm 15 http:/ /www .bookings.org/ ACM Journal Name, V ol. x, No. y , zz 2010. 16 · A. Plangprasop chok and K. Ler man 0 0.2 0.4 0.6 0.8 1 7 13 19 25 31 3 7 43 49 55 61 6 7 73 79 topic i nde x z,r fly tecomm usato da y bo ok ing s Topic #1: travel, flight, airfare, airline, guild, hotel, cheap Topic #2: flight, wireless, aviation, japan, airplan e, wifi, tracking Topic #28: statistics, stats, seo, traffic, analysis, marketing, test (a) T ag distrib utions of three resource s learned by LDA 0 0.2 0.4 0.6 0.8 1 7 13 19 25 31 37 43 49 55 61 67 73 79 topic index z,r fly tecomm usato da y bo ok ing s Topic #13: travel, flights, airfare, guide, shopping, airli nes Topic #25: flight, aviation, airplane, tracking, airline, airlines Topic #8: maps, googlemaps, mapping, geography, earth, cool (b) T ag distr ibut ions of three resources learned by ITM Fig. 6. T opic distrib utions of three resources: flytecomm , usatoday , bookings learned by (a) LDA and (b) ITM. φ z ,r in y-axis indica tes a weight of the topic z in the resource r – the degree to which r is about the topic z . In this work, we cho ose anoth er solution by extend ing our finite mod el to hav e “cou nt- ably” infinite number s of topics and interests. By “ countab ly” infinite number of compo - nents, we m ean that such number s are fle xible and can vary accord ing to the number of observations. Intuitively , there is a higher c hance tha t mor e topics and interests will be found in a data set that has m ore resource s and users. Such u nboun ded numb er of compo- nents can b e dealt with within a Bay esian framework, as mentioned in the previous w orks [Neal 200 0; Rasmussen 2000; T eh et al. 200 6]. This approach helps bypass the problem of selecting values for these par ameters. Follo wing [ Neal 2000], we set both N Z and N X to approach ∞ . This will g iv e the model th e ability to select no t only previously used topic/in terest comp onents but also to instantiate “u nused” components when requir ed. Howev er , the model th at we der i ved in the pr evious section can not be extended directly un der this framework d ue to the use of symmetric Dirichlet priors. As poin ted out by [T eh et al. 2006 ], when the number of compon ents grows, using the sym metric Dirich let prior r esults in a very low — e ven zero probab ility — chance that a mixture comp onent i s s hared across grou ps of data. That is , in our context, there is a high er ch ance that a certain topic is o nly used with in one re source rather than utilized by m any of them. Considering E q. (1 ), if we set N Z to ap proach ∞ , we can obtain posterior proba bility of z as follows ACM Journal Name, V ol. x, No. y , zz 2010. Modelin g Social Anno tation : a Bayesian Approach · 17 p ( z i = z used | z − i , x , t ) = N r i ,z − i N r i + α − 1 · N z − i ,x i ,t i + η / N T N z − i ,x i + η (8) p ( z i = z new | z − i , x , t ) = α N r i + α − 1 · 1 N T (9) From Eq. (8), we can per ceive that the model only fa vors topic components that are only u sed with in the r esource r i . Meanwhile, for o ther com ponen ts that are not used by that resour ce, N z − i ,x i ,t i would equa l zero and thus result in zero pr obability in choosing them. Conseque ntly , the mod el only chooses to pic comp onents for a resour ce either f rom compon ents that are currently used by that resource, or it instantiates a new com ponen t for that resource with p robab ilities according to Eq . (8) a nd Eq. (9) respec ti vely . As mo re new comp onents ar e instantiated, each r esource tends to own its componen ts exclusi vely . From the pre vious sectio n, we can also perceive that each reso urce profile is gene rated indepen dently (using symmetric Dirichlet prior) — there is no mechanism to link the used compon ents ac ross different resources 16 . As men tioned in [T eh et al. 2 006], this is an undesired ch aracteristic, b ecause, in our co ntext, we would expect “similar” r esources to be described by the same set of “similar” topics. One possible way t o handle this problem is to use Hierarchical Dirichlet Process (HDP) [T eh et al. 2006 ] as the prior instead of the symmetric Dirichlet pr ior . The idea of HDP is to link compon ents at group-sp ecific level together by introd ucing global compon ents across all group s. Each group is on ly allowed to use some (or all) of th ese global comp onents and thu s, some of them are e xpected to be shared across se veral gr oups. W e adapt this idea b y co nsidering all tags of re source r to b elong to the resource gr oup r . Similar ly , all tags of user u belon g to th e u ser g roup u . Each of the resource g roups is assigned to some topic com ponen ts selected from the glo bal topic componen t pool. Similarly , each of the user gro ups is assigned to som e interest comp onents selected from the globa l intere st compon ent pool. This extension is depicted in Figure 7. Suppose th at a num ber of all possible topic com ponen ts is N Z (which will b e set to appr oach ∞ later on) an d that for interest compon ents is N X , we can describe such extension as a stochastic process as follows. At the global le vel, the weight distrib ution of compon ents is sampled accordin g to . ( β 1 , ..., β N X ) ∼ D ir ichl e t ( γ x / N X , ..., γ x / N X ) ( generatin g global interest compo- nent weight) . ( α 1 , ..., α N Z ) ∼ D ir ich l et ( γ z / N Z , ..., γ z / N Z ) (generatin g global topic compon ent weight) where γ x and γ z are concentr ation parameter , which controls diversity of inter ests and topics at global lev el. At the group specific lev el, 16 This beha vior can be easily observed in multiple samples, each drawn independen tly from a Dirichle t distri- buti on D ir ic hlet ( α 1 , ..., α k ) . If α i is “small” and k is “large ”, there is a higher chance that samples obtained from this Dirichlet distributi on will have no overl apped component i.e., for any pair of samples, there is no case when the same components hav e their v alue greater than 0 at the same time. Lack of this component ove rlap across samples will be obvio us when k → ∞ . This is the problem that can be found in the model with infinite limit on N Z and N X . ACM Journal Name, V ol. x, No. y , zz 2010. 18 · A. Plangprasop chok and K. Ler man . ψ u ∼ Dir ichl e t ( µ x · β 1 , ..., µ x · β N X ) (gene rating user u in terest’ s profile) . φ r ∼ Dir ichl e t ( µ z · α 1 , ..., µ z · α N Z ) (gene rating resource r topic’ s p rofile) where µ x and µ z are concentration parameter , which co ntrols di versity of interests an d topics at group spe cific le vel. Th e r emaining steps inv olving g eneration o f tags for eac h bookm ark are the same as in the pr evious process. Suppose that ther e is an infinite nu mber of all possible top ics, N Z → ∞ , and a portion of them are curr ently used in some resou rces. By following [T eh et al. 2006], we can rewrite the global weight distribution of to pic compo nents, α , as ( α 1 , ..α k z , α u ) , where k z is the number of curre ntly used topics componen ts and α u = P N Z k = k z +1 α k – all of unu sed topic compon ents. Similarly , we can write ( α 1 , ..., α k z , α u ) ∼ Dir ich l et ( γ z /N Z , ..., γ z /N Z , γ z u ) , where γ z /N Z = γ z / N z and γ z u = ( N Z − k z ) · γ z N Z . Th e same treatme nt is also a pplied to that of interest compon ents. Now we can gen eralize Eq. (1 ) and Eq. (2) for sampling po sterior prob abilities of topic z and intere st x with HDP pr iors as follows. For sampling topic component assignment for datapoint i , p ( z i = k | z − i , x , t ) = N r i ,z − i + µ z α k N r i + µ z − 1 · N z − i ,x i ,t i + η / N T N z − i ,x i + η (10) p ( z i = k new | z − i , x , t ) = µ z α u N r i + µ z − 1 · 1 N T (11) For sampling interest component assignment for datapoint i , p ( x i = j | x − i , z , t ) = N u i ,x − i + µ x β j N u i + µ x − 1 · N x − i ,z i ,t i + η / N T N x − i ,z i + η (12) p ( x i = j new | x − i , z , t ) = µ x β u / N X N u i + β − 1 · 1 N T , (13) where k and j are a n index for topic and interest compo nent respecti vely . From these equations, we allow the model to instan tiate a ne w compon ent from the pool of unused compon ents. Considerin g th e case wh en a new topic co mponen t is instan tiated and, for simplicity , we set th is n ew comp onent to be the last used component, indexed with k ′ z . W e need to obtain weight α k ′ z for this n ew comp onent and also update the weight o f all unused com ponen ts, α u ′ . From th e unu sed co mpon ent p ool, we know that one of its un- used componen ts will be chosen as a ne wly used component, k ′ z , with probability distribu- tion ( α k z +1 /α u , .., α N Z /α u ) which can be sampled fro m Dir ichl et ( γ z / N Z , ..., γ z / N Z ) . Suppose the compon ent k ′ z will be ch osen fr om one o f these com ponen ts and we collapse the remain ing un used comp onents. It will b e chosen with the probability α k ′ z /α u , wh ich can be sampled from B eta ( γ z / N Z , γ z u / N Z − γ z / N Z ) , where B eta ( . ) is a Beta distribu- tion. Now , suppose k ′ z is c hosen. The pr obability of choosing this componen t is updated to α k ′ z /α u ∼ B eta ( γ z / N Z + 1 , γ z u / N Z − γ z / N Z ) . When N Z → ∞ , this red uces to α k ′ z /α u ∼ B eta (1 , γ z u / N Z ) . Hence, to u pdate α k ′ z , we first draw a ∼ B eta (1 , α u ) . W e then up date α k ′ z ← a.α u and up date α u ′ ← (1 − a ) .α u . Similar steps are also applied to interest compon ents. ACM Journal Name, V ol. x, No. y , zz 2010. Modelin g Social Anno tation : a Bayesian Approach · 19 Note that if we compa re Eq. (10) to Eq. (8), the prob lem we found so far has gon e since p ( z i = k | z − i , z , t ) will never hav e zero probab ility even if N r i ,z − i = 0 . At the end of ea ch iteration, we use the same method [T eh et al. 2006] to sample α and β and u pdate hyp erparam eters γ z , γ x , µ z , µ x using the method described in [Escobar and W est 1 995] . W e refer to this infinite version of ITM as “I nterest T op ic Model with Hierarchical Dirichlet Process” (HDPITM) for the rest of the paper . For the computation al comp lexity , alth ough N Z and N X are both set to appr oach ∞ , the co mputatio nal cost of each iter ation, howe ver, does not app roach ∞ . Consid ering Eq. (10) an d Eq. (1 1), samp ling of z i only inv olves curre ntly instantiated topics plus one “collapsed to pic”, which repre sents all cur rently unu sed topics. Similarly , the sampling of x i only inv olves curren tly in stantiated interests plu s one. For a particular iteration , a computatio nal cost for HDP can therefore be appr oximated as N K × ( κ × ¯ N Z + 1) . And that for HDPITM can b e appr oximated as N K × ( κ × ( ¯ N Z + ¯ N X + 2)) , wh ere ¯ N Z and ¯ N X are respectively the a verag e number of topics and interests in that iteration. 5.1 P erformance on the synthetic data W e ran both HDP an d HDPITM to extract top ic distrib u tions, φ , o n th e simulated data set. In each run th e number of instantiated topics was in itialized to ten, which equals to the actual n umber of topics for both HDP and HDPITM. The num ber of interests was ini- tialized to three. Similar to th e setting in Section 4 .1, topic an d interest assignments were random ly initialized and the n trained using 1000 iterations. Subsequently , φ was computed from the last 100 iterations. The results are shown in Figure 8 (a) and (b) for HDP an d HD- PITM respectively . From th ese results, the behaviors o f both m odel for dif ferent settings are somewhat similar to those of LDA and ITM. In particular, HDPITM can exploit user informa tion to help disambiguate tag sen ses, while HDP can not. Hen ce, the p erform ance of HDPITM is better than that of HDP when tag ambiguity level is high . A nd since topics may lose some statistical strength under low user in terest co ndition, HDPITM is inferior to HDP , similar to Figure 3(b) for the finite case. As one can compare th e plots (a) and (b ) in Figure 3 an d Figure 8, the perfor mance of infinite model is generally worse tha n that o f the finite one, ev en thou gh we allow the former the ability to adjust topic/inter est dimension s. One possible factor is that the model still allows top ic/interest dimen sions (co nfiguratio n) to ch ange even though the trained model is in a “stable” state. That w ould prohibit the model from optimizing its parame ters for a certain configu ration o f top ic/interest dimension s. On e evidence that suppor ts this claim is that, although the log likelihood seems to conver ge, the numb er of topics (for both models) and interests (only for HDPITM) still slightly fluctuate around a certain value. From this speculation , we ran bo th HDP and HDPITM with the dif f erent strategy . In particular, we split model training into two periods. In the first period, we allow the model to adjust its configuratio n, i.e. the dimensions of topics and interests. In the second period , we still train the model b ut d o n ot allow the d imensions of topics and interests to chang e. The first one is similar to the train ing process in the plain HDP and HDPITM. The second one is similar to th at of plain LDA and ITM that use the latest configuration from the first period. In this experiment, we set th e first p eriod to 50 0 iterations; anoth er 500 iterations were set for the second p hase. Subsequ ently , φ is computed f rom the last 100 iterations of the second . W e refer to th is tr aining strategy for HDP as HDP+LDA, and th at for HD- PITM as HDPITM+I TM. The overall i mprovement of performanc e using this strategy are perceived in Figure 8 (c) and (d), co mparing to (a) and (b ). That is, both HD P+LD A and ACM Journal Name, V ol. x, No. y , zz 2010. 20 · A. Plangprasop chok and K. Ler man X Z T R U N t ψ N U φ N R θ N Z ,N X β α η D γ X γ Z µ X µ Z X X Z Z T T R R U U N t ψ ψ N U φ φ N R θ N Z ,N X β α η η D γ X γ Z µ X µ Z Fig. 7. Graphical representat ion on the Inte rest T opic model with hierarchic al Dirichl et process (HDPITM). 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0 0.1 0.2 0.3 0.4 0.5 ambiguity HDP interest spread Delta 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0 0.1 0.2 0.3 0.4 0.5 ambiguity HDPITM interest spread Delta (a) (b) 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0 0.1 0.2 0.3 0.4 0.5 ambiguity HDP+LDA interest spread Delta 10 −2 10 −1 10 0 10 −2 10 −1 10 0 0 0.1 0.2 0.3 0.4 0.5 ambiguity HDPITM+ITM interest spread Delta (c) (d) Fig. 8. This plot shows the devia tion ∆ between actual and learne d topics on synthetic data sets, under diffe rent degre es of tag-ambiguity and user interest varia tion. The ∆ of HDP is shown on the left (a); as that of HDPITM is on the right (b). As , (c) and (d) shows the de viation produced by HDP+LD A and HDPITM+ITM respecti vely . For HDP+LD A, new top ics can be instantia ted, and thus the number of to pics can change, during the first half of the run (HDP); then all topics are freezed (no ne w topic can be instantiate d) during the s econd half (L D A). And this is similar to HDPITM+ITM where we tak e into account user information. See Section 5.1 for m ore detai l. HDPITM+ITM can produ ce φ , which provide lower ∆ , under this strategy . Howev er , HD- PITM+ITM perf ormanc e under the co ndition with low user interest and low tag ambiguity , is still in ferior to HDP+LDA. This is simply because their stru ctures are still the same to those of HDP and HDPITM respectively . ACM Journal Name, V ol. x, No. y , zz 2010. Modelin g Social Anno tation : a Bayesian Approach · 21 5.2 P erformance on the real-world data In th e exper iments, we initialize the numb ers of topics and in terests to 100 an d 2 0 ( the number of interests is only applicable to HDPITM), and train the models on the same real- world data sets we used in Section 4. 2. T he topic and interest assign ments are randomly initialized, and th en b oth mo dels ar e train ed with the min imum 40 0 and maximum 600 iterations. For the first 100 iterations, we allow both models to instantiate a n ew to pic or interest as required , und er the con straint that th e nu mber of topics and interests do es not exceed 400 a nd 80 resp ectiv ely . If th e m odel violates this constrain t, it will exit this phase early . F or th e re mainder of iterations, we d o n ot allow the model to add new topics or interests (but these numbers can shrink if some to pics/interests collapsed d uring this phase). Then , if the chang e in log likelihood, averaged over the 10 preced ing iterations, is less than 2% , the training p rocess will enter to final learn ing phase. ( See Figure 9 (f) for an example of log likeliho od during training iteration s.) In fact, we found that the pr ocess enters th e fina l ph ase early in all data sets. In the fin al ph ase, consisting of 10 0 iteration s, we use the topic and interest assignments in each iteration to compute the distributions of resources over topics. The reason we limit the maximum num bers o f topics, interests, and iterations over which these models are allowed to instantiate a new topic/interest, is that the number s of users and tags in our data sets are large, an d many new to pics and interests cou ld be instanti- ated. This would require many more iteration s to converge, and the mo dels would req uire more m emory than is av ailable on the d esktop mach ine we used in the experiments. 17 W e would rather allow the mo del to “explore” th e underlyin g structure of data within the constraints — in other words, find a configur ation which is best suited to the data u n- der a lim ited exploration period and then fit the data w ithin that con figuration . At th e end of the par ameter estimatio n, the number s of allocated to pics o f HDP models fo r fl y- tecomm , geocoder , wundergr o und , whitepages and online- r eservatio nz was 171 , 174 , 197 , 187 an d 17 5 respec ti vely . Th e numb ers of allocated topics and interests in HDPITM are h 307 , 43 i , h 329 , 44 i , h 231 , 8 1 i , h 2 25 , 78 i and h 20 7 , 72 i respectively , which is bigger than those inferred by HDP in all cases. These results suggests that user i nform ation allows the HDPITM discover more detailed structure. HDPITM perf orms somewhat better than HDP in flytecomm , online -r eserva tionz , and geocoder data sets. Its perfo rmance for wunder gr oun d and w hitepages , howe ver , i s almost identical to HDP . A s in Section 4.2, this is possibly du e to high inter est variation amo ng users. W e suspect that weather and director y services are of inter est to all users, and are therefor e book marked by a large variety of users. 6. RELA TED RESEARCH Modeling social an notation is an emerging new field, but it has in tellectual roots in two other fields: document mod eling an d collabo rative filtering. It is relevant to th e fo rmer in that one can vie w a resource b eing annotated by users with a set o f tags to be analogous to a docume nt, which is com posed of w ords from the document’ s autho rs. Usually , the numbe rs of users inv olved in creating a docum ent is much less than those in volved in annotating a resource. In regard to collabor ativ e rating systems, annotatio ns created by users in a social annotation system are analogo us to ob ject ratings in a reco mmend ation system. Howe ver, 17 At maximum, we can only alloc ate m emory for 1,300 Mbytes. ACM Journal Name, V ol. x, No. y , zz 2010. 22 · A. Plangprasop chok and K. Ler man 0 5 10 15 20 25 30 35 40 0 20 40 60 80 100 120 # Retri eved Reso urces # R e l ev a n t R e s o u r c e s H DP H DP IT M 0 5 10 15 20 25 30 35 40 0 20 40 60 80 100 120 # Retrie v e d Resources # Re lev ant Reso urces H D P H D PI TM (a) Flytec omm (b) Geocode r 0 10 20 30 40 50 60 70 80 0 20 40 60 80 100 12 0 # Retrie v e d Resources # Rel ev ant Re sou rces H D P H D PI TM 0 10 20 30 40 50 60 70 80 0 20 4 0 60 80 100 120 # Retrie v e d Resources # Rel ev ant Re sources H D P H D PI T M (c) W underground (d) Whitepag es 0 5 10 15 20 25 30 35 40 0 20 40 60 80 100 120 # Retrie v ed Re sources # Rel ev ant Re sou rces H DP H DPITM -1.40E+07 -1.20E+07 -1.00E+07 -8.00E+06 -6.00E+06 -4.00E+06 -2.00E+06 0.00E+00 1 51 101 151 201 251 301 351 Iterat ion log li kelihood (e) Online-Rese rv ation z (f) Log like lood of flytecomm Fig. 9. Performance of dif ferent m ethods on the fiv e data s ets (a) – (e). E ach plot sho ws the number of rele vant resources (that are s imilar to the seed ) within the top 100 results produced by HDP (non-paramete ric v ersion of LD A) and HDPITM (nonpar ametric versi on of IT M). Each model was init ializ ed wit h 100 topics and 20 intere sts for HDPITM. (f) demonstra tes log l ike lihood of the HDPITM model during paramet er estimation pe riod of flyteco mm data set. Similar behavio r of the plot (f) is found in both HDP and HDPITM for all data sets. users only p rovide on e rating to the ob ject in a reco mmenda tion system, b ut they u sually annotate a n object with several keywords. Therefo re, there are se veral relev an t thr eads o f research connecting our work to earlier ones in these areas. In relation to docum ent mod eling, our work is conceptually motiv ated by the Author- T opic model (A T) [Ro sen-Zvi et al. 20 04], where we can view a user who annotate a resource a s an autho r who composes a docu ment. In p articular, the model explains th e process of docu ment generation , governe d by author pr ofiles, in form s of distributions o f authors over topics. Howe ver, this work is not d irectly app licable to soc ial annotatio ns. This is because, fir st, in social anno tation context, we know who generates a tag on a certain resource; therefo re, the author selection proce ss in A T , which selects one of co- authors to be respo nsible for a generatio n of a certain docu ment word, is not needed in our context. Second , co -occur rences of user-tag pairs for a certain bookm ark are very sparse, i.e., th ere are fewer than 10 tags per b ookmar k. Th us, we need to group users who share the sam e interests togeth er to av oid the sparsen ess problem. Third , A T has no direct way to estimate distributions of resources over topics since there are only author-topic and topic-word associations in the model. One possible indirect w a y is to compute this from an ACM Journal Name, V ol. x, No. y , zz 2010. Modelin g Social Anno tation : a Bayesian Approach · 23 av erage o ver all distributions of autho rs o ver topics. Our model, instead, explicitly models this distribution, and since it u ses profiles of groups of similar u sers, r ather than those of an individual, the distributions are expected to be less biased. Sev eral recent works apply docu ment mo deling to a social annotation. On e study [W u et al. 2006 ] applies the multi-way aspect mod el [Ho fmann 2001; Popescu l et al. 2001] to social an notations on Delicious . The mod el does not explicitly separ ate user interests and resource topics as our mod el does, and thus cannot exploit user variations to learn b etter distributions of resou rces over to pics, as we showed in [ Plangpraso pchok and Lerman 2007] . [Zhou et al. 2008 ] introd uced a genera ti ve m odel of th e process of W eb page c reation and annotation. The model, called User Con tent An notator (UCA), include s words found in W eb docu ments, in addition to tags generated by users to ann otate th ese documents. The auth ors explore this mo del in the context of improvin g IR perfor mance. In this work, a bag of words (tags and content) is generated from two dif f erent sources — the document creator an d a nnotator . Although UCA takes do cumen ts’ contents into account, unlike ou r model, it ma kes sev eral assump tions, which we believ e do not hold for real- world data. Th e first assumption is that annotators conceptually agree with the original document’ s author s (and therefore , share the the s ame topic space), whereas ITM relaxes this assumption. The second assumption is that users an d documen ts have th e same typ es of distrib ution over topics, whereas IT M separates in ter ests from topics . I n fact, withou t docum ents’ con tent, UCA is almost identical to the Author T opic mo del [Rosen-Zvi et al. 200 4], e xcept for the fact that owners tags are e xplicitly known, and thus, it sh ares A T’ s dr awbacks. Ano ther technical dr awback of U CA is the following: if a particular tag ged W eb docum ent has no words (e. g., a W eb service, Flickr p hoto, or Y ouT ube v ideo), UCA would the n take into account the tag gers only , and lose the variable d that represents th e document. Furthe r computatio n is required to infer p ( z | d ) , the probability of a topic gi ven a document, which is required for the content discovery task we are inv estigating. Collaborative filtering was one of the first successful social applications. Collaborative filtering is a tech nolog y used by recomm ender systems to find users with similar interests by askin g them to r ate items. It then com pares their ratings to find u sers with similar opinion s, an d r ecommen ds to users new items that similar users liked. Among of recen t works in collaborative filtering area, [ Jin et al. 2006] is most relev ant to ou rs. In particular , the work describe a mixture model for collaborative filter ing that takes into account users’ intrinsic p referen ces about items. In this model, item rating is generated from bo th the item ty pe and user’ s individual preference fo r that ty pe. Intuitively , like-min ded users would have s imilar r ating on the same item types (e.g ., movie genre s). When pr edicting a rating of a certain item for a certain user , the user’ s previous ratings on other items will be used to in fer a like-m inded gro up of users. Then, the “commo n” rating on th at item fr om the users of that g roup is the pre diction. This collaborative rating p rocess is very similar to that of collabor ativ e tagging . The only technical difference is that each “item” can hav e multiple “r atings” (in ou r case, tags) from a sing le user . This is becau se an item u sually has multiple subjects and each subject can be represented using multiple terms. There exist, ho we ver , major dif ferences between [Jin et al. 2006 ] a nd ou r work. W e use the proba bilistic model to discover a “resou rce description” d espite users annotating resources with po tentially ambigu ous tags. Our goal is not to pr edict how a user will tag a resource (analog ous to predictin g a ra ting user will gi ve to an item ), or discovering ACM Journal Name, V ol. x, No. y , zz 2010. 24 · A. Plangprasop chok and K. Ler man like-minded g roup s of users, which our algorithm could also do. The main purpose of our w ork is to recover the actual “resource description” from noisy observations generated by different u sers. In essence, we hy pothesize that there is actual d escription of a certain resource and users select and the n anno tate the r esource with that descrip tion partially accordin g to th eir “interest” or “expertise”. In this work, we also demonstrate that when taking in to accoun t individual dif ference in th e process, th e inf erred resource d escriptions are not biased toward individual v a riation as m uch as those tha t do n ot take th is issue into account. Anoth er te chnical difference is that the m odel is not implemen ted in f ully Bayesian, and uses p oint estimation to estimate its parameters, wh ich is criticized to be susceptible to lo cal maxima [Gr iffiths and Steyvers 2004; Steyvers and Grif fiths 2 006] . Moreover , it can not be extended to allo w numbers of topics/interests to be flexble as ours; thus, the strong assumption on the number of topics and interests is required . Rather th an m odeling social annotation, [Li et al. 2 007] con centrates on an app roach that helps users efficiently navigate the Social W eb . Altho ugh the work share some similar challenges, e. g., ta g am biguity , with ours, the solution prop osed in th at work is rath er different. In particular , the work exploits user acti vity to resolve ambig uity – as a user selects mor e tags, the topic s cope gets more focu sed. Consequently , the recently suggested tags associate with fe we r and fe wer senses, helping to disambiguate the tag. Ou r approac h does n ot rely on such user activity to disambig uate tag senses; in stead, we exploit user interests to do this, since tag sense is cor related with a group of users who share interests. On an applications le vel, this approach and ours are also different. In particular , the former approa ch is suitable for situations when users activity and labeled data is a vailable, and can be exploited to filter information on the fly . Our approach, on the other hand, u tilizes social annotation only . It is more suitable for batch jobs without user’ s intervention; for e x ample, the automatic resource discovery task for mashup applications [Ambite et al. 2009 ]. 7. CONCLUSION W e hav e presented a probabilistic model of social anno tation that tak es into account the users who are crea ting the a nnotation s. W e argue d that our mo del is able to learn a mo re accurate topic descrip tion of a corpus of annotated resources by exploiting individual vari- ations in user interests and v o cabulary to help disambigu ate tags. Ou r e xperimen tal results on collections o f anno tated W eb resour ces from the so cial bo okmar king site De licious show that our model can effectiv ely exploit social annotation on the resource discovery task. One issue that our model does not address is tag bias, probably caused by expressi vene ss of users with high interests in a certain do main. In general, a few users use many more tags tha n other s in ann otating resour ces. Th is will bias the model toward these u sers’ annotation s, causing the learned topic distrib u tions to de v iate from the actual distrib u tions. One possible way to compensate for this is to tie the number of tags to individual interests in the m odel. ITM also does not at pr esent allow us to inclu de other sour ces of evidence about d ocumen ts, e.g., th eir contents. It would b e inter esting to extend ITM to include content words, which will make this model more attractive for Info rmation Retrie val tasks. Since our model is more computation ally expensiv e than other mod els that igno re user informa tion, e.g . LD A, it is not practical to blindly apply our approac h to all data sets. Specifically , our model cannot explo it individual v ariation in the d ata that has low tag ambiguity and sma ll individual variation, as sho wn in Section 4.1. In this case, ou r model ACM Journal Name, V ol. x, No. y , zz 2010. Modelin g Social Anno tation : a Bayesian Approach · 25 can only pr oduce small improvement or e ven similar p erform ance to th at of the simpler models. For a p ractical reason , a heuristic fo r determinin g le vel of tag ambigu ity and user variation would be very b eneficial in order to determine if the complex model is preferable to the simpler one. Ratios b etween a number of tags and that of u sers or that of resour ces may provide some clues. As we model the social annotation process b y taking into ac count all essential entities; namely , users, resour ces and tags, we can ap ply the model to o ther ap plications. For ex- ample, one c an straightforwardly apply th e m odel to personalize search [W u et al. 200 6; Lerman et al. 200 7]. It can also be used to suggest tags to a user annotating a new resou rce, in the same spirit as rating predictio ns in Collabo rativ e Filtering. Appendix W e begin to deri ve Gib bs sampling eq uations for IT M in Section 3 fr om the join t prob a- bility of t , x and z of all tuples. Suppose that we have n tuples. Their joint probability is defined as p ( t i , x i , z i ; i = 1 : n ) = Z p ( t i , x i , z i | ψ , φ, θ ; i = 1 : n ) .p ( ψ , φ, θ ) d h ψ , φ, θ i = c · Z Y i =1: n ( ψ u i ,x i · φ r i ,z i · θ t i ,z i ,x i ) · Y u,x ψ β / N X − 1 u,x · Y r,z φ α/ N Z − 1 r,z · Y t,x,z θ η/ N T − 1 t,x,z d h ψ , φ, θ i = c · Z Y u,x ψ P i δ u ( x i ,x )+ β / N X − 1 u,x d ( ψ ) · Z Y r,z φ P i δ r ( z i ,z )+ α/ N Z − 1 r,z d ( φ ) · Z Y t,z ,x θ P i δ z ,x ( t i ,t )+ η / N T − 1 t,z ,x d ( θ ) = c · Y r ( Q z Γ( P i δ r ( z i , z ) + α/ N Z ) Γ( N r + α ) ) · Y u ( Q x Γ( P i δ u ( x i , x ) + α/ N X ) Γ( N u + β ) ) · Y z ,x ( Q z ,x Γ( P i δ z ,x ( t i , t ) + η / N T ) Γ( N z ,x + η ) ) (14) where c = ( Γ( α ) Γ( α/ N Z ) z ) r · ( Γ( β ) Γ( β / N X ) x ) u · ( Γ( η ) Γ( η/ N T ) t ) ( z ,x ) and δ r ( z i , z ) is a f unction which returns 1 if z i = z and r i = r oth erwise 0. N r represents a number of all t uples associated with re source r . Similar ly , N x,z represents a numb er of all tup les associated w ith inter est x and topic z . By rearran ging Eq. (14 ), we obtain p ( t i , x i , z i ; i = 1 : n ) = Y r ( Γ( α ) Γ( N r + α ) ) · Y r,z ( Γ( P i δ r ( z i , z ) + α/ N Z ) Γ( α/ N Z ) ) · Y u ( Γ( β ) Γ( N u + β ) ) · Y u,x ( Γ( P i δ u ( x i , x ) + β / N X ) Γ( β / N X ) ) ACM Journal Name, V ol. x, No. y , zz 2010. 26 · A. Plangprasop chok and K. Ler man · Y x,z ( Γ( η ) Γ( N x,z + η ) ) · Y x,z ,t ( Γ( P i δ x,z ( t i , t ) + η / N T ) Γ( η / N T ) ) (15 ) Suppose that we have a new tup le an d we ind ex this tuple with k (say k = n + 1 for conv enience). Fr om Eq. (15), we can derive a joint p robab ility of this new tuple k and all other previous t uples as follows p ( t k , x k , z k , t i , x i , z i ; i = 1 : n ) = Γ( α ) Γ( N r = r k + α + 1 ) · ( Y r 6 = r k Γ( α ) Γ( N r + α ) ) · Γ( P i δ r = r k ( z i , z k ) + α/ N Z + 1) Γ( α/ N Z ) · ( Y r 6 = r k ,z 6 = z k Γ( P i δ r ( z i , z ) + α/ N Z ) Γ( α/ N Z ) ) · Γ( β ) Γ( N u = u k + β + 1) · ( Y u 6 = u k Γ( β ) Γ( N u + β ) ) · Γ( P i δ u = u k ( x i , x k ) + β / N X + 1) Γ( β / N X ) · ( Y u 6 = u k ,x 6 = x k Γ( P i δ u ( x i , x ) + β / N X ) Γ( β / N X ) ) · Γ( η ) Γ( N x = x k ,z = z k + η + 1) · ( Y x 6 = x k ,z 6 = z k Γ( η ) Γ( N x,z + η ) ) · Γ( P i δ x = x k ,z = z k ( t i , t k ) + η / N T + 1) Γ( η / N T ) · ( Y x 6 = x k ,z 6 = z k ,t 6 = t k Γ( P i δ x,z ( t i , t ) + η / N T ) Γ( η / N T ) ) (16) For the tuple k , supp ose that we on ly know th e values of x k and t k while that o f z k is unknown. The joint probab ility of all tuples, excluding z k is as follows. p ( t k , x k , t i , x i , z i ; i = 1 : n ) = Γ( α ) Γ( N r = r k + α ) · ( Y r 6 = r k Γ( α ) Γ( N r + α ) ) · Γ( P i δ r = r k ( z i , z ) + α/ N Z ) Γ( α/ N Z ) · ( Y r 6 = r k ,z 6 = z k Γ( P i δ r ( z i , z ) + α/ N Z ) Γ( α/ N Z ) ) · Γ( β ) Γ( N u = u k + β + 1) · ( Y u 6 = u k Γ( β ) Γ( N u + β ) ) · Γ( P i δ u = u k ( x i , x k ) + β / N X + 1) Γ( β / N X ) · ( Y u 6 = u k ,x 6 = x k Γ( P i δ u ( x i , x ) + β / N X ) Γ( β / N X ) ) · Γ( η ) Γ( N x = x k ,z = z k + η ) · ( Y x 6 = x k ,z 6 = z k Γ( η ) Γ( N x,z + η ) ) · Γ( P i δ x = x k ,z = z k ( t i , t k ) + η / N T ) Γ( η / N T ) · ( Y x 6 = x k ,z 6 = z k ,t 6 = t k Γ( P i δ x,z ( t i , t ) + η / N T ) Γ( η / N T ) ) (17) By dividing Eq. (15 ) by Eq. (17), we can obtain the posterior prob ability of z k giv en all other variables as follo ws ACM Journal Name, V ol. x, No. y , zz 2010. Modelin g Social Anno tation : a Bayesian Approach · 27 p ( z k | t k , x k , t i , x i , z i ; i = 1 : n ) = Γ( N r = r k + α ) Γ( N r = r k + α + 1) · Γ( P i δ r = r k ( z i , z ) + α/ N Z + 1) Γ( P i δ r = r k ( z i , z ) + α/ N Z ) · Γ( N x = x k ,z = z k + η ) Γ( N x = x k ,z = z k + η + 1) · Γ( P i δ x = x k ,z = z k ( t i , t k ) + η / N T + 1) Γ( P i δ x = x k ,z = z k ( t i , t k ) + η / N T ) = P i δ r = r k ( z i , z k ) + α/ N Z N r = r k + α . P i δ x = x k ,z = z k ( t i , t k ) + η / N T N x = x k ,z = z k + η = N r = r k ,z = z k + α/ N Z N r = r k + α · N x = x k ,z = z k ,t = t k + η / N T N x = x k ,z = z k + η (18) Intuitively , we can perceive from Eq. (18) tha t N r = r k ,z = z k + α/ N Z N r = r k + α tell us how resource r is likely to be de scribed by the topic z ; as the later p art, N x = x k ,z = z k ,t = t k + η/ N T N x = x k ,z = z k + η tell us how tag t is likely to be chosen giv en interest x and z . Similarly , we can obtain the posterior probability of x k as we did for z k . p ( x k | t k , z k , t i , x i , z i ; i = 1 : n ) = N u = u k ,x = x k + β / N X N u = u k + β · N x = x k ,z = z k ,t = t k + η / N T N x = x k ,z = z k + η (19) W e can now gene ralize Eq . (18 ) an d Eq . (19) for sampling p osterior probabilities of topic z and interest x of a present tuple i given a ll other tuples. W e define N r i ,z − i as the number of all toples having r = r i and z but excluding the present tuple i . Similarly , N z − i ,x i ,t i is a numb er o f all tup les having x = x i , t = t i and z b u t excluding the present tuple i . As z − i represents all topic assignments except that of the tuple i . p ( z i | z − i , x , t ) = N r i ,z − i + α/ N Z N r i + α − 1 . N z − i ,x i ,t i + η / N T N z − i ,x i + η (20) p ( x i | x − i , z , t ) = N u i ,x − i + β / N X N u i + β − 1 · N x − i ,z i ,t i + η / N T N x − i ,z i + η (21) A CKNOWLEDGME NTS W e would like to thank an onymous revie wer s for providin g useful commen ts and sugges- tions to improve the manu script. This material is based in part upo n work supp orted by the National Science Foundation under Grant Numbers CMMI-0753124 and IIS-0812677 . Any op inions, findings, and conclusions or recom mendation s expressed in this material are those of the author s and d o not necessarily reflect the views of the Na tional Science Foundation. REFERENCES A M B I T E , J . L . , D A R B H A , S . , G O E L , A . , K N O B L O C K , C . A ., L E R M A N , K . , P A R U N D E K A R , R ., A N D R U S S , T. A . 2009. Automatica lly constructi ng semantic web services from online sources. In Proce edings of Inter- national Semantic W eb Confer ence . 17–32. ACM Journal Name, V ol. x, No. y , zz 2010. 28 · A. Plangprasop chok and K. Ler man B L E I , D . M ., N G , A . Y . , A N D J O R D A N , M . I . 2003. Latent dirichl et alloc ation . Journal of Machine Learning Resear ch 3 , 993–10 22. B U N T I N E , W., P E R T T U , S ., A N D T U U L O S , V . 2004. Using di screte pc a on web pages. In Proc eedin gs of ECML workshop on Statisti cal A ppr oaches to W eb Mining . B U N T I N E , W . L . 1994. Operations for learning with graphical models. Journa l of Artificial Intellig ence R e- sear ch 2 , 159–22 5. E S C O B A R , M . D . A N D W E S T , M . 1995. Bayesian density estimation and inference using m ixtures. J ournal of the American Statisti cal Association 90 , 577–588. G I L K S , W ., R I C H A R D S O N , S . , A N D S P I E G E L H A L T E R , D . 1996. Markov Chain Monte Carlo in Practice . Interdisc iplina ry Statistics. Chapman & Hall. G O L D E R , S . A . A N D H U B E R M A N , B . A . 2006. Usage patterns of collaborat i ve tagging systems. Journal of Informatio n Science 32, 2 (April), 198–208. G R I FFI T H S , T. L . A N D S T E Y V E R S , M . 2004. Finding scient ific topics. Proce edings of the National Academy of Scien ces of the United States of America 101 , 5228–5235. H O F M A N N , T. 1999. Probabilistic late nt semantic analysis. In Proc eedin gs of 15th Confere nce on Uncertainty in Artificial Intell ige nce . 289–296. H O F M A N N , T . 2001. Unsupervised learni ng by probab ilisti c laten t semant ic a nalysis. Machine Learning 42, 1-2, 177–196. J I N , R . , S I , L ., A N D Z H A I , C . 2006. A study of mixture models for col labora ti ve filteri ng. Informati on Re- trie val 9, 3, 357–382. L E R M A N , K . , P L A N G P R A S O P C H O K , A . , A N D W O N G , C . 2007. Personalizing image search resu lts on flickr . In Pr oceed ings of AA AI worksho p on Intellig ent W eb P ersonalizat ion . L I , R. , B AO , S . , Y U , Y . , F E I , B . , AN D S U , Z . 2007. T owa rds ef fecti ve bro wsing of large scale soc ial annot ations. In Pr oceedi ngs of the 16th internation al confer ence on W orld W ide W eb . 943–952. L I N , J . 1991. Div ergence measures based on the shannon entropy . IEE E T ransact ions on Information The- ory 37, 1, 145–151. M A R L I N , B . 2004. Collabora ti ve filtering: A machine learning perspecti ve. M.S. thesis, Univ ersity of T oronto, T oronto, Ontario , Canada. M C C A L L U M , A . , W AN G , X . , A N D C O R R A DA - E M M A N U E L , A . 2007. T opic and role discov ery in social net- works wit h e xperiments on enron a nd academic email . Journal of Artifici al Int ellig ence Researc h 30 , 249–272 . M I K A , P . 2007. Ontologi es are us: A unified mode l o f socia l net works and semantic s. W eb Se mantics 5, 1, 5–15. M I N K A , T. P. 2001. Expectat ion propagation for approximate bayesian inference . In Proc eedin gs of the 17th confer ence on Uncertainty in Ar tifici al Intellig ence . 362–3 69. N E A L , R . M . 2000. Markov chain sampling methods for dirichlet process mixture m odels. Jou rnal of Compu- tationa l and Graphical Statist ics 9, 2, 249–265. P L A N G P R A S O P C H O K , A . A N D L E R M A N , K . 2007. E xploiti ng s ocial annot ation for automatic resource discov - ery . In Proc eedings of AAAI workshop on Informati on Inte grati on on the W eb . P O P E S C U L , A ., U N G A R , L . , P E N N O C K , D . , A N D L AW R E N C E , S . 2 001. P robabil istic models for unified c ollab - orati ve and content-ba sed recommendation in sparse-data en vironments. In Pr ocee dings of 17th Confer ence on Uncertai nty in Artificial Intellig ence . 437–444. R A S M U S S E N , C . E . 2000. The infinite gaussian mixture mode l. In Proc eeding s of Advances in Neural In forma- tion Pr ocessing Systems 12 . 554–56 0. R AT T E N B U RY , T. , G O O D , N . , A N D N A A M A N , M . 2007. T owa rds automatic extr action of eve nt and place semantic s from flickr tags. In Proce edings of the 30th Annual ACM SIGIR Confer ence on Researc h and Deve lopment in Information Retrieval . 103–110. R I T T E R , C . A N D T A N N E R , M . A . 1992. Facil itati ng the gibbs sampler: The gibbs stoppe r and the griddy-gibbs sampler . J ournal of the American Statistical Association 87, 419, 861–868. R O S E N - Z V I , M . , G R I FF I T H S , T. , S T E Y V E R S , M . , A N D S M Y T H , P. 2004. The author -topic model for authors and documents. In Proc eedin gs of the 20th confer ence on Uncertai nty in A rtificia l Intellige nce . 487–494. S A H U , S . K . A N D R O B E RT S , G . O . 1999. On con verge nce of the e m al gorithm and the g ibbs sample r . Stati stics and Computin g 9 , 9–55. S C H M I T Z , P . 2006. Inducing ontology from flickr tags. In Pro ceedin gs of WWW workshop on Collaborat ive W eb T agg ing . ACM Journal Name, V ol. x, No. y , zz 2010. Modelin g Social Anno tation : a Bayesian Approach · 29 S T E Y V E R S , M . A N D G R I FFI T H S , T. 2006. Probabi listic topic models. In Latent Semantic A nalysis: A Road to Meaning , T . Landaue r, D. Mcnamar, S. Dennis, and W . Kintsch, Eds. T E H , Y . W. , J O R DA N , M . I . , B E A L , M . J . , A N D B L E I , D . M . 2006. Hierarchic al dirichlet processes. Jou rnal of the American Statisti cal A ssociation 101 , 1566–1581. W U , X . , Z H A N G , L . , A N D Y U , Y . 2006. Exploring social annotations for the s emantic web . In Proce edings of the 15th inte rnational confer ence on W orld W ide W eb . 417–426. Z H O U , D . , B I A N , J . , Z H E N G , S . , Z H A , H . , A N D G I L E S , C . L . 2008. Exploring social annot ations for informa- tion retrie val . In Pro ceedi ngs of the 17th internationa l confere nce on W orld W ide W eb . 715–724. ACM Journal Name, V ol. x, No. y , zz 2010.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment