Permutation Models for Collaborative Ranking
We study the problem of collaborative filtering where ranking information is available. Focusing on the core of the collaborative ranking process, the user and their community, we propose new models for representation of the underlying permutations a…
Authors: Truyen Tran, Svetha Venkatesh
P erm utation Mo dels for Collab orativ e Ranking T ruy en T ran and Sv etha V enk atesh Departmen t of Computing, Curtin Univ ersit y , Australia F eb 2010 Abstract W e stu dy th e problem of collab orative fi ltering where ranking information is a vai lable. F ocusing on the core of the collaborative rank ing pro cess, the user and their comm un it y , w e prop ose new mo dels for representati on of the underlying p er- mutatio n s and prediction of ranks. The first approach is based on t h e assumption that the user mak es successive c hoice of items in a stage-wise manner. In particular, w e ex tend th e Plack ett-Lu ce mo del in tw o wa ys - introducing parameter factoring to account for user-sp ecific contribution, and mo delling the latent communit y in a generativ e setting. The second approac h relies on log-linear parameterisation, which relaxes the discrete-choice assumption, but makes learning and inference muc h more inv olved. W e prop ose MCMC-based learning and inference metho ds and derive linear-time prediction al gorithms. Keywords : p ermutatio n , ranking, collaborative fi ltering. 1 In tro d uction Collab ora tive filtering is an imp orta nt clas s of problems with the promise to deliver p er- sonalised services. Members of communities ra te items in a s e rvice, a nd str ong pa tter ns exist betw e e n simila r communities of users. These patterns can b e exploited to pro duce ranked lists o f items from a s et of items not not previously exp osed to the user. Research in r ecommendation systems mo dels us e r preferences throug h a numerical rating - for example, rate a movie as 4 or 5 sta rs. Although these us e rs ar e forced into nu mer ic scoring, these scores are assig ned qualitatively , and do not car ry the assumed rigour o f quantitativ e ev aluation. Also , this limits the expr essiveness of preference s. F or example, a more intuitiv e wa y is to express the or de r of prefer ences fo r a set of items. It may be eas ier to rank a set of mo v ies, or the top 10 places v isited, rather than a ssign them a n umer ic scor e. Imp or tantly , in reco mmenda tion systems, the co re v alue prop os ition is to recommend unseen items - this is where ranking rather tha n a ctual rating b ecomes significant. This pa p e r addresses the op en problem of recommending a r anked list of items, o r a preference list, without requiring intermediate ra tings, in co llab orative filtering sys tems. Each user provides a ra nked list of items, in the decreasing order o f preference. Th e list needs not be co mplete, e .g . a use r typically rates 10 or 2 0 items. The intuition in collab ora tive filtering is that the communit y as a whole may cover thousands o f items, 1 and as users b elong to cluster s within this communit y , the pro per ties of rankings within such clus ters c an b e tra ns ferred to a user fo r items that user has not seen. The technical issue is to mo del the ranked item set b o th for a user and the co mm unity , and predict the rank of unseen items for each user. Despite of its imp ortance, the collab ora tive ranking pr oblem has only b een attempted recently [10, 11, 7]. The pap ers [11, 7] consider pair w is e pr eferences , ig no ring the si- m ulta ne o us interaction b etw een items. Lis twise appro aches, s tudied in statistics (e.g. see [8, 9, 4]), often inv o lve a rela tively s mall set of items (e.g. in electio n, typically less than a dozen of candidates ar e c o nsidered). F urther, statisticians are interested in the distri- bution o f ranks in the p opulatio n r ather than pro pe r ties of indiv idua ls. Collab o rative ranking, on the other ha nd, differs in three w ays: a) the scale is sig nificantly different - sometimes there are millions of items; b) the da ta is highly sparse, that is a user will typ- ically user s will o nly expres s their preference ov er a few items; and c ) the per sonalisatio n asp ect is cruc ia l and this the the distribution of ra nk per user is mor e imp or tant. In this pa p e r, fo cusing on the user , we s tudy tw o approa ches for mo delling the ra nk or preference lis ts . Our fir st a pproach as sumes that the user, w hen ra nking items, will ma ke successive choice in a stage-w is e matter. W e extend one of the most well-known metho ds, namely the Plackett-Luce mo del, to effectively mo del user-s pe c ific rank distribution in t wo wa ys. First, w e intro duce para meter factor ing into use r-sp ecific and item-sp ecific parameters . Seco nd, we e mploy a gener ative framework whic h mo dels the communit y the user b elongs to as a latent lay er, enabling r icher mo delling of the communit y str ucture in the ranking ge neration pro cess. W e provide algorithms for learning the mo del parameters and for ranking uns e en items in linear time. The appro ach is detailed in Se c tion 3. The s econd appr oach relaxe s the s ta ge-wise choice assumption a nd mo dels intrinsic features of the p ermutation in a log-linea r setting. Poten tials in the mo del capture the likelihoo d of a n item in a sp ecific p osition, and for all item pair s, the likeliho o d of the first item being o r dered before the sec o nd. Although exact learning and inference is int r actable, w e show that truncated MCMC techniques are effective for learning, a nd for prediction that can b e computed in linea r time. The appr oach is describ ed in Sectio n 4. The nov elty in our contribution lies in the prop os al of tw o a pproaches incorp orating key asp ects of co llab orative ranking: the user , their sp ecific co mm unities, and the na- ture of the r anking lis t itself. The work contributes efficient metho ds for learning and prediction. 2 Preliminaries Suppo se that we hav e a data set o f N users, and M items a nd ea ch user u ∈ { 1 , 2 , ..., N } provides a list of n u ≤ M ra nked items π u = { π u 1 , π u 2 , ...π u n u } , w her e π u i is the index of the item in po sition i . F or notational simplicity , we will dro p the ex plicit sup ersc ript u in π u when there is no co nfusion, a nd use y = π i when we ment io n the item y ∈ { 1 , 2 , ..., M } in po sition i . The goa l is to effectiv e ly mo del the distribution P ( π | u ). The main difficult y is that the num b er of pe r mut a tions is n u !, which is only tractable for s mall n u . A simplified way is to ex amine the or dering b etw een only tw o items (e.g. s ee [1 1, 7 ]). Denote by s u π i the scoring function when the item is p os itioned at i in the list π of us er u . Let us co nsider the following q uantit y d u ij = sign( j − i )( s u π i − s u π j ) . 2 Basically d u ij is pos itive when the scoring functions { s u y , s u y ′ } a gree w ith their relative po sitions in the lis t, and negative o ther wise. F or simplicit y , we assume the factoring s u y = P K k =1 W uk H ky where W ∈ R N × K and H ∈ R K × M for some K < min { M , N } . Thus the learning go al is to estimate { W, H } s o that { d u ij } are p os itive for all the tr iples ( u, i, j ) in the training data, where 1 ≤ i < j ≤ n u . This sugges ts a re gularised loss function in the form R = 1 N X u n u X i =1 n u X j = i +1 L ( d u ij ) + Ω( W , H ) , where L ( d u ij ) is the user-sp ecific loss and Ω( W , H ) = α P uk W 2 uk + β P y k H 2 ky is the regular is ing comp onent. Popular choices of L ( d u ij ) are L ( d u ij ) = (1 − d u ij ) 2 in regr ession; max(0 , 1 − d u ij ) in larg e-margin setting; and log(1 + exp { − d u ij } ) in logistic r egressio n . 3 Laten t Discrete Choice Mo dels W e now addre s s the listwise mo dels, starting from the assumption that the user ma kes the ra nking decision in a stag e-wise manner. W e will fo cus o n the P lack ett-Luce mo del [9] P ( π ) = M Y i =1 e s π i P M j = i e s π j , (1) where s π i is the scor e asso ciated with the item a t p osition i in the p ermutation π . The probability that an item is chosen as the first in the list is e s π 1 / P M j =1 e s π j . Once this item has been chosen, the pro bability that the next item is chosen as the second in the remaining of M − 1 item list is e s π 2 / P M j =2 e s π j . The pro cess rep eats un til all items hav e bee n chosen in a ppr opriate p ositions. How ever, this model is not s uitable for collab or ative r a nking, b ecause it does not carry a ny per sonalised infor ma tion, and la cks the concept of communit y among us e r s. W e now int r o duce o ur extensions, fir s t by mo de lling the user -sp ecific distribution P ( π | u ) (Section 3.1), and then prop osing co mmunit y-g enerated choice making (Section 3.2). 3.1 F actored Ben t er-Plac ke t t-Luce Mo del In collab or a tive ranking, we a re interested in mo delling the choices by each user, and the per mut a tion π given by a user is incomplete (i.e. the user often ranks a very small subset of items). W e thus intro duce an user- sp ecific mo del as P ( π | u ) = n u Y i =1 e s u π i P n u j = i e s u π j . Thu s s u π i is the ranking scor e for item at p osition i (under π ) by user u . How ever, this mo del do es not acc o unt for the the or der at the b eginning of the list b eing mor e 3 impo rtant than that at the end. W e employ the technique by [1], in tr o ducing damping factors ρ 1 ≥ ρ 2 ≥ ... ≥ ρ n ≥ 0 as follows P ( π | u ) = n u Y i =1 e ρ i s u π i P n j = i e ρ i s u π j . As an example, we ma y cho ose ρ i = 1 / lo g(1 + i ). In the standar d P lack ett-Luce mo del, the set of para meters { s y } can b e estimated from a set o f i.i.d p ermutation samples. In our a daptation, how ever, this trick do es not work b e cause the score s u y will b e undefined for unseen items . Instea d, we prop ose to factor s u y as follows s u y = K X k =1 W uk H ky , where W ∈ R N × K and H ∈ R K × M for some K < min { M , N } a re par ameter matric e s. The y th column of H can be cons idered a s the feature vector of item y , and the u th row of W as the para meter vector sp ecific to user u . T o learn the mo del par ameters, maximum likelihoo d estimation can b e carr ied out through maximising the following regula rised log-likelihoo d with res pec t to { W, H } L ( W , H ) = X u log P ( π | u ) − α k W k 2 F − β k H k 2 F , for α, β > 0. It can be verified that the re g ularised log -likelihoo d is concav e in e ither W or H , but not b oth. Once the mo del ha s b een sp ecified, { s u y = P K k =1 W uk H ky } c a n be used for sor ting the items prev iously not s een by the user, wher e la rger s u y ranks the item higher in the list. 3.2 Laten t Seman tic Plac kett-Luce Mo del The mo del in the pr e v ious subsection la cks generative interpretation- we do not know how the ranking is g enerated by the use r . A principled wa y is to assume that the user belo ngs to hidden communities, a nd that thos e communities will jointly gener ate the ranking. Recall that in the Plackett-Luce mo del, the choice of items is made s tage-wise - the next item is chosen given that previous ly chosen items a r e ahead in the lis t. Denote by P i ( π | z , u ) the pro ba bilit y o f choo sing the item for the i th p osition by u with r esp ect to communit y z , i.e. P i ( π | z , u ) = e s z π i P n u j = i e s z π j . (2) Let P ( z | u ) b e the pr obability that the user b elongs to one o f the communities z ∈ { 1 , 2 , .., K } , then the user-s p ecific p er m utatio n is defined as P ( π | u ) = n u Y i =1 X z P ( z | u ) P i ( π | z , u ) . (3) 4 Due to the sum in the denominator in Equation 2, we may exp ect that the co mpu- tation of P ( π | u ) takes n u ( n u − 1) K / 2 time. How ever, we can co mpute in n u K time by pr e computing a recursive array A z i = A z i +1 + e s z π i for 1 ≤ i < n u . If we start with A n u = e s z π n u , then clea r ly A z i = P n u j = i e s z π j , which is the denominator in Eq ua tion 2. 3.2.1 Learning using EM There a re tw o sets of pa rameters to estimate, the mixture co efficients { P ( z | u ) } and the communit y-sp ecific item sco res { s z y } . W e describ e an EM algo rithm for learning these para meters, star ting fr om the lo wer-b ound of the incomplete log-likeliho o d L = P u log P ( π | u ) a s L = X u n u X i =1 log X z P ( z | u ) P i ( π | z , u ) ≥ X u n u X i =1 X z Q i ( z | π , u ) log P ( z | u ) P i ( π | z , u ) = Q . where Q i ( z | π , u ) is defined a t each E -step t + 1 a s follows Q t +1 i ( z | π , u ) ← P t ( z | u ) P t i ( π | z , u ) P t i ( π | u ) . In the M - step, we fix Q i ( z | π , u ) and estimate { P ( z | u ) , s z y } by maximising Q . W e eq uip the low e r -b ound with the constra int P z P ( z | u ) = 1 throug h the La grang ia n function F = Q + P u µ u ( P z P ( z | u ) − 1 ) where { µ u } are Lagr ange multipliers. Setting the gradient of the Lagr angian function ∂ F ∂ P ( z | u ) = n u X i =1 Q i ( z | π , u ) 1 P ( z | u ) + µ u to zeros and maintaining that P z P ( z | u ) = 1 would lea d to P ( z | u ) ← P n u i =1 Q i ( z | π , u ) P z P n u i =1 Q i ( z | π , u ) = 1 n u n u X i =1 Q i ( z | π , u ) . This closed form up date, how ever, do es not apply to { s z y } . Instead, we r e sort to the gradient-based metho d, where ∂ Q ∂ s z y = n u X i =1 Q i ( z | π , u ) ∂ log P i ( π | z , u ) ∂ s z y = n u X i =1 Q i ( z | π , u ) { δ y π i − P n u j = i e s z π j δ y π j P n u j = i e s z π j } , 5 where δ y π i = 1 if y = π i and 0 otherwise. Typically , we run only a few up dates for s z y per M -step. 3.2.2 Prediction Given that mo dels a re fully sp ecified, we wan t to output a ra nked list of unseen items for each us er u . How ever, finding the o ptimal ranking fo r an ar bitrary set of items is generally intractable and thus we reso r t to finding the rank of just one unseen item at a time, given that the seen items hav e b een sorted. In o ther words, we fix the or ders of the old items, and then introduce one new item into the mo del, ass uming that this int r o duction do es no t c ha nge the rela tive o rders of the old items. So the pr o blem now reduces to finding the p o sition of the new item amo ng the o ld items. W e r ep e at the proc e ss for all new items , and determine their p o s itions in the lis t. If the tw o new items a re pla ced in the same p osition, then their relative ranks will b e determined by the likelihoo d of their intro ductions. Let π ′ be the new list after introducing a new item. Denote b y π i : j the set o f items whose po sitions are from i to j under π . Suppose that the new item is placed b etw een the ( j − 1)th and the j th items of the the old list π , and thus it is in the j th po sition of the new list π ′ . Thus π ′ 1: j − 1 = π 1: j − 1 and π ′ j +1: n +1 = π j : n . W e wan t to find j ∗ = arg max j P ( π ′ 1: j − 1 , π ′ j , π ′ j +1: n +1 | u ) , where P ( π ′ 1: j − 1 , π ′ j , π ′ j +1: n +1 | u ) = " j − 1 Y i =1 X z P ( z | u ) P i ( π ′ | z , u ) # " X z P ( z | u ) P j ( π ′ | z , u ) # × × n +1 Y i = j + 1 X z P ( z | u ) P i ( π ′ | z , u ) . Naive computation for finding the optimal j ∗ will cost n u ( n u + 1) K / 2 steps. Here we pr ovide a solution with just ( n u + 1) K steps. W e will pro ceed from le ft-to -right in a recur sive ma nner , starting from j = 1 . Recall that we can co mpute P ( π ′ 1: n +1 | u ) in Equation 3 in ( n u + 1) K steps . Assume that we hav e computed for the ca s e that the p osition o f the new item is j (under π ′ ), we want to compute the case that the new p osition is j + 1 (under π ′′ ). Let us examine the o dds O j = P ( π ′′ 1: j , π ′′ j +1 , π ′′ j +2: n +1 | u ) P ( π ′ 1: j − 1 , π ′ j , π ′ j +1: n +1 | u ) . W e hav e P ( π ′′ 1: j , π ′′ j +1 , π ′′ j +2: n +1 | u ) = " j − 1 Y i =1 X z P ( z | u ) P i ( π ′′ | z , u ) # " X z P ( z | u ) P j ( π ′′ | z , u ) # × × " X z P ( z | u ) P j +1 ( π ′′ | z , u ) # n +1 Y i = j + 2 X z P ( z | u ) P i ( π ′′ | z , u ) . 6 W e now notice that π ′′ 1: j − 1 = π ′ 1: j − 1 and π ′′ j +2: n +1 = π ′ j +2: n +1 , and P i ( π ′ | z ) = P i ( π ′′ | z ) ∀ z , and for i ∈ { 1 : j − 1 } ∪ { j + 2 : n u + 1 } . The o dds c an be simplified as O j = [ P z P ( z | u ) P j ( π ′′ | z ,u ) ][ P z P ( z | u ) P j +1 ( π ′′ | z ,u ) ] [ P z P ( z | u ) P j ( π ′ | z ,u ) ][ P z P ( z | u ) P j +1 ( π ′ | z ,u ) ] . (4) which costs K time to ev alua te. Consequently , the recurs ive pr o cess co sts totally ( n u + 1) K time steps. 4 Log-linear Mo dels In this sectio n, we prop ose a second appr oach to p ermutation mo delling. The main difference fr o m the fir st appr oach is that we do not make the dis crete-choice as sumption, which makes the parameter estimation easy , but complica tes the inference. W e now rely on the lo g-linear parameterisa tion, which is mo re flexible. The gener ic conditional distribution is defined as P ( π | u ) = 1 Z ( u ) " n u Y i =1 φ π ( i, u ) # n u − 1 Y i =1 n u Y j = i +1 φ π ( i, j ) , (5) where φ π ( i, u ) a nd φ π ( i, j ) are p os itive p otential functions, Z ( u ) is the normalising co n- stant (a.k.a partition function). The pos ition-wise p otential φ π ( i, u ) captures the lik e li- ho o d that a par ticular item y = π i is placed in po s ition i by user u . F or example, we would exp ect that a particular movie is amo ng the top 5% in the list of a user. On the other ha nd, the pa ir wise p otential φ π ( i, j ) enco des the likelihoo d that the item y = π i is prefer r ed to item y ′ = π j . In what follows, we will make use of the energy notio n, i.e. φ π ( i, u ) = exp {− E ( π i , u ) } and φ π ( i, j ) = exp {− E ( π i , π j ) } . The ener gy of the permutation π is therefore the sum o f comp onent ener gies, i.e. E ( π , u ) = P i E ( π i , u ) + P i P j >i E ( π i , π j ). 4.1 MCMC for I nference Inference in the a bove gener ic mo del is int r actable due to the par titio n function Z ( u ), which requires 1 2 n 2 u ( n u − 1 ) 2 ( n u − 2 )! co mputational steps 1 . W e thus r esort to MCMC metho ds. The key is to design a prop osal distribution that helps the ra ndom w a lks to quickly r each the high density reg ions. There is also a tr ade-off her e b eca use lar ge steps would mea n sig nificant dis to rtion of the c urrent p ermutation, resulting in more computational cost p er move. W e consider three types of lo cal mov es. Item r elo c ation . Rando mly pick one item in the list, a nd relo ca te it, keeping the rela - tive o rders of the rest unc ha nged. F or example, a ssume the p ermutation is [ A, B , C , D , E , F ] and if B is relo cated to the place b etw een E and F , then the new p ermutation is [ A, C , D , E , B , F ]. Generally , this type of mov e costs O ( n u ) op erations p er mo ve due 1 There are n u ! permutations, eac h require 1 2 n u ( n u − 1) steps of computing the pro duct of p oten tials. 7 to the change in rela tive preference order s. In the ex ample we are consider ing, the pa irs B C , B D , B E would change to C B , D B , E B . Item swapping . Randomly pic k tw o items, and sw ap their p o sitions lea ving other items unc ha nged. In the ab ov e example, if w e swap B and E , then the new per m uta tion is [ A, E , C , D , B , F ]. This also costs O ( n u ) op eratio ns p er mov e. Sublist p ermut ation . Ra ndomly pick a small sublist, tr y all per mu ta tions within this sublist. F or example, the sublis t [ B , C, D ] will result in [ C , B , D ] , [ B , D , C ] , [ D , C , B ] , [ C, D , B ] , [ D , B , C ]. This c osts ∆! where ∆ is the size of the sublist. When ∆ = 2, this is the sp ecial ca se of the item swapping. Since the pro p o sals are symmetric, the acceptance proba bilit y in the Metr op olis- Hastings metho d is simply P = min { 1 , e − ∆ E } , (6) where ∆ E is the change in mo del energ y due to the pr o p osed mov e. 4.2 Learning with T runcated MCMC Learning using ma ximum likeliho o d is intractable due to the c o mputation of Z ( u ) and its gradient, and thus MCMC-based lear ning ca n b e employed. The a ssumption is that if we gener ate enoug h samples a ccording to the mo del dis tribution, then the gr adient of the log-likeliho o d ca n b e accur a tely estimated, and thus learning can pro ceed. How ever, this is clearly to o exp ensive, b eca use generally we would need a significa ntly lar ge n umber o f samples p er gradient ev alua tio n. Instead, Hinton [5 ] pro p o ses a simple technique called Contrastiv e Divergence (CD) tha t has been shown to work well in sta ndard Boltzma nn machines. The idea is that instead of starting the Ma r ko v chain ra ndomly and r unning forever, we can just s tart from the observed configur ation, and run fo r a few steps. This is enough to r elax the mo del from the empirical distribution. Here we adopt the CD, but w e sho uld stress in passing that the applica tio n of CD in the context of p ermutation mo delling is novel. It is p ossible tha t we just need to run o ne short Markov chain of length n u with the item-swapping moves. 4.3 Learning with Pseudo-lik eliho o d In standa rd gra phical mo dels, pse udo-likelihoo d is an efficie nt alter na tive to the full lik e- liho o d, a nd it is pr ov ably consistent g iven sufficient regular it y in the mo del structure. How ever, this concept has no straig h tfor ward applicatio n in permutation mo dels. W e attempt to co nsider the pseudo-likelihoo d concept from a more abs tract level. There is a close relations hip b etw een ps eudo-likelihoo d and MCMC techniques. The difference is that in MCMC w e rando mly cho ose o ne lo ca l p er m uta tio n configura tion, while in pseudo-likeliho o d, we consider a ll lo cal configura tions, and thus the pro cess is deterministic. Using this idea , the (log ) pseudo-likelihoo d can b e written as L pseudo = X u X c log P ( π c | π ¬ c , u ) where P ( π c | π ¬ c , u ) = exp {− E ( π c , π ¬ c , u ) } P π ′ c exp {− E ( π ′ c , π ′ ¬ c , u ) } . and c denotes the index of the lo cal structure, a nd ¬ c denotes the res t of the items whose relative p ositions re ma in unc ha nged. W e briefly discuss three types of lo ca l s tr ucture. 8 Item r elo c ation . All the items will be consider ed, each has the following lo cal distri- bution P ( π i | π ¬ i , u ) = exp {− E ( π 1: i − 1 , π i , π i +1: n , u ) } P n j =1 exp {− E ( π ′ 1: j − 1 , π ′ j , π ′ j +1: n , u ) } for 1 ≤ i ≤ n u . Since the denominator is the sum ov er n u po sitions, each requires n u − 1 pa irwise ener gies, naively computing P ( π i | π ¬ i , u ) would r e s ult in n u ( n u − 1) steps. How ever, we can the the denominator in a single pass. Supp ose the item y = π i mov es from current p os itio n j (under π ′ ) to j + 1 (under π ′′ ), then change in energy is ∆ E j ( π ′ → π ′′ , u ) = E ( π ′′ j , π ′′ j +1 , u ) − E ( π ′ j , π ′ j +1 , u ) , which costs a co nstant time to compute. W e can start with j = 1 , up dating mo del energies in one pas s. Item swapping . W e hav e n u ( n u − 1) / 2 item pairs for e a ch user u . So the lo cal distribution is P ( π i,j | π ¬ i,j , u ) = 1 1 + exp {− ∆ E ij ( u ) } for 1 ≤ i < j ≤ n u where ∆ E ij ( u ) is the change in energy as a result of the swapping items y = π i and y ′ = π j . Sublist p ermutation. W e will hav e n u + 1 − ∆ lo cal distributions of the following for m P ( π i : i +∆ − 1 | π ¬ i : i +∆ − 1 , u ) = exp {− E ( π 1: i − 1 , π i : i +∆ − 1 , π i +∆: n , u ) } P π ′ j : j +∆ − 1 exp {− E ( π ′ 1: j − 1 , π ′ j : j +∆ − 1 , π ′ j +∆: n , u ) } for 1 ≤ i ≤ n u + 1 − ∆. 4.4 Prediction W e employ the same technique des crib ed earlier with the Latent Plackett-Luce mo del (Section 3.2.2) in that we fix the r e lative order o f the items the user has already seen, and int r o duce the new item int o the list. Then we sea rch for the b est p o s ition of the new item in the list, where the b est p o sition has the low e s t p ermutation energy . C o mputationally , this is similar to the pseudo-likelihoo d with item-relo ca tion, except that now we cho ose the most pr obable p osition instead of summing ov er all p ositio ns. Thus, w e can find the bes t p osition in a s ingle pass. 4.5 P arameterisation Case Studies W e now sp ecify the parameter s for the log -linear mo delling. W e will fo cus on tw o spe c ial cases, one with facto r ed p osition-wis e para meter s, and the other with pa irwise pa r ameters. 4.5.1 F actored P ositio n-wise Para m eters Let us s ta rt from the idea of augmenting each item with a s core s u y , which we assume the factor ed for m a s s u y = P K k =1 W uk H ky . Ig noring the pairw is e p otentials in Equation 5, the p osition-wise p otential can b e defined a s φ π ( i, u ) = exp { s u π i g ( i, u ) } wher e g ( i, u ) is 9 a monotonica lly de cr e asing function in i . This ca se is a ttractive b ecause a MCMC step with pos ition swapping costs only a constant time, i.e. if we swap tw o items a t pos itions l and m , the change in energy is ∆ E lm ( u ) = 2 ( s u π l − s u π m )( m − l ). In addition, prediction is rather simple as we just need to use s u y for sorting. In particular, we ar e interested in the case g ( i, u ) = (1 + n u − 2 i ) /n u since it has a nice interpretation P ( π | u ) = 1 Z ( u ) exp { 1 n u n u X i =1 s u π i (1 + n u − 2 i ) } = 1 Z ( u ) exp { 1 n u n u − 1 X i =1 n u X j = i +1 ( s u π i − s u π j ) } , which ba s ically says that when y = π i is pr eferred to y ′ = π j , then we should have s u y > s u y ′ . 4.5.2 P airwise P arameters W e now co ns ider the second sp ecial case, wher e the pairwise p otential is simply φ π ( i, j ) = exp { λ y y ′ } sub ject to y = π i and y ′ = π j . Note that λ y y ′ 6 = λ y ′ y . Since the total parameters ca n be as muc h as M 2 , which is o ften to o lar ge for ro bust estimation, we k e e p only the para meter s of the item pairs whose n umber of co-o ccur rences in the training data is lar ger tha n a certain threshold. T o account for missing pairs , we als o use the po sition-wise p o tent ia l φ π ( i, u ) = exp { γ π i g ( i, u ) } with an extr a par ameter p er item γ y (here y = π i ). The distribution is now defined a s P ( π | u ) = 1 Z ( u ) exp n u X i =1 γ π i g ( i, u ) + n u − 1 X i =1 n u X j = i +1 λ π i ,π j . F or example, the thres hold may b e s e t to 5 and we can use g ( i, u ) = 1 − i/ n u . Note that there is no user - sp ecific pa rameter. How ever, the distribution is still user-dep endent bec ause of the n umber o f items n u and the r anking are user-s pe c ific. In MCMC, supp ose we swap items at p ositions l and m , where l < m , the change in energy is ∆ E lm ( u ) = ( γ π l − γ π m ) { g ( l , u ) − g ( m, u ) } + λ lm − λ ml + X l
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment