Incorporating Side Information in Probabilistic Matrix Factorization with Gaussian Processes

INCORPORA TING SIDE INF ORMA TION IN PR OBABILISTIC MA TRIX F A CTORIZA TION WITH GA USSIAN PR OCESSES By R y an P. Ad ams ∗ , George E. D ahl and Iain Murra y University of T or onto Probabilistic matrix factorization (PMF) is a p o werful method for mo deling data asso ciated with pairwise relationships, ﬁnding use in collab orativ e ﬁltering, computational biology , and do cument analysis, among other areas. In man y domains, there is additional information that can assist in prediction. F or example, when mo deling movie ratings, w e migh t kno w when the rating o ccurred, where the user liv es, or what actors app ear in the mo vie. It is diﬃcult, how ever, to incorp orate this side information into the PMF mo del. W e propose a framework for incorp orating side information b y coupling together m ultiple PMF problems via Gaussian process priors. W e replace scalar latent features with functions that v ary o ver the space of side information. The GP priors on these functions require them to v ary smo othly and share information. W e successfully use this new metho d to predict the scores of professional basketball games, where side information ab out the ven ue and date of the game are relev ant for the outcome. 1. In tro duction. Man y data that w e wish to analyze are b est modeled as the result of a pairwise in teraction. The pair in question might describ e an in teraction b et w een items from diﬀeren t sets, as in collaborative ﬁltering, or migh t describ e an interaction b etw een items from the same set, as in so cial net w ork link prediction. The salient feature of these dyadic data mo deling tasks is that the observ ations are the result of interactions: in the popular Netﬂix prize example, one is given user/mo vie pairs with asso ciated ratings and must predict the ratings of unseen pairs. Other examples of this sort of relational data include biological pathw ay analysis, document mo deling, and transp ortation route disco v ery . One approach to relational data treats the observ ations as a matrix and then uses a probabilistic lo w-rank approximation to the matrix to disco v er structure in the data. This approach was pioneered by Hofmann (1999) to mo del word co-o ccurrences in text data. These pr ob abilistic matrix factor- ization (PMF) mo dels hav e generated a great deal of interest as pow erful metho ds for mo deling dyadic data. See Srebro (2004) for a discussion of ∗ http://www.cs.toronto.edu/ ~ rpa 1 2 R.P . ADAMS ET AL. approac hes to machine learning based on matrix factorization and Salakh ut- dino v and Mnih (2008a) for a current view on applying PMF in practice. One diﬃculty with the PMF mo del is that there are often more data a v ailable ab out the observ ations than simply the identities of the partici- pan ts. Often the interaction itself will hav e additional lab els that are relev an t to the prediction task. In collab orativ e ﬁltering, for example, the date of a rating is known to b e imp ortant (Koren, 2009). Incorp orating this side information directly as part of the lo w-rank feature mo del, how ever, limits the eﬀect to only simple, linear in teractions. In this pap er we present a generalization of probabilistic matrix factorization that replaces scalar laten t features with functions whose inputs are the side information. By placing Gaussian pro cess priors on these latent functions, w e achiev e a ﬂexible non- parametric Bay esian mo del that incorp orates side information by in tro ducing dep endencies b etw een PMF problems. 2. The Dep endent PMF Mo del. In this section w e present the de- p endent pr ob abilistic matrix factorization (DPMF) mo del. The ob jective of DPMF is to tie together sev eral related probabilistic matrix factorization problems and exploit side information by incorp orating it into the latent features. W e introduce the standard PMF mo del ﬁrst and then sho w how it can b e extended to include this side information. 2.1. Pr ob abilistic Matrix F actorization. In the t ypical probabilistic matrix factorization framew ork, we hav e tw o sets, M and N , of sizes M and N . F or a collab orativ e ﬁltering application, M migh t be a set of ﬁlms and N migh t b e a set of users. M and N ma y also b e the same sets, as in the bask etball application we explore later in this pap er. In general, we are in terested in the outcomes of in teractions b etw een mem b ers of these t w o sets. Again, in the collab orative ﬁltering case, the interaction migh t b e a rating of ﬁlm m b y user n . In our basketball application, the observ ations are the scores of a game b et w een teams m and n . Our goal is to use the observ ed interactions to predict unobserved in teractions. This can b e view ed as a matrix c ompletion task: we hav e an M × N matrix Z in which only some en tries are obse rv ed and must predict some of the unav ailable entries. One approach is to use a generativ e mo del for Z . If this mo del describes useful interactions b etw een the rows and columns of Z , then inference can pro vide predictions of the unobserved entries. A typical formulation draws Z from a distribution that is parameterized b y an unobserved matrix Y . This matrix Y is of rank K  M , N so that we ma y write Y = U V T , where U and V are M × K and N × K matrices, resp ectively . A common approach is to say that the ro ws of U and V are indep enden t draws from tw o K - INCORPORA TING SIDE INF ORMA TION IN MA TRIX F ACTORIZA TION 3 (a) Standard probabilistic matrix factorization G aussian P r oc ess P r iors (b) Dep endent probabilistic matrix factorization Fig 1: (a) The basic low-rank matrix factorization mo del uses a matrix Y to parameterize a distribution on random matrices, from which Z (con taining the observ ations) is taken to b e a sample. The matrix Y is the pro duct of tw o rank- K matrices U and V . (b) In dep enden t probabilistic matrix factorization, w e consider the low-rank matrices to b e “slices” of functions o ver x (coming out of the page). These functions ha ve Gaussian pro cess priors. dimensional Gaussian distributions (Salakh utdinov and Mnih, 2008b). W e then interpret these K -dimensional vectors as latent features that are dis- tributed representations of the members of M and N . W e denote these v ectors as u m and v n for the (transp osed) m th row of U and n th row of V , resp ectiv ely , so that Y m,n = u T m v n . The distribution linking Y and Z is application-speciﬁc. F or ratings data it ma y b e natural to use an ordinal regression model. F or binary data, suc h as in link prediction, a Bernoulli logistic model may b e appropriate. PMF models t ypically assume that the en tries of Z are indep endent giv en Y , although this is not necessary . In Section 4 w e will use a conditional lik eliho o d mo del that explicitly includes dep endencies. 4 R.P . ADAMS ET AL. 2.2. L atent F e atur es as F unctions. W e now generalize the PMF mo del to include side information ab out the interactions. Let X denote the space of suc h side information, and x denote a p oin t in X . The time of a game or of a movie rating are go o d examples of such side information, but it could also in v olve v arious features of the interaction, features of the participants, or general nuisance parameters. T o enable dep endence on suc h side information, w e extend the standard PMF model b y replacing latent feature v ectors u m and v n , with latent fe atur e functions u m ( x ) : X → R K and v n ( x ) : X → R K . The matrix Y is now a function Y ( x ) such that Y m,n ( x ) = u T m ( x ) v n ( x ) , or alternatively , Y ( x ) = U ( x ) V T ( x ) , where the Z ( x ) matrix is drawn according to a distribu- tion parameterized by Y ( x ). W e mo del each Z ( x ) conditionally indep endent, giv en Y ( x ). This representation, illustrated in Figures 1 and 2, allo ws the laten t features to v ary according to x , capturing the idea that the side information should b e relev an t to the distributed represen tation. W e use a m ulti-task v arian t of the Gaussian process as a prior for these vector functions to construct a nonparametric Bay esian mo del of the latent features. 2.3. Multi-T ask Gaussian Pr o c ess Priors. When incorp orating functions in to Bay esian mo dels, we often hav e general b eliefs about the functions rather than knowledge of a sp eciﬁc basis. In these cases, the Gaussian pro cess is a useful prior, allowing for the general sp eciﬁcation of a distribution on functions from X to R via a p ositiv e-deﬁnite cov ariance k ernel C ( x , x 0 ) : X × X → R and a mean function µ ( x ) : X → R . F or a general review of Gaussian pro cesses for machine learning see Rasm ussen and Williams (2006). In this section w e restrict our discussion to GP priors for the functions u m ( x ), but we deal with the v n ( x ) functions iden tically . It is reasonable to consider the feature function u m ( x ) to b e independent of another individual’s feature function u m 0 ( x ), but we would like for each of the components within a particular function u m ( x ) to ha ve a structured prior. Rather than use indep endent Gaussian pro cesses for eac h of the K scalar comp onent functions in u m ( x ), we use a multi-task GP approach in the vein of T eh et al. (2005) and Bonilla et al. (2008). W e p erform a p oin twise linear transformation of K indep enden t latent functions using a matrix L Σ U that is the Cholesky decomp osition of an inter-task co v ariance matrix Σ U , i.e., Σ U = L Σ U L T Σ U . The cov ariance functions C U k ( x , x 0 ) (and relev an t hyperparameters θ U k ) are shared across the members of M , and constan t mean functions µ U ( x ) are added to eac h of the functions after the linear transformation. The intr a-fe atur e sharing of the cov ariance function, mean function, and INCORPORA TING SIDE INF ORMA TION IN MA TRIX F ACTORIZA TION 5 0 1 2 3 4 5 6 7 8 9 10 −3 −2 −1 0 1 2 (a) “Left hand” laten t vector function u m ( x ) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 (b) “Right hand” latent v ector function v n ( x ) 0 1 2 3 4 5 6 7 8 9 10 −10 −5 0 5 (c) F unction of inner pro duct Y m,n ( x ) 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 (d) Observed data Z m,n Fig 2: These ﬁgures illustrate the generativ e view of the DPMF mo del. (a,b) V ector functions for m and n are drawn from from Gaussian pro cesses. (c) The p oint wise inner pro duct of these functions yields the latent function Y m,n ( x ). (d) The observed data, in this case, ordinal v alues b et ween 1 and 5, that dep end on Y m,n ( x ). h yp erparameters is in tended to capture the idea that the c haracteristic v ariations of features will tend to b e consistent across mem b ers of the set. If a feature function learns to capture, for example, whether or not users in a collab orative ﬁltering problem enjoy Christmas mo vies, then w e might exp ect them to share an annual p erio dic v ariation. The inter- 6 R.P . ADAMS ET AL. fe atur e cov ariance matrix Σ U , on the other hand, captures the idea that some features are informativ e ab out others and that this information can b e shared. Salakhutdino v and Mnih (2008b) applied this idea to scalar features. W e refer to this mo del as dep endent probabilistic matrix factorization, b ecause it ties together a set of PMF problems which are indexed by X . This yields a useful sp ectrum of p ossible b ehaviors: as the length scales of the GP become large and the side information in x b ecomes uninformative, then our approac h reduces to a single PMF problem; as the length scales b ecome small and the side information b ecom es highly informative then eac h unique x has its o wn PMF mo del. The marginal distribution of a PMF problem for a giv en x are the same as that in Salakhutdino v and Mnih (2008b). Additionally , by having each of the K feature functions use diﬀeren t h yp erparameters, the v ariation ov er X can b e shared diﬀeren tly for the features. In a sp orts mo deling application, one feature might corresp ond to coaches and others to play ers. Pla yer p ersonnel may c hange at diﬀeren t timescales than coac hes and this can b e captured in the model. 2.4. Constructing Corr elation F unctions. One of the app eals of using a fully-Ba y esian approach is that it in principle allo ws hierarc hical inference of h yp erparameters. In the DPMF case, w e ma y not ha ve a strong preconception as to precisely ho w useful the side information is for prediction. The relev ance of side information can b e captured by the length-scales of the co v ariance functions on X (Rasm ussen and Williams, 2006). If X is a D -dimensional real space, then a standard choice is the automatic relev ance determination (ARD) cov ariance function: C ARD ( x , x 0 ) = exp ( − 1 2 D X d =1 ( x d − x 0 d ) 2 /` 2 d ) , (1) where in our notation there w ould b e 2 K sets of length scales that corresp ond to the co v ariance hyperparameters { θ U k } and { θ V k } . While the ARD prior is a p opular choice, some DPMF applications may ha v e temp oral data that cause feature functions to ﬂuctuate p erio dically , as in the previous Christmas mo vie example. F or this situation it may b e appropriate to include a p erio dic kernel suc h as C per ( x, x 0 ) = exp  − 2 sin 2  1 2 ( x − x 0 )  /` 2  . (2) Note that w e ha ve deﬁned b oth of these as c orr elation functions (unit marginal v ariances), and allo w for v ariation in function amplitudes to b e captured via Σ U and Σ V as in Bonilla et al. (2008). INCORPORA TING SIDE INF ORMA TION IN MA TRIX F ACTORIZA TION 7 2.5. R e ducing Multimo dality. When p erforming inference, o vercomplete- ness can cause diﬃculties by in tro ducing spurious mo des into the p osterior distribution. It is useful to construct parameterizations that av oid m ulti- mo dalit y . Although we ha ve dev elop ed our notation as U V T , in practice w e restrict the right-hand factor to be p ositiv e via a comp onent-wise trans- formation ψ ( r ) = ln(1 + e r ) of V so that Y ( x ) = U ( x ) ψ ( V T ( x )) . In pro duct mo dels such as this, there are many p osterior mo des corresponding to sign ﬂips in the functions (Adams and Stegle, 2008). Restricting the sign of one factor improv es inference without making the mo del less expressive. 2.6. Summary of Mo del. F or clarity , we present an end-to-end gener- ativ e view of the DPMF mo del: 1) Two sets of K GP hyperparameters, denoted { θ U k } and { θ V k } , come from top-hat priors; 2) K ( M + N ) func- tions are drawn from the 2 K Gaussian pro cesses, these are denoted b elo w as { f U k,m } and { f V k,n } ; 3) Two K -dimensional mean vectors µ U and µ V come from v ague Gaussian priors; 4) Two K × K cross-co v ariance matrices Σ U and Σ V are drawn from uninformative priors on p ositive deﬁnite matrices; 5) The “horizon tally sliced” functions { f U m } and { f V n } are transformed with the Cholesky decomposition of the appropriate cross-co v ariance matrix and the mean v ectors are added. 6) The transformation ψ ( · ) is applied element- wise to the resulting { v n ( x ) } to make them strictly positive; 7) The inner pro duct of u m ( x ) and ψ ( v n ( x )) computes y m,n ( x ); 8) The matrix Y ( x ) parameterizes a mo del for the en tries of Z ( x ). Ignoring the v ague priors, this is given b y: Z ( x ) ∼ p ( Z | Y ( x )) Y ( x ) = U ( x ) ψ ( V T ( x )) u m ( x ) = L Σ U f U m + µ U v n ( x ) = L Σ V f V n + µ V f U k,m ∼ G P ( x m , θ U k ) f V k,n ∼ G P ( x n , θ V k ) . 2.7. R elate d Mo dels. There ha ve b een sev eral proposals for incorp orating side information in to probabilistic matrix factorization mo dels, some of whic h ha v e used Gaussian pro cesses. The Gaussian pro cess latent v ariable mo del (GPL VM) is a nonlinear dimensionalit y reduction metho d that can b e viewed as a kernelized v ersion of PCA. Lawrence and Urtasun (2009) observes that PMF can b e viewed as a particular case of PCA and use the GPL VM as a “k ernel trick” on the inner pro ducts that pro duce Y from U V T . The latent representations are optimized with stochastic gradien t descent. This mo del diﬀers from ours in that we us the GP to map from observ ed side information to the latent space, while theirs maps from the latent space in to the matrix entries. Lawrence 8 R.P . ADAMS ET AL. and Urtasun (2009) also men tions the use of movie-speciﬁc metadata to augmen t the latent space in their collab orative ﬁltering application. W e additionally note that our mo del allo ws arbitrary link functions b et w een the laten t matrix Y and the observ ations Z , including dependent distributions, as discussed in Section 4.2. Another closely-related factorization mo del is the sto chastic relational mo del (SRM) (Y u et al., 2007). Rather than representing M and N as ﬁnite sets, the SRM uses arbitrary spaces as index sets. The GP provides a distribution ov er maps from this “iden tity space” to the latent feature space. The SRM diﬀers from the DPMF in that the input space for our Gaussian pro cess corresponds to the observ ations themselv es, and not just to the participants in the relation. Additionally , w e allo w eac h mem b er of M and N to hav e K functions, each with a diﬀeren t GP prior that may hav e diﬀeren t dep endencies on X . A p oten tial adv an tage of the DPMF mo del we present here, relativ e to the GPL VM and SRM, is that the GP priors need only b e deﬁned on the data asso ciated with the observ ations for a single individual. As inference in Gaussian pro cesses has cubic computational complexit y , it is preferable to hav e more indep endent GPs with few er data in each one than a few GPs that are eac h deﬁned on man y thousands of input p oints. There has also b een w ork on explicitly incorp orating temp oral information in to the collab orativ e ﬁltering problem, most notably by the winner of the Netﬂix prize. Koren (2009) included a simple drift model for the laten t user features and baseline ratings. When rolled into the SVD learning method, this temp oral information signiﬁcantly improv ed predictive accuracy . 3. MCMC Inference and Prediction. As discussed in Section 2.1, the typical ob jectiv e when using probabilistic matrix factorization is to predict unobserv ed entries in the matrix. F or the DPMF model, as in the Bay esian PMF mo del (Salakh utdino v and Mnih, 2008a), inference and prediction are not p ossible in closed form. W e can use Marko v chain Monte Carlo (MCMC), ho w ever, to sample from the p osterior distribution of the v arious parameters and laten t v ariables in the mo del. W e can then use these samples to construct a Mon te Carlo estimate of the predictive distribution. If the en tries of in terest in Z ( x ) can b e easily sampled giv en Y ( x ) — as is t ypically the case — then samples from the p osterior on Y ( x ) allow us to straightforw ardly generate predictiv e samples, which are the quantities of interest, and in tegrate out all of the laten t v ariables. In the DPMF mo del, w e deﬁne the state of the Mark ov chain with: 1) the v alues of the laten t feature functions U ( x ) and V ( x ), ev aluated at INCORPORA TING SIDE INF ORMA TION IN MA TRIX F ACTORIZA TION 9 the observ ations; 2) the hyperparameters { θ U k , θ V k } K k =1 asso ciated with the feature-wise co v ariance functions, t ypically capturing the relev ance of the side information to the latent features; 3) the feature cross-cov ariances Σ U and Σ V ; 4) the feature function means µ U and µ V ; 5) any parameters con trolling the conditional likelihoo d of Z ( x ) given Y ( x ). Note that due to the conv enient marginalization properties of the Gaussian pro cess, it is only necessary to represen t the v alues of feature functions at places (in the space of side information) where there ha v e b een observ ations. 3.1. Slic e Sampling. When p erforming approximate inference via Mark o v c hain Monte Carlo, one constructs a transition op erator on the state space that leav es the p osterior distribution inv arian t. The transition operator is used to simulate a Marko v chain. Under mild conditions, the distribution o ver resulting states evolv es to b e closer and closer to the true p osterior distribution (e.g., Neal (1993)). While a generic op erator, such as Metrop olis– Hastings or Hamiltonian Monte Carlo, can b e implemented, we seek eﬃcien t metho ds that do not require extensiv e tuning. T o that end, we use MCMC metho ds based on slic e sampling (Neal, 2003) when p erforming inference in the DPMF mo del. Some of the v ariables and parameters required sp ecial treatmen t, detailed in the next tw o subsections, for slice sampling to w ork w ell. 3.2. El liptic al Slic e Sampling. Sampling from the p osterior distribution o ver latent functions with Gaussian pro cess priors is often a diﬃcult task and can b e slo w to mix, due to the structure imp osed by the GP prior. In this case, we ha ve several collections of functions in U ( x ) and V ( x ) that do not lend themselv es easily to t ypical metho ds such as Gibbs sampling. Recen tly , a metho d has b een developed to speciﬁcally enable eﬃcien t slice sampling of complicated Gaussian pro cess mo dels with no tuning or gradien ts (Murra y et al., 2010). This method, called el liptic al slic e sampling (ESS), tak es adv an tage of in v ariances in the Gaussian distribution to make transitions that are never v eto ed by the highly-structured GP prior, even when there are a large num b er of suc h functions as in the DPMF. 3.3. Sampling GP Hyp erp ar ameters. As discussed in Section 2.4, the length scales in the co v ariance (correlation) functions of the Gaussian pro- cesses play a critical role in the DPMF mo del. It is through these h yp er- parameters that the model weighs the eﬀect of side information on the predictions. In the DPMF mo del, a set of hyperparameters θ U k (or θ V k ) af- fect M (or N ) functions. The t ypical approach to this w ould b e to ﬁx the 10 R.P . ADAMS ET AL. relev an t functions { f U k,m } M m =1 and sample from the conditional p osterior: p ( θ U k | { f U k,m } M m =1 ) ∝ p ( θ U k ) M Y m =1 N ( f U k,m ; 0 , Ξ U k,m ) , where Ξ U k,m is the matrix that results from applying the cov ariance function with hyperparameters θ U k to the set of side information for m . In practice, the Marko v c hain on this distribution can mix v ery slo wly , due to the strong constrain ts arising from the M functions, despite the relative weakness of the conditional likelihoo d on the data. Therefore, w e use an approach similar to (Christensen and W aagep etersen, 2002), whic h mixes faster in our application, but still lea ves the p osterior distribution on the hyperparameters in v arian t. The mo del con tains sev eral dra ws from a Gaussian pro cess of the form: f V k,n ∼ G P ( x n , θ V k ). Consider a vector of ev aluations of one of these latent functions that is marginally distributed as f ∼ N ( m , Ξ θ ). Under the genera- tiv e process, the distribution o ver the latent v alues is strongly dep enden t on the hyperparameters θ that sp ecify the co v ariance. As a result, the p osterior conditional distribution ov er the hyperparameters for ﬁxed latent v alues will b e strongly p eaked, leading to slow mixing of a Marko v c hain that up dates θ for ﬁxed f . Several authors ha ve found it useful to reparameterize Gaussian mo dels so that under the prior the laten t v alues are indep enden t of eac h other and the hyperparameters. This can b e ac hiev ed by setting ν = L − 1 θ ( f − m ), where L θ is a matrix square root, such as the Cholesky decomp osition, of the cov ariance Ξ θ . Under the new prior representation, ν is drawn from a spherical unit Gaussian for all θ . W e slice sample the GP hyperparameters after reparameterizing all vectors of latent function ev aluations. As the hyperparameters c hange, the function v alues f = m + L θ ν will also change to satisfy the co v ariance structure of the new settings. Ha ving observ ed data, some f settings are v ery unlik ely; in the reparameterized mo del the likelihoo d terms will restrict ho w muc h the h yp erparmaters can change. In the application we consider, with very noisy data, these up dates w ork muc h better than up dating the h yp erparameters for ﬁxed f . In problems where the data strongly restrict the p ossible changes in f , more adv anced reparameterizations are p ossible (Christensen et al., 2006). W e ha v e developed related slice sampling metho ds that are easy to apply (Murray and Adams, 2010). 4. DPMF for Basketball Outcomes. T o demonstrate the utility of the DPMF approach, we use our metho d to mo del the scores of games in the National Basketball Asso ciation (NBA) in the years 2002 to 2009. This task INCORPORA TING SIDE INF ORMA TION IN MA TRIX F ACTORIZA TION 11 is an appealing one to study for sev eral reasons: 1) it is of a medium size, with appro ximately ten thousand observ ations; 2) it provides a natural censored- data ev aluation setup via a “rolling predictions” problem; 3) exp ert human predictions are a v ailable via b etting lines; 4) the prop erties of teams v ary o ver time as pla y ers are traded, retire and are injured; 5) other side information, suc h as which team is pla ying at home, is clearly relev an t to game outcomes. F or these reasons, using basketball as a testb ed for probabilistic models is not a new idea. In the statistics literature there hav e been previous studies of collegiate bask etball outcomes b y Sch wertman et al. (1991), Sch wertman et al. (1996), and Carlin (1996), although with smaller data sets and narro wer mo dels. W e use the DPMF to mo del the scores of games. The observ ations in the matrix Z ( x ) are the actual scores of the games with side informa- tion x . Z m,n ( x ) is the n umber of p oin ts scored b y team m against team n , and Z n,m ( x ) is the num b er of p oints scored b y team n against team m . W e mo del these with a biv ariate Gaussian distribution, making this a somewhat un usual PMF-type mo del in that w e see two matrix en tries with eac h obser- v ation and w e place a join t distribution ov er them. W e use a single v ariance for all observ ations and allow for a correlation b etw een the scores of the t wo teams. While the Gaussian mo del is not a p erfect match for the data — scores are non-negativ e integers – eac h team in the NBA tends to score ab out 100 points p er game, with a standard deviation of ab out ten so that v ery little mass is assigned to negative n umbers. Ev en though both sets in this dy adic problem are the same, i.e., M = N , w e use diﬀerent latent feature functions for U ( x ) and V ( x ). This makes the U ( x ) functions of oﬀense and the V ( x ) functions of defense, since one of them contributes to “p oin ts for” and the other con tributes to “p oin ts against”. This sp ecialization allo ws the Gaussian pro cess h yp erparameters to ha v e diﬀeren t v alues for oﬀense and defense, enabling the side information to mo dulate the n um b er of p oints scored and conceded in p oten tially diﬀerent w a ys. 4.1. Pr oblem Setup. T o use NBA basketball score prediction as a task to determine the v alue that using side information in our framew ork provides relativ e to the standard PMF mo del, w e set up a rolling censored-data problem. W e divided each of the eight seasons in to four-week blo c ks. F or eac h four-w eek blo c k, the mo dels were asked to mak e predictions ab out the games during that interv al using only information from the past. In other w ords, when making predictions for the mon th of F ebruary 2005, the model could only train on data from 2002 through Jan uary 2005. W e rolled ov er eac h of 12 R.P . ADAMS ET AL. these interv als o ver the en tire data set, retraining the mo del each time and ev aluating the predictions. W e used three metrics for ev aluation: 1) mean predictiv e log probability from a Rao–Blackw ellized estimator; 2) error in the binary winner-prediction task; 3) root mean-squared error (RMSE) of the tw o-dimensional score vector. The winner accuracies and RMSE can b e compared against human ex- p erts, as determined by the b etting lines asso ciated with the games. Sp orts b o okmakers assign in adv ance t wo n umbers to eac h game, the spr e ad and the over/under . The spread is a num b er that is added to the score of a sp eciﬁed team to yield a b ettor an even-odds return. F or example, if the spread b etw een the LA Lakers and the Cleveland Cav aliers is “-4.5 for the Lak ers” then a single-unit bet for the Lakers yields a single-unit return if the Lakers win by 4.5 p oin ts or more (the half-p oint preven ts ties, or pushes ). If the Lak ers lose or b eat the Ca v aliers b y few er than 4.5 p oints, then a single-unit b et on the Ca v aliers would win a single-unit return. The o v er/under determines the threshold of a single-unit b et on the sum of the t w o scores. F or example, if the o v er/under is 210.5 and the ﬁnal score is 108 to 105, then a bettor who “to ok the ov er” with a single-unit b et would win a single-unit return, while a score of 99 to 103 w ould cause a loss (or a win to a bettor who “to ok the under”). F rom the p oint of view of mo del ev aluation, these are excellent predictions, as the spread and o ver/under themselves are set by the bo okmakers to balance b ets on each side. This means that exp ert humans exploit an y data a v ailable (e.g., referee identities and injury reports, whic h are not av ailable to our mo del) to exert market forces that reﬁne the lines to high accuracy . The sign of the spread indicates the fav orite to win. T o determine the implied score predictions themselv es, w e can solv e a simple linear system:  1 1 1 − 1   a w ay score home score  =  o v er/under home spread  . (3) 4.2. Basketb al l-Sp e ciﬁc Mo del Asp e cts. As men tioned previously , the con- ditional likelihoo d function that parameterizes the distribution o ver the en tries in Z ( x ) in terms of Y ( x ) is problem sp eciﬁc. In this application, w e use  Z m,n ( x ) Z n,m ( x )  ∼ N  Y m,n ( x ) Y n,m ( x )  ,  σ 2 ρσ 2 ρσ 2 σ 2  , (4) where σ ∈ R + and ρ ∈ ( − 1 , 1) parameterize the biv ariate Gaussian on scores and are included as part of inference. (A typical v alue for the correlation co eﬃcien t was ρ = 0 . 4 .) This allows us to easily compute the predictive log INCORPORA TING SIDE INF ORMA TION IN MA TRIX F ACTORIZA TION 13 probabilities of the censored test data using a Rao–Blac kwellized estimator. T o do this, we sample and store predictive state from the Mark ov chain to construct a Gaussian mixture mo del. Giv en the predictiv e samples of the laten t function at the new time, w e can compute the means for the distribution in Equation 4. The cov ariance parameters are also being sampled and this forms one comp onent in a mixture mo del with equal comp onent weigh ts. Ov er many samples, we form a go o d predictive estimate. 4.3. Nonstationary Covarianc e. In the DPMF models incorporating tem- p oral information, we are attempting to capture ﬂuctuations in the latent features due to personnel changes, etc. One un usual aspect of this particular application is that we exp ect the notion of time scale to v ary dep ending on whether it is the oﬀ-season. The timescale appropriate during the season is almost certainly inappropriate to describ e the v ariation during the 28 w eeks b et ween the end of one regular season and the start of another. T o handle this nonstationarity of the data, w e introduced an additional parameter that is the eﬀective num b er of weeks b etw een seasons, which we exp ect to b e smaller than the true n umber of w eeks. W e include this as a hyperparameter in the cov ariance functions and include it as part of inference, using the same slice sampling technique describ ed in Section 3.3. A histogram of inferred gaps for K = 4 is sho wn in Figure 3. Note that most of the mass is b elow the true num b er of weeks. 4.4. Exp erimental Setup and R esults. W e compared sev eral diﬀeren t mo del v arian ts to ev aluate the utility of side information. W e implemented the standard fully-Bay esian PMF model using the same likelihoo d as ab ov e, gen- erating predictive log probabilities using the same metho d as for the DPMF. W e constructed DPMFs with temp oral information, binary home/a wa y infor- mation, and both of these together. W e applied eac h of these mo dels using diﬀeren t num b ers of latent features, K , from one to ﬁve. W e ran ten separate Mark o v c hains to predict eac h censored interv al. Within a single year, we initialized the Mark o v state from the ending state of the previous c hain, for a “warm start”. The “cold start” at the b eginning of the y ear ran for 1000 burnin iterations, while warm starts ran for 100 iterations in each of the ten c hains. After burning in and thinning b y a factor of four, 100 samples of each predictiv e score w ere k ept from each chain, resulting in 1000 comp onen ts in the predictive Gaussian mixture model. T o prev en t the standard PMF mo del from b eing to o heavily inﬂuenced b y older data, w e only pro vided data to it from the curren t season and the previous t w o seasons. T o preven t an adv antage for the DPMF, w e also limited its data in the same w a y . Sampling from the cov ariance hyperparameters 14 R.P . ADAMS ET AL. 0 5 10 15 20 25 30 35 40 0 0.05 0.1 0.15 Between−Season Gap (Weeks) Fig 3: Histogram of the gap b etw een seasons. in the model is relativ ely exp ensive, due to the need to compute multiple Cholesky decomp ositions. T o impro v e eﬃciency in this regard, we ran an extensiv e Mark ov chain to burn in these hyperparameters and then ﬁxed them for all further sampling. Without h yp erparameter sampling, the remainder of the MCMC state can b e iterated in appro ximately three minutes on a single core of a mo dern workstation. W e p erformed this burn-in of hyperparameters on a span of games from 2002 to 2004 and ultimately used the learned parameters for prediction, so there is a mild amount of “cheating” on 2004 and b efore as those data ha v e technically b een seen already . W e b elieve this eﬀect is v ery small, ho wev er, as the cov ariance h yp erparameters (the only state carried o ver) are only lo osely connected to the data. Results from these ev aluations are provided in T able 1. The DPMF model demonstrated notable impro vemen ts o v er the baseline Ba yesian PMF mo del, and the inclusion of more information improv ed predictions ov er the time and home/a wa y information alone. The predictions in 2009 are less consisten t, whic h we attribute to v ariance in ev aluation estimates from few er interv als b eing av ailable as the season is still in progress. The eﬀect of the n um b er of laten t features K in the complex mo del is muc h less clear. Figure 4 shows the join t predictions for four diﬀeren t games b etw een the Cleveland Ca v aliers and the Los Angeles Lakers, using K = 3 with time and home/a wa y a v ailable. The diﬀerences b etw een the predictions illustrate that the mo del is incorp orating information b oth from home adv an tage and v ariation ov er time. 5. Discussion. In this pap er we ha v e presented a nonparametric Ba y esian v arian t of probabilistic matrix factorization that induces dep endencies be- t ween observ ations via side information using Gaussian processes. This mo del has the con venien t prop erty that, conditioned on the side information, the marginal distributions are equiv alen t to those in w ell-studied existing PMF mo dels. While Gaussian pro cesses and MCMC often carry a signﬁcan t INCORPORA TING SIDE INF ORMA TION IN MA TRIX F ACTORIZA TION 15 Cleveland Cavaliers Los Angeles Lakers 80 100 120 70 80 90 100 110 120 130 (a) LA at home, week 10, 2004 Cleveland Cavaliers Los Angeles Lakers 80 100 120 70 80 90 100 110 120 130 (b) Cle at home, week 14, 2004 Cleveland Cavaliers Los Angeles Lakers 80 100 120 70 80 90 100 110 120 130 (c) LA at home, week 10, 2008 Cleveland Cavaliers Los Angeles Lakers 80 100 120 70 80 90 100 110 120 130 (d) Cle at home, week 12, 2008 Fig 4: Contour plots showing the predictive densities for four games b et ween the Clev eland Cav aliers and the Los Angeles Lakers, using K = 3 with home/a wa y and temp oral information av ailable. Los Angeles is the home team in (a) and (c). Clev eland is the home team in (b) and (d). Figures (a) and (b) were in the 2004 season, (c) and (d) were in the 2008 season. The “ o ” sho ws the exp ert-predicted score and the “ x ” shows the true outcome. The diagonal line indicates the winner threshold. Note the substantial diﬀerences b etw een home and aw a y , ev en when the times are close to each other. computational cost, we ha ve developed a framew ork that can mak e useful predictions on real problems in a practical amoun t of time — hours for most of the predictions in the bask etball problem w e hav e studied. There are sev eral in teresting wa ys in whic h this w ork could b e extended. One notable issue that we hav e ov erlo ok ed and w ould b e relev an t for man y applications is that the Gaussian pro cesses as sp eciﬁed in Section 2.4 only allo w for smo oth v ariation in latent features. This slow v ariation may b e inappropriate for man y models: if a star NBA play er has a season-ending injury , we would exp ect that to b e reﬂected better in something lik e a c hangep oint mo del (see, e.g., Barry and Hartigan (1993)) than a GP mo del. Also, we ha ve not addressed the issue of ho w to select the num b er of latent features, K , or ho w to sample from this parameter as part of the mo del. Nonparametric Bay esian mo dels such as those propsed by Meeds et al. (2007) ma y give insigh t into this problem. Finally , we should note that other authors ha v e explored other kinds of structured latent factors (e.g., Sutskev er et al. (2009)), and there may b e interesting w ays to com bine the features of these approac hes with the DPMF. A cknow le dgements. The authors wish to thank Amit Grub er, Geoﬀrey Hin ton and Rich Zemel for v aluable discussions. The idea for placing bas- k etball scores directly into the matrix was originally suggested b y Danny T arlow. RP A is funded by the Canadian Institute for Adv anced Research. 16 R.P . ADAMS ET AL. T able 1 Evaluations of PMF and DPMF algorithms with various numb ers of latent factors. The PMF mo del is the ful ly-Bayesian appr o ach of Salakhutdinov and Mnih (2008a), with our applic ation-sp e ciﬁc likeliho o d. DPMF(t) is the DPMF with only temp or al information, DPMF(h) has only binary home/away indic ators, DPMF(t,h) has b oth temp or al and home/away information. (a) Me an pr e dictive lo g pr ob abilities of test data. (b) Err or r ates of winner pr e diction ar e on the left, RMSEs of sc or es ar e on the right. Exp ert human pr e dictions ar e shown on the b ottom. (a) Mean log probabilities for rolling score prediction 2002 2003 2004 2005 2006 2007 2008 2009 All PMF K 1 -7.644 -7.587 -7.649 -7.580 -7.699 -7.733 -7.634 -7.653 -7.647 K 2 -7.623 -7.587 -7.654 -7.581 -7.704 -7.738 -7.638 -7.673 -7.650 K 3 -7.615 -7.586 -7.652 -7.581 -7.698 -7.734 -7.637 -7.666 -7.646 K 4 -7.619 -7.585 -7.653 -7.581 -7.703 -7.734 -7.635 -7.667 -7.647 K 5 -7.641 -7.589 -7.653 -7.580 -7.700 -7.736 -7.638 -7.667 -7.650 DPMF(t) K 1 -7.652 -7.535 -7.564 -7.559 -7.660 -7.665 -7.618 -7.703 -7.620 K 2 -7.620 -7.551 -7.580 -7.544 -7.675 -7.658 -7.611 -7.724 -7.621 K 3 -7.620 -7.560 -7.605 -7.549 -7.673 -7.669 -7.611 -7.635 -7.615 K 4 -7.618 -7.549 -7.585 -7.548 -7.673 -7.670 -7.608 -7.651 -7.613 K 5 -7.640 -7.558 -7.591 -7.554 -7.669 -7.670 -7.609 -7.651 -7.618 DPMF(h) K 1 -7.639 -7.549 -7.624 -7.549 -7.670 -7.706 -7.606 -7.627 -7.621 K 2 -7.587 -7.553 -7.626 -7.551 -7.670 -7.707 -7.613 -7.640 -7.618 K 3 -7.580 -7.542 -7.618 -7.539 -7.667 -7.706 -7.602 -7.637 -7.612 K 4 -7.587 -7.545 -7.623 -7.547 -7.673 -7.704 -7.612 -7.652 -7.618 K 5 -7.594 -7.541 -7.619 -7.544 -7.669 -7.709 -7.606 -7.643 -7.616 DPMF(t,h) K 1 -7.656 -7.515 -7.562 -7.534 -7.659 -7.662 -7.602 -7.670 -7.607 K 2 -7.585 -7.515 -7.560 -7.520 -7.646 -7.639 -7.591 -7.695 -7.594 K 3 -7.579 -7.516 -7.563 -7.524 -7.651 -7.643 -7.575 -7.586 -7.580 K 4 -7.584 -7.511 -7.546 -7.520 -7.640 -7.643 -7.582 -7.620 -7.581 K 5 -7.593 -7.515 -7.569 -7.512 -7.634 -7.640 -7.589 -7.637 -7.586 (b) Left: p ercen tage error on rolling winner prediction, Right: RMSE on rolling score prediction 2002 2003 2004 2005 2006 2007 2008 2009 All 2002 2003 2004 2005 2006 2007 2008 2009 All PMF K 1 38.9 39.4 41.8 37.6 41.0 38.2 36.7 36.7 38.8 16.66 15.92 16.80 15.84 17.16 17.39 16.38 16.86 16.63 K 2 37.8 38.5 42.1 37.1 40.7 37.9 36.4 38.6 38.6 16.38 15.91 16.82 15.85 17.16 17.41 16.33 16.96 16.61 K 3 37.2 38.7 42.4 37.0 40.5 37.5 36.8 38.1 38.5 16.35 15.89 16.81 15.85 17.12 17.38 16.34 16.92 16.59 K 4 37.5 38.2 41.7 36.7 40.3 37.8 36.1 37.6 38.2 16.34 15.90 16.81 15.84 17.15 17.39 16.34 16.93 16.59 K 5 39.1 38.1 41.4 37.1 41.0 37.8 36.1 38.6 38.6 16.41 15.93 16.81 15.83 17.14 17.40 16.35 16.93 16.61 DPMF(t) K 1 37.8 37.7 37.9 37.3 39.1 34.4 34.4 35.7 36.8 16.73 15.68 16.07 15.54 16.73 16.93 16.42 17.46 16.46 K 2 37.4 38.0 37.5 39.2 40.3 34.3 33.6 37.6 37.3 16.37 15.70 16.30 15.29 16.84 16.75 16.21 17.57 16.39 K 3 37.2 38.9 39.0 36.5 38.1 36.3 34.0 32.4 36.5 16.39 15.77 16.49 15.50 16.97 16.86 16.05 16.67 16.34 K 4 37.7 38.6 37.4 36.0 38.8 35.7 34.6 35.7 36.8 16.37 15.63 16.30 15.46 16.90 16.90 16.11 16.91 16.33 K 5 38.6 39.4 38.0 35.9 38.4 37.0 34.1 36.7 37.2 16.39 15.71 16.37 15.48 16.90 16.85 16.16 16.84 16.35 DPMF(h) K 1 37.3 37.7 39.3 34.3 38.1 37.9 34.3 29.5 36.1 16.62 15.88 16.40 15.59 16.87 17.01 16.14 16.70 16.41 K 2 36.8 37.1 38.4 34.3 38.3 37.6 34.7 31.0 36.0 16.11 15.93 16.44 15.60 16.88 17.04 16.15 16.79 16.37 K 3 36.8 37.4 38.4 34.9 36.8 38.6 34.1 30.5 35.9 15.91 15.92 16.25 15.42 16.81 16.87 16.05 16.73 16.25 K 4 36.6 37.8 38.4 35.1 37.9 38.1 34.6 31.4 36.2 15.92 15.88 16.35 15.51 16.88 16.98 16.13 16.90 16.33 K 5 35.1 36.9 38.5 34.5 37.4 38.0 34.1 30.0 35.6 16.08 15.89 16.28 15.50 16.85 17.05 16.08 16.77 16.32 DPMF(t,h) K 1 37.2 36.1 36.7 35.3 38.6 33.7 32.0 37.1 35.8 16.76 15.55 16.07 15.46 16.69 16.91 16.26 17.26 16.38 K 2 36.1 37.0 36.3 35.4 38.8 33.6 32.5 36.7 35.8 16.08 15.58 16.04 15.19 16.59 16.61 16.07 17.25 16.19 K 3 37.0 34.9 34.4 33.8 37.0 34.2 31.7 30.0 34.1 15.90 15.72 15.89 15.35 16.69 16.57 15.86 16.71 16.10 K 4 36.0 35.2 34.1 35.0 37.4 33.2 31.5 30.5 34.1 15.93 15.62 15.75 15.20 16.52 16.60 16.03 16.81 16.07 K 5 35.3 35.6 34.0 33.2 36.2 33.5 31.7 32.9 34.0 16.05 15.66 15.96 15.21 16.57 16.60 16.00 16.97 16.14 Expert 30.9 32.2 29.7 31.4 33.3 30.6 29.8 29.4 30.9 14.91 14.41 14.55 14.70 15.36 15.18 14.95 15.49 14.95 INCORPORA TING SIDE INF ORMA TION IN MA TRIX F ACTORIZA TION 17 References. T. Hofmann. Probabilistic latent semantic analysis. In 15th Confer enc e on Unc ertainty in Artiﬁcial Intel ligenc e , 1999. N. Srebro. L e arning with Matrix F actorizations . PhD thesis, Massac husetts Institute of T ec hnology , Cam bridge, MA, August 2004. R. Salakhutdino v and A. Mnih. Bay esian probabilistic matrix factorization using Marko v c hain Mon te Carlo. In 25th International Confer enc e on Machine L e arning , 2008a. Y. Koren. Collaborative ﬁltering with temporal dynamics. In 15th International Confer enc e on Know le dge Disc overy and Data Mining , 2009. R. Salakhutdino v and A. Mnih. Probabilistic matrix factorization. In A dvanc es in Neur al Information Pr o c essing Systems 20 . 2008b. C. E. Rasmussen and C. K. I. Williams. Gaussian Pr o c esses for Machine L e arning . MIT Press, Cambridge, MA, 2006. Y. W. T eh, M. Seeger, and M. I. Jordan. Semiparametric laten t factor mo dels. In 10th International Workshop on Artiﬁcial Intel ligenc e and Statistics , 2005. E. Bonilla, K. M. Chai, and C. K. I. Williams. Multi-task Gaussian pro cess prediction. In A dvanc es in Neural Information Pr o c essing Systems 20 . 2008. R. P . Adams and O. Stegle. Gaussian pro cess pro duct mo dels for nonparametric nonsta- tionarit y . In 25th International Conferenc e on Machine Le arning , 2008. N. D. Lawrence and R. Urtasun. Non-linear matrix factorization with Gaussian pro cesses. In 26th International Confer enc e on Machine L e arning , 2009. K. Y u, W. Chu, S. Y u, V. T resp, and Z. Xu. Sto chastic relational mo dels for discriminative link prediction. In A dvanc es in Neur al Information Pr o c essing Systems 19 . 2007. R. M. Neal. Probabilistic inference using Marko v c hain Monte Carlo metho ds. T ec hni- cal Rep ort CRG-TR-93-1, Department of Computer Science, Universit y of T oronto, Septem b er 1993. R. M. Neal. Slice sampling. Annals of Statistics , 31(3):705–767, 2003. I. Murra y , R. P . Adams, and D. J. C. MacKay . Elliptical slice sampling. In 13th International Confer enc e on Artiﬁcial Intel ligenc e and Statistics. , 2010. O. F. Christensen and R. W aagepetersen. Bay esian prediction of spatial coun t data using generalized linear mixed models. Biometrics , 58(2):280–286, 2002. O. F. Christensen, G. O. Rob erts, and M. Sk˜ ald. Robust Marko v chain Monte Carlo metho ds for spatial generalized linear mixed mo dels. Journal of Computational and Gr aphic al Statistics , 15(1):1–17, 2006. I. Murray and R. P . Adams. Slice sampling cov ariance parameters using surrogate data. Under Review, 2010. N. C. Sc hw ertman, T. A. McCready , and L. How ard. Probability mo dels for the NCAA regional basketball tournaments. The Americ an Statistician , 45(1), 1991. N. C. Sc hw ertman, K. L. Sc henk, and B. C. Holbro ok. More probabilit y mo dels for the NCAA regional basketball tournaments. The Americ an Statistician , 50(1), 1996. B. P . Carlin. Improv ed NCAA basketball tournament modeling via p oin t spread and team strength information. The A meric an Statistician , 50(1), 1996. D. Barry and J. A. Hartigan. A Bay esian analysis for change p oint problems. Journal of the Americ an Statistic al Asso ciation , 88(421):309–319, 1993. E. Meeds, Z. Ghahramani, R. M. Neal, and S. T. Row eis. Mo deling dyadic data with binary latent factors. In A dvanc es in Neur al Information Pr o c essing Systems 19 . 2007. I. Sutsk ever, R. Salakh utdino v, and J. T enenbaum. Mo delling relational data using Bay esian clustered tensor factorization. In A dvanc es in Neur al Information Pr o c essing Systems 22 . 2009. 18 R.P . ADAMS ET AL. Dep ar tment of Computer Science University of Toronto 10 King’s College Ro ad Toronto, Ont ario M5S 3G4, Canada E-mail: rpa@cs.toronto.edu ; gdahl@cs.toronto.edu ; m urray@cs.toron to.edu

Incorporating Side Information in Probabilistic Matrix Factorization with Gaussian Processes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment