Student-t Processes as Alternatives to Gaussian Processes

We investigate the Student-t process as an alternative to the Gaussian process as a nonparametric prior over functions. We derive closed form expressions for the marginal likelihood and predictive distribution of a Student-t process, by integrating a…

Authors: Amar Shah, Andrew Gordon Wilson, Zoubin Ghahramani

Student-t Processes as Alternatives to Gaussian Processes
Studen t- t Pro cesses as Alternativ es to Gaussian Pro cesses Amar Shah Andrew Gordon Wilson Zoubin Ghahramani Univ ersity of Cambridge Univ ersity of Cambridge Univ ersity of Cambridge Abstract W e in vestigate the Studen t- t pro cess as an alternativ e to the Gaussian pro cess as a non- parametric prior o ver functions. W e de- riv e closed form expressions for the marginal lik eliho o d and predictive distribution of a Studen t- t pro cess, b y integrating aw ay an in verse Wishart process prior ov er the co- v ariance k ernel of a Gaussian pro cess mo del. W e sho w surprising equiv alences betw een dif- feren t hierarc hical Gaussian pro cess models leading to Student- t pro cesses, and deriv e a new sampling scheme for the in v erse Wishart pro cess, whic h helps elucidate these equiv- alences. Ov erall, we sho w that a Studen t- t pro cess can retain the attractive prop er- ties of a Gaussian pro cess – a nonparamet- ric represen tation, analytic marginal and pre- dictiv e distributions, and easy mo del selec- tion through cov ariance k ernels – but has en- hanced flexibility , and predictiv e cov ariances that, unlik e a Gaussian process, explicitly de- p end on the v alues of training observ ations. W e verify empirically that a Student- t pro- cess is esp ecially useful in situations where there are c hanges in cov ariance structure, or in applications like Bay esian optimiza- tion, where accurate predictive cov ariances are critical for go o d p erformance. These adv an tages come at no additional computa- tional cost ov er Gaussian pro cesses. 1 INTRODUCTION Gaussian processes are rich distributions ov er func- tions, which provide a Bay esian nonparametric ap- proac h to regression. Owing to their interpretabilit y , non-parametric flexibility , large supp ort, consistency , App earing in Pro ceedings of the 17 th In ternational Con- ference on Artificial In telligence and Statistics (AIST A TS) 2014, Reykja vik, Iceland. JMLR: W&CP v olume 33. Copy- righ t 2014 b y the authors. simple exact learning and inference procedures, and impressiv e empirical p erformances [Rasm ussen, 1996], Gaussian pro cesses as k ernel machines hav e steadily gro wn in p opularit y ov er the last decade. A t the heart of every Gaussian process (GP) is a parametrized co v ariance kernel, which determines the prop erties of likely functions under a GP . Typ- ically simple parametric k ernels, such as the Gaus- sian (squared exponential) k ernel are used, and its pa- rameters are determined through marginal likelihoo d maximization, having analytically in tegrated aw a y the Gaussian pro cess. Ho wev er, a fully Ba y esian nonpara- metric treatment of regression would place a nonpara- metric prior o ver the Gaussian pro cess cov ariance k er- nel, to represent uncertaint y ov er the k ernel function, and to reflect the natural in tuition that the k ernel does not hav e a simple parametric form. Lik ewise, given the success of Gaussian pro cesses ker- nel mac hines, it is also natural to consider more general families of elliptical pro cesses [F ang et al., 1989], such as Student- t pro cesses, where any collection of func- tion v alues has a desired elliptical distribution, with a co v ariance matrix constructed using a kernel. As w e will show, the Student- t process can b e deriv ed b y placing an inv erse Wishart process prior on the k er- nel of a Gaussian pro cess. Giv en their in tuitive v alue, it is not surprising that v arious forms of Student- t pro cesses ha ve b een used in different applications [Y u et al., 2007, Zhang and Y eung, 2010, Xu et al., 2011, Arc hambeau and Bach, 2010]. Ho wev er, the connec- tions b et ween these mo dels, and the theoretical prop- erties of these mo dels, remain largely unkno wn. Simi- larly , the practical utilit y of suc h models remains un- certain. F or example, Rasmussen and Williams [2006] w onder whether “the Student- t pro cess is perhaps not as exciting as one might ha ve hoped”. In short, our pap er answ ers in detail many of the “what, when and wh y?” questions one might hav e ab out Studen t- t processes (TPs), inv erse Wishart pro- cesses, and elliptical pro cesses in general. Sp ecifically: • W e precisely define and motiv ate the in verse Wishart pro cess [Dawid, 1981] as a prior ov er co- v ariance matrices of arbitrary size. Studen t- t Pro cesses as Alternatives to Gaussian Pro cesses • W e prop ose a Student- t pro cess, which we deriv e from hierarchical Gaussian pro cess mo dels. W e deriv e analytic forms for the marginal and pre- dictiv e distributions of this pro cess, and analytic deriv ativ es of the marginal likelihoo d. • W e sho w that the Student- t pro cess is the most general elliptically symmetric pro cess with ana- lytic marginal and predictive distributions. • W e derive a new w ay of sampling from the in- v erse Wishart pro cess, which intuitiv ely resolves the seemingly bizarre marginal equiv alence b e- t ween inv erse Wishart and inv erse Gamma priors for cov ariance kernels in hierarchical GP mo dels. • W e sho w that the predictiv e cov ariances of a TP dep end on the v alues of training observ ations, ev en though the predictive cov ariances of a GP do not. • W e show that, contrary to the Student- t pro cess describ ed in Rasmussen and Williams [2006], an analytic TP noise mo del can b e used which sepa- rates signal and noise analytically . • W e demonstrate non-trivial differences in b e- ha viour b etw een the GP and TP on a v ariety of applications. W e sp ecifically find the TP more robust to change-points and mo del missp ecifica- tion, to hav e notably impro ved predictive cov ari- ances, to hav e useful “tail-dep endence” b et ween distan t function v alues (whic h is orthogonal to the c hoice of kernel), and to b e particularly promis- ing for Ba yesian optimization, where predictiv e co v ariances are esp ecially imp ortant. W e b egin by introducing the inv erse Wishart pro- cess in section 2. W e then derive a Student- t pro- cess b y using an inv erse Wishart pro cess ov er cov ari- ance k ernels (section 3), and discuss the prop erties of this Student − t pro cess in section 4. Finally , we demonstrate the Student- t pro cess on regression and Ba yesian optimization problems in section 5. 2 INVERSE WISHAR T PROCESS In this section we argue that the in verse Wishart dis- tribution is an attractiv e choice of prior for cov ariance matrices of arbitrary size. The Wishart distribution is a probabilit y distribution ov er Π( n ), the set of real v alued, n × n , symmetric, positive definite matrices . Its density function is defined as follows. Definition. A random Σ ∈ Π( n ) is Wishart dis- tributed with parameters ν > n − 1, K ∈ Π( n ), and w e write Σ ∼ W n ( ν, K ) if its density is given by p (Σ) = c n ( ν, K ) | Σ | ( ν − n − 1) / 2 exp  − 1 2 T r  K − 1 Σ   , (1) where c n ( ν, K ) =  | K | ν / 2 2 ν n/ 2 Γ n ( ν / 2)  − 1 . The Wishart distribution defined with this param- eterization is consistent under marginalization. If Σ ∼ W n ( ν, K ), then any n 1 × n 1 principal submatrix Σ 11 is W n 1 ( ν, K 11 ) distributed. This prop ert y makes the Wishart distribution app ear to b e an attractiv e of prior ov er cov ariance matrices. Unfortunately the Wishart distribution suffers a flaw which makes it im- practical for nonparametric Bay esian mo delling. Supp ose we wish to mo del a cov ariance matrix using ν − 1 Σ, so that its exp ected v alue E [ ν − 1 Σ] = K , and v ar[ ν − 1 Σ ij ] = ν − 1 ( K 2 ij + K ii K j j ). Since we require ν > n − 1, we must let ν → ∞ to define a pro cess whic h has p ositive semidefinite Wishart distributed marginals of arbitrary size. How ever, as ν → ∞ , ν − 1 Σ tends to the constant matrix K almost surely . Th us the requirement ν > n − 1 prohibits defining a useful pro cess whic h has Wishart marginals of arbitrary size. Nev ertheless, the inverse Wishart distribution do es not suffer this problem. Da wid [1981] parametrized the inv erse Wishart distribution as follows: Definition. A random Σ ∈ Π( n ) is inverse Wishart distributed with parameters ν ∈ R + , K ∈ Π( n ) and w e write Σ ∼ IW n ( ν, K ) if its density is given by p (Σ) = c n ( ν, K ) | Σ | − ( ν +2 n ) / 2 exp  − 1 2 T r  K Σ − 1   , (2) with c n ( ν, K ) = | K | ( ν + n − 1) / 2 2 ( ν + n − 1) n/ 2 Γ n (( ν + n − 1) / 2) . If Σ ∼ IW n ( ν, K ), Σ has mean and cov ariance only when ν > 2 and E [Σ] = ( ν − 2) − 1 K . Both the Wishart and the inv erse Wishart distributions place prior mass on every Σ ∈ Π( n ). F urthermore Σ ∼ W n ( ν, K ) if and only if Σ − 1 ∼ IW n ( ν − n + 1 , K − 1 ). Da wid [1981] shows that the in verse Wishart distribu- tion defined as ab ov e is consisten t under marginaliza- tion. If Σ ∼ IW n ( ν, K ), then any principal submatrix Σ 11 will be IW n 1 ( ν, K 11 ) distributed. Note the k ey dif- ference in the parameterizations of b oth distributions: the parameter ν do es not need to depend on the size of the matrix in the inv erse Wishart distribution. These prop erties are desirable and motiv ate defining a pro- cess which has inv erse Wishart marginals of arbitrary size. Let X b e some input space and k : X × X → R a p ositiv e definite kernel function. Definition. σ is an inverse Wishart pr o c ess on X with parameters ν ∈ R + and base kernel k : X × X → R if Shah, Wilson, Ghahramani 0 1 2 3 − 2 − 1 0 1 2 1 0 1 2 3 − 2 − 1 0 1 2 1 Figure 1: Five samples (blue solid) from G P ( h, κ ) (left) and T P ( ν, h, κ ) (righ t), with ν = 5, h ( x ) = cos( x ) (red dashed) and κ ( x i , x j ) = 0 . 01 exp( − 20( x i − x j ) 2 ). The grey shaded area represents a 95% predictive in terv al under each mo del. for any finite collection x 1 , ..., x n ∈ X , σ ( x 1 , ..., x n ) ∼ IW n ( ν, K ) where K ∈ Π( n ) with K ij = k ( x i , x j ). W e write σ ∼ I W P ( ν, k ). In the next section w e use the inv erse Wishart pro cess as a nonparametric prior ov er kernels in a hierarchical Gaussian pro cess mo del. 3 DERIVING THE STUDENT- t PR OCESS Gaussian pro cesses (GPs) are p opular nonparamet- ric Ba yesian distributions ov er functions. A thorough guide to GPs has been provided b y Rasmussen and Williams [2006]. GPs are characterized by a mean function and a kernel function. Practitioners tend to use parametric kernel functions and learn their h yp erparameters using maxim um likelihoo d or sam- pling based metho ds. W e prop ose placing an in verse Wishart process prior on the k ernel function, leading to a Student- t pro cess. F or a base kernel k θ parameterized by θ , and a con- tin uous mean function φ : X → R , our generativ e approac h is as follows σ ∼ I W P ( ν, k θ ) y | σ ∼ G P ( φ, ( ν − 2) σ ) . (3) Since the inv erse Wishart distribution is a conjugate prior for the cov ariance matrix of a Gaussian likeli- ho od, we can analytically marginalize σ in the gen- erativ e mo del of (3). F or any collection of data y = ( y 1 , ..., y n ) > with φ = ( φ ( x 1 ) , ..., φ ( x n )) > , p ( y | ν, K ) = Z p ( y | Σ) p (Σ | ν, K ) d Σ ∝ Z exp  − 1 2 T r  K + ( y − φ )( y − φ ) > ν − 2  Σ − 1   | Σ | ( ν +2 n +1) / 2 d Σ ∝  1 + 1 ν − 2 ( y − φ ) > K − 1 ( y − φ )  − ( ν + n ) / 2 (4) Definition. y ∈ R n is multivariate Student- t dis- tributed with parameters ν ∈ R + \ [0 , 2], φ ∈ R n and K ∈ Π( n ) if it has densit y p ( y ) = Γ( ν + n 2 ) (( ν − 2) π ) n 2 Γ( ν 2 ) | K | − 1 / 2 ×  1 + ( y − φ ) > K − 1 ( y − φ ) ν − 2  − ν + n 2 (5) W e write y ∼ MVT n ( ν, φ , K ). W e easily compute the mean and co v ariance of the MVT using the generativ e deriv ation: E [ y ] = E [ E [ y | Σ]] = φ and co v[ y ] = E [ E [( y − φ )( y − φ ) > | Σ]] = E [( ν − 2)Σ] = K . W e prov e the following Lemma in the Supplementary Material. Lemma 1. The multivariate Student- t is c onsistent under mar ginalization. W e define a Student- t pro cess as follows. Definition. f is a Student- t pr o c ess on X with pa- rameters ν > 2, mean function Ψ : X → R , and ker- nel function k : X × X → R if an y finite collection of function v alues hav e a joint m ultiv ariate Studen t- t distribution, i.e. ( f ( x 1 ) , ..., f ( x n )) > ∼ MVT n ( ν, φ , K ) where K ∈ Π( n ) with K ij = k ( x i , x j ) an d φ ∈ R n with φ i = Φ( x i ). W e write f ∼ T P ( ν, Φ , k ). 4 TP PR OPER TIES & RELA TION TO OTHER PR OCESSES In this section w e discuss the conditional distribution of the TP , the relationship b et w een GPs and TPs, an- other cov ariance prior which leads to the same TP , el- liptical pro cesses, and a sampling sc heme for the IWP whic h gives insight in to this equiv alence. Finally we consider mo delling noisy functions with a TP . 4.1 Relation to Gaussian pro cess The Student- t pro cess generalizes the Gaussian pro- cess. A GP can b e seen as a limiting case of a TP as sho wn in Lemma 2, which is pro v en in the Supplemen- tary Material. Lemma 2. Supp ose f ∼ T P ( ν, Φ , k ) and g ∼ G P (Φ , k ) . Then f tends to g in distribution as ν → ∞ . The ν parameter controls ho w he avy taile d the pro cess is. Smaller v alues of ν correspond to hea vier tails. As ν gets larger, the tails conv erge to Gaussian tails. This is illustrated in prior sample draws shown in Figure 1. Notice that the samples from the TP tend to hav e more extreme b ehaviour than the GP . ν also controls the nature of the dep endence b et ween v ariables which are jointly Studen t- t distributed, and Studen t- t Pro cesses as Alternatives to Gaussian Pro cesses Man uscript under review b y AIST A TS 2014 − 6 − 5 − 4 − 3 − 2 − 1 0 1 2 3 4 5 6 0 0 . 5 1 1 . 5 2 2 . 5 3 1 − 6 − 5 − 4 − 3 − 2 − 1 0 1 2 3 4 5 6 0 0 . 5 1 1 . 5 2 2 . 5 3 1 − 6 − 5 − 4 − 3 − 2 − 1 0 1 2 3 4 5 6 0 0 . 5 1 1 . 5 2 2 . 5 3 1 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 1 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 1 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 1 − 6 − 5 − 4 − 3 − 2 − 1 0 1 2 3 4 5 6 0 0 . 5 1 1 . 5 2 2 . 5 3 1 Figure 2: Biv ariate samples from a studen t-t copula with ν = 3 (left), a studen t-t copula with ν = 10 (cen tre) and a Gaussian copula (righ t). All marginal distributions ar e N(0 , 1) distributed. Pr o of. It is sufficien t to s h o w con v ergence in den- sit y for an y finite collection of inputs. Let y ∼ MVT n ( ν , φ , K ) and set β = ( y − φ ) > K − 1 ( y − φ ) then p ( y ) ∝  1 + β ν − 2  − ( ν + n ) / 2 → e − β / 2 an ν → ∞ . Hence th e distribution of y tends to a N n ( φ , K ) distribution as ν → ∞ . The ν parameter con trols ho w he avy taile d the pro cess is. The smaller it i s , th e hea vier the tails. As ν gets larger, the tails con v erge to mimic Gaussian tails. This is illustrated in prior sample dra ws sho wn in Figure 1. Notice that th e samples from the TP lea v e the 95% confidence band more often than the s ampl e s from the GP , a clear indicator of the hea vier tails. It is imp ortan t to note that ν also con trols the nature of the dep endence b et w een v ariables whic h are join tly studen t-t di s tri buted and not just the marginal distri- bution of p oin ts. In Figure 2 w e sho w p lots of samples whic h all ha v e Gaussian marginals but differen t join t distributions. Notice ho w the tail dep endency of the biv ariate distribution s is con trolled b y ν . 4.2 Conditional di stribution W e no w deriv e the condition al distribution for a m ultiv ariate studen t-t distribution. Supp ose y ∼ MVT n ( ν , φ , K ) and let y 1 and y 2 represen t the first n 1 and remaining n 2 en tries of y resp ectiv ely . Let β 1 = ( y 1 − φ 1 ) > K − 1 11 ( y 1 − φ 1 ) and β 2 = ( y 2 − ˜ φ 2 ) > ˜ K 22 − 1 ( y 2 − ˜ φ 2 ), where ˜ φ 2 = K 21 K − 1 11 ( y 1 − φ 1 ) − φ 2 and ˜ K 22 = K 22 − K 21 K − 1 11 K 12 . Note that β 1 + β 2 = ( y − φ ) > K − 1 ( y − φ ). W e ha v e p ( y 2 | y 1 ) = p ( y 1 , y 2 ) p ( y 1 ) ∝  1 + β 1 + β 2 ν − 2  − ( ν + n ) / 2  1 + β 1 ν − 2  ( ν + n 1 ) / 2 ∝  1 + β 2 β 1 + ν − 2  − ( ν + n ) / 2 (6) Comparing th is densit y to the one in equation 5, w e note that y 2 | y 1 ∼ MVT n 2  ν + n 1 , ˜ φ 2 , ν + β 1 − 2 ν + n 1 − 2 × ˜ K 22  . (7) As ν tends to infinit y , this predictiv e distribution tends to a Gaussian pr o cess predictiv e distribution as w e w ould e xp ect giv en Lemma 2. The predictiv e mean has the same form as for a Gaussian pro cess predic- tiv e. The k ey difference is in the p redictiv e co v arian c e, whic h is a scaled v ersion of the Gaussian pr o cess pre- dictiv e co v ariance. A somewhat disapp oin ting feature of the Gaussian pro cess is that for a giv en k ernel, the pr e di c tiv e co- v ariance of new samples do es not explicitly dep end on previous observ ations. The scaling constan t of th e m ulti v ariate studen t-t pre- dictiv e co v ariance has an in tuitiv e explanation. Note that β 1 is distributed as the sum of squares of n 1 inde- p enden t MVT 1 ( ν , 0 , 1) distributions and hence E [ β 1 ] = n 1 . If the observ ed v alue of β 1 is larger than n 1 , the predictiv e co v ariance is scaled up and vice v ersa. The magnitude of scaling is con tr olled b y ν . The fact that the predictiv e co v ariance of the m ulti- v ariate studen t-t dep ends on the data is one of the k ey b enefits of this distribution o v er a Gaussian one. Figure 2: Uncorrelated biv ariate samples from a Studen t- t copula with ν = 3 (left), a Student- t copula with ν = 10 (centre) and a Gaussian copula (right). All marginal distributions are N(0 , 1) distributed. not just their marginal distributions. In Figure 2 w e show plots of samples which all ha ve Gaussian marginals but differen t joint distributions. Notice how the tail dep endency of these distributions is controlled b y ν . F or example, the dep endencies b et ween y ( x p ) and y ( x q ) are different dep ending on whether y is a TP or a GP , even if the TP and GP hav e the same k ernel. 4.2 Conditional distribution The conditional distribution for a m ultiv ariate Studen t- t has an analytic form which we state in Lemma 3 and prov e in the Supplementary Material. Lemma 3. Supp ose y ∼ MVT n ( ν, φ , K ) and let y 1 and y 2 r epr esent the first n 1 and r emaining n 2 entries of y r esp e ctively. Then y 2 | y 1 ∼ MVT n 2  ν + n 1 , ˜ φ 2 , ν + β 1 − 2 ν + n 1 − 2 × ˜ K 22  , (6) wher e ˜ φ 2 = K 21 K − 1 11 ( y 1 − φ 1 ) + φ 2 , β 1 = ( y 1 − φ 1 ) > K − 1 11 ( y 1 − φ 1 ) and ˜ K 22 = K 22 − K 21 K − 1 11 K 12 . Note that E [ y 2 | y 1 ] = ˜ φ 2 , cov[ y 2 | y 1 ] = ν + β 1 − 2 ν + n 1 − 2 × ˜ K 22 . As ν tends to infinity , this predictiv e distribution tends to a Gaussian pro cess predictiv e distribution as w e w ould exp ect given Lemma 2. Perhaps less in tuitively , this predictive distribution also tends to a Gaussian pro cess predictive as n 1 tends to infinity . The predictive mean has the same form as for a Gaus- sian pro cess, conditioned on having the same kernel k , with the same hyperparameters. The key difference is in the predictiv e co v ariance, which now explicitly dep ends on the training observ ations. Indeed, a some- what disapp ointing feature of the Gaussian pro cess is that for a giv en kernel, the predictive cov ariance of new samples does not depend on training observ ations. Imp ortan tly , since the marginal likelihoo d of the TP in (5) differs from the marginal lik eliho o d of the GP , b oth the predictive mean and predictive cov ariance of a TP will differ from that of a GP , after learning k ernel h yp erparameters. The scaling constan t of the m ultiv ariate Student- t pre- dictiv e cov ariance has an intuitiv e explanation. Note that β 1 is distributed as the sum of squares of n 1 inde- p enden t MVT 1 ( ν, 0 , 1) distributions and hence E [ β 1 ] = n 1 . If the observed v alue of β 1 is larger than n 1 , the predictiv e co v ariance is scaled up and vice versa. The magnitude of scaling is controlled b y ν . 4.3 Another Co v ariance Prior Despite the apparent flexibility of the inv erse Wishart distribution, we illustrate in Lemma 4 the surprising result that a multiv ariate Studen t- t distribution can be deriv ed using a muc h simpler cov ariance prior which has b een considered previously [Y u et al., 2007]. The pro of can b e found in the Supplemen tary Material. Lemma 4. L et K ∈ Π( n ) , φ ∈ R n , ν > 2 , ρ > 0 and r − 1 ∼ Γ( ν / 2 , ρ/ 2) y | r ∼ N n ( φ , r ( ν − 2) K/ρ ) , (7) then mar ginal ly y ∼ MVT n ( ν, φ , K ) . F rom (7), r − 1 | y ∼ Γ  ν + n 2 , ρ 2 (1 + β ν − 2 )  and hence E [( ν − 2) r /ρ | y ] = ν + β − 2 ν + n − 2 . This is exactly the fac- tor b y whic h ˜ K 22 is scaled in the MVT conditional distribution in (13). This result is surprising b ecause w e previously inte- grated ov er an infinite dimensional nonparametric ob- ject (the IWP) to deriv e the Studen t- t pro cess, yet here w e show that we can integrate ov er a single scale parameter (in verse Gamma) to arrive at the same marginal pro cess. W e pro vide some insight into wh y these distinct priors lead to the same marginal multi- v ariate Studen t- t distribution in section 4.5. 4.4 Elliptical Pro cesses W e now show that b oth Gaussian and Student- t pro- cesses are el liptic al ly symmetric , and that the Studen t- t pro cess is the more general elliptical pro cess. Definition. y ∈ R n is el liptic al ly symmetric if and only if there exists µ ∈ R n , R a nonnegative random v ariable, Ω a n × d matrix with maximal rank d and u uniformly distributed on the unit sphere in R d inde- p enden t of R suc h that y D = µ + R Ω u , where D = denotes equalit y in distribution. An o verview of elliptically symmetric distributions and the follo wing Lemma can b e found in F ang et al. [1989]. Lemma 5. Supp ose R 1 ∼ χ 2 ( n ) and R 2 ∼ Γ − 1 ( ν / 2 , 1 / 2) indep endently. If R = √ R 1 , then y is Gaussian distribute d. If R = p ( ν − 2) R 1 R 2 then y is MVT distribute d. Shah, Wilson, Ghahramani Elliptically symmetric distributions characterize a large class of distributions which are unimodal and where the lik eliho o d of a p oin t decreases in its dis- tance from this mo de. These properties are natural assumptions w e often wan t to enco de in our prior dis- tribution, making elliptical distributions ideal for m ul- tiv ariate mo delling tasks. The idea naturally extends to infinite dimensional ob jects. Definition. Let Y = { y i } b e a countable family of random v ariables. It is an el liptic al pr o c ess if any finite subset of them are jointly elliptically symmetric. Not all elliptical distributions hav e densities (e.g. L ´ evy , alpha-stable distributions). Even fewer elliptical pro- cesses ha ve densities, and the set of those that do is c haracterized in Theorem 6 due to Kelker [1970]. Theorem 6. Supp ose Y = { y i } is an el liptic al pr o- c ess. A ny finite c ol le ction z = { z 1 , ..., z n } ⊂ Y has a density if and only if ther e exists a non-ne gative r an- dom variable r such that z | r ∼ N n ( µ , r ΩΩ > ) . A simple corollary of this theorem describ es the only t wo cases where an elliptical process has an analyt- ically representable densit y function (its pro of is in- cluded in the Supplementary Material). Corollary 7. Supp ose Y = { y i } is an el liptic al pr o- c ess. Any finite c ol le ction z = { z 1 , ..., z n } ⊂ Y has an analytic al ly r epr esentable density if and only if Y is either a Gaussian pr o c ess or a Student- t pr o c ess. Since the Student- t pro cess generalizes the Gaussian pro cess, it is the most general elliptical pro cess whic h has an analytically represen table density . The TP is th us an expressiv e to ol for nonparametric Bay esian mo delling. With analytic expressions for the predictive distribu- tions, the same computational costs as a Gaussian pro- cess and increased flexibilit y , the Studen t- t pro cess can b e used as a drop-in replacemen t for a Gaussian pro- cess in many applications. 4.5 A New W ay to Sample the IWP W e show that the density of an inv erse Wishart dis- tribution dep ends only on the eigenv alues of a p os- itiv e definite matrix. T o the b est of our knowledge this c hange of v ariables has not b een computed previ- ously . This decomposition offers a no vel wa y of sam- pling from an inv erse Wishart distribution and insight in to why the Studen t- t pro cess can b e derived using an inv erse Gamma or an inv erse Wishart pro cess co- v ariance prior. Let Ξ( n ) be the set of all n × n orthogonal matrices. A matrix is orthogonal if it is square, real v alued and its ro ws and columns are orthogonal unit v ectors. Orthog- onal matrices are comp ositions of rotations and reflec- tions, which are volume preserving op erations. Sym- metric p ositiv e definite (SPD) matrices can b e repre- sen ted through a diagonal and an orthogonal matrix: Theorem 8. L et Σ ∈ Π( n ) , the set of SPD, n × n matric es. Supp ose { λ 1 , ..., λ n } ar e the eigenvalues of Σ . Ther e exists Q ∈ Ξ( n ) such that Σ = Q Λ Q > , wher e Λ = diag( λ 1 , ..., λ n ) . No w supp ose Σ ∼ IW n ( ν, I ). W e compute the den- sit y of an IW using the representation in Theorem 8, b eing careful to include the Jacobian of the c hange of v ariable, J (Σ; Q, Λ), given in Edelman and Rao [2005]. F rom (2) and using the facts that Q > Q = I and | AB | = | B A | , p (Σ) d Σ = p ( Q Λ Q > ) | J (Σ; Q, Λ) | d Λ dQ ∝ | Q Λ Q > | − ( ν +2 n ) / 2 exp  − 1 2 T r  ( Q Λ Q > ) − 1   ×    Q > Y 1 ≤ i . This result provides a geometric in terpretation of what a sample from IW n ( ν, I ) lo oks like. W e first uniformly at random pic k an orthogonal set of basis vectors in R n and then stretch these basis vectors using an ex- c hangeable set of scalar random v ariables. An analo- gous interpretation holds for the Wishart distribution. Recall from Lemma 5 that if u is uniformly distributed on the unit sphere in R n and R ∼ χ 2 ( n ) indep en- den tly , then √ R u ∼ N n (0 , I ). By (4) and Lemma 5, if we sample Q and Λ from the generative pro cess ab o v e, then p ( ν − 2) R Q Λ 1 / 2 u is marginally a draw from MVT( ν, 0 , I ). Since the diagonal elements of Λ are exchangeable, Q is orthogonal and sampled uni- formly ov er Ξ( n ), and u is spherically symmetric, we m ust ha ve that Q Λ 1 / 2 u D = √ R 0 u for some p ositiv e scalar random v ariable R 0 b y symmetry . By Lemma 5 we know R 0 ∼ Γ − 1 ( ν / 2 , 1 / 2). In summary , the ac- tion of Q Λ 1 / 2 on u is equiv alen t in distribution to a rescaling by an inv erse Gamma v ariate. Studen t- t Pro cesses as Alternatives to Gaussian Pro cesses 1 1 1 1 Figure 3: Scatter plots of p oin ts drawn from v arious 2-dim pro cesses. Here ν = 2 . 1 and K ij = 0 . 8 δ ij + 0 . 2. T op-left : MVT 2 ( ν, 0 , K ) + MVT 2 ( ν, 0 , 0 . 5 I ). T op- righ t : MVT 2 ( ν, 0 , K + 0 . 5 I ) (our mo del). Bottom- left : MVT 2 ( ν, 0 , K ) + N 2 (0 , 0 . 5 I ). Bottom-right : N 2 (0 , K + 0 . 5 I ). 4.6 Mo delling Noisy F unctions It is common practice to assume that outputs are the sum of a latent Gaussian pro cess and indep enden t Gaussian noise. An adv antage of such a mo del is in the fact that the sum of independent Gaussian dis- tributions is Gaussian distributed and hence such a Gaussian pro cess mo del remains analytic in the pres- ence of noise. Unfortunately the sum of tw o indep en- den t MVTs is analytically intractable. This problem w as encountered by Rasmussen and Williams [2006], who wen t on to dismiss the multi- v ariate Student- t pro cess for practical purp oses. Our approac h is to incorp orate the noise into the kernel function, for example, letting k = k θ + δ , where k θ is a parametrized k ernel and δ is a diagonal kernel func- tion. Such a mo del is not equiv alen t to adding inde- p enden t noise, since the scaling parameter ν will ha ve an effect on the squared-exp onential kernel as w ell as the noise k ernel. Zhang and Y eung [2010] prop ose a similar metho d for handling noise; ho wev er, they in- correctly assume that the laten t function and noise are indep enden t under this mo del. The noise will b e uncorrelated with the latent function, but not inde- p enden t. As ν → ∞ this model tends to a GP with indep enden t Gaussian noise. In Figure 3, we consider samples from v arious t wo dimensional proc esses when ν is small and the signal to noise ratio is small. Here w e see that the TP with noise incorp orated into its k ernel b eha ves similarly to a TP with independent Student- t noise. There ha ve b een several attempts to make GP regres- sion robust to hea vy tailed noise that rely on approx- imate inference [Neal, 1997, V anhatalo et al., 2009]. − 5 − 4 − 3 − 2 − 1 0 1 2 3 4 5 − 3 − 2 − 1 0 1 2 3 1 − 5 − 4 − 3 − 2 − 1 0 1 2 3 4 5 − 3 − 2 − 1 0 1 2 3 1 Figure 4: Posterior distributions of 1 sample from Syn- thetic Data B under GP prior (left) and TP prior (righ t). The solid line is the p osterior mean, the shaded area represents a 95% predictive interv al, cir- cles are training p oin ts and crosses are test p oin ts. It is hence attractive that our prop osed metho d can mo del heavy tailed noise whilst retaining an analytic inference scheme. This is a no vel finding to the b est of our knowledge. 5 APPLICA TIONS In this section we compare TPs to GPs for regression and Bay esian optimization. 5.1 Regression Consider a set of observ ations { x i , y i } n i =1 for x i ∈ X and y i ∈ R . Analogous to Gaussian pro cess regression, w e assume the following generative mo del f ∼ T P ( ν, Φ , k θ ) y i = f ( x i ) for i = 1 , ..., n. (9) In this work we consider parametric kernel functions. A key task when using suc h kernels is in learning the parameters of the chosen k ernel, which are called the h yp erparameters of the mo del. W e include deriv atives of the marginal log lik eliho od of the TP with resp ect to the hyperparameters in the Supplementary Material. 5.1.1 Exp erimen ts W e test the Student- t pro cess as a regression mo del on a num b er of datasets. W e sample hyperparameters using Hamiltonian Monte Carlo [Neal, 2011] and use a k ernel function whic h is a sum of a squared exponential and a delta k ernel function ( k θ = k SE ). The results for all of these exp erimen ts are summarized in T able 1. Syn thetic Data A. W e sample 100 functions from a GP prior with Gaussian noise and fit b oth GPs and TPs to the data with the goal of predicting test p oin ts. F or each function we train on 80 data p oin ts and test on 20. The TP , whic h generalizes the GP , has sup erior predictiv e uncertain ty in this example. Shah, Wilson, Ghahramani T able 1: Predictiv e Mean Squared Errors (MSE) and Log Lik eliho o ds (LL) of regression exp erimen ts. The TP consistently has the low est MSE and highest LL. Da t a set Gaussian Process Student-T Process MSE LL MSE LL Synth A 2.24 ± 0.09 -1.66 ± 0.04 2.29 ± 0.08 -1.00 ± 0.03 Synth B 9.53 ± 0.03 -1.45 ± 0.02 5.69 ± 0.03 -1.30 ± 0.02 Snow 10.2 ± 0.08 4.00 ± 0.12 10.5 ± 0.07 25.7 ± 0.18 Sp a tial 6.89 ± 0.04 4.34 ± 0.22 5.71 ± 0.03 44.4 ± 0.4 Wine 4.84 ± 0.08 -1.4 ± 1 4.20 ± 0.06 113 ± 2 Syn thetic Data B. W e construct data by drawing 100 functions from a GP with a squared exp onential k ernel and adding Student- t noise indep endently . The p osterior distribution of one sample is shown in Fig- ure 4. The predictiv e means are also not iden tical since the p osterior distributions of the hyperparameters dif- fer b et ween the TP and the GP . Here the TP has a sup erior predictiv e mean, since after hyperparameter training it is b etter able to mo del Student- t noise, as w ell as b etter predictive uncertaint y . Whistler Snowfall Data 1 . Daily snowfall amoun ts in Whistler hav e b een recorded for the years 2010 and 2011. This data exhibits clear changepoint type b e- ha viour due to seasonalit y whic h the TP handles m uc h b etter than the GP . Spatial In terp olation Data 2 . This dataset con tains rainfall measuremen ts at 467 (100 observ ed and 367 to b e estimated) loc ations in Switzerland on 8 May 1986. Wine Data. This dataset due to Cortez et al. [2009] consists of 12 attributes of v arious red wines includ- ing acidity , densit y , pH and alcohol level. Each wine is given a corresp onding quality score b etw een 0 and 10. W e choose a random subset of 400 wines: 360 for training and 40 for testing. 5.2 Ba yesian Optimization Mac hine learning algorithms often require tuning pa- rameters, which con trol learning rates and abilities, via optimizing an ob jective function. One can mo del this ob jective function using a Gaussian process, under a p o w erful iterative optimization pro cedure known as Gaussian process Ba yesian optimization [Broch u et al., 2010]. T o pick where to query the ob jective function next, one can optimize the expected impro vemen t (EI) o ver the running optimum, the probability of improv- ing the curren t b est or a GP upp er confidence b ound. 1 The snowfall dataset can b e found at http://www. climate.weatheroffice.ec.gc.ca. 2 The spatial interpolation data can b e found at http: //www.ai_geostats.org under SIC97. − 5 − 4 − 3 − 2 − 1 0 1 2 3 4 5 − 2 0 2 − 5 − 4 − 3 − 2 − 1 0 1 2 3 4 5 0 0 . 1 0 . 2 1 Figure 5: Posterior distribution of a function to maxi- mize under a GP prior (top) and acquisition functions (b ottom). The solid green line is the acquisition func- tion for a GP , the dotted red and dashed black lines are for TP priors with ν = 15 and ν = 5 resp ectiv ely . All other hyperparameters are kept the same. 5.2.1 Metho d In this pap er we work with the EI criterion and for reasons describ ed in Snoek et al. [2012] w e use an ARD Mat ´ ern 5/2 kernel defined as k M 52 ( x , x 0 ) = θ 0  1 + q 5 r 2 x , x 0  exp  − q 5 r 2 x , x 0  (10) where r 2 ( x , x 0 ) = P D d =1 ( x d − x 0 d ) 2 θ 2 d . W e assume that the function w e wish to optimize ov er is f : R D → R and is drawn from a multiv ariate Studen t- t pro cess with scale parameter ν > 2, con- stan t mean µ and k ernel function a linear sum of a ARD Mat´ ern 5/2 kernel and a delta function kernel. Our goal is to find where f attains its minim um. Let X N = { x n , f n } N n =1 b e our curren t set of N observ ations and f best = min { f 1 , ..., f N } . T o com- press notation w e let θ represen t the parameters θ , ν, µ . Let the acquisition function a EI  x ; X N , θ  de- note the expected improv emen t o ver the current best v alue from choosing to sample at p oin t x given cur- ren t observ ations X N and hyperparameters θ . Note that the distribution of f ( x ) | X N , θ is MVT 1 ( ν + N , ˜ µ ( x ; X n ) , ˜ τ ( x ; X n , ν ) 2 ), where the form of ˜ µ and ˜ τ are deriv ed in (13). Let ˜ γ = f best − ˜ µ ˜ τ . Then a EI  x ; X N , θ  = E  max  f best − f ( x ) , 0  | X N , θ  = Z f best −∞ dy ( f best − y ) 1 ˜ τ λ ν + N  y − ˜ µ ˜ τ  = ˜ γ ˜ τ Λ ν + N ( ˜ γ ) + ˜ τ  1 + ˜ γ 2 − 1 ν + N − 1  λ ν + N ( ˜ γ ) , (11) Studen t- t Pro cesses as Alternatives to Gaussian Pro cesses 1 2 3 4 5 6 7 8 9 10 − 60 − 40 − 20 0 20 F unction Ev aluation Max F unction V alue GP TP 1 4 8 12 16 20 24 28 32 36 40 0 5 10 15 20 F unction Ev aluation Min F unction V alue GP TP 1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 − 3 − 2 − 1 0 F unction Ev aluation Min F unction V alue GP TP 1 Figure 6: F unction ev aluations for the syn thetic function (left), Branin-Ho o function (centre) and the Hartmann function (right). Ev aluations under a Student- t pro cess prior (solid line) and a Gaussian pro cess prior (dashed line) are shown. Error bars represent the standard deviation of 50 runs. In each panel we are minimizing an ob jective function. The v ertical axis represents the running minimum function v alue. where λ ν and Λ ν are the densit y and distribution func- tions of a MVT 1 ( ν, 0 , 1) distribution resp ectiv ely . The parameters θ are all sampled from the p osterior using slice sampling, similar to the method used in Sno ek et al. [2012]. Supp ose we hav e H sets of p oste- rior samples { θ h } H h =1 . W e set ˜ a EI  x ; X N  = 1 H H X h =1 a EI  x ; X N , θ h  (12) as our approximate marginalized acquisition function. The choice of the net place to sample is x next = argmax x ∈ R D ˜ a EI  x ; X N  , which we find by using gra- dien t descent based metho ds starting from a dense set of p oin ts in the input space. T o get more intuition on how ν changes the b eha viour of the acquisition function, we study an example in Figure 5. Here w e fix all hyperparameters other than ν and plot the acquisition functions v arying ν . In this example, it is clear that in certain scenarios the TP prior and GP prior will lead to v ery different proposals giv en the same information. 5.2.2 Exp erimen ts W e compare a TP prior with a Mat´ ern plus a delta function k ernel to a GP prior with the same k ernel, for Ba yesian optimization. T o integrate aw ay uncertaint y w e slice sample the hyperparameters [Neal, 2003]. W e consider 3 functions: a 1-dim synthetic sinusoidal, the 2-dim Branin-Ho o function and a 6-dim Hartmann function. All the results are shown in Figure 6. Sin usoidal syn thetic function In this exp eri- men t we aimed to find the minimum of f ( x ) = − ( x − 1) 2 sin(3 x + 5 x − 1 + 1) in the interv al [5 , 10]. The func- tion has 2 lo cal minima in this interv al. TP optimiza- tion clearly outperforms GP optimization in this prob- lem; the TP was able to come to within 0.1% of the minim um in 8 . 1 ± 0 . 4 iterations whilst the GP to ok 10 . 7 ± 0 . 6 iterations. Branin-Ho o function This function is a p opular b enc hmark for optimization methods [Jones, 2001] and is defined on the set { ( x 1 , x 2 ) : 0 ≤ x 1 ≤ 15 , − 5 ≤ x 2 ≤ 15 } . W e initialized the runs with 4 initial observ ations, one for each corner of the input square. Hartmann function This is a function with 6 local minima in [0 , 1] 6 [Pic heny et al., 2013]. The runs are initialised with 6 observ ations at corners of the unit cub e in R 6 . Notice that the TP tends to b eha ve more lik e a step function whereas the Gaussian process’ rate of improv emen t is somewhat more constant. The rea- son for this b eha viour is that the TP tends to more thoroughly explore any mo des whic h it has found, b e- fore mo ving a w ay from these modes. This phenomenon seems more prev alant in higher dimensions. 6 CONCLUSIONS W e hav e shown that the inv erse Wishart process (IWP) is an appropriate prior ov er cov ariance matri- ces of arbitrary size. W e used an IWP prior ov er a GP k ernel and sho wed that marginalizing o v er the IWP re- sults in a Student- t pro cess (TP). The TP has consis- ten t marginals, closed form conditionals and con tains the Gaussian process as a sp ecial case. W e also prov ed that the TP is the only elliptical pro cess other than the GP whic h has an analytically representable densit y function. The TP prior was applied in regression and Ba yesian optimization tasks, showing improv ed p er- formance ov er GPs with no additional computational costs. The tak e home message for practitioners should b e that the TP has many if not all of the benefits of GPs, but with increased mo delling flexibilit y at no extra cost. Our work suggests that it could b e use- ful to replace GPs with TPs in almost any applica- tion. The added flexibility of the TP is orthogonal to the choice of kernel, and could complement recent ex- pressiv e closed form kernels [Wilson and Adams, 2013, Wilson et al., 2013] in future work. Shah, Wilson, Ghahramani References C. Archam beau and F. Bach. Multiple Gaussian Pro- cess Mo dels. A dvanc es in Neur al Information Pr o- c essing Systems , 2010. E. Bro c hu, M. Cora, and N. de F reitas. A T utorial on Bay esian Optimization of Exp ensiv e Cost F unc- tions, with Applications to Activ e User Modeling and Hierarc hical Reinforcement Learning. arXiv , 2010. URL . P . Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Mo deling Wine Preferences b y Data Mining from Physiochemical Prop erties. De cision Supp ort Systems, Elsevier , 2009. A. P . Da wid. Spherical Matrix Distributions and a Multiv ariate Mo del. J. R. Statistic al So ciety B , 1977. A. P . Dawid. Some Matrix-V ariate Distribution The- ory: Notational Considerations and a Bay esian Ap- plication. Biometrika , 1981. A. Edelman and N. Ra j Rao. Random Matrix Theory. A cta Numeric a , 1:1–65, 2005. K. T. F ang, S. Kotz, and K. W. Ng. Symmetric Multi- variate and R elate d Distributions . Chapman & Hall, 1989. D. R. Jones. A T axonomy of Global Optimization Metho ds Based on Resp onse Surfaces. Journal of Glob al Optimization , 21(4):345–383, 2001. D. Kelk er. Distribution Theory of Spherical Distri- butions and a Location-Scale Parameter. Sankhya, Ser. A, , 1970. R. M. Neal. Monte Carlo Implementation of Gaussian Pro cess Mo dels for Bay esian Regression and Classi- fication. T e chnic al R ep ort No. 9702, Dept of Statis- tics, University of T or onto , 1997. R. M. Neal. Slice Sampling. A nnals of Statistics , 31(3): 705–767, 2003. R. M. Neal. Handb o ok of Markov chain Monte Carlo . Chapman & Hall/CRC, 2011. V. Pic hen y , T. W agner, and D. Ginsb ourger. A Bench- mark of Kriging-Based Infill Criteria for Noisy Op- timization. Structur al and Multidisciplinary Opti- mization , 48(3):607–626, 2013. C. E. Rasmussen. Evaluation of Gaussian Pr o c esses and other Metho ds for Non-Line ar R e gr ession . PhD thesis, Graduate Departmen t of Computer Science, Univ ersity of T oronto, 1996. C. E. Rasmussen and C. K. I. Williams. Gaussian Pr o c esses for Machine L e arning . MIT Press, 2006. J. Sno ek, H. Laro chelle, and R. Adams. Practical Ba yesian Optimization of Machine Learning Algo- rithms. A dvanc es in Neur al Information Pr o c essing Systems , 2012. J. V anhatalo, P . Jylanki, and A. V ehtari. Gaussian Pro cess Regression with Student- t Likelihoo d. A d- vanc es in Neur al Information Pr o c essing Systems , pages 1910–1918, 2009. A. G. Wilson and R. P . Adams. Gaussian pro cess co v ariance kernels for pattern discov ery and extrap- olation. Pr o c e e dings of the 30th International Con- fer enc e on Machine L e arning , 2013. A. G. Wilson, E. Gilb oa, A. Nehorai, and J. P . Cun- ningham. GPatt: F ast m ultidimensional pattern ex- trap olation with Gaussian pro cesses. arXiv , 2013. URL . Z. Xu, F. Y an, and Y. Qi. Sparse Matrix-V ariate t Pro cess Blo ckmodel. 2011. S. Y u, V. T resp, and K. Y u. Robust Multi-T ask Learn- ing with t -Pro cesses. 2007. Y. Zhang and D. Y. Y eung. Multi-T ask Learning us- ing Generalized t Pro cess. Pr o c e e dings of the 13th Confer enc e on Artificial Intel ligenc e and Statistics , 2010. Studen t- t Pro cesses as Alternatives to Gaussian Pro cesses Supplemen tary Material In App endix A, w e provide pro ofs of Lemmas and Corollaries from our pap er. W e describ e the deriv atives of the log marginal lik eliho o d of the Student- t pro cess whic h is useful for hyperparameter learning in App endix B. In App endix C we offer more insights as to why tw o seemingly differen t co v ariance priors for a Gaussian pro cess prior lead to the same marginal distribution. A Pro ofs Lemma. [1] The multivariate Student- t is c onsistent under mar ginalization. Pr o of. Assume the generative pro cess of equation 3 of the main text. Σ 11 is IW n 1 ( ν, K 11 ) distributed for an y principal submatrix of Σ. F uthermore y 1 | Σ 11 ∼ N n 1 (0 , ( ν − 2)Σ 11 ) since the Gaussian distribution is consistent under marginalization. Hence y 1 ∼ MVT n 1 ( ν, µ 1 , K 11 ). Lemma. [2] Supp ose f ∼ T P ( ν, Φ , k ) and g ∼ G P (Φ , k ) . Then f tends to g in distribution as ν → ∞ . Pr o of. It is sufficient to show conv ergence in densit y for any finite collection of inputs. Let y ∼ MVT n ( ν, φ , K ) and set β = ( y − φ ) > K − 1 ( y − φ ) then p ( y ) ∝  1 + β ν − 2  − ( ν + n ) / 2 → e − β / 2 an ν → ∞ . Hence the distribution of y tends to a N n ( φ , K ) distribution as ν → ∞ . Lemma. [3] Supp ose y ∼ MVT n ( ν, φ , K ) and let y 1 and y 2 r epr esent the first n 1 and r emaining n 2 entries of y r esp e ctively. Then y 2 | y 1 ∼ MVT n 2  ν + n 1 , ˜ φ 2 , ν + β 1 − 2 ν + n 1 − 2 × ˜ K 22  , (13) wher e ˜ φ 2 = K 21 K − 1 11 ( y 1 − φ 1 ) − φ 2 , β 1 = ( y 1 − φ 1 ) > K − 1 11 ( y 1 − φ 1 ) and ˜ K 22 = K 22 − K 21 K − 1 11 K 12 . Pr o of. Let β 2 = ( y 2 − ˜ φ 2 ) > ˜ K − 1 22 ( y 2 − ˜ φ 2 ). Note that β 1 + β 2 = ( y − φ ) > K − 1 ( y − φ ). W e hav e p ( y 2 | y 1 ) = p ( y 1 , y 2 ) p ( y 1 ) ∝  1 + β 1 + β 2 ν − 2  − ( ν + n ) / 2  1 + β 1 ν − 2  ( ν + n 1 ) / 2 ∝  1 + β 2 β 1 + ν − 2  − ( ν + n ) / 2 Comparing this expression to the definition of a MVT density function gives the required result. Lemma. [4] L et K ∈ Π( n ) , φ ∈ R n , ν > 2 , ρ > 0 and r − 1 ∼ Γ( ν / 2 , ρ/ 2) y | r ∼ N n ( φ , r ( ν − 2) K/ρ ) , (14) then mar ginal ly y ∼ MVT n ( ν, φ , K ) . Shah, Wilson, Ghahramani Pr o of. Let β = ( y − φ ) > K − 1 ( y − φ ). W e can analytically marginalize out the scalar r , p ( y ) = Z p ( y | r ) p ( r ) dr ∝ Z exp  − ρβ 2( ν − 2) r  r − n 2 exp  − ρ 2 r  r − ( ν +2) 2 dr ∝  1 + β ν − 2  − ( ν + n ) 2 Z exp  − 1 2 r  r − ( ν + n +2) 2 dr ∝  1 + β ν − 2  − ( ν + n ) 2 Hence y ∼ MVT n ( ν, φ , K ) . Note the redundancy in ρ . Without loss of generality , let ρ = 1. Corollary . [7] Supp ose Y = { y i } is an el liptic al pr o c ess. A ny finite c ol le ction z = { z 1 , ..., z n } ⊂ Y has an analytic al ly r epr esentable density if and only if Y is either a Gaussian pr o c ess or a Student- t pr o c ess. Pr o of. By Theorem 6, we need to b e able to analytically solve R p ( z | r ) p ( r ) dr , where z | r ∼ N n ( µ , r ΩΩ > ). This is possible either when r is a constant with probability 1 or when r ∼ Γ − 1 ( ν / 2 , 1 / 2), the conjugate prior. These lead to the Gaussian and Studen t- t pro cesses resp ectiv ely . B Marginal Lik eliho o d Deriv atives Being able to analytically compute the deriv ativ e of the likelihoo d with resp ect to the hyperparameters is useful for hyperparameter learning e.g. maximum likelihoo d or Hamiltonian (Hybrid) Monte Carlo. log p ( y | ν, K θ ) = − n 2 log(( ν − 2) π ) − 1 2 log( | K θ | ) + log  Γ( ν + n 2 ) Γ( ν 2 )  − ( ν + n ) 2 log  1 + β ν − 2  , where β = ( y − φ ) > K − 1 θ ( y − φ ) and its deriv ative with resp ect to a hyperparameter is ∂ ∂ θ log p ( y | ν, φ , K θ ) = 1 2 T r   ν + n ν + β − 2 αα > − K − 1 θ  ∂ K θ ∂ θ  , where α = K − 1 θ ( y − φ ). W e may also learn ν using gradient based metho ds and the following deriv ativ e ∂ ∂ ν log p ( y | ν, K θ ) = − n 2( ν − 2) + ψ  ν + n 2  − ψ  ν 2  − 1 2 log  1 + β ν − 2  + ( ν + n ) β 2( ν − 2) 2 + 2 β ( ν − 2) (15) where ψ is the digamma function. C More Insigh t Into the In v erse Wishart Pro cess and Inv erse Gamma Priors As a reminder, we define a Wishart distribution as follows Definition. A random Σ ∈ Π( n ) is Wishart distributed with parameters ν > n − 1, K ∈ Π( n ) and w e write Σ ∼ W n ( ν, K ) if its density is given by p (Σ) = c n ( ν, K ) | Σ | ( ν − n − 1) / 2 exp  − 1 2 T r  K − 1 Σ   , (16) where c n ( ν, K ) =  | K | ν / 2 2 ν n/ 2 Γ n ( ν / 2)  − 1 . Studen t- t Pro cesses as Alternatives to Gaussian Pro cesses C.1 The Multiv ariate Gamma F unction The function in the normalizing constant of the Wishart distribution is called the m ultiv ariate gamma function and is defined as follows Definition. The multivariate gamma function , Γ n ( . ), is a generalization of the gamma function defined as Γ n ( a ) = Z S > 0 | S | a − ( n +1) / 2 exp  − T r( S )  dS (17) where S > 0 means S is p ositiv e definite. In the following lemma we illustrate an explicit relationship b etw een the multiv ariate gamma function and the gamma function. Lemma. [A] Γ n ( a ) = π n ( n − 1) / 4 n Y j =1 Γ  a + (1 − j ) / 2  (18) Pr o of. Γ n ( a ) = Z S > 0 | S | a − ( n +1) / 2 exp  − T r( S )  dS = Z S > 0 S a − ( n +1) / 2 11 exp  − S 11  | S 22 . 1 | a − ( n +1) / 2 exp  − T r( S 22 . 1 )  × exp  − T r( S 21 S − 1 11 S 12 )  dS 11 dS 12 dS 22 . 1 = Z S 11 > 0 ( π S 11 ) ( n − 1) / 2 S a − ( n +1) / 2 11 exp  − S 11  dS 11 × Z S 22 . 1 | S 22 . 1 | a − ( n +1) / 2 exp  − T r( S 22 . 1 )  dS 22 . 1 = π ( n − 1) / 2 Γ( a )Γ n − 1 ( a − 1 / 2) This recursive relationship and the fact that Γ 1 ( b ) = Γ( b ) implies Γ n ( a ) = n Y j =1 π ( j − 1) / 2 Γ( a − ( j − 1) / 2) = π n ( n − 1) / 4 n Y j =1 Γ( a + (1 − j ) / 2) whic h is as required. A simple corollary of this result will b e key later. Corollary . [B] Γ n ( a ) Γ n ( a − 1 / 2) = Γ( a ) Γ( a − n/ 2) (19) C.2 Tw o Different Cov ariance Priors The tw o generative pro cesses we are interested in are r − 1 ∼ Γ( ν / 2 , 1 / 2) Ω ∼ W n ( ν + n − 1 , K − 1 ) y 1 ∼ N n (0 , ( ν − 2) r K ) y 2 ∼ N(0 , ( ν − 2)Ω − 1 ) where n ∈ N , ν > 2 and K is a n × n symmetric, p ositive definite matrix. Shah, Wilson, Ghahramani The marginal distribution for y 1 is p ( y 1 ) = Z p ( y 1 | r ) p ( r ) dr = Z (2 π r ( ν − 2)) − n/ 2 | K | − 1 / 2 exp  − y > 1 K − 1 y 1 2( ν − 2) r  r − ν / 2 − 1 exp( − 1 / (2 r )) 2 ν / 2 Γ( ν / 2) dr = (2 π ( ν − 2)) − n/ 2 | K | − 1 / 2 2 ν / 2 Γ( ν / 2) Z r − ( ν + n ) / 2 − 1 exp  −  1 + y > 1 K − 1 y 1 ( ν − 2)  / 2 r  dr = (2 π ( ν − 2)) − n/ 2 | K | − 1 / 2 2 ν / 2 Γ( ν / 2)   1 + y > 1 K − 1 y 1 ( ν − 2)  / 2  − ( ν + n ) / 2 Γ  ( ν + n ) / 2  = ( π ( ν − 2)) − n/ 2 | K | − 1 / 2  1 + y > 1 K − 1 y 1 ( ν − 2)  − ( ν + n ) / 2 Γ  ( ν + n ) / 2  Γ( ν / 2) . (20) The marginal distribution for y 1 is p ( y 2 ) = Z p ( y 2 | Ω) p (Ω) d Ω = Z  2 π ( ν − 2)  − n/ 2 | Ω | 1 / 2 exp  − y > 2 Ω y 2 2( ν − 2)  × c n ( ν + n − 1 , K − 1 ) | Ω | ( ν − 2) / 2 exp  − 1 2 T r  K Ω   d Ω =  2 π ( ν − 2)  − n/ 2 c n ( ν + n − 1 , K − 1 ) × Z | Ω | ( ν − 1) / 2 exp  − 1 2 T r  K + y 2 y > 2 ν − 2  Ω   d Ω =  2 π ( ν − 2)  − n/ 2 c n ( ν + n − 1 , K − 1 ) × c n  ν + n,  K + y 2 y > 2 ν − 2  − 1  − 1 =  2 π ( ν − 2)  − n/ 2  | K | − ( ν + n − 1) / 2 2 ( ν + n − 1) n/ 2 Γ n (( ν + n − 1) / 2)  − 1 × | K | − ( ν + n ) / 2  1 + y > 2 K − 1 y 2 ν − 2  − ( ν + n ) / 2 2 ( ν + n ) n/ 2 Γ n (( ν + n ) / 2) = ( π ( ν − 2)) − n/ 2 | K | − 1 / 2  1 + y > 2 K − 1 y 2 ν − 2  − ( ν + n ) / 2 Γ n  ( ν + n ) / 2  Γ n  ( ν + n − 1) / 2  . (21) Both marginal distributions are equiv alent given the result in Corollary B.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment