Parameter estimation in high dimensional Gaussian distributions

P arameter Estimatio n in High Dimensio nal Gauss ian Distributi ons Erlend Aune (NTNU, Norw a y) and Daniel P . Simpson (NTNU, Norw a y) Ma y 28, 2018 Abstract In order to compute the log-likelihoo d for high dimensional spatial Gaussia n mo dels, it is necessary to compute the determinan t of the large, sparse, symmetric p ositive deﬁnite precision matrix, Q. T raditional methods for ev aluating the log - likelihoo d fo r very la rge models may fail due to the massive memory requirements. W e present a novel approa ch for ev alua ting such likelihoo ds when the ma trix-vector product, Qv, is fast to c o mpute. In this approach w e utilise matrix functions, Krylov subspa c es, and pro bing vectors to constr uc t a n iterative metho d for computing the log-likelihoo d. 1 In tro d uction In compu tational and, in particular, sp atial statistics, increasing p ossib ilities for obser v in g large amoun ts of data leav es the statistician in wan t of computational tec hniqu es capable of extracting useful in formation from such d ata. Large data sets arise in m any applications, s u c h as m o delling seismic d ata acquisition (Buland and Omre (2003)); analysing satellite data for ozone in tensit y , temp erature and cloud formations(McP eters et al. (1996)); or constructing global climate mo d els (Lindgren et al. (2011 )). Most mo dels in spatial statistics are based aroun d multiv ariate Gaussian distributions, whic h has probabilit y den s it y function p ( x ) = (2 π ) − k / 2 det( Q ) 1 / 2 exp  − 1 2 ( x − µ ) T Q ( x − µ )  where the precision matrix Q is th e in verse of the co v ariance matrix. In this p ap er, we assume that the precision matrix is sparse, w hic h essen tially enforces a Marko v p rop erty on th e Gauss ian r andom ﬁeld. Th ese mo dels h a v e b etter computational p r op erties than those based on the co v ariance, and there are m o delling r easons to pr efer using Q d irectly (Lind gren et al. (2011)). W e note that Rue and Tjelmeland (2002) sho we d that it is p ossib le to approxi m ate general Gaussian random ﬁelds by Gaussian Marko v rand om ﬁ elds (GMRFs), that is Gaussian r andom v ectors with sp arse precision matrices. Throughout this p ap er, we will consider the common Gauss-linear mo del, in wh ic h our data is a n oisy observ ation of a linear transf ormation of a tru e rand om ﬁeld, that is y = A ( θ ) x + ǫ 1 , (1) where the matrix A ( θ ) mo d els th e observ ation of the true underlying ﬁeld x , kno wn as the ‘forw ard mo del’, while ǫ ∼ N ( 0 , Q − 1 1 ) is the observ ation noise. In order to complete the m o del, we r equire a prior distribution on x , which we tak e to b e Gaussian with mean µ and precision matrix Q x ( η ), 1 and appropriate h yp erpriors are giv en for the mean µ , the hyperp arameters in th e prior η and the h yp erparameters in the forw ard mo d el θ . Giv en a set of data, w e wish to infer η , θ and x . The w ays in whic h this is don e can v ary , but in the end, several determinan t ev aluatio n s are needed. One w ay to do this is to alternately estimate x and η , θ u s ing the distributions p ( x | y , η , θ ) and p ( η , θ | x , y ), up d ating ea ch consecutiv ely . That is 1. Find arg max x p ( x | y , η , θ ) 2. Find arg max η , θ p ( η , θ | x , y ) 3. Rep eat until con verge n ce. In a Gauss-linear mo del, the ﬁr s t step inv olv es a linear solv e, while the second is an optimisation o v er the sp ace in whic h η , θ reside. The d istr ibution is p ( η , θ | x , y ) ∝ p ( y | x , η , θ ) p ( η , θ | x ) = p ( y | x , θ ) p ( x | η ) p ( η ) p ( θ ) (2) The log of the last line in (2) giv es the ob jective fu nction for the hyp er-parameters in this case: Φ( η , θ ) = − log p ( y | x , θ ) p ( x | η ) p ( η ) p ( θ ) (3) W e also ha ve that p ( y | x , θ ) p ( x | η ) as a fun ction of x is N ( µ p , Q p ), where Q p = Q x ( η )+ A ( θ ) T Q 1 A ( θ ) and µ p = Q − 1 p ( Q x µ + A T Q 1 y ). The separated expr ession, p ( y | x , θ ) p ( x | η ) in Φ, is usu ally prefer- able, bu t this is problem dep endent . Expanding (3), w e get Φ( η , θ ) = C − log ( 1 2 det( Q 1 )) + 1 2 ( y − A ( θ ) x ) T Q 1 ( y − A ( θ ) x ) − log( 1 2 det( Q x ( η ))) + ( x − µ ) T Q x ( x − µ ) − log p ( η ) − log p ( θ ) . (4) In this expr ession, the only term whic h is diﬃcult to ev aluate is log det( Q x ( η )), and it is needed for estimating the h yp er-parameters, b oth for p oint estimat es and for Gaussian- or Laplace- appro ximations of the hyp er-parameters (see C arlin and Louis (2000) for details on suc h appro xi- mations). It is this ev aluation and its use in optimisation we address in this article. 2 Determinan t ev aluations The most common w a y to compute th e log-determinant of a spars e precision or co v ariance matrix is to 1) reorder Q to optimise for Cholesky factorisat ion, 2) p erform a Cholesky factorisation of the reordered matrix Q = LL T , 3) extract the d iagonal en tries of l = diag( L ) and 4) set th e log- determinan t as log det Q = P n j =1 2 log( l j ) (this comes from the iden tit y det Q = det L det L T = (det L ) 2 ). The algorithm tak es very few lines to program, gi ven a go o d s p arse matrix sorting routine, such as METIS (Karypis and Ku mar (1999 )) and a f ast sparse Cholesky factoring im- plemen tation, s uc h as CHOLMOD (Da vis and Hager (1999 ),Chen et al. (2008)). Pr oblems o ccur, ho wev er, when there are massiv e amounts of ﬁll-in in the Cholesky factorisatio n ev en after resorting the m atrix in qu estion. F or a Gaussian Mark o v rand om ﬁeld the dimen sionalit y of the under lyin g parameter space aﬀect the s torage cost for computing the Cholesky factorisation. In R 1 the cost is O ( n ), in R 2 , O ( n 3 / 2 ) in R 3 , O ( n 2 ) (Ru e and Held (2005)). 2 2.1 Alternativ e appro ximations The starting p oint for an alternativ e, less memory in tensive p r o cedure for computing the log- determinan t comes from the id en tit y log det Q = tr log Q (5) In the symm etric p ositiv e deﬁn ite case, this id en tit y is prov ed noting that d et Q = Q n i =1 λ i where { λ i } are the eigen v alues of Q and th at log Q = V log ( Λ ) V T with Λ = diag( λ ) and V con tains the eigen v ectors of Q . F u rthermore, tr ( V log ( Λ ) V T ) = tr ( V V T log Λ ) = tr log Λ , which giv es the iden tit y . Ho w can this b e usefu l in computation, is the next question. A trivial observ ation shows that tr log Q = n X j =1 e T j log( Q ) e j where e j = (0 , . . . , 1 , . . . , 0) w h ere the one is in p osition j . While it is cumb ersome to carry out this computation, it is the basis for sto chastic estimators of the log-determinan t. Suc h estimators hav e b een studied for the trace of a matrix, the trac e of the inv erse of a matrix and our case, the trace of the logarithm of a matrix. Details on this can b e found in Hu tchinson (1989) and Bai and Golub (1997). The Hutc hinson sto chasti c estimator is describ ed as follo ws : 1) Let v j , j = 1 , . . . , s b e v ectors with en tries P ( v k = 1) = 1 / 2, P ( v k = − 1) = 1 / 2 indep enden tly . 2) Let tr log Q ≈ 1 s s X j =1 v T j log( Q ) v j . (6) As th is is a Mon te Carlo metho d, it is p ossib le to compute conﬁ dence r egions for the estimators, using either Mon te C arlo tec h niques or the Ho eﬀding inequalit y (Bai and Golub (1997 )). Th e Hutc hinson estimato r was form u lated for appro ximating tr Q − 1 in which case, w e r ep lace the lo g Q in (6) with Q − 1 . The Hutc hinson estimator requires a lot of random v ectors v j to b e su ﬃcien tly accurate for optimisation. The memory requiremen ts are lo w, bu t w e ma y ha ve to w ait an eternit y for on e determinan t approxima tion. The question, then, can we c h o ose the v j s in a clev er wa y , s o that w e require far fewe r v ectors? In recen t p ublications, Bek as et al. (2007) and T ang and S aad (2010) explored the use of p robing v ectors for extracting the diagonal of a matrix or its inv erse. In the ﬁrst of these the d iagonal of a sparse matrix is extracted, and it is relativ ely str aigh tforwa rd und er mild assum p tions to extract the diagonal. The second relies on approxima te sparsit y of the inv erse, where the appro ximate sparsit y p attern of Q − 1 is determined by a pow er of the original matrix, i.e. Q p . Assum in g suc h a sparsit y structure, it is p ossible to compute the probing v ectors { v j } s j =1 b y a neighbour colouring of the graph induced by Q p (see e.g. Culb erson (199 2 ) for a survey on the greedy graph colouring algorithms). A h euristic suggested in T ang and Saad (2010) to ﬁnd the p o we r , p in Q p is to solv e Qx = e j and ﬁnd p = min { d ( l, j ) || x l | < ǫ } where d giv es the graph distance. In our case, we ma y compute log( Q ) e j and use the s ame h eu ristic. A nice feature is that the probin g v ectors need not b e stored, but ma y b e computed c heaply on the ﬂy . If w e pr e-compute them, they are sparse, and do es n ot need m uch storage. Since wh at we need for eac h probing v ector is v T j log( Q ) v j , we observ e that the computation is highly parallel with lo w communicat ion costs. Eac h no de gets 3 one probin g v ector, and computes v T j log( Q ) v j and sends b ac k the result. In essence, th is leads to linear s p eedup with the amoun t of p ro cessors av ailable with p rop ortionalit y close to unit y . Next, we need to consider th e ev aluation of log ( Q ) v j . The matrices we consid er ha ve real p ositiv e sp ectrum, and it is p ossible to ev aluate log( Q ) v j through C auc hy’s int egral formula, log( Q ) v j = I Γ log( x )( z I − Q ) − 1 v j dz , where Γ is a suitable curv e enclo s in g the sp ectrum of Q and a vo idin g branc h cuts of the logarithm. Direct qu adrature ov er suc h curves can b e terribly ineﬃcient, b ut through clev er conformal m ap - pings, Hale et al. (2008) dev elop ed mid p oint quadrature ru les that con ve r ge rapidly for in creasing n u m b er of quadrature p oin ts at the cost of needing estimates for the extremal ei genv alues of Q . In fact, k log Q − f N ( Q ) k = O ( e − 2 π N/ (log ( λ max /λ min )+6) ) with f N as b elo w. This essen tially gives u s the r ational approximati on log( Q ) v j ≈ f N ( Q ) v j = N X l =1 α l ( Q − σ l I ) − 1 v j , α l , σ l ∈ C . (7) In eﬀect, w e need to solv e a family of shifted linear systems to appro ximate log( Q ) v j . Ho w we compute f N ( Q ) v j is p roblem dep endent, b ut in high dimensions, we usually ha ve to rely on iter- ativ e method s, such as Krylo v m etho ds. A Kr ylo v sub space, K k ( Q , v ) is deﬁ n ed b y K k ( Q , v ) = span { v , Qv , Q 2 v , . . . , Q k v } and a thorough int r o duction to the use and theory of Krylo v metho ds can b e foun d in Saad (2003). Whic h Kr ylo v metho d w e use is highly dep end ent on the qualit y and p erformance p otent ial p reconditioners for the matrix Q . While the Kr ylo v metho d of c hoice is prob lem dep endent , there are ones that are particu- larily well suited to compute the appr o ximation in (7 ). These metho ds are b ased on the fact that K k ( Q , v ) = K k ( Q − σ l I , v ) and after some simp le algebra, we obtain the co eﬃcient s for the shifted systems without computing new matrix-ve ctor pro d ucts, see Jegerlehner (1996) and F rommer (2003) for details. W e hav e emp lo y ed the metho d CG-M in Jegerlehner (1996) our com- putations. On e p ossib le diﬃcu lty in emplo yin g the metho d is that we ha ve complex sh ifts - this is remedied b y using a v arian t, C onjugate Orthogonal CG-M (COCG-M), whic h ent ails u sing the conjugate symm etric form ( x , y ) = x T y instead of th e usual inner pro duct ( x , y ) = x T y in the Krylo v iterations. See v an der V orst and Melissen (1990) f or a description of th e CO CG metho d. In practice, little complex arithmetic is n eeded, since the complex, shifted co eﬃcients are compu ted from the real ones obtained by the CG metho d used to solv e Qx = y . 3 Examples In th is section we explore how optimisation fares under d iﬀeren t conditions. F or doing this, w e assume essen tially the simplest p ossible mod el, but we plan use the outlined appr oac h on a seismic case in the futur e. The mo del w e assume is the SPDE τ ( κ − △ ) u = W , in 2D, w hic h we observ e directly . W e will explore ho w op timisation works (on τ , κ ) f or diﬀerent κ s an d d iﬀeren t distance colouring of the graph. Note that the n umb er of colours needed is essent ially ind ep endent of the gran ularity of the discretisation: a ﬁne grid yields appro ximatelly the same n u m b er of colours as a coarse grid. T he initial suspicion is that for sm all κ s, corresp onding to long range will b e harder to optimise in the follo wing sense: we need more Krylo v iterations for the COCG-M routine 4 to con v erge and we need a larger distance colouring to cov er the increasing range, resu lting in more prob in g v ectors. W e use a mo d iﬁed Newton metho d for optimisation and compare using the exact determin ant to using the app ro ximation outlined ab o ve. F or th e CO CG-M metho d, we use a r elativ e tolerance of 10 − 3 for computing log ( Q ) v j . Note th at a prior on the parameters will stabilise the results as usual. T he results are giv en in T able 1. T able 1: O p timisation of ( κ, τ ) for diﬀeren t distance colourings Exact 2-dist 4-dist 6-dist 8-dist 10-dist κ = 1 (0 . 927 , 1 . 01 5) (1 . 06 , 0 . 98) (0 . 933 , 1 . 01 3) (0 . 927 , 1 . 01 5) . . . . . . κ = 0 . 5 (0 . 455 , 1 . 01 0) (0 . 605 , 0 . 96 1) (0 . 471 , 1 . 005) (0 . 45 7 , 1 . 009) (0 . 45 5 , 1 . 010) . . . κ = 0 . 1 (0 . 0842 , 1 . 003) (0 . 208 , 0 . 940) (0 . 122 , 0 . 984 ) (0 . 0983 , 0 . 996) (0 . 0891 , 1 . 000) (0 . 08 59 , 1 . 002) κ = 0 . 05 (0 . 0398 , 1 . 001) (0 . 138 , 0 . 941) (0 . 0762 , 0 . 980) (0 . 05 67 , 0 . 992) (0 . 04 75 , 0 . 997) (0 . 0434 , 0 . 9 9 9) κ = 0 . 01 (0 . 0064 4 , 0 . 998) (0 . 056 5 , 0 . 947) (0 . 0 292 , 0 . 978 ) (0 . 197 , 0 . 987) (0 . 0143 , 0 . 992) (0 . 01 17 , 0 . 994) In T able 1, . . . ind icates th at the optimisation yields the same as the previous entry . W e see that as we increase the graph n eighb ou r ho o d in our colourings, we get r esu lts closer and closer to that of using th e exact d eterminan t. W e also see that the estimates are mon otone: κ decreases with larger distance colourings and τ increases. L astly , we see th at some of the estimates are b etter when u sing the appr o ximation. This sh ould not necessarily b e take n as a go o d sign, as it ma y only b e a result of app ro ximation errors. As increasing k in k -distance colourings yields b etter and b etter appro ximations, one could b e lead to using a ”large ” k in the entire optimisation, whatev er metho d one uses to d etermine k . Our results in dicate, ho wev er, that we only need very goo d approxima tions in the last ite r ations of the optimisation pro cedur e. In eﬀect, we ma y use 2-distance colourings in the b eginning, and go to 5- or larger -distance colourings in the last steps. Computing th e colourings is cheap and one colouring only r equires the storage of one vect or, so we ma y store a couple of d iﬀeren t colourings and u se them as r equired in the op timisation p ro cedure. Lastly , w e presen t an example that cannot b e done using blac k-b ox Cholesky factorisatio n s. Namely , a 3-D version of the mo del ab ov e with κ = 0 . 5 and 2 million discretisation p oint s . W e use a 1-distance colouring for the ﬁrs t iterations and increase to 2- an d 6-distance colouring in the last iterations. This w ill giv e u s temp orary k -distance estimates wh ic h we also giv e. Th e estimates, p rogressiv ely from 1-, 2 and 6-distance colouring w ere (7 . 62 7 , 0 . 382) , (1 . 401 , 0 . 801) , and (0 . 561 , 0 . 988 ). It to ok 24 h ours to complete th e optimisation. F rom th is we conclude - th e metho d is slow, but it can b e u sed for parameter estimation in high-dimens ional prob lems wh ere other alternativ es are imp ossib le due to memory limitations. W e m ust, how ever, b e careful so that w e h a v e enough colours to capture the essentia ls of the determinan t. The estimated m emory u se for using Cholesky factorisation in the determinan t ev aluations is 155 Gb w ith MET I S sorting of the graph and m uc h h igher without. F ew computin g serve r s ha ve th is amount of memory on a no de. The memory consumption for the appro ximation wa s 3 Gb at maxim um, and eve n this ma y b e lo wered quite a lot with some clev er memory m an agement. In a computing cluster, the time for computing this optimisation would ha ve b een m u c h lo we r du e to linear sp eedu p vs. n umb er of no des. W e ha ve tried the appro ximation for d iﬀerent t yp es of matrices, and what w e foun d is that the examples ab ov e are among the toughest to do d eterminan t appro ximations on. S ince w e can do reasonable approxi m ations for this class, we exp ect that these lik eliho o d ev aluations will work for 5 a large class of p r ecision matrices in use in statistics. 4 Discussion W e ha ve sho wed that the determinan t approxima tions discussed sho ws p romise for lik eliho o d ev al- uations in mo dels wh ere we cannot p erform Cholesky factorisatio n s or Kronec ker d ecomp ositions of the p recision matrices. T h is m a y pr ov e useful in high dim en sional mo dels where appro ximate lik eliho o ds are not suﬃcien t for accurate inference. It remains to u tilise the approxi m ations on a real world dataset. References Bai, Z. and Golub, G. (1997). Bounds for the trace of the in ve r s e and the d eterminan t of sym m etric p ositiv e deﬁn ite matrices. Anna ls of Numeric al Mathematics , 4:29–38. Bek as, C., Kokiop oulou, E., and Saad, Y. (2007). An estimator for th e diagonal of a matrix. Applie d numeric al mathematics , 57(11-1 2):1214–1229. Buland, A. and Omre, H. (2003). Ba y esian linearized av o inv ersion. Ge ophysics , 68:185 –198. Carlin, B. an d Louis, T. (2000). Bayes and Empiric al Bayes metho ds for data analysis . C R C Pr ess. Chen, Y., Da vis, T., Hager, W., and Ra jamanic k am, S. (2008). Algorithm 887: CHOL MO D, sup er n o dal sp arse Cholesky factorization and up d ate/do w ndate. ACM T r ansactions on Mathe- matic al Softwar e (TOMS) , 35(3):22. Culb er s on, J. (1992). I terated greedy graph coloring and the diﬃculty landscap e. T ec hn ical rep ort, Univ ersity of Alb erta. Da vis, T. and Hager, W. (1999). Mo difying a sparse Ch olesky factorization. SIAM Journal on Matrix Analysis and Applic ations , 20(3):606 –627. F rommer, A. (200 3). BiCGStab (l) for families of sh ifted linea r systems. Computing , 70(2):87–10 9. Hale, N., Higham, N. J., and T refethen, L. N. (2008). Computing A α , log ( A ) and r elated matrix functions by con tour integ r als. SIAM Journal of Numeric al Analysis , 46:2505–2 523. Higham, N. J. (2 008). F unctions of Matric es: The ory and Computation . So ciet y for Industrial and Applied Mathematics, Ph iladelphia, P A, USA. Hutc hinson, M. (1 989). A sto c h astic estimator of the trace of the inﬂuence matrix for { L } aplac ian smo othing sp lines. Commun. Stat. Simula. , 18:1059–1 076. Jegerlehner, B. (1996 ). Krylov space solv ers for shifted linear systems. arXiv.or g, arXiv:hep-lat/96 12014 v1 , NA:NA. Karypis, G. and Kumar, V. (1999 ). A fast and high qualit y multil evel scheme for partitioning irregular graphs . SIAM Journal on Sci e ntiﬁc Computing , 20(1):35 9. 6 Lindgren, F., Lindstrøm, J., and Rue, H. (2011). An explicit link b etw een Gaussian ﬁelds and Gaussian Mark ov rand om ﬁelds: Th e SPDE ap p roac h. J ournal of the R oyal Statistic al So ciety, Series B , 5, to app ear. McP eters, R., Aeronautics, U. S . N., Scien tiﬁc, S. A., and Branc h , T. I. (1996). Nimbus-7 T otal Ozone M apping Sp e ctr ometer (TOMS) data pr o ducts user’s guide . NASA, S cientiﬁc and T ec hnical Information Branch. Rue, H. and Held, L. (2005). Gaussian Markov R andom Fields . Chapm an & Hall. Rue, H. and Tjelmeland, H. (2002). Fitting gaussian mark o v rand om ﬁelds to gaussian ﬁelds . Sc andinavian Journal of Statistics , 29:31–49. Saad, Y. (2003). Iter ative Metho ds for Sp arse Line ar Systems, 2nd Ed. S IAM. T ang, J. and Saad, Y. (20 10). A probing metho d for computing the diagonal of the matrix in v erse. T ec hn ical r ep ort, Minn esota Sup ercomputing Ins titute for Adv an ced Computational Researc h. v an d er V orst, H. and Melissen, J. (1990 ). A Petro v-Galerkin t yp e metho d for solving Axk= b, where A is symmetric complex. Magnetics, IEEE T r ansactions on , 26(2):706– 708. 7

Parameter estimation in high dimensional Gaussian distributions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment