Improved Estimation of High-dimensional Ising Models

Impro v ed Estimation of High-dimensional Ising Mo dels Mladen Kolar Sch o ol of Computer Science Carnegie Mellon University Pittsburgh, P A 1521 3 mladenk@cs.cmu.edu Eric P . Xing Sch o ol of Computer Science Carnegie Mellon University Pittsburgh, P A 1521 3 epxing@cs.cm u.edu Abstract W e consider the problem of jointly estimat- ing the parameters as well as the structure of binary v alued Marko v Random Fields, in contrast to earlier work that focus on one of the t wo problems. W e for m ula te the problem as a maximization of ℓ 1 -regular ized surrogate likelih o od that allows us to ﬁnd a spar se solu- tion. Our optimization tec hnique eﬃciently incorp orates the cutting-pla ne algor ithm in order to o btain a tighter outer b ound on the marginal polytop e, which results in improv e- men t of both parameter estimates and a p- proximation to ma rginals. On sy nthetic data, we compare our algor ithm on the tw o estima- tion tasks to the other ex is ting methods. W e analyze the metho d in the hig h-dimensional setting, where the nu mber of dimensions p is allow ed to grow with the n umber of obser v a- tions n . The rate of c o nv ergence o f the es- timate is demonstrated to dep end explicitly on the sparsity of the underlying graph. 1 In tro duction Undirected graphical models, also known as Markov random ﬁelds (MRFs), hav e b een succes sfully applied in a v ariety of domains, including natural languag e pro cessing, computer vision, ima g e analysis, spa tial data analy s is and statistica l physics. In most of these domains, the structure of the graphical mo dels is con- structed b y hand. How ever, in certain complex do- mains w e hav e little expertise ab out interactions b e- t ween features in da ta a nd we need a metho d that automatically selec ts a mo del that repr esents da ta well. W e propo se an algor ithm that is able to lear n a spars e mo del that ﬁts data well and has an easily in ter pretable structure. Let X = ( X 1 , . . . , X p ) T be a random vector with dis- tribution P that can be represented by an undirected graph G = ( V , E ). Each vertex fr o m the set V is asso ciated with o ne comp onent of the random vector X . The edge set E o f the graph G enco des certain conditional indepe ndence assumptions among subsets of the p -dimensional random vector X ; X i is condi- tionally independent o f X j given the other v ar iables if ( i, j ) / ∈ E . Let D = { x ( i ) = ( x ( i ) 1 , . . . , x ( i ) p ) } n i =1 be an i.i.d. sample. Our task is to estimate the edge set E as well as the para meters of the distribution that gener- ated the s ample. The main contribution of our pap er is t wofold: the developmen t of a n eﬃcient alg orithm that estimates the undirected graphical mo del from data, and the high-dimensiona l asymptotic ana lysis o f its estimates. W e ﬁnd that the rate of conv ergence of the estimate explicitly dep ends on the spar s it y of the underlying gr aphical mo del. Under the assumption that X is Ga us s ian, estimation of the graph structure is equiv alent to estimation o f zeros in the inv erse co v ar ia nce matrix Σ − 1 . Several methods have been prop osed that estimate the graph structure fr o m D . [4] pr o po sed a metho d that tests for partial correla tions that are not signiﬁcantly diﬀer- ent from zero, which can be applied when p is small. In high dimensional setting, when p is larg e, the es ti- mation is more complicated, but under the assump- tion that the gra ph is sparse, sev er al metho ds can be employ ed successfully f o r structure recovery (e.g [10, 1, 7, 12]). If the r andom v ariable X is discrete, the problem of s tr ucture estimation b ecomes even harder as the lik eliho o d cannot b e optimized eﬃciently due to the problem of ev aluation of the log-partition func- tion. Many r ecently prop osed metho ds mak e use of ℓ 1 regulariza tion to learn sparse undirected models. [15] used a pseudo -likelihoo d approach, based on the lo c al conditional likelihoo d at each node, which results in a c onsistent estimate of the structure, but not neces- sarily o f the parameters. [9] prop ose d to optimize the ℓ 1 pena lized log -likelihoo d only ov er the set of active v ariables and iterativ ely enlarge the set until optimal- it y is a ch ieved. Ho wever, the resulting graph structure is not necessarily spars e . [1] used the log-determina n t relaxation of the log-pa rtition function [19] to obtain a surr ogate lik eliho o d that ca n b e ea sily optimized. In this pap er w e are interested in learning b oth the pa- rameters and the structure o f a binary v a lued Marko v Random Field with pairwise int er actions from o b- served data. An important insigh t into the problem of structure estimation is that even if a spa rse gr aph is estimated, that do es not necessa rily mean that the inference in the mo del is tracta ble (e.g. inference in a grid, in whic h each node has only 4 neighbors, is not tractable). Since w e are in teres ted in using the estimated mo del for inference , w e use the insig h t ob- tained from [16]; i.e. we use the same a pproximate pro cedure for estimating the parameters and for in- ference, as the bias in tro duced in estimation phas e can comp ensate for the er ror in the inference phas e . Our metho d is mostly related to [1], as we pro po se to optimize the ℓ 1 pena lized surrogate likelihoo d based on the log-determinant r elaxation. How ever, in o rder to obtain a b etter estimate, we eﬃciently inco rp orate the cutting-plane algor ithm [13] for obtaining a tighter outer b ound on the mar ginal p olytop e in to o ur op- timization pro ce dur e. Using a b etter approximation to the lo g-partition function is crucial in reducing the approximation error of the estimate. In man y appli- cations, the a mbien t dimensio nality o f the mo del p is larger than the sample size n and the classica l asy mp- totic ana lysis, where the mo del is ﬁxed a nd the sam- ple size increases , do es not give a goo d insight in to the model behavior. W e ana ly z e our estimate in the high-dimensional setting, i.e. we allow the dimension p to increase with the sample size n . Doing so , w e ar e allowing a pro cedure to select a mo r e complex mo del that can represent a larg er class of distributions. 2 Preliminaries In this section we brieﬂy in tro duce MRFs. W e present why the ex act inference in discrete MRFs, in g eneral, is intractable, a nd how to for m ula te approximate in- ference as a convex optimization pr oblem. In Section 3, w e derive our learning a lgorithm using methods pre- sented here. Mark o v Random Fields. Consider an undirected graph G with v ertex set V = { 1 , 2 , . . . , p } a nd edge set E ⊆ V × V . A Markov random ﬁeld consists of a r andom vector X = ( X 1 , . . . , X p ) T ∈ X p , where the random v aria ble X s is a s so ciated with vertex s ∈ V . In this pap er we will consider binary pairwise MRFs, the Ising mo del, in whic h the probability distribu- tion facto r izes as P ( x ) = e x p ( h θ, ϕ ( x ) i − A ( θ )) and X ∈ {− 1 , 1 } p . Here h θ , ϕ ( x ) i denotes the dot pro d- uct b etw een the par ameter vector θ ∈ R d and p oten- tials ϕ ( x ) ∈ R d , and A ( θ ) = log P x ∈X p exp ( h θ, ϕ ( x ) i ) is the log- pa rtition function. Since w e are consider- ing binary pa irwise MRFs, po tent ia ls are functions ov er nodes and edges o f t he form ϕ ( x v ) = x v and ϕ ( x u , x v ) = x u x v . F or future use, we in tro duce a shorthand notation for the mean parameters η v = E θ [ φ ( x v )] and η vv ′ = E θ [ φ ( x v , x v ′ )]. Log-partition function. Ev aluating the log - partition function inv olves summing ov er exp onen- tially many terms a nd, in ge ner al, is intractable. The log-par tition function can b e expres sed as an optimiza- tion problem, as a v ar iational form ulation, using its F enchel-Legendre conjugate dual A ∗ ( η ): A ( θ ) = sup η ∈M {h θ , η i − A ∗ ( η ) } , (1) where M :=  η ∈ R d | ∃ p ( X ) s.t. η = E θ [ φ ( x )]  is the set of realizable mean para meters η and the dual function A ∗ ( η ) = − H ( p ( x ; η ( θ )) is equal to the neg - ative e ntropy of the dis tr ibution parametrized by the mean parameter s η . F o r each para meter θ there is a corresp onding mean parameter η ∈ M that max imizes (1). The relation is given as: η = ∇ A ( θ ) = E θ [ φ ( x )] . (2) [18, 19] list other pro pe r ties of log-partition function. Log-partition relaxations. The log-partition func- tion written in equation (1) deﬁnes an optimization problem restricted to the set M . Since the set M is a po lytop e, it ca n be represented as an in ters e c tion of a ﬁnite n umber of hyperpla nes; how ever, the num b er of h y p er planes needed to describ e the set M grows exp o- nen tially with the num b er of no des p . It is imp ortant to ﬁnd a n outer b ound on M that can b e easily char- acterized and as tight as p o ssible. One outer b ound can be obtained using the set of points that satisfy lo cal consis tency conditions LOCAL( G ) :=    η ∈ [ − 1 , 1] d    ∀ ( u, v ) ∈ E : η uv − η u + η v ≤ 1 η uv − η v + η u ≤ 1 η u + η v − η uv ≤ 1    . By construction, we hav e M ⊆ LO CAL( G ), but points in LOCAL( G ) do no t necessa rily repre s ent mean pa- rameters of any pr obability distribution. Another outer bound ca n be obtained observing that the second moment matrix M 1 ( η ) = E θ [(1 x ) T (1 x )] is p ositive semi-deﬁnite. Therefore we have M ⊆ SDEF 1 ( G ) :=  η ∈ R d | M 1 ( η )  0  . The outer bound o n the p olytop e M can be further tight ened b y r e la ting it to the cut p olytop e CUT( G ) (e.g. [3]) and using kno wn relaxations to the cut p oly - tope . Mapping b e t ween M and the cut p olytop e can be done using the susp ension gr aph G ′ = ( V ′ , E ′ ) of the graph G , where V ′ = V ∪ { p + 1 } and E ′ = E ∪ { ( v , p + 1 ) | v ∈ V } . The suspension gr aph is cre- ated by adding a n additional node p + 1 to the gra ph G and co nnecting each no de v ∈ V to the newly created no de p + 1 . Deﬁnition 1. L et w uv denote t he weight of an e dge in G ′ . The line ar bije ction ξ cut that maps p oints η ∈ M to p oints w ∈ CUT( G ′ ) is given by w v,n +1 = 1 2 ( η v + 1) for v ∈ V and w uv = 1 2 (1 − η uv ) for ( u, v ) ∈ E . Using a se pa ration alg orithm, it is p os s ible to sepa- rate a cla ss of inequalities tha t deﬁne the cut p olytop e and add them to the inequalities deﬁning LOCAL( G ) to tigh ten the outer b ound. Eﬃcient separation algo- rithms are known for several classes of inequalities [3], but in this pap er we will use the simplest one, cycle- ine qualities . Cy c le - inequalities can b e written a s: X uv ∈ C \ F w uv + X uv ∈ F (1 − w uv ) ≥ 1 (3) where C is a cycle in G ′ and F ⊆ C and | F | is o dd. T he class o f cycle-inequalities ca n b e eﬃcien tly separated us ing Dijkstra’s sho rtest pa th algo rithm in O ( n 2 log n + n | E | ). T o further relax the problem (1), we approximate the negative en tropy A ∗ ( η ) b y a function B ∗ ( η ) to obtain the approximation B ( θ ) to A ( θ ): B ( θ ) = sup η ∈ OUT ( G ) h θ, η i − B ∗ ( η ) , (4) where OUT( G ) is a conv ex a nd compact set acting as an outer b ound to M . Even thoug h it is no t re- quired, many of the existing entropy approximations are strictly conv ex (e.g. the conv exiﬁed Bethe ent r opy [17], the Gaussia n-based log - determinant relaxation [19] or the r eweigh ted Kikuchi approximations [20]), which guarantees uniqueness of the solution to (4) . Inference in MR Fs. The inference task in MRFs refers to ﬁnding o r approximating the mar ginal pro b- abilities (or the mean v ector). It can b e seen fro m equation (2 ) that the log-partition pla ys an imp ortant role in inference and a g o o d approximation to it is es- senti a l in obtaining go o d es timates of the marg inals. Many known approximate inference algorithms can b e explained us ing the fr a mework ex plained a b ove; they create an outer b ound OUT( G ) and use an en tropy approximation to estimate the mar ginals (e.g . the log- determinant r e la xation [19], Belief Pro pagation [21] and tree- reweigh ted sum-pro duct (TR W) [17] to name few). Recent work prop osed a cutting-plane algo rithm [13] that iteratively tightens the outer b ound on the po lytop e M using cycle-inequalities and e mpir ic a lly obtains improved estimates of marg inals. 3 Structure learning and parameter estimation In this section we address the problem o f structure learning and par ameter estimation. Given a sample D , we obtain our estimate of the edge set E and pa - rameters θ asso ciated with edges as a maximizer of the ℓ 1 pena lized surro gate log-likelihoo d: ˆ θ n = arg max θ ∈ R d ℓ ( θ ; D ) − λ n X uv ∈ E | θ uv | = arg max θ ∈ R d h θ, ˆ η n i − B ( θ ) − λ n X uv ∈ E | θ uv | , (5) where ˆ η n denotes the mean parameter estimated from the sample. λ n is a tuning parameter that sets the strength of the p enalty . Note that a solution to the problem (5 ) simultaneously gives the estimate of the edge set E and the parameter vector θ , since if ˆ θ uv = 0 then the estimated graph do es not contain the edg e b e- t ween nodes u and v . Since the ℓ 1 pena lt y shr inks the edge parameters tow a rds 0 , the estimation is biase d tow ards spar s e models . In the remainder of this section w e will explain our algorithm that eﬃcien tly s olves (5). Since the prob- lem (5) is convex one could use, for example, the subgradient metho d for non-diferentiable functi o ns [2]. How ever, computing the gradient o f the log-likelihoo d ∂ ℓ ( D ; θ ) ∂ θ = ˆ η n − E θ [ ϕ ( x )] requires infer e nce over the mo del with curren t v alues of the parameters . Since the inference is only a pproximate, the a c c ur acy of the computed gradient heavily dep ends on the appr oxi- mation used. Note a gain that the spars it y of the graph do es no t necessarily imply that the inference is tracta ble. Applying the cutting-plane algorithm [13] for approximate inference could a chiev e goo d accuracy , how ever it would be computationally prohibitive, since at each iteration of the subgradient algorithm, the cut- ting plane algorithm hav e to b e run a new to obtain the mean parameter s. Hence, co mputational ineﬃciency stems fro m the fact that for ea ch pa rameter θ , we hav e to compute the corresp onding mean parameter η . W e presen t a way to exploit the structure of the log-determinant rela xation to obtain b oth θ and η and obtain c o mputationally eﬃcient algor ithm. W e will use the conv enient way of repres e n ting par ameters in a matrix: R ( θ ) =      0 θ 1 θ 2 . . . θ p θ 1 0 θ 12 . . . θ 1 p . . . θ p θ 1 p θ 2 p . . . 0      . Algorithm 1 describes o ur propos ed method for struc- ture learning. T o obtain the so lution of (5) eﬃcien tly , Algorithm 1 Structure learning with cutting-plane algorithm 1: OUT( G ) ← LO CAL( G ) 2: rep eat 3: ˆ θ n , ˆ η ← max θ { h θ , ˆ η n i − λ n P vv ′ | θ vv ′ |− max η ∈ OUT( G ) {h θ, η i − B ∗ ( η ) } } 4: w ← Cr eate Suspens io n Graph( ˆ η ) 5: C ← Separ ate Cycle Inequalities( w ) 6: OUT( G ) ← OUT( G ) ∩ C 7: un ti l C = ∅ we use v ariationa l represent a tion o f B ( θ ) using the log - determinant relaxation and join tly optimize ov er θ and η . Star ting with LOCAL( G ) as an outer b ound to M , the algor ithm alternates b etw een ﬁnding the best pa - rameters and tigh tening the outer b o und OU T ( G ) by incorp orating the cycle inequalities that a re violated b y the cur rent mean par ameters η . The alg orithm is similar, in spirit, to the cutting-plane algorithm [13] for inference. How ever, o ptimization in line 3 of Al- gorithm 1 is done ov er all par ameters jointly , which pro duces some tec hnical challenges. W e pro ce e d with a pro cedure for solving the optimization pr oblem (5). The for m ula tio n in line 3 aro se from using the v aria- tional form (4) of B ( θ ) in the surroga te likelihoo d (5). The idea b ehind the log-determinant r e la xation [19] is to upper b ound the log-par tition function using the Gaussian-based entropy appr oximation: A ( θ ) ≤ sup η ∈ OUT ( G ) 1 2 log det( R ( η ) + diag( m )) + h θ, η i , where m ∈ R p +1 = (1 , 4 3 , . . . , 4 3 ). T o make use of this upper b ound, we hav e to r ewrite it so that b oth pa- rameters θ and η can be extra cted from it. Before we rewrite the upper bound, notice that after k iterations of r epea t-unt il lo op in Algorithm 1, we have a dded k cycle-inequalities. It will be useful to rewrite equation (3) in a matrix form a s T r ( A k R ( η )) ≥ b k , where A k is a symmetric ma trix fo r the k - th inequalit y . Lemma 2. After k iter ations of algorithm, we write the lo g-p artition r elaxation as B ( θ ) = p 2 log( e π 2 ) − 1 2 ( p + 1) − 1 2 max ν,α ≥ 0 { ν T m − α T b + log det( − R ( θ ) − diag( ν ) + k X i =1 α i A i ) } . (6) T o obtain the me an ve ctor η ∗ c orr esp onding to θ we take oﬀ diagonal elements of the matrix Z = ( − R ( θ ) − diag( ν ) + X i α i A i ) − 1 deﬁne d for optimal ν, α , i.e. R ( η ∗ ) = Z − diag( Z ) . Algorithm 2 Finding bes t par a meters 1: W, α ← Initialize() 2: rep eat 3: W ← max W  log det( W + R ( ˆ η n ) + diag( m )) − T r ( W ( P i α i A i ))  W ∈ W ( deﬁned in equation (9) ) 4: Y ← ( W + R ( ˆ η n ) + diag( m )) − 1 − P i α i A i 5: α ← max α ≥ 0 { log det( Y + P i α i A i ) − α T b } 6: un ti l stop cr iter ion Pr o of. Due to the lack o f space, we leav e o ut technical details of the pro of and just give a sk etch of the main idea. The pro of is similar to the proo f of Lemma 5 in [1]. W e s tart from e q uation (4), where the Gauss ian- based ent r opy is used as B ∗ ( η ), and rewr ite it in the Lagrang ian form with α i as a Lagrangian m ultiplier for i -th cycle-inequality . The lemma follows fro m rewrit- ing the Lagrangian in the dual for m. Using Le mma 2, we can rewrite the problem in line 3 of Algorithm 1 so that b o th θ a nd η can b e extracted. Deﬁning Y := − R ( θ ) − dia g ( ν ) and dropping cons tant terms the optimization problem is written a s : max Y α ≥ 0  − T r ( Y ( R ( ˆ η n ) + diag( m )))+ log det ` Y + P i α i A i ´ − α T b − λ n P uv | Y uv |  . (7) With α = 0, the pr o blem (7) is iden tical to the pro blem analysed in [1, 7], where Y can be found using a block co ordinate descent. How ever, due to the terms that arise from the added cycle-inequalities, the Lagra ngian m ultipliers α 6 = 0 a re diﬀeren t from zero and w e need a diﬀerent metho d to obtain the optimal Y and α . W e prop os e Algorithm 2 for so lving the problem (7). The algor ithm iter ate b etw een solving Y and α . F o r ﬁxed α , the dual o f (7) is given as max W ∈W  log det( W + R ( ˆ η n ) + diag( m )) − T r ( W ( P i α i A i ))  , (8) where W =  W uv = 0 for u = 1 , v = 1 , u = v | W uv | ≤ λ n otherwise  . (9) T o solve for Y we apply the subgradien t method, in which we optimize the dua l v ariable W with the pro- jection step onto the b ox co nstraint deﬁned in (9). The gradient direction is given as ∇ W = ( R ( ˆ η n ) + diag( m ) + W ) − 1 − P i α i A i and the step size can b e deﬁned as γ t = γ 0 / √ t . F o r ﬁx e d Y , the problem (7) reduces to one in line 5 of Algorithm 2. Since the dimension of α , corre s po nding to the num ber of cycle- inequalities, is not larg e, we can apply any optimiza- tion metho d to ﬁnd a n optimal α . Algorithm 2 is guaranteed to con verge, which c a n b e shown from the standa r d results for blo ck co ordinate optimization (e.g [1 4]). Using the same res ults, we could p e r form the ana ly sis of the conv ergence rate, but that is b eyond the sco pe of this pap er. The co mpu- tational complexity o f Algor ithm 2 is ro ughly O ( p 3 ), so the ov er all complexity of the structure lear ning is O ( p 3 ). T he metho d has low er complexity than the method [1] and same as gr aphic al lasso [7]. A use- ful tric k for b o osting the performance of the structure learning a lgorithm is to use a warm-start stra teg y for Algorithm 2, i.e. initialize Y and α to the optimal so- lution of the previous iteration. An imp ortant a dv an tag e of our a lgorithm, ov er other blo ck-coor dinate descen t a lgorithms [1, 7], is that we can rea dily mo dify it to the case when the regular iza- tion is given as a constrain t on ℓ 1 ball in which param- eters lie, e.g . P uv ∈ E | θ uv | ≤ C . In that ca se, we can solve directly the primal pro blem, to obtain Y , using an eﬃcien t algo rithm for pro jection on to the ℓ 1 ball [6]. The algorithm for eﬃcient pr o jection onto the ℓ 1 ball was also exploited for lear ning sparse inv ers e cov ari- ance ma trices under the Gaussian assumption on data [5]. Finally , the structure learning algorithm could be extended to the non-binary MRFs using a gener a liza- tion to the log -determinant re la xation [19] and pro ject- ing the mar ginal po lytop e to diﬀerent binary marg inal po lytop es [13]. 4 Asymptotic analysis In this section, w e state our main theoretical result on the con vergence rate o f the ℓ 1 pena lized log-likelihoo d estimate. As oppo sed to the algorithm describ ed in the last section, where we used the log-determina n t ap- proximation, our theore tica l result is applicable to any strongly con vex surr ogate for the log-par tition func- tion. The asymptotic analysis presented here is hig h- dimensional in nature, i.e. we analyse the estimate in the case when both the model dimension p and the sample size n tend to inﬁnity . T raditionally , as ymp- totic analys is is p erformed for a ﬁxed mo del letting the sample size n to increase, how ever, that type of anal- ysis does not reﬂect the situation that o ccurs in man y real da ta sets where the dimensionality p is larger than the sample size n (e.g . gene ar r ays, fMRIs). T o get in- sight in to the b ehaviour of the estimate, it is therefore impo r tant to p erform the high-dimensional analysis. The ﬁrst question to as k is whether the estimate ˆ θ n conv erges to θ ∗ , the true parameter asso ciated with the distribution? Unf o rtunately , in g eneral, the a nswer to this question is no , since we are using an appr oxima- tion to the log-pa rtition function. Note that we could obtain such consistency result if we are willing to a s - sume that the true gra ph can b e found in a r estricted class of models , e.g. trees for which the a pproximation will give the exact solution. The next b est thing we ca n ho p e for is that our pro- cedure pro duces an estimate that is close to the b est parameter ˆ θ in the class using the surr ogate B ( θ ). In general, we ca nnot tell how far is the b est para meter in the class ˆ θ from the true θ ∗ , how ever, letting the dimension o f the mo del p to increase with the size of the sample and using a go od approximation B ( θ ) we are able to represent an increasing num b er of distr ibu- tions and, hence, reduce the s ize of the approximation error. Naturally , the next questio n is whether the estimate ˆ θ n at leas t conv er ges to ˆ θ ? This co nv ergence would b e obviously true if the mo del had b een ﬁxed, how ever, w e are dealing with mo dels of increasing dimensionalit y . In or der to guar antee the closeness of the estimate ˆ θ n to the best parameter ˆ θ we will ha ve to assume that the mo del is sparse and we will show how the ra te of co n vergence explicitly dep ends o n the sparsity . W e pro ceed with the main result. Let η ∗ be the true mea n vector corr esp onding to θ ∗ . Since we use str ictly conv ex conjugate dual pa ir B and B ∗ , the gradient ma pping ∇ B is o ne-to-one a nd onto the rela tive interior of the co nstrain set OUT( G ) [11]. In the limit of inﬁnite amount o f data, the asymptotic v alue of the parameter estimate is given b y ˆ θ = ∇ − 1 B ( η ∗ ). Let S = { β | ˆ θ β 6 = 0 } be the index set of non zero elemen ts of ˆ θ a nd let S b e its comple- men t. Note that the set S indexes nodes and edges in the g raph and that the size of the set is re la ted to the sparsity o f the estimated graph. Denote s = | S | the nu mber of the non-zero elemen ts. The following theorem giv es us the a symptotic behaviour of the pa- rameter estimate ˆ θ n . T o prove the theorem we follow a metho d of Rothman et al. [12]. Theorem 3. L et ˆ θ n b e the minimizer of (5) . If B ( θ ) is str ongly c onvex and λ n ≍ q log p n then,       ˆ θ n − ˆ θ       2 = O P r ( p + s ) log p n ! . Pr o of. Let G : R d 7→ R b e a map deﬁned as G ( δ ) := ℓ ( ˆ θ + δ ; D ) − λ n P uv | ˆ θ uv + δ uv | − ℓ ( ˆ θ ; D ) + λ n P uv | ˆ θ uv | . Using the T aylor expansion and the fact that ∇ B ( ˆ θ ) = η ∗ , we have B ( ˆ θ + δ ) − B ( ˆ θ ) = h η ∗ , δ i + 1 2 δ T ∇ 2 [ B ( ˆ θ + αδ )] δ, for α ∈ [0 , 1]. No w, we may write G as : G ( δ ) = h δ, ˆ η n − η ∗ i − 1 2 δ T ∇ 2 [ B ( ˆ θ + αδ )] δ − λ n X vv ′ ( | ˆ θ vv ′ + δ vv ′ | − | ˆ θ vv ′ | ) . (10) By construction, our estimate ˆ δ n = ˆ θ n − ˆ θ maximizes G and we hav e G ( ˆ δ n ) ≥ G (0) = 0. The pro of contin ues b y showing that for some L > 0 and || δ || 2 = L we hav e G ( δ ) < 0, whic h implies that || ˆ δ n || 2 ≤ L b ecause of concavit y of G . Appropriately c ho osing L , w e show that δ conv er ges to 0. W e pro ceed b y b ounding each term in (10). The ﬁrst term ca n b e wr itten as: |h δ, ˆ η n − η ∗ i| ≤ | X uv ∈ E δ uv ( ˆ η n uv − η ∗ uv ) | + | X v ∈ V δ v ( ˆ η n v − η ∗ v ) | ≤ ( ∗ ) + ( ∗∗ ) . (11) Using the union sum inequality a nd Ho eﬀding’s in- equality , with proba bilit y tending to 1, w e hav e max uv | ˆ η n uv − η ∗ uv | ≤ C 1 r log p n , and a b ound ( ∗ ) ≤ C 1 q log p n P uv | δ uv | . Using the Cauch y-Sch wartz inequality and Ho eﬀding’s inequal- it y the se c o nd term is bo unded as ( ∗∗ ) ≤ ( X v ( ˆ η n v − η ∗ v ) 2 ) 1 / 2 ( X v δ 2 v ) 1 / 2 ≤ C 2 r p log p n ( X v δ 2 v ) 1 / 2 . The Hessian o f a stro ng ly conv ex function is p os itiv e deﬁnite, so the s econd term in (10) can be b ound a s follows T r ( δ T ∇ 2 B ( ˆ θ + αδ ) δ ) ≥ C 3 || δ || 2 2 , where C 3 is a constant that depends on the minim um eigenv alue of the Hess ian. Using the triangula r inequality on the third term, w e hav e X vv ′ ( | ˆ θ vv ′ + δ vv ′ | − | ˆ θ vv ′ | ) ≥ ( X vv ′ ∈ S | δ vv ′ | − X vv ′ ∈ S | δ vv ′ | ) . Now w e ca n deﬁne the constant L as, L = M r s log p n + r p log p n ! → 0 . T aking λ n = C 1 ǫ q log p n , we have an upp er bo und on G : G ( δ ) ≤ C 1 r log p n (1 − 1 ǫ ) X uv ∈ S | δ uv | + C 1 r log p n (1 + 1 ǫ ) X uv ∈ S | δ uv | + C 2 r p log p n ( X v δ 2 v ) 1 2 − C 3 || δ || 2 2 . First term is nega tiv e for small ǫ so we ca n re- mov e it from the upper bound. Using the fact that P vv ′ ∈ S | δ vv ′ | ≤ √ s ( P vv ′ ∈ S δ 2 vv ′ ) 1 2 , the upp er b ound beco mes G ( δ ) ≤ ( C 1 (1+ ǫ ) ǫM − C 3 )( P vv ′ δ 2 vv ′ ) 1 2 + ( C 2 M − C 3 )( P v δ 2 v ) 1 2 < 0 for s uﬃcien tly large M and the the- orem follows. 5 Exp erimen tal Results In this section we compare the p erfo r mance of our es- timation method for sparse graphs as well as Baner- jee et al. [1] and W ain wright et al. [15] on simulated data. In order to asse s s the pe r formance we co mpa re sp e e d, accuracy of the structure selection and accuracy of the estimated parameters. Metho d of W ainwrigh t et al. [15] produces t wo es timates ˜ θ uv and ˜ θ vu for ea ch edge pa rameter. There are tw o w ays how we can sym- metrize the solution: ˆ θ uv =  ˜ θ uv if | ˜ θ uv | < | ˜ θ vu | ˜ θ vu if | ˜ θ uv | ≥ | ˜ θ vu | “W ainwright min” , and ˆ θ uv =  ˜ θ uv if | ˜ θ uv | > | ˜ θ vu | ˜ θ vu if | ˜ θ uv | ≤ | ˜ θ vu | “W ainwright max” . W e hav e implemented the method of W ainwright et al. [15] using a co ordinate descent algo rithm for log is tic regress io n [8]. T o compare the method of Banerjee et al. [1 ] we us ed “COVSEL” pack a ge av ailable fro m the authors website. W e use three t yp es o f graphs for exp eriments: (a) 4-neares t-neig hbor gr id mo dels, (b) sparse ra ndo m graphs, and (c) sparse ra ndom graphs with dense sub- graphs. The g rid mo del repr esents a simple spa rse graph, which do e s not allow for exact inference due to the la rge tree width. T o create a r a ndom spa rse graph, we choose a total num b er of edges and a dd them b etw een random pair s of no des, tak ing in to an account the maxim um node degr e e . Ra ndom gr aphs with dense subgraphs are globally sparse, i.e. they have few edges, how ever there are lo cal subgraphs that are very dense, with strong interactions b etw een no des. T o generate them, we ﬁrst create dense co mp onents and then randomly add edges b etw een diﬀerent com- po nent s. F or a given g r aph, we assign eac h no de a parameter θ v ∼ U [ − 1 , 1 ] and each edge a parameter θ uv ∼ U [ − ξ , ξ ], where ξ is the coupling strength. F or a g iven dis tr ibution P θ ∗ of the mo del we generate ran- dom data sets { x (1) , x (2) , . . . , x ( n ) } of i.i.d. p oints us- ing Gibbs sampling. Every experimental res ult is av- eraged ov er 5 0 runs. W e ﬁrst comment on the sp eed o f convergence of the methods . W e ha ve decided not to plot gr aphs with sp e e d compar isons due to the use of diﬀerent progra m- ming languag es, ho wev er , from our limited exp e r ience we o bserve that the method of W ain wright et al. [15] is the fastest and the running time do es not dep end m uch on the underlying sparsity of the graph. Our method compar ed fav or a bly to the metho d of Baner - jee et al. [1], howev er, incre asing the densit y of the graph or increasing the coupling strength ξ resulted in an incr ease of the n umber of a dded cycle- inequalities needed for the a pproximation and in a slow er conv er - gence. F rom the comments on the sp eed o f the meth- o ds, we can susp ect that there is a trade-o ﬀ b e tw ee n sp e e d and accur acy . Next, we co mpare the a ccuracy of the edge selection. F or this exp eriment we crea te a ra ndom sparse g raphs with p = 5 0 no des and 10 0 edges , such that the cou- pling strength is ξ = 0 . 5. Then we v ary the sam- ple size n fro m 10 0 to 1000. Figure 1 shows preci- sion and recall of edg es included in to the gra ph for a regulariza tion par ameter set a s λ n = 2 ∗ q log p n . W e hav e excluded the metho d of Banerjee et al. [1] in the ﬁgure, since for this exp e r iment the estimated struc- ture w a s iden tica l. F rom plots we ca n see that our method and “W ainwrigh t min” pro duce similar results and p erfo r m b etter than “ W ainwright max”. Similar conclusions ca n be drawn for the g rid mo del (results not sho wn). T o shed some more ligh t on these results, it is instructive to discuss the speed at which the edges get included into the mo del as the function of para m- eter λ n . “W a inwrigh t max” pro duces estimates that include edges the fastest and the resulting g raph is the densest which can explain low precisio n due to many spurio us edges that get included into the mo del. “W ainwright min” includes edg e s most conser v ativ ely , while our so lution pro duces gra phs that accor ding to their denseness fall in betw een “W ainwright min” and “W ainwright max” estimates. Next, we mov e onto a harder pr oblem of estimating the structure of graphs that ha ve stro ng couplings betw een nodes. F or thi s exp e riment , we construct a random graph with tw o dense subgra phs, which a re fully connected graphs of size 8. Graphs ha ve total of p = 50 no des and 100 edges. W e v ary the coupling strength ξ in interv al [1 , 10]. The structure is estimated from a sample size of n = 500 and the results ar e presented in Figure 2. In this exp er iment we start to notice a diﬀerence b e- t ween the estimated mo dels using our metho d and the method [1]. As the coupling s trength increases, the mean par a meter is not captured within the constraints deﬁned by the cy c le-inequalities and our metho d has to add the violated cycle-inequalities to the constraint set O UT( G ) whic h produces a diﬀerent estimate. Final set o f exp eriments mea sures ho w well the esti- mated model ﬁts the observed data and for that pur- po se we use surro gate log-likeliho o d. While the true measure of the ﬁt should be the log-likelih o od, it is 200 400 600 800 1000 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 n Precision (b) 200 400 600 800 1000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 n Recall (a) Wainwright_min OurMethod Wainwright_max Figure 1: Comparing edge rec overy of a spar se random graph: p = 5 0 , ξ = 0 . 5 , N edg es = 100. 2 4 6 8 10 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 coupling strength ξ Precision (b) 2 4 6 8 10 0.5 0.52 0.54 0.56 0.58 0.6 coupling strength ξ Recall (Ð°) Wainwright_min OurMethod Banerjee Wainwright_max Figure 2 : Comparing edg e re c overy o f a graph with dense subgra ph: p = 50, n = 500, N edg es = 10 0 . not possible to use it due to the pro blem of ev aluating the log-pa rtition function. F urthermore, in pr actise w e alwa ys use the surrog ate log-likelihoo d when c ho o sing a mo del from data, s o o ur ex p er imen ts can be justi - ﬁed. Figur e 3 shows the ﬁt for da ta g e ner ated fro m the gr id mo del. As in es timation of the structure, for this sparse model, there is no diﬀerence b etw een our metho d and the method of Baner jee et al. [1]. One explanation of this result is that the outer b ound on the po lytop e M , implicitly deﬁned through lo g- determinant that a ct as a log-barrier, captures the es- timated mean parameter a nd all the cycle- inequalities are sa tisﬁed. Next, similarly to the structure estima- tion, we g enerate a graph with dense subgraphs and present results in Figure 3. Again, we can see that our method s ta rt to perform better as we increase the strength of co uplings . On Figure 3 w e also plo t the ﬁt for pa rameters es timated from the metho d of W ain- wright et al. [15], how ever, s ince the metho d uses a diﬀerent pseudo - likelihoo d, it p erfor ms w o r se than the other tw o methods. T o summarize, we hav e shown that on the task of es- timating the structure our method p erforms similarly to the metho d of Banerjee et al. [1] and that b oth methods estimate structure that falls b etw een “W a in- wright min” and “W ain wright max” estimates. When measuring how well do lear ned mo dels ﬁt the da ta, we observe tha t our metho d outp erforms the method of Banerjee et al. [1] when the mo del ha s dense subgraphs and strong in tera ctions betw een no des. 2 4 6 8 10 −34 −32 −30 −28 −26 −24 −22 −20 −18 coupling strength ξ Surrogate log−likelihood (a) Grid with n=50 nodes 2 4 6 8 10 −21 −20 −19 −18 −17 −16 −15 −14 −13 −12 −11 coupling strength ξ Surrogate log−likelihood (b) Graph with dense subgraphs Wainwright_min OurMethod Banerjee Wainwright_max Figure 3: T est s urroga te log-likelihoo d. θ uv ∈ [ − ξ , ξ ]. (a) Spars e grid p = 50 , n = 500; (b) Graph with dense subgraphs p = 50, n = 200, N edg es = 10 0. 6 Conclusion In the pap er we have presented a method for joint ly learning the s tructure and the parameters of an undi- rected MRF from the da ta . O ur metho d is useful when there a re strong co rrelations b etw een no des in the graph or when the true graph is not too sparse . If the true graph is very spa rse, our algorithm eﬃciently ﬁnds the estimate without p erforming the cutting- plane step. W e show ed how to inco rp orate the cla ss of cycle- inequalities into the a lgorithm, which ar e par- ticularly v aluable a s they can b e separated eﬃciently and added as needed to improve the solution. F urther- more, our algor ithm can b e eﬃciently co m bined with the algorithm for pro jection onto the ℓ 1 ball in cases when the par ameters ar e constrained to lie in the ℓ 1 ball. W e hav e ana lyzed conv ergence r ate of an estimate based on maximizing a p enalized surroga te lik eliho o d in high-dimensio nal settings. W e hav e given a rate that depends on the num ber of non-zero elemen ts of the best pa rameter for the surr ogate likeliho o d. Such analysis pr ovides insight into the p erformance of the method in high-dimensional setting, when b oth p and n a re allow ed to grow. 7 Ac knowled gemen t s This mater ial is bas e d upo n work supp orted by a n NSF CARE E R Award to E PX under grant No. DBI- 05465 94, and NSF gr a nt IIS-0 71337 9. EPX is als o suppo rted by an Alfred P . Sloan Resea rch F ellowship of Co mputer Science. References [1] O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maxim um like liho o d estimation. J. Mach. L e arn. R es. , 9:485 –516, 2008 . [2] D. P . Bertsek as. Nonline ar Pr o gr amming . Athena Sci- entiﬁc, September 1999. [3] M.M. Deza and M. Laurent. Ge ometry of Cuts and Metrics (A lgorithms and Combinatorics) . Springer, June 1997. [4] M. Drton and M.D. Perl man. Model selection for Gaussian concentratio n graphs. Biometrika , 91(3):591 –602, 2004. [5] J. Duchi, S. Gould, and D. Koller. Pro jected subgradi- ent methods for learning sparse gaussians. In Pr o c e e d- ings of the Twenty-fourth Confer enc e on Unc ertainty in AI (UAI) , 200 8. [6] J. Duc h i, S. Shalev-Shw artz, Y. S inger, and T. Chan- dra. Eﬃcient pro jections onto the l1-ball for learning in high d imensions. p ages 272– 279, 2008. [7] J. F riedman, T. Hastie, and R. Tibshirani. Sp arse inv erse co v ariance estimation with the graphical lasso . Biostat , page kx m045, 2007 . [8] T. Hastie J. F riedman and R. Tibshirani. Regularized paths for generalized linear mod els via coordinate de- scen t. In Manuscript , 2008. [9] S.-I . Lee, V. Ganapathi, and D. Koller. Eﬃ- cien t structure learning of mark ov netw orks using l 1 - regularizatio n. In B. Sc h¨ olk opf, J. Platt, and T. Hoﬀ- man, editors, A dvanc es in Neur al Information Pr o- c essing Systems 19 , pages 817–824. MIT Press, Cam- bridge, MA, 2007. [10] N. Meinshausen and P . Buhlmann. H igh-d imensional graphs and v ariable selection with the lasso. A nnals of Statistics , 34:1 436, 2006. [11] T.R. Ro ck afell ar. Convex analysis , volume 28. Prince- ton Univ ersity Press, Princeton, N.J., 1970 . [12] A. Rothman, P . Bick el, E. Levina, and J. Zhu. Sparse p erm u tation in v ariant cov ariance estimation, 2008. [13] D. Sontag and T. Jaakkola. New outer b ounds on the marginal polytop e. In J.C . Platt, D. Koller, Y. Singer, and S . Ro weis, editors, A dvanc es in Neur al Informa- tion Pr o c essing Systems 20 , pages 1393–14 00. MIT Press, Cam bridge, MA, 2008. [14] P . Tseng. Conv ergence of a block co ordinate d escen t method for nondiﬀerentiable minimization. J. Optim. The ory Appl. , 109(3 ):475–494, 2001. [15] M. W ainwrigh t, P . Ravikumar, and J. Laﬀerty . High-dimensional graphical mo del selection u sing ℓ 1 - regularized logistic regression. In Bernhard Schlk opf, John Platt, and Thomas Hoﬀman, editors, NI PS , pages 14 65–1472. MIT Pres s, 2006. [16] M.J. W ain wright. Estimating the ”wrong” graphical model: Beneﬁts in the computation-limited setting. J. Mach. L e arn. R es. , 7:1829 –1859, 2006. [17] M.J. W ain wright, T.S. Jaakkola, and A.S. Willsky . A n ew class of upp er b ounds on the log partition function. Information The ory, I EEE T r ansactions on , 51(7):231 3–2335, July 2005 . [18] M.J. W ain wright and M.I. Jordan. Graphical models, exp onentia l families, and v ariatio n al inference, 2003 . [19] M.J. W ainwrigh t and M.I. Jordan. Log-determinant relaxation for appro x imate inference in discrete marko v random ﬁelds. Signal Pr o c essing, I EEE T r ansactions on , 54(6):2 099–2109, June 2006. [20] W. Wiegerinck. Approxima tions with rew eighted gen- eralized b elief propagation. In aistats05 , p ages 421– 428. S ociety for A rtiﬁcial Intelli gence and Statistics, 2005. [21] J. S. Y edidia, W. T. F reeman, and Y . W eiss. Con- structing free-energy appro ximations and generalized b elief p ropagation algorithms. IEEE T r ansactions on Information The ory , 51(7):2 282–2312, 2005.

Improved Estimation of High-dimensional Ising Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment