Improved Estimation of High-dimensional Ising Models
We consider the problem of jointly estimating the parameters as well as the structure of binary valued Markov Random Fields, in contrast to earlier work that focus on one of the two problems. We formulate the problem as a maximization of $\ell_1$-reg…
Authors: M. Kolar, E. P. Xing
Impro v ed Estimation of High-dimensional Ising Mo dels Mladen Kolar Sch o ol of Computer Science Carnegie Mellon University Pittsburgh, P A 1521 3 mladenk@cs.cmu.edu Eric P . Xing Sch o ol of Computer Science Carnegie Mellon University Pittsburgh, P A 1521 3 epxing@cs.cm u.edu Abstract W e consider the problem of jointly estimat- ing the parameters as well as the structure of binary v alued Marko v Random Fields, in contrast to earlier work that focus on one of the t wo problems. W e for m ula te the problem as a maximization of ℓ 1 -regular ized surrogate likelih o od that allows us to find a spar se solu- tion. Our optimization tec hnique efficiently incorp orates the cutting-pla ne algor ithm in order to o btain a tighter outer b ound on the marginal polytop e, which results in improv e- men t of both parameter estimates and a p- proximation to ma rginals. On sy nthetic data, we compare our algor ithm on the tw o estima- tion tasks to the other ex is ting methods. W e analyze the metho d in the hig h-dimensional setting, where the nu mber of dimensions p is allow ed to grow with the n umber of obser v a- tions n . The rate of c o nv ergence o f the es- timate is demonstrated to dep end explicitly on the sparsity of the underlying graph. 1 In tro duction Undirected graphical models, also known as Markov random fields (MRFs), hav e b een succes sfully applied in a v ariety of domains, including natural languag e pro cessing, computer vision, ima g e analysis, spa tial data analy s is and statistica l physics. In most of these domains, the structure of the graphical mo dels is con- structed b y hand. How ever, in certain complex do- mains w e hav e little expertise ab out interactions b e- t ween features in da ta a nd we need a metho d that automatically selec ts a mo del that repr esents da ta well. W e propo se an algor ithm that is able to lear n a spars e mo del that fits data well and has an easily in ter pretable structure. Let X = ( X 1 , . . . , X p ) T be a random vector with dis- tribution P that can be represented by an undirected graph G = ( V , E ). Each vertex fr o m the set V is asso ciated with o ne comp onent of the random vector X . The edge set E o f the graph G enco des certain conditional indepe ndence assumptions among subsets of the p -dimensional random vector X ; X i is condi- tionally independent o f X j given the other v ar iables if ( i, j ) / ∈ E . Let D = { x ( i ) = ( x ( i ) 1 , . . . , x ( i ) p ) } n i =1 be an i.i.d. sample. Our task is to estimate the edge set E as well as the para meters of the distribution that gener- ated the s ample. The main contribution of our pap er is t wofold: the developmen t of a n efficient alg orithm that estimates the undirected graphical mo del from data, and the high-dimensiona l asymptotic ana lysis o f its estimates. W e find that the rate of conv ergence of the estimate explicitly dep ends on the spar s it y of the underlying gr aphical mo del. Under the assumption that X is Ga us s ian, estimation of the graph structure is equiv alent to estimation o f zeros in the inv erse co v ar ia nce matrix Σ − 1 . Several methods have been prop osed that estimate the graph structure fr o m D . [4] pr o po sed a metho d that tests for partial correla tions that are not significantly differ- ent from zero, which can be applied when p is small. In high dimensional setting, when p is larg e, the es ti- mation is more complicated, but under the assump- tion that the gra ph is sparse, sev er al metho ds can be employ ed successfully f o r structure recovery (e.g [10, 1, 7, 12]). If the r andom v ariable X is discrete, the problem of s tr ucture estimation b ecomes even harder as the lik eliho o d cannot b e optimized efficiently due to the problem of ev aluation of the log-partition func- tion. Many r ecently prop osed metho ds mak e use of ℓ 1 regulariza tion to learn sparse undirected models. [15] used a pseudo -likelihoo d approach, based on the lo c al conditional likelihoo d at each node, which results in a c onsistent estimate of the structure, but not neces- sarily o f the parameters. [9] prop ose d to optimize the ℓ 1 pena lized log -likelihoo d only ov er the set of active v ariables and iterativ ely enlarge the set until optimal- it y is a ch ieved. Ho wever, the resulting graph structure is not necessarily spars e . [1] used the log-determina n t relaxation of the log-pa rtition function [19] to obtain a surr ogate lik eliho o d that ca n b e ea sily optimized. In this pap er w e are interested in learning b oth the pa- rameters and the structure o f a binary v a lued Marko v Random Field with pairwise int er actions from o b- served data. An important insigh t into the problem of structure estimation is that even if a spa rse gr aph is estimated, that do es not necessa rily mean that the inference in the mo del is tracta ble (e.g. inference in a grid, in whic h each node has only 4 neighbors, is not tractable). Since w e are in teres ted in using the estimated mo del for inference , w e use the insig h t ob- tained from [16]; i.e. we use the same a pproximate pro cedure for estimating the parameters and for in- ference, as the bias in tro duced in estimation phas e can comp ensate for the er ror in the inference phas e . Our metho d is mostly related to [1], as we pro po se to optimize the ℓ 1 pena lized surrogate likelihoo d based on the log-determinant r elaxation. How ever, in o rder to obtain a b etter estimate, we efficiently inco rp orate the cutting-plane algor ithm [13] for obtaining a tighter outer b ound on the mar ginal p olytop e in to o ur op- timization pro ce dur e. Using a b etter approximation to the lo g-partition function is crucial in reducing the approximation error of the estimate. In man y appli- cations, the a mbien t dimensio nality o f the mo del p is larger than the sample size n and the classica l asy mp- totic ana lysis, where the mo del is fixed a nd the sam- ple size increases , do es not give a goo d insight in to the model behavior. W e ana ly z e our estimate in the high-dimensional setting, i.e. we allow the dimension p to increase with the sample size n . Doing so , w e ar e allowing a pro cedure to select a mo r e complex mo del that can represent a larg er class of distributions. 2 Preliminaries In this section we briefly in tro duce MRFs. W e present why the ex act inference in discrete MRFs, in g eneral, is intractable, a nd how to for m ula te approximate in- ference as a convex optimization pr oblem. In Section 3, w e derive our learning a lgorithm using methods pre- sented here. Mark o v Random Fields. Consider an undirected graph G with v ertex set V = { 1 , 2 , . . . , p } a nd edge set E ⊆ V × V . A Markov random field consists of a r andom vector X = ( X 1 , . . . , X p ) T ∈ X p , where the random v aria ble X s is a s so ciated with vertex s ∈ V . In this pap er we will consider binary pairwise MRFs, the Ising mo del, in whic h the probability distribu- tion facto r izes as P ( x ) = e x p ( h θ, ϕ ( x ) i − A ( θ )) and X ∈ {− 1 , 1 } p . Here h θ , ϕ ( x ) i denotes the dot pro d- uct b etw een the par ameter vector θ ∈ R d and p oten- tials ϕ ( x ) ∈ R d , and A ( θ ) = log P x ∈X p exp ( h θ, ϕ ( x ) i ) is the log- pa rtition function. Since w e are consider- ing binary pa irwise MRFs, po tent ia ls are functions ov er nodes and edges o f t he form ϕ ( x v ) = x v and ϕ ( x u , x v ) = x u x v . F or future use, we in tro duce a shorthand notation for the mean parameters η v = E θ [ φ ( x v )] and η vv ′ = E θ [ φ ( x v , x v ′ )]. Log-partition function. Ev aluating the log - partition function inv olves summing ov er exp onen- tially many terms a nd, in ge ner al, is intractable. The log-par tition function can b e expres sed as an optimiza- tion problem, as a v ar iational form ulation, using its F enchel-Legendre conjugate dual A ∗ ( η ): A ( θ ) = sup η ∈M {h θ , η i − A ∗ ( η ) } , (1) where M := η ∈ R d | ∃ p ( X ) s.t. η = E θ [ φ ( x )] is the set of realizable mean para meters η and the dual function A ∗ ( η ) = − H ( p ( x ; η ( θ )) is equal to the neg - ative e ntropy of the dis tr ibution parametrized by the mean parameter s η . F o r each para meter θ there is a corresp onding mean parameter η ∈ M that max imizes (1). The relation is given as: η = ∇ A ( θ ) = E θ [ φ ( x )] . (2) [18, 19] list other pro pe r ties of log-partition function. Log-partition relaxations. The log-partition func- tion written in equation (1) defines an optimization problem restricted to the set M . Since the set M is a po lytop e, it ca n be represented as an in ters e c tion of a finite n umber of hyperpla nes; how ever, the num b er of h y p er planes needed to describ e the set M grows exp o- nen tially with the num b er of no des p . It is imp ortant to find a n outer b ound on M that can b e easily char- acterized and as tight as p o ssible. One outer b ound can be obtained using the set of points that satisfy lo cal consis tency conditions LOCAL( G ) := η ∈ [ − 1 , 1] d ∀ ( u, v ) ∈ E : η uv − η u + η v ≤ 1 η uv − η v + η u ≤ 1 η u + η v − η uv ≤ 1 . By construction, we hav e M ⊆ LO CAL( G ), but points in LOCAL( G ) do no t necessa rily repre s ent mean pa- rameters of any pr obability distribution. Another outer bound ca n be obtained observing that the second moment matrix M 1 ( η ) = E θ [(1 x ) T (1 x )] is p ositive semi-definite. Therefore we have M ⊆ SDEF 1 ( G ) := η ∈ R d | M 1 ( η ) 0 . The outer bound o n the p olytop e M can be further tight ened b y r e la ting it to the cut p olytop e CUT( G ) (e.g. [3]) and using kno wn relaxations to the cut p oly - tope . Mapping b e t ween M and the cut p olytop e can be done using the susp ension gr aph G ′ = ( V ′ , E ′ ) of the graph G , where V ′ = V ∪ { p + 1 } and E ′ = E ∪ { ( v , p + 1 ) | v ∈ V } . The suspension gr aph is cre- ated by adding a n additional node p + 1 to the gra ph G and co nnecting each no de v ∈ V to the newly created no de p + 1 . Definition 1. L et w uv denote t he weight of an e dge in G ′ . The line ar bije ction ξ cut that maps p oints η ∈ M to p oints w ∈ CUT( G ′ ) is given by w v,n +1 = 1 2 ( η v + 1) for v ∈ V and w uv = 1 2 (1 − η uv ) for ( u, v ) ∈ E . Using a se pa ration alg orithm, it is p os s ible to sepa- rate a cla ss of inequalities tha t define the cut p olytop e and add them to the inequalities defining LOCAL( G ) to tigh ten the outer b ound. Efficient separation algo- rithms are known for several classes of inequalities [3], but in this pap er we will use the simplest one, cycle- ine qualities . Cy c le - inequalities can b e written a s: X uv ∈ C \ F w uv + X uv ∈ F (1 − w uv ) ≥ 1 (3) where C is a cycle in G ′ and F ⊆ C and | F | is o dd. T he class o f cycle-inequalities ca n b e efficien tly separated us ing Dijkstra’s sho rtest pa th algo rithm in O ( n 2 log n + n | E | ). T o further relax the problem (1), we approximate the negative en tropy A ∗ ( η ) b y a function B ∗ ( η ) to obtain the approximation B ( θ ) to A ( θ ): B ( θ ) = sup η ∈ OUT ( G ) h θ, η i − B ∗ ( η ) , (4) where OUT( G ) is a conv ex a nd compact set acting as an outer b ound to M . Even thoug h it is no t re- quired, many of the existing entropy approximations are strictly conv ex (e.g. the conv exified Bethe ent r opy [17], the Gaussia n-based log - determinant relaxation [19] or the r eweigh ted Kikuchi approximations [20]), which guarantees uniqueness of the solution to (4) . Inference in MR Fs. The inference task in MRFs refers to finding o r approximating the mar ginal pro b- abilities (or the mean v ector). It can b e seen fro m equation (2 ) that the log-partition pla ys an imp ortant role in inference and a g o o d approximation to it is es- senti a l in obtaining go o d es timates of the marg inals. Many known approximate inference algorithms can b e explained us ing the fr a mework ex plained a b ove; they create an outer b ound OUT( G ) and use an en tropy approximation to estimate the mar ginals (e.g . the log- determinant r e la xation [19], Belief Pro pagation [21] and tree- reweigh ted sum-pro duct (TR W) [17] to name few). Recent work prop osed a cutting-plane algo rithm [13] that iteratively tightens the outer b ound on the po lytop e M using cycle-inequalities and e mpir ic a lly obtains improved estimates of marg inals. 3 Structure learning and parameter estimation In this section we address the problem o f structure learning and par ameter estimation. Given a sample D , we obtain our estimate of the edge set E and pa - rameters θ asso ciated with edges as a maximizer of the ℓ 1 pena lized surro gate log-likelihoo d: ˆ θ n = arg max θ ∈ R d ℓ ( θ ; D ) − λ n X uv ∈ E | θ uv | = arg max θ ∈ R d h θ, ˆ η n i − B ( θ ) − λ n X uv ∈ E | θ uv | , (5) where ˆ η n denotes the mean parameter estimated from the sample. λ n is a tuning parameter that sets the strength of the p enalty . Note that a solution to the problem (5 ) simultaneously gives the estimate of the edge set E and the parameter vector θ , since if ˆ θ uv = 0 then the estimated graph do es not contain the edg e b e- t ween nodes u and v . Since the ℓ 1 pena lt y shr inks the edge parameters tow a rds 0 , the estimation is biase d tow ards spar s e models . In the remainder of this section w e will explain our algorithm that efficien tly s olves (5). Since the prob- lem (5) is convex one could use, for example, the subgradient metho d for non-diferentiable functi o ns [2]. How ever, computing the gradient o f the log-likelihoo d ∂ ℓ ( D ; θ ) ∂ θ = ˆ η n − E θ [ ϕ ( x )] requires infer e nce over the mo del with curren t v alues of the parameters . Since the inference is only a pproximate, the a c c ur acy of the computed gradient heavily dep ends on the appr oxi- mation used. Note a gain that the spars it y of the graph do es no t necessarily imply that the inference is tracta ble. Applying the cutting-plane algorithm [13] for approximate inference could a chiev e goo d accuracy , how ever it would be computationally prohibitive, since at each iteration of the subgradient algorithm, the cut- ting plane algorithm hav e to b e run a new to obtain the mean parameter s. Hence, co mputational inefficiency stems fro m the fact that for ea ch pa rameter θ , we hav e to compute the corresp onding mean parameter η . W e presen t a way to exploit the structure of the log-determinant rela xation to obtain b oth θ and η and obtain c o mputationally efficient algor ithm. W e will use the conv enient way of repres e n ting par ameters in a matrix: R ( θ ) = 0 θ 1 θ 2 . . . θ p θ 1 0 θ 12 . . . θ 1 p . . . θ p θ 1 p θ 2 p . . . 0 . Algorithm 1 describes o ur propos ed method for struc- ture learning. T o obtain the so lution of (5) efficien tly , Algorithm 1 Structure learning with cutting-plane algorithm 1: OUT( G ) ← LO CAL( G ) 2: rep eat 3: ˆ θ n , ˆ η ← max θ { h θ , ˆ η n i − λ n P vv ′ | θ vv ′ |− max η ∈ OUT( G ) {h θ, η i − B ∗ ( η ) } } 4: w ← Cr eate Suspens io n Graph( ˆ η ) 5: C ← Separ ate Cycle Inequalities( w ) 6: OUT( G ) ← OUT( G ) ∩ C 7: un ti l C = ∅ we use v ariationa l represent a tion o f B ( θ ) using the log - determinant relaxation and join tly optimize ov er θ and η . Star ting with LOCAL( G ) as an outer b ound to M , the algor ithm alternates b etw een finding the best pa - rameters and tigh tening the outer b o und OU T ( G ) by incorp orating the cycle inequalities that a re violated b y the cur rent mean par ameters η . The alg orithm is similar, in spirit, to the cutting-plane algorithm [13] for inference. How ever, o ptimization in line 3 of Al- gorithm 1 is done ov er all par ameters jointly , which pro duces some tec hnical challenges. W e pro ce e d with a pro cedure for solving the optimization pr oblem (5). The for m ula tio n in line 3 aro se from using the v aria- tional form (4) of B ( θ ) in the surroga te likelihoo d (5). The idea b ehind the log-determinant r e la xation [19] is to upper b ound the log-par tition function using the Gaussian-based entropy appr oximation: A ( θ ) ≤ sup η ∈ OUT ( G ) 1 2 log det( R ( η ) + diag( m )) + h θ, η i , where m ∈ R p +1 = (1 , 4 3 , . . . , 4 3 ). T o make use of this upper b ound, we hav e to r ewrite it so that b oth pa- rameters θ and η can be extra cted from it. Before we rewrite the upper bound, notice that after k iterations of r epea t-unt il lo op in Algorithm 1, we have a dded k cycle-inequalities. It will be useful to rewrite equation (3) in a matrix form a s T r ( A k R ( η )) ≥ b k , where A k is a symmetric ma trix fo r the k - th inequalit y . Lemma 2. After k iter ations of algorithm, we write the lo g-p artition r elaxation as B ( θ ) = p 2 log( e π 2 ) − 1 2 ( p + 1) − 1 2 max ν,α ≥ 0 { ν T m − α T b + log det( − R ( θ ) − diag( ν ) + k X i =1 α i A i ) } . (6) T o obtain the me an ve ctor η ∗ c orr esp onding to θ we take off diagonal elements of the matrix Z = ( − R ( θ ) − diag( ν ) + X i α i A i ) − 1 define d for optimal ν, α , i.e. R ( η ∗ ) = Z − diag( Z ) . Algorithm 2 Finding bes t par a meters 1: W, α ← Initialize() 2: rep eat 3: W ← max W log det( W + R ( ˆ η n ) + diag( m )) − T r ( W ( P i α i A i )) W ∈ W ( defined in equation (9) ) 4: Y ← ( W + R ( ˆ η n ) + diag( m )) − 1 − P i α i A i 5: α ← max α ≥ 0 { log det( Y + P i α i A i ) − α T b } 6: un ti l stop cr iter ion Pr o of. Due to the lack o f space, we leav e o ut technical details of the pro of and just give a sk etch of the main idea. The pro of is similar to the proo f of Lemma 5 in [1]. W e s tart from e q uation (4), where the Gauss ian- based ent r opy is used as B ∗ ( η ), and rewr ite it in the Lagrang ian form with α i as a Lagrangian m ultiplier for i -th cycle-inequality . The lemma follows fro m rewrit- ing the Lagrangian in the dual for m. Using Le mma 2, we can rewrite the problem in line 3 of Algorithm 1 so that b o th θ a nd η can b e extracted. Defining Y := − R ( θ ) − dia g ( ν ) and dropping cons tant terms the optimization problem is written a s : max Y α ≥ 0 − T r ( Y ( R ( ˆ η n ) + diag( m )))+ log det ` Y + P i α i A i ´ − α T b − λ n P uv | Y uv | . (7) With α = 0, the pr o blem (7) is iden tical to the pro blem analysed in [1, 7], where Y can be found using a block co ordinate descent. How ever, due to the terms that arise from the added cycle-inequalities, the Lagra ngian m ultipliers α 6 = 0 a re differen t from zero and w e need a different metho d to obtain the optimal Y and α . W e prop os e Algorithm 2 for so lving the problem (7). The algor ithm iter ate b etw een solving Y and α . F o r fixed α , the dual o f (7) is given as max W ∈W log det( W + R ( ˆ η n ) + diag( m )) − T r ( W ( P i α i A i )) , (8) where W = W uv = 0 for u = 1 , v = 1 , u = v | W uv | ≤ λ n otherwise . (9) T o solve for Y we apply the subgradien t method, in which we optimize the dua l v ariable W with the pro- jection step onto the b ox co nstraint defined in (9). The gradient direction is given as ∇ W = ( R ( ˆ η n ) + diag( m ) + W ) − 1 − P i α i A i and the step size can b e defined as γ t = γ 0 / √ t . F o r fix e d Y , the problem (7) reduces to one in line 5 of Algorithm 2. Since the dimension of α , corre s po nding to the num ber of cycle- inequalities, is not larg e, we can apply any optimiza- tion metho d to find a n optimal α . Algorithm 2 is guaranteed to con verge, which c a n b e shown from the standa r d results for blo ck co ordinate optimization (e.g [1 4]). Using the same res ults, we could p e r form the ana ly sis of the conv ergence rate, but that is b eyond the sco pe of this pap er. The co mpu- tational complexity o f Algor ithm 2 is ro ughly O ( p 3 ), so the ov er all complexity of the structure lear ning is O ( p 3 ). T he metho d has low er complexity than the method [1] and same as gr aphic al lasso [7]. A use- ful tric k for b o osting the performance of the structure learning a lgorithm is to use a warm-start stra teg y for Algorithm 2, i.e. initialize Y and α to the optimal so- lution of the previous iteration. An imp ortant a dv an tag e of our a lgorithm, ov er other blo ck-coor dinate descen t a lgorithms [1, 7], is that we can rea dily mo dify it to the case when the regular iza- tion is given as a constrain t on ℓ 1 ball in which param- eters lie, e.g . P uv ∈ E | θ uv | ≤ C . In that ca se, we can solve directly the primal pro blem, to obtain Y , using an efficien t algo rithm for pro jection on to the ℓ 1 ball [6]. The algorithm for efficient pr o jection onto the ℓ 1 ball was also exploited for lear ning sparse inv ers e cov ari- ance ma trices under the Gaussian assumption on data [5]. Finally , the structure learning algorithm could be extended to the non-binary MRFs using a gener a liza- tion to the log -determinant re la xation [19] and pro ject- ing the mar ginal po lytop e to different binary marg inal po lytop es [13]. 4 Asymptotic analysis In this section, w e state our main theoretical result on the con vergence rate o f the ℓ 1 pena lized log-likelihoo d estimate. As oppo sed to the algorithm describ ed in the last section, where we used the log-determina n t ap- proximation, our theore tica l result is applicable to any strongly con vex surr ogate for the log-par tition func- tion. The asymptotic analysis presented here is hig h- dimensional in nature, i.e. we analyse the estimate in the case when both the model dimension p and the sample size n tend to infinity . T raditionally , as ymp- totic analys is is p erformed for a fixed mo del letting the sample size n to increase, how ever, that type of anal- ysis does not reflect the situation that o ccurs in man y real da ta sets where the dimensionality p is larger than the sample size n (e.g . gene ar r ays, fMRIs). T o get in- sight in to the b ehaviour of the estimate, it is therefore impo r tant to p erform the high-dimensional analysis. The first question to as k is whether the estimate ˆ θ n conv erges to θ ∗ , the true parameter asso ciated with the distribution? Unf o rtunately , in g eneral, the a nswer to this question is no , since we are using an appr oxima- tion to the log-pa rtition function. Note that we could obtain such consistency result if we are willing to a s - sume that the true gra ph can b e found in a r estricted class of models , e.g. trees for which the a pproximation will give the exact solution. The next b est thing we ca n ho p e for is that our pro- cedure pro duces an estimate that is close to the b est parameter ˆ θ in the class using the surr ogate B ( θ ). In general, we ca nnot tell how far is the b est para meter in the class ˆ θ from the true θ ∗ , how ever, letting the dimension o f the mo del p to increase with the size of the sample and using a go od approximation B ( θ ) we are able to represent an increasing num b er of distr ibu- tions and, hence, reduce the s ize of the approximation error. Naturally , the next questio n is whether the estimate ˆ θ n at leas t conv er ges to ˆ θ ? This co nv ergence would b e obviously true if the mo del had b een fixed, how ever, w e are dealing with mo dels of increasing dimensionalit y . In or der to guar antee the closeness of the estimate ˆ θ n to the best parameter ˆ θ we will ha ve to assume that the mo del is sparse and we will show how the ra te of co n vergence explicitly dep ends o n the sparsity . W e pro ceed with the main result. Let η ∗ be the true mea n vector corr esp onding to θ ∗ . Since we use str ictly conv ex conjugate dual pa ir B and B ∗ , the gradient ma pping ∇ B is o ne-to-one a nd onto the rela tive interior of the co nstrain set OUT( G ) [11]. In the limit of infinite amount o f data, the asymptotic v alue of the parameter estimate is given b y ˆ θ = ∇ − 1 B ( η ∗ ). Let S = { β | ˆ θ β 6 = 0 } be the index set of non zero elemen ts of ˆ θ a nd let S b e its comple- men t. Note that the set S indexes nodes and edges in the g raph and that the size of the set is re la ted to the sparsity o f the estimated graph. Denote s = | S | the nu mber of the non-zero elemen ts. The following theorem giv es us the a symptotic behaviour of the pa- rameter estimate ˆ θ n . T o prove the theorem we follow a metho d of Rothman et al. [12]. Theorem 3. L et ˆ θ n b e the minimizer of (5) . If B ( θ ) is str ongly c onvex and λ n ≍ q log p n then, ˆ θ n − ˆ θ 2 = O P r ( p + s ) log p n ! . Pr o of. Let G : R d 7→ R b e a map defined as G ( δ ) := ℓ ( ˆ θ + δ ; D ) − λ n P uv | ˆ θ uv + δ uv | − ℓ ( ˆ θ ; D ) + λ n P uv | ˆ θ uv | . Using the T aylor expansion and the fact that ∇ B ( ˆ θ ) = η ∗ , we have B ( ˆ θ + δ ) − B ( ˆ θ ) = h η ∗ , δ i + 1 2 δ T ∇ 2 [ B ( ˆ θ + αδ )] δ, for α ∈ [0 , 1]. No w, we may write G as : G ( δ ) = h δ, ˆ η n − η ∗ i − 1 2 δ T ∇ 2 [ B ( ˆ θ + αδ )] δ − λ n X vv ′ ( | ˆ θ vv ′ + δ vv ′ | − | ˆ θ vv ′ | ) . (10) By construction, our estimate ˆ δ n = ˆ θ n − ˆ θ maximizes G and we hav e G ( ˆ δ n ) ≥ G (0) = 0. The pro of contin ues b y showing that for some L > 0 and || δ || 2 = L we hav e G ( δ ) < 0, whic h implies that || ˆ δ n || 2 ≤ L b ecause of concavit y of G . Appropriately c ho osing L , w e show that δ conv er ges to 0. W e pro ceed b y b ounding each term in (10). The first term ca n b e wr itten as: |h δ, ˆ η n − η ∗ i| ≤ | X uv ∈ E δ uv ( ˆ η n uv − η ∗ uv ) | + | X v ∈ V δ v ( ˆ η n v − η ∗ v ) | ≤ ( ∗ ) + ( ∗∗ ) . (11) Using the union sum inequality a nd Ho effding’s in- equality , with proba bilit y tending to 1, w e hav e max uv | ˆ η n uv − η ∗ uv | ≤ C 1 r log p n , and a b ound ( ∗ ) ≤ C 1 q log p n P uv | δ uv | . Using the Cauch y-Sch wartz inequality and Ho effding’s inequal- it y the se c o nd term is bo unded as ( ∗∗ ) ≤ ( X v ( ˆ η n v − η ∗ v ) 2 ) 1 / 2 ( X v δ 2 v ) 1 / 2 ≤ C 2 r p log p n ( X v δ 2 v ) 1 / 2 . The Hessian o f a stro ng ly conv ex function is p os itiv e definite, so the s econd term in (10) can be b ound a s follows T r ( δ T ∇ 2 B ( ˆ θ + αδ ) δ ) ≥ C 3 || δ || 2 2 , where C 3 is a constant that depends on the minim um eigenv alue of the Hess ian. Using the triangula r inequality on the third term, w e hav e X vv ′ ( | ˆ θ vv ′ + δ vv ′ | − | ˆ θ vv ′ | ) ≥ ( X vv ′ ∈ S | δ vv ′ | − X vv ′ ∈ S | δ vv ′ | ) . Now w e ca n define the constant L as, L = M r s log p n + r p log p n ! → 0 . T aking λ n = C 1 ǫ q log p n , we have an upp er bo und on G : G ( δ ) ≤ C 1 r log p n (1 − 1 ǫ ) X uv ∈ S | δ uv | + C 1 r log p n (1 + 1 ǫ ) X uv ∈ S | δ uv | + C 2 r p log p n ( X v δ 2 v ) 1 2 − C 3 || δ || 2 2 . First term is nega tiv e for small ǫ so we ca n re- mov e it from the upper bound. Using the fact that P vv ′ ∈ S | δ vv ′ | ≤ √ s ( P vv ′ ∈ S δ 2 vv ′ ) 1 2 , the upp er b ound beco mes G ( δ ) ≤ ( C 1 (1+ ǫ ) ǫM − C 3 )( P vv ′ δ 2 vv ′ ) 1 2 + ( C 2 M − C 3 )( P v δ 2 v ) 1 2 < 0 for s ufficien tly large M and the the- orem follows. 5 Exp erimen tal Results In this section we compare the p erfo r mance of our es- timation method for sparse graphs as well as Baner- jee et al. [1] and W ain wright et al. [15] on simulated data. In order to asse s s the pe r formance we co mpa re sp e e d, accuracy of the structure selection and accuracy of the estimated parameters. Metho d of W ainwrigh t et al. [15] produces t wo es timates ˜ θ uv and ˜ θ vu for ea ch edge pa rameter. There are tw o w ays how we can sym- metrize the solution: ˆ θ uv = ˜ θ uv if | ˜ θ uv | < | ˜ θ vu | ˜ θ vu if | ˜ θ uv | ≥ | ˜ θ vu | “W ainwright min” , and ˆ θ uv = ˜ θ uv if | ˜ θ uv | > | ˜ θ vu | ˜ θ vu if | ˜ θ uv | ≤ | ˜ θ vu | “W ainwright max” . W e hav e implemented the method of W ainwright et al. [15] using a co ordinate descent algo rithm for log is tic regress io n [8]. T o compare the method of Banerjee et al. [1 ] we us ed “COVSEL” pack a ge av ailable fro m the authors website. W e use three t yp es o f graphs for exp eriments: (a) 4-neares t-neig hbor gr id mo dels, (b) sparse ra ndo m graphs, and (c) sparse ra ndom graphs with dense sub- graphs. The g rid mo del repr esents a simple spa rse graph, which do e s not allow for exact inference due to the la rge tree width. T o create a r a ndom spa rse graph, we choose a total num b er of edges and a dd them b etw een random pair s of no des, tak ing in to an account the maxim um node degr e e . Ra ndom gr aphs with dense subgraphs are globally sparse, i.e. they have few edges, how ever there are lo cal subgraphs that are very dense, with strong interactions b etw een no des. T o generate them, we first create dense co mp onents and then randomly add edges b etw een different com- po nent s. F or a given g r aph, we assign eac h no de a parameter θ v ∼ U [ − 1 , 1 ] and each edge a parameter θ uv ∼ U [ − ξ , ξ ], where ξ is the coupling strength. F or a g iven dis tr ibution P θ ∗ of the mo del we generate ran- dom data sets { x (1) , x (2) , . . . , x ( n ) } of i.i.d. p oints us- ing Gibbs sampling. Every experimental res ult is av- eraged ov er 5 0 runs. W e first comment on the sp eed o f convergence of the methods . W e ha ve decided not to plot gr aphs with sp e e d compar isons due to the use of different progra m- ming languag es, ho wev er , from our limited exp e r ience we o bserve that the method of W ain wright et al. [15] is the fastest and the running time do es not dep end m uch on the underlying sparsity of the graph. Our method compar ed fav or a bly to the metho d of Baner - jee et al. [1], howev er, incre asing the densit y of the graph or increasing the coupling strength ξ resulted in an incr ease of the n umber of a dded cycle- inequalities needed for the a pproximation and in a slow er conv er - gence. F rom the comments on the sp eed o f the meth- o ds, we can susp ect that there is a trade-o ff b e tw ee n sp e e d and accur acy . Next, we co mpare the a ccuracy of the edge selection. F or this exp eriment we crea te a ra ndom sparse g raphs with p = 5 0 no des and 10 0 edges , such that the cou- pling strength is ξ = 0 . 5. Then we v ary the sam- ple size n fro m 10 0 to 1000. Figure 1 shows preci- sion and recall of edg es included in to the gra ph for a regulariza tion par ameter set a s λ n = 2 ∗ q log p n . W e hav e excluded the metho d of Banerjee et al. [1] in the figure, since for this exp e r iment the estimated struc- ture w a s iden tica l. F rom plots we ca n see that our method and “W ainwrigh t min” pro duce similar results and p erfo r m b etter than “ W ainwright max”. Similar conclusions ca n be drawn for the g rid mo del (results not sho wn). T o shed some more ligh t on these results, it is instructive to discuss the speed at which the edges get included into the mo del as the function of para m- eter λ n . “W a inwrigh t max” pro duces estimates that include edges the fastest and the resulting g raph is the densest which can explain low precisio n due to many spurio us edges that get included into the mo del. “W ainwright min” includes edg e s most conser v ativ ely , while our so lution pro duces gra phs that accor ding to their denseness fall in betw een “W ainwright min” and “W ainwright max” estimates. Next, we mov e onto a harder pr oblem of estimating the structure of graphs that ha ve stro ng couplings betw een nodes. F or thi s exp e riment , we construct a random graph with tw o dense subgra phs, which a re fully connected graphs of size 8. Graphs ha ve total of p = 50 no des and 100 edges. W e v ary the coupling strength ξ in interv al [1 , 10]. The structure is estimated from a sample size of n = 500 and the results ar e presented in Figure 2. In this exp er iment we start to notice a difference b e- t ween the estimated mo dels using our metho d and the method [1]. As the coupling s trength increases, the mean par a meter is not captured within the constraints defined by the cy c le-inequalities and our metho d has to add the violated cycle-inequalities to the constraint set O UT( G ) whic h produces a different estimate. Final set o f exp eriments mea sures ho w well the esti- mated model fits the observed data and for that pur- po se we use surro gate log-likeliho o d. While the true measure of the fit should be the log-likelih o od, it is 200 400 600 800 1000 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 n Precision (b) 200 400 600 800 1000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 n Recall (a) Wainwright_min OurMethod Wainwright_max Figure 1: Comparing edge rec overy of a spar se random graph: p = 5 0 , ξ = 0 . 5 , N edg es = 100. 2 4 6 8 10 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 coupling strength ξ Precision (b) 2 4 6 8 10 0.5 0.52 0.54 0.56 0.58 0.6 coupling strength ξ Recall (а) Wainwright_min OurMethod Banerjee Wainwright_max Figure 2 : Comparing edg e re c overy o f a graph with dense subgra ph: p = 50, n = 500, N edg es = 10 0 . not possible to use it due to the pro blem of ev aluating the log-pa rtition function. F urthermore, in pr actise w e alwa ys use the surrog ate log-likelihoo d when c ho o sing a mo del from data, s o o ur ex p er imen ts can be justi - fied. Figur e 3 shows the fit for da ta g e ner ated fro m the gr id mo del. As in es timation of the structure, for this sparse model, there is no difference b etw een our metho d and the method of Baner jee et al. [1]. One explanation of this result is that the outer b ound on the po lytop e M , implicitly defined through lo g- determinant that a ct as a log-barrier, captures the es- timated mean parameter a nd all the cycle- inequalities are sa tisfied. Next, similarly to the structure estima- tion, we g enerate a graph with dense subgraphs and present results in Figure 3. Again, we can see that our method s ta rt to perform better as we increase the strength of co uplings . On Figure 3 w e also plo t the fit for pa rameters es timated from the metho d of W ain- wright et al. [15], how ever, s ince the metho d uses a different pseudo - likelihoo d, it p erfor ms w o r se than the other tw o methods. T o summarize, we hav e shown that on the task of es- timating the structure our method p erforms similarly to the metho d of Banerjee et al. [1] and that b oth methods estimate structure that falls b etw een “W a in- wright min” and “W ain wright max” estimates. When measuring how well do lear ned mo dels fit the da ta, we observe tha t our metho d outp erforms the method of Banerjee et al. [1] when the mo del ha s dense subgraphs and strong in tera ctions betw een no des. 2 4 6 8 10 −34 −32 −30 −28 −26 −24 −22 −20 −18 coupling strength ξ Surrogate log−likelihood (a) Grid with n=50 nodes 2 4 6 8 10 −21 −20 −19 −18 −17 −16 −15 −14 −13 −12 −11 coupling strength ξ Surrogate log−likelihood (b) Graph with dense subgraphs Wainwright_min OurMethod Banerjee Wainwright_max Figure 3: T est s urroga te log-likelihoo d. θ uv ∈ [ − ξ , ξ ]. (a) Spars e grid p = 50 , n = 500; (b) Graph with dense subgraphs p = 50, n = 200, N edg es = 10 0. 6 Conclusion In the pap er we have presented a method for joint ly learning the s tructure and the parameters of an undi- rected MRF from the da ta . O ur metho d is useful when there a re strong co rrelations b etw een no des in the graph or when the true graph is not too sparse . If the true graph is very spa rse, our algorithm efficiently finds the estimate without p erforming the cutting- plane step. W e show ed how to inco rp orate the cla ss of cycle- inequalities into the a lgorithm, which ar e par- ticularly v aluable a s they can b e separated efficiently and added as needed to improve the solution. F urther- more, our algor ithm can b e efficiently co m bined with the algorithm for pro jection onto the ℓ 1 ball in cases when the par ameters ar e constrained to lie in the ℓ 1 ball. W e hav e ana lyzed conv ergence r ate of an estimate based on maximizing a p enalized surroga te lik eliho o d in high-dimensio nal settings. W e hav e given a rate that depends on the num ber of non-zero elemen ts of the best pa rameter for the surr ogate likeliho o d. Such analysis pr ovides insight into the p erformance of the method in high-dimensional setting, when b oth p and n a re allow ed to grow. 7 Ac knowled gemen t s This mater ial is bas e d upo n work supp orted by a n NSF CARE E R Award to E PX under grant No. DBI- 05465 94, and NSF gr a nt IIS-0 71337 9. EPX is als o suppo rted by an Alfred P . Sloan Resea rch F ellowship of Co mputer Science. References [1] O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maxim um like liho o d estimation. J. Mach. L e arn. R es. , 9:485 –516, 2008 . [2] D. P . Bertsek as. Nonline ar Pr o gr amming . Athena Sci- entific, September 1999. [3] M.M. Deza and M. Laurent. Ge ometry of Cuts and Metrics (A lgorithms and Combinatorics) . Springer, June 1997. [4] M. Drton and M.D. Perl man. Model selection for Gaussian concentratio n graphs. Biometrika , 91(3):591 –602, 2004. [5] J. Duchi, S. Gould, and D. Koller. Pro jected subgradi- ent methods for learning sparse gaussians. In Pr o c e e d- ings of the Twenty-fourth Confer enc e on Unc ertainty in AI (UAI) , 200 8. [6] J. Duc h i, S. Shalev-Shw artz, Y. S inger, and T. Chan- dra. Efficient pro jections onto the l1-ball for learning in high d imensions. p ages 272– 279, 2008. [7] J. F riedman, T. Hastie, and R. Tibshirani. Sp arse inv erse co v ariance estimation with the graphical lasso . Biostat , page kx m045, 2007 . [8] T. Hastie J. F riedman and R. Tibshirani. Regularized paths for generalized linear mod els via coordinate de- scen t. In Manuscript , 2008. [9] S.-I . Lee, V. Ganapathi, and D. Koller. Effi- cien t structure learning of mark ov netw orks using l 1 - regularizatio n. In B. Sc h¨ olk opf, J. Platt, and T. Hoff- man, editors, A dvanc es in Neur al Information Pr o- c essing Systems 19 , pages 817–824. MIT Press, Cam- bridge, MA, 2007. [10] N. Meinshausen and P . Buhlmann. H igh-d imensional graphs and v ariable selection with the lasso. A nnals of Statistics , 34:1 436, 2006. [11] T.R. Ro ck afell ar. Convex analysis , volume 28. Prince- ton Univ ersity Press, Princeton, N.J., 1970 . [12] A. Rothman, P . Bick el, E. Levina, and J. Zhu. Sparse p erm u tation in v ariant cov ariance estimation, 2008. [13] D. Sontag and T. Jaakkola. New outer b ounds on the marginal polytop e. In J.C . Platt, D. Koller, Y. Singer, and S . Ro weis, editors, A dvanc es in Neur al Informa- tion Pr o c essing Systems 20 , pages 1393–14 00. MIT Press, Cam bridge, MA, 2008. [14] P . Tseng. Conv ergence of a block co ordinate d escen t method for nondifferentiable minimization. J. Optim. The ory Appl. , 109(3 ):475–494, 2001. [15] M. W ainwrigh t, P . Ravikumar, and J. Lafferty . High-dimensional graphical mo del selection u sing ℓ 1 - regularized logistic regression. In Bernhard Schlk opf, John Platt, and Thomas Hoffman, editors, NI PS , pages 14 65–1472. MIT Pres s, 2006. [16] M.J. W ain wright. Estimating the ”wrong” graphical model: Benefits in the computation-limited setting. J. Mach. L e arn. R es. , 7:1829 –1859, 2006. [17] M.J. W ain wright, T.S. Jaakkola, and A.S. Willsky . A n ew class of upp er b ounds on the log partition function. Information The ory, I EEE T r ansactions on , 51(7):231 3–2335, July 2005 . [18] M.J. W ain wright and M.I. Jordan. Graphical models, exp onentia l families, and v ariatio n al inference, 2003 . [19] M.J. W ainwrigh t and M.I. Jordan. Log-determinant relaxation for appro x imate inference in discrete marko v random fields. Signal Pr o c essing, I EEE T r ansactions on , 54(6):2 099–2109, June 2006. [20] W. Wiegerinck. Approxima tions with rew eighted gen- eralized b elief propagation. In aistats05 , p ages 421– 428. S ociety for A rtificial Intelli gence and Statistics, 2005. [21] J. S. Y edidia, W. T. F reeman, and Y . W eiss. Con- structing free-energy appro ximations and generalized b elief p ropagation algorithms. IEEE T r ansactions on Information The ory , 51(7):2 282–2312, 2005.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment