Hardness of parameter estimation in graphical models

We consider the problem of learning the canonical parameters specifying an undirected graphical model (Markov random field) from the mean parameters. For graphical models representing a minimal exponential family, the canonical parameters are uniquel…

Authors: Guy Bresler, David Gamarnik, Devavrat Shah

Hardness of paramete r estimatio n in graphic al models Guy Bresler 1 David Gamarnik 2 Deva vrat Shah 1 Laborato ry for Inform ation and Decision Systems Departmen t of EECS 1 and Sloan School of Management 2 Massachusetts Institute of T echnolo gy { gbresler, gamarnik,dev avrat } @mit.edu Abstract W e consider the problem of learning the canonical parameter s specifying an undi- rected graphica l model (M arkov random field) from the mean parameter s. For graphica l models representing a minimal expo nential family , the ca nonical param- eters are uniquely d etermined by the mean parameters, so the p roblem is f easible in principle. The goal of this paper is to in vestigate the computational feasibil- ity of this statistical task. Our m ain result shows tha t p arameter estimation is in general in tractable: no algor ithm can learn the canon ical p arameters of a g eneric pair-wise binary graphical model from the mean parameters in time bounded by a polyno mial in the number of v ariables (unless R P = NP). I ndeed, s uch a result has been believ ed to be true (see [ 1 ]) but no proo f w as known. Our p roof gives a polynomial time r eduction f rom app roximating the p artition function of the h ard-cor e mod el, known to b e hard, to learnin g ap proxima te pa- rameters. Our redu ction entails showing that the marginal polytop e boundary has an inheren t repulsive proper ty , which v alidates an o ptimization proc edure over the po lytope th at does not use any knowledge of its stru cture (as req uired by th e ellipsoid method and others). 1 Intr oduction Graphical models are a po werful framew ork for succinct representatio n of co mplex high - dimensiona l d istributions. As such , they are at the cor e of machin e learnin g and artificial intelli- gence, and are used in a variety of applied fields includ ing financ e, sign al pro cessing, commun ica- tions, biology , as well as the modeling of social and other complex networks. In this paper we focus on binary pairwise undirected graphica l models, a rich class of models with wide applicability . This is a parametr ic family of probab ility distributions, and fo r the m odels we consider, the ca nonical parameters θ are uniqu ely determin ed by the vector µ of m ean p arameters, which con sist o f the node-wise and pairwise marginals. T wo primary statistical tasks pertaining to graphical models are inference and parameter es timation. A b asic infer ence p roblem is the compu tation of ma rginals (or cond itional probab ilities) giv en the model, that is, the forwar d mapping θ 7→ µ . Conversely , the backwar d mapping µ 7→ θ corresponds to learning the canon ical parameter s from the mean parameters. The b ackward map ping is d efined only for µ i n the margina l polytope M of realizable mean para meters, and this is important in what follows. Th e back ward mapping captu res maximu m likelihood estimation of parameters; the study of the statistical prop erties of maximum likelihood estimation for exponential f amilies is a classical and importan t subject. In this paper w e ar e interested in the compu tational tractability of th ese statistical tasks. A ba sic question is wh ether or not the se maps can be compu ted efficiently (n amely in time p olynom ial in 1 the proble m size). As far as inferen ce goes, it is well known that app roximating the fo rward map (inferen ce) is computational hard in gen eral. Th is was shown by Luby and V igod a [ 2 ] for the hard- core model, a simple pairwise binary graph ical model (defined in ( 2.1 )). More recently , remarkably sharp results have been obtained, showing that compu ting the forward map for the hard-co re model is tractable if and only if the s ystem exhibits the correlation decay property [ 3 , 4 ]. In contrast, to the best o f ou r k nowledge, no a nalogou s h ardness result exists f or the backward map ping (parame ter estimation), despite its seeming intractability [ 1 ]. T angentially related hardn ess results have been pr eviously obtained for the pro blem of learning the graph structur e u nderly ing an undir ected graph ical m odel. Bog danov e t al. [ 5 ] showed ha rdness of d etermining graph structure wh en th ere ar e hidden nodes, and Karger and Srebro [ 6 ] showed hardne ss of finding the maximum likelihood graph with a giv en treewidth. Com puting the backward mapping , in comparison , requires estimation of the parameter s when the graph is known. Our main result, stated precisely in the next section, establishes har dness of ap proxim ating the backward mappin g for the h ard-co re model. Thu s, despite the p roblem being statistically feasible, it is computatio nally intractable. The pro of is by reduction , sho wing that the backward map can be used as a b lack box to efficiently estimate the pa rtition function of the h ard-co re mode l. The reduc tion, described in Section 4 , u ses the variational cha racterization of the log-par tition fu nction as a constrained conve x optimization over the marginal polytop e of realizab le mean par ameters. The gradien t of th e function to be min- imized is given by the backward mapp ing, and we use a projected grad ient optimization m ethod. Since app roximatin g the par tition f unction of the hard- core mo del is known to be co mputation ally hard, the reduction implies hardn ess of appro ximating the backward map. The main techn ical difficulty in carrying out the argumen t arises because the co n vex optimiza tion is constrain ed to the margina l polytop e, an intrinsically comp licated objec t. I ndeed, even deter- mining membe rship (or ev aluating the pr ojection) to within a crud e app roximatio n of th e polytop e is NP-h ard [ 7 ]. Ne vertheless, we show that it is p ossible to do th e o ptimization withou t using any knowledge of th e po lytope structur e, as is n ormally r equired by ellipsoid, barrier , or projection meth- ods. T o this end, we pr ove that the po lytope bound ary has an inherent repulsive property that keeps the iterates inside the polytope withou t actu ally enforc ing th e constrain t. The con sequence of th e bound ary repulsion proper ty is stated in Proposition 4.6 of Section 4 , which is proved in Section 5 . Our red uction has a close connection to th e variational appr oach to approx imate inference [ 1 ]. Ther e, the conju gate-dua l r epresentation of th e log-pa rtition function leads to a relaxed op timization prob- lem d efined over a tractable bo und fo r the marginal polytop e an d with a simple su rrogate to the entropy function . What ou r proof shows is that accurate app roximation of th e grad ient of the en- tropy obviates the need to relax the marginal polytope. W e mention a re lated work of Kearns and Roug hgarde n [ 8 ] showing a p olynom ial-time r eduction from infere nce to determin ing me mbership in the margin al p olytope. Note that such a reductio n does not establish hardness of parameter estimation: th e empirical marginals ob tained from samples are gu aranteed to be in the marginal po lytope, so an efficient algorithm could hypoth etically exist for parame ter estimation without contradictin g the hardness of marginal polytope membership. After c ompletion of our manuscrip t, we learned that Montanar i [ 9 ] has ind ependen tly and simu lta- neously obtained similar results showi ng hardness of parameter estimation in graphical models f rom the mean parameter s. His high -lev el appro ach is similar to ours, but the details dif fer substantially . 2 Main r esult In o rder to establish hardn ess o f learn ing param eters fr om marginals for pairwise binary graph ical models, we focus on a specific in stance of this class of grap hical models, the hard-cor e mo del. Giv en a graph G = ( V , E ) (whe re V = { 1 , . . . , p } ), the co llection o f ind ependen t set vector s I ( G ) ⊆ { 0 , 1 } V consist of vectors σ such that σ i = 0 or σ j = 0 (or both) for e very ed ge { i, j } ∈ E . Each vector σ ∈ I ( G ) is the in dicator vector o f an indepen dent set. The h ard-cor e mo del assigns nonzer o probab ility only to indepen dent set vectors, with P θ ( σ ) = exp  X i ∈ V θ i σ i − Φ( θ )  for each σ ∈ I ( G ) . (2.1) 2 This is a n expo nential family with vector of sufficient statistics φ ( σ ) = ( σ i ) i ∈ V ∈ { 0 , 1 } p and vector of canonica l parame ters θ = ( θ i ) i ∈ V ∈ R p . In the statistical physics liter ature the mo del is usually parameter ized in terms of nod e-wise fug acity (or activity) λ i = e θ i . The log-p artition function Φ( θ ) = log X σ ∈I ( G ) exp  X i ∈ V θ i σ i  ! serves to normalize the distribution; note that Φ( θ ) is fin ite for all θ ∈ R p . Here and thro ugho ut, all logarithm s are to the natural base. The set M of realizable mean parameter s plays a major role in the paper, and is defined as M = { µ ∈ R p | the re exists a θ such that E θ [ φ ( σ )] = µ } . For th e hard-co re mod el ( 2.1 ), the set M is a po lytope eq ual to the conve x hu ll of in depend ent set vectors I ( G ) and is called the ma r ginal polytop e . The m arginal poly tope’ s structu re can be rath er complex, and one indication of this is that the number of half-space inequalities needed to represent M can be very large, depending on the structure of the graph G underly ing the model [ 10 , 11 ]. The mo del ( 2.1 ) is a regular minim al exponen tial family , so for each µ in th e interio r M ◦ of the marginal polytope there corresponds a unique θ ( µ ) satisfying the dua l matching condition E θ [ φ ( σ )] = µ . W e ar e concern ed with app roximatio n of the backward mapp ing µ 7→ θ , and we use the fo llowing notion of appro ximation. Definition 2.1 . W e say that ˆ y ∈ R is a δ -approxim ation to y ∈ R if y (1 − δ ) ≤ ˆ y ≤ (1 + δ ) . A vector ˆ v ∈ R p is a δ -app roximatio n to v ∈ R p if each entry ˆ v i is a δ -appro ximation to v i . W e next define the appr opriate notion of efficient approximation algorithm. Definition 2.2. A fully polynomial r and omized appr oximation scheme (FPRAS) for a mapping f p : X p → R is a rando mized algorith m th at fo r each δ > 0 and inp ut x ∈ X p , with prob ability at least 3 / 4 outp uts a δ -app roximation ˆ f p ( x ) to f p ( x ) and mor eover the runnin g time is b ounded by a polyno mial Q ( p, δ − 1 ) . Our resu lt uses the complexity classes RP and NP , defined precisely in any complexity text (suc h as [ 12 ]). Th e class RP consists of problem s solvable by efficient ( random ized polyn omial) algo- rithms, an d NP con sists of many seeming ly d ifficult pro blems with no k nown efficient alg orithms. It is widely believed that N P 6 = R P . Assum ing this, o ur result says that there canno t be an efficient approx imation algo rithm f or the backward mappin g in the hard-co re model (and thu s also for the more general class of binary pairwise graphical models). W e recall that approximatin g th e backward ma pping entails taking a vector µ as input an d producing an ap proxim ation o f the correspondin g vector o f canon ical parameter s θ as o utput. It should be noted that e ven determining whether a gi ven vector µ belon gs to the mar ginal polytope M is known to be an NP-h ard p roblem [ 7 ]. H owe ver , o ur resu lt shows that the prob lem is NP- hard even if the inpu t vector µ is kn own a priori to be an element of the marginal polytop e M . Theorem 2 .3. Assuming N P 6 = R P , ther e doe s not exist an F P R A S for th e ba ckwar d mapp ing µ 7→ θ . As discussed in the introd uction, T heorem 2.3 is proved b y showing that the backward mappin g can be used as a bla ck-box to efficiently estima te the partition functio n o f the hard core m odel, known to be hard. Th is uses the v ariational characterization of the log-partition function as well as a projected gr adient optimization metho d. Proving validity of the pr ojected gradien t method requ ires overcoming a substantial technica l challenge: we show that th e iterates rem ain within th e margin al polytop e without explicitly enfo rcing this (in particular, we do not proje ct on to the p olytope) . The bulk of the paper is de voted to establishing this fact, which may be of independen t interest. In the next section we give n ecessary b ackgro und on conjug ate-duality and the variational ch aracter- ization as well as revie w the result we will use on h ardness of compu ting the log-pa rtition function. The proof of Theorem 2.3 is then giv en in Section 4 . 3 3 Backgr ound 3.1 Exponential families and conjugate duality W e now provid e ba ckgrou nd on exponential families (as can b e found in the m onogr aph by W ain- wright and Jordan [ 1 ]) sp ecialized to the h ard-co re model ( 2.1 ) on a fixed graph G = ( V , E ) . General theory on conju gate duality ju stifying the statem ents of this subsectio n can be f ound in Rockafellar’ s book [ 13 ]. The ba sic relationship b etween the c anonical and m ean pa rameters is expressed via conjugate (or Fenchel) duality . The con jugate dual of the log-partition function Φ( θ ) is Φ ∗ ( µ ) := sup θ ∈ R d n h µ, θ i − Φ( θ ) o . Note that for our model Φ( θ ) is finite for all θ ∈ R p and further more the supre mum is uniqu ely attained. On the in terior M ◦ of the m arginal p olytope, − Φ ∗ is the en tropy function. The log- partition function can then be expressed as Φ( θ ) = sup µ ∈M n h θ, µ i − Φ ∗ ( µ ) o , (3.1) with µ ( θ ) = arg max µ ∈M n h θ, µ i − Φ ∗ ( µ ) o . (3.2) The forward mapping θ 7→ µ is specified by the v ariational characterization ( 3.2 ) or alternatively by the gradient map ∇ Φ : R p → M . As mentioned earlier , for each µ in the interior M ◦ there is a unique θ ( µ ) satisfying the dual match- ing condition E θ ( µ ) [ φ ( σ )] = ( ∇ Φ)( θ ( µ )) = µ . For mean param eters µ ∈ M ◦ , the back ward mapping µ 7→ θ ( µ ) to the canonical par ameters is giv en by θ ( µ ) = arg max θ ∈ R p n h µ, θ i − Φ( θ ) o or by the gradien t ∇ Φ ∗ ( µ ) = θ ( µ ) . The latter representation will be the more useful one for us. 3.2 Hardness of inference W e describe an existing result on the hard ness of inferen ce and state the corollar y we will use. T he result says that, sub ject to wid ely b eliev ed con jectures in com putational comp lexity , no efficient algorithm exists fo r appr oximating the par tition function of certain h ard-co re mo dels. Recall that the hard-co re model with fugac ity λ is giv en by ( 2.1 ) with θ i = ln λ f or each i ∈ V . Theorem 3.1 ([ 3 , 4 ]) . S uppose d ≥ 3 and λ > λ c ( d ) = ( d − 1) d − 1 ( d − 2) d . Assuming N P 6 = R P , ther e e xists no F P R A S for comp uting th e partition function of the hard-cor e model with fugacity λ on r e gular graphs of de g r ee d . In particu lar , no F P R A S exists when λ = 1 an d d ≥ 5 . W e remark tha t the source of h ardness is the long-ran ge dependen ce pro perty of the hard-core mod el for λ > λ c ( d ) . It was shown in [ 14 ] that for λ < λ c ( d ) the mo del exhibits decay of co rrelations and there is an FPRAS fo r the log- partition f unction (in fact there is a d eterministic approx imation scheme as well). W e note that a nu mber of h ardness r esults ar e k nown f or the hard core and Ising models, including [ 15 , 16 , 3 , 2 , 4 , 17 , 18 , 19 ]. The result stated in Th eorem 3.1 suffices fo r o ur purpo ses. From th is section we w ill need only the fo llowing cor ollary , proved in the Appendix . Th e pro of, standard in the literatu re, u ses the self-re ducibility of th e hard -core mo del to express the par tition function in terms of marginals computed on subgraphs. Corollary 3 .2. Consider the h ar d-core model ( 2.1 ) on graphs of de gr ee most d with parameters θ i = 0 for a ll i ∈ V . A ssuming N P 6 = R P , there e xists no F P R A S ˆ µ ( 0 ) for the vector o f marginal pr obabilities µ ( 0 ) , wher e err or is mea sur ed entry-wise as per Definition 2.1 . 4 4 Reduction by optimizing over the mar ginal polytope In this section we describe our reduction and prove Theorem 2.3 . W e d efine polynom ial constants ǫ = p − 8 , q = p 5 , and s =  ǫ 2 p  2 , (4.1) which we will lea ve as ǫ , q , and s to clarify th e calculations. Also, gi ven the a symptotic nature of the results, we assume t hat p is larger than a uni versal constant so that certain inequalities are satisfied. Proposition 4.1 . Fix a graph G on p nodes. Let ˆ θ : M ◦ → R p be a black b ox giving a γ - appr oximation for the ba ckwar d mapp ing µ 7→ θ for the h ar d-core model ( 2.1 ) . Using 1 / ǫγ 2 calls to ˆ θ , and c omputatio n bo unded by a p olynomial in p, 1 /γ , it is possible to pr oduce a 4 γ p 7 / 2 /q ǫ 2 - appr oximation ˆ µ ( 0 ) to the m ar ginals µ ( 0 ) co rr esponding to all zer o p arameters. W e first ob serve that Theorem 2.3 follows almost immediately . Pr o of of Theor em 2.3 . A stand ard med ian a mplification trick (see e.g. [ 20 ]) allows to d ecrease the probab ility 1 / 4 of erroneous outp ut b y a FPRAS to below 1 /p ǫ γ 2 using O (log ( pǫγ 2 )) f unction calls. Thus the assumed FPRAS f or the backward mappin g can be made to give a γ -appro ximation ˆ θ to θ on 1 / ǫγ 2 successiv e calls, with proba bility of no erroneo us outputs equal to at least 3 / 4 . By tak ing γ = ˜ γ q ǫ 2 p − 7 / 2 / 2 in Pro position 4.1 we ge t a ˜ γ -approx imation to µ ( 0 ) with compu tation bound ed by a polynom ial in p, 1 / ˜ γ . I n other words, the existence of an FPR AS for the mapping µ 7→ θ g i ves an FPRAS for the marginals µ ( 0 ) , and by C orollar y 3.2 this is not possible if NP 6 = RP. W e n ow work towards provin g Proposition 4.1 , the go al b eing to estimate the vector of marginals µ ( 0 ) for some fixed graph G . The desired marginals are given b y the solution to the op timiza- tion ( 3.2 ) with θ = 0 : µ ( 0 ) = − arg min µ ∈M Φ ∗ ( µ ) . (4.2) W e know from Section 3 that f or x ∈ M ◦ the g radient ∇ Φ ∗ ( x ) = θ ( x ) , that is, the b ackward mapping amoun ts to a gradien t first ord er (g radient) o racle. A natural approac h to solving the optimization problem ( 4.2 ) is to use a projected gradient method. For rea sons that will be co me clear later , instead o f pro jecting onto th e m arginal po lytope M , we project onto the shrunken margin al polytop e M 1 ⊂ M defined as M 1 = { µ ∈ M ∩ [ q ǫ, ∞ ) p : µ + ǫ · e i ∈ M for all i } , (4.3) where e i is the i th standard basis vector . As mention ed before, projecting onto M 1 is NP-hard , and this must there fore be av oided if we are to obtain a poly nomial-time red uction. Ne vertheless, we tempor arily assume that it is p ossible to d o the projection and address this difficulty later . W ith th is in mind, we p ropose to solve the optimization ( 4.2 ) by a projected gradient method with fixed s tep size s , x t +1 = P M 1 ( x t − s ∇ Φ ∗ ( x t )) = P M 1 ( x t − sθ ( x t )) , (4.4) In orde r for the metho d ( 4.4 ) to succeed a first req uiremen t is th at the optimum is inside M 1 . The following lemma is proved in the Appen dix. Lemma 4.2. Consider the har d co r e model ( 2.1 ) on a g raph G with maximum degr ee d on p ≥ 2 d +1 nodes and canonical parameters θ = 0 . Then the corr esponding vector of mean parameter s µ ( 0 ) is in M 1 . One of the b enefits of operating within M 1 is that the gra dient is b ounde d by a po lynomial in p , and this will allow th e optimiz ation proced ure to con verge in a p olyno mial num ber o f steps. Th e following lemm a amounts to a rep hrasing of Lem mas 5.3 and 5.4 in Section 5 and the proof is omitted. Lemma 4.3. W e h ave the gradient bound k∇ Φ ∗ ( x ) k ∞ = k θ ( x ) k ∞ ≤ p/ǫ = p 9 for any x ∈ M 1 . Next, we state general con ditions un der which an ap proxima te p rojected grad ient algor ithm co n- verges qu ickly . Better convergence rates are po ssible using the strong convexity of Φ ∗ (shown in Lemma 4.5 below), but this lem ma suffices f or our pu rposes. Th e proo f is standard (see [ 21 ] or Theorem 3.1 in [ 22 ] for a similar statement) and is giv en in the Appendix for completene ss. 5 Lemma 4.4 (Projected gradient method) . Let G : C → R b e a con vex function defined over a com- pact conve x set C with min imizer x ∗ ∈ ar g min x ∈ C G ( x ) . Sup pose we have access to an ap pr o xi- mate gradient oracle d ∇ G ( x ) for x ∈ C with err o r boun ded as sup x ∈ C k d ∇ G ( x ) − ∇ G ( x ) k 1 ≤ δ / 2 . Let L = sup x ∈ C k d ∇ G ( x ) k . Consider th e pr ojected gradient method x t +1 = P C ( x t − s d ∇ G ( x t )) starting at x 1 ∈ C a nd with fixed step size s = δ / 2 L 2 . After T = 4 k x 1 − x ∗ k 2 L 2 /δ 2 iterations the average ¯ x T = 1 T P T t =1 x t satisfies G ( ¯ x T ) − G ( x ∗ ) ≤ δ . T o translate ac curacy in approximatin g the function Φ ∗ ( x ∗ ) to appro ximating x ∗ , we use the fact th at Φ ∗ is stro ngly conve x. The proof (in the Ap pendix) uses th e eq uiv alence between stron g con vexity of Φ ∗ and strong sm oothness of the Fenchel dual Φ , the latter being easy to ch eck. Since we only require th e implicatio n of th e lemma, w e d efer th e definitio ns of strong co n vexity and strong smoothne ss to the appen dix where they are used. Lemma 4 .5. The function Φ ∗ : M ◦ → R is p − 3 2 -str o ngly co n vex. As a co nsequenc e, if Φ ∗ ( x ) − Φ ∗ ( x ∗ ) ≤ δ for x ∈ M ◦ and x ∗ = arg min y ∈M ◦ Φ ∗ ( y ) , then k x − x ∗ k ≤ 2 p 3 2 δ . At th is point all the ingredien ts are in place to show that the u pdates ( 4.4 ) rapidly ap proach µ ( 0 ) , but a crucia l difficulty re mains to b e overcom e. The assumed black bo x ˆ θ for ap proxim ating the mapping µ 7→ θ is only de fined for µ inside M , an d thu s it is n ot at all obviou s how to ev aluate the pro jection on to the closely rela ted p olytope M 1 . In deed, as shown in [ 7 ], ev en appro ximate projection on to M is NP-h ard, and no p olynom ial time reductio n can req uire projec ting o nto M 1 (assuming P 6 = NP). The g oal of the subseq uent Section 5 is to p rove Propo sition 4.6 b elow , which states that the op ti- mization procedu re can be car ried ou t withou t any knowledge abou t M or M 1 . Spec ifically , we show that thr esholding coordin ates suffices, that is, instead of p rojecting onto M 1 we may pro ject onto the translated no n-negative orth ant [ q ǫ, ∞ ) p . Writing P ≥ for this projectio n, we show that the original projected gradien t method ( 4.4 ) has identical iterates x t as the much simpler update rule x t +1 = P ≥ ( x t − sθ ( x t )) . (4.5) Proposition 4 .6. Choo se c onstants as p er ( 4.1 ) . Suppose x 1 ∈ M 1 , and con sider the iterates x t +1 = P ≥ ( x t − s ˆ θ ( x t )) fo r t ≥ 1 , wher e ˆ θ ( x t ) is a γ -ap pr o ximation of θ ( x t ) for all t ≥ 1 . Then x t ∈ M 1 , for all t ≥ 1 , and thus the iterates ar e the same using either P ≥ or P M 1 . The next section is de voted to the proo f of Proposition 4.6 . W e now complete the redu ction. Pr o of of Pr op osition 4.1 . W e start the g radient update pr ocedur e x t +1 = P ≥ ( x t − s ˆ θ ( x t )) at the point x 1 = ( 1 2 p , 1 2 p , . . . , 1 2 p ) , wh ich we claim is w ithin M 1 for any g raph G for p = | V | large enoug h. T o see th is, note th at ( 1 p , 1 p , . . . , 1 p ) is in M , because it is a c on vex combinatio n (with weight 1 /p each) of the i ndep endent set vector s e 1 , . . . , e p . Hence x 1 + 1 2 p · e i ∈ M , and add itionally x 1 i = 1 2 p ≥ q ǫ , for all i . W e establish that x t ∈ M 1 for each t ≥ 1 by indu ction, having verified the b ase case t = 1 in the preced ing p aragrap h. Let x t ∈ M 1 for some t ≥ 1 . At iteration t o f the u pdate rule we m ake a ca ll to th e b lack box ˆ θ ( x t ) giving a γ -approxim ation to the back ward mapping θ ( x t ) , com pute x t − s ˆ θ ( x t ) , an d th en projec t o nto [ q ǫ, ∞ ) p . Pr oposition 4.6 ensur es th at x t +1 ∈ M 1 . T herefor e, the update x t +1 = P ≥ ( x t − s ˆ θ ( x t )) is the same as x t +1 = P M 1 ( x t − s ˆ θ ( x t )) . Now we can now app ly Lemma 4.4 with G = Φ ∗ , C = M 1 , δ = 2 γ p 2 /ǫ and L = sup x ∈ C k d ∇ G ( x ) k 2 ≤ p p ( p/ǫ ) 2 = p 3 / 2 /ǫ . After T = 4 k x 1 − x ∗ k 2 L 2 /δ 2 ≤ 4 p ( p 3 /ǫ 2 ) / (4 γ 2 p 4 /ǫ 2 ) = 1 /γ 2 iterations the av erage ¯ x T = 1 T P T t =1 x t satisfies G ( ¯ x T ) − G ( x ∗ ) ≤ δ . Lemma 4.5 imp lies that k ¯ x T − x ∗ k 2 ≤ 2 δ p 3 2 , and since x ∗ i ≥ q ǫ , we get the entry-wise boun d | ¯ x T i − x ∗ i | ≤ 2 δ p 3 2 x ∗ i /q ǫ for each i ∈ V . Hen ce ¯ x T is a 4 γ p 7 / 2 /q ǫ 2 -appro ximation for x ∗ . 6 5 Pr oof of Proposition 4.6 In Subsection 5.1 we p rove estimates on th e parameters θ correspo nding to µ close to the bo undary of M 1 , and then in Subsection 5.2 we use these estimates to show that the bou ndary of M 1 has a certain repulsive property that keeps the iterates inside. 5.1 Bounds on gradient W e start by intro ducing some helpful no tation. For a node i , let N ( i ) = { j ∈ [ p ] : ( i, j ) ∈ E } denote its neighbo rs. W e partition the collection of independ ent s et vectors as I = S i ∪ S − i ∪ S ⊘ i , where S i = { σ ∈ I : σ i = 1 } = { Ind sets containin g i } S − i = { σ − e i : σ ∈ S i } = { Ind sets where i can be added } S ⊘ i = { σ ∈ I : σ j = 1 for some j ∈ N ( i ) } = { Ind sets conflicting with i } . For a collection of independen t set vectors S ⊆ I we write P ( S ) as sho rthand for P θ ( σ ∈ S ) and f ( S ) = P ( S ) · e Φ( θ ) = X σ ∈ S exp  X j ∈ V θ j σ j  . W e can then write the marginal at node i a s µ i = P ( S i ) , and since S i , S − i , S ⊘ i partition I , the space of all indepen dent sets of G , 1 = P ( S i ) + P ( S − i ) + P ( S ⊘ i ) . For each i let ν i = P ( S ⊘ i ) = P ( a neighb or of i is in σ ) . The following lemma specifies a condition on µ i and ν i that implies a lower bound on θ i . Lemma 5.1. If µ i + ν i ≥ 1 − δ a nd ν i ≤ 1 − ζ δ for ζ > 1 , then θ i ≥ ln( ζ − 1) . Pr o of. Let α = e θ i , and observe that f ( S i ) = αf ( S − i ) . W e want to show that α ≥ ζ − 1 . The first condition µ i + ν i ≥ 1 − δ implies that f ( S i ) + f ( S ⊘ i ) ≥ (1 − δ )( f ( S i ) + f ( S ⊘ i ) + f ( S − i )) = (1 − δ )( f ( S i ) + f ( S ⊘ i ) + α − 1 f ( S i )) , and rearran ging gives f ( S ⊘ i ) + f ( S i ) ≥ 1 − δ δ α − 1 f ( S i ) . (5.1) The second condition ν i ≤ 1 − ζ δ r eads f ( S ⊘ i ) ≤ (1 − ζ δ )( f ( S i ) + f ( S ⊘ i ) + f ( S − i )) or f ( S ⊘ i ) ≤ 1 − ζ δ ζ δ f ( S i )(1 + α − 1 ) (5.2) Combining ( 5.1 ) and ( 5.2 ) and simplifying results in α ≥ ζ − 1 . W e now u se the p receding lemma to show that if a coordin ate is close to the b ound ary of the shrunken marginal polytope M 1 , then the correspon ding parameter is large. Lemma 5.2. Let r b e a positive r ea l number . I f µ ∈ M 1 and µ + rǫ · e i / ∈ M , then θ i ≥ ln  q r − 1  . Pr o of. W e would like to apply Lemma 5.1 with ζ = q /r an d δ = rǫ , which requ ires showing that (a) ν i ≤ 1 − q ǫ and ( b) µ i + ν i ≥ 1 − rǫ . T o show (a), note that if µ ∈ M 1 , then µ i ≥ q ǫ by definition of M 1 . It follows that ν i ≤ 1 − µ i ≤ 1 − q ǫ . W e now show (b). Since µ i = P ( S i ) , ν i = P ( S ⊘ i ) , an d 1 = P ( S i ) + P ( S ⊘ i ) + P ( S − i ) , (b ) is equiv alent to P ( S − i ) ≤ r ǫ . W e assume that µ + r ǫ · e i / ∈ M and suppose for th e sake of 7 contradictio n that P ( S − i ) > r ǫ . Writing η σ = P ( σ ) for σ ∈ I , so that µ = P σ ∈I η σ · σ , we define a new probability measure η ′ σ =    η σ + η σ − e i if σ ∈ S i 0 if σ ∈ S − i η σ otherwise . One can ch eck that µ ′ = P σ ∈I η ′ σ σ ha s µ ′ j = µ j for each i 6 = j and µ ′ i = µ i + P ( S − i ) > µ i + r ǫ . The point µ ′ , being a conve x combinatio n of indepen dent set vectors, must be in M , and hen ce so must µ + r ǫ · e i . But this contrad icts the hypothesis and completes t he proo f of the lemma. The proo fs of the next two lemm as ar e similar in spir it to Lemma 8 in [ 23 ] and are pr oved in the Append ix. T he first lemma g iv es an upp er bound on the parameter s ( θ i ) i ∈ V correspo nding to an arbitrary point in M 1 . Lemma 5.3. If µ + ǫ · e i ∈ M , then θ i ≤ p/ǫ . Hence if µ ∈ M 1 , then θ i ≤ p/ǫ for all i . The next lemm a sh ows that if a co mponen t µ i is not too small, the cor respondin g param eter θ i is also not too n egati ve. As befo re, this allows to b ound from b elow the parameter s correspon ding to an arbitrary point in M 1 . Lemma 5.4. If µ i ≥ q ǫ , then θ i ≥ − p/q ǫ . Hence if µ ∈ M 1 , then θ i ≥ − p/q ǫ for all i . 5.2 Finishing the proof of Pr oposition 4.6 W e sketch the rem ainder of the proof here; full detail is gi ven in Section D of the Supp lement. Starting with an arbitrary x t in M 1 , our g oal is to sh ow that x t +1 = P ≥ ( x t − s ˆ θ ( x t )) remain s in M 1 . T he proo f will the n follow by in duction , b ecause our in itial po int x 1 is in M 1 by th e hypoth esis. The argumen t considers separately eac h hyper plane constraint f or M of the f orm h h, x i ≤ 1 . Th e distance of x from the hyperp lane is 1 − h h, x i . Now , the definition of M 1 implies that if x ∈ M 1 , then x + ǫ · e i ∈ M 1 for all co ordinates i , and thus 1 − h h, x i ≥ ǫ k h k ∞ for all constraints. W e call a constraint h h, x i ≤ 1 critica l if 1 − h h, x i < ǫ k h k ∞ , and active if ǫ k h k ∞ ≤ 1 − h h, x i < 2 ǫ k h k ∞ . For x t ∈ M 1 there are no critical constraints, but t here may be active constraints. W e fir st show that inactive co nstraints can at worst b ecome active for th e next iter ate x t +1 , which requires only that the step-size is not too large relative to the magnitude of the gradient (Lemma 4.3 giv es the desired boun d). Then we show (using the gra dient estimates from Lemmas 5.2 , 5.3 , and 5.4 ) that the active constra ints h av e a repulsive pr operty and th at x t +1 is no closer than x t to any activ e constrain t, that is, h h, x t +1 i ≤ h h, x t i . Th e argument r equires care, b ecause the pro - jection P ≥ may pre vent coordinates i from decreasing despite x t i − s ˆ θ i ( x t ) being very n egati ve if x t i is already small. These argumen ts together show that x t +1 remains in M 1 , completin g the proof . 6 Discussion This paper addresses th e computatio nal tractability of parameter estimation for the hard-co re mo del. Our main re sult sh ows hardn ess of ap proxim ating the backward mapp ing µ 7→ θ to w ithin a small polyno mial factor . T his is a fairly stringent form o f approx imation, an d it would be in teresting to strengthen the result to show h ardness ev en for a weaker form of appro ximation. A p ossible goal would be to show that there exists a universal con stant c > 0 such that appr oximation of the backward mapping to within a factor 1 + c in each coordina te is NP-hard . Acknowledgments GB thank s Sahand Negahban for helpful discussion s. Also we th ank Andrea Mon tanari for sharing his unp ublished man uscript [ 9 ]. Th is work was supported in part by NSF grants CMMI-13 35155 and CNS-1161 964, and by Army Research Office MURI A ward W911 NF-11-1- 0036. 8 Refer ences [1] M. W ainwr ight and M. Jordan, “Graphical models, e xponen tial families, and variational infer- ence, ” F oundation s and T r ends in Machine Learning , vol. 1, no. 1-2, pp. 1–3 05, 2008. [2] M. Luby and E. V igoda, “Fast con vergence of the glaub er dynam ics for sampling independ ent sets, ” Rando m Structur es and Algorithms , vol. 15, no. 3-4, pp. 229 –241, 1999. [3] A. Sly and N. Sun, “The computa tional hardness of co unting in two-spin mo dels on d -regular graphs, ” in FOCS , pp. 361– 369, IEEE, 2012. [4] A. Galanis, D. Stefankovic, an d E. V igoda, “Inappro ximability o f the partition function for the antiferro magnetic Ising and hard-co re models, ” arXiv pr eprint arXiv:1203 .2226 , 20 12. [5] A. Bogdan ov , E. Mo ssel, and S. V ad han, “ The co mplexity o f disting uishing Mar kov rando m fields, ” Appr oximation, Randomization and Combinatorial Optimization , pp. 331–342, 2008. [6] D. Karger and N. Sreb ro, “Learning Mar kov networks: Maximum bounded tree- width g raphs, ” in Symposium on Discr ete Algorithms (SODA) , pp. 392–401 , 2001. [7] D. Shah, D. N. Tse, and J. N. T sitsiklis, “Hard ness of low delay n etwork schedulin g, ” In for- mation Theory , IEEE T ransactio ns on , v ol. 57, no. 12, pp. 7810– 7817 , 20 11. [8] T . Roughgar den an d M. Kearns, “Marginals-to-m odels red ucibility , ” in Advance s in Neural Information Pr oce ssing Systems , pp. 1043–1 051, 2013. [9] A. Mon tanari, “Computation al implicatio ns o f reduc ing da ta to sufficient statistics. ” u npub- lished, 2014 . [10] M. Deza and M. Lauren t, Geometry of cuts and metrics . Spring er , 1997 . [11] G. M. Ziegler, “Lectures on 0/1-p olytopes, ” in P olytop es—combinato rics an d co mputation , pp. 1–41, Springer, 2000. [12] C. H. Papadimitriou , Computation al complexity . John W iley and Sons Ltd., 2003. [13] R. T . Rockaf ellar , Con vex analysis , v ol. 28. Princeto n uni versity press, 1997. [14] D. W eitz, “Counting indep endent sets up to the tree threshold, ” in Pr o ceedings of the thirty- eighth annua l A CM symposium on Theory of compu ting , pp. 140– 149, A CM, 2006 . [15] M. Dyer , A. Frieze, and M. Jer rum, “ On cou nting inde pendent sets in spa rse grap hs, ” SIA M Journal on C ompu ting , v ol. 31, no. 5, pp. 1527– 1541, 2002. [16] A. Sly , “Comp utational transition at the uniquene ss threshold, ” in FOCS , pp. 287–29 6, 2010. [17] F . Jaeger, D. V ertigan, and D. W elsh, “On the comp utational complexity of the jones and tutte polyno mials, ” Math. Pr oc. Cambridge Philos. Soc , vol. 108, no. 1, pp. 35–53, 1990. [18] M. Jerru m an d A. Sinclair, “Polyn omial-time ap proxim ation algorithm s for the Ising model, ” SIAM Journal on computing , vol. 22, no. 5, pp. 1087–11 16, 1993. [19] S. Istrail, “Statistical mech anics, three-d imensionality an d NP-co mpleteness: I. universality of intracatab ility for the partitio n fun ction of the I sing model across non- planar surfaces, ” in STOC , pp. 87–96, A CM, 2000 . [20] M. R. Jerrum , L. G. V aliant, and V . V . V azirani, “Ran dom gen eration of com binatorial struc- tures from a uniform distribution, ” T heoretical Computer Science , v ol. 43, pp. 169–188 , 19 86. [21] Y . Nesterov , Intr oductory lectu r es on conve x op timization: A ba sic c ourse , vol. 87. Spring er , 2004. [22] S. Bubeck, “Theory of con vex optimization for machine learning. ” A vailable at http://www .p rinceton. edu/ sbubeck/pu b .html. [23] L. Jiang, D. Shah, J. Shin, and J. W alran d, “D istributed rando m access alg orithm: sch eduling and congestion control, ” IEEE T rans. on Info. Theory , vol. 56, no. 12 , pp. 6182–62 07, 2010. [24] D. P . Bertsekas, Nonlinear pr ogramming . Ath ena Scientific, 1999. [25] S. M. Kakade, S. Shalev-Shwartz, and A. T ewari, “Regularization tech niques for learning with matrices, ” J. Mach. Learn. Res. , v ol. 13, pp. 1865– 1890, June 2012. [26] J. M. Borw ein and J. D. V ander werff, Con vex functions: construction s, characterizations and countere xamples . No. 109, Cambridge University Press, 2010. 9 Supplemen tary Material A Miscellaneous pr oofs A.1 Proof of Corollary 3.2 The proof is standard and uses the self-reducib ility of the hard -core model, meaning t hat co ndition- ing on σ i = 0 amou nts to removin g nod e i fr om the grap h. Fix a g raph G and param eters θ = 0 . W e show th at g i ven an algorithm to appro ximately compute th e m arginals for induced su bgraph s H ⊆ G , it is possible to approxima te th e partition function e Φ( 0 ) , denoted here by Z . W e first claim that Z = p Y i =1 1 1 − µ i ( G \ [ i − 1]) . (A.1) The gra ph G \ [ i − 1 ] is ob tained by rem oving nodes labeled 1 , 2 , . . . , i − 1 , an d µ i ( G \ [ i − 1]) is the marginal at node i for this g raph. W e u se inductio n on the numbe r of nodes. The base case with one nod e is trivial: Z = 1 + e 0 = 2 = 1 / (1 − µ ) . Supp ose now that th e form ula ( A.1 ) hold s f or graphs on k no des and that | V | = k + 1 . Let Z 0 and Z 1 denote the p artition fun ction summatio n restricted to σ 1 = 0 or σ 1 = 1 , respectively . Thus Z = Z 0 + Z 1 = Z 0 ( Z 0 + Z 1 Z 0 ) = Z 0 1 − µ 1 . Now Z 0 is the partition fu nction of a new gr aph o btained by deleting vertex i , an d the inductive assumption proves the formu la. From ( A.1 ) we see that in order to compu te a γ -ap proxim ation to Z − 1 , it suffices to compute a γ /p approx imation to each of the marginals. Now for sma ll γ , a γ approx imation to Z − 1 giv es a 2 γ approx imation to Z , and this comp letes t he proo f. A.2 Proof of Lemma 4.2 W e wish to show that µ ( 0 ) ∈ M 1 for a grap h G = ( V , E ) of m aximum degre e d an d p ≥ 2 d +1 . Consider a p articular node i ∈ V with neighbo rs N ( i ) , a nd let d i = | N ( i ) | denote its degree. W e use the notation S i , S − i , S ⊘ i defined in Sub section 5.1 . A collection of indepen dent set vectors S ⊆ I ( G ) is assigned proba bility P ( S ) = | S | / |I ( G ) | for our ch oice θ = 0 , so it suffices to argue about cardinalities. W e first claim that | S i | ≥ 2 − d | S ⊘ i | . This follows by observin g that each set in S ⊘ i gets mapped to a set in S i by re moving th e n eighbo rs N i , an d mor eover at most 2 d sets are mapp ed to the same set in S i . Next, we note that | S i | = | S − i | since the removal of nod e i is a bijection from S i to S − i and hen ce they are of the same cardinality . Comb ining these observations with the fact that P ( S i ) + P ( S − i ) + P ( S ⊘ i ) = 1 , we get the estimate µ i = P ( S i ) ≥ 1 / (2 − d + 2) ≥ 2 − d − 1 . Next, we s how for each coordinate i that the vector µ ′ = µ + 2 − d − 1 e i is in M , which will complete the proof that µ ( 0 ) is M 1 . Let η σ = P 0 ( σ ) den ote the probab ility ass igned to σ under the distribu- tion with parameter s θ = 0 , so th at µ = P σ ∈I ( G ) η σ · σ . Similarly to the proof o f Lemma 5.2 , we define a new probab ility measure η ′ σ =    η σ + 2 − d − 1 if σ ∈ S i η σ − 2 − d − 1 if σ ∈ S − i η σ otherwise . This is a valid pr obability distribution because η σ ≥ 2 − d − 1 for σ ∈ S − i . One can check th at µ ′ = P σ ∈I η ′ σ σ has µ ′ j = µ j for each j 6 = i an d µ ′ i = µ i + 2 − d − 1 . The point µ ′ , bein g a co n vex combinatio n of indepen dent set vectors, must be in M , and hence so must µ + 2 − d − 1 e i . 10 B Proofs f or pr ojected gradient method B.1 Proof of L emma 4.4 The proof here is a slight modification of the proof of Theorem 3.1 in [ 22 ]. Observe first that if P is the p rojection onto a conv ex set, then P is a contraction: k P ( x ) − P ( y ) k 2 ≤ k x − y k 2 (cf. Pr op 2.1.3 in [ 24 ]). Using the the conv exity inequ ality G ( x ) − G ( x ∗ ) ≤ ∇ G ( x ) T ( x − x ∗ ) , the de finition η = sup x ∈ C k d ∇ G ( x ) − ∇ G ( x ) k 1 , and the update formula x t +1 = x t − s d ∇ G ( x t ) , it follows that G ( x t ) − G ( x ∗ ) ≤ ∇ G ( x t ) T ( x t − x ∗ ) = d ∇ G ( x t ) T ( x t − x ∗ ) + ( d ∇ G ( x t ) T − ∇ G ( x t ) T )( x t − x ∗ ) ≤ d ∇ G ( x t ) T ( x t − x ∗ ) + η k x t − x ∗ k ∞ = 1 s ( x t − x t +1 ) T ( x t − x ∗ ) + η = 1 2 s ( k x t − x ∗ k 2 2 + k x t − x t +1 k 2 2 − k x t +1 − x ∗ k 2 2 ) + η = 1 2 s ( k x t − x ∗ k 2 2 − k x t +1 − x ∗ k 2 2 ) + s 2 k d ∇ G ( x t ) k 2 2 + η . Adding the precedin g inequality for t = 1 to t = T , the sum telescopes and we get T X t =1 [ G ( x t ) − G ( x ∗ )] ≤ R 2 2 s + s 2 L 2 T + η T = RL √ T + η T . (B.1) Here we used the definitions R = k x 1 − x ∗ k and L = sup x ∈ C k d ∇ G ( x ) k a nd the last equality is by the cho ice s = R L √ T . Now d efining ¯ x T = 1 T P T t =1 x t , dividing ( B.1 ) throu gh by T and using th e conv exity of G to apply Jensen’ s inequ ality gi ves G ( ¯ x T ) − G ( x ∗ ) ≤ RL √ T + η . Thus in order to make the r ight hand side smaller than δ it suffices to take T = 4 R 2 L 2 /δ 2 and η = δ / 2 . B.2 Proof of L emma 4.5 W e start by showing that the gradient ∇ Φ is p 3 2 -Lipschitz. Recall that ∇ Φ( θ ) = µ ( θ ) . W e prove a bou nd on | µ i ( θ ) − µ i ( θ ′ ) | b y changin g one co ordinate o f θ at a time. Let θ ( r ) = ( θ 1 , . . . , θ r , θ ′ r +1 , . . . , θ ′ p ) . The triang le inequality gi ves | µ i ( θ ) − µ i ( θ ′ ) | = p − 1 X r =0 | µ i ( θ ( r ) ) − µ i ( θ ( r +1) ) | . A direct calculation shows that ∂ ∂ θ r µ i ( θ ) = P ( σ i = σ r = 1) − µ i ( θ ) µ r ( θ ) . Since this is uniformly boun ded by one in absolute value, we ob tain the inequality | µ i ( θ ) − µ i ( θ ′ ) | ≤ k θ − θ ′ k 1 or k µ ( θ ) − µ ( θ ′ ) k 1 ≤ p k θ − θ ′ k 1 Hence k µ ( θ ) − µ ( θ ′ ) k 2 ≤ k µ ( θ ) − µ ( θ ′ ) k 1 ≤ p k θ − θ ′ k 1 ≤ p 3 2 k θ − θ ′ k 2 , i.e., ∇ Φ is p 3 2 -Lipschitz. 11 Now th e fun ction ∇ Φ bein g p 3 2 -Lipschitz implies that Φ is p 3 2 -strongly smoo th, where Φ is β - strongly smooth if Φ( x + ∆) − Φ( x ) ≤ h∇ Φ( x ) , ∆ i + 1 2 β k ∆ k 2 . T o see th is, we write Φ( x + ∆) − Φ( x ) = Z 1 0 h∇ Φ( x + τ ∆) , ∆ i dτ = h∇ Φ( x ) , ∆ i + Z 1 0  ∇ Φ( x + τ ∆) − ∇ Φ( x )  dτ ≤ h∇ Φ( x ) , ∆ i + p 3 2 Z 1 0 h τ ∆ , ∆ i dτ = h∇ Φ( x ) , ∆ i + 1 2 p 3 2 k ∆ k 2 . Now Theorem 6 from [ 25 ] or Chapter 5 o f [ 26 ] imply that Φ ∗ , be ing the Fenchel conjug ate of Φ , is p − 3 2 -strongly conve x, meaning Φ ∗ ( x + ∆) − Φ ∗ ( x ) ≥ h∇ Φ ∗ ( x ) , ∆ i + 1 2 p − 3 2 k ∆ k 2 . This giv es the desired bound on k x − x ∗ k in terms of Φ ∗ ( x ) − Φ ∗ ( x ∗ ) . C Proofs of gradien t bounds C.1 Proof of Lemma 5.3 W e suppose for the sake o f deriving a contrad iction that θ i > p/ δ . Let ¯ µ = µ + δ · e i , and let η ′ be a probab ility measure such that ¯ µ = P σ ∈I η ′ σ σ . Now η ′ ( S i ) = ¯ µ i ≥ δ , and we define the non-n egati ve measu re γ (su mming to less than one) with support S i as γ σ = ( η ′ σ · δ η ′ ( S i ) if σ ∈ S i 0 otherwise . In this way , γ σ ≤ η ′ σ and γ ( S i ) = δ . W e define a new probability measure η σ =    η ′ σ − γ σ if σ ∈ S i η ′ σ + γ σ ∪{ i } if σ ∈ S − i η ′ σ otherwise , (C.1) and one may check th at µ = P σ ∈I η σ σ and η ( S − i ) ≥ γ ( S i ) = δ . W e use the d efinitions in Subsection 5.1 to get F µ ( θ ) , µ · θ − log  X σ ∈I exp( σ · θ )  = X ρ ∈I η ρ log exp( ρ · θ ) P σ exp( σ · θ ) ( a ) = ≤ X ρ ∈ S − i η ρ log exp( ρ · θ ) f ( S − i ) + e θ i f ( S − i ) + f ( S ⊘ i ) ( b ) ≤ X ρ ∈ S − i η ρ log f ( S − i ) e θ i f ( S − i ) ≤ − η ( S − i ) θ i ( c ) < − p ( d ) ≤ − log |I | = F ( 0 ) . Here (a) f ollows by restricting the sum to S − i ⊆ I ( G ) a nd fro m the fact that P σ exp( σ · θ ) = f ( S − i ) + e θ i f ( S − i ) + f ( S ⊘ i ) , (b) follows by retainin g o nly the term e θ i f ( S − i ) in the d enominato r and replacing exp( ρ · θ ) for ρ ∈ S − i with f ( S − i ) = P ρ ∈ S − i exp( ρ · θ ) , thereby in creasing the argument to the lo garithm, ( c) u ses th e fact that η ( S − i ) ≥ δ and the assumption that θ i > p/δ , and (d) follows from the crud e bound on number of indepen dent sets |I | ≤ 2 p and log 2 < 1 . Finally , the relation θ ( µ ) = ar g max θ F µ ( θ ) from Section 3 contr adicts F µ ( θ ) < F ( 0 ) . 12 C.2 Proof of Lemma 5.4 W e su ppose f or th e sake of contra diction that θ i < − p/ δ a nd show that θ c annot be the vector of canonical parameter s correspo nding to µ . Since µ ∈ M , there exists a no n-negative measu re η so that µ = P σ ∈I η σ σ , and further more η ( S i ) = µ i ≥ δ . Now ar gumen ts similar to the proof of Lemma 5.3 above giv e F µ ( θ ) = µ · θ − log  X σ exp( σ · θ )  = X ρ ∈I η ρ log exp( ρ · θ ) P σ exp( σ · θ ) ≤ X ρ ∈ S i η ρ log exp( ρ · θ ) f ( S − i ) + e θ i f ( S − i ) + f ( S ∗ i ) ≤ X ρ ∈ S i η ρ log e θ i f ( S − i ) f ( S − i ) + e θ i f ( S − i ) + f ( S ∗ i ) ≤ X ρ ∈ S i η ρ θ i = η ( S i ) θ i < − δ p/δ = − p ≤ − lo g |I | = F ( 0 ) . As before, this contradicts the relation θ ( µ ) = arg max θ F µ ( θ ) . D Proof of Pr o position 4.6 Starting with x t in M 1 , our go al is to show that x t +1 = P ≥ ( x t − s ˆ θ ( x t )) rem ains in M 1 . Th e proof will then follow by induction , because our initial point x 1 is in M 1 by the hypoth esis. W e will use the fact that a ll hyperp lane constraints for M , except for the non-negativity constraints x i ≥ 0 , can be written as h h, x i ≤ 1 for a vector h ∈ [0 , 1] p . This can be justified using the f act that e i ∈ M fo r each i to gether with the prope rty that for any µ ∈ M , any coordinate of µ ca n be set to zero while remaining in M . Giv en our current iterate x t , we call a constraint h h, x i ≤ 1 a ctive if 1 − 2 ǫ k h k ∞ < h h, x t i ≤ 1 − ǫ k h k ∞ (D.1) and critical if 1 − ǫ k h k ∞ < h h, x t i . (D.2) Observe tha t a n activ e constrain t has a coord inate i ( namely i with h i = k h k ∞ ) with h h, x t + 2 ǫ · e i i = h h, x t i + 2 h i ǫ > 1 and si milarly a critical constraint has a coordinate i with h h, x t + ǫ · e i i = h h, x t i + h i ǫ > 1 . For x t ∈ M 1 there are (by d efinition) no critical constraints, but there may b e active constrain ts. W e will first show that in activ e constra ints can at worst beco me active fo r the next iterate x t +1 , which requires only that the st ep-size is not too lar ge relati ve to the magnitud e of the gradient. Then we sh ow th at th e active co nstraints h av e a repulsive property and that x t +1 is no closer tha n x t to any a cti ve con straint, that is, h h, x t +1 i ≤ h h, x t i . Thus, if x t is in M 1 , th en there are no cr itical constraints fo r x t +1 and every coordinate i satisfies h h, x t +1 + ǫ · e i i ≤ 1 for all constrain t vectors h . Since the projectio n P ≥ ensures that x t +1 i ≥ q ǫ , the up date x t +1 is in M 1 . W e n ow focus on inactive constraints. Inactive constra int. W e consider an inac ti ve con straint h , meaning that h h, x t i + 2 ǫ k h k ∞ ≤ 1 . By assumption the step size s =  ǫ 2 p  2 so the incremen t in any coordinate j is bou nded as x t +1 j − x t j ≤ s | ˆ θ j ( x t ) | ≤ s | ˆ θ j ( x t ) − θ j ( x t ) | + s | θ j ( x t ) | ≤ (1 + γ ) s | θ j ( x t ) | ≤ ǫ /p 13 using Lemma 5.3 and fact that γ ≤ 1 . Th ese bounds giv e h h, x t +1 i = h h, x t i + h h, x t +1 − x t i ≤ h h, x t i + X j h j ( x t +1 j − x t j ) ≤ h h, x t i + p ( ǫ/p ) k h k ∞ ≤ 1 − ǫ k h k ∞ which shows that the constraint is not critical for x t +1 and at worst becomes acti ve. Active constra int. Th e rough idea is that if a coordinate i cannot b e increased by 2 ǫ while remain- ing in M , then the parameter θ i must be sufficiently large, and th e next iter ate x t +1 will d ecrease enoug h to overcome the possible increase in oth er coordina tes. Th is argument does not work, how- ev er, bec ause it might be the case that x t i = q ǫ , which prev ents an y decrease (i.e., x t +1 i ≥ x t i ) due t o the p rojection P ≥ . I nstead, we start by showing that if some coor dinate canno t be increased by 2 ǫ , then there must be a r easonably lar ge coordinate which cannot be increased by 4 pǫ . Lemma D.1. If h is an active con straint, then ther e is a coordinate ℓ ∈ V with x t + (4 pǫ ) e ℓ / ∈ M and x t ℓ ≥ 2 q ǫ . Pr o of. If h is acti ve then 1 − 2 ǫ k h k ∞ < h h, x t i . Using the fact that h j ≤ 1 for all j we have 1 − 2 ǫ ≤ 1 − 2 ǫ k h k ∞ < h h, x t i . (D.3) Let B ⊆ V consist of coord inates j with small entries x t j ≤ 2 ǫq . Then h h, x t i = X j ∈ B h j x t + X j ∈ B c h j x t ≤ | B | (2 ǫq ) + X j ∈ B c h j x t ≤ 2 p + X j ∈ B c h j x t j . (D.4) The last inequ ality used th e crud e estimate | B | ≤ p . Com bining ( D.3 ) and ( D.4 ) and r earrang ing giv es X j ∈ B c h j x t j ≥ 1 − 2 ǫ − 2 /p ≥ 1 − 3 / p , and it follows that there is an ℓ ∈ B c for which h ℓ ≥ h ℓ x t ℓ ≥ 1 / 2 p . Adding h ℓ · (4 pǫ ) ≥ 2 ǫ to bo th sides o f ( D.3 ) shows that x t + (4 pǫ ) e ℓ violates the in equality h h, x i ≤ 1 . T his proves th e lemma, since x t ℓ > 2 q ǫ for ℓ ∈ B c . W e are now ready to p rove that h h, x t +1 i ≤ h h, x t i . Let ℓ be th e coo rdinate pr omised b y Lemma D.1 , with x t + (4 pǫ ) e ℓ / ∈ M an d x t ℓ ≥ 2 q ǫ . From Lemma 5.2 , we know that θ ℓ ( x t ) ≥ log  q 4 p − 1  ≥ 3 log p , for p large enoug h. By definition of ˆ θ b eing a γ -app roximatio n to θ , ˆ θ ℓ ( x t ) ≥ (1 − γ ) θ ℓ ( x t ) . Therefo re, since γ → 0 as p → ∞ , it follows that for p large enoug h ˆ θ ℓ ( x t ) ≥ log p . This implies x t +1 ℓ − x t ℓ ≤ − min( s ˆ θ ( x t ) , s log p ) ≤ − s log p . (D.5) Here we used the fact that x t ℓ ≥ q ǫ + s log p so the pro jection P ≥ does not affect this coord inate. Denote by D the set of coordin ates D = { j ∈ [ p ] : h h, x t i + q 2 ǫh j > 1 } . These coordinates have non-po siti ve incremen t: sinc e x j ≥ q ǫ for x ∈ M 1 , Lemma 5.1 implies that θ j ≥ 0 , and hence ˆ θ j ≥ (1 − γ ) θ j ≥ 0 , or x t +1 j − x t j ≤ 0 f or j ∈ D . In con trast, coo rdinates in D c might incr ease, but by a limited amou nt: sinc e x t ∈ M 1 , all coord i- nates j ∈ D c satisfy x t j ≥ q ǫ , and Lem ma 5.4 gi ves the bound θ j ≥ − p/q ǫ , or x t +1 j − x t j ≤ (1 + γ ) | − sθ j | ≤ 2 sp/q ǫ for all j ∈ D c . (D.6) Additionally , b y th e defin ition of D and th e fact that increasing coordinate ℓ by 4 pǫ violates h h, x i ≤ 1 , if j ∈ D c , then 4 pǫh ℓ > q ǫh j / 2 , or h j < 8 ph ℓ /q for all j ∈ D c . (D.7) 14 Using the crude bound | D c | ≤ p together with ( D.6 ) and ( D.7 ) giv es X j ∈ D c h j ( x t +1 j − x t j ) ≤ | D c | 8 ph ℓ q · 2 sp q ǫ ≤ s 4 p 2 q 2 ǫ h ℓ ≤ 4 sh ℓ . (D.8) Counting the contributions fro m D c in ( D.8 ) in ad dition to D (non e) and ℓ (negative as per ( D.5 )), it follows that h c, x t +1 i = h h, x t i + h h, x t +1 − x t i ≤ h h, x t i + sh ℓ (4 − θ ℓ ) ≤ h h, x t i + sh ℓ (4 − ln p ) ≤ h h, x t i . Here we have used the fact tha t p is large en ough ( p ≥ e 4 suffices for th is last step). In words, we move away from an y acti ve hyperplan e constraint. Th is completes the proof. 15

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment