Lower Bounds on Active Learning for Graphical Model Selection

Lo w er Bounds on A ctiv e Learning for Graphical Mo del Selection Jonathan Scarlett and V olk an Cevher Lab oratory for Information and Inference Systems (LIONS) École Polytec hnique Fédérale de Lausanne (EPFL) Email: {jonathan.scarlett, volk an.cevher}@epﬂ.c h Abstract W e consider the problem of estimating the underlying graph associated with a Mark ov random ﬁeld, with the added twist that the deco ding algorithm can iterativ ely c ho ose whic h subsets of no des to sample based on the previous samples, resulting in an activ e learning setting. Considering b oth Ising and Gaussian mo dels, we pro- vide algorithm-indep enden t low er bounds for high-probabilit y recov ery within the class of degree-b ounded graphs. Our main results are minimax low er b ounds for the active setting that match the b est kno wn lo wer b ounds for the passive setting, which in turn are known to b e tight in sev eral cases of in terest. Our analysis is based on F ano’s inequality , along with nov el m utual information bounds for the activ e learning setting, and the application of restricted graph ensembles. While w e con- sider ensembles that are similar or identical to those used in the passive setting, we re- quire diﬀerent analysis tec hniques, with a key c hallenge being bounding a mutual informa- tion quan tity associated with observ ed sub- sets of no des, as opp osed to full observ ations. 1 In tro duction Graphical mo dels are a widely-used to ol for providing compact represen tations of the conditional indep en- dence relations b et ween random v ariables, and arise in areas such as image pro cessing [1], statistical ph ysics [2], computational biology [3], natural language pro- cessing [4], and so cial netw ork analysis [5]. The prob- lem of gr aphic al mo del sele ction consists of reco vering App earing in Proceedings of the 20 th In ternational Con- ference on Artiﬁcial Intelligence and Statistics (AIST A TS) 2017, F ort Lauderdale, Florida, USA. JMLR: W&CP vol- ume 54. Copyrigh t 2017 by the authors. the graph structure given a n umber of independent samples from the underlying distribution. While this problem is NP-hard in general [6], there exist a v ari- et y of metho ds guaranteeing exact reco very with high probabilit y on r estricte d graph classes, with a partic- ularly common restriction b eing b ounded degree. Sev eral v ariations of graphical mo del selection prob- lems with active le arning ha ve app eared in the liter- ature. In this pap er, we adopt the form ulation given in [7], in which the recov ery algorithm may adaptively c ho ose which no des to sample, based on the previous samples. The goal is to reco ver the underlying graph sub ject to a constraint on the total num ber of no de observ ations. As discussed in [7], this v ariation is of in terest in several applications; for example, in sen- sor net works one may b e able to choose which sensors to activ ate, rather than simultaneously activ ating ev- ery sensor at every time instan t. Only upp er b ounds w ere provided in [7], and the problem of ﬁnding low er b ounds w as left as an op en problem. 1.1 Con tributions In this paper, we complement the work of [7] b y pro- viding algorithm-independent low er b ounds on active learning for graphical mo del selection. Our main ﬁnd- ings are summarized as follo ws: 1. F or b oth Ising models and Gaussian mo dels, we pro vide low er bounds that essentially match the b est known lo wer b ounds for the passive setting [8, 9], in terms of the minimax probability of er- ror with resp ect to the class of bounded-degree graphs. The passive learning b ounds are known to b e tight in sev eral cases of interest, and our re- sults show that active le arning do es not help sig- niﬁc antly in the minimax sense in such cases. 2. W e pro vide a class of Gaussian graphical mod- els where the a v erage degree dictates the lo wer b ounds as opposed to the maximal degree, and where we match upp er b ounds based on the av er- age degree in [7]. Hence, we identify a graph class Lo wer Bounds on A ctive Learning for Graphical Mo del Selection where the av erage degree is pro v ably the funda- men tal quan tity dictating the fundamental lim- its. Moreov er, we pro vide a class of Ising models where the maximal degree prov ably remains the k ey quantit y dictating the p erformance, hence re- v ealing that one cannot alwa ys improv e the de- p endence from the maximal to the av erage degree. Our analysis uses a v ariation of F ano’s inequality for the activ e learning setting, along with nov el m utual in- formation bounds prov ed using analogous techniques to those used in channel co ding with noiseless feed- bac k [10]. W e apply the resulting b ound to a v ariety of restricted graph ensembles in which the graphs are diﬃcult to distinguish from each other, with notable examples being (i) isolated edges that are diﬃcult to detect; (ii) cliques with a single edge remo v ed such that the remov al is diﬃcult to detect. While the ensem bles that we use are similar or identical to those used in the passive setting, analyzing them in the active set- ting requires new techniques, particularly for b ounding a mutual information quantit y asso ciated with partial observ ations instead of full observ ations. 1.2 Related W ork In the same wa y that feedback often provides little or no gain in the capacit y for channel co ding [10], it is often observed that active learning provides little or no gain in the information-theoretic sample complexit y of inference and learning problems. F or example, in the compressiv e sensing problem, it has b een sho wn that the impro v ement amoun ts to at most a logarithmic factor [11]. F or the group testing problem, under a broad range of scalings of the sparsity level, not even the constant factors improv e [12, 13]. On the other hand, active learning is known to strictly impro ve the sample complexity in sev eral cases of in- terest [14, 15]. Moreov er, it should b e noted that ev en when adaptivity do es not help asymptotically in an information-theoretic sense, it can still help in the sense of leading to simpler and less computationally exp ensiv e algorithms, and also in improving the non- asymptotic p erformance [13, 15, 16]. A ctive learning for graphical mo del selection has b een studied in several contexts [7, 17, 18], the most rele- v an t to ours b eing that of Dasarath y et al. [7]. A general algorithm was prop osed therein using abstract subroutines for neighborho o d selection and neighbor- ho od v eriﬁcation, and applications to the Gaussian setting revealed cases where the total num b er of no de observ ations is improv ed from O ( d max p log p ) to O ((1+ d max ) p log p ) . Here d max is the av erage of the no de- wise maximal degree, where the latter is deﬁned as the highest degree among a no de and all its neighbors. This quan tity can b e signiﬁcantly smaller than d max , in which case the impro vemen t in the sample complex- it y is substantial. Information-theoretic lo wer b ounds for the passiv e set- ting were given in [8, 19 – 24] for the Ising mo del, and [9, 25, 26] for the Gaussian mo del. Let e n b e the sample complexity with resp ect to the n umber of p - dimensional observ ations. The b est minimax lo wer b ounds for degree-b ounded graphs are summarized as follo ws for the Ising mo del [8]: e n = Ω  max  log p λ tanh λ , e λd log( pd ) λde λ , d log p d  , (1) where p is the num ber of nodes, d is the maximal de- gree, and λ is the in verse temperature of the Ising mo del (see Section 2 for precise deﬁnitions). F or the Gaussian model, the b est kno wn lo wer b ounds for degree-b ounded graphs are [9] e n = Ω  max  log p τ 2 , d log p d log(1 + dτ )  , (2) where τ corresponds to the smallest allow ed oﬀ- diagonal magnitude in the normalized in verse co v ari- ance matrix (see Section 2 for details). A wide range of p olynomial-time algorithms ha ve b een prop osed for the passive learning of graphical mo dels; see [19, 20, 24, 27 – 32] for Ising mo dels, and [20, 33 – 35] for Gaussian mo dels. The b est p erformance b ounds among these algorithms match those of (1)–(2) in sev- eral cases of interest, though there are other cases where gaps remain, or where the results are diﬃcult to compare due to the diﬀerences in the underlying assumptions (e.g., additional coherence assumptions). 1.3 Structure of the P ap er In Section 2, we formally deﬁne the Ising and Gaus- sian graphical mo dels, and formulate the activ e learn- ing problem. Our main results are presen ted and dis- cussed in Section 3. The pro ofs are given in Section 4.1 (F ano’s inequality), Section 4.2 (Ising mo del), and Section 4.3 (Gaussian mo del). In Section 5, w e dis- cuss the role of the av erage vs. maximal degree, and w e conclude our work in Section 6. 2 A ctiv e Learning for Graphical Mo del Selection 2.1 Preliminaries W e consider a collection of p random v ariables ( X 1 , . . . , X p ) whose joint distribution is enco ded b y a graphical mo del G = ( V , E ) with v ertex set V = Jonathan Scarlett and V olk an Cevher { 1 , . . . , p } and undirected edge set E . The elements of V are referred to as no des or variables in terchange- ably . W e use the standard terminology that the de gr e e of a no de i ∈ V is the num b er of edges in E containing i , and that a clique is a fully-connected subset of V of cardinalit y at least tw o. W e consider tw o classes of join t probabilit y distribu- tions enco ded by G , namely , Ising mo dels and Gaus- sian mo dels. These are describ ed as follows. Ising Mo del: In the ferromagnetic Ising model [36, 37], each vertex is associated with a binary random v ariable X i ∈ {− 1 , 1 } , and the corresponding joint dis- tribution is describ ed by the probabilit y mass function P G ( x ) = 1 Z exp  λ X ( i,j ) ∈ E x i x j  , (3) where Z is a normalizing constant called the partition function. Here λ > 0 is a parameter to the distribu- tion, sometimes called the in verse temp erature. In the context of Ising mo del selection, we write G d as G d,λ to emphasize that the results dep end on λ . Al- though we let λ b e a constant here, our lo wer b ounds remain v alid in the minimax sense when one considers the larger class in which the edges hav e diﬀering pa- rameters { λ ij } in the range [ λ min , λ max ] , provided that λ min ≤ λ ≤ λ max . Gaussian Mo del: In the Gaussian graphical mo del [37], eac h vertex is asso ciated with a random v ariable X i ∈ R , and the corresp onding join t distribution is ( X 1 , . . . , X p ) ∼ N ( 0 , Σ ) , (4) where 0 is the v ector of zeros, and Σ is a cov ari- ance matrix whose inv erse Σ − 1 con tains non-zeros only in the diagonal entries and the indices corresp ond- ing to pairs in E . By the Hammersley-Cliﬀord the- orem [37], th is implies the Marko v property for the graph, namely , that a giv en no de is conditionally in- dep enden t of the rest of the graph given its neigh b ors. The joint density function corresp onding to (4) is de- noted by P G , ov erloading the notation used ab o ve for the Ising mo del. A typical restriction on the entries of Θ = Σ − 1 is that | Θ ij | √ Θ ii Θ j j is lo w er b ounded b y some constan t τ > 0 [7, 9]. W e consider the simplest sp ecial case of this in whic h the low er b ound alwa ys holds with equality: Θ ij =      1 i = j ± τ ( i, j ) ∈ E 0 otherwise . (5) W e write G d as G d,τ to emphasize that the results de- p end on τ . Similarly to the Ising mo del, our low er Decoder Cho ose No des X ( i ) ˆ G Z ( i +1) G Sample Figure 1: Illustration of the active learning problem for graphical mo del selection. b ounds remain v alid in the minimax case when we consider the larger class with | Θ ij | √ Θ ii Θ j j ∈ [ τ min , τ max ] with τ min ≤ τ ≤ τ max . 2.2 Problem Statemen t The problem of graphical model selection with activ e learning pro ceeds in rounds i = 1 , 2 , . . . , as illustrated in Figure 1. In the i -th round, the algorithm selects a subset of V to observ e, enco ded by a binary vector Z ( i ) ∈ { 0 , 1 } p equaling one for observed nodes and zero for non-observed no des. The resulting sample (or observation ) is a p -dimensional vector X ( i ) suc h that: • The joint distribution of the entries of X ( i ) , cor- resp onding to the entries where Z ( i ) is one, coin- cide with the corresp onding join t distribution of the vector ( X 1 , . . . , X p ) ∼ P G , with indep endence b et w een rounds; • The v alues of the en tries of X ( i ) , corresp onding to the entries where Z ( i ) is zero, are deterministically giv en by ∗ , a symbol indicating that the no de was not observed. F or conv enience, we let N denote the maxim um p ossi- ble num ber of active learning rounds (e.g., w e can sim- ply set N = n ), and use the conv ention that for v alues of i b ey ond the actual (p ossibly random) ﬁnal round, X ( i ) = ( ∗ , . . . , ∗ ) . Letting | Z ( i ) | denote the num b er of entries where Z ( i ) is one, we refer to P N i =1 | Z ( i ) | as the total numb er of no de observations used throughout the course of the algorithm, and we imp ose an upp er b ound on its maximum allo wed v alue, denoted b y n . Note that this diﬀers from the quantit y e n in (1)–(2) b y a factor of p . After the ﬁnal round, the algorithm constructs an es- timate ˆ G of G , and the error probabilit y is given by P e ( G ) := P [ ˆ G 6 = G ] . (6) W e consider the class G d of degree-b ounded graphs, in whic h all no des hav e degree at most d . Sp eciﬁcally , w e Lo wer Bounds on A ctive Learning for Graphical Mo del Selection are interested in b ounds on the minimax (worst-case) error probability for graphs in this class: P e := max G ∈G d P [ ˆ G 6 = G ] , (7) where the dep endence on the total num b er of node samples n is k ept implicit. Note that when w e consider the G aussian setting, the maximum in (6) is not only o ver the graph G , but also implicitly ov er the signs ( +1 or − 1 ) in the second case of (5). W e are interested in characterizing the sample c om- plexity , meaning the required num b er of no de obser- v ations n needed in order to achiev e P e ≤ δ for some target error probabilit y δ > 0 . 3 Main Results In this section, we state and discuss our main results, namely , minimax lo wer b ounds on the sample com- plexit y for G d . W e note that the pro ofs are based on graph ensembles in which the maximal degree and a verage degree are approximately equal; ho w ever, in Section 5, we discuss v ariations of these ensembles in whic h these tw o notions diﬀer signiﬁcantly . 3.1 Ising Mo del Theorem 1. F or Ising gr aphic al mo dels with λd ≥ 1 , in or der to r e c over any gr aph in G d,λ with pr ob ability at le ast 1 − δ , it is ne c essary that the total numb er of no de observations, n , satisﬁes n ≥ max ( 2 p log p λ tanh λ , e λd log( pd ) 2 λde λ , pd log p 8 d 4 log 2 ) × (1 − δ − o (1)) . (8) Pr o of. See Section 4.2. The second b ound in (8) reveals that the sample com- plexit y is v ery large when λd → ∞ at a rate that is not to o slow, due to the exp onential term e λd . On the other hand, when λ = O  1 d  , the ﬁrst bound giv es a sample complexity of Ω( d 2 p log p ) , since tanh λ = O ( λ ) as λ → 0 . Finally , in any case, the third b ound giv es n = Ω  pd log p d  . These observ ations coincide with those for the low er b ounds on passiv e learning in [8] (see (1) with e n = np ), suggesting that active learn- ing do es not help muc h in the minimax sense for G d,λ . Note that compared to [8], we lose a factor of p in the second b ound, but this factor is insigniﬁcan t compared to e λd pro vided that λd  log p . 3.2 Gaussian Mo del Theorem 2. F or Gaussian gr aphic al mo dels with d = o ( p ) , in or der to r e c over any gr aph in G d,τ with pr ob a- bility at le ast 1 − δ , it is ne c essary that the total numb er of no de observations, n , satisﬁes n ≥ max ( 4 p log p log 1 1 − τ 2 , 2 pd log p d log  1 +  ( d + 1) τ 1 − τ  2  ) × (1 − δ − o (1)) . (9) Pr o of. See Section 4.3. When τ = o (1) , the ﬁrst b ound b ehav es as Ω  1 τ 2 p log p  , whereas when τ is a constan t, the sec- ond b ound b eha ves as Ω  1 log d · pd log p d  . Both of these scaling laws are iden tical to the necessary conditions for passive learning in [9] (see (2) with e n = np ), again suggesting that active learning do es not help muc h in the minimax sense for G d,τ . While the ab o ve ﬁndings indicate that active learning do es not help muc h in the minimax sense for G d , w e discuss a more restricted class of graphs in Section 5 for which active learning helps when τ is a constan t. Sp eciﬁcally , similarly to the upp er b ound in [7], the linear dep endence on the maximal degree d in the sec- ond term of (9) is improv ed to the aver age degree. 4 Pro ofs of Main Results 4.1 F ano’s Inequalit y for Activ e Learning W e ﬁrst apply F ano’s inequality [10] along with a no vel m utual information b ound for activ e learning in graph- ical mo del selection. The pro of b ears some resem- blance to that of the conv erse b ound for channel co ding with noiseless feedbac k [10, Sec. 7.12]. F or z ∈ { 0 , 1 } p , w e let G ( z ) denote the subgraph of G obtained b y keeping only the nodes corresp onding to entries where z equals one, and denote the result- ing join t distribution b y P G ( z ) . More generally , for a join t distribution Q on p random v ariables lab eled { 1 , · · · , p } , we let Q ( z ) denote the joint marginal dis- tribution corresp onding to the entries where z is one. In the follo wing lemma, we let G b e uniformly random on some subset of G d , and deﬁne the aver age error probabilit y P e := P [ ˆ G 6 = G ] = E [ P e ( G )] , (10) where in contrast with (7), the probability is now ad- ditionally o ver G . Clearly an y lo wer bound on the sample complexit y for achieving P e ≤ δ implies the Jonathan Scarlett and V olk an Cevher same low er b ound for achieving P e ≤ δ , since P e is deﬁned with resp ect to the w orst case. Lemma 1. L et G b e uniform over a r estricte d gr aph class T ⊆ G d . In or der to achieve P e ≤ δ , it is ne c es- sary that 1 ≥ log |T | P N i =1 I ( G ; X ( i ) | Z ( i ) )  1 − δ − log 2 log |T |  , (11) wher e N is the maximum p ossible numb er of ac- tive le arning r ounds. Mor e over, if ther e ex- ists a p -dimensional joint distribution Q such that D ( P G ( z ) k Q ( z ) ) ≤  ( z ) for al l G ∈ T and z ∈ { 0 , 1 } p , wher e  ( z ) is some non-ne gative function, then we have I ( G ; X ( i ) | Z ( i ) ) ≤ E   ( Z ( i ) )  (12) for al l i . The pro of is giv en in the supplementary material. The high-level steps are as follows: (i) Bound the er- ror probability in terms of I ( G ; X ) using F ano’s in- equalit y; (ii) Use the c hain rule to write I ( G ; X ) = P N i =1 I ( X ( i ) ; G | X (1) , . . . , X ( i − 1) ) ; (iii) Upper bound the summands via analogous steps to the pro of of the c hannel coding theorem with feedbac k [10, Sec. 7.12]; (iv) Relate the div ergence D ( P G ( z ) k Q ( z ) ) to I ( G ; X ( i ) | Z ( i ) ) using similar steps to [22]. 4.2 Pro of of Theorem 1 (Ising model) 4.2.1 First Bound for the Ising Mo del W e use the follo wing ensem ble in whic h every no de has degree one. Ensem ble1 [Isolated edges ensemble] • Eac h graph in T consists of b p/ 2 c node-disjoint edges that ma y otherwise b e arbitrary . The total num b er of graphs is |T | =  p 2  p − 2 2  . . .  4 2  2 2  (or similarly when p is an o dd num b er), which is lo wer b ounded b y  b p/ 2 c 2  b p/ 2 c , yielding log |T | ≥ j p 2 k log  b p/ 2 c 2  = ( p log p )(1 + o (1)) . (13) T o obtain a m utual information bound of the form (12), we choose Q = P G 0 with G 0 b eing the empt y graph, and note that for a ﬁxed z ∈ { 0 , 1 } p con tain- ing n ( z ) ones, G ( z ) consists of at most n ( z ) / 2 no de- disjoin t edges. Since the div ergence correspond ing to graphs diﬀering in a single edge is upp er b ounded by λ tanh λ [22], and since the div ergence is additiv e for indep enden t pro ducts, we obtain D ( P G ( z ) k P G 0 ( z ) ) ≤ n ( z ) 2 λ tanh λ , and hence (12) b ecomes I ( G ; X ( i ) | Z ( i ) ) ≤ 1 2 E [ n ( Z ( i ) )] λ tanh λ. (14) Summing o v er i and noting that P N i =1 n ( Z ( i ) ) ≤ n with probability one, since the algorithm can only use up to n no de observ ations, we obtain N X i =1 I ( G ; X ( i ) | Z ( i ) ) ≤ n 2 λ tanh λ. (15) Substitution into (11) yields the necessary condition n ≥ 2 p log p λ tanh λ (1 − δ − o (1)) , (16) where the n umerator arises from (13). 4.2.2 Second Bound for the Ising Mo del W e use the follo wing ensemble from [8]. Ensem ble2( m ) [Clique-minus-one ensemble]: • F orm b p m c arbitrary no de-disjoin t cliques con- taining m no des each, to form a base graph G 0 . • Eac h graph in T is obtained by removing a sin- gle edge from G 0 . W e choose m = d + 1 , so that the maximal degree is d . The total n umber of graphs is b p m c  m 2  , which yields log |T | = (log ( pd ))(1 + o (1)) . (17) W e obtain a b ound of the form (12) by choosing Q = P G 0 with G 0 as in the ensemble deﬁnition. The div ergence asso ciated with the full graphs satisﬁes D ( P G k P G 0 ) ≤ 4 λde λ e λd when λd ≥ 1 [8, Lemma 2]. Since G ( z ) and G 0 ( z ) are common subgraphs of G and G 0 , w e trivially hav e D ( P G ( z ) k P G 0 ( z ) ) ≤ D ( P G k P G 0 ) , and hence D ( P G ( z ) k P G 0 ( z ) ) satisﬁes the same upp er b ound as D ( P G k P G 0 ) regardless of z . Hence, (12) yields I ( G ; X ( i ) | Z ( i ) ) ≤ 4 λde λ e λd . (18) Since the no de observ ation budget is n , the activ e learning can b e done in at most n/ 2 rounds without loss of optimality (i.e., excluding trivial cases where only one no de is observed), and we ha ve N X i =1 I ( G ; X ( i ) | Z ( i ) ) ≤ 2 nλde λ e λd . (19) Substitution into (11) yields the necessary condition n ≥ e λd log( pd ) 2 λde λ (1 − δ − o (1)) , (20) where the n umerator arises from (17). Lo wer Bounds on A ctive Learning for Graphical Mo del Selection 4.2.3 Third Bound for the Ising Mo del W e use the following straightforw ard ensemble, which w as also used in [8]. Ensem ble3 [Complete ensemble]: • T con tains all graphs with maximal degree at most d , i.e., T = G d . It w as shown in [8] that log |T | ≥ dp 4 log p 8 d . T o b ound the m utual information in (11), we note that the fol- lo wing holds when z ( i ) con tains n ( z ( i ) ) ones, and hence n ( z ( i ) ) no des are observ ed in the i -th round: I ( G ; X ( i ) | Z ( i ) = z ( i ) ) ≤ n ( z ( i ) ) log 2 . (21) This is because the remaining p − n ( z ( i ) ) no des are deterministically equal to ∗ , whereas the n ( z ( i ) ) no des are binary and hence reveal at most log 2 bits of infor- mation each. Summing (21) ov er i and av eraging ov er Z ( i ) , we obtain N X i =1 I ( G ; X ( i ) | Z ( i ) ) ≤ n log 2 , (22) and substitution in to (11) yields the desired result. 4.3 Pro of of Theorem 2 (Gaussian model) 4.3.1 First Bound for the Gaussian Mo del W e re-use Ensemble 1 ab o ve and apply the same analy- sis, with the only diﬀerence b eing the b ounding of the div ergence D ( P G 1 k P G 0 ) when G 1 con tains one edge and G 0 con tains no edges. When an edge is present, we let the resulting 2 × 2 co v ariance matrix and its inv erse b e given by Σ 1 = (1 − τ 2 )  1 τ τ 1  , Σ − 1 1 =  1 − τ − τ 1  , (23) whereas for the graph without the edge we simply ha ve Σ 0 = Σ − 1 0 = I . Both of these c hoices are clearly consisten t with (5). The divergence b et ween tw o zero-mean Gaussian vec- tors of dimension k is D ( P 1 k P 0 ) = 1 2  T r( Σ − 1 0 Σ 1 ) − k + log det Σ 0 det Σ 1  , (24) and with the abov e cov ariance matrices and k = 2 , this simpliﬁes to D ( P 1 k P 0 ) = 1 2 log 1 1 − τ 2 . (25) Hence, in analogy with (16), we obtain n ≥ 4 p log p log 1 1 − τ 2 (1 − δ − o (1)) . (26) 4.3.2 Second Bound for the Gaussian Mo del W e make use of the following ensemble that is similar to one in [9], but with multiple cliques as opp osed to only a single one. It can also b e thought of as a gener- alization of Ensem ble 1, which corresp onds to m = 2 . Ensem ble4( m ) [Disjoint cliques ensemble]: • Eac h graph in T consists of b p m c disjoint cliques of m no des that may otherwise b e arbitrary . The total n umber of graphs is  p m  p − m m  . . .  2 m m  m m  (or analogously when p do es not divide m ), which is lo wer b ounded by  b p/ 2 c m  1 2 b p m c , yielding log |T | ≥ 1 2 j p m k log  b p/ 2 c m  =  p 2 log p m  (1 + o (1)) (27) assuming that m = o ( p ) and hence log  b p/ 2 c m  =  m log p m  (1 + o (1)) . W e c ho ose m = d + 1 so that the maximal degree is d , yielding log |T | ≥  p 2 log p d  (1 + o (1)) . (28) As in [9], we let the in verse cov ariance matrix asso ci- ated with a single clique b e given b y Σ − 1 1 =      1 + a a · · · a a 1 + a · · · a . . . . . . . . . . . . a a · · · 1 + a      , (29) for a > 0 , yielding a co v ariance matrix given by Σ 1 = 1 1 + ma ×      1 + ( m − 1) a − a · · · − a − a 1 + ( m − 1) a · · · − a . . . . . . . . . . . . − a − a · · · 1 + ( m − 1) a      (30) W e set a = τ 1 − τ to ensure that the ratio of oﬀ-diagonals to diagonals in Σ − 1 1 is τ , in accordance with (5). Note that this form of the inv erse cov ariance matrix is slightly diﬀeren t to that in (5), but the diﬀerence only amounts to scaling all observ ations by a factor of Jonathan Scarlett and V olk an Cevher √ 1 + a , and hence the reco very problem is unc hanged regardless of whic h form is assumed. T o obtain a b ound of the form (12), we let Q b e join tly Gaussian with mean zero and identit y cov ariance ma- trix, deﬁning Σ 0 = Σ − 1 0 = I accordingly . W e ﬁrst study the behavior of the divergence D ( P G ( z ) k Q ( z ) ) when all of the non-zero v alues of z correspond to no des within a single clique in G . Hence, z contains e m ∈ { 1 , . . . , m } non-zero entries. Letting e Σ 1 denote an arbitrary sub-matrix of Σ 1 cor- resp onding to e m ∈ { 1 , . . . , m } no des, a straightforw ard computation gives det e Σ 1 = 1 + ( m − e m ) a 1 + ma = 1 − e ma 1 + ma (31) T r( e Σ 1 ) = e m 1 + ( m − 1) a 1 + ma = e m  1 − a 1 + ma  . (32) Deﬁning e Σ 0 analogously simply gives e Σ 0 = e Σ − 1 0 = I , and hence (24) with k = e m giv es D ( e P 1 k e P 0 ) = 1 2 − log  1 − e ma 1 + ma  − e ma 1 + ma ! (33) for e P 0 ∼ N ( 0 , e Σ 0 ) and e P 1 ∼ N ( 0 , e Σ 1 ) . Supp ose no w that a single measuremen t consists of n ( z ) no des indexed by z ∈ { 0 , 1 } p . F or a ﬁxed graph G ∈ T , this amounts to observing e m j no des from each clique j = 1 , . . . , b p m c , for some integers { e m j } suc h that P b p m c j =1 e m j = n ( z ) . Since the divergence is additive for indep enden t pro ducts, we obtain D ( P G ( z ) k Q ( z ) ) = 1 2 b p m c X j =1 − log  1 − e m j a 1 + ma  − e m j a 1 + ma ! . (34) T o simplify the subsequen t exp osition, w e write the summation as b p m c X j =1 β j f ( β j ) , (35) where β j = e m j a 1+ ma and f ( β ) = − log(1 − β ) − β β . W e con- sider the maximization of (35) sub ject to 0 ≤ β j ≤ ma 1+ ma and P j β j = n ( z ) a 1+ ma , where these constraints fol- lo w immediately from 0 ≤ e m j ≤ m and P j e m j = n ( z ) . It is easy to verify that the function f ( β ) is increas- ing in β , and therefore, the maximal v alue of (35) is obtained by setting as man y v alues of β j as p ossible to the maximum v alue ma 1+ ma , and letting an additional v alue of β j equal the remainder (if an y). This amounts to setting as many v alues of e m j as p ossible to m , and letting an additional v alue of e m j equal the remainder. The corresp onding maximum v alue is b p m c X j =1 β j f ( β j ) = j n ( z ) m k ma 1 + ma f  ma 1 + ma  + r a 1 + ma f  r a 1 + ma  (36) ≤ n ( z ) m ma 1 + ma f  ma 1 + ma  , (37) where r denotes the remainder v alue (i.e., the addi- tional v alue of e m j men tioned ab o ve), and (37) fol- lo ws by writing f  ra 1+ ma  ≤ f  ma 1+ ma  using the ab o ve- men tioned monotonicity of f . Roughly sp eaking, w e hav e argued that giv en a budget of n ( z ) no des to observe, the ones that yield a graph that is “furthest” from the empty graph are those that corresp ond to b n ( z ) m c complete m -cliques, with any re- mainder also concen trated within a single clique. In- tuitiv ely , this is b ecause taking measuremen ts from a v ariety of diﬀeren t cliques yields more indep endent no des, thus b eing closer to the b ehavior of the empt y graph in whic h all no des are indep enden t. Upp er b ounding the summation on the right-hand side of (34) by the maximum v alue (37), we obtain D ( P G ( z ) k Q ( z ) ) ≤ n ( z ) 2 m − log  1 − ma 1 + ma  − ma 1 + ma ! . (38) Applying the inequality − log(1 − β 1+ β ) − β 1+ β ≤ 1 2 log(1 + β 2 ) , we can weak en (38) to D ( P G ( z ) k Q ( z ) ) ≤ n ( z ) 4 m log  1 + ( ma ) 2  . (39) W e obtain from (39) and (12) that I ( G ; X ( i ) | Z ( i ) ) ≤ E [ n ( Z ( i ) )] 4 m log  1 + ( ma ) 2  , (40) and summing o ver i and again noting that P N i =1 n ( Z ( i ) ) ≤ n with probability one, we obtain N X i =1 I ( G ; X ( i ) | Z ( i ) ) ≤ n 4 m log  1 + ( ma ) 2  . (41) Substitution into (11) yields the necessary condition n ≥ 2 pd log p d log  1 +  ( d + 1) τ 1 − τ  2  (1 − δ − o (1)) , (42) where the numerator arises from (28), and we hav e set m = d + 1 and a = τ 1 − τ . Lo wer Bounds on A ctive Learning for Graphical Mo del Selection 5 Discussion: A v erage Degree vs. Maximal Degree The question of whether the maximal degree d max or a verage degree d avg dictates the p erformance of active graphical mo del selection was raised in [7], 1 where it w as suggested that it is the latter in the Gaussian case if τ is b ounded aw ay from zero. Our results are pro ved b y considering restricted ensembles for which d avg = d max (1 + o (1)) , and hence it is not immediately clear whic h is more fundamental. W e pro ceed by discussing the tw o for b oth Ising mo dels and Gaussian mo dels. W e ﬁrst remark that the ﬁrst terms in each of (8) and (9) do not contain d , and they were prov ed by con- sidering an ensemble where every no de has degree ex- actly one. Moreo ver, the third term in (8) is trivially obtained by counting the num b er of graphs with max- imal degree d , without any further restrictions, and it is unclear how to adapt this to gain insight on the role of the av erage degree. Hence, to pro vide a distinction b et w een d max and d avg , we fo cus only on the second terms in (8) and (9). F or the Ising mo del, the second term in (8) was ob- tained by considering b p d +1 c cliques of size d + 1 , and considering graphs obtained by subsequently removing a single edge, cf. , Section 4.2.2. In the supplementary material, we describ e an analogous ensemble in whic h these cliques hav e diﬀerent sizes, and show that the term e λd max still arises in the resulting sample complex- it y b ound. Intuitiv ely , this is because ev en if all cliques except the largest are known p erfectly and an edge is remo ved from the largest one, it is still very diﬃcult to iden tify that edge. Hence, regarding this exp onential term (which is the main feature of the b ound), it is d max that dictates the p erformance here. F or the Gaussian mo del, the second term in (9) was ob- tained b y considering graphs containing b p d +1 c cliques of size d + 1 , cf. , Section 4.3.2. In the supplemen- tary material, we provide a natural extension of this ensem ble which instead uses cliques of diﬀering sizes ( d 1 , . . . , d K ) such that P K k =1 ( d k + 1) = p . W e make the mild assumption that eac h of these degrees b ehav es as d k = o ( p ) . The most straightforw ard extension of the pro of of Theorem 2 yields a bound of the form n = Ω  pd min log p d max log(1+ τ d max )  , where d min is the minim um degree. This b ound is rarely tight, but it can b e improv ed by a genie argument: Reveal to the decoder all of the smallest cliques, up to a total of (1 − α ) p no des for 1 More precisely , [7] considers the quantit y d max deﬁned in Section 1.2, but this coincides with d avg for all ensembles considered in this pap er, at least up to a m ultiplicative 1 + o (1) term. some α ∈ (0 , 1) . The deco der is left to estimate the remaining cliques among αp no des. In the supplementary material, w e show that as long as α is b ounded a wa y from zero and one, this approach yields a sample complexity lo wer b ound of the form n = Ω  pd ( α ) min log p d max log(1+ τ d max )  , where d ( α ) min is the minimum de- gree among the remaining αp no des. If the top αp no de degrees in the graph coincide to within a constant fac- tor, then we hav e d ( α ) max = Θ( d avg ) , and we thus match the O ((1 + d avg ) p log p ) upp er b ound from [7] for ﬁxed τ , up to a logarithmic factor. These observ ations supp ort the idea prop osed in [7] that the a verage degree is the more fundamental quan- tit y in the Gaussian setting with ﬁxed τ . Note, how- ev er, that the assumptions are sligh tly diﬀerent, due to the coherence assumption made in [7] and the ab o ve assumption on the top αp no de degrees. 6 Conclusion W e hav e provided low er b ounds on active learning for graphical model selection. Using a v ariety of restricted graph ensem bles, we recov ered analogous b ounds to those for the passiv e setting, suggesting that active learning do es not help muc h in the minimax sense for the degree-b ounded class G d . Moreov er, w e iden tiﬁed an ensemble for the Ising mo del in which the maximal degree remains the crucial quantit y , and another en- sem ble for the Gaussian mo del in whic h the a verage degree is the more imp ortan t quan tity . W e note that our analysis also readily extends to the edge-b ounded class G k in whic h all graphs ha ve at most k edges, analogously to previous works suc h as [8, 23]. An imp ortan t direction for further research is to char- acterize the gain (if an y) that can b e achiev ed by activ e learning in the case of r andom graphs (e.g., Erdös- Rén yi [20, 25], p o wer law [21]), in whic h the maximal and av erage degrees can diﬀer considerably . More- o ver, it would b e of interest to understand the role of active learning when the edges ha ve diﬀering pa- rameters { λ ij } in the Ising mo del, or when the v alues τ ij = | Θ ij | √ Θ ii Θ j j diﬀer in the Gaussian mo del. A c knowledgmen t This w ork was supp orted in part by the Euro- p ean Commission under Grant ERC F uture Proof, SNF 200021-146750 and SNF CRSI I2-147633, and by the ‘EPFL F ellows’ programme (Horizon2020 gran t 665667). Jonathan Scarlett and V olk an Cevher References [1] S. Geman and D. Geman, “Sto c hastic relaxation, Gibbs distributions, and the Bay esian restora- tion of images,” IEEE T r ans. Patt. A nalysis and Mach. Intel. , no. 6, pp. 721–741, 1984. [2] R. J. Glaub er, “Time-dep enden t statistics of the Ising mo del,” J. Math. Phys. , v ol. 4, no. 2, pp. 294–307, 1963. [3] R. Durbin, S. R. Eddy , A. Krogh, and G. Mitchi- son, Biolo gic al se quenc e analysis: Pr ob abilistic mo dels of pr oteins and nucleic acids . Cambridge Univ. Press, 1998. [4] C. D. Manning and H. Schütze, F oundations of statistic al natur al language pr o c essing . MIT press, 1999. [5] S. W asserman and K. F aust, So cial network anal- ysis: Metho ds and applic ations . Cam bridge Univ. Press, 1994, v ol. 8. [6] D. M. Chick ering, “Learning Bay e sian net works is NP-complete,” in L e arning fr om data . Springer, 1996, pp. 121–130. [7] G. Dasarath y , A. Singh, M.-F. Balcan, and J. H. P ark, “Activ e learning algorithms for graphical mo del selection,” in Int. Conf. Art. Intel. Stats. (AIST A TS) , 2016. [8] N. Santhanam and M. W ainwrigh t, “Information- theoretic limits of selecting binary graphical mo d- els in high dimensions,” IEEE T r ans. Inf. The ory , v ol. 58, no. 7, pp. 4117–4134, July 2012. [9] W. W ang, M. W ainwrigh t, and K. Ramchandran, “Information-theoretic b ounds on mo del selection for Gaussian Mark ov random ﬁelds,” in IEEE Int. Symp. Inf. The ory , 2010. [10] T. M. Cov er and J. A. Thomas, Elements of Infor- mation The ory . John Wiley & Sons, Inc., 2006. [11] E. Arias-Castro, E. J. Candes, and M. A. Dav- enp ort, “On the fundamental limits of adaptiv e sensing,” IEEE T r ans. Inf. The ory , vol. 59, no. 1, pp. 472–481, Jan. 2013. [12] J. Scarlett and V. Cevher, “Phase transitions in group testing,” in Pr o c. ACM-SIAM Symp. Disc. A lg. (SOD A) , 2016. [13] L. Baldassini, O. Johnson, and M. Aldridge, “The capacit y of adaptive group testing,” in IEEE Int. Symp. Inf. The ory , July 2013, pp. 2676–2680. [14] R. M. Castro and R. D. No wak, “Minimax bounds for active learning,” IEEE T r ans. Inf. The ory , v ol. 54, no. 5, pp. 2339–2353, 2008. [15] J. Haupt, R. M. Castro, and R. Now ak, “Dis- tilled sensing: A daptive sampling for sparse de- tection and estimation,” IEEE T r ans. Inf. The- ory , vol. 57, no. 9, pp. 6222–6235, 2011. [16] Y. P olyanskiy , H. V. P o or, and S. V erdú, “F eed- bac k in the non-asymptotic regime,” IEEE T r ans. Inf. The ory , vol. 57, no. 8, pp. 4903–4925, 2011. [17] K. P . Murphy , “Activ e learning of causal Bay es net structure,” 2001, technical rep ort, UC Berkeley . [18] S. T ong and D. Koller, “Activ e learning for struc- ture in Ba yesian netw orks,” in Int. Joint Conf. A rt. Intel. , 2001. [19] G. Bresler, E. Mossel, and A. Sly , “ Reconstruction of Marko v random ﬁelds from samples: Some ob- serv ations and algorithms,” in Appr., R and. and Comb. Opt. A lgorithms and T e chniques . Springer Berlin Heidelb erg, 2008, pp. 343–356. [20] A. Anandkumar, V. Y. F. T an, F. Huang, and A. S. Willsky , “High-dimensional structure esti- mation in Ising mo dels: Lo cal separation crite- rion,” Ann. Stats. , v ol. 40, no. 3, pp. 1346–1375, 2012. [21] R. T andon and P . Ra vikumar, “On the diﬃcult y of learning p o wer law graphical mo dels,” in IEEE Int. Symp. Inf. The ory , 2013. [22] K. Shanmugam, R. T andon, A. Dimakis, and P . Ravikumar, “On the information theoretic lim- its of learning Ising mo dels,” in A dv. Neur. Inf. Pr o c. Sys. (NIPS) , 2014. [23] J. Scarlett and V. Cevher, “On the diﬃculty of se- lecting Ising models with approximate reco very ,” 2016, accepted to IEEE T r ans. Sig. Inf. Pr o c. over Networks . [24] D. V ats and J. M. Moura, “Necessary conditions for consistent set-based graphical mo del selec- tion,” in IEEE Int. Symp. Inf. The ory , 2011, pp. 303–307. [25] A. Anandkumar, V. Y. F. T an, F. Huang, and A. S. Willsky , “High-dimensional Gaussian graph- ical mo del selection: W alk summability and lo- cal separation criterion,” J. Mach. L e arn. R es. , v ol. 13, pp. 2293–2337, 2012. [26] V. Jog and P .-L. Loh, “On mo del missp eciﬁcation and KL separation for Gaussian graphical mod- els,” in IEEE Int. Symp. Inf. The ory , 2015. Lo wer Bounds on A ctive Learning for Graphical Mo del Selection [27] R. W u, R. Srik ant, and J. Ni, “Learning lo osely connected Mark ov random ﬁelds,” Sto ch. Sys. , v ol. 3, no. 2, pp. 362–404, 2013. [28] A. Jalali, C. C. Johnson, and P . K. Raviku- mar, “On learning discrete graphical mo dels using greedy metho ds,” in A dv. Neur. Inf. Pr o c. Sys. (NIPS) , 2011. [29] A. Ray , S. Sanghavi, and S. Shakk ottai, “Greedy learning of graphical mo dels with small girth,” in Al lter on Conf. Comm., Contr ol, and Comp. , 2012. [30] G. Bresler, D. Gamarnik, and D. Shah, “Struc- ture learning of antiferromagnetic Ising models,” in A dv. Neur. Inf. Pr o c. Sys. (NIPS) , 2014. [31] G. Bresler, “Eﬃcien tly learning Ising mo dels on arbitrary graphs,” in ACM Symp. The ory Comp. (STOC) , 2015. [32] P . Ravikumar, M. J. W ainwrigh t, J. D. Laﬀerty , and B. Y u, “High-dimensional Ising mo del selec- tion using ` 1 -regularized logistic regression,” Ann. Stats. , vol. 38, no. 3, pp. 1287–1319, 2010. [33] P . Ravikumar, M. J. W ainwrigh t, G. Raskutti, and B. Y u, “High-dimensional cov ariance estima- tion by minimizing ` 1 -p enalized log-determinant div ergence,” Ele c. J. Stats. , v ol. 5, pp. 935–980, 2011. [34] N. Meinshausen and P . Bühlmann, “High- dimensional graphs and v ariable selection with the Lasso,” Ann. Stats. , vol. 34, no. 3, pp. 1436– 1462, June 2006. [35] E. Y ang, A. C. Lozano, and P . K. Ravikumar, “Elemen tary estimators for graphical mo dels,” in A dv. Neur. Inf. Pr o c. Sys. (NIPS) , 2014, pp. 2159–2167. [36] E. Ising, “Beitrag zur theorie des ferromag- netism us,” Zeitschrift für Physik A Hadr ons and Nuclei , vol. 31, no. 1, pp. 253–258, 1925. [37] S. L. Lauritzen, Gr aphic al mo dels . Clarendon Press, 1996. Jonathan Scarlett and V olk an Cevher Supplemen tary Material “Lo w er Bounds on Activ e Learning for Graphical Mo del Selection” (Scarlett and Cevher, AIST A TS 2017) A Pro of of Lemma 1 W e start with the following form of F ano’s inequality [22, Lemma 1]: 1 ≥ log |T | I ( G ; X )  1 − δ − log 2 log |T |  , (43) where X = ( X (1) , . . . , X N ) . This remains v alid in the active learning setting since it only relies on the fact that G → X → ˆ G forms a Mark ov chain. Despite this common starting p oin t, we b ound the mutual information signiﬁcan tly diﬀerently . Deﬁning X (1 ,i ) = ( X (1) , . . . , X ( i ) ) , we ha ve 2 I ( G ; X ) = N X i =1 I ( X ( i ) ; G | X (1 ,i − 1) ) (44) = N X i =1 I ( X ( i ) ; G | X (1 ,i − 1) , Z ( i ) ) (45) = N X i =1  H ( X ( i ) | X (1 ,i − 1) , Z ( i ) ) − H ( X ( i ) | X (1 ,i − 1) , Z ( i ) , G )  (46) = N X i =1  H ( X ( i ) | X (1 ,i − 1) , Z ( i ) ) − H ( X ( i ) | Z ( i ) , G )  (47) ≤ N X i =1  H ( X ( i ) | Z ( i ) ) − H ( X ( i ) | G, Z ( i ) )  (48) = N X i =1 I ( G ; X ( i ) | Z ( i ) ) , (49) where (44) follows from the chain rule, (45) follo ws since Z ( i ) is a function of X (1 ,i − 1) , (47) follows since X ( i ) is conditionally indep enden t of X (1 ,i − 1) giv en ( G, Z ( i ) ) , and (48) follows since conditioning redu ces entrop y . This completes the pro of of (11). Conditioned on Z ( i ) = z ( i ) , the only v ariables in X ( i ) con veying information ab out G are those corresp onding to en tries where z ( i ) is one, since the others deterministically equal ∗ . By applying the m utual information upper b ound of [22] (see the pro of of Corollary 2 therein) to the restricted graph G ( z ( i ) ) with an auxiliary distribution Q ( z ( i ) ) , we obtain that D ( P G ( z ( i ) ) k Q ( z ( i ) ) ) ≤  ( z ( i ) ) , ∀ G ∈ T = ⇒ I ( G ; X ( i ) | Z ( i ) = z ( i ) ) ≤  ( z ( i ) ) . (50) Note that conditioned on Z ( i ) = z ( i ) , the graph G ma y no longer b e uniform on T ; the preceding claim remains v alid since the pro of of [22, Cor. 2] is for general graph distributions that need not b e uniform. Finally , the inequality in (12) follows b y a veraging b oth sides of the m utual information b ound in (50) o ver Z ( i ) . B Ensem ble and Sample Complexit y for Comparing the A v erage Degree and Maximal Degree (Ising mo del) F ormalizing the discussion on the Ising model in Section 5, we in tro duce the following analog of Ensemble 2, consisting of some num b er L of v ariable-size cliques with an edge remov ed. 2 Here H represents entrop y in the discrete case (e.g., Ising), and diﬀerential entrop y in the contin uous case (e.g., Gaussian). Lo wer Bounds on A ctive Learning for Graphical Mo del Selection Ensem ble2a( m 1 , . . . , m L ) [V ariable-size edge-remo ved cliques ensemble]: • F orm L arbitrary no de-disjoint cliques of sizes ( m 1 , . . . , m L ) , to obtain a base graph G 0 . • Eac h graph in T is obtained by removing a single edge from each of the L cliques. W e ha ve the following. Lemma 2. Fix the inte gers L and ( m 1 , . . . , m L ) with P L j =1 m j = p , and let G b e dr awn uniformly fr om Ensem ble2a ( m 1 , . . . , m L ). Then in or der to achieve P e ≤ δ , it is ne c essary that n ≥ e λd max log  d max ( d max + 1)  2 λd max e λ  1 − δ − log 2 log( d max + 1)  , (51) wher e d max = max j =1 ,...,L m j − 1 . Pr o of. W e consider a genie argument, in which the deco der is informed of all of the remov ed edges from the cliques, except for the largest, whose size is d max + 1 . In this case, the analysis reduces to that of Ensemble2( d max + 1 ) on a graph with p = d max + 1 no des. The result now follows immediately from (20), and recalling that the o (1) remainder term therein is equal to log 2 |T | from (11). C Ensem ble and Sample Complexit y for Comparing the A v erage Degree and Maximal Degree (Gaussian mo del) F ormalizing the discussion on the Gaussian mo del in Section 5, w e introduce the following ensemble, consisting of some n umber L of v ariable-size cliques. Ensem ble4a( m 1 , . . . , m L ) [Disjoin t v ariable-size cliques ensemble]: • Eac h graph in T consists of L disjoin t cliques of sizes ( m 1 , . . . , m L ) no des that may otherwise b e arbitrary . W e ha ve the following. Lemma 3. Fix the inte gers L and ( m 1 , . . . , m L ) with P L j =1 m j = p and max j =1 ,...,L m j = o ( p ) , and let G b e dr awn uniformly fr om Ensemble4a ( m 1 , . . . , m L ). Then for any α ∈ (0 , 1) (not dep ending on p ), in or der to achieve P e ≤ δ , it is ne c essary that n ≥ 2 αpd ( α ) min log p d max log  1 +  ( d max + 1) τ 1 − τ  2   1 − δ − o (1)  , (52) wher e d max = max j =1 ,...,L m j − 1 , and d ( α ) min is the minimum de gr e e among the αp no des having the lar gest de gr e e. 3 Pr o of. W e again consider a genie argument, in which the deco der is informed of all of the cliques except the largest ones, such that these remaining cliques form a total of αp no des. 4 Assuming without loss of generality that the m j are in decreasing order, the analysis reduces to the study of Ensemble4a on a graph with α p no des, and cliques of size ( m 1 , . . . , m L 0 ) , where L 0 ≤ L is deﬁned suc h that P L 0 j =1 m j = αp . 3 This is the same for all graphs in the ensemble, so here d ( α ) min is well-deﬁned. 4 Since m j = o ( p ) for all j , we can safely ignore rounding and assume that the total is exactly αp . Jonathan Scarlett and V olk an Cevher F or this reduced ensemble, the total num b er of graphs is  αp m 1  αp − m 1 m 2  . . .  αp − P L 0 − 2 j =1 m j m L 0 − 1  m L 0 m L 0  . W e let L 00 b e the largest integer suc h that P L 00 j =1 m j ≤ αp/ 2 , and write log |T | ≥ L 00 X j =1 log  b αp/ 2 c m j  (53) = L 00 X j =1  m j log αp 2 m j  (1 + o (1)) (54) ≥  αp 2 log αp 2 m 1  (1 + o (1)) (55) =  αp 2 log p d max  (1 + o (1)) , (56) where (54) follows since m j = o ( αp ) by assumption, (55) follows by ﬁrst applying m j ≤ m 1 inside the logarithm and then applying the deﬁnition of L 00 , and (56) follows since m 1 = d max + 1 by deﬁnition. W e now follo w the analysis of Section 4.3.2, and note that if a single measurement consists of n ( z ) no des indexed b y z ∈ { 0 , 1 } p , and if this corresponds to observing e m j no des from each clique j = 1 , . . . , L 0 , then w e ha ve the follo wing analog of (34): D ( P G ( z ) k Q ( z ) ) = 1 2 L 0 X j =1 − log  1 − e m j a 1 + m j a  − e m j a 1 + m j a ! , (57) where Q ( z ) and a are deﬁned in Section 4.3.2. Deﬁning β j = e m j a 1+ m j a and f ( β ) = − log(1 − β ) − β β , we can write the right-hand side of (57) as L 0 X j =1 β j f ( β j ) , (58) As a result, w e consider the maximization of (35) sub ject to 0 ≤ β j ≤ m j a 1+ m j a and P j β j (1 + m j a ) = n ( z ) a , where these constrain ts follow immediately from 0 ≤ e m j ≤ m and P j e m j = n ( z ) . While the optimal choices of { β j } for the preceding maximization problem are unclear, w e observe that the ﬁnal ob jective v alue can only increase if we relax the second constraint to P j β j (1 + m ( α ) min a ) ≤ n ( z ) a , where m ( α ) min = m L 0 = d ( α ) min + 1 . With this mo diﬁcation, w e ﬁnd similarly to (34) that the maximum is achiev ed b y setting β j to its maxim um v alue m j a 1+ m j a (i.e., e m j = m j ) for as many of the largest cliques as is p ermitted by the constraint P j β j (1 + m ( α ) min a ) ≤ n ( z ) a . Since each clique under consideration has at least m ( α ) min no des, this amoun ts to at most n ( z ) m ( α ) min cliques. Moreo ver, since β f ( β ) is increasing in β , the corresp onding v alues of β j f ( β j ) are upp er b ounded by m max a 1+ m max a f  m max a 1+ m max a  . Com bining these observ ations, we obtain the following analog of (37): L 0 X j =1 β j f ( β j ) ≤ n ( z ) m ( α ) min m max a 1 + m max a f  m max a 1 + m max a  , (59) and accordingly , using the same subsequen t steps, we obtain the following analog of (41): N X i =1 I ( G ; X ( i ) | Z ( i ) ) ≤ n 4 m ( α ) min log  1 + ( m max a ) 2  . (60) The pro of is concluded using (11) along with the cardinality b ound in (56), and recalling that m ( α ) min = d ( α ) min + 1 , m max = d max + 1 , and a = τ 1 − τ .

Lower Bounds on Active Learning for Graphical Model Selection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment