Lifted Tree-Reweighted Variational Inference

Lifted T r ee-Reweighted V ariational Infer ence ∗ Hung Hai Bui Natural Language Understanding Lab Nuance Communications Sunnyv ale, CA, USA bui.h.hung@gmail.com T uyen N. Huynh John von Neumann Institute V ietnam National Univ ersity Ho Chi Minh City tuyen.huynh@jvn.edu.vn David Sontag Courant Institute of Mathematical Sciences New Y ork Uni versity dsontag@cs.nyu.edu Abstract W e analyze v ariational inference for highly sym- metric graphical models such as those arising from ﬁrst-order probabilistic models. W e ﬁrst show that for these graphical models, the tree- reweighted variational objectiv e lends itself to a compact lifted formulation which can be solved much more ef ﬁciently than the standard TR W formulation for the ground graphical model. Compared to earlier work on lifted belief prop- agation, our formulation leads to a con ve x op- timization problem for lifted marginal inference and provides an upper bound on the partition function. W e provide two approaches for im- proving the lifted TR W upper bound. The ﬁrst is a method for efﬁciently computing maxi- mum spanning trees in highly symmetric graphs, which can be used to optimize the TR W edge ap- pearance probabilities. The second is a method for tightening the relaxation of the marginal poly- tope using lifted cycle inequalities and nov el ex- changeable cluster consistency constraints. 1 Introduction Lifted probabilistic inference focuses on exploiting sym- metries in probabilistic models for efﬁcient inference [5, 2, 3, 10, 17, 18, 21]. W ork in this area has demonstrated the possibility to perform very efﬁcient inference in highly- connected, lar ge tree-width, b ut symmetric models, such as those arising in the conte xt of relational (ﬁrst-order) proba- bilistic models and exponential family random graphs [19]. These models also arise frequently in probabilistic pro- gramming languages, an area of increasing importance as demonstrated by D ARP A ’ s PP AML program (Probabilistic Programming for Advancing Machine Learning). Even though lifted inference can sometimes of fer order-of- magnitude improvement in performance, approximation is ∗ Main article appeared in UAI 2014. This version also in- cludes the supplementary material. still necessary . A topic of particular interest is the interplay between lifted inference and variational approximate infer - ence. Lifted loop y belief propagation (LBP) [13, 21] was one of the ﬁrst attempts at exploiting symmetry to speed up loopy belief propagation; subsequently , counting be- lief propagation (CBP) [16] provided additional insights into the nature of symmetry in BP . Nev ertheless, these work were largely procedural and speciﬁc to the choice of message-passing algorithm (in this case, loopy BP). More recently , Bui et al., [3] proposed a general frame work for lifting a broad class of conv ex variational techniques by formalizing the notion of symmetry (deﬁned via automor - phism groups) of graphical models and the corresponding variational optimization problems themselves, independent of any speciﬁc methods or solv ers. Our goal in this paper is to extend the lifted variational framew ork in [3] to address the important case of approxi- mate marginal inference. In particular , we sho w how to lift the tree-reweighted (TR W) con ve x formulation of marginal inference [28]. As far as we kno w , our work presents the ﬁrst lifted con vex variational mar ginal inference, with the following beneﬁts over pre vious work: (1) a lifted con- ve x upper bound of the log-partition function, (2) a ne w tightening of the relaxation of the lifted marginal poly- tope e xploiting exchangeability , and (3) a con ver gent infer- ence algorithm. W e note that con ve x upper bounds of the log-partition function immediately lead to concave lower bounds of the log-likelihood which can serv e as useful sur - rogate loss functions in learning and parameter estimation [29, 13]. T o achieve the abov e goal, we ﬁrst analyze the symmetry of the TR W log-partition and entropy bounds. Since TR W bounds depend on the choice of the edge appearance prob- abilities ρ , we prove that the quality of the TR W bound is not affected if one only works with suitably symmet- ric ρ . W orking with symmetric ρ gi ves rise to an explicit lifted formulation of the TR W optimization problem that is equiv alent b ut much more compact. This con ve x objectiv e function can be con vergently optimized via a Frank-W olfe (conditional gradient) method where each Frank-W olfe it- eration solves a lifted MAP inference problem. W e then discuss the optimization of the edge-appearance v ector ρ , effecti vely yielding a lifted algorithm for computing maxi- mum spanning trees in symmetric graphs. As in Bui et al. ’ s framework, our work can beneﬁt from any tightening of the local polytope such as the use of cy- cle inequalities [1, 23]. In fact, each method for relaxing the marginal polytope immediately yields a v ariant of our algorithm. Notably , in the case of exchangeable random variables, radically sharper tightening (sometimes e ven e x- act characterization of the lifted marginal polytope) can be obtained via a set of simple and eleg ant linear constraints which we call exchangeable polytope constraints . W e pro- vide extensi ve simulation studies comparing the behaviors of different variants of our algorithm with exact inference (when av ailable) and lifted LBP demonstrating the adv an- tages of our approach. The supplementary material [4] pro- vides additional proof and algorithm details. 2 Background W e begin by revie wing v ariational inference and the tree- reweighted (TR W) approximation. W e focus on in- ference in Marko v random ﬁelds, which are distribu- tions in the exponential family given by Pr( x ; θ ) = exp {h Φ( x ) , θ i − A ( θ ) } , where A ( θ ) is called the log- partition function and serves to normalize the distribution. W e assume that the random v ariables x ∈ X n are discrete- valued, and that the features (Φ i ) , i ∈ I factor according to the graphical model structure G ; Φ can be non-pairwise and is assumed to be overcomplete. This paper focuses on the inference tasks of estimating the marginal proba- bilities p ( x i ) and approximating the log-partition function. Throughout the paper , the domain X is the binary domain { 0 , 1 } , howe ver , except for the construction of exchange- able polytope constraints in Section 6, this restriction is not essential. V ariational inference approaches view the log-partition function as a conv ex optimization problem ov er the marginal polytope A ( θ ) = sup µ ∈M ( G ) h µ, θ i − A ∗ ( µ ) and seek tractable approximations of the negati ve entropy A ∗ and the mar ginal polytope M [27]. F ormally , − A ∗ ( µ ) is the entropy of the maximum entropy distribution with mo- ments µ . Observ e that − A ∗ ( µ ) is upper bounded by the en- tropy of the maximum entropy distribution consistent with any subset of the expected sufﬁcient statistics µ . T o arrive at the TR W approximation [26], one uses a subset gi ven by the pairwise moments of a spanning tree 1 . Hence for any distribution ρ over spanning trees, an upper bound on − A ∗ is obtained by taking a conv ex combination of tree en- tropies − B ∗ ( τ , ρ ) = P s ∈ V ( G ) H ( τ s ) − P e ∈ E ( G ) I ( τ e ) ρ e . Since ρ is a distrib ution o ver spanning trees, it must belong to the spanning tree polytope T ( G ) with ρ e denoting the edge appearance probability of e . Combined with a relax- 1 If the original model contains non-pairwise potentials, the y can be represented as cliques in the graphical model, and the bound based on spanning trees still holds. ation of the mar ginal polytope OUTER ⊃ M , an upper bound B of the log-partition function is obtained: A ( θ ) ≤ B ( θ, ρ ) = sup τ ∈ OUTER ( G ) h τ , θ i − B ∗ ( τ , ρ ) (1) W e note that B ∗ is linear w .r .t. ρ , and for ρ ∈ T ( G ) , B ∗ is con vex w .r .t. τ . On the other hand, B is con ve x w .r .t. ρ and θ . The optimal solution τ ∗ ( ρ, θ ) of the optimization problem (1) can be used as an approximation to the mean param- eter µ ( θ ) . T ypically , the local polytope LOCAL gi ven by pairwise consistency constraints is used as the relaxation OUTER; in this paper, we also consider tightening of the local polytope. Since (1) holds with any edge appearance ρ in the spanning tree polytope T , the TR W bound can be further improv ed by optimizing ρ inf ρ ∈ T ( G ) B ( θ , ρ ) (2) The resulting ρ ∗ is then plugged into (1) to ﬁnd the marginal approximation. In practice, one might choose to work with some ﬁxed choice of ρ , for example the uniform distribution over all spanning trees. [14] proposed using the most uniform edge-weight arg inf ρ ∈ T ( G ) P e ∈ E ( ρ e − | V |− 1 | E | ) 2 which can be found via conditional gradient where each direction-ﬁnding step solves a maximum spanning tree problem. Sev eral algorithms have been proposed for optimizing the TR W objectiv e (1) given ﬁx ed edge appearance probabil- ities. [27] derived the tree-reweighted belief propagation algorithm from the ﬁxed point conditions. [8] show how to solve the dual of the TR W objecti ve, which is a geomet- ric program. Although this algorithm has the adv antage of guaranteed conv ergence, it is non-trivial to generalize this approach to use tighter relaxations of the marginal poly- tope, which we show is essential for lifted inference. [14] use an e xplicit set of spanning trees and then use dual de- composition to solve the optimization problem. Ho we ver , as we sho w in the next section, to maintain symmetry it is essential that one not work directly with spanning trees but rather use symmetric edge appearance probabilities. [23] optimize TR W o ver the local and cycle polytopes using a Frank-W olfe (conditional gradient) method, where each it- eration requires solving a linear program. W e follow this latter approach in our paper . T o optimize the edge appearance in (2), [26] proposed us- ing conditional gradient. The y observed that ∂ B ( θ,ρ ) ∂ ρ e = − ∂ B ∗ ( τ ∗ ,ρ ) ∂ ρ e = − I ( τ ∗ e ) where τ ∗ is the solution of (1). The direction-ﬁnding step in conditional gradient reduces to solving sup ρ ∈ T h ρ, I i , ag ain equi valent to ﬁnding the maxi- mum spanning tree with edge mutual information I ( τ ∗ e ) as weights. W e discuss the corresponding lifted problem in section 5. 3 Lifted V ariational Framew ork W e build on the key element of the lifted v ariational frame- work introduced in [3]. The automorphism group of a graphical model, or more generally , an exponential family is deﬁned as the group A of permutation pairs ( π , γ ) where π permutes the set of v ariables and γ permutes the set of features in such a way that they preserve the feature func- tion: Φ γ − 1 ( x π ) = Φ( x ) . Note that this construction of A is entirely based on the structure of the model and does not depend on the particular choice of the model parameters; nev ertheless the group stabilizes 2 (preserves) the key char - acteristics of the exponential family such as the marginal polytope M , the log-partition A and entropy − A ∗ . As shown in [3] the automorphism group is particularly useful for e xploiting symmetries when parameters are tied. For a giv en parameter-tying partition ∆ such that θ i = θ j for i, j in the same cell 3 of ∆ , the group A gives rise to a subgroup called the lifting group A ∆ that stabilizes the tied-parameter vector θ as well as the exponential family . The orbit partition of the the lifting group can be used to formulate equi v alent but more compact variational prob- lems. More speciﬁcally , let ϕ = ϕ (∆) be the orbit parti- tion induced by the lifting group on the feature index set I = { 1 . . . m } , let R m [ ϕ ] denote the symmetrized subspace { r ∈ R m s.t. r i = r j ∀ i, j in the same cell of ϕ } and de- ﬁne the lifted marginal polytope M [ ϕ ] as M ∩ R m [ ϕ ] , then (see Theorem 4 of [3]) sup µ ∈M h θ , µ i − A ∗ ( µ ) = sup µ ∈M [ ϕ ] h θ , µ i − A ∗ ( µ ) (3) In practice, we need to work with conv ex variational ap- proximations of the LHS of (3) where M is relaxed to an outer bound OUTER ( G ) and A ∗ is approximated by a con- ve x function B ∗ ( µ ) . W e now state a similar result for lift- ing general con vex approximations. Theorem 1. If B ∗ ( µ ) is con vex and stabilized by the lift- ing gr oup A ∆ , i.e., for all ( π , γ ) ∈ A ∆ , B ∗ ( µ γ ) = B ∗ ( µ ) , then ϕ is the lifting partition for the appr oximate varia- tional pr oblem sup µ ∈ OUTER ( G ) h θ , µ i − B ∗ ( µ ) = sup µ ∈ OUTER [ ϕ ] h θ , µ i − B ∗ ( µ ) (4) The importance of Theorem 1 is that it shows that it is equiv alent to optimize ov er a subset of OUTER ( G ) where pseudo-marginals in the same orbit are restricted to take the same value. As we will sho w in Section 4.2, this will allow us to combine many of the terms in the objectiv e, which is where the computational gains will deriv e from. A 2 Formally , G stabilizes x if x g = x for all g ∈ G . 3 If ∆ = { ∆ 1 . . . ∆ K } is a partition of S , then each subset ∆ k ⊂ S is called a cell. sketch of its proof is as follows. Consider a single pseudo- marginal vector µ . Since the objectiv e value is the same for ev ery µ γ for ( π , γ ) ∈ A ∆ and since the objective is con- cav e, the average of these, 1 | A ∆ | P ( π ,γ ) ∈ A ∆ µ γ , must hav e at least as good of an objectiv e value. Furthermore, note that this averaged vector liv es in the symmetrized subspace. Thus, it sufﬁces to optimize o ver OUTER [ ϕ ] . 4 Lifted T r ee-Reweighted Pr oblem 4.1 Symmetry of TR W Bounds W e now show that Theorem 1 can be used to lift the TR W optimization problem (1). Note that the applicability of Theorem 1 is not immediately obvious since B ∗ depends on the distribution ov er trees implicit in ρ . In establishing that the condition in Theorem 1 holds, we need to be care- ful so that the choice of the distribution ov er trees ρ does not destroy the symmetry of the problem. The result belo w ensures that with no loss in optimality , ρ can be assumed to be suitably symmetric. More speciﬁ- cally , let ϕ E = ϕ E (∆) be the set of G ’ s edge orbits induced by the action of the lifting group A ∆ ; the edge-weights ρ e for e very e in the same edge orbits can be constrained to be the same, i.e. ρ can be restricted to T [ ϕ E ] . Theorem 2. F or any ρ ∈ T , ther e exists a symmetrized ˆ ρ ∈ T [ ϕ E ] that yields at least as good an upper bound, i.e. B ( θ , ˆ ρ ) ≤ B ( θ , ρ ) ∀ θ ∈ Θ [∆] As a consequence, in optimizing the edge appearance, ρ can be r estricted to the symmetrized spanning tree polytope T [ ϕ E ] ∀ θ ∈ Θ [∆] , inf ρ ∈ T B ( θ , ρ ) = inf ρ ∈ T [ ϕ E ] B ( θ , ρ ) Pr oof. Let ρ be the argmin of the LHS, and deﬁne ˆ ρ = 1 | A ∆ | P π ∈ A ∆ ρ π so that ˆ ρ ∈ T [ ϕ E ] . For all ( π , γ ) ∈ A ∆ and for all tied-parameter θ ∈ Θ [∆] , θ π = θ , so B ( θ , ρ π ) = B ( θ π , ρ π ) . By theorem 1 of [3], π must be an automor- phism of the graph G . By lemma 7 (see supplementary ma- terial), B ( θ π , ρ π ) = B ( θ , ρ ) . Thus B ( θ , ρ π ) = B ( θ, ρ ) . Since B is conv ex w .r .t. ρ , by Jensen’ s inequality we hav e that B ( θ , ˆ ρ ) ≤ 1 | A ∆ | P π ∈ A ∆ B ( θ , ρ π ) = B ( θ , ρ ) . Using a symmetric choice of ρ , the TR W bound B ∗ then satisﬁes the condition of theorem 1, enabling the applica- bility of the general lifted variational inference frame work. Theorem 3. F or a ﬁxed ρ ∈ T [ ϕ E ] , ϕ is the lifting partition for the TRW variational pr oblem sup τ ∈ OUTER ( G ) h τ , θ i − B ∗ ( τ , ρ ) = sup τ ∈ OUTER [ ϕ ] h τ , θ i − B ∗ ( τ , ρ ) (5) 4.2 Lifted TR W Problems W e give the explicit lifted formulation of the TR W opti- mization problem (5). As in [3], we restrict τ to OUTER [ ϕ ] by introducing the lifted variables ¯ τ j for each cell ϕ j , and for all i ∈ ϕ j , enforcing that τ i = ¯ τ j . Effecti vely , we sub- stitute e very occurrence of τ i , i ∈ ϕ j by ¯ τ j ; in v ector form, τ is substituted by D ¯ τ where D is the characteristic matrix of the partition ϕ : D ij = 1 if i ∈ ϕ j and 0 otherwise. This results in the lifted form of the TR W problem sup D ¯ τ ∈ OUTER  ¯ τ , ¯ θ  − B ∗ ( ¯ τ , ¯ ρ ) (6) where ¯ θ = D > θ ; B ∗ is obtained from B ∗ via the abov e substitution; and ¯ ρ is the edge appearance per edge- orbit: for every edge orbit e , and for ev ery edge e ∈ e , ρ e = ¯ ρ e . Using an alternativ e but equiv alent form B ∗ = − P v ∈ V (1 − P e ∈ N b ( v ) ρ e ) H ( τ v ) − P e ∈ E ρ e H ( τ e ) , we obtain the following e xplicit form for B ∗ ( ¯ τ , ¯ ρ ) = − X v ∈ ¯ V   | v | − X e ∈ N ( v ) | e | d ( v , e ) ¯ ρ e   H ( ¯ τ v ) − X e ∈ ¯ E | e | ¯ ρ e H ( ¯ τ e ) (7) Intuitiv ely , the above can be viewed as a combination of node and edge entropies deﬁned on nodes and edges of the lifted graph ¯ G . Nodes of ¯ G are the node orbits of G while edges are the edge-orbits of G . ¯ G is not a simple graph: it can have self-loops or multi-edges between the same node pair (see Fig. 1). W e encode the incidence on this graph as follows: d ( v , e ) = 0 if v is not incident to e , d ( v , e ) = 1 if v is incident to e and e is not a loop, d ( v , e ) = 2 if e is a loop incident to v . The entr opy at the node orbit v is deﬁned as H ( ¯ τ v ) = − X t ∈X ¯ τ v : t ln( ¯ τ v : t ) and the entropy at the edge orbit e is H ( ¯ τ e ) = − X t,h ∈X ¯ τ { e 1 : t,e 2 : h } ln( ¯ τ { e 1 : t,e 2 : h } ) where { e 1 , e 2 } for e 1 , e 2 ∈ V is a representative (any el- ement) of e , { e 1 : t, e 2 : h } is an assignment of the ground edge { e 1 , e 2 } , and { e 1 : t, e 2 : h } is the assignment orbit. As in [3], we write { e 1 : t, e 2 : t } as e : t , and for t < h , { e 1 : t, e 2 : h } as a : ( t, h ) where a is the arc-orbit ( e 1 , e 2 ) . When OUTER is the local or cycle polytope, the con- straints D ¯ τ ∈ OUTER yield the lifted local (or c ycle) poly- tope respectiv ely . For these constraints, we use the same form given in [3]. In section 6, we describe a set of addi- tional constraints for further tightening when some cluster of nodes are exchangeable. Example. Consider the MRF shown in Fig. 1 (left) with 10 binary v ariables that we denote B i (for the blue nodes) and 2 2 1 1 b r Figure 1: Left: ground graphical model. Same colored nodes and edges have the same parameters. Right: lifted graph sho wing 2 node orbits ( b and r ), and 3 edge orbits. Numbers on the lifted graph representing the incidence degree d ( v , e ) between an edge and a node orbit. R i (for the red nodes). The node and edge coloring denotes shared parameters. Let θ b and θ r be the single-node poten- tials used for the blue and red nodes, respectiv ely . Let θ r e be the edge potential used for the red edges connecting the blue and red nodes, θ b e for the edge potential used for the blue edges ( B i , B i +1 ) , and θ k e for the edge potential used for the black edges ( B i , B i +2 ) . There are two node orbits: b = { B 1 , . . . , B 5 } and r = { R 1 , . . . , R 5 } . There are three edge orbits: r e for the red edges, b e for the blue edges , and k e for the black edges. The size of the node and edge orbits are all 5 (e.g., | b | = | b e | = 5 ), and d ( b , r e ) = d ( r , r e ) = 1 , d ( b , b e ) = d ( b , k e ) = 2 . Suppose that ρ corresponds to a uniform distrib ution over spanning trees, which satisﬁes the symmetry needed by Theorem 2. W e then ha ve ρ r e = 1 and ρ b e = ρ k e = 2 5 . Putting all of this together, the lifted TR W entropy is given by B ∗ ( ¯ τ , ¯ ρ ) = 8 H ( τ b ) − 5 H ( τ r e ) − 2 H ( τ b e ) − 2 H ( τ k e ) . W e illustrate the expansion of the entropy of the red edge orbit H ( ¯ τ r e ) . This edge orbit has 2 corresponding arc-orbits: rb a = { ( R i , B i ) } and br a = { ( B i , R i ) } . Thus, H ( ¯ τ r e ) = − ¯ τ r e :00 ln ¯ τ r e :00 − ¯ τ r e :11 ln ¯ τ r e :11 − ¯ τ rb a :01 ln ¯ τ rb a :01 − ¯ τ br a :01 ln ¯ τ br a :01 . Finally , the linear term in the objective is given by  ¯ τ , ¯ θ  =5 h ¯ τ b , θ b i + 5 h ¯ τ r , θ r i + 5 h ¯ τ r e , θ r e i + 5 h ¯ τ b e , θ b e i + 5 h ¯ τ k e , θ k e i where, as an example, h ¯ τ r e , θ r e i = ¯ τ r e :00 θ r e , 00 + ¯ τ r e :11 θ r e , 11 + ¯ τ br a :01 θ r e , 01 + ¯ τ rb a :01 θ r e , 10 4.3 Optimization using Frank-W olfe What remains is to describe ho w to optimize Eq. 6. Our lifted tree-re weighted algorithm is based on Frank-W olfe, also kno wn as the conditional gradient method [7, 11]. First, we initialize with a pseudo-marginal vector corre- sponding to the uniform distribution, which is guaranteed to be in the lifted marginal polytope. Next, we solve the lin- ear program whose objecti ve is gi ven by the gradient of the objectiv e Eq. 6 ev aluated at the current point, and whose constraint set is OUTER. When using the lifted c ycle re- laxation, we solve this linear program using a cutting-plane algorithm [3, 23]. W e then perform a line search to ﬁnd the optimal step size using a golden section search (a type of bi- nary search that ﬁnds the maxima of a unimodal function), and ﬁnally repeat using the new pseudo-marginal vector . W e w arm start each linear program using the optimal basis found in the previous run, which makes the LP solves ex- tremely fast after the ﬁrst couple of iterations. Although we use a generic LP solv er in our e xperiments, it is also possi- ble to use dual decomposition to deri ve efﬁcient algorithms specialized to graphical models [24]. 5 Lifted Maximum Spanning T ree Optimizing the TR W edge appearance probability ρ re- quires ﬁnding the maximum spanning tree (MST) in the ground graphical model. For lifted TR W , we need to per - form MST while using only information from the node and edge orbits, without referring to the ground graph. In this section, we present a lifted MST algorithm for symmetric graphs which works at the orbit le vel. Suppose that we are gi ven a weighted graph ( G , w ) , its au- tomorphism group A = Aut ( G ) and its node and edge orbits. W e aim to derive an algorithm analogous to the Kruskal’ s algorithm, b ut with complexity depends only on the number of node/edge orbits of G . Ho we ver , if the algo- rithm has to return an actual spanning tree of G then clearly its complexity cannot be less than O ( | V | ) . Instead, we con- sider an equiv alent problem: solving a linear program on the spanning-tree polytope sup ρ ∈ T ( G ) h ρ, w i (8) The same mechanism for lifting con vex optimization prob- lem (Lemma 1 in [3]) applies to this problem. Let ϕ E be the edge orbit partition, then an equi v alent lifted problem problem is sup ρ ∈ T [ ϕ E ] h ρ, w i (9) Since ρ e is constrained to be the same for edges in the same orbit, it is now possible to solve (9) with complexity de- pending only on the number of orbits. Any solution ρ of the LP (8) can be turned into a solution ¯ ρ of (9) by letting ¯ ρ ( e ) = 1 | e | P e 0 ∈ e ρ ( e 0 ) . 5.1 Lifted Kruskal’ s Algorithm The Kruskal’ s algorithm ﬁrst sorts the edges according to their decreasing weight. Then starting from an empty graph, at each step it greedily attempts to add the ne xt edge while maintaining the property that the used edges form a forest (containing no cycle). The forest obtained at the end of this algorithm is a maximum-weight spanning tree. Imagine how Kruskal’ s algorithm would operate on a weighted graph G with non-tri vial automorphisms. Let e 1 , . . . , e k be the list of edge-orbits sorted in the order of decreasing weight (the weights w on all edges in the same orbit by deﬁnition must be the same). The main question therefore is how many edges in each edge-orbit e i will be added to the spanning tree by the Kruskal’ s algorithm. Let G i be the subgraph of G formed by the set of all the edges and nodes in e 1 , . . . e i . Let V ( G ) and C ( G ) denote the set of nodes and set of connected components of a graph, re- spectiv ely . Then (see the supplementary material for proof) Lemma 4. The number of edges in e i appearing in the MST found by the Kruskal’ s algorithm is δ ( i ) V − δ ( i ) C wher e δ ( i ) V = | V ( G i ) |− | V ( G i − 1 ) | and δ ( i ) C = | C ( G i ) |− | C ( G i − i ) | . Thus a solution for the linear pr ogram (9) is ¯ ρ ( e i ) = δ ( i ) V − δ ( i ) C | e i | . 5.2 Lifted Counting of the Number of Connected Components W e note that counting the number of nodes can be done simply by adding the size of each node orbit. The remain- ing difﬁculty is how to count the number of connected com- ponents of a gi ven graph 4 G using only information at the orbit lev el. Let ¯ G be the lifted graph of G . Then (see sup- plementary material for proof) Lemma 5. If ¯ G is connected then all connected compo- nents of G ar e isomorphic. Thus if furthermor e G 0 is a con- nected component of G then | C ( G ) | = | V ( G ) | / | V ( G 0 ) | . T o ﬁnd just one connected component, we can choose an arbitrary node u and compute ¯ G [ u ] , the lifted graph ﬁxing u (see section 8.1 in [3]), then search for the connected component in ¯ G [ u ] that contains { u } . Finally , if ¯ G is not connected, we simply apply lemma 5 for each connected component of ¯ G . The ﬁnal lifted Kruskal’ s algorithm combines lemma 4 and 5 while keeping track of the set of connected components of ¯ G i incrementally . The full algorithm is given in the sup- plementary material. 6 Tightening via Exchangeable Polytope Constraints One type of symmetry often found in ﬁrst-order probabilis- tic models are large sets of exchangeable random v ariables. In certain cases, exact inference with exchangeable v ari- ables is possible via lifted counting elimination and its gen- eralization [17, 2]. The drawback of these e xact methods is that the y do not apply to man y models (e.g., those with transitiv e clauses). Lifted v ariational inference methods do not ha ve this drawback, howe ver local and cycle relaxation can be sho wn to be loose in the exchangeable setting, a po- tentially serious limitation compared to earlier work. T o remedy this situation, we now sho w how to take adv an- tage of highly symmetric subset of variables to tighten the relaxation of the lifted marginal polytope. W e call a set of random variables χ an exchang eable cluster iff χ can be arbitrary permuted while preserving the prob- ability model. Mathematically , the lifting group A ∆ acts on χ and the image of the action is isomorphic to S ( χ ) , 4 Since we are only interested in connectivity in this subsec- tion, the weights of G play no role. Thus, orbits in this subsec- tion can also be generated by the automorphism group of the un- weighted version of G . the symmetric group on χ . The distribution of the random variables in χ is also exchangeable in the usual sense. Our method for tightening the relaxation of the marginal polytope is based on lift-and-project, wherein we introduce auxiliary v ariables specifying the joint distribution of a large cluster of variables, and then enforce consistency be- tween the cluster distribution and the pseudo-marginal vec- tor [20, 24, 27]. In the ground model, one typically works with small clusters (e.g., triplets) because the number of variables grows exponentially with cluster size. The key (and nice) difference in the lifted case is that we can make use of very large clusters of highly symmetric v ariables: while the grounded relaxation would clearly blo w up, the corresponding lifted relaxation can still remain compact. Speciﬁcally , for an exchangeable cluster χ of arbitrary size, one can add cluster consistency constraints for the entire cluster and still maintain tractability . T o keep the exposi- tion simple, we assume that the v ariables are binary . Let C denote a χ -conﬁguration, i.e., a function C : χ → { 0 , 1 } . The set { τ χ C | ∀ conﬁguration C } is the collection of χ - cluster auxiliary variables. Since χ is exchangeable, all nodes in χ belong to the same node orbit; we call this node orbit v ( χ ) . Similarly , e ( χ ) and a ( χ ) denote the single edge and arc orbit that contains all edges and arcs in χ respec- tiv ely . Let u 1 , u 2 be two distinct nodes in χ . T o enforce consistency between the cluster χ and the edge { u 1 , u 2 } in the ground model, we introduce the constraints ∃ τ χ : X C s.t. C ( u i )= s i τ χ C = τ u 1 : s 1 ,u 2 : s 2 ∀ s i ∈ { 0 , 1 } (10) These constraints correspond to using intersection sets of size two, which can be shown to be the exact characteri- zation of the marginal polytope inv olving v ariables in χ if the graphical model only has pairwise potentials. If higher- order potentials are present, a tighter relaxation could be obtained by using larger intersection sets together with the techniques described below . The constraints in (10) can be methodically lifted by re- placing occurrences of ground v ariables with lifted v ari- ables at the orbit lev el. First observe that in place of the grounded variables τ u 1 : s 1 ,u 2 : s 2 , the lifted local relaxation has three corresponding lifted variables, ¯ τ e ( χ ):00 , ¯ τ e ( χ ):11 and ¯ τ a ( χ ):01 . Second, we consider the orbits of the set of conﬁgurations C . Since χ is exchangeable, there can be only | χ | + 1 χ -conﬁguration orbits; each orbit contains all conﬁgurations with precisely k 1’ s where k = 0 . . . | χ | . Thus, instead of the 2 | χ | ground auxiliary variables, we only need | χ | + 1 lifted cluster variables. Further manip- ulation leads to the following set of constraints, which we call lifted exc hangeable polytope constraints. Theorem 6. Let χ be an exchangeable cluster of size n ; e ( χ ) and a ( χ ) be the single edge and ar c orbit of the graphical model that contains all edges and ar cs in χ re- spectively; ¯ τ be the lifted mar ginals. Then ther e exist c χ k ≥ 0 , k = 0 . . . n such that n − 2 X k =0 ( n − k )( n − k − 1) n ( n − 1) c χ k = ¯ τ e ( χ ):00 n − 2 X k =0 ( k + 1)( k + 2) n ( n − 1) c χ k +2 = ¯ τ e ( χ ):11 n − 2 X k =0 ( n − k − 1)( k + 1) n ( n − 1) c χ k +1 = ¯ τ a ( χ ):01 Pr oof. See the supplementary material. In contrast to the lifted local and cycle relaxations, the num- ber of variables and constraints in the lifted exchangeable relaxation depends linearly on the domain size of the ﬁrst- order model. From the lifted local constraints gi ven by [3], ¯ τ e ( χ ):00 + ¯ τ e ( χ ):11 + 2 ¯ τ a ( χ ):01 = 1 . Substituting in the expression in volved ˜ c χ k , we get P n k =0 c χ k = 1 . Intuitiv ely , c χ k represents the approximation of the marginal probability Pr( P i ∈ χ x i = k ) of having precisely k ones in χ . As proved by [2], groundings of unary predicates in Markov Logic Networks (MLNs) gi ves rise to exchange- able clusters. Thus, for MLNs, the abo ve theorem imme- diately suggests a tightening of the relaxation: for every unary predicate of a MLN, add a new set of constraints as above to the existing lifted local (or cycle) optimiza- tion problem. Although it is not the focus of our paper , we note that this should also impro ve the lifted MAP infer - ence results of [3]. For example, in the case of a symmetric complete graphical model, lifted MAP inference using the linear program gi ven by these new constraints would ﬁnd the exact k that maximizes Pr( x χ ) , hence recov er the same solution as counting elimination. Marginal inference may still be inexact due to the tree-reweighted entropy approxi- mation. W e re-emphasize that the complexity of variational inference with lifted exchangeable constraints is guaran- teed to be polynomial in the domain size, unlike exact methods based on lifted counting elimination and variable elimination. 7 Experiments In this section, we pro vide an empirical ev aluation of our lifted tree reweighted (L TR W) algorithm. As a baseline we use a dampened version of the lifted belief propagation (LBP-Dampening) algorithm from [21]. Our lifted algo- rithm has all of the same adv antages of the tree-re weighted approach over belief propagation, which we will illustrate in the results: (1) a conv ex objectiv e that can be con ver - gently solved to optimality , (2) upper bounds on the parti- tion function, and (3) the ability to easily impro ve the ap- proximation by tightening the relaxation. Our ev aluation includes four variants of the L TR W algorithm correspond- ing to using different outer bounds: lifted local polytope (L TR W -L), lifted cycle polytope (L TR W -C), lifted local Q1(2)& Q1(1)& Q1(3)& Q2(2 )& Q2(1)& Q2(3)& Q3(2)& Q3(1)& Q3(3) & Figure 2: An example of the ground graphical model for the Clique-Cycle model (domain size = 3). polytope with exchangeable polytope constraints (L TR W - LE), and lifted cycle polytope with exchangeable con- straints (L TR W -CE). The conditional gradient optimization of the lifted TR W objective terminates when the duality gap is less than 10 − 4 or when a maximum number of 1000 it- erations is reached. T o solve the LP problem during condi- tional gradient, we use Gurobi 5 . W e ev aluate all the algorithms using sev eral ﬁrst-order probabilistic models. W e assume that no evidence has been observed, which results in a large amount of symmetry . Even without evidence, performing marginal inference in ﬁrst-order probabilistic models can be very useful for max- imum likelihood learning [13]. Furthermore, the fact that our lifted tree-reweighted variational approximation pro- vides an upper bound on the partition function enables us to maximize a lo wer bound on the likelihood [29], which we demonstrate in Sec. 7.5. T o ﬁnd the lifted orbit parti- tion, we use the renaming group as in [3] which exploits the symmetry of the unobserved contants in the model. Rather than optimize over the spanning tree polytope, which is computationally intensive, most TR W implemen- tations use a single ﬁxed choice of edge appearance prob- abilities, e.g. an (un)weighted distrib ution obtained using the matrix-tree theorem. In these experiments, we initial- ize the lifted edge appearance probabilities ¯ ρ to be the most uniform per-orbit edge-appearance probabilties by solv- ing the optimization problem inf ¯ ρ ∈ T [ ϕ E ] ( ¯ ρ − | V |− 1 | E | ) 2 us- ing conditional gradient. Each direction-ﬁnding step of this conditional gradient solves a lifted MST problem of the form sup ¯ ρ 0 ∈ T [ ϕ E ] D − 2( ¯ ρ − | V |− 1 | E | ) , ¯ ρ 0 E using our lifted Kruskal’ s algorithm, where ¯ ρ is the current solution. After this initialization, we ﬁx the lifted edge appearance proba- bilities and do not attempt to optimize them further . 7.1 T est models Fig. 3 describes the four test models in MLN syntax. W e focus on the repulsi ve case, since for attractiv e models, all TR W variants and lifted LBP give similar results. The pa- rameter W denotes the weight that will be v arying during the experiments. In all models except Clique-Cycle , W acts like the “local ﬁeld” potential in an Ising model; a negati ve (or positiv e) value of W means the correspond- ing variable tends to be in the 0 (or 1) state. Complete- 5 http://www .gurobi.com/ Graph is equiv alent to an Ising model on the complete graph of size n (the domain size) with homogenous param- eters. Exact marginals and the log-partition function can be computed in closed form using lifted counting elimina- tion. The weight of the interaction clause is set to − 0 . 1 (repulsiv e). F riends-Smokers (ne gated) is a v ariant of the Friends-Smokers model [21] where the weight of the ﬁ- nal clause is set to -1.1 (repulsi ve). W e use the method in [2] to compute the e xact marginal for the Cancer predicate and the exact value of the log-partition function. Lovers- Smokers is the same MLN used in [3] with a full transi- tiv e clause and where we vary the prior of the Loves pred- icate. Clique-Cycle is a model with 3 cliques and 3 bipar - tite graphs in between. Its corresponding ground graphical model is shown in Fig. 2. 7.2 Accuracy of Marginals Fig. 4 sho ws the marginals computed by all the algo- rithms as well as exact mar ginals on the Complete-Graph and Friends-Smokers models. W e do not know how to efﬁ- ciently perform exact inference in the remaining two mod- els, and thus do not measure accuracy for them. The result on complete graphs illustrates the clear beneﬁt of tight- ening the relaxation: L TR W -Local and LBP are inaccu- rate for moderate W , whereas cycle constraints and, es- pecially , exchangeable constraints drastically improve ac- curacy . As discussed earlier , for the case of symmetric complete graphical models, the exchangeable constraints sufﬁce to exactly characterize the marginal polytope. As a result, the approximate marginals computed by L TR W -LE and L TR W -CE are almost the same as the exact marginals; the very small difference is due to the entropy approxima- tion. On the Friends-Smokers (neg ated) model, all L TR W variants giv e accurate marginals while lifted LBP ev en with very strong dampening ( 0 . 9 weight gi ven to pre vious itera- tions’ messages) f ails to con ver ge for W < 2 . W e observ ed that L TR W -LE gives the best trade-of f between accuracy and running time for this model. Note that we do not com- pare to ground versions of the lifted TR W algorithms be- cause, by Theorem 3, the marginals and log-partition func- tion are the same for both. 7.3 Quality of Log-Partition Upper bounds Fig. 5 plots the values of the upper bounds obtained by the L TR W algorithms on the four test models. The re- sults clearly show the beneﬁts of adding each type of con- straint to the L TR W , with the best upper bound obtained by tightening the lifted local polytope with both lifted c y- cle and exchangeable constraints. For the Complete-Graph and Friends-Smokers model, the log-partition approxima- tion using exchangeable polytope constraints is very close to exact. In addition, we illustrate lifted LBP’ s approxima- tion of the log-partition function on the Complete-Graph (note it is non-con vex and not an upper bound). Complete Graph W V ( x ) − 0 . 1 [ x 6 = y ∧ ( V ( x ) ⇔ V ( y ))] Friends-Smokers (Negated) W [ x 6 = y ∧ ¬ F r iends ( x, y )] 1 . 4 ¬ S mokes ( x ) 2 . 3 ¬ C ancer ( x ) 1 . 5 S mokes ( x ) ⇒ C ancer ( x ) − 1 . 1 [ x 6 = y ∧ S mokes ( x ) ∧ F r iends ( x, y ) ⇒ S mokes ( y )] Lovers-Smokers W [ x 6 = y ∧ Lov es ( x, y )] 100 M ale ( x ) ⇔ ! F emale ( x ) 2 M ale ( x ) ∧ S mokes ( x ) 1 F emale ( x ) ∧ S mokes ( x ) 0 . 5 [ x 6 = y ∧ M ale ( x ) ∧ F emale ( y ) ∧ Lov es ( x, y )] 1 [ x 6 = y ∧ Lov es ( x, y ) ∧ ( S mokes ( x ) ⇔ S mokes ( y ))] − 100 [ x 6 = y ∧ y 6 = z ∧ z 6 = x ∧ Lov es ( x, y ) ∧ Lov es ( y, z ) ∧ Loves ( x, z )] Clique-Cycle W x 6 = y ∧ ( Q 1( x ) ⇔ ¬ Q 2( y )) W x 6 = y ∧ ( Q 2( x ) ⇔ ¬ Q 3( y )) W x 6 = y ∧ ( Q 3( x ) ⇔ ¬ Q 1( y )) − W x 6 = y ∧ ( Q 1( x ) ⇔ Q 1( y )) − W x 6 = y ∧ ( Q 2( x ) ⇔ Q 2( y )) − W x 6 = y ∧ ( Q 3( x ) ⇔ Q 3( y )) Figure 3: T est models −1000 0 500 1500 Parameter W Log−partition −30 −25 −20 −15 −10 −5 0 5 10 15 20 25 30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Log−partition Upperbound Complete Graph (Domain Size = 100) ● Exact LBP−Dampening L TRW−L L TRW−LE L TRW−C L TRW−CE −20000 0 20000 60000 Parameter W Log−partition −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 Log−partition Upperbound Friends−Smokers (Domain Size = 200) Exact L TRW−L L TRW−LE L TRW−C L TRW−CE 0 10000 30000 50000 Parameter W Log−partition 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 Log−partition Upperbound Clique−Cycle Graph (Domain Size = 100) L TRW−L L TRW−LE L TRW−C L TRW−CE Parameter W Log−partition −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 0 40000 80000 Log−partition Upperbound Lovers−Smoker s (Domain Size = 200) L TRW−L L TRW−LE L TRW−C L TRW−CE Figure 5: Approximations of the log-partition function on the four test models from Fig. 3 (best viewed in color). 0 10 20 30 40 50 0.5 0.6 0.7 0.8 0.9 1.0 Parameter W Marginal probability ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Accuracy of Appro ximate Marginals Complete Graph (Domain Size = 100) ● Exact LBP−Dampening L TRW−L L TRW−LE L TRW−C L TRW−CE 0.0 0.1 0.2 0.3 0.4 Parameter W Marginal probability −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10 ● ● ● ● ● ● ● ● Accuracy of Appro ximate Marginals Friends−Smokers (Domain Size = 200) ● Exact LBP−Dampening L TRW−L L TRW−LE L TRW−C L TRW−CE Figure 4: Left: marginal accuracy for complete graph model. Right: mar ginal accuracy for P r ( C ancer ( x )) in Friends-Smokers (neg). Lifted TR W variants using differ- ent outer bounds: L=local, C=cycle, LE=local+exchangeable, CE=cycle+e xchangeable (best viewed in color). 7.4 Running time As shown in T able 1, lifted variants of TR W are order-of- magnitudes faster than the ground version. Interestingly , lifted TR W with local constraints is observed to be faster as the domain size increase; this is probably due to the fact that as the domain size increases, the distrib ution becomes more peak, so marginal inference becomes more similar to MAP inference. Lifted TR W with local and exchangeable constraints requires a smaller number of conditional gradi- ent iterations, thus is faster; howe ver note that its running time slo wly increases since the exchangeable constraint set grows linearly with domain size. LBP’s lack of con ver gence makes it difﬁcult to have a Domain size 10 20 30 100 200 TR W-L 138370 609502 1525140 - - L TR W -L 3255 3581 3438 1626 1416 L TR W -LE 681 703 721 1033 1307 T able 1: Ground vs lifted TR W runtime on Complete-Graph (mil- liseconds) meaningful timing comparison with LBP . For example, LBP did not con verge for about half of the v alues of W in the Lovers-Smokers model, ev en after using very strong dampening. W e did observe that when LBP con ver ges, it is much faster than L TR W . W e hypothesize that this is due to the message passing nature of LBP , which is based on a ﬁxed point update whereas our algorithm is based on Frank-W olfe. 7.5 Application to Learning W e no w describe an application of our algorithm to the task of learning relational Markov networks for inferring protein-protein interactions from noisy , high-throughput, experimental assays [12]. This is equi valent to learning the parameters of an exponential family random graph model [19] where the edges in the random graph represents the protein-protein interactions. Despite fully observed data, maximum likelihood learning is challenging because of the intractability of computing the log-partition function and its gradient. In particular , this relational Marko v network has over 330K random v ariables (one for each possible in- teraction of 813 variables) and tertiary potentials. Howe ver , Jaimovich et al. [13] observed that the partition function in 0.5 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 θ 1 1 1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 θ 0 1 1 LTRW-CE loglik lower bound 26000 24000 22000 20000 18000 16000 14000 12000 10000 0.5 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 θ 1 1 1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 θ 0 1 1 difference between LTRW-CE and LTRW-L loglik bounds 80 160 240 320 400 480 560 640 Figure 6: Log-likelihood lower -bound obtained using lifted TR W with the cycle and exchangeable constraints (CE) for the same protein-protein interaction data used in [13] (left) (c.f. Fig. 7 in [13]). Improvement in lower -bound after tightening the local constraints (L) with CE (right). relational Markov networks is highly symmetric, and use lifted LBP to efﬁciently perform approximate learning in running time that is independent of the domain size. They use their lifted inference algorithm to visualize the (approx- imate) lik elihood landscape for dif ferent values of the pa- rameters, which among other uses characterizes the robust- ness of the model to parameter changes. W e use precisely the same procedure as [13], substituting lifted BP with our new lifted TR W algorithms. The model has three parameters: θ 1 , used in the single-node potential to specify the prior probability of a protein-protein interac- tion; θ 111 , part of the tertiary potentials which encourages cliques of three interacting proteins; and θ 011 , also part of the tertiary potentials which encourages chain-like struc- tures where proteins A, B interact, B , C interact, b ut A and C do not (see supplementary material for the full model speciﬁcation as an MLN). W e follo w their two-step esti- mation procedure, ﬁrst estimating θ 1 in the absence of the other parameters (the maximum likelihood, BP , and TR W estimates of this parameter coincide, and estimation can be performed in closed-form: θ ∗ 1 = − 5 . 293 ). Next, for each setting of θ 111 and θ 011 we estimate the log-partition func- tion using lifted TR W with the cycle+exchangeable vs. lo- cal constraints only . Since TR W is an upper bound on the log-partition function, these provide lower bounds on the likelihood. Our results are shown in Fig. 6, and should be compared to Fig. 7 of [13]. The overall shape of the likelihood land- scapes are similar . Howe ver , the lifted LBP estimates of the likelihood have sev eral local optima, which cause gradient- based learning with lifted LBP to reach dif ferent solutions depending on the initial setting of the parameters. In con- trast, since TR W is con ve x, any gradient-based procedure would reach the global optima, and thus learning is much easier . Interestingly , we see that our estimates of the likeli- hood hav e a signiﬁcantly smaller range ov er these parame- ter settings than that estimated by lifted LBP . Moreo ver , the high-likelihood parameter settings extends to larger values of θ 111 . For all algorithms there is a sudden decrease in the likelihood at θ 011 > 0 (not shown in the ﬁgure). 8 Discussion and Conclusion Lifting partitions used by lifted and counting BP [21, 16] can be coarser than orbit partitions. In graph-theoretic terms, these partitions are called equitable partitions. If each equitable partition cell is thought of as a distinct node color , then among nodes with the same color , their neigh- bors must have the same color histogram. It is kno wn that orbit partitions are always equitable, howe ver the conv erse is not always true [9]. Since equitable partition can be computed more ef ﬁciently and potentially leads to more compact lifted problems, the following question naturally arises: can we use equitable partition in lifting the TR W problem? Unfortunately , a complete answer is non-tri vial. W e point out here a the- oretical barrier due to the interplay between the spanning tree polytope and the equitable partition of a graph. Let ε be the coarsest equitable partition of edges of G . W e give an example graph in the supplementary mate- rial (see e xample 9) where the symmetrized spanning tree polytope corresponding to the equitable partition ε , T [  ] = T ( G ) ∩ R | E | [ ε ] is an empty set. When T [  ] is empty , the conse- quence is that if we w ant ρ to be within T so that B ( ., ρ ) is guaranteed to be a conv ex upper bound of the log-partition function, we cannot restrict ρ to be consistent with the eq- uitable partition. In lifted and counting BP , ρ ≡ 1 so it is clearly consistent with the equitable partition; howe ver , one loses con ve xity and upper bound guarantee as a result. This suggests that there might be a trade-off between the compactness of the lifting partition and the quality of the entropy approximation, a topic deserving the attention of future work. In summary , we presented a formalization of lifted marginal inference as a conv ex optimization problem and showed that it can be efﬁciently solved using a Frank- W olfe algorithm. Compared to pre vious lifted variational inference algorithms, in particular lifted belief propagation, our approach comes with con ver gence guarantees, upper bounds on the partition function, and the ability to im- prov e the approximation (e.g. by introducing additional constraints) at the cost of small additional running time. A limitation of our lifting method is that as the amount of soft evidence (the number of distinct individual objects) approaches the domain size, the behavior of lifted infer- ence approaches ground inference. The wide difference in running time between ground and lifted inference suggests that signiﬁcant ef ﬁciency can be gained by solving an ap- proximation of the orignal problem that is more symmetric [25, 15, 22, 6]. One of the most interesting open questions raised by our work is how to use the v ariational formula- tion to perform approxiate lifting. Since our lifted TR W algorithm provides an upper bound on the partition func- tion, it is possible that one could use the upper bound to guide the choice of approximation when deciding how to re-introduce symmetry into an inference task. Acknowledgements : W ork by DS supported by D ARP A PP AML program under AFRL contract no. F A8750-14-C- 0005. References [1] F . Barahona and A. R. Mahjoub. On the cut polytope. Math- ematical Pr ogramming , 36:157–173, 1986. [2] Hung Hai Bui, Tuyen N. Huynh, and Rodrigo de Salvo Braz. Lifted inference with distinct soft evidence on every object. In AAAI-2012 , 2012. [3] Hung Hai Bui, T uyen N. Huynh, and Sebastian Riedel. Au- tomorphism groups of graphical models and lifted varia- tional inference. In Proceedings of the T wenty-Ninth Con- fer ence on Uncertainty in Artiﬁcial Intelligence , U AI-2013 . A U AI Press, 2013. [4] Hung Hai Bui, Tuyen N. Huynh, and David Sontag. Lifted tree-reweighted variational inference. arXiv pr e-print , 2014. http://arxiv .or g/abs/1406.4200. [5] R. de Salv o Braz, E. Amir , and D. Roth. Lifted ﬁrst-order probabilistic inference. In Pr oceedings of the 19th Inter- national Joint Conference on Artiﬁcial Intelligence (IJCAI ’05) , pages 1319–1125, 2005. [6] Rodrigo de Salvo Braz, Sriraam Natarajan, Hung Bui, Jude Shavlik, and Stuart Russell. Anytime lifted belief propaga- tion. In 6th International W orkshop on Statistical Relational Learning (SRL 2009) , 2009. [7] Marguerite Frank and Philip W olfe. An algorithm for quadratic programming. Naval Resear ch Logistics Quar- terly , 3(1-2):95–110, 1956. ISSN 1931-9193. [8] A. Globerson and T . Jaakkola. Con ver gent Propagation Al- gorithms via Oriented Trees. In Uncertainty in Artiﬁcial Intelligence , 2007. [9] Chris Godsil and Gordon Royle. Algebraic Graph Theory . Springer , 2001. [10] V ibhav Gogate and Pedro Domingos. Probabilistic theorem proving. In Pr oceedings of the T wenty-Seventh Annual Con- fer ence on Uncertainty in Artiﬁcial Intelligence (UAI-11) , pages 256–265, 2011. [11] Martin Jaggi. Re visiting Frank-W olfe: Projection-free sparse con ve x optimization. In Pr oceedings of the 30th ICML , volume 28, pages 427–435. JMLR W orkshop and Conference Proceedings, 2013. [12] Ariel Jaimovich, Gal Elidan, Hanah Mar galit, and Nir Fried- man. T ow ards an integrated protein-protein interaction net- work: a relational markov network approach. Journal of Computational Biology , 13(2):145–164, 2006. [13] Ariel Jaimovich, Ofer Meshi, and Nir Friedman. T em- plate based inference in symmetric relational markov ran- dom ﬁelds. In Pr oceedings of the T wenty-Thir d Confer- ence on Uncertainty in Artiﬁcial Intelligence , V ancouver , BC, Canada, J uly 19-22, 2007 , pages 191–199. A U AI Press, 2007. [14] Jeremy Jancsary and Gerald Matz. Con ver gent decompo- sition solvers for tree-reweighted free energies. Journal of Machine Learning Researc h - Pr oceedings T rac k , 15:388– 398, 2011. [15] K. K ersting, Y . El Massaoudi, B. Ahmadi, and F . Hadiji. In- formed lifting for message–passing. In D. Poole M. Fox, editor , T wenty–F ourth AAAI Conference on Artiﬁcial Intel- ligence (AAAI–10) , Atlanta, USA, July 11 – 15 2010. AAAI Press. [16] Kristian K ersting, Babak Ahmadi, and Sriraam Natarajan. Counting belief propagation. In Pr oceedings of the 25th An- nual Confer ence on Uncertainty in AI (U AI ’09) , 2009. [17] B. Milch, L. S. Zettlemoyer , K. K ersting, M. Haimes, and L. P . Kaelbling. Lifted Probabilistic Inference with Count- ing Formulas. In Proceedings of the 23r d AAAI Confer- ence on Artiﬁcial Intelligence (AAAI ’08) , pages 1062– 1068, 2008. [18] Mathias Niepert. Markov chains on orbits of permutation groups. In U AI-2012 , 2012. [19] Garry Robins, Pip Pattison, Y uval Kalish, and Dean Lus her . An introduction to exponential random graph ( p ∗ ) models for social networks. Social networks , 29(2):173–191, 2007. [20] H. D. Sherali and W . P . Adams. A hierarchy of relaxations between the continuous and conve x hull representations for zero-one programming problems. SIAM Journal on Discrete Mathematics , 3(3):411–430, 1990. doi: 10.1137/0403036. URL http://link.aip.org/link/?SJD/3/411/ 1 . [21] Parag Singla and Pedro Domingos. Lifted ﬁrst-order be- lief propagation. In Pr oceedings of the 23rd AAAI Con- fer ence on Artiﬁcial Intelligence (AAAI ’08) , pages 1094– 1099, 2008. [22] Parag Singla, Aniruddh Nath, and Pedro Domingos. Ap- proximate lifted belief propagation. In W orkshop on Statis- tical Relational Artiﬁcial Intelligence (StaR-AI 2010) , 2010. [23] D. Sontag and T . Jaakkola. New outer bounds on the marginal polytope. In Advances in Neural Information Pr o- cessing Systems 21 . MIT Press, 2008. [24] David Sontag, T . Meltzer , A. Globerson, T . Jaakkola, and Y . W eiss. T ightening LP relaxations for MAP using message passing. In Pr oceedings of the 24th Annual Conference on Uncertainty in AI (U AI ’08) , 2008. [25] Guy V an den Broeck and Adnan Darwiche. On the com- plexity and approximation of binary e vidence in lifted infer- ence. In Advances in Neural Information Pr ocessing Sys- tems , pages 2868–2876, 2013. [26] M. W ainwright, T . Jaakkola, and A. W illsky . A ne w class of upper bounds on the log partition function. IEEE T ransac- tions on Information Theory , 51:2313–2335, 2005. [27] Martin W ainwright and Michael Jordan. Gr aphical Mod- els, Exponential F amilies, and V ariational Inference . Now Publishers, 2008. [28] Martin J. W ainwright, T ommi Jaakkola, and Alan S. W ill- sky . Tree-based reparameterization framew ork for analysis of sum-product and related algorithms. IEEE T ransactions on Information Theory , 49:1120–1146, 2003. [29] C. Y ano ver , O. Schueler-Furman, and Y . W eiss. Minimiz- ing and learning energy functions for side-chain prediction. Journal of Computational Biolo gy , 15(7):899–911, 2008. Supplementary Materials f or “Lifted T ree-Reweighted V ariational Inference” W e present (1) proofs not giv en in the main paper, (2) full pseudo-code for the lifted Kruskal’ s algorithm for ﬁnding maximum spanning tree in symmetric graphs, and (3) ad- ditional details of the protein-protein interaction model. Proof of Theorem 1 . Pr oof. The lifting group A ∆ stabilizes both the objectiv e function and the constraints of the con vex optimization problem in the LHS of Eq. (4). The equality is then es- tablished using Lemma 1 in [3]. W e state and prove a lemma about the symmetry of the bounds B and B ∗ that will be used in subsequent proofs. Lemma 7. let π be an automorphism of the graphical model G , then B ∗ ( τ , ρ ) = B ∗ ( τ π , ρ π ) and B ( θ , ρ ) = B ( θ π , ρ π ) . Pr oof. The intuition is that since the entropy bound B ∗ is deﬁned on the graph structure of the graphical model G , it inherits the symmetry of G . This can be veriﬁed by view- ing the graph automorphism π as a bijection from nodes to nodes and edges to edges, and so B ∗ ( τ π , ρ π ) simply rear- ranges the summation inside B ∗ ( τ , ρ ) . B ∗ ( τ , ρ ) = − X v ∈ V ( G ) H ( τ v ) + X e ∈ E ( G ) I ( τ e ) ρ e = − X v ∈ V ( G ) H ( τ π ( v ) ) + X e ∈ E ( G ) I ( τ π ( e ) ) ρ π ( e ) = B ∗ ( τ π , ρ π ) W e now show same symmetry applies to the log-partition upper bound B . Let OUTER ( G ) be an outer bound of the marginal polytope M ( G ) such that M ( G ) ⊂ OUTER ( G ) ⊂ LOCAL ( G ) . Note that π acts on and sta- bilizes OUTER, i.e., OUTER π = OUTER. Thus B ( θ , ρ ) = sup τ ∈ OUTER h θ , τ i − B ∗ ( τ , ρ ) = sup τ π ∈ OUTER h θ , τ i − B ∗ ( τ , ρ ) = sup τ ∈ OUTER D θ , τ π − 1 E − B ∗ ( τ π − 1 , ρ ) = sup τ ∈ OUTER h θ π , τ i − B ∗ ( τ , ρ π ) = B ( θ π , ρ π ) Proof of Theorem 3. Pr oof. W e will sho w that the condition of theorem 1 holds. Let us ﬁx a ρ ∈ T [ ϕ E ] . Then ρ π = ρ for all ( π , γ ) ∈ A ∆ . Thus B ∗ ( τ π , ρ ) = B ∗ ( τ π , ρ π ) . On the other hand, by Lemma 7, B ∗ ( τ π , ρ π ) = B ∗ ( τ , ρ ) . Thus B ∗ ( τ π , ρ ) = B ∗ ( τ , ρ ) . Note that in case of the ov ercomplete representa- tion, the action of the group A ∆ is the permuting action of π ; thus, the TR W bound B ∗ ( τ , ρ ) (for ﬁx ed ρ ∈ T [ ϕ E ] ) is stabilized by the lifting group A ∆ . Conditional Gradient (Frank-W olfe) Algorithm for Lifted TR W The pseudo-code is giv en in Algorithm 1. Step 2 essentially solves a lifted MAP problem which we used the same algo- rithms presented in [3] with Gurobi as the main linear pro- gramming engine. Step 3 solves a 1-D constrained con vex problem via the golden search algorithm to ﬁnd the optimal step size. Proof of Lemma 4 . Pr oof. After considering all edges in e 1 ∪ . . . ∪ e i , Kruskal’ s algorithm must form a spanning forest of G i (the forest is spanning since if there remains an edge that can be used without forming a cycle, Kruskal’ s algorithm must hav e used it already). Since the forest is spanning, the number of edges used by Kruskal’ s algorithm at this point is pre- cisely | V ( G i ) | − | C ( G i ) | . Similarly , just after considering all edges in e 1 ∪ . . . ∪ e i − 1 , the number of edges used is | V ( G i − 1 ) | − | C ( G i − 1 ) | . Therefore, the number of e i -edges used must be | V ( G i ) | − | C ( G i ) | − [ | V ( G i − 1 ) | − | C ( G i − 1 ) | ] = δ ( i ) V − δ ( i ) C which is the difference between the number of new nodes (which must be non-negati ve) and the number of new con- nected components (which could be negativ e) induced by considering edges in e i . Any MST solution ρ can be turned into a solution ¯ ρ of (9) by letting ¯ ρ ( e ) = 1 | e | P e 0 ∈ e ρ ( e 0 ) . Thus, we obtain a solution ¯ ρ ( e i ) = δ ( i ) V − δ ( i ) C | e i | . Proof of Lemma 5 . W e ﬁrst need an intermediate result. Lemma 8. If u and v ar e two distinct node orbits, and u and v are reac hable in the lifted graph ¯ G , then for any u ∈ u , ther e is some v ∈ v such that v is r eachable fr om u . Pr oof. Induction on the length of the u - v path. Base case: if u and v are adjacent, there exists an edge orbit e inci- dent to both u and v . Therefore, the e xists a ground edge { u 0 , v 0 } in e such that u 0 ∈ u and v 0 ∈ v . The automor- phism mapping u 0 7→ u will map v 0 7→ v for some node v . Clearly { u, v } is an edge, and v ∈ v . Main case: assume the statement is true for all pair of orbits with path length ≤ n . Suppose u − v is a path of length n + 1 . T ak e the orbit z right in front of v in the path, so that u − z is a path of length n , and z and v are adjacent. By the inducti ve as- sumption, there exists z ∈ z such that u is connected to z . Algorithm 1 Conditional gradient for optimizing lifted TR W problem 1: k = 0 ; ¯ τ (0) ← uniform 2: Direction ﬁnding via lifted MAP s k = arg max s k ∈O h s k , ¯ θ − ∇ ¯ τ B ∗ ( ¯ τ k , ¯ ρ ) i 3: Step size ﬁnding via golden section search λ k = arg max λ ∈ [0 , 1] λ h s k − ¯ τ k , ¯ θ i − B ∗ ( ¯ τ k (1 − λ ) + s k λ ) 4: Update ¯ τ k +1 = ¯ τ k (1 − λ k ) + s k λ k 5: k ← k + 1 6: if not con verged go to 2 Applying the same argument of the base case, there e xists v ∈ v such that { z , v } is an edge. Thus u is connected to v . W e no w return to the main proof of lemma 5. Pr oof. If the ground graph G has only one component then this is tri vially true. Let G 1 and G 2 be two distinct con- nected components of G , let u 1 be a node in G 1 and u be the orbit containing u 1 . Let v 2 be any node in G 2 . Since the lifted graph ¯ G is connected, all orbits are reachable from one another in ¯ G . By the above lemma, there must be some node u 2 ∈ u reachable from v 2 , hence u 2 ∈ G 2 (if v 2 ∈ u then we just take u 2 = v 2 ). This establishes that the node orbit u intersects with both G 1 and G 2 . Note that u 1 6 = u 2 since otherwise G 1 = G 2 . Let π be the automorphism that takes u 1 to u 2 . W e now show that π ( G 1 ) = G 2 . Since π maps edges to edges and non-edges to non-edges, it is sufﬁcient to show that π ( V ( G 1 )) = π ( V ( G 2 )) . Let z 1 be a node of G 1 and z 2 = π ( z 1 ) . Since G 1 is connected, there exists a path from u 1 to z 1 . But π must map this path to a path from u 2 to z 2 , hence z 2 ∈ V ( G 2 ) . Thus π ( V ( G 1 )) ⊂ V ( G 2 ) . No w , let z 2 be a node of G 2 and let z 1 = π − 1 ( z 2 ) . By a similar argu- ment, π − 1 must map the path from u 2 to z 2 to a path from u 1 to z 1 , hence z 1 ∈ V ( G 1 ) . Thus π ( V ( G 2 )) ⊂ V ( G 1 ) , and we hav e indeed shown that π ( V ( G 1 )) = π ( V ( G 2 )) . Hence all connected components of G are isomorphic. Giv en one connected component G 0 , the number of con- nected components of G is | C ( G ) | = | V ( G ) | / | V ( G 0 ) | . Lifted Kruskal’ s Algorithm See algorithm 2. The algorithm keeps track of the list of connected components of the lifted graph ¯ G i in a disjoint- set data structure similar to Kruskal’ s algorithm. Line 17 and line 20 follow from lemma 5 and lemma 4 respecti vely . The number of ground nodes of a lifted graph is computed as the sum of the size of each of its node orbit. Proof of Theorem 6 . W e gi ve two different proofs of this theorem. The ﬁrst proof demonstrates how the constraints are derived me- thodically by lifting the ground constraints of the exact marginal polytope. The second proof is more intuiti ve and illustrates what each variable in the system of constraints represents conceptually . Proof 1. Pr oof. First consider the conﬁguration C such that C ( u 1 , u 2 ) = (0 , 0) . From Eq. (10), after substituting the ground variables τ χ C by the lifted variables ¯ τ χ | C | where | C | denotes the number of 1 ’ s in the conﬁguration, we obtain the lifted constraint ∃ ¯ τ χ 0 . . . ¯ τ χ n : X C s.t. C ( u 1 ,u 2 )=(0 , 0) ¯ τ χ | C | = ¯ τ e ( χ ):00 Let us no w simplify the summation. Since C ( u 1 , u 2 ) = (0 , 0) , | C | can range from 0 to n − 2 . For each value of | C | = k there are  n − 2 k  different conﬁgurations. As a result, we can compactly write the above lifted constraint as n − 2 X k =0  n − 2 k  ¯ τ χ k = ¯ τ e ( χ ):00 Note that e very edge { u 1 , u 2 } results in exactly the same constraint. Similarly , when C ( u 1 , u 2 ) = (1 , 1) we obtain the lifted constraint n − 2 X k =0  n − 2 k  ¯ τ χ k +2 = ¯ τ e ( χ ):11 and when C ( u 1 , u 2 ) = (0 , 1) we obtain n − 2 X k =0  n − 2 k  ¯ τ χ k +1 = ¯ τ a ( χ ):01 There’ s no need to consider the (1 , 0) case since we can always re-order the pair ( u 1 , u 2 ) . Algorithm 2 Lifted Kruskal’ s algorithm Find: ¯ ρ , a solution of (9) at orbit lev el Input: lifted graph ¯ G and its set of edge orbits 1: sort edge orbits in decreasing weight order ( e 1 . . . e k ) 2: C = ∅  set of connected comp. of lifted graph ¯ G i as disjoint set 3: GC = empty map  hashmap from elements of C to their number of ground components 4: numGC = 0  number of ground connected comp of ¯ G i 5: numGV = 0  number of ground nodes of ¯ G i 6: GV max = # Gr oundN ode ( ¯ G )  number of ground nodes of ¯ G 7: for i=1,. . . ,k do 8: C old = ﬁnd elements in C containing the end nodes of e i 9: GC old = P S ∈ C old GC ( S ) 10: δ V = P v : v ∈ N b ( e i ) , v 6∈ C old | v | 11: numGV = numGV + δ V 12: H = union of members of C old and e i 13: C ← remov e members of C old and add H 14: u = pick a node of H ; u = r epr esentative ( u ) 15: H f ixed = lifted graph of H after ﬁxing the node u 16: K = component contains { u } in H f ixed 17: GC ( H ) ← # GroundN ode ( H ) # GroundN ode ( K ) 18: δ C = GC ( H ) − GC old 19: numGC ← numGC + δ C 20: ¯ ρ ( e i ) = 1 | e i | [ δ V − δ C ] 21: if numGV = GV max and numGC = 1 then  no new ground nodes, 1 ground connected component 22: break  future δ V and δ C must be 0 23: end if 24: end for 25: ¯ ρ ( e j ) = 0 for all j = i + 1 . . . k 26: retur n ¯ ρ Finally , let c χ k =  n k  ¯ τ χ k , we arri ve at the set of constraints of theorem 6. Proof 2. Pr oof. Let C (0 , 0) denote the number of (0 , 0) edges in χ . An (0 , 0) edge is an edge where the two end-nodes receiv e the assignment 0 . T o sho w the ﬁrst equality holds, we use two dif ferent ways to compute the expectation of C (0 , 0) . First, E [ C (0 , 0)] is the sum of the probability that an edge is (0 , 0) , summing over all the edges. Due to exchangeabil- ity , all probabilities are the same and is equal to ¯ τ e ( χ ) : 00 , so E [ C (0 , 0)] = n ( n − 1) 2 ¯ τ e ( χ ) : 00 Second, C (0 , 0) conditioned on the event that there are pre- cisely k 1 ’ s in χ is ( n − k )( n − k − 1) 2 for k ≤ n − 2 and 0 if k > n − 2 . Now let c χ k be the probability of this e vent, then E [ C (0 , 0)] = E [ E [ C (0 , 0) | X i ∈ χ x i = k ]] = n − 2 X k =0 ( n − k )( n − k − 1) 2 c χ k This shows the ﬁrst equality holds. The second equal- ity is obtained similarly by considering the expectation of C (1 , 1) , the number of (1 , 1) edges in χ . The last equality is obtained by considering the expectation of C (0 , 1) , the number of (0 , 1) arcs. Protein-Pr otein Interaction (PPI) Model The PPI model we considered is exactly the same as an ex- ponential family random graph model with 3 graph statis- tics: edge count, 2-chain (triangle with a missing edge) count, and triangle count. The model speciﬁcation in MLN (with an additional clause enforcing undirectedness of the random graph) is 1 2 θ 11 r ( x, y ) 1 2 θ 110 r ( x, y ) ∧ r ( x, z ) ∧ ! r ( y , z ) 1 6 θ 111 r ( x, y ) ∧ r ( y , z ) ∧ r ( x, z ) −∞ !( r ( x, y ) ↔ r ( y , x )) For this model, the edge appearance ¯ ρ is initialized so that the edges correspond to the hard clause are always selected. 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 Theta011 900 1000 1100 1200 1300 1400 1500 1600 1700 logz protein-interaction; domain size 813 LTRW-L_logz LTRW-C_logz LTRW-LE_logz LTRW-CE_logz 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Theta011 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 logz 1e7 protein-interaction; domain size 813 LTRW-L_logz LTRW-C_logz LTRW-LE_logz LTRW-CE_logz Figure 7: Log-partition bounds for PPI model Figure 8: G and its equitable partition. No element of the spanning tree polytope is uniform in the same cell of the equitable partition. The set { r ( x, y ) | x ﬁxed, y varies } is an exchangeable cluster , thus exchangeable constraints can be enforced on the orbit corresponding to this cluster . The log-partition upper bounds of lifted TR W with different outer bounds are shown in Fig. 7. The left part shows the region with negati ve θ 011 and the right part sho ws the re gion with posi- tiv e θ 011 ( θ 111 is held ﬁx ed at − 0 . 021 ). For negati ve θ 011 , a combination of cycle and e xchangeable constraints (CE) is crucial to improv e the upper bound. For positiv e θ 011 , ex- change constraints (LE) are already sufﬁcient to yield the best bound. Equitable Partition and the Spanning T ree Polytope Example 9. W e gi ve an example where the symmetrized spanning tree polytope corresponding to the (edge) equi- table partition ε , T [  ] = T ( G ) ∩ R | E | [ ε ] is an empty set. The example graph G is shown in Fig. 8. There are two edge types in the coarsest equitable partition: solid and dashed. The dashed edge in the middle is a bridge, it must appear in every spanning tree, so every ρ ∈ T must assign 1 to this edge. If T [  ] 6 = ∅ then there is some ρ ∈ T [  ] that as- signs the same weights to all dashed edges. Therefore it will assign 1 to the two dashed edges on the left and right hand side. The remaining solid edges hav e equal weight, and since the total weight is | V | − 1 = 7 , the solid weight is (7 − 3) / 6 = 2 / 3 . Now consider the triangle on the left. This triangle has the total weight 1 + 4 / 3 > 2 which vio- lates the constraint of the spanning tree polytope. Thus, for this graph T [  ] must be empty .

Lifted Tree-Reweighted Variational Inference

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment