Exact and approximate inference in graphical models: variable elimination and beyond

Exact or approximate inference in graphical models: why the choice is dictated by the tree width, and ho w v ariable elimination can be exploited Nathalie Peyrard a , Marie-Jos ´ ee Cros a , Simon de Gi vry a , Alain Franc b , St ´ ephane Robin c,d , R ´ egis Sabbadin a , Thomas Schiex a , Matthieu V ignes a,e a INRA UR 875 MIA T , Chemin de Borde Rouge, 31326 Castanet-T olosan, France b INRA UMR 1202, Biodi versit ´ e, G ` enes et Communaut ´ es, 69, route d’Arcachon, Pierroton, 33612 Cestas Cedex, France c AgroParisT ech, UMR 518 MIA, 16 rue Claude Bernard, P aris 5e, France d INRA, UM R518 MIA, 16 rue Claude Bernard, Paris 5e, France e Institute of Fundamental Sciences, Massey Uni versity , Palmerston North, Ne w Zealand Abstract Probabilistic graphical models of fer a powerful frame work to account for the dependence structure between v ariables, which is represented as a graph. Ho wev er , the dependence be- tween variables may render inference tasks intractable. In this paper we revie w techniques exploiting the graph structure for e xact inference, borro wed from optimisation and computer science. They are b uilt on the principle of variable elimination whose complexity is dictated in an intricate way by the order in which v ariables are eliminated. The so-called tree width of the graph characterises this algorithmic comple xity: lo w-treewidth graphs can be processed ef ﬁciently . The ﬁrst message that we illustrate is therefore the idea that for inference in graph- ical model, the number of v ariables is not the limiting factor , and it is worth checking for the tree width before turning to approximate methods. W e sho w ho w algorithms providing an upper bound of the tree width can be e xploited to deriv e a ’good’ elimination order enabling to perform exact inference. The second message is that when the tree width is too lar ge, algo- rithms for approximate inference linked to the principle of v ariable elimination, such as loopy belief propagation and variational approaches, can lead to accurate results while being much less time consuming than Monte-Carlo approaches. W e illustrate the techniques revie wed in this article on benchmarks of inference problems in genetic linkage analysis and computer vision, as well as on hidden v ariables restoration in coupled Hidden Markov Models. 1 K eywords: computational inference; marginalisation; mode ev aluation; message passing; v ariational approximations. 1 Intr oduction Graphical models (Lauritzen, 1996; Bishop, 2006; K oller and Friedman, 2009; Barber, 2012; Murphy, 2012) are formed by variables linked to each other by stochastic relationships. They enable to model dependencies in possibly high-dimensional heterogeneous data and to capture uncertainty . Graphical models hav e been applied in a wide range of areas when elementary units locally interact with each other , like image analysis (Solomon and Breckon, 2011), speech recognition (Baker et al., 2009), bioinformatics (Liu et al., 2009; Maathuis et al., 2010; H ¨ ohna et al., 2014) and ecology (Illian et al., 2013; Bonneau et al., 2014; Carriger and Barron, 2016) to name a fe w . In real applications a lar ge number of random v ariables with a comple x dependency structure are in volv ed. As a consequence, inference tasks such as the calculation of a normalisation con- stant, of a mar ginal distribution or of the mode of the joint distribution can be challenging. Three main approaches exist to e valuate such quantities for a gi ven distrib ution Pr deﬁning a graphical model: ( a ) compute them in an exact manner; ( b ) use a stochastic algorithm to sample from the distribution Pr to get (unbiased) estimates; ( c ) deriv e an approximation of Pr for which the e xact calculation is possible. Even if appealing, exact computation on Pr can lead to very time and memory consuming procedures for large problems. The second approach is probably the most widely used by statisticians and modellers. Stochastic algorithms such as Monte-Carlo Markov Chains (MCMC) (Robert and Casella, 2004), Gibbs sampling (Geman and Geman, 1984; Casella and George, 1992) and particle ﬁltering (Gordon et al., 1993) hav e become standard tools in many ﬁelds of application using statistical models. The last approach includes variational approxima- tion techniques (W ainwright and Jordan, 2008), which are starting to become common practice in computational statistics. In essence, approaches of type ( b ) provide an approximate answer to an exact problem whereas approaches of type ( c ) provide an exact answer to an approximate problem. In this paper , we focus on approaches of type ( a ) and ( c ) , and we will revie w techniques for exact or approximate inference in graphical models borrowed from both optimisation and computer science. They are computationally ef ﬁcient, yet not always standard in the statistician toolkit. The characterisation of the structure of the graph G associated to a graphical model (pre- cise deﬁnitions are giv en in Section 2) enables both to determine if the exact calculation of the quantities of interest (marginal distrib ution, normalisation constant, mode) can be implemented ef ﬁciently and to deriv e a class of operational algorithms. When the exact calculation cannot be achie ved efﬁciently , a similar analysis of the problem enables the practitioner to design al- gorithms to compute an approximation of the desired quantities with an associated acceptable complexity . Our aim is to provide the reader with the key elements to understand the power of these tools for statistical inference in graphical models. The central algorithmic tool we focus on in this paper is the variable elimination concept (Bertel ´ e and Brioshi, 1972). In Section 3 we adopt a uniﬁed algebraic presentation of the different inference tasks (mar ginalisation, normalising constant or mode ev aluation) to emphasise that each 2 of them can be solved using a particular case of a variable elimination scheme. Consequently , the work done to demonstrate that variable elimination is ef ﬁcient for one task passes on to the other ones. The k ey ingredient to design efﬁcient algorithms based on variable elimination is the clev er use of distributi vity between algebraic operators. F or instance distributi vity of the product ( × ) ov er the sum ( + ) enables to write ( a × b ) + ( a × c ) = a × ( b + c ) and e valuating the left-hand side of this equality requires two multiplications and one addition while e valuating the right-hand side requires one multiplication and one addition. Similarly since max( a + b, a + c ) = a + max( b, c ) it is more ef ﬁcient to compute the right-hand side from an algorithmic point of vie w . Distrib utivity enables to minimise the number of operations. T o perform v ariable elimination, associati vity and commutati vity properties are also required, and the algebra behind is that of semi-ring (from which some notations will be borrowed). Inference algorithms using the distributi vity property hav e been kno wn and published in the Artiﬁcial Intelligence and Machine Learning literature under different names, such as sum-prod, or max-sum (Pearl, 1988; Bishop, 2006). They are typical examples of v ariable elimination procedures. V ariable elimination relies on the choice of an order of elimination of the variables, via suc- cessi ve marginalisation or maximisation operations. The calculations are performed according to this ordering when applying distrib utivity . The topology of the graph G provides key informa- tion to optimally organise the calculations as to minimise the number of elementary operations to perform. For example, when the graph is a tree, the most ef ﬁcient elimination order corre- sponds to eliminating recursiv ely the vertices of degree one. One starts from the leav es tow ards the root, and inner nodes of higher degree successi vely become lea ves. The notion of an optimal elimination order for inference in an arbitrary graphical model is closely linked to the notion of tree width of the associated graph G . W e will see in Section 3 the reason why inference algorithms based on variable elimination with the best elimination order are of linear complexity in n , the number of v ariables/nodes in the graph, i.e. the size of the graph, but exponential complexity in the treewidth. Therefore treewidth is one the main characterisation of G to determine if exact inference is possible in practice or not. This notion has lead to the de velopment of se veral works for solving apparently complex inference problems, which hav e then been applied in biology (e.g. T amura and Akutsu 2014). More details on these methodological and applied results are provided in the Conclusion Section. The concept of treewidth has been proposed in parallel in computer science (Bodlaender, 1994), in discrete mathematics and graph minor theory (see Robertson and Seymour 1986; Lov ´ asz 2005). Discrete mathematics existence theorems (Robertson and Seymour, 1986) es- tablish that there exists an algorithm for computing the treewidth of any graph with complexity polynomial in n (but exponential in the treewidth), and the degree of the polynomial is deter - mined. Howe ver , this result does not tell how to deriv e and implement the algorithm, apart from some very speciﬁc cases such as trees, chordal graphs, and series-parallel graphs (Dufﬁn, 1965). Section 4 introduces the reader to sev eral state-of-the-art algorithms that provide an upper bound of the treewidth, together with an associated elimination order . These algorithms are therefore useful tools to test if exact inference is achiev able and, if applicable, to deriv e an exact inference algorithm based on variable elimination. Their behaviour is illustrated on benchmarks borrowed from combinatorial optimisation competitions. V ariable elimination also lead to message passing algorithms (Pearl, 1988) which are no w common tools in computer science or machine learning for marginal or mode e valuation. More 3 recently , these algorithms have been reinterpreted as a way to re-parameterise the original graphi- cal model into an updated one with different potential functions by still representing the same join distribution (K oller and Friedman, 2009). W e explain in Section 5 ho w re-parametrisation can be used as a pre-processing tool to obtain a new parameterisation with which inference becomes simpler . Message passing is not the only way to perform re-parametrisation, and we discuss alter- nati ve ef ﬁcient algorithms proposed in the context of Constraint Satisf action Problems (CSP , see Rossi et al. 2006). These latter ones ha ve, to the best of our kno wledge, not yet been exploited in the context of graphical models. As emphasised above, efﬁcient exact inference algorithms can only be designed for graphical models with limited treewidth, i.e. much less than the number of vertices. Although this is not the case for many graphs, the principles of variable elimination and message passing for a tree can be applied to any graph leading to heuristic inference algorithms. The most famous heuristics is the Loopy Belief Propagation algorithm (LBP , see Kschischang et al. 2001). W e recall in Section 6 the result that establishes LBP as a v ariational approximation method. V ariational methods rely on the choice of a distribution which renders inference easier . They approximate the original complex graphical model. The approximate distribution is chosen within a class of models for which ef ﬁcient inference algorithms exist, that is models with small treewidth (0, 1 or 2 in practice). W e revie w some standard choices of approximate distrib utions, each of them corresponds to a dif ferent underlying treewidth. Finally , Section 7 illustrates the techniques re viewed in the article, on the case of Coupled Hidden Markov Model (CHMM, see Brand 1997). W e ﬁrst compare them on the problem of mode inference in a CHMM de voted to the study of pest propagation. Then we ex emplify the use of dif ferent variational methods for EM-based parameter estimation in CHMM. 2 Graphical Models 2.1 Models deﬁnition Consider a stochastic system deﬁned by a set of random variables X = ( X 1 , . . . , X n ) > . Each v ariable X i takes values in Λ i . A realisation of X is denoted x = ( x 1 , . . . , x n ) > , with x i ∈ Λ i . The set of all possible realisations is called the state space, and is denoted Λ = Q n i =1 Λ i . If A is a subset of V = { 1 , . . . , n } , then X A , x A and Λ A are respecti vely the subset of random v ariables { X i , i ∈ A } , a possible realisation { x i , i ∈ A } of X A and the state space of X A respecti vely . If p is the joint probability distribution of X on Λ , we denote for all x ∈ Λ p ( x ) = Pr( X = x ) . Note that we focus here on discrete variables (we will discuss inference in the case of contin- uous v ariables on examples in Section 8). A joint distribution p on Λ is said to be a pr obabilistic graphical model (Lauritzen, 1996; Bishop, 2006; Koller and Friedman, 2009) index ed on a set B of parts of V if there exists a set Ψ = { ψ B } B ∈B of maps from Λ B to R + , called potential functions , index ed by B such that p can be expressed in the follo wing factorised form: p ( x ) = 1 Z Y B ∈B ψ B ( x B ) , (1) 4 where Z = P x ∈ Λ Q B ∈B ψ B ( x B ) is the normalising constant, also called partition function. The elements B ∈ B are the scopes of the potential functions and | B | is the arity of the potential function ψ B . The set of scopes of all the potential functions in volving v ariable X i is denoted B i = { B ∈ B : i ∈ B } One desirable property of graphical models is that of Markov local independence: if p ( x ) can be expressed as in (1), then a variable X i is (stochastically) independent of all others in X conditionally to the set of v ariables X ( ∪ B ∈B i B ) \ i . The set X ( ∪ B ∈B i B ) \ i is called the Mark ov blanket of X i , or its neighbourhood (K oller and Friedman, 2009, chapter 4). It is denoted N i . These conditional independences can be represented, by a graph with one vertex per variable in X . The question of encoding the independence properties associated with a gi ven distribution into a graph structure has been widely described (e.g. K oller and Friedman 2009, chapters 3 and 4), and we will not discuss it here. W e consider the classical graph G = ( V , E ) associated to the decomposition dictated in (1), where an edge is drawn between two vertices i and j if there exists B ∈ B such that i and j are in B . Such a representation of a graphical model is actually not as rich as the representation of (1). For instance, if n = 3 , the two cases B = {{ 1 , 2 , 3 }} and B = {{ 1 , 2 } , { 2 , 3 } , { 3 , 1 }} are represented by the same graph G , namely a clique (i.e. a fully connected set of vertices) of size 3. W ithout loss of generality , we could impose in the deﬁnition of a graphical model that scopes B correspond to cliques of G . In the abov e example where B = {{ 1 , 2 } , { 2 , 3 } , { 3 , 1 }} , this can be done by deﬁning ψ 0 1 , 2 , 3 = ψ 12 ψ 23 ψ 13 . The original structure is then lost, and ψ 0 is more costly to store than the original potential functions. The factor graph representation goes beyond the limit of the representation G : this graphical representation is a bipartite graph with one vertex per potential function and one vertex per variable. Edges are only between functions and variables. An edge is present between a function vertex (also called factor verte x) and a v ariable vertex, if and only if the v ariable is in the scope of the potential function. Figure 1 displays examples of the tw o graphical representations. Se veral families of probabilistic graphical models exist (Koller and Friedman, 2009; Murph y, 2012). The y can be grouped into directed and undirected ones. The most classical directed frame work is that of Bayesian network (Pearl, 1988; Jensen and Nielsen, 2007). In a Bayesian network, potential functions are conditional probabilities of a variable giv en its parents. In such models, trivially Z = 1 . There is a representation by a directed graph where an edge is directed from a parent vertex to a child verte x (see Figure 1 (a)). The undirected graphical representation G is obtained by moralisation, i.e. by adding an edge between two parents of a same variables. Undirected probabilistic graphical models (see Figure 1 (c)) are equiv alent to Markov Random F ields (MRF , Li 2001) as soon as the potential functions take v alues in R + \ { 0 } . In a Markov random ﬁeld (MRF), a potential function is not necessarily a probability distribution: ψ B is not required to be normalised (as opposed to a Bayesian network model). Deterministic Graphical models. Although the terminology of ’Graphical Models’ is of- ten used to refer to probabilistic graphical models, the idea of describing a joint interaction on a set of v ariables through local functions has also been used in Artiﬁcial Intelligence to concisely describe Boolean functions or cost functions, with no normalisation constraint. Throughout this article we regularly refer to these deterministic graphical models, and we explain how the algo- rithms de voted to their optimisation can be directly applied to compute the mode in a probabilistic graphical model. 5 1 2 3 4 5 6 7 (a) 1 2 3 4 5 6 7 (b) 1 2 3 4 5 6 7 (c) 1 2 3 4 5 6 7 (d) Figure 1: From left to right: (a) Graphical representation of a directed graphical model where potential functions deﬁne the conditional probability of each variable giv en its parents v alues; (b) The corresponding factor graph where ev ery potential function is represented as a factor (square verte x) connected to the v ariables that are in volv ed in it; (c) Graphical representation of an undi- rected graphical model. It is impossible from this graph to distinguish between a graphical model deﬁned by a unique potential function on vertices 3, 4 and 5 from a model deﬁned by 3 pairwise potential functions over each pair (3 , 4) , (3 , 5) and (4 , 5) ; (d) The corresponding factor graph, which unambiguously deﬁnes the potential functions, here three pairwise potential functions. In a deterministic graphical model with only Boolean (0/1) potential functions, each potential function describes a constraint between variables. If the potential function takes value 1, the corresponding realisation is said to satisfy the constraint. If it takes value 0, the realisation does not satisfy it. The graphical model is known as a ’Constraint Network’. It describes a joint Boolean function on all variables that takes value 1 if and only if all constraints are satisﬁed. The problem of ﬁnding a realisation that satisﬁes all the constraints, called a solution of the constraint network, is the ’Constraint Satisfaction Problem’ (CSP , Rossi et al. 2006). This framew ork is used to model and solve combinatorial optimisation problems. There is a wide variety of software tools to solve it. CSP hav e been extended to describe joint cost functions, decomposed as a sum of local cost functions, f B in the ‘W eighted Constraint Network’ (Rossi et al., 2006) or ‘Cost Function Net- work’. f ( x ) = X B ∈B f B ( x B ) . In this case, cost functions take ﬁnite or inﬁnite integer or rational values: inﬁnity enables to express hard constraints while ﬁnite values encode costs for unsatisﬁed soft constraints. The problem of ﬁnding a realisation of minimum cost is the ’W eighted Constraint Satisfaction Prob- lem’ (WCSP), which is NP-hard. It is easy to observe that any probabilistic graphical model can be translated in a weighted constraint netw ork, and vice v ersa using a simple − ln( · ) transforma- tion. f B ( x B ) = − ln( ψ B ) , with f B ( x B ) = + ∞ ⇔ ψ B ( x B ) = 0 . Therefore the WCSP is equi valent to ﬁnding a realisation with maximal probability in a proba- bilistic graphical model. W ith this equiv alence, it becomes possible to use exact WCSP resolution 6 algorithms that hav e been dev eloped in this ﬁeld for mode e valuation or for the computation of Z , the normalising constant, in probabilistic graphical model. See for instance V iricel et al. (2016), for an application on a problem of protein design. 2.2 Infer ence tasks in probabilistic graphical models Computations on probabilities and potentials rely on two fundamental types of operations. Firstly , multiplication (or addition in the log domain) is used to combine potentials to deﬁne a joint po- tential distribution. Secondly , sum or max / min can be used to eliminate variables and compute marginals or modes of the joint distribution on subsets of variables. The precise identity of these two basic operations is not important for the inference algorithms based on v ariable elimina- tion. W e therefore adopt a presentation using generic operators to emphasise this property of the algorithms. W e denote as  and as ⊕ the combination operator and the elimination operator, respecti vely . T o be able to apply the variable elimination algorithm, the only requirement is that ( R + , ⊕ ,  ) deﬁnes a commutativ e semi-ring. Speciﬁcally , the semi-ring algebra offers distribu- ti vity: ( a  b ) ⊕ ( a  c ) = a  ( b ⊕ c ) . For instance, this corresponds to the distributi vity of the product operation o ver the sum operation, i.e. ( a × b ) + ( a × c ) = a × ( b + c ) , or to the distribu- ti vity of the max operation over the sum operation, i.e. max( a + b, a + c ) = a + max( b, c ) , or to the distributi vity of the max operation over the product operation, i.e. max( a × b, a × c ) = a × (max( b, c )) . W e extend the deﬁnition of the two abstract operators  and ⊕ to operators on potential functions, as follo ws: Combine operator: the combination of two potential functions ψ A and ψ B is a new function ψ A  ψ B : Λ A ∪ B → R + deﬁned as ψ A  ψ B ( x A ∪ B ) = ψ A ( x A )  ψ B ( x B ) . Elimination operator: the elimination of v ariable X i , i ∈ B from a potential function ψ B is a ne w function ( ⊕ x i ψ B ) : Λ B \{ i } → R + deﬁned as ( ⊕ x i ψ B )( x B \{ i } ) = ⊕ x i ( ψ B ( x B \{ i } , x i )) . For ⊕ = + , ( ⊕ x i ψ B )( x B \{ i } ) represents the marginal sum P x i ψ B ( x B \{ i } , x i ) . Classical counting and optimisation tasks in graphical models can no w be entirely written with these two operators. For simplicity , we denote by ⊕ x B , where B ⊂ V a sequence of eliminations ⊕ x i for all i ∈ B , the result being insensiti ve to the order in a commutativ e semi- ring. Similarly ,  B ∈B represents the successi ve combination of all potential functions ψ B , with B ∈ B . Counting task. Under this name we group all tasks that inv olve summing over the state space of a subset of v ariables in X . This includes the computation of the partition function Z or of any marginal distribution, as well as entropy ev aluation. For A ⊂ V and ¯ A = V \ A , the marginal distribution p A of X A associated to the joint distribution p is deﬁned as: p A ( x A ) = X x ¯ A ∈ Λ ¯ A p ( x A , x ¯ A ) = 1 Z X x ¯ A ∈ Λ ¯ A Y B ∈B ψ B ( x B ) The function p A then satisﬁes ( Z is a constant function): p A  Z = p A   ⊕ x   B ∈B ψ B   =  ⊕ x ¯ A   B ∈B ψ B   7 where  combines functions using × and ⊕ eliminates v ariables using + . Marginal ev aluation is also interesting in the case where some variables are observed. If x O ( O ⊂ V ) are the values of the observed values, the marginal conditional distribution can be computed by restricting the domains of variables X O to the observed value. This is typically the kind of computational task required in the E-step of an EM algorithm, for parameter estimation of models with hidden data. Optimisation task The most common optimisation task in a graphical model corresponds to the e valuation of the most probable state x ∗ of the random vector X , deﬁned as x ∗ = arg max x ∈ Λ p ( x ) = arg max x ∈ Λ Y B ∈B ψ B ( x B ) = arg max x ∈ Λ X B ∈B ln ψ B ( x B ) The maximum itself is ⊕ x (  B ∈B ln ψ B ( x B )) with ⊕ and  set to max and to + , respectiv ely . The computation of the mode x ∗ does not require the computation of the normalising constant Z , ho wev er ev aluating the mode probability value p ( x ∗ ) does. Another optimisation task of interest is the computation of the max-marginals of each v ariable X i deﬁned as p ∗ ( x i ) = max x V \ i p ( x ) . Therefore counting and optimisation tasks can be interpreted as two instantiations of the same computational task expressed in terms of combination and elimination operators, namely ⊕ x A  B ∈B ψ B , where A ⊆ V . When the combination operator  and the elimination operator ⊕ are set to × and + , respectiv ely , this computational problem is known as a sum-product prob- lem in the Artiﬁcial Intelligence literature (Pearl, 1988),(Bishop, 2006, chapter 8). When ⊕ and  are set to max and to the sum operator , respectiv ely it is a max-sum problem (Bishop, 2006, chapter 8). In practice, it means that tasks such as solving the E-step of the EM algorithm or computing the mode in a graphical model, belong to the same family of computational problems. W e will see in Section 3 that there exists an exact algorithm solving this general task which exploits the distrib utivity of the combination and elimination operators to perform operations in a smart order . From this generic algorithm, kno wn as variable elimination (Bertel ´ e and Brioshi, 1972) or bucket elimination (Dechter, 1999), one can deduce exact algorithms to solve counting and optimisation tasks in a graphical model, by instantiating the operators ⊕ and  . Deterministic Graphical models. The Constraint Satisfaction Problem is a ∨ - ∧ problem as it can can be deﬁned using ∨ (logical ’or’) as the elimination operator and ∧ (logical ’and’) as the combination operator o ver Booleans. The weighted CSP is a min - + as it uses min as the elimination operator and + (or bounded variants of + ) as the combination operator . Sev eral other variants e xist (Rossi et al., 2006), including generic algebraic variants (Schiex et al., 1995; Bistarelli et al., 1997; Cooper, 2004; Pralet et al., 2007; K ohlas, 2003). 2.3 Example: Coupled HMM W e introduce now the e xample of Coupled Hidden Markov Models (CHMM), which can be seen as extensions Hidden Mark ov Chain (HMC) models to sev eral chains in interactions. In section 7 we will use this frame work to illustrate the beha viour of e xact and approximate algorithms based on v ariable elimination. 8 T able 1: Deﬁnitions of the Combine (  ) and the Elimination ( ⊕ ) operators for classical tasks on probabilistic and deterministic graphical models. T ask ⊕  Marginal e valuation + × Mode e valuation max + Existence of a solution in a CSP ∨ ∧ Ev aluation of the minimum cost in WCSP min + 1 2 3 4 5 6 7 8 Figure 2: Graphical representation of a HMM. Hidden v ariables correspond to vertices 1, 3, 5, 7, and observed v ariables to vertices 2, 4, 6, 8. A HMC (Figure 2) is deﬁned by two sequences of random variables O and H of same length, T . A realisation o = ( o 1 , . . . o T ) > of the variables O = ( O 1 , . . . O T ) > is observed, while the states of variables H = ( H 1 , . . . H T ) > are unkno wn (hidden). In the HMC model the assumption is made that O i is independent of H V \{ i } and O V \{ i } gi ven the hidden variable H i . These indepen- dences are modelled by pairwise potential functions ψ H i ,O i , ∀ 1 ≤ i ≤ T . Furthermore, hidden v ariable H i is independent of H 1 , . . . , H i − 2 and O 1 , . . . , O i − 1 gi ven the hidden v ariable H i − 1 . These independences are modelled by pairwise potential functions ψ H i − 1 ,H i , ∀ 1 < i ≤ T . Then the model is fully deﬁned by specifying an additional potential function ψ H 1 ( h 1 ) to model the ini- tial distribution. In the classical HMC formulation (Rabiner, 1989), these potential functions are normalised conditional probability distrib utions i.e., ψ H i − 1 ,H i ( h i − 1 , h i ) = Pr( H i = h i | H i − 1 = h i − 1 ) , ψ O i ,H i ( o i , h i ) = Pr( O i = o i | H i = h i ) and ψ H 1 ( h 1 ) = Pr( H 1 = h 1 ) . As a consequence, the normalising constant Z is equal to 1 , as it is in Bayesian networks. Consider now that there is more than one hidden chain: I signals are observed at times t ∈ { 1 , . . . T } and we denote O i t the variable corresponding to the observed signal i at time t . V ariable O i t depends on some hidden state H i t . The Coupled HMM (CHMM) framew ork assumes dependency between two hidden chains at two consecutive time steps (see Brand 1997): H i t depends not only of H i t − 1 , it may depend on some H j t − 1 for j 6 = i . The set of the indices of chains upon which H i t depends (expect i ) is noted L i . This results in the graphical structure displayed on Figure 3, where L 2 = { 1 , 3 } and L 1 = L 3 = { 2 } . Such models hav e been considered in a series of domains such as bioinformatics (Choi et al., 2013), electroencephalogram analysis (Zhong and Ghosh, 2002) or speech recognition (Nock and Ostendorf, 2003). In a CHMM setting, the joint distribution of the hidden v ariables H = ( H i t ) i,t and observed variables O = ( O i t ) i,t factorises as Pr( h , o ) ∝ I Y i =1 ψ init ( h i 1 ) I Y i =1 T Y t =2 ψ M ( h i t − 1 , h L i t − 1 , h i t ) ! × I Y i =1 T Y t =1 ψ E ( h i t , o i t ) ! , (2) 9 where ψ init is the initial distribution, ψ M encodes the local transition function of H i t and ψ E encodes the emission of the observed signal giv en the corresponding hidden state. A fairly com- prehensi ve e xploration of these models can be found in (Murphy, 2002). H 1 t − 1 H 1 t H 1 t +1 H 2 t − 1 H 2 t H 2 t +1 H 3 t − 1 H 3 t H 3 t +1 O 1 t − 1 O 1 t O 1 t +1 O 2 t − 1 O 2 t O 2 t +1 O 3 t − 1 O 3 t O 3 t +1 Figure 3: Graphical representation of a coupled HMM with 3 hidden chains. Potential function ψ init , ψ M and ψ E can be parameterised by a set of parameters denoted θ . A classical problem for CHMM is have more than one iron in the ﬁre: (a) estimate θ and (b) compute the mode of the conditional distribution of the hidden variables giv en the observ ations. Estimation can be performed using an EM algorithm, and as mentioned pre viously , the E-step of the algorithm and the mode computation task belong to the same family of computational task in graphical models. Both can be solved using variable elimination, as we sho w in the ne xt section. Beforehand, we present a reasonably simple example of CHMM that will be used to illustrate the different inference algorithms introduced in this work. It models the dynamics of a pest that can spread on a landscape composed of I crop ﬁelds organised on a regular grid. The spatial neighbourhood of ﬁeld i , denoted L i , is the set of the four closest ﬁelds (three on the borders, and two in corners of the grid). H i t ∈ { 0 , 1 } ( 1 ≤ i ≤ I , 1 ≤ t ≤ T ) is the state of crop ﬁeld i at time t . State 0 (resp. 1) represents the absence (resp. presence) of the pest in the ﬁeld. V ariable H i t depends on H i t − 1 and of the H j t − 1 , for j ∈ L i . The conditional probabilities of survi val and apparition of the pest in ﬁeld i are parameterised by 3 parameters:  , the probability of contamination from outside the landscape (long-distance dispersal); ρ , the probability that the pest spreads from an infected ﬁeld j ∈ L i to ﬁeld i between two consecuti ve times; and ν , the probability of ﬁeld persistent infection between two consecutiv e times. W e assume that contamination ev ents from all neighbouring ﬁelds are independent. Then, if C i t is the number of contaminated neighbours of ﬁeld i at time t (i.e. C i t = P j ∈ L i H j t ), the contamination potential of ﬁeld i at time t writes: ψ M (0 , h L i t − 1 , 1) = Pr( H i t = 1 | H i t − 1 = 0 , h j t − 1 , j ∈ L i ) =  + (1 −  )(1 − (1 − ρ ) C i t ) , 10 and its persistence in a contaminated state writes: ψ M (1 , h L i t − 1 , 1) = Pr( H i t = 1 | h i t − 1 = 1 , h j t , j ∈ L i ) = ν + (1 − ν )   + (1 −  )(1 − (1 − ρ ) C i t )  . The ( H i t ) ’ s are hidden variables b ut monitoring observations are a vailable. A binary variable O i t is observed: it takes value 1, if the pest was declared as present in the ﬁeld, and 0 otherwise. Errors of detection are possible. False ne gati ve observ ations occur since even if the pest is there, it can be difﬁcult to notice, and missed. On the opposite, f alse positi ve observations occur when the pest is mixed up with another one. W e deﬁne the corresponding emission potential as ψ E (0 , 1) = Pr( O i t = 0 | H i t = 1) = f n and ψ E (1 , 0) = Pr( O i t = 1 | H i t = 0) = f p , respecti vely . 3 V ariable elimination f or exact inference W e describe now the principle of variable elimination to solve the general inference tasks pre- sented in Section 2.2. W e ﬁrst recall the V iterbi algorithm for Hidden Marko v Chains (Rabiner, 1989), a classical example of variable elimination for optimisation (mode ev aluation). Then, we formally describe the varia ble elimination procedure in the general graphical model framew ork. The key element is the choice of an ordering for the sequential elimination of the variables. It is closely link ed to the notion of treewidth of the graphical representation of the model. W e explain ho w the complexity of a variable elimination algorithm is fully characterised by this notion. W e also describe the extension to the elimination of blocks of v ariables. 3.1 Case of hidden Marko v chain models As a didactic introduction to exact inference on graphical models by variable elimination, we consider a well studied stochastic process: the discrete Hidden Marko v Chain model (HMC). A classical inference task for HMC is to identify the most likely values of variables H given a realisation o of the variables O . The problem is to compute arg max h Pr( H = h | O = o ) , or equi valently the ar gument of: max h 1 ,...,h T h ( ψ H 1 ( h 1 ) ψ O 1 ,H 1 ( o 1 , h 1 )) T Y i =2 ( ψ H i − 1 ,H i ( h i − 1 , h i ) ψ O i ,H i ( o i , h i )) i (3) The number of possible realisations of H is exponential in T . Nev ertheless, this optimisation problem can be solved in a number of operations linear in T using the well-known V iterbi al- gorithm (Rabiner, 1989). This algorithm, based on dynamic programming, performs successiv e eliminations (by maximisation) of all hidden v ariables, starting with H T , and iterati vely consid- ering the H i ’ s for i = T − 1 , T − 2 , . . . , and ﬁnishing by H 1 . It successiv ely computes the most likely sequence of hidden variables. By using distributi vity between the max and the product operators, the elimination of v ariable H T can be done by re writing (3) as: 11 max h 1 ,...,h T − 1 h ψ H 1 ( h 1 ) ψ O 1 ,H 1 ( o 1 , h 1 ) T − 1 Y i =2  ψ H i − 1 ,H i ( h i − 1 , h i ) ψ O i ,H i ( o i , h i ) max h T ψ H T − 1 ,H T ( h T − 1 , h T ) ψ O T ,H T ( o T , h T ) | {z } New potential function  i The new potential function created by maximising on H T depends only on v ariable H T − 1 . The same principle can then be applied to H T − 1 and so forth. This is a simple application of the general v ariable elimination algorithm that we describe in the next section. 3.2 General principle of variable elimination In Section 2, we hav e seen that counting and optimisation tasks can be formalised by the same generic algebraic formulation ⊕ x A (  B ∈B ψ B ) (4) where A ⊆ V . The trick behind v ariable elimination (Bertel ´ e and Brioshi, 1972) relies on a cle ver use of the distributi vity property . Indeed, ev aluating ( a  b ) ⊕ ( a  c ) as a  ( b ⊕ c ) requires fewer opera- tions. Hence eliminating a in the second writing leads to dealing with fe wer algebraic operations. Since distributi vity applies both for counting and optimising tasks, variable elimination can be applied to both tasks. It also means that if v ariable elimination is ef ﬁcient for one task it will also be efﬁcient for the other one. As in the HMC example, the principle of the variable elimination algorithm for counting or optimising consists in eliminating variables one by one in an expression of the problem like in (4). The elimination of the ﬁrst v ariable, say X i , i ∈ A , is performed by merging all potential functions in volving X i and applying operator ⊕ x i to these potential functions. Using commuta- ti vity and associativity of both operators, (4) can be re written as: ⊕ x A (  B ∈B ψ B ) = ⊕ x A \{ i } ⊕ x i  (  B ∈B\B i ψ B )  (  B ∈B i ψ B )  , where B i is the subset of V deﬁned such as all its elements contain i . Then using distrib utivity of  on ⊕ , we obtain: ⊕ x A (  B ∈B ψ B ) = ⊕ x A \{ i } h (  B ∈B\B i ψ B )  ( ⊕ x i  B ∈B i ψ B ) | {z } New potential function ψ N i i This shows that the elimination of X i results in a new graphical model, where variable X i and the potential functions ψ B , B ∈ B i = { B 0 , x i ∈ B 0 } do not appear anymore. They are replaced by a ne w potential ψ N i which does not in volv e X i , but depends on its neighbours in G . The graph associated to the new graphical model is in a sense similar to the one of the original model. It is updated as follows: verte x X i is remov ed, and neighbours X N i of X i are 12 no w connected together in a clique because they are all in the scope of ψ N i . The new edges between the neighbours of X i are called ﬁll-in edges. For instance, when eliminating v ariable X 1 in the graph of Figure 4 (left), potential functions ψ 1 , 2 , ψ 1 , 3 , ψ 1 , 4 and ψ 1 , 5 are replaced by ψ 2 , 3 , 4 , 5 = ⊕ x 1 ( ψ 1 , 2  ψ 1 , 3  ψ 1 , 4  ψ 1 , 4 ) . The ne w graph is sho wn in Figure 4 (right). 1 2 3 5 4 1 2 3 5 4 2 3 5 4 Figure 4: Elimination of variable X 1 replaces the four pairwise potential functions in volving v ariable X 1 with a new potential ψ N 1 , in volving the four neighbours of verte x 1 in the original graph. The new edges created between these four vertices are called ﬁll-in edges (dashed edges in the middle ﬁgure). Interpr etation for marginalisation, maximisation and ﬁnding the mode of a distribution When the ﬁrst elimination step is applied with ⊕ = + and  = × , the probability distribution deﬁned by this new graphical model is the mar ginal distribution p V \{ i } ( x V \{ i } ) of the original distribution p (up to a constant). The complete elimination can be obtained by successiv ely eliminating all variables in X A . The result is a graphical model ov er X V \ A , which speciﬁes the marginal distribution p V \ A ( x V \ A ) . When A = V , the result is a model with a single constant potential function with v alue Z . If instead ⊕ is max , and  = × (or + with a log transformation of the potential functions) and A = V , the last potential function obtained after elimination of the last v ariable is equal to the maximum of the non-normalised distribution. So ev aluating Z or the maximal probability of a graphical model can be both obtained with the same variable elimination algorithm, just changing the deﬁnition of the ⊕ (and  if needed) operator(s). Lastly , if one is interested in the mode itself, an additional computation is required. The mode is actually obtained by induction: if x ∗ V \{ i } is the mode of the graphical model obtained after the elimination of the ﬁrst variable, X i , then the mode of p can be deﬁned as ( x ∗ V \{ i } , x ∗ i ) , where x ∗ i is a v alue in Λ i that maximises  B ∈B ψ B ( x ∗ V \{ i } , x i ) . This maximisation is straightforward to deri ve because x i can take only | Λ i | values. x ∗ V \{ i } itself is obtained by completing the mode of the graphical model obtained after elimination of the second v ariable, and so on. W e stress here that the procedure requires to keep the intermediary potential functions ψ N i created during the successi ve eliminations. 13 Complexity of the intermediary potential functions and variable elimination ordering: a prelude to the treewidth When eliminating a v ariable X i , the task which can be computa- tionally expensi ve is the computation of the intermediate ψ N i . It requires to compute the prod- uct  B ∈B i ψ B ( x B ) of sev eral potential functions for all elements of Λ N i ∪{ i } , the state space of X N i ∪{ i } . The time and space complexity of the operation are entirely determined by the cardi- nality | N i | of the set of indices in N i . If K = max j ∈ V | Λ j | , the time complexity (i.e. number of elementary operations performed) is in O ( K | N i | +1 ) and space complexity (i.e. memory space needed) is in O ( K | N i | ) . Comple xity is therefore exponential in | N i | , the number of neighbours of the eliminated v ariable in the current graphical model. The total complexity of the variable elimination is then e xponential in the maximum cardinality | N i | over all successi ve eliminations. Ho wev er note that it is linear in n , which means that a large n is not necessarily a problem for having access to exact inference. Because the graphical model changes at each elimination step, this number usually depends on the order in which v ariables are eliminated. As a consequence, the prerequisite to apply v ariable elimination is to decide for an ordering of the elimination of the v ariables. As illustrated in Figure 5 two different orders can lead to two dif ferent N i subsets. The key message is that the choice of the order is crucial. It dictates the ef ﬁciency of the v ariable elimination procedure. W e no w illustrate and formalise this intuition. 3.3 When is variable elimination efﬁcient ? W e can understand why the V iterbi algorithm is an efﬁcient algorithm for mode ev aluation in a HMC. The graph associated to a HMC is comb-shaped: the hidden v ariables form a line and each observed variable is a leaf in the comb (see Figure 2). So it is possible to design an elimination order where the current v ariable to eliminate has a unique neighbour in the graphical representa- tion of the current model: for instance H T > H T − 1 , . . . > H 1 . By con vention, the ﬁrst eliminated v ariable is the largest according to this ordering (note that v ariables O t do not hav e to be elimi- nated since their value is kno wn). Follo wing this elimination order , when eliminating a v ariable using ⊕ , the resulting graphical model has one fe wer verte x than the previous one and no ﬁll-in edge . Indeed, the new potential function ψ N i is a function of a single variable since | N i | = 1 . The V iterbi algorithm as a space comple xity of O ( T K ) and a time comple xity of O ( T K 2 ) . More generally , variable elimination is very efﬁcient, i.e. leads to transitional N i sets of small cardinality , on graphical models whose graph representation is a tree. More speciﬁcally , for such graph structure, it is always possible to design an elimination order where the current v ariable to eliminate has only one neighbour in the graphical representation of the current model. Another situation where variable elimination can be efﬁcient is when the graph associated to the graphical model is chor dal (any cycle of length four or more has a chord i.e., an edge con- necting two non adjacent vertices in the cycle), and when the size of the largest clique is lo w . The rationale for this interesting property is explained intuiti vely here. In Figure 4, ne w edges are created between neighbours of the eliminated verte x. If this neighbourhood is a clique, no ne w edge is added. A vertex whose neighbourhood is a clique is called a simplicial vertex . Chordal graphs hav e the property that there exists an elimination order of the vertices, such that ev ery ver - tex during the elimination process is simplicial (Habib and Limouzy, 2009). Consequently , there exists an elimination order such that no ﬁll-in edges are created. Thus, the size of a transitional N i ’ s is dictated by the size of the clique formed by the neighbours of i Let us note that a tree 14 is a chordal graph, in which all edges and only edges are cliques. Hence, for a tree, simplicial vertices are v ertices of degree one. The elimination of de gree one vertices on a tree is an e xample of simplicial elimination on a chordal graph. For arbitrary graphs, if the maximal scope size of the intermediate ψ N i functions created during v ariable elimination is too large, then memory and time required for the storage and com- putation quickly exceed computer capacities. Depending on the chosen elimination order , this maximal scope can be reasonable from a computational point of view , or too large. So again, the choice of the elimination order is crucial. In the case of CHMM, we can imagine two differ - ent elimination orders: either time slice per time slice, or chain by chain (we omit the observed v ariables that are kno wn and do not ha ve to be eliminated). F or the ﬁrst order , starting from the oriented graph of Figure 3, we ﬁrst moralise it. Then, elimination of the variables H i T of the last time step does not add any ﬁll-in edges. Ho wev er, when eliminating v ariables H i T − 1 for 1 ≤ i ≤ I − 1 , due to the temporal dependences between chain, we create an intermediate potential function depending of I + 1 variables ( H I T − 1 and the H i T − 2 for all chains). And when successi vely eliminating temporal slices, the maximal size of the intermediate potential functions created is I + 1 . For the second elimination order , still starting from the moralised v ersion of the oriented graph, after eliminating all variables H 1 t for 1 ≤ t ≤ T − 1 , we create an intermediate potential function depending of T + 1 variables ( H 1 T and H 2 t for all t ). And when successiv ely eliminating chains, the maximal size of the intermediate potential functions created is T + 1 . So depending on the v alues of I and T , we will not select the same elimination order . 3.4 The tr eewidth to characterise variable elimination complexity The lowest complexity achiev able when performing v ariable elimination is characterised by a parameter called the tree width of the graph associated to the original graphical model. This concept has been repeatedly discovered and redeﬁned. The tree width of a graph is sometimes called its induced width (Dechter and Pearl, 1988), its minimum front size (Liu, 1992), its k -tree number (Arnbor g, 1985), its dimension (Bertel ´ e and Brioshi, 1972), and is also equal to the min- max clique number of G minus one (Arnborg, 1985) to name a fe w . The treewidth is also a key notion in the theory of graph minors (Robertson and Seymour, 1986; Lo v ´ asz, 2005). W e insist here on two deﬁnitions. The ﬁrst one (Bodlaender, 1994) relies on the notion of induced graph (see Deﬁnition below). It highlights the close relationship between ﬁll-in edges and the intermediate N i sets created during v ariable elimination. The second one (Robertson and Seymour, 1986; Bodlaender, 1994) is the most commonly used characterisation of the treewidth using so-called tree decompositions, also kno wn as junction trees, which are key tools to derive v ariable elimination algorithms. It underlies the block-by-block el imination procedure described in Section 3.5. Deﬁnition 1 (induced graph) Let G = ( V , E ) be a graph deﬁned by a set of vertices indexed on V and a set E of edges. Given an or dering π of the vertices of G , the induced graph G ind π is deﬁned in a constructive way as follows. F irst, G and G ind π have same vertices. Then for each edge in E an oriented edge is added in G ind π going fr om the ﬁrst of the two nodes accor ding to π towar d the second. Then each vertex i of V is consider ed one after the other following the or der deﬁned by π . When vertex i is treated, an oriented edge is cr eated between all pairs of 15 neighbours of i in G that follow i in the or dering deﬁned by π . Again the edge is going fr om the ﬁrst of the two nodes accor ding to π towar d the second. The induced graph G ind π is also called the ﬁll graph of G , and the process of computing it is sometimes referred to as “playing the elimination game” on G , as it just simulates elimination on G using the v ariable ordering π (see an example on Figure 5). This graph is chordal (V an- denberghe and Andersen, 2014). It is known that ev ery chordal graph G has at least one verte x ordering π such that G ind π = G (omitting the f act that edges of G ind π are directed), called a perfect elimination ordering (Fulkerson and Gross, 1965). 1 2 3 4 5 6 7 7 6 5 4 3 2 1 7 5 3 1 6 4 2 Figure 5: A graph and two elimination orders. Left, the graph; middle, induced graph associated to the elimination order (7 > 6 > 5 > 4 > 3 > 2 > 1) . V ertices are eliminated from the largest to the smallest. The maximum size of N i sets created during elimination is 2 (maximum number of outgoing edges) and only one (dashed) ﬁll-in edge is added when verte x 4 is eliminated; right, induced graph associated to the elimination order (7 > 5 > 3 > 1 > 6 > 4 > 2) . The maximum size of N i sets created during elimination is 3 and 5 (dashed) ﬁll-in edges are used. The second notion that enables to deﬁne the treewidth is the notion of tree decomposition. Intuiti vely , a tree decomposition of a graph G organises the vertices of G in clusters of vertices which are linked by edges such that the graph obtained is a tree. Speciﬁc constraints on the way vertices of G are associated to clusters in the decomposition tree are required. These constraints ensure that the resulting tree decomposition has properties useful for building variable elimination algorithms. Deﬁnition 2 (tree decomposition) Given a graph G = ( V , E ) , a tr ee decomposition of G , T , is a tr ee ( C , E T ) , wher e C = { C 1 , . . . , C l } is a family of subsets of V (called clusters), and E T is a set of edges between the subsets C i , satisfying the following pr operties: • The union of all clusters C k equals V (each vertex of G is associated with at least one vertex of T ). • F or every edge ( i, j ) in E , ther e is at least one cluster C k that contains both i and j . 16 • If clusters C k and C l both contain a vertex i of G , then all clusters C s of T in the (unique) path between C k and C l contain i as well: clusters containing vertex i form a connected subset of T . This is known as the running intersection pr operty . The concept of tree decomposition is illustrated in Figure 6. 1 2 3 4 5 6 7 C 1 C 2 C 3 C 4 C 5 1 4 4 3 5 5 Figure 6: Left: graphical representation of a graphical model. Right: tree decomposition ov er clusters C 1 = { 1 , 2 , 4 } , C 2 = { 1 , 3 , 4 } , C 3 = { 3 , 4 , 5 } , C 4 = { 5 , 6 } and C 5 = { 5 , 7 } . Each edge between two clusters is labelled by their shared v ariables. Deﬁnition 3 (treewidth) The two following deﬁnitions of the tr eewidth derived r espectively fr om the notion of induced graph, and fr om that of tr ee decomposition ar e equivalent: • The tr eewidth T W π ( G ) of a gr aph G for the or dering π is the maximum number of outgo- ing edg es of a vertex in the induced graph G ind π . The tr eewidth T W ( G ) of a graph G is the minimum tr eewidth over all possible or derings π . • The width of a tr ee decomposition ( C , E T ) is the size of the lar gest C i ∈ C minus 1, and the tr eewidth T W ( G ) of a gr aph is the minimum width among all its tr ee decompositions. It is not trivial to establish the equiv alence (see Meseguer et al. 2006, chapter 7, and Schiex 1999). The term T W π ( G ) is exactly the cardinality of the largest set N i created during variable elimination with elimination order π . For e xample, in Figure 5, the middle and right graphs are the two induced graphs for two different orderings and T W π ( G ) is equal to 2 with the ﬁrst ordering and to 3 with the second. It is easy to see that in this example T W ( G ) = 2 . The tree width of the graph of the HMC model, and of any tree is equal to 1. It has been established that ﬁnding a minimum treewidth ordering π for a graph G , ﬁnding a minimum treewidth tree decomposition, or computing the treewidth of a graph are of equiv alent complexity . For an arbitrary graph, computing the tree width is not an easy task. Section 4 is dedicated to this question, both from a theoretical and from a practical point of vie w . The treewidth is therefore a ke y indicator to answer the driving subject of this revie w: will v ariable elimination be efﬁcient for a given graphical model? For instance, the principle of vari- able elimination was applied to the exact computation of the normalising constant of a Markov random ﬁeld on a small r by c lattice in Reev es and Pettitt (2004). F or this re gular graph, it 17 is known that the treewidth is equal to min( r, c ) . So exact computation through v ariable elimi- nation is possible for lattices with a small value for min( r , c ) (ev en if max( r, c ) is large). It is ho wev er well beyond computer capacities for real challenging problems in image analysis. In this case variable elimination can be used to deﬁne heuristic computational solutions, such as the algorithm of Friel et al. (2009), which relies on the merging of exact computations on small sub-lattices of the original lattice. 3.5 T r ee decomposition and block by block elimination Gi ven a graphical model and a tree decomposition of its graph, a possible alternativ e to solve counting or optimisation tasks is to eliminate variables by successiv e blocks instead of one after the other . T o do so, the block by block elimination procedure (Bertel ´ e and Brioshi, 1972) relies on the tree decomposition characterisation of the treewidth. The underlying idea is to apply the v ariable elimination procedure on the tree decomposition, eliminating one cluster of the tree at each step. First a root cluster C r ∈ C is chosen and used to deﬁne an order of elimination of the clusters, by progressing from the lea ves to ward the root. Every eliminated cluster corresponds to a leaf of the current intermediate tree. Then each potential function ψ B is assigned to the cluster C i in C such that B ⊂ C i which is the closest to the root. Such a cluster always exists otherwise either the running intersection property would not be satisﬁed or the graph of the decomposition would not be a tree. More precisely , the procedure starts with the elimination of any leaf cluster C i of T , with parent C j in T . Let us note B ( C i ) = { B ∈ B , ψ B assigned to C i } . Here again, commutati vity and distributi vity are used to rewrite e xpression (4) (with A = V ) as follo ws: ⊕ x  B ∈B ψ B = ⊕ x V \ ( C i \ C j ) h  B ∈B\B ( C i ) ψ B  ( ⊕ x C i \ C j  B ∈B ( C i ) ψ B ) | {z } New potential function i Note that only v ariables with indices in C i \ C j ≡ C i ∩ ( V \ C j ) are eliminated, e ven if it is common to say that the cluster has been eliminated. For instance, in the example depicted in Figure 6, if the ﬁrst eliminated cluster is C 1 , the new potential function is ⊕ x 2 ψ 1 , 2 ( x 1 , x 2 ) ψ 2 , 4 ( x 2 , x 4 ) , it depends only on variables X 1 and X 4 . Cluster elimination continues until no cluster is left. The interest of this procedure is that the intermediate potential function created after each cluster elimination may hav e a scope much smaller than the tree width, leading to better space complexity (Bertel ´ e and Brioshi, 1972, chapter 4). Ho we ver , the time complexity is increased. In summary , the lowest achiev able complexity when performing variable elimination is reached for elimination orders when the cardinality of the intermediate sets N i are smaller or equal to the tree width of G . This tree width can be determined by considering cluster sizes in tree decomposi- tions of G . Furthermore, an y tree decomposition T can be used to build an elimination order and vice versa. Indeed, an elimination order can be deﬁned by using a cluster elimination order based on T , and by choosing an arbitrary order to eliminate v ariables with indices in the subsets C i \ C j . Con versely , it is easy to build a tree decomposition from a giv en verte x ordering π . Since the in- duced graph G ind π is chordal, its maximum cliques can be identiﬁed in polynomial time. Each such clique deﬁnes a cluster C i of the tree decomposition. Edges of T can be identiﬁed as the edges of any minimum spanning tree in the graph with vertices C i and edges ( C i , C j ) weighed by | C i ∩ C j | . 18 Deterministic Graphical Models. T o our knowledge, the notion of treewidth and its prop- erties were ﬁrst identiﬁed in combinatorial optimisation in Bertel ´ e and Brioshi (1972). It was then coined “dimension”, a graph parameter which was later sho wn to be equi valent to the tree width (Bodlaender, 1998). V ariable elimination itself is related to the Fourier -Motzkin elim- ination (Fourier, 1827), a variable elimination algorithm which beneﬁts from the linearity of the handled formulas. V ariable elimination has been repeatedly rediscov ered, as non-serial dy- namic programming (Bertel ´ e and Brioshi, 1972), in the David-Putnam procedure for Boolean satisﬁability problems (SA T , Davis and Putnam 1960), as Bucket elimination for the CSP and WCSP (Dechter, 1999), in the V iterbi and Forward-Backward algorithms for HMM (Rabiner, 1989) and many more. There exists other situations where the choice of an elimination order has a deep impact on the complexity of the computations as in Gauss elimination scheme for a system of linear equations, or Choleski factorisation of very large sparse matrices, in which cases, the equiv alence between elimination and decomposition was also used (see Bodlaender et al. 1995). 4 T r eewidth approximation f or exact infer ence As already mentioned, the complexity of the counting and the optimisation tasks on graphical models is strongly linked to the treewidth T W ( G ) of the underlying graph G . If one could guess (one of) the optimal verte x ordering(s), π ∗ , leading to T W π ∗ ( G ) = T W ( G ) , then, one would be able to achiev e the “optimal comple xity” O ( K T W ( G ) n ) for solving exactly these tasks; we recall that K is the maximal domain size of a v ariable in the graphical model. Howe ver , the ﬁrst obstacle to ov ercome is that the tree width of a giv en graph cannot be ev aluated easily: the tree width computation problem is kno wn to be NP-hard (Arnborg et al., 1987). If one has to spend more time on ﬁnding an optimal verte x ordering than on computing probabilities in the underlying graphical model, the utility of exact treewidth computation appears limited. There- fore, an alternati ve line of search is to look for algorithms computing a verte x ordering π leading to a suboptimal width, T W π ( G ) ≥ T W ( G ) , but more efﬁcient in terms of computational time. In the follo wing, we describe and empirically compare heuristics which simultaneously provide a vertex ordering and an upper bound of the treewidth. Performing inference relying on this or- dering is still exact. It is not optimal in terms of time complexity , but, on some problems, the inference can still be performed in reasonable time. A broad class of heuristic approaches is that of greedy algorithms (Bodlaender and K oster, 2010). They use the same iterative approach as the variable elimination algorithm (Section 3) except that they only manipulate the graph structure. They do not perform any actual combina- tion/elimination computation. Starting from an empty vertex ordering and an initial graph G , they repeatedly select the next vertex to add in the ordering by locally optimising one of the follo wing criteria: • select a verte x with minimum degree in the current graph ; • select a verte x with minimum number of ﬁll-in edges in the current graph. After each verte x selection, the current graph is modiﬁed by removing the selected verte x and making a clique on its neighbours. The new edges added by this clique creation are ﬁll-in edges. 19 A verte x with no ﬁll-in edges is a simplicial verte x (see Section 3.3). Fast implementations of minimum degree algorithms have been developed, see e.g., AMD (Amestoy et al., 1996) with time complexity in O ( nm ) (Heggernes et al., 2001) for an input graph G with n vertices and m edges. The minimum ﬁll-in heuristics tend to be slower to compute but yield slightly better tree width approximations in practice. Moreover , if a perfect elimination ordering (i.e., adding no ﬁll-in edges) exists, this heuristic will ﬁnd it. Thus, it recognises chordal graphs, and it returns the optimal tree width in this particular case. This can be easily established from results in Bodlaender et al. 2005. Notice that there exists linear time O ( n + m ) algorithms to detect chordal graphs as the Max- imum Cardinality Search (MCS) greedy algorithm (T arjan and Y annakakis, 1984). MCS builds an elimination order based on the cardinality of the already processed neighbours. Ho wever , the tree width approximation they return is usually worse than the pre vious heuristic approaches. A simple way to improve the tree width bound found by these greedy algorithms is to choose between candidate vertices with same v alue for the selected criterion by using a second criterion, such as minimum ﬁll-in ﬁrst and then maximum degree, or to choose at random and to iterate on the resulting randomised algorithms as done in Kask et al. (2011). W e compared the mean tree width upper bound found by these four approaches (minimum degree, minimum ﬁll-in, MCS and randomised iterative minimum ﬁll-in) on a set of ﬁ ve WCSP and MRF benchmarks used as combinatorial optimisation problems in various solver competi- tions. ParityLearning is an optimisation v ariant of the minimal disagreement parity CSP problem originally contributed to the DIMA CS benchmark and used in the Minizinc challenge (Opti- mization Research Group, 2012). Linkage is a genetic linkage analysis benchmark (Elidan and Globerson, 2010). GeomSurf and SceneDecomp are respecti vely geometric surface labelling and scene decomposition problems in computer vision (Andres et al., 2013). For each problem it is possible to v ary the number of vertices and potential functions. The number of instances per problem as well as their mean characteristics are giv en in T able 2. Results are reported in Figure 7 (Left).The randomised iterati ve minimum ﬁll-in algorithm used a maximum of 30 , 000 iterations or 180 seconds (respectiv ely 10 , 000 iterations and 60 seconds for ParityLearning and Linkage), compared to a maximum of 0 . 37 second used by the non-iterativ e approaches. The minimum ﬁll-in algorithm (using maximum degree for breaking ties) performed better than the other greedy approaches. Its randomised iterativ e version offers slightly improved performance, at the price of some computation time. Then on the same benchmark, we compared three exact methods for the task of mode ev al- uation that exploit either minimum ﬁll-in ordering or its randomised iterati ve version: v ariable elimination (ELIM), BTD (de Givry et al., 2006), and AND/OR Search (Marinescu and Dechter, 2006). Elim and BTD exploit the minimum ﬁll-in ordering while AND/OR Search used its ran- domised iterativ e version. In addition, BTD and AND/OR Search exploit a tree decomposition during a Depth First Branch and Bound method in order to get a good trade-off between memory space and search ef fort. Just like variable elimination, they hav e a worst-case time complexity exponential in the treewidth. All methods were allocated a maximum of 1 hour and 4 GB of RAM on an AMD Operon 6176 at 2.3 GHz. The results are reported in Figure 7 (Right), and sho w that BTD was able to solv e more problems than the tw o other methods for ﬁx ed CPU time. Ho wev er , the best performing method heavily depends on the problem category . On ParityLearn- ing, ELIM was the fastest method, but it ran out of memory on 83% of the total set of instances, 20 T able 2: Characteristics of the ﬁve optimisation problems used as benchmark. For a giv en prob- lem, se veral instances are av ailable, corresponding to dif ferent numbers of v ariables (equal to the number of vertices in the underlying graph) and dif ferent numbers of potential functions. Problem Nb Mean nb Mean nb T ype/Name of instances of v ertices of potential functions CSP/ParityLearning 7 659 1246 MRF/Linkage 22 917 1560 MRF/GeomSurf-3 300 505 2140 MRF/GeomSurf-7 300 505 2140 MRF/SceneDecomp 715 183 672 while BTD (resp. AND/OR Search) used less than 1 . 7 GB (resp. 4GB). The randomised iterati ve minimum ﬁll-in heuristic used by AND/OR Search in preprocessing consumed a ﬁxed amount of time ( ≈ 180 seconds, included in the CPU time measurements) larger than the cost of a sim- ple minimum ﬁll-in heuristics run. BTD was faster than AND/OR Search to solve most of the instances except on tw o problem categories (P arityLearning and Linkage). T o perform this comparison, we ran the follo wing implementation of each method. The ver - sion of ELIM was the one implemented in the combinatorial optimisation solver T O O L B A R 2 . 3 (options -i -T3 , av ailable at mulcyber.toulouse.inra.fr/projects/toolbar ). The version of BTD was the one implemented in the combinatorial optimisation solver T O U L - B A R 2 0 . 9 . 7 (options -B=1 -O=-3 -nopre ). T oulbar2 is av ailable at www7.inra.fr/ mia/T/toulbar2 . This software won the UAI 2010 (Elidan and Globerson, 2010) and 2014 (Gogate, 2014) Inference Competitions on the MAP task. AND/OR Search was the version implemented in the open-source version 1.1.2 of DAO O P T (Otten et al., 2012) (options -y -i 35 --slsX=20 --slsT=10 --lds 1 -m 4000 -t 30000 --orderTime=180 for benchmarks from computer vision, and -y -i 25 --slsX=10 --slsT=6 --lds 1 -m 4000 -t 10000 --orderTime=60 for the other benchmarks) which w on the Probabilistic Inference Challenge 2011 (Elidan and Globerson, 2011), albeit with a different closed-source version (Otten et al., 2012). 5 Fr om V ariable Elimination to Message Passing On tree-structured graphical models, message passing algorithms extend the v ariable elimination algorithm by ef ﬁciently computing e very mar ginals (or max-marginals) simultaneously , when v ariable elimination only computes one. On general graphical models, message passing algo- rithms can still be applied. They either provide approximate results efﬁciently , or hav e an expo- nential running cost. W e also present a less classical interpretation of the message passing algorithms: it may be conceptually interesting to view these algorithms as performing a re-parametrisation of the orig- inal graphical model, i.e. a rewriting of the potentials without modifying the joint distribution. Instead of producing external messages, the re-parametrisation produces an equiv alent MRF , 21 Figure 7: Left: Comparison of treewidth upper bounds provided by MCS (red), minimum degree (green), minimum ﬁll-in (blue) and randomized iterativ e minimum ﬁll-in (cyan) for the 5 categories of problems Right: Mode ev aluation by three exact methods exploiting minimum ﬁll-in ordering or its randomized iterativ e version. Number of instances solved ( x -axis) within a given CPU time in seconds (log10 scale y -axis) of E L I M (red), B T D (green), and A N D / O R S E A R C H (blue). where mar ginals can be easily accessed, and which can be better adapted that the original one for initialising further processing. 5.1 Message passing and belief pr opagation 5.1.1 Message passing when the graph is a tr ee Message passing algorithms ov er trees (Pearl, 1988) can be described as an e xtension of v ariable elimination, where the marginals or max-marginals of all v ariables are computed in a double pass of the algorithm. W e depict the principle here when G is a tree ﬁrst and for marginal computation. At the beginning of the ﬁrst pass (the forward pass) each leaf X i is marked as “processed“ and all other variables are ”unprocessed”. Then each leaf is successi vely visited and the ne w potential ψ N i is considered as a “message” sent from X i to X pa ( i ) (the parents of X i in the tree), denoted as µ i → pa ( i ) . This message is a potential function ov er X pa ( i ) only (scope of size 1). Messages are mov ed upward to nodes in the subgraph deﬁned by unmarked variables. A v ariable is marked as processed once it has recei ved its messages. When only one v ariable remains unmarked (deﬁning the root of the tree), the combination of all the functions on this variable (messages and possibly an original potential function in volving only the root variable) will be equal to the marginal unnormalised distribution on this v ariable. This results directly from the fact that the operations performed in this forward pass of message 22 passing are equi valent to v ariable elimination. T o compute the mar ginal of another v ariable, one can redirect the tree using this v ariable as a ne w root. Some subtrees will remain unchanged (in terms of direction from the root of the subtree to the leav es) in this new tree, and the messages in these subtrees do not need to be recomputed. The second pass (backward pass) of the message passing algorithm exploits the fact that messages are shared between se veral mar ginal computations, to or ganise all these computations in a cle ver way , so that in order to compute marginals of all variables, it is enough in the second pass to send messages from the root to wards the leaf. Then the marginal is computed by combining do wnward messages with upw ard messages arri ving at a particular v ertex. One application is the well-kno wn Forward-Backw ard algorithm (Rabiner, 1989). Formally , in the message passing algorithm for marginal ev aluation ov er a tree ( V , E ) , mes- sages µ i → j are deﬁned for each edge ( i, j ) ∈ E in a leaves-to-r oot-to-leaves order; there are 2 | E | such messages, one for each edge direction. Messages µ i → j are functions of x j , which are computed iterati vely , by the following algorithm: 1. First, messages leaving the leaves of the tree are computed: for each i ∈ V , where i is a leaf of the tree, and for j the unique parent of i , for all ( x i , x j ) ∈ Λ i × Λ j : µ i → j ( x j ) ← X x 0 i ψ ij ( x 0 i , x j ) ψ i ( x 0 i ) Mark all leav es as pr ocessed . 2. Then, messages are sent upward through all edges. Message updates are performed itera- ti vely , from marked nodes i to their only unmarked neighbour j through edge ( i, j ) ∈ E . Message updates take the follo wing form for all x j ∈ Λ j : µ i → j ( x j ) ← 1 K X x 0 i ψ ij ( x 0 i , x j ) ψ i ( x 0 i ) Y k 6 = j, ( k ,i ) ∈ E µ k → i ( x 0 i ) , (5) where K = P x j P x 0 i ψ ij ( x 0 i , x j ) ψ i ( x 0 i ) Q k 6 = j, ( k ,i ) ∈ E µ k → i ( x 0 i ) . In theory it is not necessary to normalise the messages, but this can be useful to a void numerical problems. Mark node j as pr ocessed . See Figure 8 for an illustration. 3. Send the messages do wnward (from root to lea ves). This second phase of message updates takes the follo wing form: • Unmark root node. • While there remains a marked node, send update (5) from an unmarked node to one of its marked neighbours, unmark the corresponding neighbour . 4. After the three abov e steps, messages have been transmitted through all edges in both directions. Finally , marginal distributions ov er variables and pairs of v ariables (linked by an edge) are computed as follo ws for all ( x i , x j ) ∈ Λ i × Λ j : p i ( x i ) ← 1 K i ψ i ( x i ) Q j, ( j,i ) ∈ E µ j → i ( x i ) , p ij ( x i , x j ) ← 1 K ij ψ ij ( x i , x j ) Q k 6 = j, ( k ,i ) ∈ E µ k → i ( x i ) Q l 6 = i, ( l,j ) ∈ E µ l → j ( x j ) 23 K i and K ij are suitable normalising constants. s w t v µ t → s ← − − µ w → t ← − − − µ v → t ← − − Figure 8: Example of message update on a tree. In this example, nodes t , v and w are marked, while node s is still unmarked. µ t → s is a function of all the incoming messages to node t , except µ s → t . Max-product and Max-sum algorithms can be equi valently deﬁned on a tree, for exact com- putation of the max -marginal of a joint distrib ution or its logarithm (see chapter 8 of Bishop 2006). In algebraic language, updates as deﬁned by the formula of (5) take the general form: ∀ x j ∈ Λ j , µ i → j ( x j ) = ⊕ x 0 i ψ ij ( x 0 i , x j ) ψ i ( x 0 i )  k 6 = j, ( k ,i ) ∈ E µ k → i ( x 0 i ) . 5.1.2 Message passing when the factor graph is a tr ee In some cases, the graph underlying the model may not be a tree, but the corresponding factor graph can be a tree, with factors potentially in volving more than two variables (see Figure 9 for an example). In these cases, message passing algorithm can still be deﬁned, and they lead to exact mar ginal value computations (or of max-marginals). Howe ver , their complexity becomes exponential in the size of the lar gest factor minus 1. The message passing algorithm on a tree structured factor graph exploits the same idea of shared messages than in the case a tree structured graphical models, e xcept that two dif ferent kinds of messages are computed: • Factor -to-v ariable messages: messages from a factor B (we identify the factor with the subset B of the potential function ψ B it represents) to wards a v ariable i , µ B → i ( x i ) . • V ariable-to-factor messages: message from a variable i to wards a factor B , ν i → B ( x i ) . These are updated in a leaf-to-root direction and then backward, as above, but two different updating rules are used instead of (5): for all x i ∈ Λ i µ B → i ( x i ) ← X x B \ i   ψ B ( x B ) Y j ∈ B \ i ν j → B ( x j )   , ν i → B ( x i ) ← Y B 0 6 = B ,i ∈ B 0 µ B 0 → i ( x i ) . 24 1 2 3 4 5 1 123 2 24 3 35 4 5 µ { 2 , 4 }→ 2 ← − − − − − − − ν 4 →{ 2 , 4 } ← − − − − − − Figure 9: Left: Graphical model which structure is not a tree. Right: Corresponding f actor graph, which is a tree. For applying message passing, the root is variable 1, while variables 4 and 5 are leav es. For the left branch the ﬁrst messages sent is ν 4 →{ 2 , 4 } ( x 4 ) followed by µ { 2 , 4 }→ 2 ( x 2 ) . Then, the marginal probabilities are obtained by local marginalisation, as in Step 4 of the algo- rithm of Subsection 5.1.1 abov e. p i ( x i ) ← 1 K i ψ i ( x i ) Y B ,i ∈ B µ B → i ( x i ) , ∀ x i ∈ Λ i , where K i is again a normalising constant. 5.2 When the factor graph is not a tr ee When the factor graph of the graphical model is not a tree, the two-pass message passing al- gorithm can no more be applied directly as is because of the loops. Y et, for general graphical models, this message passing approach can be generalised in two dif ferent ways. • A tree decomposition can be computed, as previously discussed in Section 3.5. Message passing can then be applied on the resulting cluster tree, handling each cluster as a cross- product of variables follo wing a block-by-block approach. This yields an exact algorithm, for which computations can be expensi ve (exponential in the treewidth) and space intensiv e (exponential in the separator size). A typical example of such algorithm is the algebraic exact message passing algorithm (Shafer and Sheno y, 1988; Shenoy and Shafer, 1990). • Alternativ ely , the Loopy Belief Propagation algorithm (Fre y and MacKay, 1998) is another extension of message passing in which messages updates are repeated, in arbitrary order 25 through all edges (possibly many times through each edge), until a termination condition is met. The algorithm returns approximations of the marginal probabilities (over v ariables and pairs of variables ). The quality of the approximation and the con vergence to steady- state messages are not guaranteed, hence, the importance of the termination condition. Ho wev er , it has been observed that LBP often provides good estimates of the marginals, in practice. A deeper analysis of the Loopy Belief Propagation algorithm is postponed to Section 6. 5.3 Message Passing and r e-parametrisation It is possible to use message passing as a re-parametrisation technique. In this case, the com- puted messages are directly used to reformulate the original graphical model in a ne w equiv alent graphical model with the same graphical structure. By “equi valent” we mean that the potential functions are not the same but they deﬁne the same joint distribution as the original graphical model. Se veral methods for re-parametrisation ha ve been proposed both in the ﬁeld of probabilistic graphical models (Koller and Friedman, 2009, chapters 10 and 13) or in the ﬁeld of deterministic graphical models (Cooper et al., 2010). They all share the same advantage: the re-parameterised formulation can be computed to satisfy precise requirements. It can be designed so that the re- parameterised potential functions contains some information of interest (marginal distributions on singletons, on pairs p i ( x i ) , max-marginals p ∗ ( x i ) , or their approximation). It can also be optimised in order to tighten a bound on the probability of a MAP assignment (K olmogorov, 2006; Schiex, 2000; Cooper et al., 2010; Huang and Daphne Koller, 2013) or on the partition function (W ainwright et al., 2005; Liu and Ihler, 2011; V iricel et al., 2016). Originally nai ve bounds can be tightened into non-nai ve ones by re-parametrisation. An additional advantage of the re-parametrised distribution is in the context of incremental updates, where we ha ve to perform inference based on the observation of some of the v ariables, and new observ ations (new e vidence) are introduced incrementally . Since the the re-parameterised model already includes the result of previous inferences, it is more interesting (in term of number of message to send) to perform the updated inference when starting with this expression of the joint distrib ution that with the original one (Koller and Friedman, 2009, chapter 10). The idea behind re-parametrisation is conceptually very simple: when a message µ i → j is computed, instead of keeping it as a message, it is possible to combine any potential function in volving X j with µ i → j , using  . T o preserv e the joint distrib ution deﬁned by the original graph- ical model, we need to di vide another potential function in v olving X j by the same message µ i → j using the in verse of  . Example f or the computation of the max-marginals. W e illustrate here how re-parametrisation can be exploited to extract directly all (unnormalised) max-marginals p ∗ ( x i ) from the order 1 po- tentials of the ne w model. In this case ψ ij is divided by µ i → j , while ψ j is multiplied by µ i → j . The same procedure can be run by replacing max by + in the message deﬁnition to obtain all singleton marginals P ( x i ) instead. Let us consider a graphical model with 3 binary v ariables. The potential functions deﬁning 26 the graphical model are: ψ 1 ( x 1 ) = (3 , 1) , ψ 2 ( x 2 ) = (2 , 6) , ψ 3 ( x 3 ) = (3 , 4) ψ 12 ( x 1 , x 2 ) =  3 2 5 4  , ψ 23 ( x 2 , x 3 ) =  4 8 4 1  Since the graph of the model is a single path and is thus tree-structured, we just need two passes of messages. W e use vertex 2 as the root. The ﬁrst messages, from the leaves to the root, are: µ 1 → 2 ( x 2 ) = max x 1 ψ 1 ( x 1 ) ψ 12 ( x 1 , x 2 ) µ 3 → 2 ( x 2 ) = max x 3 ψ 3 ( x 3 ) ψ 23 ( x 2 , x 3 ) W e obtain µ 1 → 2 (0) = max(3 × 3 , 1 × 2) = 9 , µ 1 → 2 (1) = max(3 × 2 , 1 × 4) = 6 µ 3 → 2 (0) = max(3 × 4 , 4 × 8) = 32 , µ 3 → 2 (1) = max(3 × 4 , 4 × 1) = 12 Potentials ψ 12 and ψ 23 are divided respectiv ely by µ 1 → 2 and µ 3 → 2 , while ψ 2 is multiplied by these two same messages. For instance ψ 0 2 (0) = ψ 2 (0) µ 1 → 2 (0) µ 3 → 2 (0) = 2 × 9 × 32 = 576 ψ 0 2 (1) = ψ 2 (1) µ 1 → 2 (1) µ 3 → 2 (1) = 6 × 6 × 12 = 532 ψ 0 12 ( x 1 , 0) = ψ 12 ( x 1 , 0) µ 1 → 2 (0) , ψ 0 12 ( x 1 , 1) = ψ 12 ( x 1 , 1) µ 1 → 2 (1) All the updated potentials are: ψ 0 1 ( x 1 ) = ψ 1 ( x 1 ) = (3 , 1) , ψ 0 2 ( x 2 ) = (576 , 432) , ψ 0 3 ( x 3 ) = ψ 3 ( x 3 ) = (3 , 4) ψ 0 12 ( x 1 , x 2 ) =  3 / 9 2 / 6 5 / 9 4 / 6  =  1 / 3 1 / 3 5 / 9 2 / 3  ψ 0 23 ( x 2 , x 3 ) =  4 / 32 8 / 32 4 / 12 1 / 12  =  1 / 8 1 / 4 1 / 4 1 / 12  Then messages from the root to wards the leav es are computed using these updated potentials: µ 2 → 1 ( x 1 ) = max x 2 ψ 0 2 ( x 2 ) ψ 0 12 ( x 1 , x 2 ) = (192 , 320) µ 2 → 3 ( x 3 ) = max x 2 ψ 0 2 ( x 2 ) ψ 0 23 ( x 2 , x 3 ) = (144 , 144) Finally , potentials ψ 0 12 and ψ 0 23 are divided respectiv ely by µ 2 → 1 and µ 2 → 3 , while ψ 0 1 and ψ 0 3 are multiplied by µ 2 → 1 and µ 2 → 3 respecti vely , leading to the re-parameterised potentials ψ 00 1 ( x 1 ) = (3 × 192 , 1 × 320) = (576 , 320) , ψ 00 2 ( x 2 ) = (576 , 432) ψ 00 3 ( x 3 ) = (3 × 144 , 4 × 144) = (432 , 576) ψ 00 12 ( x 1 , x 2 ) =  1 3 × 192 1 3 × 192 5 9 × 320 2 3 × 320  =  1 576 1 576 1 576 1 480  ψ 00 23 ( x 2 , x 3 ) =  1 8 × 144 1 4 × 144 1 3 × 144 1 12 × 144  =  1 1152 1 576 1 432 1 1728  27 Then we can directly read the (unnormalised) max marginal from the singleton potentials. For instance max x 2 ,x 3 ψ 1 (0) ψ 2 ( x 2 ) ψ 3 ( x 3 ) ψ 12 (0 , x 2 ) ψ 23 ( x 2 , x 3 ) = 576 = ψ 00 (0) . W e can check that the original graphical model and the re- parameterised one deﬁne the same joint distribution by comparing to the (unnormalised) probability of each possible state (see T able 3). T able 3: The unnormalised probabilities of the eight possible states in the original and re- parameterised models. One can check that the re-parameterised version describes the same joint distribution than the original one. Original Reparameterised x 1 x 2 x 3 ψ 1 ψ 2 ψ 3 ψ 12 ψ 23 ψ 00 1 ψ 00 2 ψ 00 3 ψ 00 12 ψ 00 23 0 0 0 3 × 2 × 3 × 3 × 4 = 216 = 576 × 576 × 432 × 1 576 × 1 1152 0 0 1 3 × 2 × 4 × 3 × 8 = 576 = 576 × 576 × 576 × 1 576 × 1 576 0 1 0 3 × 6 × 3 × 2 × 4 = 432 = 576 × 432 × 432 × 1 576 × 1 432 0 1 1 3 × 6 × 4 × 2 × 1 = 144 = 576 × 432 × 576 × 1 576 × 1 1728 1 0 0 1 × 2 × 3 × 5 × 4 = 120 = 320 × 576 × 432 × 1 576 × 1 1152 1 0 1 1 × 2 × 4 × 5 × 8 = 320 = 320 × 576 × 576 × 1 576 × 1 576 1 1 0 1 × 6 × 3 × 4 × 4 = 288 = 320 × 432 × 432 × 1 480 × 1 432 1 1 1 1 × 6 × 4 × 4 × 1 = 96 = 320 × 432 × 576 × 1 480 × 1 1728 Re-parametrisation to compute pairwise or cluster joint distributions. One possibility is to incorporate the messages in the binary potentials, in order to extract directly the pairwise joint distributions as described in K oller and Friedman (2009, chapter 10): ψ ij is replaced by ψ ij  µ i → j  µ j → i while ψ i is di vided by µ j → i and ψ j by µ i → j . If, for example, sum-prod mes- sages are computed, each re-parameterised pairwise potential ψ ij can be shown to be equal to the (unnormalised) marginal distribution of ( X i , X j ) (or an approximation of it if the graph is loopy). In tree-structured problems, the resulting graphical model is said to be calibrated to empha- sise the fact that all pairs of binary potentials sharing a common variable agree on the marginal distribution of this common v ariable (here x i ): ⊕ x j ψ ij = ⊕ x k ψ ik In the loopy case, if an exact approach using tree decomposition is follo wed, the domains of the messages hav e a size exponential in the size of the intersection of pairs of clusters, and the re-parametrisation will create ne w potentials of this size. These messages are included inside the clusters. Each resulting cluster potential will be the (unnormalised) marginal of the joint distrib u- tion on the cluster v ariables. Again, a re-parameterised graphical model on a tree-decomposition is calibrated, and any two intersecting clusters agree on their marginals. This is exploited in the Lauritzen-Spiegelhalter and Jensen sum-product-divide algorithms (Lauritzen and Spiegel- halter, 1988; Jensen et al., 1990). Besides its interest for incremental updates in this context, the 28 re-parameterised graphical model using tree decomposition allows us to locally compute exact marginals for an y set of variables in a same cluster . If a local “loopy” approach is used instead, re-parameterisations do not change scopes, but provide a re-parameterised model. Estimates of the marginals of the original model can be read directly . For MAP , such re-parameterisations can follo w cle ver update rules to provide con ver- gent re-parameterisations maximising a well deﬁned criterion. T ypical examples of this process are the sequential version of the tree re-weighted algorithm (TR WS K olmogorov 2006), or the Max-Product Linear P rogramming algorithm (MPLP , Globerson and Jaakk ola 2008) which aims optimising a bound on the non-normalised probability of the mode. These algorithms can be exact on graphical models with loops, provided the potential functions are all submodular (often described as the discrete version of con ve xity , see for instance T opkis 1978; Cohen et al. 2004). Re-parametrisation in deterministic graphical models. Re-parameterising message passing algorithms ha ve also been used in deterministic graphical models. They are then kno wn as “local consistency” enforcing or constraint propagation algorithms. On one side, a local consistency property deﬁnes the tar geted calibration property . On the other side, the enforcing algorithm uses so-called Equi valence Preserving T ransformations to transform the original network into an equiv alent network, i.e. deﬁning the same joint function, which satisﬁes the desired calibra- tion/local consistency property . Similar to LBP , Arc Consistency (W altz, 1972; Rossi et al., 2006) is the most usual form of local consistency , and is related to Unit Propagation in SA T (Biere et al., 2009). Arc consistency is exact on trees, while it is usually incrementally maintained during an exact tree search, using re-parametrisation. Because of the idempotency of logical operators (they can be applied se veral time without changing the result obtained after the ﬁrst application), local consistencies always con v erge to a unique ﬁx-point. Local consistency properties and algorithms for W eighted CSPs are closely related to mes- sage passing for MAP . They are howe ver always con ver gent, thanks to suitable calibration prop- erties (Schiex, 2000; Cooper and Schiex, 2004; Cooper et al., 2010), and also solve tree structured problems or problems where all potential functions are submodular . These algorithms can be directly used to tackle the max-prod and sum-prod problems in a MRF . The re-parametrised MRF is then often more informati ve that the original one. For in- stance, under the simple conditions that all potential functions which scope larger than 1 are bounded by 1 , a trivial upper bound of the normalising constant Z is Q i P x i ψ i ( x i ) . This naiv e upper bound can be considerably tightened by re-parameterising the MRF using a soft-arc con- sistency algorithm (V iricel et al., 2016). 6 Heuristics and appr oximations f or inference W e mainly discussed methods for exact inference in graphical models. The y are useful if an order for variable elimination with small treewidth is av ailable. In many real life applications, interaction network are seldom tree-shaped, and their treewidth can be large (e.g. a grid of pixel in image analysis). Consequently , exact methods cannot be applied an ymore. Howe ver , they can be drawn inspiration from to deriv e heuristic methods for inference that can be applied to any graphical model. What is meant by a heuristic method is an algorithm that is (a priori) not 29 deri ved from the optimisation of a particular criterion, the latter is rather termed an approxima- tion method. Nevertheless, we shall alleviate this distinction, and show that good performing message passing-based heuristics can sometimes be interpreted as approximate methods. For the marginalisation task, the most widespread heuristics deriv ed from variable elimination and mes- sage passing principles is the Loopy Belief Propagation In the last decade, a better understanding of these heuristics was reached, and they can now be re-interpreted as particular instances of v ariational approximation methods (W ainwright and Jordan, 2008). A v ariational approximation of a distrib ution p is deﬁned as the best approximation of p in a class Q of tractable distributions (for inference), according to the K ullback-Leibler di vergence. Depending of the application (e.g. discrete or continuous variables), sev eral choices for Q can be considered. The connection with v ariable elimination principles and treewidth is not obvious at ﬁrst sight. Howe ver , as we just emphasised, LBP can be cast in the v ariational frame work. The treewidth of the chosen vari- ational distribution depends on the nature of the v ariables: i ) in the case of discrete v ariables the treewidth need be low: in most cases, the class Q is formed by independent variables (mean ﬁeld approximation), with associated treewidth equal to 0, and some works consider a class Q with associated treewidth equal to 1 (see Section 6.1); ii ) in the case of continuous variables, the treewidth of the variational distribution is the same as in the original model: Q is in general chosen to be the class of multi variate Gaussian distrib utions, for which numerous inference tools are av ailable. W e recall here the two ke y components for a variational approximation method: the Kullback- Leibler di ver gence and the choice of a class of tractable distributions. W e then explain ho w LBP can be interpreted as a variational approximation method. Finally we recall the rare examples where some statistical properties of an estimator obtained using a v ariational approximation have been established. In Section 7 we will illustrate how variational methods can be used to deriv e approximate EM algorithms for estimation in CHMM. 6.1 V ariational appr oximations The Kullback-Leibler div ergence K L ( q || p ) = P x q ( x ) log q ( x ) p ( x ) measures the dissimilarity be- tween two probability distributions p and q . K L is not symmetric, hence not a distance. It is positi ve, and it is null if and only if p and q are equal. Let us consider no w that q is constrained to belong to a family Q , which does not include p . The solution q ∗ of arg min q ∈Q K L ( q || p ) is then the best approximation of p in Q according to the K L diver gence. It is called the v aria- tional distribution. If Q is a set of tractable distributions for inference, then marginals, mode or normalising constant of q ∗ can be used as approximations of the same quantities on p . V ariational approximation were originally deﬁned in the ﬁeld of statistical mechanics, as approximations of the minimum of the free energy F ( q ) , F ( q ) = − X x q ( x ) log Y B ∈B ψ B ( x B ) + X x q ( x ) log q ( x ) . They are also kno wn as Kikuchi approximations or Cluster V ariational Methods (CVM, Kikuchi 1951). Minimising F ( q ) is equiv alent to minimising K L ( q || p ) , since F ( q ) = − X x q ( x ) log p ( x ) − log ( Z ) + X x q ( x ) log q ( x ) = K L ( q || p ) − log ( Z ) . 30 The mean ﬁeld approximation is the most naiv e approximation among the family of Kikuchi approximations. Let us consider a binary Potts model on n vertices whose joint distrib ution is p ( x ) = 1 Z Y i exp ( a i x i + X ( i,j ) ∈ E b ij x i x j ) . W e can deriv e its mean ﬁeld approximation, corresponding to the class Q M F of fully factorised distributions (i.e. an associated graph of treewidth equal to 0): Q M F = { q , such that q ( x ) = Q i ∈ V q i ( x i ) } . Since variables are binary Q M F corresponds to joint distributions of independent Bernoulli v ariables with respectiv e parameters q i = def q i (1) . Namely for all q in Q M F , we can write q ( x ) = Q i q x i i (1 − q i ) 1 − x i . The optimal approximation (in terms of K ullback-Leibler di vergence) within this class of distrib utions is characterised by the set of q i ’ s which minimise K L ( q || p ) . Denoting E q the expectation with respect to q , K L ( q || p ) − log Z is E q   X i [ X i log q i + (1 − X i ) log(1 − q i )] − X i a i X i − X ( i,j ) ∈ E b ij X i X j   = X i [ q i log q i + (1 − q i ) log(1 − q i )] − X i a i q i − X ( i,j ) ∈ E b ij q i q j . This expectation has a simple form because of the speciﬁc structure of q . Minimising it with respect to q i gi ves the ﬁx ed-point relation that each optimal q M F i ’ s must satisfy: log  q M F i / (1 − q M F i )  = a i + X j :( i,j ) ∈ E b ij q M F j . leading to q M F i = e a i + P j :( i,j ) ∈ E b ij q M F j 1 + e a i + P j :( i,j ) ∈ E b ij q M F j . It is interesting to note that this expression is very close to the expression of the conditional probability that X i = 1 gi ven that all other v ariables in the neighbourhood of i : Pr( X i = 1 | x N i ) = e a i + P j :( i,j ) ∈ E b ij x j 1 + e a i + P j :( i,j ) ∈ E b ij x j . The variational distribution q M F i can be interpreted as equal to this conditional distribution, with neighbouring variables ﬁxed to their e xpected v alues under distribution q M F . It explains the name of mean ﬁeld approximation. Note that in general q i is not equal to the marginal p i (1) . The choice of the class Q is indeed a critical trade-of f between opposite desirable properties: it must be large enough to guarantee a good approximation, and small enough to contain only distributions for which inference in manageable. In the next section, a particular choice for Q , the Bethe class, is emphasised. In particular , it enables us to link the LBP heuristics to variational methods. Other choices are possible, and ha ve been used. F or instance, in the structured mean ﬁeld setting (Ghahramani and Jordan, 1997; W ainwright and Jordan, 2008), the distribution of a 31 factorial Hidden Marko v Model is approximated in a variational approach; the multi variate hid- den state is decoupled, and the variational distrib ution q of the conditional distribution of hidden states is that of independent Markov chains (here again, the treewidth is equal to 1). The Chow- Liu algorithm (Chow and Liu, 1968) computes the minimum of K L ( p || q ) for a distribution q whose associated graph is a spanning tree of the graph of p . This amounts to computing the best approximation of p among graphical models with treewidth equal to 1. Finally , an alternati ve to tree width reduction is to choose the variational approximation in the class of exponential distri- butions. This has been applied to Gaussian process classiﬁcation (Kim and Ghahramani, 2006) using a multiv ariate Gaussian approximation of the posterior distribution of the hidden ﬁeld. This method relies on the use of the EP algorithm (Minka, 2001). In this algorithm, K L ( p || q ) is minimised instead of K L ( q || p ) . The choice of minimising one or the other depends on their computational tractability . 6.2 LBP heuristics as a variational method If p and q are pairwise MRF whose associated graph G = ( V , E ) is the same and is a tree, then q ( x ) = Q ( i,j ) ∈ E q ( x i ,x j ) Q i ∈ V q ( x i ) d i − 1 , where { q ( x i , x j ) } and { q ( x i ) } are coherent sets of order 2 and order 1 marginals of q , respecti vely , and d i is the degree of verte x i in the tree. In this particular case, the free energy is e xpressed as (see Heskes et al. 2004; Y edidia et al. 2005) F ( q ) = − X ( i,j ) ∈ E X x i ,x j q ( x i , x j ) log ψ ( x i , x j ) − X i ∈ V X x i q ( x i ) log ψ ( x i ) + X ( i,j ) ∈ E X x i ,x j q ( x i , x j ) log q ( x i , x j ) + X i ∈ V ( d i –1) X x i q ( x i ) log q ( x i ) The Bethe approximation consists in applying to an arbitrary graphical model the same for- mula of the free energy as the one used for a tree, and then in minimising it ov er the vari- ables { q ( x i , x j ) } and { q ( x i ) } under the constraint that they are probability distributions and that q ( x i ) is the marginal of q ( x i , x j ) . By extension, the Bethe approximation can be interpreted as a variational method associated to the family Q B ethe of unnormalised distributions that can be expressed as q ( x ) = Q ( i,j ) ∈ E q ( x i ,x j ) Q i ∈ V q ( x i ) d i − 1 with { q ( x i , x j ) } and { q ( x i ) } coherent sets of order 2 and order 1 marginals. Y edidia et al. (2005) established that the ﬁxed points of LBP (when they exist, con ver gence is still not well understood, see W eiss 2000 and Mooij and Kappen 2007) are stationary points of the problem of minimising the Bethe free ener gy , or equi valently K L ( q || p ) with q in the class Q B ethe of distributions. Furthermore, Y edidia et al. (2005) sho wed that for any class of distributions Q corresponding to a particular CVM method, it is possible to deﬁne a generalised BP algorithm whose ﬁxed points are stationary points of the problem of minimising K L ( q || p ) in Q . The drawback of the LBP algorithm and its extensions (Y edidia et al., 2005) is that they are not associated with any theoretical bound on the error made on the marginals approximations. Ne vertheless, LBP is increasingly used for inference in graphical models for its good behaviour in practice (Murphy et al., 1999). It is implemented in software packages for inference in graphical models like libD AI (Mooij, 2010) or OpenGM2 (Andres et al., 2012). 32 6.3 Statistical pr operties of variational estimates Maximum-likelihood parameter estimation in graphical model is often intractable because it could require to compute mar ginals or normalising constants. A computationally ef ﬁcient al- ternati ve to Monte-Carlo estimates are v ariational estimates, obtained using a v ariational approx- imation of the model. From a statistical point-of-view , because v ariational estimation is only an approximation of maximum-likelihood estimation, the resulting parameter estimates do not beneﬁt of the typical properties of maximum-likelihood estimates (MLE), such as consistency or asymptotic normality . Unfortunately , no general theory exists for variational estimates, and results are av ailable only for some speciﬁc models (see e.g. Hall et al. 2011 for the consis- tency in the Poisson log-normal model and Blei et al. 2017 for some other e xamples). From a more general point of view , in a Bayesian context, W ang and T itterington (2005) and W ang and T itterington (2006) studied the properties of variational estimates. They proved that the ap- proximate conditional distribution are centred on the true posterior mean, but with a too small v ariance. Celisse et al. (2012) proved the consistency of the (frequentist) v ariational estimates of the Stochastic Block Model (SBM), while Gazal et al. (2012) empirically established the ac- curacy of their Bayesian counterpart. V ariational Bayes estimates are also proposed by Jaakkola and Jordan (2000) for logistic re gression, and the approximate posterior also turns out to be v ery accurate. A heuristic explanation for these two positiv e examples (SBM and logistic regression) is that, in both cases, the class Q used for the approximate conditional (or posterior) distribution q is sought so as to asymptotically contain the true conditional distrib ution. 7 Illustration on CHMM In this last section, we illustrate how the different discussed algorithms, in the CHMM frame- work, perform in practice for marginal inference when the model parameters are known, and ho w concretely they can be e xploited in the EM algorithm to perform parameter estimation. 7.1 Comparison of exact variable elimination, variational inference and Gibbs sampling in practice W e compared the follo wing inference algorithms on the problem of computing the marginals of all the hidden variables of the CHMM model of pest propagation described in Section 2.3, conditionally to the observed variables. W e simulated 10 datasets with the following parameters v alues: ρ = 0 . 2 , ν = 0 . 5 ,  = 0 . 15 , f n = 0 . 3 and f p = 0 . 1 . For each data set, we ran the fol- lo wing algorithms, using libD AI software (Mooij, 2010): junction tree (JT , exact method using the principles of tree decomposition and block by block elimination); loopy belief propagation (LBP); mean ﬁeld approximation (MF); and Gibbs sampling (GS, Geman and Geman 1984), with 10,000 runs, each with a burn-in of 100 iterations and then 10,000 iterations. W e compared the algorithms on three criteria: running time ( time v ariable), mean absolute difference between the true marginal probability of state 0 and the estimated one, over all hidden variable s ( dif f-mar g v ariable), and percentage of hidden variables which are not restored to their true v alue with the mode of the estimated marginal ( err or-r esto variable). The results (see T able 4) are presented for increasing values of n , the number of rows (and also of columns) of the square grid of ﬁelds 33 T able 4: Comparison of Junction Tree (JT), Loopy Belief Propagation (LBP), Mean Field (MF) and Gibbs sampling (GS) inference algorithms on the CHMM model of pest propagation: (a) running time, in second; (b) mean dif ference between the true and the estimated mar ginal of state 0 (when JT cannot be ran we use GS marginals as true ones); (c) percentage of hidden variables not restored to their true v alue when using the mode of the marginals. time JT LBP MF GIBBS n = 3 0 . 04 0 . 04 0 . 03 1 . 05 n = 5 − 0 . 19 0 . 14 3 . 30 n = 10 − 1 . 07 0 . 65 13 . 99 n = 100 − 219 . 31 134 . 31 3 , 499 . 6 n = 200 − 1 , 026 . 2 746 . 68 29 , 341 . 0 diff-mar g LBP MF GIBBS n = 3 0 . 001 0 . 032 0 . 032 n = 5 0 . 003 0 . 037 − n = 10 0 . 003 0 . 032 − n = 100 0 . 003 0 . 032 − n = 200 0 . 003 0 . 032 − (a) (b) err or-r esto JT LBP MF GIBBS n = 3 20 . 00 19 . 80 19 . 26 20 . 19 n = 5 − 18 . 60 19 . 27 18 . 93 n = 10 − 17 . 87 17 . 70 17 . 83 n = 100 − 18 . 19 18 . 39 18 . 20 n = 200 − 18 . 18 18 . 40 18 . 18 (c) (i.e. I = n 2 ). Beyond n = 3 , JT cannot be run, so for computing diff-mar g we used the GS marginals instead of the true mar ginals. These results illustrate well the f act that approximate in- ference methods based on the principle of variable elimination are very time ef ﬁcient compared to Monte-Carlo methods (less than 4 minutes for a problem with I = 10 , 000 hidden variables), while being still very accurate. Furthermore, ev en a naiv e variational method like the mean ﬁeld one can be interesting if accurate marginal estimates are not required b ut we are only interested in preserving their mode. 7.2 V ariational appr oximation f or estimation in CHMM W e no w illustrate ho w v ariational approximations hav e been used for parameter estimation using an EM algorithm in the case of CHMM. Exact EM algorithm CHMM are examples of incomplete data models, as they inv olve vari- ables ( O , H ) , and only variables O are observed. Maximum likelihood inference for such a model aims at ﬁnding the value of the parameters θ which maximise the (log-)likelihood of the observed data o , i.e. to solve max θ log Pr θ ( o ) . The most popular algorithm to achie ve this task is the EM algorithm (Dempster et al., 1977). One of its formulation reads as an iterati ve maximi- 34 sation procedure of the follo wing functional: F ( θ , q ) = E q (log Pr θ ( o , H )) − E q (log q ( H )) = log Pr θ ( o ) − K L ( q ( H ) || Pr θ ( H | o )) , where q stands for any distribution on the hidden variables H , and E q stands for the expectation under the arbitrary distribution q . The EM algorithm consists in alternati vely maximising F ( θ , q ) with respect to q (E-step) and to θ (M-step). The solution of the E-step is q ( h ) = Pr θ ( h | o ) , since the Kullback-Leibler di ver gence is then minimal, and ev en null in this case. When replacing q ( h ) by q ( h ) = Pr θ ( h | o ) in F , we obtain that the M-step amounts to maximising E  log Pr θ ( o , H ) | o  . Exact computation of Pr θ ( h | o ) can be performed by observing that (2) can be rewritten as Pr θ ( h , o ) ∝ ψ init 0 ( h 1 ) T Y t =2 ψ M 0 ( h t − 1 , h t ) ! × I Y i =1 T Y t =1 ψ E ( h i t , o i t ) ! , where ψ init 0 is the global initial distribution, equal to Q I i =1 ψ init ( h i 1 ) , and Ψ M 0 is the global tran- sition probability , equal to Q I i =1 ψ M ( h i t − 1 , h L − i t − 1 , h i t ) . This writing is equi valent to mer ging all hidden variables of a gi ven time step. It corresponds to the graphical model giv en in Figure 10. Denoting K the number of possible v alues for each hidden variables, we end up with a regular hidden Markov model with K I possible hidden states. Both Pr θ ( h | o ) and its mode can then be computed in an exact manner with either the forward-backward recursion or the V iterbi algorithm for the mode ev aluation. Both procedures hav e the same complexity: O ( T K 2 I ) . The exact calcu- lation can therefore be achie ved provided that K I remains small enough, but becomes intractable when the number of signals I exceeds a fe w tens. H t − 1 H t H t +1 O 1 t − 1 O 1 t O 1 t +1 O 2 t − 1 O 2 t O 2 t +1 O 3 t − 1 O 3 t O 3 t +1 Figure 10: Graphical representation of Pr( h , o ) for a coupled HMM when merging hidden vari- ables at each time step Sev eral variational approximations f or the EM algorithm F or more complex graphical struc- ture, explicitly determining Pr θ ( h | o ) can be too expensi ve to perform exactly . A ﬁrst approach to deri ve an approximate E-step is to seek for a variational approximation of Pr θ ( h | o ) assuming that q ( h ) is restricted to a family Q of tractable distributions, as described in Section 6.1. The choice 35 of Q is critical, and requires achieving an acceptable balance between approximation accuracy and computation ef ﬁciency . Choosing Q typically amounts to breaking do wn some dependencies in the original distribution to end up with some tractable distribution. In the case of CHMM, the simplest distribution is the class of fully factorised distributions (i.e. mean ﬁeld approximation), that is Q 0 = { q : q ( h ) = I Y i =1 T Y t =1 q it ( h i t ) } . Such an approximation of Pr θ ( h | o ) corresponds to the graphical model of Figure 11. Intuiti vely , this approximation replaces the stochastic inﬂuence between the hidden variables by its mean v alue. H 1 t − 1 H 1 t H 1 t +1 H 2 t − 1 H 2 t H 2 t +1 H 3 t − 1 H 3 t H 3 t +1 O 1 t − 1 O 1 t O 1 t +1 O 2 t − 1 O 2 t O 2 t +1 O 3 t − 1 O 3 t O 3 t +1 Figure 11: Graphical representation for the mean-ﬁeld approximation of Pr( h , o ) in a coupled HMM. Observed v ariables are indicated in light grey since they are not part of the v ariational distribution which is a distrib ution only on the hidden variables. As suggested in W ainwright and Jordan (2008), a less drastic approximation of Pr θ ( h | o ) can be obtained using the distribution f amily of independent heterogeneous Marko v chains: Q M = { q : q ( h ) = Y i Y t q it ( h i t | h i t − 1 ) } which is consistent with the graphical representation of an independent HMM, as depicted in Figure 12. An alternativ e is to use the Bethe approximation of F ( θ , q ) . Then the LBP algorithm can be used to provide an approximation of the conditional marginal distributions on singletons and pairs of variables (no other marginals are in volv ed in the E step of EM). This approach has been proposed in Heskes et al. (2004). The advantage of this approach compared to the variational approximations based on families Q 0 or Q M , is that it pro vides an approximation of the joint conditional distrib ution of pairs of hidden v ariables within a same time step, instead of assuming that they are independent. 36 H 1 t − 1 H 1 t H 1 t +1 H 2 t − 1 H 2 t H 2 t +1 H 3 t − 1 H 3 t H 3 t +1 O 1 t − 1 O 1 t O 1 t +1 O 2 t − 1 O 2 t O 2 t +1 O 3 t − 1 O 3 t O 3 t +1 Figure 12: Graphical representation for the approximation of Pr( h , o ) in a coupled HMM by independent heterogeneous Markov chain. Observed v ariables are indicated in light grey since they are not part of the variational distrib ution which is a distrib ution only on the hidden variables. 8 Conclusion and discussion This tutorial on variable elimination for exact and approximate inference is an introduction to the basic concepts of v ariable elimination, message passing and their li nks with v ariational methods. It introduces these ﬁelds to statisticians confronted with inference in graphical models. The main message is that exact inference should not be systematically ruled out. Before looking for an efﬁcient approximate method, a wise advice would be to try to ev aluate the treewidth of the graphical model. In practice, this question is not easy to answer . Nev ertheless se veral algorithms exist that provide an upper bound of the treewidth together with the associated v ariable elimination order (minimum degree, minimum ﬁll-in, maximum cardinality search, ...). Even if it is not optimal, this ordering can be used to perform exact inference if the bound is small enough. Examples where the lo w tree width of the graphical model has been successfully exploited to perform exact inference in problems apparently too comple x are numerous. Korhonen and Parviainen (2013) simpliﬁed the NP-hard problem of learning the structure of a Bayesian net- work from data when the underlying network has “low” treewidth. They proposed an exact score-based algorithm to learn graph structure using dynamic programming. Berg et al. (2014) compared their approach with an encoding of the algorithm in the framework of Maximum Sat- isﬁability and improv ed performances on classical Machine Learning datasets with networks up to 29 nodes. Akutsu et al. (2009) tackled the problem of Boolean acyclic network completion. More speciﬁcally , the aim is to achie ve the smallest number of modiﬁcations in the network, so that the distribution is consistent with the binary observ ations at the nodes. The authors es- tablished the general NP-completeness of the problem, ev en for tree-structured networks. They ho wev er reported that these problems can be solved in polynomial time for network with bounded 37 tree width and in-degree, and with enough samples (in the order of at least log of the number of nodes). Their ﬁndings were applied (T amura and Akutsu, 2014) to obtain the sparsest possible set of modiﬁcations in the acti vation and inhibition functions of a signalling network (comprising 57 nodes and 154 edges) after a hypothesised cell-state alteration in colorectal cancer patients. Xing (2004) introduced two Bayesian probabilistic graphical modelling of genomic data analysis de voted to (i) the identiﬁcation of motifs and cis-regulatory modules from transcriptional regu- latory sequences, and (ii) the haplotype inference from genotypes of SNPs (Single Nucleotide Polymorphisms). The inference for these two high-dimensional models on h ybrid distributions is very complex to compute. The author noted that the exact computation (e.g. of MAP or mar ginal distributions) might be feasible for models of bounded tree-width, if a good v ariable ordering was av ailable. Howe ver , the question on ho w to ﬁnd this latter one is not addressed, and an ap- proximate generalised mean ﬁeld inference algorithm is dev eloped. Finally , the reader can ﬁnd in Berger et al. 2008 more illustration of how the notion of tree width can help simplifying the parametrisation of many algorithms in bioinformatics. For the reader interested in testing the inference algorithms presented in this article, the list provided by K evin Murphy (https://www .cs.ubc.ca/ murphyk/Software/bnsoft.html), ev en though slightly out-dated, giv es a good idea of the v ariety of existing software packages, most of them being dedicated to a particular family of graphical model (directed, or undirected). One of the reason why variable elimination based technique for inference in graphical model is not well widespread outside the communities of researchers in Computer Science and Machine Learning is probably that there exist no software being both generic and with an easy interface from R, Python or Matlab . Obviously this tutorial is not exhausti ve, since we chose to focus on fundamental concepts. While many important results on tree width and graphical models hav e sev eral decades in age, the area is still liv ely , and we now broaden our discussion to a few recent works which tackle some challenges related to the computation of the tree width. Because they offer efﬁcient algorithms, graphical models with a bounded tree width offer an attracti ve target when the aim is to learn a model that best represents some giv en sample. In Kumar and Bach (2012), the problem of learning the structure of an undirected graphical model with bounded treewidth is approximated by a con v ex optimisation problem. The resulting algorithm has a polynomial time complexity . As discussed in Kumar and Bach (2012), this algorithm is useful to deri ve tractable candidate distrib utions in a variational approach, enabling to go beyond the usual v ariational distributions with tree width zero or 1. For optimisation (MAP), other exact techniques are of fered by tree search algorithms such as Branch and Bound (Lawler and W ood, 1966), that recursiv ely consider possible conditioning of variables. These techniques often exploit limited v ariable elimination processing to prev ent exhausti ve search, either using message-passing like algorithms (Cooper et al., 2010) to compute bounds that can be used for pruning, or by performing “on-the-ﬂy” elimination of variables with small degree (Larrosa, 2000). Beyond pairwise potential functions, the time needed for simple update rules of message passing becomes exponential in the size of the scope of the potential functions. Howe ver , for speciﬁc potential functions in volving many (or all) variables, exact messages can be computed in reasonable time, ev en in the conte xt of con vergent message passing for optimisation. This can be done using polytime graph optimisation algorithms such as shortest path or mincost ﬂow 38 algorithms. Such functions are kno wn as global potential functions (V icente et al., 2008; W erner, 2008) in probabilistic graphical models, and as global cost functions (Lee and Leung, 2009; Allouche et al., 2012; Lee and Leung, 2012) in deterministic Cost Function Networks. Dif ferent problems appear with continuous v ariables, where counting requires integration of functions. Here again, for speciﬁc f amilies of distributions, e xact (analytic) computations can be obtained for distributions with conjugate distributions. For message passing, sev eral solutions hav e been proposed. For instance, a recent message passing scheme proposed by Noorshams and W ainwright (2013) relies on the combination of orthogonal series approximation of the messages, and the use of stochastic updates. W e refer the reader to references in Noorshams and W ainwright (2013) for a state-of-the-art of alternati ve methods dealing with continuous v ariables message passing. V ariational methods are also largely exploited for continuous variables, in particular in Signal Processing (Smidi and Quinn, 2006). Finally , we hav e excluded Monte-Carlo methods from the scope of our re view . Ho wev er the combination of the inference methods presented in this article and stochastic methods for inference is a ne w area that researchers start exploring. Recent sampling algorithms hav e been proposed that use exact optimisation algorithms to sample points with high probability in the context of estimating the partition function. Additional control in the sampling method is needed to a void biased estimations: this may be hashing functions enforcing a fair sampling (Ermon et al., 2014) or randomly perturbed potential functions using a suitable noise distrib ution (Hazan et al., 2013). More recently , Monte-Carlo and v ariational approaches have been combined to propose Discrete Particle V ariational Inference (Saeedi et al., 2017), an algorithm that beneﬁts from the accuracy of the former and the rapidity of the latter . W e hope this revie w will enable more cross-fertilisations of this sort, combining statistics and computer science, stochastic and deterministic algorithms for inference in graphical models. Refer ences Akutsu, T ., T . T amura, and K. Horimoto (2009). Completing networks using observed data. In R. Gav ald ` a, G. Lugosi, T . Zeugmann, and S. Zilles (Eds.), Algorithmic Learning Theory (ALT 2009) , V olume 5809 of LNCS , Berlin, Heidelberg, pp. 126–140. Springer . Allouche, D., C. Bessiere, P . Boizumault, S. de Givry, P . Gutierez, S. Loudni, J. Metivier , and T . Schiex (2012). Decomposing Global Cost Functions. In Pr oceedings of the National Con- fer ence on Artiﬁcial Intelligence, AAAI-2012 , pp. 407–413. Amestoy , P ., T . A. Davis, and I. S. Duff (1996). An approximate minimum degree ordering algorithm. SIAM J ournal on Matrix Analysis and Applications 17 (4), 886–905. Andres, B., T . Beier , and J. H. Kappes (2013). OpenGM benchmark - CVPR’2013 section. See http://hci.iwr.uni- heidelberg.de/opengm2/?l0=benchmark . Andres, B., B. T ., and J. H. Kappes (2012). OpenGM: A C++ library for discrete graphical models. arXiv . Arnborg, S. (1985). Efﬁcient algorithms for combinatorial problems on graphs with bounded decomposability — a surve y . BIT , Computer Science and Numerical Mathematics 25 , 2–23. 39 Arnborg, S., D. G. Corneil, and A. Proskuro wski (1987). Complexity of ﬁnding embeddings in a k -tree. SIAM J . Algebraic Discr ete Methods 8 , 277–284. Baker , J., L. Deng, J. Glass, S. Khudanpur , C. h. Lee, N. Morgan, and D. O’Shaughnessy (2009). De velopments and directions in speech recognition and understanding, Part 1 [DSP education]. IEEE Signal Pr ocessing Magazine 26 (3), 75–80. Barber , D. (2012). Bayesian Reasoning and Machine Learning . Cambridge Univ ersity Press. Berg, J., M. J ¨ arvisalo, and B. Malone (2014). Learning optimal bounded tree width Bayesian net- works via maximum satisﬁability . In Pr oceedings of the Seventeenth International Confer ence on Artiﬁcial Intelligence and Statistics , V olume 33, pp. 86–95. Berger , B., R. Singht, and J. Xu (2008). Graph algorithms for biological systems analysis. In Pr oceedings of the Nineteenth Annual A CM-SIAM Symposium on Discr ete Algorithms , SOD A ’08, Philadelphia, P A, USA, pp. 142–151. SIAM. Bertel ´ e, U. and F . Brioshi (1972). Nonserial Dynamic Pr ogramming . Academic Press. Biere, A., M. Heule, and H. van Maaren (2009). Handbook of satisﬁability , V olume 185. Ios press. Bishop, C. M. (2006). P attern Recognition and Mac hine Learning . Springer . Bistarelli, S., U. Montanari, and F . Rossi (1997). Semiring based constraint solving and opti- mization. J ournal of the ACM 44 (2), 201–236. Blei, D. M., A. Kucukelbir , and J. D. McAulif fe (2017). V ariational inference: A re view for statisticians. J ournal of the American Statistical Association 112 (518), 859–877. Bodlaender , H. L. (1994). A tourist guide through tree width. Developments in Theor etical Com- puter Science 1 . Bodlaender , H. L. (1998). A partial k-arboretum of graphs with bounded treewidth. Theor etical computer science 209 (1), 1–45. Bodlaender , H. L., J. R. Gilbert, H. Hafsteinsson, and T . Kloks (1995). Approximating tree width, pathwidth, frontsize, and shortest elimination tree. J ournal of Algorithms 18 , 238–255. Bodlaender , H. L., A. K oster , and F . v . d. Eijkhof (2005). Preprocessing rules for triangulation of probabilistic networks. Computational Intelligence 21 (3), 286–305. Bodlaender , H. L. and A. M. C. A. K oster (2010). T reewidth computations I. upper bounds. Information and Computation 208 (3), 259–275. Bonneau, M., S. Gaba, N. Peyrard, and R. Sabbadin (2014). Reinforcement learning-based design of sampling policies under cost constraints in Markov random ﬁelds: Application to weed map reconstruction. Computational Statistics & Data Analysis 72 , 30–44. 40 Brand, M. (1997). Coupled hidden Marko v models for modeling interacting processes. T echnical Report 405, MIT . Carriger , F . and M. Barron (2016). A practical probabilistic graphical modeling tool for weigh- ing ecological risk-based evidence. Soil and Sediment Contamination: An International J our- nal 25 (4), 476–487. Casella, G. and E. I. George (1992). Explaining the Gibbs sampler . The American Statisti- cian 46 (3), 167–174. Celisse, A., J.-J. Daudin, and L. Pierre (2012). Consistenc y of maximum-likelihood and varia- tional estimators in the stochastic block model. Electr onic J ournal of Statistics 6 , 1847–99. Choi, H., D. Fermin, A. Nesvizhskii, D. Ghosh, and Z. Qin (2013). Sparsely correlated hidden Marko v models with application to genome-wide location studies. Bioinformatics 29 (5), 533– 541. Cho w , C. K. and C. Liu (1968). Approximating discrete probability distributions with depen- dence trees. IEEE T r ansactions on Information Theory 14 (3), 462–467. Cohen, D., M. Cooper, P . Jeav ons, and A. Krokhin (2004). A maximal tractable class of soft constraints. J ournal of Artiﬁcial Intelligence Resear ch 22 , 1–22. Cooper , M. C. (2004). Cyclic consistency: a local reduction operation for binary valued con- straints. Artiﬁcial Intelligence 155 (1-2), 69–92. Cooper , M. C., S. de Givry, M. Sanchez, T . Schiex, M. Zytnicki, and T . W erner (2010). Soft arc consistency re visited. Artiﬁcial Intelligence 174 , 449–478. Cooper , M. C. and T . Schiex (2004). Arc consistency for soft constraints. Artiﬁcial Intelli- gence 154 (1-2), 199–227. Davis, M. and H. Putnam (1960). A computing procedure for quantiﬁcation theory . J ournal of the A CM 7 (3), 210–215. de Givry , S., T . Schiex, and G. V erfaillie (2006). Exploiting Tree Decomposition and Soft Lo- cal Consistency in Weighted CSP. In Pr oceedings of the National Conference on Artiﬁcial Intelligence, AAAI-2006 , pp. 22–27. Dechter , R. (1999). Bucket elimination: A unifying framework for reasoning. Artiﬁcial Intelli- gence 113 (1–2), 41–85. Dechter , R. and J. Pearl (1988). Network-based heuristics for constraint satisfaction problems. In L. Kanal and V . Kumar (Eds.), Searc h in Artiﬁcial Intelligence , Chapter 11, pp. 370–425. Springer-V erlag. Dempster , A. P ., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. J ournal of the Royal Statistical Society B 39 , 1–38. 41 Duf ﬁn, R. J. (1965). T opology of series-parallel networks. Journal of Mathematical Analysis and Application 10 (2), 303–313. Elidan, G. and A. Globerson (2010). U AI inference challenge 2010. See www.cs.huji.ac. il/project/UAI10 . Elidan, G. and A. Globerson (2011). The probabilistic inference challenge. See http://www. cs.huji.ac.il/project/PASCAL/index.php . Ermon, S., C. Gomes, A. Sabharwal, and B. Selman (2014). Lo w-density parity constraints for hashing-based discrete integration. In Pr oceedings of the 31st International Confer ence on Machine Learning , pp. 271–279. Fourier , J. (1827). M ´ emoir es de l’Acad ´ emie des sciences de l’Institut de F rance 7 , Chapter Histoire de l’Acad ´ emie, partie math ´ ematique (1824). Gauthier -V illars. Frey , B. and D. MacKay (1998). A re volution: Belief propagation in graphs with cycles. In Advances in Neural Information Pr ocessing Systems , pp. 479–485. MIT Press. Friel, N., A. N. Pettitt, R. Reeves, and E. W it (2009). Bayesian inference in hidden Markov ran- dom ﬁelds for binary data deﬁned on large lattices. Journal of Computational and Graphical Statistics 18 , 243–261. Fulkerson, D. and O. Gross (1965). Incidence matrices and interval graphs. P aciﬁc journal of mathematics 15 (3), 835–855. Gazal, S., J.-J. Daudin, and S. Robin (2012). Accuracy of v ariational estimates for random graph mixture models. J ournal of Statistical Computation and Simulation 82 (6), 849–862. Geman, S. and D. Geman (1984, November). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE T ransactions on P attern Analysis and Machine Intelli- gence 6 (6), 721–741. Ghahramani, Z. and M. Jordan (1997). Factorial hidden Markov models. Machine learning 29 (2- 3), 245–273. Globerson, A. and T . Jaakkola (2008). Fixing max-product: Conv ergent message passing algo- rithms for MAP LP-relaxations. In Advances in Neural Information Pr ocessing Systems , pp. 553–560. Gogate, V . (2014). U AI 2014 inference competition. See www.hlt.utdallas.edu/ ˜ vgogate/uai14- competition . Gordon, N. J., D. J. Salmond, and A. F . M. Smith (1993). A nov el approach to nonlinear/non- Gaussian Bayesian state estimation. IEEE Pr oceedings on Radar and Signal Pr ocess- ing 140 (2), 107–113. Habib, M. and V . Limouzy (2009). On some simplicial elimination schemes for chordal graphs. Electr onic Notes in Discr ete Mathematics 32 , 125 – 132. 42 Hall, P ., J. T . Ormerod, and M. W and (2011). Theory of Gaussian variational approximation for a Poisson mixed model. Statistica Sinica , 369–389. Hazan, T ., S. Maji, and T . Jaakkola (2013). On sampling from the Gibbs distribution with random maximum a-posteriori perturbations. In Advances in Neural Information Pr ocessing Systems , pp. 1268–1276. Heggernes, P ., S. Eisenstat, G. Kumfert, and A. Pothen (2001). The computational complexity of the minimum degree algorithm. In 14th Norwe gian Computer Science Confer ence , Troms, Norway . Heskes, T ., O. Zoeter , and W . W iegerinck (2004). Approximate expectation maximization. Ad- vances in Neural Information Pr ocessing Systems 16 , 353–360. H ¨ ohna, S., T . Heath, B. Boussau, M. Landis, F . Ronquist, and J. Huelsenbeck (2014). Probabilis- tic graphical model representation in phylogenetics. Systematic Biology 63 (5), 753–771. Huang, H. and D. Daphne K oller (2013). Subproblem-tree calibration: A uniﬁed approach to max-product message passing. In International Confer ence on Machine Learnig . Illian, J., S. Martino, S. Sørbye, J. Gallego-Fern ´ andez, M. Zunzunegui, M. Esqui vias, and J. J. T ravis (2013). Fitting complex ecological point process models with inte grated nested laplace approximation. Methods in Ecology and Evolution 4 (4), 305–315. Jaakkola, T . and M. Jordan (2000). Bayesian parameter estimation via v ariational methods. Statistics and Computing 10 (1), 25–37. Jensen, F ., K. Olesen, and S. Andersen (1990). An algebra of Bayesian belief univ erses for kno wledge-based systems. Networks 20 (5), 637–659. Jensen, F . V . and T . D. Nielsen (2007). Bayesian Networks and Decision Graphs (2nd ed.). Springer Publishing Company , Incorporated. Kask, K., A. Gelfand, L. Otten, and R. Dechter (2011). Pushing the po wer of stochastic greedy ordering schemes for inference in graphical models. In Pr oceedings of the National Confer ence on Artiﬁcial Inteliigence, AAAI-2011 . Kikuchi, R. (1951). A theory of cooperati ve phenomena. Physical Revie w 81 (6), 988–1003. Kim, H. and Z. Ghahramani (2006). Bayesian Gaussian process classiﬁcation with the EM-EP algorithm. IEEE T ransactions on P attern Analysis and Machine Intelligence 28 (12), 1948– 1959. K ohlas, J. (2003). Information algebras: Generic structures for infer ence . Springer Science & Business Media. K oller , D. and N. Friedman (2009). Pr obabilistic Graphical Models: Principles and T echniques . MIT Press. 43 K olmogorov , V . (2006). Con ver gent tree-reweighted message passing for energy minimization. IEEE T ransactions on P attern Analysis and Machine Intelligence 28 (10), 1568–1583. K orhonen, J. and P . P arviainen (2013). Exact learning of bounded tree-width Bayesian networks. In Pr oceedings of the Sixteenth International Confer ence on Artiﬁcial Intelligence and Statis- tics , V olume 31, pp. 370–378. Kschischang, F . R., B. J. Frey , and H.-A. Loeliger (2001). Factor graphs and the sum-product algorithm. IEEE T r ansactions on Information Theory 47 (2), 498 –519. Kumar , K. S. S. and F . Bach (2012). Con vex relaxations for learning bounded treewidth decom- posable graphs. In Pr oceedings of the International Conference on Machine Learning , Atlanta, United States. Larrosa, J. (2000, September). Boosting search with v ariable elimination. In Principles and Practice of Constraint Pr ogramming - CP 2000 , V olume 1894 of LNCS , Singapore, pp. 291– 305. Lauritzen, S. and D. Spiegelhalter (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society – Series B 50 , 157–224. Lauritzen, S. L. (1996). Gr aphical Models . Clarendon Press. Lawler , E. and D. W ood (1966). Branch-and-bound methods: A survey . Operations Re- sear ch 14 (4), 699–719. Lee, J. and K. L. Leung (2009). T ow ards ef ﬁcient consistency enforcement for global constraints in weighted constraint satisf action. In International Confer ence on Artiﬁcial Intelligence , V ol- ume 9, pp. 559–565. Lee, J. H. M. and K. L. Leung (2012). Consistency T echniques for Global Cost Functions in W eighted Constraint Satisfaction. J ournal of Artiﬁcial Intelligence Resear ch 43 , 257–292. Li, S. Z. (2001). Marko v random ﬁeld modeling in ima ge analysis . Springer -V erlag. Liu, J. W . H. (1992). The multifrontal method for sparse matrix solution: Theory and practice. SIAM Revie w 34 , 82–109. Liu, Q. and A. T . Ihler (2011). Bounding the partition function using holder’ s inequality . In Pr oceedings of the 28th International Confer ence on Machine Learning , pp. 849–856. Liu, Y ., J. Carbonell, and P . Gopalakrishnan, V .and W eigele (2009). Conditional Graphical Mod- els for Protein Structural Motif Recognition. Journal of Computational Biology 16 (5), 639– 657. Lov ´ asz, L. (2005). Graph minor theory . Bulletin of the American Mathematical Society 43 , 75–86. 44 Maathuis, M., D. Colombo, M. Kalisch, and P . B ¨ uhlmann (2010). Predicting causal ef fects in large-scale systems from observ ational data. Natur e Methods 7 , 247–248. Marinescu, R. and R. Dechter (2006). Memory intensi ve branch-and-bound search for graphical models. In pr oceedings of the National Confer ence on Artiﬁcial Intelligence, AAAI-2006 , pp. 1200–1205. Meseguer , P ., F . Rossi, and T . Schiex (2006). Soft constraints processing. In F . Rossi, P . v an Beek, and T . W alsh (Eds.), Handbook of Constr aint Pr ogramming , Chapter 9. Else vier . Minka, T . (2001). A family of algorithms for appr oximate Bayesian inference . Ph. D. thesis, MIT . Mooij, J. M. (2010, August). libD AI: A free and open source C++ library for discrete approxi- mate inference in graphical models. J ournal of Machine Learning Resear ch 11 , 2169–2173. Mooij, J. M. and H. J. Kappen (2007). Suf ﬁcient conditions for conv ergence of the sum-product algorithm. IEEE T r ansactions on Information Theory 53 (12), 4422–4437. Murphy , K. (2002). Dynamic Bayesian networks: r epr esentation, infer ence and learning . Ph. D. thesis, Uni versity of California, Berkele y . Murphy , K. (2012). Machine Learning: A Pr obabilistic P erspective . The MIT Press. Murphy , K., Y . W eiss, and M. Jordan (1999). Loopy belief propagation for approximate infer- ence: An empirical study . In Pr oceedings of the 15th confer ence on Uncertainty in Artiﬁcial Intelligence , pp. 467–475. Nock, H. and M. Ostendorf (2003). Parameter reduction schemes for loosely coupled HMMs. Computer Speech & Langua ge 17 (2), 233–262. Noorshams, N. and M. J. W ainwright (2013). Belief propagation for continuous state spaces: stochastic message-passing with quantitativ e guarantees. J ournal of Machine Learning Re- sear ch 14 (1), 2799–2835. Optimization Research Group, N. (2012). Minizinc challenge 2012. See http://www. minizinc.org/challenge2012/challenge.html . Otten, L., A. Ihler , K. Kask, and R. Dechter (2012). W inning the P ASCAL 2011 MAP challenge with enhanced AND/OR branch-and-bound. In NIPS DISCML W orkshop , Lake T ahoe, USA. Pearl, J. (1988). Pr obabilistic Reasoning in Intellig ent Systems, Networks of Plausible Infer ence . Palo Alto: Morgan Kaufmann. Pralet, C., G. V erfaillie, and T . Schiex (2007). An algebraic graphical model for decision with uncertainties, feasibilities, and utilities. J ournal of Artiﬁcial Intelligence Resear ch , 421–489. Rabiner , L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Pr oceedings of the IEEE 77 (2), 257–286. 45 Ree ves, R. and A. N. Pettitt (2004). Ef ﬁcient recursions for general factorisable models. Biometrika 91 , 751–757. Robert, C. and G. Casella (2004). Monte Carlo Statistical Methods . Ne w Y ork, Springer - V erlag. Robertson, N. and P . Seymour (1986). Graph minors. ii. algorithmic aspects of tree-width. J our- nal of Algorithms 7 (3), 309–322. Rossi, F ., P . van Beek, and T . W alsh (Eds.) (2006). Handbook of Constraint Pr ogramming . Else vier . Saeedi, A., T . Kulkarni, V . Mansinghka, and S. J. Gershman (2017). V ariational particle approx- imations. J ournal of Machine Learning Resear ch 18 (69), 1–29. Schiex, T . (1999). A note on CSP graph parameters. T echnical Report 1999/03, INRA. Schiex, T . (2000, September). Arc consistency for soft constraints. In International Confer ence on Principles and Practice of Constr aint Pr ogramming - CP-2000 , Sing apore, pp. 411–424. Schiex, T ., H. Far gier , and G. V erfaillie (1995, August). V alued constraint satisfaction problems: hard and easy problems. In Pr oceedings of the 14 th International Joint Confer ence on Artiﬁcial Intelligence , Montr ´ eal, Canada, pp. 631–637. Shafer , G. and P . Shenoy (1988). Local computations in hyper -trees. W orking paper 201, School of business, Uni versity of Kansas. Shenoy , P . and G. Shafer (1990). Axioms for probability and belief-function propagation. In Pr oceedings of the 6th Confer ence on Uncertainty in Artiﬁcial Intelligence , Cambridge, USA, pp. 169–198. Smidi, V . and A. Quinn (2006). The Variational Bayes method in Signal Pr ocessing . Signal and Communication T echnologies. Springer . Solomon, C. and T . Breckon (2011). Fundamentals of Digital Image Pr ocessing: A Practical Appr oach with Examples in Matlab . John W iley & Sons, Ltd. T amura, T . and T . Akutsu (2014). Biological Data Mining and Its Applications in Healthcar e , Chapter Theory and method of completion for a Boolean regulatory network using observed data, pp. 123–146. W orld Scientiﬁc. T arjan, R. and M. Y annakakis (1984). Simple linear-time algorithms to test chordality of graphs, test acyclicity of hyper graphs and selectiv ely reduce acyclic hypergraphs. SIAM Journal of Computing 13 (3), 566–579. T opkis, D. (1978). Minimizing a submodular function on a lattice. Operations Resear ch 26 , 305–321. V andenberghe, L. and M. S. Andersen (2014). Chordal graphs and semideﬁnite optimization. F oundations and T r ends in Optimization 1 (4), 241–433. 46 V icente, S., V . Kolmogoro v , and C. Rother (2008). Graph cut based image segmentation with connecti vity priors. In Computer V ision and P attern Recognition,CVPR 2008 , Alaska, USA, pp. 1–8. V iricel, C., D. Simoncini, S. Barbe, and T . Schiex (2016). Guaranteed weighted counting for af ﬁnity computation: Beyond determinism and structure. In International Conference on Prin- ciples and Practice of Constr aint Pr ogramming - CP-2016 , T oulouse France, pp. 733–750. W ainwright, M., T . Jaakkola, and A. W illsky (2005). A new class of upper bounds on the log partition function. Information Theory , IEEE T r ansactions on 51 (7), 2313–2335. W ainwright, M. J. and M. I. Jordan (2008). Graphical models, exponential families, and varia- tional inference. F oundations and T rends in Mac hine Learning 1 (1–2), 1–305. W altz, D. L. (1972). Generating semantic descriptions from drawings of scenes with shado ws. T echnical Report AI271, M.I.T ., Cambridge MA. W ang, B. and D. Titterington (2005). Inadequacy of interval estimates corresponding to varia- tional Bayesian approximations. In Pr oceedings of the 10th International W orkshop in Artiﬁ- cial Intelligence and Statistics , pp. 373–380. W ang, B. and M. T itterington, D. (2006). Con ver gence properties of a general algorithm for calculating v ariational Bayesian estimates for a normal mixture model. Bayesian Analysis 1 (3), 625–50. W eiss, Y . (2000). Correctness of local probability propagation in graphical models with loops. Neural Computation 12 (1). W erner , T . (2008). High-arity interactions, polyhedral relaxations, and cutting plane algorithm for soft constraint optimisation (MAP-MRF). In Pr oceedings of the Conference on Computer V ision and P attern Reco gnition, CVPR-2008 , Alaska, USA, pp. 1–8. Xing, P . (2004). Pr obabilistic gr aphical models and algorithms for genomic analysis . Phd thesis, Uni versity of California, Berkele y , CA, USA. Y edidia, J., W . Freeman, and Y . W eiss (2005). Constructing free energy approximations and generalized belief propagation algorithms. IEEE T ransactions on Information Theory 51 (7), 2282–2312. Zhong, S. and J. Ghosh (2002). HMMs and coupled HMMs for multi-channel EEG classiﬁcation. In Pr oceedings of the IEEE International Joint Conference on Neural Networks , V olume 2, Honolulu, Haw ai, pp. 1254–1159. 47

Exact and approximate inference in graphical models: variable elimination and beyond

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment