On the Relationship between Sum-Product Networks and Bayesian Networks

On the Relationship between Sum-Pr oduct Networks and Bay esian Networks Han Zhao H A N . Z H AO @ U W A T E R L O O . C A Mazen Melibari M M E L I BA R @ U W A T E R L O O . C A Pascal P oupart P P O U PAR T @ U W A T E R L O O . C A David R. Cheriton School of Computer Science, Uni versity of W aterloo, W aterloo, ON N2L 3G1, Canada Abstract In this paper , we establish some theoretical con- nections between Sum-Product Networks (SPNs) and Bayesian Networks (BNs). W e prov e that ev- ery SPN can be con verted into a BN in linear time and space in terms of the network size. The key insight is to use Algebraic Decision Diagrams (ADDs) to compactly represent the local condi- tional probability distributions at each node in the resulting BN by e xploiting conte xt-speciﬁc inde- pendence (CSI). The generated BN has a simple directed bipartite graphical structure. W e show that by applying the V ariable Elimination algo- rithm (VE) to the generated BN with ADD rep- resentations, we can recover the original SPN where the SPN can be vie wed as a history record or caching of the VE inference process. T o help state the proof clearly , we introduce the notion of normal SPN and present a theoretical analysis of the consistency and decomposability properties. W e conclude the paper with some discussion of the implications of the proof and establish a con- nection between the depth of an SPN and a lo wer bound of the tree-width of its corresponding BN. 1. Introduction Sum-Product Networks (SPNs) ha ve recently been pro- posed as tractable deep models ( Poon & Domingos , 2011 ) for probabilistic inference. They distinguish themselves from other types of probabilistic graphical models (PGMs), including Bayesian Netw orks (BNs) and Mark ov Networks (MNs), by the fact that inference can be done e xactly in lin- ear time with respect to the size of the network. This has generated a lot of interest since inference is often a core task for parameter estimation and structure learning, and it typically needs to be approximated to ensure tractabil- ity since probabilistic inference in BNs and MNs is #P- complete ( Roth , 1996 ). The relationship between SPNs and BNs, and more broadly with PGMs, is not clear . Since the introduction of SPNs in the seminal paper of Poon & Domingos ( 2011 ), it is well understood that SPNs and BNs are equally expressi ve in the sense that they can represent an y joint distribution ov er discrete v ariables 1 , but it is not clear how to conv ert SPNs into BNs, nor whether a blow up may occur in the con- version process. The common belief is that there exists a distrib ution such that the smallest BN that encodes this distribution is exponentially larger than the smallest SPN that encodes this same distribution. The key behind this belief lies in SPNs’ ability to exploit context-speciﬁc inde- pendence (CSI) ( Boutilier et al. , 1996 ). While the above belief is correct for classic BNs with tab u- lar conditional probability distributions (CPDs) that ignore CSI, and for BNs with tree-based CPDs due to the repli- cation problem ( Pagallo , 1989 ), it is not clear whether it is correct for BNs with more compact representations of the CPDs. The other direction is clear for classic BNs with tab- ular representation: given a BN with tab ular representation of its CPDs, we can build an SPN that represents the same joint probability distrib ution in time and space complexity that may be exponential in the tree-width of the BN. Brieﬂy , this is done by ﬁrst constructing a junction tree and trans- late it into an SPN 2 . Howe ver , to the best of our kno wledge, it is still unknown how to con vert an SPN into a BN and whether the conv ersion will lead to a blow up when more compact representations than tables and trees are used for the CPDs. W e prov e in this paper that by adopting Algebraic Deci- sion Diagrams (ADDs) ( Bahar et al. , 1997 ) to represent the CPDs at each node in a BN, every SPN can be conv erted into a BN in linear time and space comple xity in the size of the SPN. The generated BN has a simple bipartite structure, which facilitates the analysis of the structure of an SPN in terms of the structure of the generated BN. Furthermore, 1 Joint distributions over continuous variables are also possi- ble, b ut we will restrict ourselves to discrete v ariables in this pa- per . 2 http://spn.cs.washington.edu/faq.shtml On the Relationship between Sum-Product Netw orks and Bayesian Networks we show that by applying the V ariable Elimination (VE) algorithm ( Zhang & Poole , 1996 ) to the generated BN with ADD representation of its CPDs, we can recov er the origi- nal SPN in linear time and space with respect to the size of the SPN. Our contributions can be summarized as follo ws. First, we present a constructive algorithm and a proof for the con- version of SPNs into BNs using ADDs to represent the lo- cal CPDs. The con version process is bounded by a lin- ear function of the size of the SPN in both time and space. This giv es a new perspective to understand the probabilis- tic semantics implied by the structure of an SPN through the generated BN. Second, we show that by executing VE on the generated BN, we can recov er the original SPN in linear time and space complexity in the size of the SPN. Combined with the ﬁrst point, this establishes a clear re- lationship between SPNs and BNs. Third, we introduce the subclass of normal SPNs and show that e very SPN can be transformed into a normal SPN in quadratic time and space. Compared with general SPNs, the structure of normal SPNs exhibit more intuitiv e probabilistic semantics and hence normal SPNs are used as a bridge in the con ver- sion of general SPNs to BNs. Fourth, our construction and analysis provides a new direction for learning the parame- ter/structure of BNs since the SPNs produced by the algo- rithms that learn SPNs ( Dennis & V entura , 2012 ; Gens & Domingos , 2013 ; Peharz et al. , 2013 ; Rooshenas & Lowd , 2014 ) can be con verted into BNs. 2. Related W ork Exact probabilistic reasoning has a close connection with propositional logic and weighted model counting ( Roth , 1996 ; Gomes et al. , 2008 ; Bacchus et al. , 2003 ; Sang et al. , 2005 ). The model counting problem, #SA T , is the prob- lem of computing the number of models for a gi ven propo- sitional formula, i.e., the number of distinct truth assign- ments of the v ariables for which the formula ev aluates to TRUE . In its weighted version, each boolean v ariable X has a weight Pr( x ) ∈ [0 , 1] when set to TRUE and a weight 1 − Pr( x ) when set to FALSE . The weight of a truth assignment is the product of the weights of its literals. The weighted model counting problem then asks the sum of the weights of all satisfying truth assignments. There are two important streams of research for exact weighted model counting and exact probabilistic reasoning that re- late to SPNs: DPLL-style exhausti ve search ( Birnbaum & Lozinskii , 2011 ) and those based on knowledge compila- tion , e.g., Binary Decision Diagrams (BDDs), Decompos- able Negation Normal Forms (DNNFs) and Arithmetic Cir- cuits (A Cs) ( Bryant , 1986 ; Darwiche , 2001 ; 2000 ) . The SPN, as an inference machine, has a close connec- tion with the broader ﬁeld of kno wledge representation and knowledge compilation. In knowledge compilation, the reasoning process is divided into two phases: an ofﬂine compilation phase and an online query-answering phase. In the of ﬂine phase, the kno wledge base, either propositional theory or belief network, is compiled into some tractable target language. In the online phase, the compiled target model is used to answer a lar ge number of queries ef ﬁ- ciently . The ke y moti vation of kno wledge compilation is to shift the computation that is common to many queries from the online phase into the ofﬂine phase. As an e xample, A Cs ha ve been studied and used extensiv ely in both kno wl- edge representation and probabilistic inference ( Darwiche , 2000 ; Huang et al. , 2006 ; Chavira et al. , 2006 ). Rooshenas & Lo wd ( 2014 ) recently sho wed that ACs and SPNs can be con verted mutually without an exponential blow-up in both time and space. As a direct result, A Cs and SPNs share the same expressi veness for probabilistic reasoning. Another representation closely related to SPNs in propo- sitional logic and knowledge representation is the deterministic-Decomposable Ne gation Normal Form (d- DNNF) ( Darwiche & Marquis , 2001 ). Propositional for- mulas in d-DNNF are represented by a directed acyclic graph (DA G) structure to enable the re-usability of sub- formulas. The terminal nodes of the D AG are literals and the internal nodes are AND or OR operators. Like SPNs, d-DNNF formulas can be queried to answer satisﬁability and model counting problems. W e refer interested readers to Darwiche & Marquis ( 2001 ) and Darwiche ( 2001 ) for more detailed discussions. Since their introduction by Poon & Domingos ( 2011 ), SPNs have generated a lot of interest as a tractable class of models for probabilistic inference in machine learn- ing. Discriminati ve learning techniques for SPNs have been proposed and applied to image classiﬁcation ( Gens & Domingos , 2012 ). Later , automatic structure learn- ing algorithms were developed to b uild tree-structured SPNs directly from data ( Dennis & V entura , 2012 ; Peharz et al. , 2013 ; Gens & Domingos , 2013 ; Rooshenas & Lowd , 2014 ). SPNs have also been applied to various ﬁelds and hav e generated promising results, including activity mod- eling ( Amer & T odorovic , 2012 ), speech modeling ( Peharz et al. , 2014 ) and language modeling ( Cheng et al. , 2014 ). Theoretical work inv estigating the inﬂuence of the depth of SPNs on expressi veness exists ( Delalleau & Bengio , 2011 ), but is quite limited. As discussed later , our results rein- force previous theoretical results about the depth of SPNs and provide further insights about the structure of SPNs by examining the structure of equi valent BNs. 3. Preliminaries W e start by introducing the notation used in this paper . W e use 1 : N to abbreviate the notation { 1 , 2 , . . . , N } . W e On the Relationship between Sum-Product Netw orks and Bayesian Networks use a capital letter X to denote a random variable and a bold capital letter X 1: N to denote a set of random v ariables X 1: N = { X 1 , . . . , X N } . Similarly , a lowercase letter x is used to denote a value taken by X and a bold lo wercase letter x 1: N denotes a joint value taken by the correspond- ing vector X 1: N of random variables. W e may omit the subscript 1 : N from X 1: N and x 1: N if it is clear from the context. F or a random variable X i , we use x j i , j ∈ 1 : J to enumerate all the v alues taken by X i . For simplicity , we use Pr( x ) to mean Pr( X = x ) and Pr( x ) to mean Pr( X = x ) . W e use calligraphic letters to denote graphs (e.g., G ). In particular , BNs, SPNs and ADDs are denoted respectiv ely by B , S and A . For a DA G G and a node v in G , we use G v to denote the subgraph of G induced by v and all its descendants. Let V be a subset of the nodes of G , then G | V is a subgraph of G induced by the node set V . Similarly , we use X | A or x | A to denote the restriction of a vector to a subset A . W e use node and v ertex, arc and edge interchangeably when we refer to a graph. Other notation will be introduced when needed. T o ensure that the paper is self contained, we brieﬂy re- view some background material about Bayesian Netw orks, Algebraic Decision Diagrams and Sum-Product Networks. Readers who are already familiar with those models can skip the following subsections. 3.1. Bayesian Network Consider a problem whose domain is characterized by a set of random variables X 1: N with ﬁnite support. The joint probability distribution over X 1: N can be characterized by a Bayesian Network , which is a D A G where nodes repre- sent the random variables and edges represent probabilistic dependencies among the variables. In a BN, we also use the terms “node” and “variable” interchangeably . For each variable in a BN, there is a local conditional probability distribution (CPD) ov er the variable gi ven its parents in the BN. The structure of a BN encodes conditional independencies among the variables in it. Let X 1 , X 2 , . . . , X N be a topo- logical ordering of all the nodes in a BN 3 , and let π X i be the set of parents of node X i in the BN. Each variable in a BN is conditionally independent of all its non-descendants giv en its parents. Hence, the joint probability distribution ov er X 1: N admits the factorization in Eq. 1 . Pr( X 1: N ) = N Y i =1 Pr( X i | X 1: i − 1 ) = N Y i =1 Pr( X i | π X i ) (1) Giv en the factorization, one can use v arious inference al- 3 A topological ordering of nodes in a D AG is a linear ordering of its nodes such that each node appears after all its parents in this ordering. gorithms to do probabilistic reasoning in BNs. See W ain- wright & Jordan ( 2008 ) for a comprehensiv e survey . 3.2. Algebraic Decision Diagram W e ﬁrst give a formal deﬁnition of Algebraic Decision Di- agrams (ADDs) for variables with Boolean domains and then extend the deﬁnition to domains corresponding to ar- bitrary ﬁnite sets. Deﬁnition 1 (Algebraic Decision Diagram ( Bahar et al. , 1997 )) . An Algebraic Decision Diagram (ADD) is a graph- ical representation of a real function with Boolean input variables: f : { 0 , 1 } N 7→ R , where the graph is a rooted D AG. There are two kinds of nodes in an ADD. T erminal nodes, whose out-degree is 0, are associated with real val- ues. Internal nodes, whose out-degree is 2, are associated with Boolean variables X n , n ∈ 1 : N . F or each internal node X n , the left out-edge is labeled with X n = FALSE and the right out-edge is labeled with X n = TRUE . W e extend the original deﬁnition of an ADD by allo wing it to represent not only functions of Boolean variables, but also any function of discrete variables with a ﬁnite set as domain. This can be done by allowing each internal node X n to have |X n | out-edges and label each edge with x j n , j ∈ 1 : |X n | , where X n is the domain of v ariable X n and |X n | is the number of values X n takes. Such an ADD represents a function f : X 1 × · · · × X N 7→ R , where × means the Cartesian product between two sets. Henceforth, we will use our extended deﬁnition of ADDs throughout the paper . For our purpose, we will use an ADD as a compact graphi- cal representation of local CPDs associated with each node in a BN. This is a key insight of our constructiv e proof pre- sented later . Compared with a tabular representation or a decision tree representation of local CPDs, CPDs rep- resented by ADDs can fully exploit CSI ( Boutilier et al. , 1996 ) and ef fecti vely av oid the replication problem ( Pa- gallo , 1989 ) of the decision tree representation. W e gi ve an e xample in Fig. 1 where the tab ular representa- tion, decision-tree representation and ADD representation of a function of 4 Boolean variables is presented. Another advantage of ADDs to represent local CPDs is that arith- metic operations such as multiplying ADDs and summing- out a v ariable from an ADD can be implemented efﬁciently in polynomial time. This will allow us to use ADDs in the V ariable Elimination (VE) algorithm to recover the original SPN after its con version to a BN with CPDs represented by ADDs. Readers are referred to Bahar et al. ( 1997 ) for more detailed and thorough discussions about ADDs. 3.3. Sum-Product Netw ork Before introducing SPNs, we ﬁrst deﬁne the notion of net- work polynomial , which plays an important role in our On the Relationship between Sum-Product Netw orks and Bayesian Networks X 1 X 2 X 3 X 4 f ( · ) X 1 X 2 X 3 X 4 f ( · ) 0 0 0 0 0.4 1 0 0 0 0.4 0 0 0 1 0.6 1 0 0 1 0.6 0 0 1 0 0.3 1 0 1 0 0.3 0 0 1 1 0.3 1 0 1 1 0.3 0 1 0 0 0.4 1 1 0 0 0.1 0 1 0 1 0.6 1 1 0 1 0.1 0 1 1 0 0.3 1 1 1 0 0.1 0 1 1 1 0.3 1 1 1 1 0.1 (a) T abular representation. X 1 X 3 X 2 X 4 0 . 3 X 3 0 . 1 0 . 4 0 . 6 X 4 0 . 3 0 . 4 0 . 6 (b) Decision-Tree representation. X 1 X 3 X 2 X 4 0 . 3 0 . 1 0 . 4 0 . 6 (c) ADD representation. Figure 1. Different representations of the same Boolean function. The tabular representation cannot exploit CSI and the Decision-Tree representation cannot reuse isomorphic subgraphs. The ADD representation can fully exploit CSI by sharing isomorphic subgraphs, which makes it the most compact representation among the three representations. In Fig. 1(b) and Fig. 1(c) , the left and right branches of each internal node correspond respectiv ely to FALSE and TRUE . proof. W e use I [ X = x ] to denote an indicator that returns 1 when X = x and 0 otherwise. T o simplify the notation, we will use I x to represent I [ X = x ] . Deﬁnition 2 (Network Polynomial ( Poon & Domingos , 2011 )) . Let f ( · ) ≥ 0 be an unnormalized probability distribution over a Boolean random vector X 1: N . The network polynomial of f ( · ) is a multilinear function P x f ( x ) Q N n =1 I x n of indicator variables, where the sum- mation is over all possible instantiations of the Boolean random vector X 1: N . Intuitiv ely , the network polynomial is a Boolean expan- sion ( Boole , 1847 ) of the unnormalized probability dis- tribution f ( · ) . F or example, the network polynomial of a BN X 1 → X 2 is Pr( x 1 , x 2 ) I x 1 I x 2 + Pr( x 1 , ¯ x 2 ) I x 1 I ¯ x 2 + Pr( ¯ x 1 , x 2 ) I ¯ x 1 I x 2 + Pr( ¯ x 1 , ¯ x 2 ) I ¯ x 1 I ¯ x 2 . Deﬁnition 3 (Sum-Product Network ( Poon & Domingos , 2011 )) . A Sum-Product Network (SPN) over Boolean v ari- ables X 1: N is a rooted DA G whose leaves are the indicators I x 1 , . . . , I x N and I ¯ x 1 , . . . , I ¯ x N and whose internal nodes are sums and products. Each edge ( v i , v j ) emanating from a sum node v i has a non-negati ve weight w ij . The value of a product node is the product of the values of its children. The value of a sum node is P v j ∈ C h ( v i ) w ij v al ( v j ) where C h ( v i ) are the children of v i and v al ( v j ) is the v alue of node v j . The v alue of an SPN S [ I x 1 , I ¯ x 1 , . . . , I x N , I ¯ x N ] is the value of its root. The scope of a node in an SPN is deﬁned as the set of v ari- ables that have indicators among the node’ s descendants: For any node v in an SPN, if v is a terminal node, say , an indicator variable over X , then scope ( v ) = { X } , else scope ( v ) = S ˜ v ∈ C h ( v ) scope ( ˜ v ) . Poon & Domingos ( 2011 ) further deﬁne the following properties of an SPN: Deﬁnition 4 (Complete) . An SPN is complete iff each sum node has children with the same scope. Deﬁnition 5 (Consistent) . An SPN is consistent iff no vari- able appears negated in one child of a product node and non-negated in another . Deﬁnition 6 (Decomposable) . An SPN is decomposable iff for ev ery product node v , scope( v i ) T scope( v j ) = ∅ where v i , v j ∈ C h ( v ) , i 6 = j . Clearly , decomposability implies consistency in SPNs. An SPN is said to be valid if f it deﬁnes a (unnormalized) prob- ability distribution. Poon & Domingos ( 2011 ) proved that if an SPN is complete and consistent, then it is valid. Note that this is a suf ﬁcient, but not necessary condition. In this paper , we focus only on complete and consistent SPNs as we are interested in their associated probabilistic seman- tics. For a complete and consistent SPN S , each node v in S deﬁnes a network polynomial f v ( · ) which corresponds to the sub-SPN rooted at v . The network polynomial deﬁned by the root of the SPN can then be computed recursi vely by taking a weighted sum of the network polynomials deﬁned by the sub-SPNs rooted at the children of each sum node and a product of the network polynomials deﬁned by the sub-SPNs rooted at the children of each product node. The probability distribution induced by an SPN S is deﬁned as Pr S ( x ) , f S ( x ) P x f S ( x ) , where f S ( · ) is the network polyno- mial deﬁned by the root of the SPN S . An example of a complete and consistent SPN is giv en in Fig. 2 . 4. Main Results In this section, we ﬁrst state the main results obtained in this paper and then provide detailed proofs with some dis- cussion of the results. T o keep the presentation simple, we assume without loss of generality that all the random vari- ables are Boolean unless explicitly stated. It is straightfor- On the Relationship between Sum-Product Netw orks and Bayesian Networks + × × × + + + + I x 1 I ¯ x 1 I x 2 I ¯ x 2 10 6 9 6 4 9 1 6 14 2 8 Figure 2. A complete and consistent SPN over Boolean variables X 1 , X 2 . This SPN is also decomposable since every product node has children whose scopes do not intersect. The network polyno- mial deﬁned by (the root of) this SPN is: f ( X 1 , X 2 ) = 10(6 I x 1 + 4 I ¯ x 1 )(6 I x 2 + 14 I ¯ x 2 ) + 6(6 I x 1 + 4 I ¯ x 1 )(2 I x 2 + 8 I ¯ x 2 ) + 9(9 I x 1 + I ¯ x 1 )(2 I x 2 + 8 I ¯ x 2 ) = 594 I x 1 I x 2 + 1776 I x 1 I ¯ x 2 + 306 I ¯ x 1 I x 2 + 824 I ¯ x 1 I ¯ x 2 and the probability distribution induced by S is Pr S = 594 3500 I x 1 I x 2 + 1776 3500 I x 1 I ¯ x 2 + 306 3500 I ¯ x 1 I x 2 + 824 3500 I ¯ x 1 I ¯ x 2 . ward to extend our analysis to discrete random variables with ﬁnite support. For an SPN S , let |S | be the size of the SPN, i.e., the number of nodes plus the number of edges in the graph. For a BN B , the size of B , |B | , is deﬁned by the size of the graph plus the size of all the CPDs in B (the size of a CPD depends on its representation, which will be clear from the context). The main theorems are: Theorem 1. There exists an algorithm that con verts any complete and decomposable SPN S over Boolean variables X 1: N into a BN B with CPDs represented by ADDs in time O ( N |S | ) . Furthermore, S and B represent the same distri- bution and |B| = O ( N |S | ) . As it will be clear later , Thm. 1 immediately leads to the following corollary: Corollary 2. There exists an algorithm that conv erts any complete and consistent SPN S ov er Boolean variables X 1: N into a BN B with CPDs represented by ADDs in time O ( N |S | 2 ) . Furthermore, S and B represent the same dis- tribution and |B| = O ( N |S | 2 ) . Remark 1. The BN B generated from S in Theorem 1 and Corollary 2 has a simple bipartite D AG structure, where all the source nodes are hidden variables and the terminal nodes are the Boolean variables X 1: N . Remark 2. Assuming sum nodes alternate with product nodes in SPN S , the depth of S is proportional to the max- imum in-degree of the nodes in B , which, as a result, is proportional to a lower bound of the tree-width of B . Theorem 3. Given the BN B with ADD representation of CPDs generated from a complete and decomposable SPN S over Boolean v ariables X 1: N , the original SPN S can be recov ered by applying the V ariable Elimination algorithm to B in O ( N |S | ) . Remark 3. The combination of Theorems 1 and 3 shows that distributions for which SPNs allow a compact repre- sentation and efﬁcient inference, BNs with ADDs also al- low a compact representation and efﬁcient inference (i.e., no exponential blo w up). T o make the upcoming proofs concise, we ﬁrst deﬁne a normal form for SPNs and show that e very complete and consistent SPN can be transformed into a normal SPN in quadratic time and space without changing the network polynomial. W e then deri ve the proofs with normal SPNs. Note that we only focus on SPNs that are complete and con- sistent . Hence, when we refer to an SPN, we assume that it is complete and consistent without explicitly stating this. 4.1. Normal Form For an SPN S , let f S ( · ) be the netw ork polynomial deﬁned at the root of S . Deﬁne the height of an SPN to be the length of the longest path from the root to a terminal node. Deﬁnition 7. An SPN is said to be normal if 1. It is complete and decomposable. 2. For each sum node in the SPN, the weights of the edges emanating from the sum node are nonnegativ e and sum to 1. 3. Every terminal node in the SPN is a univ ariate dis- tribution over a Boolean v ariable and the size of the scope of a sum node is at least 2 (sum nodes whose scope is of size 1 are reduced into terminal nodes). Theorem 4. For any complete and consistent SPN S , there exists a normal SPN S 0 such that Pr S ( · ) = Pr S 0 ( · ) and |S 0 | = O ( |S | 2 ) . T o show this, we ﬁrst pro ve the following lemmas. Lemma 5. For any complete and consistent SPN S over X 1: N , there exists a complete and decomposable SPN S 0 ov er X 1: N such that f S ( x ) = f S 0 ( x ) , ∀ x and |S 0 | = O ( |S | 2 ) . Pr oof. Let S be a complete and consistent SPN. If it is also decomposable, then simply set S 0 = S and we are done. Otherwise, let v 1 , . . . , v M be an in verse topologi- cal ordering of all the nodes in S , including both terminal nodes and internal nodes, such that for an y v m , m ∈ 1 : M , all the ancestors of v m in the graph appear after v m in the ordering. Let v m be the ﬁrst product node in the order - ing that violates decomposability . Let v m 1 , v m 2 , . . . , v m l be the children of v m where m 1 < m 2 < · · · < m l < m (due to the in verse topological ordering). Let ( v m i , v m j ) , i < j, i, j ∈ 1 : l be the ﬁrst ordered pair of nodes such that scope ( v m i ) T scope ( v m j ) 6 = ∅ . Hence, let X ∈ scope ( v m i ) T scope ( v m j ) . Consider f v m i and f v m j which are the network polynomials deﬁned by the sub-SPNs rooted at v m i and v m j . On the Relationship between Sum-Product Netw orks and Bayesian Networks Expand network polynomials f v m i and f v m j into a sum- of-pr oduct form by applying the distributiv e law between products and sums. For example, if f ( X 1 , X 2 ) = ( I x 1 + 9 I ¯ x 1 )(4 I x 2 +6 I ¯ x 2 ) , then the expansion of f is f ( X 1 , X 2 ) = 4 I x 1 I x 2 + 6 I x 1 I ¯ x 2 + 36 I ¯ x 1 I x 2 + 54 I ¯ x 1 I ¯ x 2 . Since S is com- plete, then sub-SPNs rooted at v m i and v m j are also com- plete, which means that each monomial in the expansion of f v m i must share the same scope. The same applies to f v m j . Since X ∈ scope ( v m i ) T scope ( v m j ) , then every monomial in the expansion of f v m i and f v m j must con- tain an indicator v ariable o ver X , either I x or I ¯ x . Fur- thermore, since S is consistent, then the sub-SPN rooted at v m is also consistent. Consider f v m = Q l k =1 f v m k = f v m i f v m j Q k 6 = i,j f v m k . Because v m is consistent, we know that each monomial in the expansions of f v m i and f v m j must contain the same indicator variable of X , either I x or I ¯ x , otherwise there will be a term I x I ¯ x in f v m which violates the consistency assumption. W ithout loss of gen- erality , assume each monomial in the expansions of f v m i and f v m j contains I x . Then we can re-factorize f v m in the following w ay: f v m = l Y k =1 f v m k = I 2 x f v m i I x f v m j I x Y k 6 = i,j f v m k = I x f v m i I x f v m j I x Y k 6 = i,j f v m k = I x ˜ f v m i ˜ f v m j Y k 6 = i,j f v m k (2) where we use the fact that indicator variables are idem- potent, i.e., I 2 x = I x and ˜ f v m i ( ˜ f v m j ) is deﬁned as the function by factorizing I x out from f v m i ( f v m j ) . Eq. 2 means that in order to make v m decomposable, we can sim- ply remove all the indicator variables I x from sub-SPNs rooted at v m i and v m j and later link I x to v m directly . Such a transformation will not change the network poly- nomial f v m as sho wn by Eq. 2 , but it will remove X from scope ( v m i ) T scope ( v m j ) . In principle, we can apply this transformation to all ordered pairs ( v m i , v m j ) , i < j, i, j ∈ 1 : l with nonempty intersections of scope. Howe ver , this is not algorithmically efﬁcient and more importantly , for local components containing I x in f v m which are reused by other nodes v n outside of S v m , we cannot remove I x from them otherwise the network polynomials for each such v n will be changed due to the removal. In such case, we need to dupli- cate the local components to ensure that local transforma- tions with respect to f v m do not affect network polynomials f v n . W e present the transformation in Alg. 1 . Alg. 1 trans- forms a complete and consistent SPN S into a complete and decomposable SPN S 0 . Informally , it works using the Algorithm 1 Decomposition T ransformation Input: Complete and consistent SPN S . Output: Complete and decomposable SPN S 0 . 1: Let v 1 , v 2 , . . . , v M be an in verse topological ordering of nodes in S . 2: f or m = 1 to M do 3: if v m is a non-decomposable product node then 4: Ω( v m ) ← S i 6 = j scope ( v m i ) T scope ( v m j ) 5: V ← { v ∈ S v m | scope ( v ) T Ω( v m ) 6 = ∅ } 6: S V ← S v m | V 7: D ( v m ) ← descendants of v m 8: for node v ∈ S V \{ v m } do 9: if P a ( v ) \ D ( v m ) 6 = ∅ then 10: Create p ← v ⊗ Q X ∈ Ω( v m ) ∩ scope ( v ) I x ∗ 11: Connect p to ∀ f ∈ P a ( v ) \ D ( v m ) 12: Disconnect v from ∀ f ∈ P a ( v ) \ D ( v m ) 13: end if 14: end for 15: for node v ∈ S V in bottom-up order do 16: Disconnect ˜ v ∈ C h ( v ) ∀ scope ( ˜ v ) ⊆ Ω( v m ) 17: end for 18: Connect Q X ∈ Ω( v m ) I x ∗ to v m directly 19: end if 20: end f or 21: Delete all nodes unreachable from the root of S 22: Delete all product nodes with out-degree 0 23: Contract all product nodes with out-degree 1 following identity: f v m =   Y X ∈ Ω( v m ) I x ∗   l Y k =1 f v m k Q X ∈ Ω( v m ) ∩ scope ( v m k ) I ∗ x (3) where Ω( v m ) , S i,j ∈ 1: l,i 6 = j scope ( v m i ) ∩ scope ( v m j ) , i.e., Ω( v m ) is the union of all the shared variables between pairs of children of v m and I x ∗ is the indicator variable of X ∈ Ω( v m ) appearing in S v m . Based on the analysis above, we know that for each X ∈ Ω( v m ) there will be only one kind of indicator variable I x ∗ that appears inside S v m , otherwise v m is not consistent. In Line 6, S v m | V is deﬁned as the sub- SPN of S v m induced by the node set V , i.e., a subgraph of S v m where the node set is restricted to V . In Lines 5-6, we ﬁrst extract the induced sub-SPN S V from S v m rooted at v m using the node set in which nodes hav e nonempty intersections with Ω( v m ) . W e disconnect the nodes in S V from their children if their children are indicator variables of a subset of Ω( v m ) (Lines 15-17). At Line 18, we build a new product node by multiplying all the indicator v ariables in Ω( v m ) and link it to v m directly . T o keep unchanged the network polynomials of nodes outside S v m that use nodes in S V , we create a duplicate node p for each such node v and link p to all the parents of v outside of S v m and at the On the Relationship between Sum-Product Netw orks and Bayesian Networks same time delete the original link (Lines 9-13). In summary , Lines 15-17 ensure that v m is decomposable by removing all the shared indicator v ariables in Ω( v m ) . Line 18 together with Eq. 3 guarantee that f v m is un- changed after the transformation. Lines 9-13 create nec- essary duplicates to ensure that other network polynomials are not af fected. Lines 21-23 simplify the transformed SPN to make it more compact. An example is depicted in Fig. 3 to illustrate the transformation process. + × v m × v n + v m 1 + v m 2 × × I x 1 I x 3 I x 2 I ¯ x 1 + × v m × v n + v m 1 + v m 2 × × × p I x 3 I x 1 I x 2 I ¯ x 1 I x 3 Figure 3. T ransformation process described in Alg. 1 to construct a complete and decomposable SPN from a complete and consis- tent SPN. The product node v m in the left SPN is not decompos- able. Induced sub-SPN S v m is highlighted in blue and S V is high- lighted in green. v m 2 highlighted in red is reused by v n which is outside S v m . T o compensate for v m 2 , we create a new product node p in the right SPN and connect it to indicator variable I x 3 and v m 2 . Dashed gray lines in the right SPN denote deleted edges and nodes while red edges and nodes are added during Alg. 1 . W e no w analyze the size of the SPN constructed by Alg. 1 . For a graph S , let V ( S ) be the number of nodes in S and let E ( S ) be the number of edges in S . Note that in Lines 8-17 we only focus on nodes that appear in the induced SPN S V , which clearly has |S V | ≤ |S v m | . Furthermore, we create a new product node p at Line 10 iff v is reused by other nodes which do not appear in S v m . This means that the number of nodes created during each iteration be- tween Lines 2 and 20 is bounded by V ( S V ) ≤ V ( S v m ) . Line 10 also creates 2 new edges to connect p to v and the indicator variables. Lines 11 and 12 ﬁrst connect edges to p and then delete edges from v , hence these two steps do not yield increases in the number of edges. So the increase in the number of edges is bounded by 2 V ( S V ) ≤ 2 V ( S v m ) . Combining increases in both nodes and edges, during each outer iteration the increase in size is bounded by 3 |S V | ≤ 3 |S v m | = O ( |S | ) . There will be at most M = V ( S ) outer iterations hence the total increase in size will be bounded by O ( M |S | ) = O ( |S | 2 ) . Lemma 6. For any complete and decomposable SPN S ov er X 1: N that satisﬁes condition 2 of Def. 7 , P x f S ( x ) = 1 . Pr oof. W e gi ve a proof by induction on the height of S . Let R be the root of S . • Base case. SPNs of height 0 are indicator variables ov er some Boolean variable whose network polyno- mials immediately satisfy Lemma 6 . • Induction step. Assume Lemma 6 holds for any SPN with height ≤ k . Consider an SPN S with height k + 1 . W e consider the following tw o cases: – The root R of S is a product node. Then in this case the network polynomial f S ( · ) for S is de- ﬁned as f S = Q v ∈ C h ( R ) f v . W e hav e X x f S ( x ) = X x Y v ∈ C h ( R ) f v ( x | scope ( v ) ) (4) = Y v ∈ C h ( R ) X x | scope ( v ) f v ( x | scope ( v ) ) (5) = Y v ∈ C h ( R ) 1 = 1 (6) where x | scope ( v ) means that x is restricted to the set scope ( v ) . Eq. 5 follo ws from the decompos- ability of R and Eq. 6 follo ws from the induction hypothesis. – The root R of S is a sum node. The network polynomial is f S = P v ∈ C h ( R ) w R,v f v . W e have X x f S ( x ) = X x X v ∈ C h ( R ) w R,v f v ( x ) (7) = X v ∈ C h ( R ) w R,v X x f v ( x ) (8) = X v ∈ C h ( R ) w R,v = 1 (9) Eq. 8 follows from the commutativ e and associa- tiv e law of addition and Eq. 9 follows by the in- duction hypothesis. Corollary 7. For any complete and decomposable SPN S ov er X 1: N that satisﬁes condition 2 of Def. 7 , Pr S ( · ) = f S ( · ) . Lemma 8. For any complete and decomposable SPN S , there exists an SPN S 0 where the weights of the edges em- anating from every sum node are nonnegativ e and sum to 1, and Pr S ( · ) = Pr S 0 ( · ) , |S 0 | = |S | . Pr oof. Alg. 2 runs in one pass of S to construct the re- quired SPN S 0 . W e proceed to prov e that the SPN S 0 re- turned by Alg. 2 satisﬁes Pr S 0 ( · ) = Pr S ( · ) , |S 0 | = |S | and that S 0 satisﬁes condition 2 of Def. 7 . It is clear that |S 0 | = |S | because we only modify the weights of S to construct S 0 at Line 7. Based on Lines 6 and 7, it is also straightforward to verify that for each sum node v in S 0 , On the Relationship between Sum-Product Netw orks and Bayesian Networks Algorithm 2 W eight Normalization Input: SPN S Output: SPN S 0 1: S 0 ← S 2: v al ( I x ) ← 1 , ∀ I x ∈ S 3: Let v 1 , . . . , v M be an inv erse topological ordering of the nodes in S 4: f or m = 1 to M do 5: if v m is a sum node then 6: v al ( v m ) ← P v ∈ C h ( v m ) w v m ,v v al ( v ) 7: w 0 v m ,v ← w v m ,v v al ( v ) v al ( v m ) , ∀ v ∈ C h ( v m ) 8: else if v m is a product node then 9: v al ( v m ) ← Q v ∈ C h ( v m ) v al ( v ) 10: end if 11: end f or the weights of the edges emanating from v are nonnega- tiv e and sum to 1. W e no w show that Pr S 0 ( · ) = Pr S ( · ) . Using Corollary 7 , Pr S 0 ( · ) = f S 0 ( · ) . Hence it is suf- ﬁcient to show that f S 0 ( · ) = Pr S ( · ) . Before deriving a proof, it is helpful to note that for each node v ∈ S , v al ( v ) = P x | scope ( v ) f v ( x | scope ( v ) ) . W e giv e a proof by in- duction on the height of S . • Base case. SPNs with height 0 are indicator variables which automatically satisfy Lemma 8 . • Induction step. Assume Lemma 8 holds for any SPN of height ≤ k . Consider an SPN S of height k + 1 . Let R be the root node of S with out-degree l . W e discuss the following tw o cases. – R is a product node. Let R 1 , . . . , R l be the children of R and S 1 , . . . , S l be the correspond- ing sub-SPNs. By induction, Alg. 2 returns S 0 1 , . . . , S 0 l that satisfy Lemma 8 . Since R is a product node, we hav e f S 0 ( x ) = l Y i =1 f S 0 i ( x | scope ( R i ) ) (10) = l Y i =1 Pr S i ( x | scope ( R i ) ) (11) = l Y i =1 f S i ( x | scope ( R i ) ) P x | scope ( R i ) f S i ( x | scope ( R i ) ) (12) = Q l i =1 f S i ( x | scope ( R i ) ) P x Q l i =1 f S i ( x | scope ( R i ) ) (13) = f S ( x ) P x f S ( x ) = Pr S ( x ) (14) Eq. 11 follo ws from the induction hypothesis and Eq. 13 follows from the distributi ve law due to the decomposability of S . – R is a sum node with weights w 1 , . . . , w l ≥ 0 . W e hav e f S 0 ( x ) = l X i =1 w 0 i f S 0 i ( x ) (15) = l X i =1 w i v al ( R i ) P l j =1 w j v al ( R j ) Pr S i ( x ) (16) = l X i =1 w i v al ( R i ) P l j =1 w j v al ( R j ) f S i ( x ) P x f S i ( x ) (17) = l X i =1 w i v al ( R i ) P l j =1 w j v al ( R j ) f S i ( x ) v al ( R i ) (18) = P l i =1 w i f S i ( x ) P l j =1 w j v al ( R j ) = f S ( x ) P x f S ( x ) (19) = Pr S ( x ) (20) where Eqn. 16 follows from the induction hy- pothesis, Eq. 18 and 19 follow from the fact that v al ( v ) = P x | scope ( v ) f v ( x | scope ( v ) ) , ∀ v ∈ S . This completes the proof since Pr S 0 ( · ) = f S 0 ( · ) = Pr S ( · ) . Giv en a complete and decomposable SPN S , we now con- struct and show that the last condition in Def. 7 can be sat- isﬁed in time and space O ( |S | ) . Lemma 9. Gi ven a complete and decomposable SPN S , there exists an SPN S 0 satisfying condition 3 in Def. 7 such that Pr S 0 ( · ) = Pr S ( · ) and |S 0 | = O ( |S | ) . Pr oof. W e give a proof by construction. First, if S is not weight normalized, apply Alg. 2 to normalize the weights (i.e., the weights of the edges emanating from each sum node sum to 1). Now check each sum node v in S in a bottom-up order . If | scope ( v ) | = 1 , by Corollary 7 we kno w the network polynomial f v is a probability distrib ution over its scope, say , { X } . Reduce v into a terminal node which is a dis- tribution over X induced by its network polynomial and disconnect v from all its children. The last step is to re- mov e all the unreachable nodes from S to obtain S 0 . Note that in this step we will only decrease the size of S , hence |S 0 | = O ( |S | ) . Pr oof of Thm. 4 . The combination of Lemma 5 , 8 and 9 completes the proof of Thm. 4 . An example of a normal SPN constructed from the SPN in Fig. 2 is depicted in Fig. 4 . On the Relationship between Sum-Product Netw orks and Bayesian Networks + × × × + + + + I x 1 I ¯ x 1 I x 2 I ¯ x 2 10 6 9 6 4 9 1 6 14 2 8 + × × × X 1 X 1 X 2 X 2 (0 . 6 , 0 . 4) (0 . 9 , 0 . 1) (0 . 3 , 0 . 7) (0 . 2 , 0 . 8) 4 7 6 35 9 35 Normal Form Figure 4. T ransform an SPN into a normal form. T erminal nodes which are probability distributions over a single variable are rep- resented by a double-circle. 4.2. SPN to BN In order to construct a BN from an SPN, we require the SPN to be in a normal form, otherwise we can ﬁrst trans- form it into a normal form using Alg. 1 and 2 . Let S be a normal SPN over X 1: N . Before showing how to construct a corresponding BN, we ﬁrst give some intu- itions. One useful view is to associate each sum node in an SPN with a hidden variable. For e xample, consider a sum node v ∈ S with out-degree l . Since S is normal, we hav e P l i =1 w i = 1 and w i ≥ 0 , ∀ i ∈ 1 : l . This natu- rally suggests that we can associate a hidden discrete ran- dom variable H v with multinomial distribution Pr v ( H v = i ) = w i , i ∈ 1 : l for each sum node v ∈ S . Therefore, S can be thought as deﬁning a joint probability distribu- tion over X 1: N and H = { H v | v ∈ S , v is a sum node } where X 1: N are the observable variables and H are the hidden variables. When doing inference with an SPN, we implicitly sum out all the hidden variables H and compute Pr S ( x ) = P h Pr S ( x , h ) . Associating each sum node in an SPN with a hidden variable not only giv es us a concep- tual understanding of the probability distrib ution deﬁned by an SPN, but also helps to elucidate one of the ke y prop- erties implied by the structure of an SPN as summarized below: Proposition 10. Giv en a normal SPN S , let p be a product node in S with l children. Let v 1 , . . . , v k be sum nodes which lie on a path from the root of S to p . Then Pr S ( x | scope ( p )    H v 1 = v ∗ 1 , . . . , H v k = v ∗ k ) = l Y i =1 Pr S ( x | scope ( p i )    H v 1 = v ∗ 1 , . . . , H v k = v ∗ k ) (21) where H v = v ∗ means the sum node v selects its v ∗ th branch and x | A denotes restricting x by set A , p i is the i th child of product node p . Pr oof. Consider the sub-SPN S p rooted at p . S p can be ob- tained by restricting H v 1 = v ∗ 1 , . . . , H v k = v ∗ k , i.e., going from the root of S along the path H v 1 = v ∗ 1 , . . . , H v k = v ∗ k . Since p is a decomposable product node, S p admits the abov e factorization by the deﬁnition of a product node and Corollary 7 . Note that there may exist multiple paths from the root to p in S . Each such path admits the factorization stated in Eq. 21 . Eq. 21 explains two key insights implied by the structure of an SPN that will allow us to construct an equiv- alent BN with ADDs. First, CSI is efﬁciently encoded by the structure of an SPN using Proposition 21 . Second, the D AG structure of an SPN allo ws multiple assignments of hidden variables to share the same factorization, which ef- fectiv ely av oids the replication problem presents in deci- sion trees. Based on the observ ations above and with the help of the normal form for SPNs, we now proceed to prov e the ﬁrst main result in this paper: Thm. 1 . First, we present the algorithm to construct the structure of a BN B from S in Alg. 3 . In a nutshell, Alg. 3 creates an observ able variable Algorithm 3 Build BN Structure Input: normal SPN S Output: BN B = ( B V , B E ) 1: R ← root of S 2: if R is a terminal node ov er v ariable X then 3: Create an observable v ariable X 4: B V ← B V ∪ { X } 5: else 6: for each child R i of R do 7: if BN has not been built for S R i then 8: Recursiv ely build BN Structure for S R i 9: end if 10: end for 11: if R is a sum node then 12: Create a hidden variable H R associated with R 13: B V ← B V ∪ { H R } 14: for each observ able variable X ∈ S R do 15: B E ← B E ∪ { ( H R , X ) } 16: end for 17: end if 18: end if X in B for each terminal node over X in S (Lines 2-4). For each internal sum node v in S , Alg. 3 creates a hidden v ari- able H v associated with v and builds directed edges from H v to all observable v ariables X appearing in the sub-SPN rooted at v (Lines 11-17). The BN B created by Alg. 3 has a directed bipartite structur e with a layer of hidden vari- ables pointing to a layer of observable variables . A hidden variable H points to an observ able variable X in B iff X appears in the sub-SPN rooted at H in S . W e now present Alg. 4 and 5 to build ADDs for each ob- servable variable X and hidden v ariable H in B . F or each On the Relationship between Sum-Product Netw orks and Bayesian Networks Algorithm 4 Build CPD using ADD, observable v ariable Input: normal SPN S , variable X Output: ADD A X 1: if ADD has already been created for S and X then 2: A X ← retriev e ADD from cache 3: else 4: R ← root of S 5: if R is a terminal node then 6: A X ← decision stump rooted at R 7: else if R is a sum node then 8: Create a node H R into A X 9: for each R i ∈ C h ( R ) do 10: Link BuildADD ( S R i , X ) as i th child of H R 11: end for 12: else if R is a product node then 13: Find child S R i such that X ∈ scope ( R i ) 14: A X ← BuildADD ( S R i , X ) 15: end if 16: store A X in cache 17: end if Algorithm 5 Build CPD using ADD, hidden variable Input: normal SPN S , variable H Output: ADD A H 1: Find the sum node H in S 2: A H ← decision stump rooted at H in S hidden variable H , Alg. 5 builds A H as a decision stump 4 obtained by ﬁnding H and its associated weights in S . Consider ADDs b uilt by Alg. 4 for observ able v ariables X s. Let X be the current observable variable we are con- sidering. Basically , Alg. 4 is a recursi ve algorithm applied to each node in S whose scope intersects with { X } . There are three cases. If current node is a terminal node, then it must be a probability distribution over X . In this case we simply return the decision stump at the current node. If the current node is a sum node, then due to the completeness of S , we kno w that all the children of R share the same scope with R . W e ﬁrst create a node H R corresponding to the hidden v ariable associated with R into A X (Line 8) and recursiv ely apply Alg. 4 to all the children of R and link them to H R respectiv ely . If the current node is a product node, then due to the decomposability of S , we know that there will be a unique child of R whose scope intersects with { X } . W e recursively apply Alg. 4 to this child and return the resulting ADD (Lines 12-15). Equiv alently , Alg. 4 can be understood in the following way: we extr act the sub-SPN induced by { X } and con- tract 5 all the pr oduct nodes in it to obtain A X . Note that 4 A decision stump is a decision tree with one variable. 5 In graph theory , the contraction of a node v in a DA G is the operation that connects each parent of v to each child of v and the contraction of product nodes will not add more edges into A X since the out-degree of each product node in the induced sub-SPN must be 1 due to the decomposability of the product node. W e illustrate the application of Alg. 3 , 4 and 5 on the normal SPN in Fig. 4 , which results in the BN B with CPDs represented by ADDs sho wn in Fig. 5 . W e now sho w that Pr S ( x ) = Pr B ( x ) ∀ x . Lemma 11. Given a normal SPN S , the ADDs constructed by Alg. 4 and 5 encode local CPDs at each node in B . Pr oof. It is easy to verify that for each hidden variable H in B , A H represents a local CPD since A H is a decision stump with normalized weights. For any observ able variable X in B , let P a ( X ) be the set of parents of X . By Alg. 3 , every node in P a ( X ) is a hidden variable. Furthermore, ∀ H , H ∈ P a ( X ) iff there exists one terminal node ov er X in S that appears in the sub-SPN rooted at H . Hence gi ven any joint assignment h of P a ( X ) , there will be a path in A X from the root to a terminal node that is consistent with the joint assignment of the parents. Also, the leaves in A X contain normalized weights corresponding to the probabilities of X (see Def. 7 ) induced by the creation of decision stumps over X in Lines 5-6 of Alg. 4 . Theorem 12. For any normal SPN S over X 1: N , the BN B constructed by Alg. 3 , 4 and 5 encodes the same probability distribution, i.e., Pr S ( x ) = Pr B ( x ) , ∀ x . Pr oof. Again, we gi ve a proof by induction on the height of S . • Base case. The height of SPN S is 0. In this case, S will be a single terminal node over X and B will be a single observable node with decision stump A X constructed from the terminal node by Lines 5-6 in Alg. 4 . It is clear that Pr S ( x ) = Pr B ( x ) , ∀ x . • Induction step. Assume Pr B ( x ) = Pr S ( x ) , ∀ x for any S with height ≤ k , where B is the corresponding BN constructed by Alg. 3 , 4 and 5 from S . Consider an SPN S with height k + 1 . Let R be the root of S and R i , i ∈ 1 : l be the children of R in S . W e consider the following tw o cases: – R is a product node. Let scope ( R t ) = X t , t ∈ 1 : l . Claim: there is no edge between S i and S j , i 6 = j , where S i ( S j ) is the sub-SPN rooted at R i ( R j ) . If there is an edge, say , from v j to v i where v j ∈ S j and v i ∈ S i , then scope ( v i ) ⊆ scope ( v j ) ⊆ scope ( R j ) . On the other hand, scope ( v i ) ⊆ scope ( R i ) . So we hav e ∅ 6 = scope ( v i ) ⊆ scope ( R i ) T scope ( R j ) , which contradicts the decomposability of the then delete v from the graph. On the Relationship between Sum-Product Netw orks and Bayesian Networks + × × × X 1 X 1 X 2 X 2 (0 . 6 , 0 . 4) (0 . 9 , 0 . 1) (0 . 3 , 0 . 7) (0 . 2 , 0 . 8) 4 7 6 35 9 35 H X 1 X 2 H A H = 4 7 6 35 9 35 h 1 h 2 h 3 H = A X 1 X 1 X 1 0 . 6 0 . 4 0 . 9 0 . 1 h 1 h 2 h 3 x 1 ¯ x 1 x 1 ¯ x 1 H A X 2 = X 2 X 2 0 . 3 0 . 7 0 . 2 0 . 8 h 1 h 2 h 3 x 2 ¯ x 2 x 2 ¯ x 2 Figure 5. Construct a BN with CPDs represented by ADDs from an SPN. On the left, the induced sub-SPNs used to create A X 1 and A X 2 by Alg. 4 are indicated in blue and green respectiv ely . The decision stump used to create A H by Alg. 5 is indicated in red. product node R . Hence the constructed BN B will be a forest of l disconnected components, and each component B t will correspond to the sub-SPN S t rooted at R t , ∀ t ∈ 1 : l , with height ≤ k . By the induction hypothesis we hav e Pr B t ( x t ) = Pr S t ( x t ) , ∀ t ∈ 1 : l . Consider the whole BN B , we ha ve: Pr B ( x ) = Y t Pr B t ( x t ) = Y t Pr S t ( x t ) = Pr S ( x ) (22) where the ﬁrst equation is due to the d -separation rule in BNs by noting that each component B t is disconnected from all other components. The second equation follows from the induction hy- pothesis. The last equation follows from the def- inition of a product node. – R is a sum node. In this case, due to the com- pleteness of S , all the children of R share the same scope as R . By the construction process presented in Alg. 3 , 4 and 5 , there is a hidden variable H corresponding to R that takes l dif- ferent values in B . Let w 1: l be the weights of the edges emanating from R in S . F or the t th branch of R , we use H t to denote the set of hid- den variables in B that also appear in B t , and let H − t = H \ H t , where H is the set of all hidden variables in B except H . First, we show the fol- lowing identity: Pr B ( x | H = h t ) = X h t X h − t Pr B ( x , h t , h − t | H = h t ) (23) = X h t X h − t Pr B ( x , h t | H = h t , h − t ) Pr B ( h − t | H = h t ) (24) = X h t X h − t Pr B ( x , h t | H = h t ) Pr B ( h − t | H = h t ) (25) = X h t Pr B ( x , h t | H = h t ) X h − t Pr B ( h − t | H = h t ) (26) = X h t Pr B ( x , h t | H = h t ) (27) = X h t Pr B t ( x , h t ) = Pr B t ( x ) (28) Using this identity , we have Pr B ( x ) = l X t =1 Pr B ( h t ) Pr B ( x | H = h t ) (29) = l X t =1 w t Pr B t ( x ) (30) = l X t =1 w t Pr S t ( x ) (31) = Pr S ( x ) (32) Eq. 25 follows from the fact that X and H t are independent of H − t giv en H = h t , i.e., we take On the Relationship between Sum-Product Netw orks and Bayesian Networks advantage of the CSI described by ADDs of X . Eq. 26 follows from the fact that H − t appears only in the second term. Combined with the fact that H = h t is giv en as evidence in B , this giv es us the induced subgraph B t referred to in Eq. 28 . Eq. 30 follows from Eq. 28 and Eq. 31 follo ws from the induction hypothesis. Combing the base case and the induction step completes the proof for Thm. 12 . W e now bound the size of B : Theorem 13. |B| = O ( N |S | ) , where BN B is constructed by Alg. 3 , 4 and 5 from normal SPN S over X 1: N . Pr oof. For each observ able variable X in B , A X is con- structed by ﬁrst e xtracting from S the induced sub-SPN S X that contains all nodes whose scope includes X and then contracting all the product nodes in S X to obtain A X . By the decomposability of product nodes, each product node in S X has out-degree 1 otherwise the original SPN S vio- lates the decomposability property . Since contracting prod- uct nodes does not increase the number of edges in S X , we hav e |A X | ≤ |S X | ≤ |S | . For each hidden variable H in B , A H is a decision stump constructed from the internal sum node corresponding to H in S . Hence, we have P H A H ≤ |S | . Now consider the size of the graph B . Note that only ter- minal nodes and sum nodes will have corresponding vari- ables in B . It is clear that the number of nodes in B is bounded by the number of nodes in S . Furthermore, a hid- den variable H points to an observable variable X in B iff X appears in the sub-SPN rooted at H in S , i.e., there is a path from the sum node corresponding to H to one of the terminal nodes in X . For a sum node H (which corre- sponds to a hidden variable H ∈ B ) with scope size s , each edge emanated from H in S will correspond to directed edges in B at most s times, since there are exactly s ob- servable v ariables which are children of H in B . It is clear that s ≤ N , so each edge emanated from a sum node in S will be counted at most N times in B . Edges from prod- uct nodes will not occur in the graph of B , instead, the y hav e been counted in the ADD representations of the local CPDs in B . So again, the size of the graph B is bounded by P H scope ( H ) × deg ( H ) ≤ P H N deg ( H ) ≤ 2 N |S | . There are N observable variables in B . So the total size of B , including the size of the graph and the size of all the ADDs, is bounded by N |S | + |S | +2 N |S | = O ( N |S | ) . W e giv e the time complexity of Alg. 3 , 4 and 5 . Theorem 14. For any normal SPN S over X 1: N , Alg. 3 , 4 and 5 construct an equiv alent BN in time O ( N |S | ) . Pr oof. First consider Alg. 3 . Alg. 3 recursiv ely visits each node and its children in S if they hav e not been visited (Lines 6-10). For each node v in S , Lines 7-9 cost at most 2 · out-degree ( v ) . If v is a sum node, then Lines 11- 17 create a hidden variable and then connect the hidden variable to all observable vari ables that appear in the sub- SPN rooted at v , which is clearly bounded by the number of all observable variables, N . So the total cost of Alg. 3 is bounded by P v 2 · out-degree ( v ) + P v is a sum node N ≤ 2 V ( S ) + 2 E ( S ) + N V ( S ) ≤ 2 |S | + N |S | = O ( N |S | ) . Note that we assume that inserting an element into a set can be done in O (1) by using hashing. The analysis for Alg. 4 and 5 follows from the same anal- ysis as in the proof for Thm. 13 . The time complexity for Alg. 4 and Alg. 5 is then bounded by N |S | + |S | = O ( N |S | ) . Pr oof of Thm. 1 . The combination of Thm. 12 , 13 and 14 prov es Thm. 1 . Pr oof of Cor ollary . 2 . Given a complete and consistent SPN S , we can ﬁrst transform it into a normal SPN S 0 with |S 0 | = O ( |S | 2 ) by Thm. 4 if it is not normal. After this the analysis follows from Thm. 1 . 4.3. BN to SPN It is kno wn that a BN with CPDs represented by tables can be conv erted into an SPN by ﬁrst con verting the BN into a junction tree and then translating the junction tree into an SPN. The size of the generated SPN, howe ver , will be ex- ponential in the tree-width of the original BN since the tab- ular representation of CPDs is ignorant of CSI. As a result, the generated SPN loses its power to compactly represent some BNs with high tree-width, yet, with CSI in its local CPDs. Alternativ ely , one can also compile a BN with ADDs into an A C ( Chavira & Darwiche , 2007 ) and then con vert an A C into an SPN ( Rooshenas & Lowd , 2014 ). Ho wever , in Chavira & Darwiche ( 2007 )’ s compilation approach, the variables appearing along a path from the root to a leaf in each ADD must be consistent with a pre-deﬁned global variable ordering. The global variable ordering, may , to some extent restrict the compactness of ADDs as the most compact representation for different ADDs normally have different topological orderings. Interested readers are re- ferred to ( Chavira & Darwiche , 2007 ) for more details on this topic. In this section, we focus on BNs with ADDs that are con- structed using Alg. 4 and 5 from normal SPNs. W e show that when applying VE to those BNs with ADDs we can recov er the original normal SPNs. The key insight is that the structure of the original normal SPN natur ally deﬁnes a On the Relationship between Sum-Product Netw orks and Bayesian Networks global variable or dering that is consistent with the topolog- ical or dering of e very ADD constructed . More speciﬁcally , since all the ADDs constructed using Alg. 4 are induced sub-SPNs with contraction of product nodes from the orig- inal SPN S , the topological ordering of all the nodes in S can be used as the pre-deﬁned variable ordering for all the ADDs. Algorithm 6 Multiplication of two symbolic ADDs, ⊗ Input: Symbolic ADD A X 1 , A X 2 Output: Symbolic ADD A X 1 ,X 2 = A X 1 ⊗ A X 2 1: R 1 ← root of A X 1 , R 2 ← root of A X 2 2: if R 1 and R 2 are both variable nodes then 3: if R 1 = R 2 then 4: Create a node R = R 1 into A X 1 ,X 2 5: for each r ∈ dom ( R ) do 6: A r X 1 ← C h ( R 1 ) | r 7: A r X 2 ← C h ( R 2 ) | r 8: A r X 1 ,X 2 ← A r X 1 ⊗ A r X 2 9: Link A r X 1 ,X 2 to the r th child of R in A X 1 ,X 2 10: end for 11: else 12: A X 1 ,X 2 ← create a symbolic node ⊗ 13: Link A X 1 and A X 2 as two children of ⊗ 14: end if 15: else if R 1 is a variable node and R 2 is ⊗ then 16: if R 1 appears as a child of R 2 then 17: A X 1 ,X 2 ← A X 2 18: A R 1 X 1 ,X 2 ← A X 1 ⊗ A R 1 X 2 19: else 20: Link A X 1 as a new child of R 2 21: A X 1 ,X 2 ← A X 2 22: end if 23: else if R 1 is ⊗ and R 2 is a variable node then 24: if R 2 appears as a child of R 1 then 25: A X 1 ,X 2 ← A X 1 26: A R 2 X 1 ,X 2 ← A X 2 ⊗ A R 2 X 1 27: else 28: Link A X 2 as a new child of R 1 29: A X 1 ,X 2 ← A X 1 30: end if 31: else 32: A X 1 ,X 2 ← create a symbolic node ⊗ 33: Link A X 1 and A X 2 as two children of ⊗ 34: end if 35: Merge connected product nodes in A X 1 ,X 2 In order to apply VE to a BN with ADDs, we need to show how to apply two common operations used in VE, i.e., multiplication of two factors and summing-out a hid- den variable, on ADDs. For our purpose, we use a symbolic ADD as an intermediate representation during the inference process of VE by allowing symbolic operations, such as Algorithm 7 Summing-out a hidden variable H from A using A H , ⊕ Input: Symbolic ADDs A and A H Output: Symbolic ADD with H summed out 1: if H appears in A then 2: Label each edge emanating from H with weights ob- tained from A H 3: Replace H by a symbolic ⊕ node 4: end if + , − , × , / to appear as internal nodes in ADDs. In this sense, an ADD can be vie wed as a special type of symbolic ADD where all the internal nodes are variables. The same trick was applied by ( Chavira & Darwiche , 2007 ) in their compilation approach. For example, gi ven symbolic ADDs A X 1 ov er X 1 and A X 2 ov er X 2 , Alg. 6 returns a symbolic ADD A X 1 ,X 2 ov er X 1 , X 2 such that A X 1 ,X 2 ( x 1 , x 2 ) , ( A X 1 ⊗ A X 2 ) ( x 1 , x 2 ) = A X 1 ( x 1 ) × A X 2 ( x 2 ) . T o sim- plify the presentation, we choose the in verse topological ordering of the hidden v ariables in the original SPN S as the elimination order used in VE. This helps to avoid the situations where a multiplication is applied to a sum node in symbolic ADDs. Other elimination orders could be used, but a more detailed discussion of sum nodes is needed. Giv en two symbolic ADDs A X 1 and A X 2 , Alg. 6 recur- siv ely visits nodes in A X 1 and A X 2 simultaneously . In general, there are 3 cases: 1) the roots of A X 1 and A X 2 are both v ariable nodes (Lines 2-14); 2) one of the tw o roots is a variable node and the other is a product node (Lines 15- 30); 3) both roots are product nodes or at least one of them is a sum node (Lines 31-34). W e discuss these 3 cases. If both roots of A X 1 and A X 2 are variable nodes, there are two subcases to be considered. First, if they are nodes la- beled with the same v ariable (Lines 3-10), then the compu- tation related to the common variable is shared and the mul- tiplication is recursiv ely applied to all the children, other- wise we simply create a symbolic product node ⊗ and link A X 1 and A X 2 as its two children (Lines 11-14). Once we ﬁnd R 1 ∈ A X 1 and R 2 ∈ A X 2 such that R 1 6 = R 2 , there will be no common node that is shared by the sub-ADDs rooted at R 1 and R 2 . T o see this, note that Alg. 6 recur- siv ely calls itself as long as the roots of A X 1 and A X 2 are labeled with the same variable. Let R be the last variable shared by the roots of A X 1 and A X 2 in Alg. 6 . Then R 1 and R 2 must be the children of R in the original SPN S . Since R 1 does not appear in A X 2 , then X 2 6∈ scope ( R 1 ) , other- wise R 1 will occur in A X 2 and R 1 will be a ne w shared variable belo w R , which is a contradiction to the fact that R is the last shared variable. Since R 1 is the root of the sub-ADD of A X 1 rooted at R , hence no variable whose scope contains X 2 will occur as a descendant of R 1 , other- wise the scope of R 1 will also contain X 2 , which is again On the Relationship between Sum-Product Netw orks and Bayesian Networks a contradiction. On the other hand, each node appearing in A X 2 corresponds to a v ariable whose scope intersects with { X 2 } in the original SPN, hence no node in A X 2 will ap- pear in A X 1 . The same analysis also applies to R 2 . Hence no node will be shared between A X 1 and A X 2 . If one of the two roots, say , R 1 , is a v ariable node and the other root, say , R 2 , is a product node, then we consider two subcases. If R 1 appears as a child of R 2 then we recursively multiply R 1 with the child of R 2 that is labeled with the same variable as R 1 (Lines 16-18). If R 1 does not appear as a child of R 2 , then we link the ADD rooted at R 1 to be a new child of the product node R 2 (Lines 19-22). Again, let R be the last shared node between A X 1 and A X 2 during the multiplication process. Then both R 1 and R 2 are children of R , which corresponds to a sum node in the original SPN S . Furthermore, both R 1 and R 2 lie in the same branch of R in S . In this case, since scope ( R 1 ) ⊆ scope ( R ) , scope ( R 1 ) must be a strict subset of scope ( R ) otherwise we would have scope ( R 1 ) = scope ( R ) and R 1 will also appear in A X 2 , which contradicts the fact that R is the last shared node between A X 1 and A X 2 . Hence here we only need to discuss the two cases where either their scope dis- joint (Line 16-18) or the scope of one root is a strict subset of another (Line 19-22). If the two roots are both product nodes or at least one of them is a sum node, then we simply create a new product node and link A X 1 and A X 2 to be children of the product node. The above analysis also applies here since sum nodes in symbolic ADD are created by summing out processed variable nodes and we eliminate all the hidden v ariables using the in verse topological ordering. The last step in Alg. 6 (Line 35) simpliﬁes the symbolic ADD by merging all the connected product nodes with- out changing the function it encodes. This can be done in the follo wing way: suppose ⊗ 1 and ⊗ 2 are two connected product nodes in symbolic ADD A where ⊗ 1 is the par- ent of ⊗ 2 , then we can remove the link between ⊗ 1 and ⊗ 2 and connect ⊗ 1 to e very child of ⊗ 2 . It is easy to ver - ify that such an operation will remove links between con- nected product nodes while keeping the encoded function unchanged. T o sum-out one hidden variable H , Alg. 7 simply replaces H in A by a symbolic sum node ⊕ and labels each edge of ⊕ with weights obtained from A H . W e now present the V ariable Elimination (VE) algorithm in Alg. 8 used to recover the original SPN S , taking Alg. 6 and Alg. 7 as two operations ⊗ and ⊕ respectively . In each iteration of Alg. 8 , we select one hidden v ariable H in ordering π , multiply all the ADDs A X in which H appears using Alg. 6 and then sum-out H using Alg. 7 . The algorithm keeps going until all the hidden variables ha ve Algorithm 8 V ariable Elimination for BN with ADDs Input: BN B with ADDs for all observable variables and hidden variables Output: Original SPN S 1: π ← the in verse topological ordering of all the hidden variables present in the ADDs 2: Φ ← {A X | X is an observable v ariable } 3: f or each hidden v ariable H in π do 4: P ← {A X | H appears in A X } 5: Φ ← Φ \ P ∪ {⊕ H ⊗ A∈ P A} 6: end f or 7: retur n Φ been summed out and there is only one symbolic ADD left in Φ . The ﬁnal symbolic ADD giv es us the SPN S which can be used to build BN B . Note that the SPN returned by Alg. 8 may not be literally equal to the original SPN since during the multiplication of two symbolic ADDs we effecti vely remove redundant nodes by merging connected product nodes. Hence, the SPN returned by Alg. 8 could hav e a smaller size while representing the same probability distribution. An example is gi ven in Fig. 6 to illustrate the recov ery process. The BN in Fig. 6 is the one constructed in Fig. 5 . Note that Alg. 6 and 7 apply only to ADDs constructed from normal SPNs by Alg. 4 and 5 because such ADDs nat- urally inherit the topological ordering of sum nodes (hidden variables) in the original SPN S . Otherwise we need to pre- deﬁne a global variable ordering of all the sum nodes and then arrange each ADD such that its topological ordering is consistent with the pre-deﬁned ordering. Note also that Alg. 6 and 7 should be implemented with caching of re- peated operations in order to ensure that directed acyclic graphs are preserved. Alg. 8 suggests that an SPN can also be viewed as a history recor d or caching of the sums and products computed during inference when applied to the resulting BN with ADDs. W e now bound the run time of Alg. 8 . Theorem 15. Alg. 8 builds SPN S from BN B with ADDs in O ( N |S | ) . Pr oof. First, it is easy to verify that Alg. 6 takes at most |A X 1 | + |A X 2 | operations to compute the multiplication of A X 1 and A X 2 . More importantly , the size of the gener- ated A X 1 ,X 2 is also bounded by |S | . This is because all the common nodes and edges in A X 1 and A X 2 are shared (not duplicated) in A X 1 ,X 2 . Also, all the other nodes and edges which are not shared between A X 1 and A X 2 will be in tw o branches of a product node in S , otherwise the y will be shared by A X 1 and A X 2 as they have the same scope which contain both X 1 and X 2 . This means that A X 1 ,X 2 can be viewed as a sub-SPN of S induced by the node set On the Relationship between Sum-Product Netw orks and Bayesian Networks H = A X 1 X 1 X 1 0 . 6 0 . 4 0 . 9 0 . 1 h 1 h 2 h 3 x 1 ¯ x 1 x 1 ¯ x 1 ⊗ H A X 2 = X 2 X 2 0 . 3 0 . 7 0 . 2 0 . 8 h 1 h 2 h 3 x 2 ¯ x 2 x 2 ¯ x 2 H ⊗ ⊗ ⊗ X 2 X 1 X 2 X 1 0 . 3 0 . 7 0 . 6 0 . 4 0 . 2 0 . 8 0 . 9 0 . 1 h 1 h 2 h 3 x 2 ¯ x 2 x 1 ¯ x 1 x 2 ¯ x 2 x 1 ¯ x 1 Multiplication + × × × X 2 X 1 X 2 X 1 (0 . 3 , 0 . 7) (0 . 6 , 0 . 4) (0 . 2 , 0 . 8) (0 . 9 , 0 . 1) 4 7 6 35 9 35 Summing Out Figure 6. Multiply A X 1 and A X 2 that contain H using Alg. 6 and then sum out H by applying Alg. 7 . The ﬁnal SPN is isomorphic with the SPN in Fig. 5 . { X 1 , X 2 } with some product nodes contracted out. So we hav e |A X 1 ,X 2 | ≤ |S | . Now consider the for loop (Lines 3-6) in Alg. 8 . The loop ends once we’ ve summed out all the hidden variables and there is only one ADD left. Note that there may be only one ADD in Φ during some intermediate steps, in which case we do not have to do any multiplication. In such steps, we only need to perform the sum out procedure without mul- tiplying ADDs. Since there are N ADDs at the beginning of the loop and after the loop we only hav e one ADD, then there is exactly N − 1 multiplications during the for loop, which costs at most ( N − 1) |S | operations. Furthermore, in each iteration there is exactly one hidden v ariable be- ing summed out. So the total cost for summing out all the hidden variables in Lines 3-6 is bounded by |S | . Overall, the operations in Alg. 8 are bounded by ( N − 1) |S | + |S | = O ( N |S | ) . Pr oof of Thm. 3 . Thm. 15 and the analysis abo ve prove Thm. 3 . 5. Discussion Thm. 1 together with Thm. 3 establish a relationship be- tween BNs and SPNs: SPNs are no more powerful than BNs with ADD representation. Informally , a model is con- sidered to be more powerful than another if there exists a distribution that can be encoded in polynomial size in some input parameter N , while the other model requires expo- nential size in N to represent the same distribution. The key is to recognize that the CSI encoded by the structure of an SPN as stated in Proposition. 21 can also be encoded explicitly with ADDs in a BN. W e can also vie w an SPN as an inference machine that efﬁciently records the history of the inference process when applied to a BN. Based on this perspecti ve, an SPN is actually storing the calculations to be performed (sums and products), which allows online inference queries to be answered quickly . The same idea also exists in other ﬁelds, including propositional logic (d- DNNF) and knowledge compilation (A C). The constructed BN has a simple bipartite structure, no matter how deep the original SPN is. Howe ver , we can relate the depth of an SPN to a lower bound on the tree- width of the corresponding BN obtained by our algorithm. W ithout loss of generality , let’ s assume that product layers alternate with sum layers in the SPN we are considering. Let the height of the SPN, i.e., the longest path from the root to a terminal node, be K . By our assumption, there will be at least b K / 2 c sum nodes in the longest path. Ac- cordingly , in the BN constructed by Alg. 3 , the observable variable corresponding to the terminal node in the longest path will hav e in-degree at least b K/ 2 c . Hence, after mor- alizing the BN into an undirected graph, the clique-size of the moral graph is bounded below by b K/ 2 c + 1 . Note that for any undirected graph the clique-size minus 1 is al- ways a lower bound of the tree-width. W e then reach the conclusion that the tree-width of the constructed BN has a lo wer bound of b K/ 2 c . In other words, the deeper the SPN, the larger the tr ee-width of the BN constructed by our algorithm and the more comple x are the pr obability distri- butions that can be encoded . This observ ation is consistent with the conclusion drawn in ( Delalleau & Bengio , 2011 ) where the authors prove that there exist families of distri- butions that can be represented much more efﬁciently with a deep SPN than with a shallo w one, i.e. with substantially fewer hidden internal sum nodes. Note that we only gi ve a proof that there exists an algorithm that can con vert an SPN into a BN without any exponential blo w-up. There may ex- ist other techniques to con vert an SPN into a BN with a more compact representation and also a smaller tree-width. High tree-width is usually used to indicate a high inference complexity , but this is not always true as there may exist lots of CSI between v ariables, which can reduce inference complexity . CSI is precisely what enables SPNs and BNs with ADDs to compactly represent and tractably perform inference in distributions with high tree-width. In con- trast, in a Restricted Boltzmann Machine, which is an undi- rected bipartite Markov network, CSI may not be present or not e xploited, which is why practitioners ha ve to re- sort to approximate algorithms, such as contrastiv e div er- gence ( Carreira-Perpinan & Hinton , 2005 ). Similarly , ap- proximate inference is required in bipartite diagnostic BNs such as the Quick Medical Reference network ( Shwe et al. , 1991 ) since causal independence is insuf ﬁcient to reduce the complexity , while CSI is not present or not exploited. On the Relationship between Sum-Product Netw orks and Bayesian Networks 6. Conclusion In this paper , we establish a precise connection between BNs and SPNs by pro viding a constructive algorithm to transform between these two models. T o simplify the proof, we introduce the notion of normal SPN and describe the relationship between consistency and decomposability in SPNs. W e analyze the impact of the depth of SPNs onto the tree-width of the corresponding BNs. Our work also provides a ne w direction for future research about SPNs and BNs. Structure and parameter learning algorithms for SPNs can no w be used to indirectly learn BNs with ADDs. In the resulting BNs, correlations are not expressed by links directly between observed variables, but rather through hid- den variables that are ancestors of correlated observ ed vari- ables. The structure of the resulting BNs can be used to study probabilistic dependencies and causal relationships between the variables of the original SPNs. It would also be interesting to explore the opposite direction since there is already a large literature on parameter and structure learn- ing for BNs. One could learn a BN from data and then exploit CSI to con vert it into an SPN. References Amer , Mohamed R and T odorovic, Sinisa. Sum-product networks for modeling acti vities with stochastic struc- ture. In Computer V ision and P attern Recognition (CVPR), 2012 IEEE Confer ence on , pp. 1314–1321. IEEE, 2012. Bacchus, Fahiem, Dalmao, Shannon, and Pitassi, T oni- ann. Algorithms and complexity results for #SA T and bayesian inference. In F oundations of Computer Sci- ence, 2003. Pr oceedings. 44th Annual IEEE Symposium on , pp. 340–351. IEEE, 2003. Bahar , R Iris, Frohm, Erica A, Gaona, Charles M, Hachtel, Gary D, Macii, Enrico, Pardo, Abelardo, and Somenzi, Fabio. Algebraic decision diagrams and their applica- tions. F ormal methods in system design , 10(2-3):171– 206, 1997. Birnbaum, Elazar and Lozinskii, Eliezer L. The good old davis-putnam procedure helps counting models. arXiv pr eprint arXiv:1106.0218 , 2011. Boole, George. The mathematical analysis of logic . Philo- sophical Library , 1847. Boutilier , Craig, Friedman, Nir, Goldszmidt, Moises, and K oller , Daphne. Context-speciﬁc independence in bayesian networks. In Pr oceedings of the T welfth inter- national confer ence on Uncertainty in artiﬁcial intelli- gence , pp. 115–123. Morgan Kaufmann Publishers Inc., 1996. Bryant, Randal E. Graph-based algorithms for boolean function manipulation. Computers, IEEE T ransactions on , 100(8):677–691, 1986. Carreira-Perpinan, Miguel A and Hinton, Geoffrey E. On contrastiv e di vergence learning. In Pr oceedings of the tenth international workshop on artiﬁcial intelligence and statistics , pp. 33–40. Citeseer , 2005. Chavira, Mark and Darwiche, Adnan. Compiling bayesian networks using v ariable elimination. In IJCAI , pp. 2443– 2449, 2007. Chavira, Mark, Darwiche, Adnan, and Jaeger , Manfred. Compiling relational bayesian networks for exact infer- ence. International Journal of Appr oximate Reasoning , 42(1):4–20, 2006. Cheng, W ei-Chen, K ok, Stanley , Pham, Hoai V u, Chieu, Hai Leong, and Chai, Kian Ming A. Language mod- eling with sum-product networks. In F ifteenth Annual Confer ence of the International Speech Communication Association , 2014. Darwiche, Adnan. A differential approach to inference in bayesian networks. In U AI , pp. 123–132, 2000. Darwiche, Adnan. Decomposable negation normal form. Journal of the A CM (J ACM) , 48(4):608–647, 2001. Darwiche, Adnan and Marquis, Pierre. A perspectiv e on knowledge compilation. In IJCAI , volume 1, pp. 175– 182. Citeseer , 2001. Delalleau, Oli vier and Bengio, Y oshua. Shallow vs. deep sum-product networks. In Advances in Neural Informa- tion Pr ocessing Systems , pp. 666–674, 2011. Dennis, Aaron and V entura, Dan. Learning the architecture of sum-product networks using clustering on variables. In Advances in Neural Information Pr ocessing Systems , pp. 2042–2050, 2012. Gens, Robert and Domingos, Pedro. Discriminativ e learn- ing of sum-product networks. In Advances in Neural Information Pr ocessing Systems , pp. 3248–3256, 2012. Gens, Robert and Domingos, Pedro. Learning the structure of sum-product networks. In Pr oceedings of The 30th In- ternational Conference on Machine Learning , pp. 873– 880, 2013. Gomes, Carla P , Sabharwal, Ashish, and Selman, Bart. Model counting. 2008. Huang, Jinbo, Chavira, Mark, and Darwiche, Adnan. Solv- ing map exactly by searching on compiled arithmetic cir- cuits. In AAAI , volume 6, pp. 3–7, 2006. On the Relationship between Sum-Product Netw orks and Bayesian Networks Pagallo, Giulia. Learning DNF by decision trees. In IJCAI , volume 89, pp. 639–644, 1989. Peharz, Robert, Geiger , Bernhard C, and Pernkopf, Franz. Greedy part-wise learning of sum-product networks. In Machine Learning and Knowledge Discovery in Databases , pp. 612–627. Springer , 2013. Peharz, Robert, Kapeller , Geor g, Mo wlaee, Pejman, and Pernkopf, Franz. Modeling speech with sum-product networks: Application to bandwidth e xtension. In Acoustics, Speech and Signal Pr ocessing (ICASSP), 2014 IEEE International Conference on , pp. 3699–3703. IEEE, 2014. Poon, Hoifung and Domingos, Pedro. Sum-product net- works: A new deep architecture. In Proc. 12th Conf. on Uncertainty in Artiﬁcial Intelligence , pp. 2551–2558, 2011. Rooshenas, Amirmohammad and Lowd, Daniel. Learning sum-product networks with direct and indirect variable interactions. In Pr oceedings of The 31st International Confer ence on Machine Learning , pp. 710–718, 2014. Roth, Dan. On the hardness of approximate reasoning. Ar- tiﬁcial Intelligence , 82(1):273–302, 1996. Sang, T ian, Beame, Paul, and Kautz, Henry A. Perform- ing bayesian inference by weighted model counting. In AAAI , volume 5, pp. 475–481, 2005. Shwe, Michael A, Middleton, B, Heckerman, DE, Hen- rion, M, Horvitz, EJ, Lehmann, HP , and Cooper , GF . Probabilistic diagnosis using a reformulation of the INTERNIST -1/QMR knowledge base. Methods of in- formation in Medicine , 30(4):241–255, 1991. W ainwright, Martin J and Jordan, Michael I. Graphical models, exponential families, and variational inference. F oundations and T r ends R  in Machine Learning , 1(1-2): 1–305, 2008. Zhang, Nevin Lianwen and Poole, David. Exploiting causal independence in bayesian network inference. Journal of Artiﬁcial Intelligence Resear ch , 5:301–328, 1996.

On the Relationship between Sum-Product Networks and Bayesian Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment