Learning Mixtures of DAG Models

Learning Mixtures of D A G Mo dels Bo Thiesson, Christopher Meek, D av id Maxw ell Chic k ering, and David Hec k erman Microsoft Research Redmond W A, 98052 - 6399 { thiesson,meek,dmax,heck erma } @microso ft.com Abstract W e describ e computationally eﬃcient meth - o ds for learning mixtures in whic h each com- po nen t is a directed acyclic g r aphical mo del (mixtures of D AGs or MDA Gs). W e argue that s imple sear ch -a nd-score a lgorithms ar e infeasible for a v ariety of problems, a nd in- tro duce a feasible approach in which para m- eter and structure sear ch is in terleav ed and exp e cted da ta is treated as real data. O ur approach can be viewed as a combination of (1) the Chees eman–Stutz asymptotic ap- proximation for mo del poster io r probability and (2 ) the Exp ectation– Ma ximization algo- rithm. W e ev aluate our pro cedure for select- ing among MDA Gs on synthetic a nd real ex- amples. 1 In tro duction F or almos t a deca de, statisticians and co mputer sci- ent ists hav e used directed-acyclic graph (DA G) mo d- els for learning from data (e.g., Co oper & Hersko vits, 1992; Spirtes, Glymour, & Scheines, 1 993; Spiegelhal- ter, Dawid, Lauritzen, & Cow ell, 1 993; Bunt ine, 1994 ; and Heck erman, Geiger , & Chick ering, 1 995). In this pap er, we consider mixtures of D AG mo dels (MDA G mo dels) and methods for choo sing among models in this class. MD AG mo dels generalize DA G mo dels, and should more accurately mo del domains containing m ultiple distinct p opulatio ns. In general, our hop e is that the use of MD AG mo dels will lea d to b etter pre- dictions and mo re accura te insights in to causal r ela- tionships. In this pap er, we concentrate o n prediction. W e take a decidedly B ay esian p ersp ective on the prob- lem o f lea r ning MD AG mo dels. In principle, lear ning is stra ightf or ward: we compute the p osterior probabil- it y of each mo del in the class g iven data and use this criterion to av erage over the mo dels o r to selec t one or more mo dels. F rom a computational p ersp ective, how ever, lea r ning is extremely diﬃcult. One problem is that the n umber o f p ossible model structures grows supe r -exp onentially with the num b er of r andom v ari- ables for the domain. A second problem is that all av ailable metho ds for computing the po s terior prob- abilit y of a n MD AG mo del, including Mon te-Car lo and la rge-sa mple approximations, are slow. In co mbi- nation, these pro blems make simple search-and-score learning a lgorithms in tra ctable for MD AG mo dels. In the pap e r, we introduce a heuristic metho d for MD AG mo del selection that a ddresses b oth of these diﬃculties. The metho d is no t g uaranteed to ﬁnd the MD AG mo del with the highest probability , but exp er- imen ts that we present suggest that it o ften identi- ﬁes a go o d one. Our appr o ach handles missing data and comp onent DA G mo dels that contain hidden or latent v ariables. O ur appro ach can b e used to lear n D AG mo dels (single-comp onent MD AG mo dels) from incomplete data as well. 2 Multi-D A G mo dels and mixtures of D AG mo dels In this section, we describ e DA G, multi-D AG, and MD AG models . First, ho wev er, let us introduce some notation. W e denote a random v ariable by an upper - case letter (e.g ., X , Y , X i , Θ), and the v alue of a corre- sp o nding random v ariable b y that same letter in lower case (e.g., x, y , x i , θ ). When X is discrete, we use | X | to deno te the num b er of v alues of X , and s ometimes refer to a v alue of X as a state . W e denote a set of random v ariables by a b old-fac e capitalized letter or letters (e.g., X , Y , P a i ). W e use a co rresp onding bold- face low er- c a se letter or letters (e.g., x , y , pa i ) to de- note an assignment of v alue to each random v ar iable in a giv en set. When X = x we s ay that X is in c onﬁgur a- tion x . W e use p ( X = x | Y = y ) (or p ( x | y ) as a sho rt- hand) to denote the probability or pr obability density that X = x g iven Y = y . W e also use p ( x | y ) to de- note the pr obability distribution (bo th mass functions and dens ity functions) for X given Y = y . Whether p ( x | y ) re fer s to a probability , a probability density , or a proba bilit y distribution should b e clear from context. Suppos e our pro blem domain co ns is ts of random v ari- ables X = ( X 1 , . . . , X n ). A D AG mo del for X is a graphical factor ization of the join t pr o bability distri- bution of X . The mo del consists of tw o co mpo nents: a structure and a set of lo cal distribution families. The structure b fo r X is a directed acyclic graph that rep- resents c o nditional-indep endence asser tions through a factorization of the jo int distribution for X : p ( x ) = n Y i =1 p ( x i | pa ( b ) i ) (1) where pa ( b ) i is the conﬁguration of the parents of X i in structure b consistent with x . The lo cal distribu- tion families a sso ciated w ith the DA G mo del are those in Equation 1. In this discussio n, w e as sume that the lo cal distribution fa milies are par ametric. Using θ b to denote the collective parameters for all lo cal distribu- tions, we rewrite Equation 1 as p ( x | θ b ) = n Y i =1 p ( x i | pa ( b ) i , θ b ) (2) With o ne ex ception to b e discussed in Section 6, the parametric family cor resp onding to the v ariable X will be determined b y (1) whether X is discrete or contin- uous and (2) the mo del structure. Consequently , we suppress the parametric family in our notation, and refer to the DA G mo del simply by its structure b . Let b h denote the assertio n or hypothesis that the “true” joint distribution ca n b e r epresented by the D AG mo del b and ha s precisely the conditional inde- pendence assertions implied b y b . W e ﬁnd it useful to include the s tructure hypothesis explicitly in the fac- torization o f the joint distribution when we compar e mo del structures. In pa rticular, we write p ( x | θ b , b h ) = n Y i =1 p ( x i | pa i , θ b , b h ) (3) This notation often ma kes it unneces sary to us e the argument b in the term pa ( b ) i , and we use the simpler expression where p os s ible. The structure o f a D AG mo del enco des a limited form of conditional indep endence that we call c ontext-n on- sp e ciﬁc c onditional indep endenc e . In particular , if the structure implies that t wo sets of ra ndom v ariables Y and Z are indep endent given so me conﬁg uration of random v ariables W , then Y a nd Z are also in- dependent g iven e very o ther co nﬁguration o f W . In a more general for m of co nditiona l indep endence, tw o sets o f random v ariables ma y be indep endent given one conﬁgur ation of W , and dep endent given ano ther conﬁguration of W . A mul ti-DA G mo del, ca lled a Bay esian mult inet b y Geiger & Heck erman (1996 ), is a generalization of the DA G model that can enco de context-speciﬁc con- ditional independence. In particular, a m u lti-DA G mo del for X and di stinguishe d r andom varia ble C is a set of c omp onent DA G mo dels for X , each of which enco des the joint distr ibution for X given a state o f C , and a distribution for C . Thu s, the m ulti-D AG mo del for X and C enco des a joint distribution for X and C , and ca n enco de co n text-sp eciﬁc conditional independence among these r andom v a riables, beca use the str uctur e o f ea ch co mponent DA G mo del may b e diﬀerent . Let s and θ s denote the structure and par ameters of a multi-D AG mo del for X and C . In addition, let b c and θ c denote the structur e and parameters o f the c th DA G-mo del co mponent of the multi-D A G mo del. Also, let s h denote the hypothesis that the “tr ue” jo int distribution for X and C can b e represent ed b y the MD AG mo del s and has precisely the conditional in- dependence assertions implied by s . Then, the joint distribution for X and C enco ded by this mult i-DA G mo del is g iven b y p ( c, x | θ s , s h ) = p ( c | θ s , s h ) p ( x | c, θ s , s h ) = π c p ( x | θ c , b h c ) (4) where θ s = ( θ 1 , . . . , θ | C | , π 1 , . . . , π | C | ) a re the para m- eters of the multi-D AG mo del, π c = p ( c | θ s , s h ), and b h c is a shorthand for the conjunction of the even ts s h and C = c . As with DA G mo dels, we sometimes use the structure a lone to re fer to the multi-D A G mo del. In what follows, we assume that the distinguished ran- dom v aria ble has a mu ltinomial distribution. In addi- tion, with one exception to b e discussed in Section 6 , we limit the structure of the comp onent DA G mo dels and the par ametric families for the lo cal distributions as follows. When X i ∈ X is a discre te random v ari- able, we require that every random v ariable in Pa i (for every comp onent mo del) a lso b e discr ete, and that the lo cal distribution families for X be a set of mult ino- mial distr ibutions, one for each conﬁguration of Pa i . When X i ∈ X is a co n tinuous random v ariable, we r e - quire that the lo cal distribution family for X i be a set of linear -regress ions ov er X i ’s contin uous pa rents with Gaussian er ror, one regres sion for each co nﬁg uration of X i ’s discrete par ent s. Lauritzen (1992) refers to this set of res trictions as a c onditional-Gaussian distribu- tion for a DA G mo del. In this pap er, we co ncent ra te on the sp ecial case where the distinguished ra ndo m v ariable C is hidden. In this situation, we ar e interested in the joint distribution for X , given by p ( x | θ s , s h ) = | C | X c =1 π c p ( x | θ c , b h c ) (5) This join t distribution is a mixture of distributions determined by the comp onent DA G mo dels, and has mixture weigh ts π 1 , . . . , π | C | . Thus, when C is hidden, we s ay that the multi-D A G mo del for X and C is a mixtur e of DA G mo dels (or MDA G mo del) for X . An imp or ta n t subclass of DA G mo dels is the Gaus- sian DA G mo del (e.g., Shach ter & Kenley , 198 9). In this sub class, the loca l distribution family for every random v a riable given its parents is a linear regres- sion with Gaussian noise. It is well known that a Gaussian DA G mo del for X 1 , . . . , X n uniquely deter- mines a m ultiv ariate-Gaussian distribution for those random v ariables. In such a mo del, the str ucture of the D AG mo del (in part) determines the “sha pe” of the m ultiv ariate-Gaussia n distribution. Thus, the MD AG mo del class includes mixtures o f mul tiv ariate-Gauss ian distributions in which each c o mpo nen t may have a dif- ferent shap e . 3 Learning m ulti-D A G models In this a nd the following tw o s ections, we consider a Bay esian approa ch for lear ning m ulti-DA G mo dels and MD AG mo dels. Let us assume that our data is ex- changeable so that we c a n reaso n as if the data is a random sample from a true joint distribution. In ad- dition, let us as sume that the true join t distribution for X is enco ded by s ome m ulti-DA G mo del, and that we are uncertain ab out b oth its structure a nd param- eters. W e deﬁne a discrete random v ariable S h whose states s h corresp ond to the po ssible true mo del hy- po theses, and encode o ur uncertaint y ab out structure using the probability distribution p ( s h ). In addition, for each mo del s , we deﬁne a contin uous vector-v alued random v a riable Θ s , whose conﬁgurations θ s corre- sp o nd to the p os sible true pa rameters. W e encode our uncertaint y ab out Θ s using the pro babilit y den- sit y function p ( θ s | s h ). Given a random sample d = ( x 1 , . . . , x N ) from the true distribution for X , w e compute the p oster ior dis- tributions for each s h and θ s using Bay es’ rule. W e can use the mo del pos terior probability for v arious forms of mo del co mparison, including mo del av erag ing (e.g., Bernardo & Smith, 19 9 4). In this work, we limit ourselves to the selection of a mo del with a high p os- terior proba bilit y . In what follows, we concent ra te o n mo del selec tio n using the p oster ior mo del probability . T o simplify the discussio n, we as s ume that all p oss ible mo del structures ar e e q ually likely , a prio ri, in which case our selection c r iterion is the mar g inal likelihoo d: p ( d | s h ) = Z p ( d | θ s , s h ) p ( θ s | s h ) d θ s (6) 3.1 The marginal l ik eliho o d criterion Consider a DA G mo del b that enco des a co nditional- Gaussian distribution for X . Let Θ i , i = 1 , . . . , n de- note the random v ariables corresp onding to the para m- eters of the lo cal distr ibution family for X i . Buntine (1994) and Hec kerman a nd Geiger (1995) have shown that, if (1 ) the pa rameters Θ 1 , . . . , Θ n are mut ually in- dependent given b h , (2) the par ameter prio r s p (Θ i | b h ) are conjugate for all i , and (3) the data d is complete for C and X , then the log ma rginal likelihoo d has a closed form that ca n b e computed eﬃciently . This o bserv ation extends to multi-D AG mo dels. Let Θ ic denote the set of random v ariables co rresp ond- ing to the loca l distribution family of X i in c ompo - nen t c . Also, let Π deno te the set of random v ari- ables (Π 1 , . . . , Π | C |− 1 ) cor resp onding to the mixture weigh ts. If (1) Π , Θ 11 , . . . , Θ n 1 , . . . , Θ 1 | C | , . . . , Θ n | C | are mutually indep endent given s h , (2) the parameter priors p (Θ ic | s h ) are conjugate for all i a nd c , and (3) the data d is complete, then the mar g inal likelihoo d p ( d | s h ) has a clo s ed for m. In particular , log p ( d | s h ) = log p ( d C ) + | C | X c =1 log p ( d X ,C = c | b h c ) (7) where d C is the data r estricted to the v ariable C , and d X ,C = c is the data restricted to the v ariables X and those cases in which C = c . The ter m p ( d C ) is the marginal likelihoo d of a trivial D AG mo del having only a single discr ete no de C . The terms in the sum are log marginal likelihoo ds for the comp onent DA G mo dels of the m ulti-DA G. Hence, p ( d | s h ) has a clo s ed for m. 3.2 Structure searc h An important iss ue regarding mo del selection is the search for mo dels (structures) with high p osterior probabilities. Consider the pr oblem of ﬁnding the D AG mo del with the highest marginal likelihoo d from the set of all mo dels in which each no de ha s no more than k parents. Chick ering (1 9 96) has shown the pr o b- lem for k > 1 is NP-hard. It follows immediately that the problem of ﬁnding the m ulti-DA G mo del with the highest marg inal likelihoo d from the set of all multi- D AGs in which each no de in each comp onent has no more than k parents is NP-hard. Consequently , researchers use heuristic search algorithms including greedy sea rch, greedy searc h with restarts, best-ﬁr st search, and Monte-Carlo methods. One conso lation is that v arious mo del-selection cr ite- ria, including log marginal likeliho o d (under the as- sumptions just describ ed), ar e factor able. W e say that a criterion cr it( s , d ) for a multi-D A G structure s is fac- tor able if it ca n b e written as follows: crit( s , d ) = f ( d C ) + | C | X c =1 n X i =1 g c ( d X i , Pa c i ) (8) where d C is the da ta restricted to the set C , P a c i are the parents o f X i in comp o nent c , d X i , Pa c i is the data restricted to the rando m v a r iables X i and Pa c i and to those cases in which C = c , a nd f and g c are functions. When a criterion is factorable, sea rch is more eﬃcien t for t w o reasons. One, the co mpo nen t D AG mo dels have non-interacting sub criteria so that we may s earch fo r a g o od D AG structure for each com- po nen t sepa rately . Two, as we search for a go o d s truc- ture in a n y o ne comp onent, we need not r eev aluate the criterion for the whole comp onent. F o r example, in a greedy search for a go o d DA G structure, we iter atively transform the gra ph by choo s ing the transformation that improv es the mo del criterion the most, until no such tr a nsformation is p ossible. Typical transfor ma- tions include the r emov al, reversal, and addition of an arc (constra ined so that the resulting g raph is acyclic). Given a factora ble criterio n, we only need to reev alu- ate g c for X i if it’s parents have ch ang ed. 4 Learning MD A Gs: A simp le approac h When learning multi-D AG mo dels given c o mplete data, the marg inal likeliho o d has a closed form. In contrast, when lea rning MDA Gs, the assumption that data is complete does not hold, b e cause the distin- guished random v a riable C is hidden. When data is incomplete, no tractable closed for m for marginal likelih o o d is av ailable. No netheless , we can a pprox- imate the marginal likelih o o d using either Mo n te- Carlo or large-sa mple metho ds (e.g ., DiCiccio , Kass, Raftery , and W asse r man, 199 5). Thus, a straig htfor- ward class o f algor ithm for choosing an MDA G mo del is to search among structures as b efore (e.g., p erform greedy sear ch ), using so me approximation for marg inal likelih o o d. W e shall refer to this class as simple se ar ch- and-sc or e algorithms . As we shall see, simple search-and-score a lg orithms for MD AG mo del selectio n are computationally infeasible in practice. Nonetheless, le t us consider one appr oxi- mation for the marginal likelihoo d that will help mo- tiv a te a tractable class of algor ithms that we consider in the next section. The approximation that we exam- ine is a large-s a mple a ppr oximation ﬁrst prop os ed by Cheeseman & Stutz (1995): p ( d | s h ) ≈ p ( d 0 | s h ) p ( d 0 | ˜ θ s , s h ) p ( d | ˜ θ s , s h ) (9) where d 0 is any c o mpletion of the data set d . The approximation is a heuristic one, but Chick ering & Heck erman (1 997) give an a rgument that it may per form well in practice 1 . F ur thermore, they provide an empirical study , using multinomial mixtures, that shows the approximation to b e q uite g o o d. In all exp e riment s, it was a t least as accur a te and some- times more accurate than the standard approxima- tion obtained using Lapla ce’s method (e.g., Tierney & Kadane, 1986). An impor tant idea b ehind the Cheese ma n–Stutz ap- proximation is that w e treat data completed by the EM algorithm as if it were real data. This same idea underlies the M step of the EM a lgorithm. As we s hall see in the next section, this idea also ca n b e applied to s tructure search. 5 Learning MD A Gs: A practical approac h Simple search-and-score alg orithms f or selecting MD AG mo dels a r e ineﬃcient for t wo reas ons. O ne is that computing approximations for the margina l like- liho o d is slow (DiCiccio et al., 1995 ). Another is that these approximations do not factor. Consequently , ev- ery time a transformation is applied to a structure during sear ch, the en tire str ucture may need to b e 1 Chic kering & Hec kerman (1997) discuss a versio n of the Cheeseman–Stutz approximation th at has a correction for dimension. Pick initial structure Run EM for a while (parameter search) Compute expected sufficient statistics of the complete model Search structure, pretending that the expected sufficient statistics are real sufficient statistics of complete data Figure 1: A schematic of our a pproach for MDA G mo del selection. rescored. In this sectio n, w e consider a heuristic ap- proach that addresses both of these problems. The basic idea b ehind the appro a ch is that we in- terleav e par a meter sear ch with structure search. A schematic o f this appro ach is shown in Figure 1. Fir st, we choose some initi al mo del and parameter v alues. Then, w e p erfor m several itera tions of the EM algo- rithm to ﬁnd fairly g o o d v alues for the parameter s of the structure. Next, w e use these parameter v alues and the current mo del to compute exp ected suﬃcient statistics for a c omplete MD AG (one that enco des no conditional-indep endence facts). W e call these statis- tics for the current mo del s , pa rameters θ s , a nd data d the exp e ct e d c omplete mo del suﬃcient statistics and denote the quantit y by ECMSS( d , θ s , s ). A detailed discussion of the computation of this quantit y is g iven in the App endix. Next, we treat these exp ected s uﬃ- cient statistics as if they were suﬃcient statistics from a complete data set, and p erfor m structure sear ch. Be- cause we pretend the data set is co mplete, the mo del scores hav e a closed form and are factorable, mak- ing structure sear ch eﬃcien t. After structure sea rch, we reestimate the para meters for the new structure to b e the MAP parameter s g iven the exp ected suﬃ- cient statistics. Finally , the EM, the ECMSS( d , θ s , s ) computation, the structure search, and the par ameter reestimation steps are iterated un til some conv er g ence criterion is s atisﬁed. In the remainder of this section, we discuss v ariations of the approach. In addition, w e exa mine the crite- rion used for mo del sear ch, the initialization o f b oth the s tructure and para meters, a nd a n approach for de- termining the num b er of mixture comp onents a nd the n umber of states of any hidden v ariables in the com- po nen t mo dels. Our search cr iter ion is the log mar g inal likelihoo d of the expected co mplete mo del suﬃcient statistics: crit( s 0 | d , ˜ θ s , s ) = log p (ECMSS( d , ˜ θ s , s ) | s 0 h ) (10) where s 0 is the mo del being ev aluated and ( s , θ s ) are the mo del and par ameters used to compute the ex- pected co mplete model suﬃcient statistics. W e use ( s , θ s ) to compute suﬃcient statistics for the c omplete mo del, b ecause we wan t all p ossible dep en- dencies in the data to b e r eﬂected in the statistics. If we were to compute suﬃcient sta tistics for an incom- plete (constrained) mo del, then mo dels visited dur- ing search that violates these constraints would not b e suppo rted by the data. The cr iterion in E quation 10 is related to the Cheeseman–Stutz approximation for the marginal like- liho o d, whic h we can rewrite as log p ( d | s 0 h ) = log p (ECMSS( d , ˜ θ s 0 , s 0 ) | s 0 h ) + lo g p ( d | ˜ θ s 0 , s 0 h ) p (ECMSS( d , ˜ θ s 0 , s 0 ) | ˜ θ s 0 , s 0 h ) . (11) Although the arg umen t of Chick ering & Heck erman (1997) s uggests that Equation 11 is a mo re accur ate approximation for the log marginal likelihoo d than is Equation 10, we use the less accur a te criterion for tw o practical rea sons. One, if we were to include the like- liho o d ratio “corr e c tio n term” in Equation 11, then the criterion would not factor. Two, if we were to use just the ﬁrst term in Equation 11 , then we would still need to compute the MAP conﬁguration ˜ θ s 0 for every s tructure that we ev aluate. In contrast, by us- ing Equation 10, we compute the MAP co nﬁguration ˜ θ s only once. Despite these shortcuts, exp eriments describ ed in Section 6 suggest that the use of the cri- terion in E quation 10 guides the structure search to go o d models . Our approach requires that bo th an initial structure and a n initial par ameterization b e chosen. First, let us consider structural initialization. W e initialize the structure o f each comp onent mo del by placing an arc from every hidden v ar iable to ev ery observ able v ari- able, with the exception that no des corres p onding to contin uous random v ariables do not po in t to no des cor - resp onding to discrete random v ariables. A simpler choice for an initial graph is one in which every com- po nen t consists o f an empt y g raph—that is, a graph containing no arcs . Howev er, with such a n initializa- tion and for a restricted set o f prior s, we conjecture that o ur a pproach would b e unable to discov er con- nections betw een hidden and observ able v ariables . Next, let us consider parameter initialization. When the mixture comp o nent s contain no hidden contin u- ous v ariables, we initi alize parameters for a compo - nen t DA G structure b as follows. First, we remo ve all hidden nodes and adjacen t a rcs from b , creating mo del b 0 . Next, we deter mine ˜ θ b 0 , the MAP conﬁgu- ration for θ b 0 given da ta d . Since the data is complete with resp ect to b 0 , we can co mpute this MAP in closed form. Then, w e crea te a conjugate distribution for θ b 0 whose conﬁguration of maximum v alue agrees with the MAP conﬁguration just computed and whose equiv a- lent sample sizes a re sp eciﬁed by the user. Next, for each non-hidden no de X i in b and for each conﬁg- uration of X i ’s hidden discrete parents, we initialize the par ameters of the lo cal distribution family for X i b y drawing from the conjugate distribution just de- scrib ed. F or ea ch hidden discrete no de X i in b and for eac h conﬁguration o f X i ’s (possible) parents, w e initialize the multinomial pa rameters ass o ciated with the loca l distribution family of X i to be some ﬁxed distribution (e.g., uniform). When the mixture com- po nen ts con tain hidden contin uous v ariables, w e use the simpler approach of initializing parameters at ran- dom (i.e., by drawing from a distribution such a s the prior). Methods for initializing the parameters of the distinguished ra ndom v ariable C include (1) setting the para meters to be equal, (2) setting the parameter s to their prior means, and (3 ) drawing the par ameters from a Dirichlet dis tr ibution. As we have mentioned, our approa ch has several v ari- ations. One source o f v ariation is the heur istic alg o- rithm used for search o nce ECMSS( d , θ s , s )) is com- puted. The options a re the s ame as those for the sim- ple search-and-score alg orithms, and include greedy search, g reedy search with restar ts, be s t-ﬁrst search, and Monte-Carlo metho ds. In preliminary studies, w e hav e found g reedy search to b e eﬀectiv e; and in our analysis of real data in Section 6 , we use this tech- nique. Another sour c e o f v ariation is the schedule used to al- ternate b etw een pa rameter and structur e sea rch. With resp ect to para meter sear ch, we can run EM to conv er- gence, for o ne step, for some ﬁxed num be r of s teps, or for a num b er of steps that dep ends on how many times we ha ve perfo r med the sea rch phase. With r e s pect to structure search, we ca n p erform model-s tr ucture transformations for some ﬁxed n umber of steps, for some num b er of steps that dep ends o n how man y times we hav e p erformed the search pha s e, or un til w e ﬁnd a lo cal maximum. Finally , we can iter a te the steps con- sisting of E M, the computation of ECMSS( d , θ s , s ), and structure sear ch unt il either (1) the MDA G struc- ture do es not change a cross t wo consecutiv e sear ch phases, o r (2) the a pproximate mar ginal likelihoo d of the resulting MD AG structure do es not increas e . Un- der the seco nd schedule, the a lg orithm is guar anteed to ter minate, b ecause the ma rginal likeliho o d ca nnot increase indeﬁnitely . Under the ﬁrst schedule, we do not know of a pro of that the algo r ithm will terminate. In our exper imen ts with g reedy structure s earch, how- ever, we hav e found that this schedule halts. W e ﬁnd it c o nv enien t to descr ibe these schedules us- ing a regular gra mmar, where E , M, E c , S denote an E step, M step, computation of ECMSS( d , θ s , s ), and structure sear ch, re s pectively . F or example, w e us e ((EM) ∗ E c S ∗ M) ∗ to denote the ca se wher e, within e a ch outer iteration, we (1) r un EM to convergence, (2) compute the exp ected co mplete mo del suﬃcient statis- tics, (3) run structure sea rch to co n vergence, a nd (4) per form an M step. Another schedule we examine is ((EM) 10 E c S ∗ M) ∗ . In this sc hedule, we r un EM fo r only 1 0 steps b efore computing the ex pected complete mo del suﬃcien t statistics. 2 In a technical rep or t that is a companion to this pap er (Thiesson, Meek, Chick ering , and Heck erman, 1997 ), we ev aluate v arious combinations o f these schedules. 2 When the structure search leav es the mo del structure unchanged, we force another iteration of the outer lo op in whic h we run EM to conv ergence rather than for 10 steps. If the model structure c hanges in this forced iteration, we con tinue t o iterate with 10 EM steps. Our exp eriments indicate that, although it is not nec- essary to run EM to con vergence b etw een str uctur e search, a single EM step b etw een structure searches se- lects mo dels that have low er pr ediction accura cy . W e hav e found that the s chedule ((EM) 10 E c S ∗ M) ∗ works well for a v ariet y o f problems. Finally , the a lgorithm as desc r ibe d can compar e nei- ther mo dels that contain diﬀerent random v ariables nor mo dels in which the same r a ndom v ariable has a diﬀerent num b er of sta tes. No netheless, we can p er- form a n additional s e a rch over the n um b er o f states of each discrete hidden v ariable b y applying the al- gorithm in Figure 1 to initial models with diﬀerent n umbers of states for the hidden v a riables. W e can discard a discrete hidden v ariable fro m a mo del b y setting its num b er of states to one. After the b est MD AG fo r each initializa tion is identiﬁed, we select the o verall b est s tr ucture using some criter ion. Be- cause only a rela tiv ely small n umber of a lternatives are considered, we can us e a computationally exp en- sive a pproximation for the mar ginal likelihoo d such as the Cheeseman-Stutz a ppr oximation or a Monte-Carlo method. 6 Example In this section, we ev aluate the predictive per formance of MDA G mo dels on real data. In addition, we ev al- uate some of the ass umptions underlying our metho d for learning these mo dels. In the domain that we co n- sider, all the observ able ra ndom v ariables are contin u- ous. Consequently , we fo cus o ur attention on mixtures of Gaussian DA G mo dels. T o acco mmo da te the out- liers contained in the da ta set that we ana lyze, ea ch of the mixture mo dels that we co nsider has a noise comp onent in addition to o ne or more Gaussian co m- po nen ts. The noise compo nen t is mo deled as a mult i- v ariate unifor m distribution, and can be viewed as an empt y D AG mo del in which the distribution function for each of the ra ndom v aria bles is uniform. W e compare the predictive p erformance of (1) mix- tures of DA G mo dels (MDA G/n) (2) mixtures of m ultiv ariate-Gaussia n distributions for which the co- v ariance matrices are diag o nal (MDIA G/n), and (3) mixtures of multiv ar iate-Gaussian distributions for which the cov ariance matrices ar e full (MFULL/n). The MDIA G/n a nd MFULL/n mo del classes co rre- sp o nd to MDA G mo dels with ﬁx e d empt y structures and ﬁx e d co mplete structures, resp ectively , for all Gaussian comp onents. The / n suﬃx indicates the ex- istence of a uniform nois e comp onent. W e p erfo r m a n outer sear ch to identi fy the num- ber of co mponents within ea ch mixture mo del as de- scrib ed in Section 5. In pa rticular, w e ﬁrst lear n a t wo-comp onent mo del (one Gaussia n and one noise comp onent) , and then increa se by one the num b er of Gaussian mixture comp onents until the mo del scor e is clearly decreasing . W e choose the b est num b er of co m- po nen ts using the Cheesema n–Stutz criter ion. Then, we measure the predictive ability of the chosen model s us ing the logarithmic s c o ring r ule of Go o d (1952 ): 1 | d test | X l ∈ d test log p ( x l | s h ) (12) where d test is a set o f test ca ses and | d test | is the n umber of test cas es. W e approximate p ( x l | s h ) b y p ( x l | ˜ θ s , s h ), the likelihoo d ev aluated at the MAP pa- rameter co nﬁguration. 3 When learning MD AG/n models, w e use the ((EM) 10 E c S ∗ M) ∗ search schedule; a nd when learning MDIA G/n and MFULL/n mo dels, we run the EM a l- gorithm to conv ergence . In all exp eriments, we deem EM to have converged when the the r atio of the change in log lik eliho o d from the pr o ceeding step and the change in log likelihoo d from the initialization falls b e - low 10 − 6 . W e initialize s tr ucture a nd par ameters for our search pro cedures a s describ ed in Section 5 with equiv alen t sample sizes equal to 20 0. The example we cons ider addr esses the digital encod- ing o f handwritten digits (Hinton, Day an, & Revo w, 1997). In this doma in, there a re 64 ra ndom v ariables corresp onding to the gray-scale v alues [0 ,2 55] of sca led and smo othed 8 -pixel x 8-pixel images o f handwritten digits obtained from the CEDAR U.S. p ostal s e r vice database (Hull, 1994). Applications of joint predic- tion include ima ge compres sion and digit cla s siﬁcation. The sa mple sizes for the digits (“0” through “ 9”) ra nge from 129 3 to 1534 . F or each digit, we use 1 100 samples for training, and the remaining samples for testing. W e use a rela tiv ely diﬀuse Normal- Wishart param- eter pr ior for each of the Gaussian comp o nent s of MDIA G/n and MFULL/n mo dels. In the notation of DeGro o t (197 0), our pr io r has ν = 2, all v alues in µ set to 64 as a roug h assessment o f the average gray-scale v alue ov er pixe ls , α = ν + 64, and τ set to the iden tity matrix. W e c ho ose α to b e the sum of ν and the num b er of o bserved v a riables to com- pute the MAP par ameter v alues in clos ed form. The parameter priors for the Gauss ia n comp onents of the MD AG/n mo dels are Normal- Wishart prio rs sp eciﬁed using the h yp erpa r ameters describ ed above and the methods described in Heck erma n and Geiger (1995). W e use a uniform pr io r on the n umber of co mpo nen ts in the mixture and, when learning MD AG/n mo dels, a uniform prio r on the structure of the comp onent DA G mo dels. Becaus e we know that the v alues of each of the 64 v ariables are constra ined to the r ange [0,255 ], we ﬁx the parameters in the uniform distribution of the noise mo del accor ding ly . Finally , the hyper param- eters { α 0 , . . . , α k } of the Dirichlet prior on the mixture weigh ts (i.e., the distinguished v ariable) are α 0 = 0 . 01 for the noise co mpo nen t, and α 1 = . . . = α k = 0 . 9 9 /k for the Ga ussian compo nent s. The predictive log arithmic scor e on the test set for each dig it is shown in Figure 2 . The num ber of Gaus- sian comp onents k a nd the mo del dimension d for the 3 W e are currently implemen ting a Monte-Carlo method to av erage o ver the parameter conﬁgurations. -350 -300 -250 -200 -150 "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" Digits Logarithmic score MDAG MFULL MDIAG Figure 2: Logar ithmic predictive scores on the test sets for the dig it data. MDA G/n MFULL/n MDIAG /n Digit k d k d k d “0” 5 181 2 2 4290 8 1032 “1” 7 291 0 2 4290 5 645 “2” 6 181 6 1 2145 6 774 “3” 4 134 4 1 2145 6 774 “4” 7 211 5 2 4290 6 774 “5” 9 270 2 1 2145 6 774 “6” 9 271 2 2 4290 5 645 “7” 9 316 8 2 4290 4 516 “8” 6 186 8 2 4290 5 645 “9” 8 295 5 2 4290 5 645 T able 1: Number of Ga ussian comp onents a nd pa ram- eters in the lea rned models for the digit data. bes t mo del in each class a re dis played in T a ble 1. Fig- ure 2 indicates that MDA G/n mo dels , o n average, im- prov e the pr e dictive a ccuracy by 8 % ov er MFULL/n mo dels and 20% ov er MDIAG/n mo dels. Note that the ga ins in predictiv e accuracy ov er MFULL/n mo d- els are o btained while reducing the a verage num b er of parameters b y one third. Using a P6 20 0MHz computer, the time taken to lea r n the MDA G/n, MFULL/n, and MDIAG/n mo dels for a single digit—including the time needed to ﬁnd the optimal n umber of components—is, on av erage, 6.0, 1.5, and 1.9 ho ur s, r esp ectively . These times could b e improv ed by using a more clever se arch for the optimal n umber of mixture comp onent s. T o better understand the diﬀerences in the distribu- tions that these mixture mo dels represent, we exam- ine the individual Gaussian comp onents for the learned MD AG/n, MFULL/n, and MDIA G/n mo dels for the digit “7” . T he ﬁrst r ow of Figure 3 shows the means for ea ch of the comp onents o f each of the mo dels. The mean v alues fo r the v ariables in each comp onent ar e display ed in an 8 x 8 grid in which the sha de of grey indicates the v alue o f the mean. The displa ys indicate that e a ch of the comp o nent s o f ea ch type of mo del are capturing distinctiv e t yp es of sevens. They do not, how ever, re veal any of the dep endency structure in the comp onent mo dels. T o help visualize these dep enden- cies, we drew four samples from each comp onent for each type of mo del. The gr id for each sa mple is s haded to indicate the sampled v alues. Whereas the samples from the MDIAG/n comp onents do lo ok like s e vens, they are mottled. This is not surprising, b ecause ea ch of the v a riables a re conditionally independent given the comp onent. The samples for the MFULL/n com- po nen ts are not mottled, but indicate that multiple t yp es o f sev ens are being represen ted in one c o mpo - -320000 -295000 -270000 -245000 -220000 1 301 601 901 Structural search step Cheeseman-Stutz criterion Figure 4: The Cheeseman–Stutz criterion for each in termediate mo del obtained during structur e search when learning a three - comp onent mo del for the digit “7”. The abrupt incr eases around steps 1, 35 0 , and 540 o ccur when structure se arch transitions to a new comp onent. nen t. That is, several of the samples lo ok blurred and app ear to hav e m ultiple sevens sup erimp osed. Gener- ally , s amples fro m each MDA G/n co mpo nen t lo ok lik e sevens of the same distinct s t yle, all of which closely resemble the mean. Let us turn our attention to the ev aluation o f one of the key assumptions underlying o ur lear ning method. As we have discussed, the criterio n used to g uide struc- ture s earch (E q uation 1 0) is only a heuristic approx- imation to the true mo del pos terior. T o inv estigate the q ua lit y of this approximation, we can ev aluate the mo del po sterior using the Cheeseman-Stutz a pproxi- mation (what w e believe to be a more accurate ap- proximation) for intermediate mo dels v isited during structure sear ch . If the heuristic cr iterion is go o d, then the C hee s eman–Stutz cr iter ion sho uld increa se as s tructure sea rch prog r esses. W e p erfor m this ev alu- ation when learning a three-co mponent MDA G mo del for the digit “7 ” using the ((EM) 10 E c S ∗ M) ∗ sched- ule. F or 149 out o f the 96 4 mo del transitions, the Cheeseman–Stutz a pproximation decreased. Overall, how ever, as shown in Figur e 4, the Chees eman–Stutz score prog resses upw ard to apparent conv ergence. W e obtain similar results for other data s e ts. These r esults suggest that the heuris tic criterion (Equation 10) is a useful guide for structure s earch. Using statistics fro m this same exp eriment, we ar e able to estimate the tim e it would take to learn the MD AG mo del using the simple search-and-score ap- proach describ ed in Section 4. W e ﬁnd that the time to learn the three-compo nent MDA G mo del for the digit “7”, using the Cheeseman–Stutz a pproximation for mo del co mparison, is a ppr oximately 6 years o n a P6 200 MHz co mputer, thu s substantiating o ur previ- ous claim a bo ut the intractability of simple se a rch- and-score approaches. Finally , a natura l question is whether the Cheeseman– Stutz approximation for the marginal likelihoo d is ac- curate for mo del selection. The answer is imp ortant, beca use the MDA G mo dels we select and ev a luate are chosen using this approximation. So me evidence for the reasona bleness of the approximation is provided by the fac t that, as we v ary the num b e r of comp onents o f the MDA G mo dels, the Cheeseman- Stutz and predic- MDAG/n MFULL/n MDIAG/n Weight Mean Samples 0.65 0.35 0.23 0.04 0.12 0.16 0.33 0.002 0.04 0.15 0.13 0.03 0.31 0.23 0.19 Figure 3: Means and samples from the comp onents of the lea r ned MD AG/n, MFULL/n, and MDIA G/n mo dels for digit “7”. tiv e scores roughly r ise and fall in synchron y , usually pea king at the same num be r of compo nen ts. 7 Structure learnin g: A prelim inary study As we have mentioned in the introduction, many com- puter scientists and s tatisticians are using statistical- inference techniques to lear n the structure o f DA G mo dels fro m observ ational (i.e., non-exp erimental data). Pearl & V er ma (19 91) a nd Spirtes et a l. (1993) hav e argued that, under a se t of simple (and sometimes reasona ble) ass umptions, the structures so lear ned can be us e d to infer cause- and-eﬀect r elationships. An in- teresting po ssibilit y is that these results can b e gen- eralized so that we may use the structure of learned MD AG mo dels to infer causal rela tionships in mixed po pulations (p opulations in which subgro ups hav e dif- ferent ca us al relationships). In this section, we present a pr eliminary inv estigation into how well our a pproach can learn MD AG structure. W e p erfor m our analysis a s follows. First, we construct a “gold-standa r d” MDA G mo del, a nd use the mo del to generate data sets of v arying size. Then, for ea ch data set, we use o ur approach to learn an MD AG mo del (without a noise compo nent ). Finally , we co mpare the structure of the lear ned mo del to that of the gold- standard mo del, and meas ure the minimum num b e r o f arc manipulations (additions, deletions, and reversals) needed to transfor m each lea r ned co mpo nent structure to the co rresp onding g old-standard structure. The g old-standard mo del is an MDA G mo del fo r ﬁve contin uous random v ariables. The mo del has three mixture comp onents. The structure of the ﬁrst and third comp onents (COMP1 and COMP3) are identical and this structure is shown in Figur e 5a. The structure of the second comp onent (COMP2 ) is shown in Fig- ure 5b. The D AGs are para meterized so that ther e is some spatia l ov erlap. In particular, all unconditional means in COMP 1 and COMP2 are zero ; all means in COMP3 are equa l to ﬁve; and a ll linear co eﬃcients and conditional v ariances a re one (see Shach ter & Kenley , 1989). W e construct a da ta s et of size N = 30 0 0 by sampling (a) X2 X1 X3 X4 (b) X5 X1 X2 X3 X4 X5 Figure 5: (a) The graphical structure for ﬁrst and third compo nen ts in the gold-standar d MD AG. (b) The graphical structure fo r s econd compo nent. Sample W eigh t of thre e Arc di ﬀerences size k largest c omp. C OMP1 COMP 2 COMP3 93 2 1.00 - 4 0 186 2 1.00 - 2 0 375 3 1.00 1 1 0 750 5 0.98 1 1 0 1500 3 1.00 0 3 0 3000 5 0.99 1 1 0 T able 2 : Performance on the task of s tructure lea rning as a function of sa mple size. 1000 cases from each co mpo nen t of the gold- s tandard mo del. W e then itera tiv ely subsa mple this data, cre- ating data sets of size N = 9 3, 1 8 6, 3 75, 75 0, 15 00, and 3 000. T able 2 shows the results of lear ning mo dels from the six da ta sets using the ((EM) 10 E c S ∗ M) ∗ sched- ule. The columns of the table con tain the num b er of compo nent s k in the learned MD AG , the sum of the mixture weigh ts in the three largest comp onents and the minim um n umber o f arc manipulations (ad- ditions, deletions , and r e versals) needed to transfor m each lea r ned comp onent str ucture to the corresp ond- ing gold-standar d structure for the three comp onents with the larges t mixture weigh ts. Arc manipulations that lead to a mo del with diﬀerent str uctures but the same family of distributions are not included in the count. All learned MDA G structures ar e close to that of the g o ld-standard mo del. In addition, although not a ppa rent from the table, the structure of every learned compo nen t has only additiona l arcs in com- parison with the g old-standard mo del for sample sizes larger than 375. Finally , it is interesting to note that, essentially , the structure is re cov ered for a sa mple s ize as low as 375. 8 Related w ork D AG mo dels (single-co mponent MD AG mo dels ) with hidden v ariables generalize many well-known s ta tisti- cal mo dels including linear factor analysis, latent fac- tor mo dels (e.g., Clogg , 1 995), and pro babilistic prin- ciple component analysis (Tipping & B ishop, 1 997). MD AG mo dels generalize a v ariet y o f mixtures mo dels including naive-Ba yes mo dels use d for clustering (e.g., Clogg, 199 5; Chee s eman and Stutz, 1 995), mixtures of factor analytic mo dels (Hin ton, Da yan, & Revo w, 1997), and mixtures of probabilistic principle comp o- nen t analy tic mo dels (Tipping & Bishop, 1997 ). There is also work r elated to our learning meth- o ds. The idea of interleaving pa rameter and s tr ucture search to learn graphical models has be e n discussed b y Meil˘ a, Jordan, & Morris (1997), Singh (1997 ), and F riedman (1997). Meil˘ a et al. (1997) co nsider the problem of lear ning mixtures of DA G mo dels for dis- crete rando m v aria bles where each comp onent is a spanning tree. Similar to our approach, they trea t ex- pected da ta as real data to pro duce a completed data set for structure search. Unlike our work, they r e pla ce heuristic mo del sear ch with a p olynomial algorithm for ﬁnding the “b est” spanning-tree comp onents g iven the completed da ta. Also, unlike our work, they use like- liho o d as a selection criterion, and thus do not tak e in to account the complexity of the mo del. Singh (1997) concentrates on lear ning a single DA G mo del for discre te ra ndo m v ariables with incomplete data. He do es not consider contin uous v ariables or mixtures o f D AG mo dels. In contrast to our a pproach, Singh (1997 ) uses a Monte-Carlo metho d to pro duce completed data sets for structure sear ch . F riedman (199 7, 1998) describ es general alg orithms for learning D AG mo dels given inco mplete data , a nd pro- vides theoretical justiﬁcation for some of his metho ds. Similar to our a pproach and the a pproach of Meil˘ a et al. (19 97), F riedman treats exp ected da ta a s r eal data to pro duce completed da ta s ets. In con trast to our approa ch , F r iedman obtains the exp ected suﬃcient statistics for a new model using the current model. Most of these statistics are calcula ted by per forming probabilistic inference in the cur rent DA G mo del, al- though s ome o f the statistics ar e obtained from a cache of previous inferences. In o ur approach, we only need to p erfor m inference once on every cas e that has miss- ing v alues to compute the exp ected complete model suﬃcient statistics. After these statistics are com- puted, mo del scores for ar bitra ry structures can b e computed without additional inference . 9 Discussion and future w ork W e hav e describ ed mixtures of DA G mo dels, a clas s of mo dels that is more gener al than DA G mo dels , and hav e presented a novel heuristic metho d for choo s- ing go o d mo dels in this class . Although ev aluations for more e x amples (especially ones containing dis- crete v ariables) a re needed, o ur preliminary ev alu- ations suggest that mo del sele c tio n within this ex- panded mo del cla s s ca n lead to substantially improv ed predictions. This result is fortunate, as our ev aluations also show that simple sear ch-and-score algo rithms, in which mo dels are e v aluated one at a time using Monte- Carlo or larg e-sample a pproximations for mo del p oste- rior pr obability , are intractable for some real pr oblems. One impor tant observ ation from our ev aluations is that the (practical) selection criterion that we in tro duce—the mar ginal lik eliho o d of the complete- mo del suﬃcient statistics—is a go od guide for mo del search. An in teresting question is: Wh y? W e hop e that this work will stimulate theo r etical work to an- swer this question a nd pe r haps uncover b etter a ppr ox- imations for guiding mo del search. F r iedman (19 98) has some initial insig h t. A p ossibly related challenge for theor etical study ha s to do with the a pparent accuracy of the Cheeseman– Stutz a pproximation for the marg ina l likelihoo d. As we have discussed, in exper imen ts with m ultinomial mixtures, Chick ering & Hec kerman (1997) hav e found the approximation to b e at leas t as accurate and some- times mor e accurate than the standard Laplace a p- proximation. O ur ev aluations hav e a lso provided some evidence that the Cheeseman– Stutz approximation is an accurate criterion for mo del selection. In our ev aluations, we hav e not co nsidered situations where the comp onent D AG mo dels themselves contain hidden v ariables. In or der to lear n mo dels in this class, methods for structure search are needed. In such sit- uations, the num b er of p oss ible mo dels is signiﬁcantly larger than the n umber of p ossible DA Gs ov er a ﬁxed set o f v ariables . Without co nstraining the set of p os - sible models with hidden v ariables—for insta nce , b y restricting the num ber of hidden v ariables—the num- ber of po ssible mo dels is inﬁnite. On a p ositive note, Spirtes et al. (1993 ) hav e shown that cons tr aint -ba sed methods under suitable assumptions can sometimes indicate the existence o f a hidden co mmon cause b e- t ween tw o v ariables. Thus, it may b e p os sible to use the constraint-based metho ds to s uggest an initial se t of plausible mo dels containing hidden v ar iables that can then b e sub jected to a Bay esian ana lysis. In Section 7 , w e saw that we can recover the structure of a n MD AG mo del to a fair degr e e of accuracy . This observ ation r aises the in trig uing p os s ibilit y that we can infer causal rela tionships from a popula tion consist- ing of subgr oups governed by diﬀerent causa l relation- ships. One imp or tant issue that needs to be address ed ﬁrst, how ever, has to do with s tructural identiﬁabil- it y . F o r example, tw o MDA G mo dels may sup erﬁcially hav e diﬀerent structures , but may otherwise b e statis- tically equiv alent. Although the criteria for structural iden tiﬁability a mong single-comp onent DA G mo dels is well known, such criteria are no t well understo o d for MD AG mo dels. App endix: Exp ected complete model suﬃcien t statistics In this app endix, we examine complete model s uﬃ- cient statistics more clos e ly . W e shall limit our discus- sion to m ulti-DA G mo dels for which the comp onent D AG mo dels hav e co nditional-Gaussian distributions. The extensio n to the noise comp onent is straightfor- ward. Consider a multi-D A G mo del fo r ra ndom v ariables C and X . Let Γ denote the set of co nt inuous v ariables in X , γ denote a c onﬁguration of Γ , a nd n c denote the n umber of v ariables in Γ . Let ∆ denote the set of all discrete v ariables (including the distinguished v ariable C ), and m deno te the n umber o f p ossible co nﬁgura- tions of ∆ . In addition, let d = y 1 , . . . , y N , where y i is the conﬁgura tion of the observed v ariables in ca se i . Note that diﬀerent v ariables may b e obser ved in dif- ferent cases. Finally , a s in Dempster et al. (1977), let x i denote the i th c omplete c ase —the conﬁg ur ation of X and C in the i th case. Now, co nsider the complete mo del suﬃcient statis- tics for a complete case, which we denote T ( x ). F or the m ulti-DA G model, T ( x ) is a v ector hh N 1 , R 1 , S 1 i , . . . , h N m , R m , S m ii of m triples, where the N j are sca la rs, the R j are vectors o f length n c , the S j are squar e ma trices of size n c × n c . In par ticular, if the discrete v ariables in x take o n the j th conﬁgu- ration, then N j = 1, R j = γ , and S j = γ 0 ∗ γ , and N k = 0 , R k = 0, and S k = 0 for k 6 = j . When w e do not have a complete data set, w e com- pute the exp ected co mplete mo del suﬃcient s tatistics ECMSS( d , θ s , s ), g iven b y ECMSS( d , θ s , s ) = N X i =1 E ( T ( x i ) | y i , θ s , s h ) (13) The exp ectation is taken with re spec t to the joint dis- tribution ov er the random v aria bles C and X given θ s , s h , a nd the observ ations for the curr ent case. The exp e ctation of T ( x ) is c o mputed by p erforming prob- abilistic infere nce in the mu lti-DA G mo del. Such in- ference is a simple extension of the work o f Lauritzen (1992). The sum of exp ectations are simply scalar, vector, or matrix additions (a s a ppropriate) in eac h triple in ea ch o f the co o rdinates o f the vector. Note that, in the computation a s we hav e des c r ibe d it, we re q uire a statistic triple for every p ossible con- ﬁguration o f discrete v ariables. In pra c tice, how ever, we can use a sparse re pr esentation in which we store triples only for those co mplete o bserv ations that a re consistent with the da ta. References Bernardo, J. and Smith, A. (1994). Bayesian Theory . John Wil ey and S ons, New Y ork. Buntine, W. (1994). Op er ati ons for learning with graph ical mo del s. Journal of Artiﬁcia l Intel ligenc e Rese ar ch , 2: 159–225. Cheeseman, P . and Stutz, J. (1995). Bay esian c l assiﬁcation (Auto- Class): The ory and results. In F a yyad, U., Pi ate sky-Shapiro, G., Smyth, P . , and Uthurusam y , R., editors, Advanc es in Know ledge Disc overy and Data Mining , page s 153–180. AAAI P ress, Menl o Park, CA. Chick ering, D. (1996). Le arning Bayesian Networks fr om Data . PhD t h esis, Universit y of California, Los Ange les, CA. Clogg, C. ( 1995). Latent class mo dels. In Handb o ok of statistic al mo deling for the so cial and behavior al sciences , pages 311–359. Plenum P ress, New Y ork. Coop er, G. and He rsk ovits, E . (1992). A Bay esian me t h od for the induction of probabilistic netw orks from data. Machine L earni ng , 9:309–347. DiCiccio, T., Kass, R., Raftery , A., and W asserman, L. (July , 1995). Computing Ba yes fac tors by c ombining simulation and asymp totic approximations. T echnical Rep ort 630, Departme nt of Stati stics, Carnegie Me l lon Universit y , P A. F riedman, N. (1997). Learning b e lief n etw orks in the presenc e of missing v alues and hid d en v ariables. In Pro c e edi ngs of the F our- te enth International Confer enc e on Machine Le arning . Morgan Kaufmann, San Mateo, CA. F riedman, N. (1998). The Ba yesian structural EM algorith m. In Pr o ce e dings of the F ourte enth Confer ence on Unc ertainty in A r- tiﬁcial Intel ligence L earn ing . Morgan Kaufman n, San Mateo, CA. T o app ear. Geiger, D. and Heckerman, D. (1996). Bey ond Bay esian netw orks: Similarity netw orks an d Bay esian multinets. Ar tiﬁcial Intel lige nc e , 82:45–74. Goo d, I. (1952). Rational de cisions. J. R. Statist. So c. B , 14:107– 114. Heck erman, D. and Geiger, D. (1995). Likeliho o ds an d p riors for Ba yesian n etw orks. T e chnical Rep ort MSR-TR-95-54, Microsoft Re- search, Re d mond, W A. Heck erman, D. , Geiger, D., and Chick eri n g, D. (1995). Learn ing Ba yesian netw orks: The combination of kn o wled ge an d statist i cal data. Mac hine Le arning , 20:197–243. Hinton, G., Day an, P ., and Revo w, M. (1997). Mo deling th e man i- folds of image s of handwritten digits. IEEE T r ansactions on Neur al Networks , 8:65–74. Hull, J. (1994). A database for handwritte n text re cognition re- search. IEEE T r ansactions on Pattern Analysis and Machine In- tel ligence , 16:550–554. Lauritzen, S. (1992). P ropagation of p robabilities, means, and v ari- ances in mixed graphical association mo de ls. Journal of the Amer- ic an Statistic al Asso ciation , 87:1098–1108. Meil˘ a, M., Jordan, M., and Morris, Q. (1997). E stimating de p en- dency structure as a hidden v ariable. T echnical Rep ort 1611, Mas- sach usetts Institute of T echnology , Artiﬁ cial Intelligence Lab ora- tory . Shach ter, R. and Kenl e y , C. (1989). Gaussian inﬂuenc e d iagrams. Management Science , 35:527–550. Singh, M. (1997). Learning Ba yesian netw orks from incompl ete data. In Pr o ce e dings AAAI-97 F ourte enth National Confer enc e on Ar- tiﬁcial Intel lige nce, Providence, RI, pages 534–539. AAAI Press, Menlo Park, CA. Spiegelh alter, D., Dawid, A., Lauritzen, S., and Cow ell, R. (1993). Ba yesian analysis in exp ert systems. Statistic al Scienc e , 8:219–282. Spirtes, P ., Glymour, C., and Scheines, R. (1993). Causation, Pr e- diction, and Sear ch . Springer-V erlag, New Y ork. Thiesson, B., Meek , C., Chickering, D., and Heck erman, D. (Dec em- b er, 1997). Learning mix t u res of DA G mo dels. T echnical Rep ort MSR-TR-97-30, M icrosoft Research, Redmond , W A. Tierney , L. and Kadane, J. (1986). Accurate approximations for p osterior moments and m arginal d ensities. Journal of the A me ric an Statistic al Asso ciation , 81:82–86. Tipping, M. an d Bishop, C. (1997). Mixtures of probabil istic prin- ciple com p onent analysers. T echnical Rep ort NCR G-97-003, Neural Computing Research Group.

Learning Mixtures of DAG Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment