Asymptotic Model Selection for Directed Networks with Hidden Variables
We extend the Bayesian Information Criterion (BIC), an asymptotic approximation for the marginal likelihood, to Bayesian networks with hidden variables. This approximation can be used to select models given large samples of data. The standard BIC as …
Authors: ** David Heckerman, Christopher Meek, (others) – *정확한 저자 명단은 원 논문을 확인 필요* **
Asymptotic Mo del Selection for Directed Net w orks with Hidden V ariables ∗ Dan Geiger Computer Science Departmen t T ec hnion, Haifa 3 2000, Israel dang@cs.technion.ac.il Da vid He c k erman Microsoft Research, Bldg 9S Redmond W A, 9 8052-639 9 heck erma @ microsoft.com Christopher Meek Carnegie-Mello n Univ er sit y Department of Philosophy meek@cmu.edu Abstract W e extend the Bay esian Information Crite- rion (BIC), an asymptotic approximation f or the marginal lik eliho od, to Bay esia n net works with hidden v ariables. This approximation can b e used to select models given lar ge sam- ples of data. The standard BIC as w ell as our extension punishes the complexit y of a mo del according to the dimension of its parameters. W e arg ue that the dimension o f a B a yesian net work with hidden v ar iables is the r ank o f the Jacobian matrix of the transformation betw een the parameters o f the net work and the parameter s of the observ a ble v ar ia bles. W e compute the dimensions of several net- works including the naiv e Ba yes mo del with a hidden ro ot no de. 1 In t roduction Learning Bayesian netw o rks from data extends their applicability to situatio ns where data is easily obtained and exp ert knowledge is exp ensiv e. Consequently , it has b een the sub ject of m uch resear c h in rece n t years (see e.g., Heck er man [19 9 5] and Buntin e [1996]). Re- searchers hav e pursued tw o t yp es of appro ac hes for learning Ba yesian net w orks: one that uses indep en- dence tests to direct a search among v a lid mo dels and another tha t uses a score to search for the b est scored net work—a pro cedure known as mo del sele c- tion . Score s based on exact Bay esian computations hav e b een developed by (e.g.) Co o per a nd Hersk ovits (1992), Spiegelhalter et al. (1993), Bun tine (199 4), and Heck erman et al. (1995), and scores based on min- im um description leng th (MDL) have b een develop ed in Lam and Bacchus (1993) and Suzuki (19 93). W e consider a Bayesian approa ch to model selection. ∗ This work was done at Microsoft Research. Suppos e w e have a s e t { X 1 , . . . , X n } = X of discr ete v ariables, and a set { x 1 , . . . , x N } = D of cases, where each cas e is an instance of some or of all the v ari- ables in X . Let ( S, θ s ) be a Bay esia n netw ork , where S is the netw or k structure of the Bay esian netw ork , a directed a cyclic gr aph such that each no de X i of S is asso ciated with a random v ar iable X i , and θ s is a set of parameters asso ciated with the net work struc- ture. Let S h stand for the h y pothesis that the true or ob jective joint distribution o f X can b e enco ded in the netw or k structure S . Then, a Bay es ian measure of the go odness-o f-fit of net work structure S to D is p ( S h | D ) ∝ p ( S h ) p ( D | S h ), where p ( D | S h ) is known a s the mar ginal likeli ho o d of D given S h . The problem o f mo del selection amo ng Bayesian net- works with hidden v a riables, that is, net works with v ariables whose v a lues are not observed is more dif- ficult than model selection among netw o rks without hidden v a riables. First, the space of pos sible net- works bec o mes infinite, and s econd, scoring eac h net- work is computationally harder bec ause one m ust ac- count for all p ossible v a lues of the miss ing v a riables (Co oper and Hers k ovits, 1992). Our goa l is to de- velop a Bayesian scoring approach for netw orks that include hidden v aria bles. Obtaining suc h a sco re that is computationally effective and co nceptually simple will allow us to selec t a mo del from amo ng a set o f comp eting mo dels. Our approach is to use an asymptotic approximation of the marginal likelihoo d. This asymptotic approx- imation is known as the Bay esia n Infor mation Crite- ria (BIC) (Sch warz, 19 78), and is equiv a len t to Rissa - nen’s (1987) minimum description length (MDL). Such an asymptotic approximation has been carried out for Bay es ian netw or ks by Hersko vits (19 9 1) a nd Bouck aert (1995) when no hidden v aria bles are present. Bouck- aert (1995) shows that the marginal likelihoo d of data D giv en a netw ork structure S is given by p ( D | S h ) = H ( S, D ) N − 1 / 2 dim( S ) log( N ) + O (1) (1) where N is the sample size of the data, H ( S, D ) is the entrop y of the probabilit y distribution obta ined b y pro jecting the frequencies of observed c a ses in to the conditional proba bilit y tables of the Bayesian net- work S , and dim( S ) is the num ber of para meters in S . E q. 1 reveals the qualitative preferences made by the Bay esia n a pproac h. First, with sufficient data, a net work structure that is an I-map of the true distribu- tion is mor e likely than a netw ork structure tha t is not an I-map of the true distribution. Seco nd, a mo ng all net work structures that ar e I-maps of the true distribu- tion, the o ne with the minim um num b er of parameters is mo re likely . Eq. 1 was der iv ed from an explicit for m ula for the probability of a netw o rk given data by letting the s am- ple size N run to infinity and using a Dirichlet prior for its para meters. Nonetheless, Eq. 1 do es not dep end on the selected prior. In Section 3, we use Laplac e ’s method to rederive Eq. 1 without assuming a Dirich- let prior. Our deriv a tion is a sta nda rd applicatio n of asymptotic Bay esian analysis. This deriv ation is useful for gaining intuiti on for the hidden-v a riable case. In section 4 , w e provide an approximation to the marginal lik eliho od for Bayesian net w orks with hid- den v ariables, and give a heuristic argument for this approximation using Laplace’s method. W e obtain the following equation: log p ( S | D ) ≈ log p ( S | D , ˆ θ s ) − 1 / 2 dim( S, ˆ θ s ) log ( N ) (2) where ˆ θ s is the maximum likeli ho o d (ML) v a lue fo r the parameter s of the netw or k and dim( S, ˆ θ s ) is the dimension of S at the ML v alue for θ s . The dimen- sion o f a mo del can b e interpreted in tw o equiv a le nt wa ys. First, it is the n umber of free parameters needed to r epresen t the pa rameter space near the maximum likelih o od v alue. Second, it is the r ank of the Jacobian matrix of the transformatio n betw een the para meters of the net w ork and the parameters of the obser v able (non-hidden) v aria bles. In a n y ca se, the dimension de- pends on the v alue of ˆ θ s , in con trast to Eq . 1, where the dimension is fixed throug hout the para meter spa ce. In Section 5 , we compute the dimensio ns of several net work str uctur e s , including the naive Bay es mo del with a hidden class no de. In Section 6, we demonstrate that the scoring function used in AutoClass sometimes diverges from p ( S | D ) asymptotically . In Sections 7 and 8, we describ e ho w our approach ca n be ex tend ed to Ga ussian and sig moid net w orks. 2 Bac kground W e in tro duce the following notatio n for a Bayesian net work. Let r i be the n um ber of sta tes of v a riable X i , P a i be the set of v a riables corresp onding to the parents of no de X i , and q i = Q X l ∈ Pa i r l be the num ber of states of P a i . W e use the in teger j to index the states of P a i . That is, w e write P a i = pa j i to denote that the parents of X i are assigned its j th state. W e use θ ijk to denote the true pro babilit y or p ar ameter that X i = x k i given that Pa i = pa j i . Note that P r i k =1 θ ijk = 1 . Also, we assume θ ijk > 0. In addition, w e use θ ij = { θ ijk | 1 ≤ k ≤ r i } to d enote the parameters a ssociated with no de i for a g iv en instance of the parent s P a i , and θ i = { θ ij | 1 ≤ j ≤ q i } to denote the parameters asso ciated with no de i . Th us, θ s = { θ i | 1 ≤ i ≤ n } . When S is unambiguous, we use θ instead of θ s . T o compute p ( D | S h ) in clo s ed form, several assump- tions are usually made. First, the data D is as- sumed to be a ra ndom sample from so me Ba yesian net- work ( S, θ s ). Second, for ea c h net work structur e , the parameter sets θ 1 , . . . , θ n are mutually indep enden t (global independence [Spiegelhalter and Lauritzen, 1990]), and the par a meter sets θ i 1 , . . . , θ iq i for each i are assumed to be mutually independent (local inde- pendence [Spiegelhalter and Lauritzen, 19 90 ]). Thir d, if a no de has the same parents in t wo distinct net works, then the distribution of the parameters asso ciated with this no de are identical in both net w orks (parameter mo dularit y [Heck erma n et al., 199 4 ]). F ourth, each case is complete. Fifth, the prior distribution of the parameters a ssociated with each no de is Dirichlet— that is, p ( θ ij | S h ) ∝ Q k θ α ijk ijk where α ijk can be in ter- preted as the equiv alent num b er of ca ses seen in which X i = x k i and P a i = pa j i . Using these assumptions, Co oper and Hersko vits (1992) obtained the following exact formula for the marginal likelihoo d: p ( D | S h ) = n Y i =1 q i Y j =1 Γ( α ij ) Γ( α ij + N ij ) r i Y k =1 Γ( α ijk + N ijk ) Γ( α ijk ) where N ijk is the num b er of cases in D in which X i = x k i and P a i = pa j i . W e call this expres s ion the Co op er– Herskovits sc oring function. The last tw o a ssumptions ar e made for the sake of conv enience. Namely , the par ameter distributions b e- fore and after data are seen are in the same family: the Dirichlet family . Geiger and Hec kerman (1995) provide a c ha racterization of the Dirichlet distribu- tion, which shows that the fift h ass umption is im- plied from the first three assumptions a nd from one additional assumption that if S 1 and S 2 are equiv a- lent Ba yesian net works (i.e., they represent th e sa me sets of join t distributions), then the even ts S h 1 and S h 2 are equiv alent a s well (hypothesis eq uiv alence [Hec kerman et al., 1995]). This assumption was made explicit, b e c ause it do es not hold for ca us al netw o rks where t w o ar cs with o pposing directions co rrespo nd to distinct hypo th eses [Heck erman, 19 95a ]. T o satisfy these assumptions, Heck er man et a l. (19 95) show that one m ust use α ijk = α q ( X i = x k i , P a i = pa j i ) in the Co op er–Hersko vits sc o ring function, where q ( X 1 , . . . , X n ) is the join t proba bilit y distribution of X obtained from an initial or prior Bay esia n net work sp e c ified by the user, and α is the user’s effective sa m- ple size or confidence in the pr ior net work. The Coop er–Herskovits scoring function does not lend itself to a qualita tive analysis. No neth eless, by letting N gr o w to infinit y y et keeping N ij / N a nd N ijk / N fi- nite, Eq. 1 ca n b e deriv ed by expanding Γ( · ) using Sterling’s appr o xima ti on. This deriv ation hinges on the as sumpt ions of glo bal and lo cal independence and on a Dirichlet prior, although, as we sho w, the result still holds without these a ssumptions. Int uitiv ely , with a lar ge sample size N , the data washes aw ay any con- tribution of the prior. 3 Assymptotics Without Hidden V ariables W e shall now r e der iv e Hersko vits’ (19 91) and Bouck- aert’s (199 5) a symptotic result. The technique w e use is Lapla c e ’s method, whic h is to expand the log likelih o od of the data a round the maximum likeli- ho od v alue, and then a ppro x imate the p eak us ing a m ultiv a riate-normal distribution. Our der iv ation bypasses the need to compute p ( D N | S h ) f or data D N of a sa mple size N , whic h re- quires the assumptions discussed in the previous sec- tion. Instead, we compute lim N →∞ p ( D N | S h ). F ur- thermore, our deriv ation only a ssumes that the prior for θ a round the maximum lik elihoo d v alue is p osi- tiv e. Finally , we a rgue in the next section that our deriv ation can be extended to Bay esian netw orks with hidden v ariables. W e b egin by defining f ( θ ) ≡ log p ( D N | θ , S h ). Th us, p ( D N | S h ) = Z p ( D N | θ , S h ) p ( θ | S h ) d θ = Z exp { f ( θ ) } p ( θ | S h ) d θ (3) Assuming f ( θ ) has a maximum—the ML v alue ˆ θ —w e hav e f 0 ( ˆ θ ) = 0. Using a T aylor-series expansion of f ( θ ) ar ound the ML v alue, we get f ( θ ) ≈ f ( ˆ θ ) + 1 / 2( θ − ˆ θ ) f 00 ( θ )( θ − ˆ θ ) (4) where f 00 ( θ ) is the Hessian of f —the square matrix of second deriv atives with resp ect to every pair o f v ari- ables { θ ijk , θ i 0 j 0 k 0 } . Consequently , from Eqs. 3 a nd 4, log p ( D | S h ) ≈ f ( ˆ θ )+ (5) log Z exp { 1 / 2 ( θ − ˆ θ ) f 00 ( θ )( θ − ˆ θ ) } p ( θ | S h ) d θ W e a ssume that − f 00 ( θ ) is p ositiv e- defini te, a nd that, as N grows to infinity , the pe ak in a neig h b orho od around the maxim um becomes shar per. Consequently , if we ignore the prio r , we get a no rmal distribution around the p eak. F urthermore , if w e a ssume that the prior p ( θ | S h ) is not zer o ar o und ˆ θ , then a s N grows it can be assumed constant and so removed from the inte- gral in Eq. 5 . The rema ining integral is a ppr o xima ted b y the formula for multiv aria te- normal distributions: Z exp { 1 / 2 ( θ − ˆ θ ) f 00 ( θ )( θ − ˆ θ ) } d θ ≈ √ 2 π det h − f 00 ( ˆ θ ) i d/ 2 (6) where d is the n um ber of par ameters in θ , d = Q n i =1 ( r i − 1) q i . As N gr o ws to infinity , the above ap- proximation b ecomes more pre cise b ecause the entire mass beco mes concentrated ar o und the p eak. Plug- ging Eq. 6 into Eq . 5 and noting that det h − f 00 ( ˆ θ ) i is prop ortional to N yields the BIC: p ( D N | S h ) ≈ p ( D N | ˆ θ , S h ) − d/ 2 lo g ( N ) (7) A careful deriv a tion in this spirit shows tha t the er- ror in this approximation do es not depend on N [Sch warz, 197 8]. F or B a yesian netw or ks, the function f ( θ ) is known . Thu s, all the a ssumptions ab out this function ca n b e verified. First, we note that f 00 ( θ ) is a blo c k diago- nal matrix where eac h blo c k A ij corresp onds to v a ri- able X i and a pa r ticular instance j o f P a i , and is of size ( r i − 1 ) 2 . Let us examine one such A ij . T o sim- plify notation, a ssume that X i has three states. Let w 1 , w 2 and w 3 denote θ ijk for k = 1 , 2 , 3 , where i and j ar e fixed. W e cons ider only those ca ses in D N where P a i = j , and examine only the obser v ations of X i . Let D 0 N denote the s et of N v a lues of X i obtained in this pro cess. With ea c h observ ation, we a ssociate t wo indicator functions x i and y i . The function x i is one if X i gets its first v a lue in case i and is zero otherwise. Similarly , y i is one if X i gets its second v alue in case i and is zero otherwise. The log likelihoo d function o f D 0 N is g iv en by λ ( w 1 , w 2 ) = log N Y i =1 w x i 1 w y i 2 (1 − w 1 − w 2 ) 1 − x i − y i (8) T o find the maximum, we set the first deriv a tiv e of this function to zero. The resulting equations are called the maximum lik eliho od equatio ns: λ w 1 ( w 1 , w 2 ) = N X i =1 x i w 1 − 1 − x i − y i 1 − w 1 − w 2 = 0 λ w 2 ( w 1 , w 2 ) = N X i =1 y i w 2 − 1 − x i − y i 1 − w 1 − w 2 = 0 The o nly solution to these equations is given by w 1 = x = P i x i / N , w 2 = y = P i y i / N , whic h is th e maxi- m um likelih o od v alue. The Hessian of λ ( w 1 , w 2 ) at the ML v alue is g iven by λ 00 ( w 1 , w 2 ) = λ 00 w 1 w 1 λ 00 w 1 w 2 λ 00 w 1 w 1 λ 00 w 2 w 2 = − N 1 x + 1 1 − x − y 1 1 − x − y 1 1 − x − y 1 y + 1 1 − x − y (9) This Hessian matrix dec o mposes in to the sum of t wo matrices. One m atrix is a diagonal matrix with po s i- tiv e n um bers 1 / x and 1 /y on the diagona l. The second matrix is a constant matr ix in whic h all elements eq ua l the p ositiv e num b er 1 / (1 − x − y ). Because these tw o matrices are p ositiv e and non-negative definite, resp ec- tiv ely , t he Hessian is pos iti ve definite. This ar gumen t also holds when X i has more than three v a lues. Because the maximum likelihoo d equation has a single solution, and the Hessian is positive definite, and be- cause as N increases the peak becomes sharp er (Eq .9 ), all the conditions for the genera l der iv ation of the BIC are met. Plugg ing the maximum likelihoo d v alue int o Eq. 7, which is corr ect to O (1), yields Eq. 1. 4 Assymptotics With Hidden V ariables Let us now consider the s ituatio n where S contains hidden v a riables. In this case, we can not use the deriv ation in the previous section, b ecause the log- likelih o od function log p ( D N | S h , θ ) do es not necess ar- ily tend to ward a p e ak a s the sample size increases. Instead, the log -lik eliho o d function can tend tow ard a ridge . Consider, for example, a net work with one arc H → X where H has t w o v alues h and ¯ h and X has t wo v alues x and ¯ x . Assume that only v alues of X are observed—that is, H is hidden. Then, the like- liho od function is g iven by Q i w x i (1 − w ) 1 − x i where w = θ h θ x | h + (1 − θ h ) θ x | ¯ h , and x i is the indicato r func- tion that equals one if X gets v alue x in ca se i and zero otherwise. The parameter w is the true proba- bilit y that X = x unconditionally . The ML v alue is unique in terms of w : it attains its maximum when w = P i x i / N . Nonetheless, an y solution for θ to the equation X i x i / N = θ h θ x | h + (1 − θ h ) θ x | ¯ h will ma ximize the likelihoo d of the data. In this s e nse, the netw ork structure H → X ha s only one non- redundant parameter. In this section, we provide an informal argumen t describing how to iden tify a s et o f non-redundant parameters for a n y Bay esian netw or k with hidden v ariables . Given a Bayesian net work for domain X with o bserv- able v ariables O ⊂ X , let W = { w o | o ∈ O } denote the para meters of the true joint probability di stribu- tion of O . Corresp onding to every v a lue of θ is a v alue of W . That is, S defines a (smooth) map g fro m θ to W . The range of g is a curved ma nifold M in the space defined by W . 1 Now, consider g ( ˆ θ ), the image of all ML v alues of θ . In a small region around g ( ˆ θ ), the manifold M will resemble Euclidean s pa ce with some dimension d . Tha t is, in a small regio n aro und g ( ˆ θ ), M will lo ok like R d with or thogonal co ordinates Φ = { φ 1 , . . . , φ d } . Thus, the log-likelihoo d function written as a function o f Φ—log p ( D N | Φ)—will b ecome pea k ed as the sa mple size increases , a nd we can apply the BIC approximation: log p ( D N | S h ) ≈ log p ( D N | ˆ Φ , S h ) − d 2 log N (10) Note that log p ( D N | ˆ Φ , S h ) = log p ( D N | ˆ θ , S h ). It r e ma ins to understand what d is a nd how it can be found. When consider ing a linear transformatio n j : R n → R m , the transfor mation is a matrix of size n × m . The dimension d of the image of j equa ls the rank of th e matrix. When k : R n → R m is a smo oth mapping, it ca n b e a ppro ximated lo cally as a linear transformation, wher e the Jacobian matrix J ( x ) serves as the linear transformation matrix for the neigh bo r- ho od of x ∈ R n . The dimension of the image of k in a s mall region around k ( x ) is the r ank of J ( x ) (Spi- v ak, 1979). This observ atio n holds when the rank of the Jacobia n matrix do es not ch ange in a s mall ball around x , in which case x is called a r e gular p oint. Returning to our pr oblem, the mapping from θ to W is a p olynomial function of θ . T hus, as the next theorem shows, the rank of the Jacobian matrix h ∂ θ ∂ W i is almost 1 F or terminology and basic facts in d iffere ntia l geome- try , see Sp iv ak (1979). everywhere some fixed constant d , whic h we call the r e gu lar r ank of the Jacobian matrix. This r ank is the n um ber of non-redundant par a meters o f S —that is, the dimension of S . Theorem 1 L et θ b e the p ar ameters of a network S for variab les X with observable variabl es O ⊂ X . L et W b e the p ar ameters of the t rue joint distribution of the observable variables. If e ach p ar ameter in W is a p olynomial fun ctio n of θ , then rank h ∂ θ ∂ W ( θ ) i = d almost e verywher e, wher e d is a c onstant. Pro of: Because the mapping f rom θ to W is p olyno- mial, e a c h entry in the matrix J ( θ ) = h ∂ θ ∂ W ( θ ) i is a po lynomial in θ . When dia gonalizing J , the lea ding element s of the first d lines remain poly no mials in θ , whereas all other lines, which are dependent given ev- ery v alue o f θ , beco me identically zero. The r ank of J ( θ ) falls b elo w d only for v alues o f θ that a re roots of some of the p olynomials in the diag onalized matrix. The set of all such ro ots has measure zero . 2 Our heuristic arg umen t for Eq. 10 do es not provide us with the err or term. If the image manifold is to o curved, it migh t be p ossible that t he loc a l reg ion will nev er b ecome “sufficien tly flat” to o btain an O (1) bo und on the err or of the approximate marginal lik eli- ho od. W e conjecture that, for manifolds cor responding to Bayesian netw o rks with hi dden v aria bles, the loca l region will alwa ys be sufficiently flat. Researchers ha ve shown that O (1) b ounds are a ttainable for a v ariety of statistical models (e.g., Sch w arz, 1978, a nd Haughton, 1988). Although the ar gumen ts of these researchers do not directly apply to our case, it may be p ossible to e x tend their metho ds to prov e our conjecture. 5 Computations of the Rank W e hav e argued that the second term of the BIC for Bay es ian netw o r ks with hidden v ariables is the ra nk of the J acobian matrix o f the trans f ormation b et w een the parameters of the netw o rk and the parameters of the observ able v ar iables. In this section, we explain how to compute this rank, and demonstrate the approach with several examples. Theorem 1 sug g ests a random alg o rithm for calculat- ing the rank. Compute the Jacobian matrix J ( θ ) sym- bo lically from the e q uation W = g ( θ ). This co mpu- tation is possible s ince g is a vector of p olynomials in θ . Then, a ssign a random v a lue to θ and diagonalize the n umeric matrix J ( θ ). Theorem 1 guarantees that, with pro babilit y 1, the resulting rank is the re gular rank of J . F or every netw o rk, select—say—ten v al- ues for θ , a nd determine r to be the maxim um o f the resulting ra nks. In all our exp eriment s, none of the randomly chosen v alues for θ accident ally reduced the rank. W e now demonstrate the computation of the needed rank for a naiv e Bay es mo del with one hidden v a riable H a nd t wo feature v ariables X 1 and X 2 . Assume all three v a r iables are binary . The set of pa rameters W = g ( θ ) is giv en by w x 1 x 2 = θ h θ x 1 | h θ x 2 | h + (1 − θ h ) θ x 1 | ¯ h θ x 2 | ¯ h w ¯ x 1 x 2 = θ h (1 − θ x 1 | h ) θ x 2 | h + (1 − θ h )(1 − θ x 1 | ¯ h ) θ x 2 | ¯ h w x 1 ¯ x 2 = θ h θ x 1 | h (1 − θ x 2 | h ) + (1 − θ h ) θ x 1 | ¯ h (1 − θ x 2 | ¯ h ) The 3 × 5 J a cobian ma trix for this tra nsformation is given in Figure 5 wher e θ ¯ x i | h = 1 − θ x i | h ( i = 1 , 2 ). The columns corresp ond to differentiation with resp ect to θ x 1 | h , θ x 2 | h , θ x 1 | ¯ h , θ x 2 | ¯ h and θ h , respec tively . A sym- bo lic computation of the ra nk of this matrix can b e carried out; and it shows that the regular rank is equal to the dimension o f the matrix—namely , 3 . Nonethe- less, a s we ha v e a rgued, in order to compute the regular rank, one can simply c ho ose ra ndo m v alues for θ and diagonalize the resulting numerical matrix. W e hav e done so for naive Ba yes models with one binary hidden ro ot node and n ≤ 7 binar y observ able non-ro ot no des. The size of the asso ciated matrices is (1 + 2 n ) × (2 n − 1). The regular rank for n = 3 , . . . , 7 was found to b e 1 + 2 n . W e conjecture that 1 + 2 n is the reg ular rank for all n > 2 . F or n = 1 , 2, the rank is 1 and 3, re - sp e c tively , which is the size of the full par ameter spa c e ov er one a nd tw o binar y v ar iables. The r ank can not be grea ter than 1 + 2 n b ecause this is the maxim um po ssible dimension of the Jacobian matrix. In fact, we hav e prov en a low er b ound of 2 n as well. Theorem 2 L et S b e a naive Bayes mo del with one binary hidde n r o ot no de and n > 2 binary observable non-r o ot no des. Then 2 n ≤ r ≤ 2 n + 1 wher e r is the r e gular r ank of t he Jac obian matrix b e- twe en t he p ar ameters of the network and the p ar ame- ters of the fe ature variables. The pr oof is obtained b y diago nalizing the Jacobian matrix sym bolica lly , and showing that there a re at least 2 n indep e ndent lines. The computation for 3 ≤ n ≤ 7 shows that, for naiv e Bay es mo dels with a binary hidden ro ot node, there are no redundan t para meters. Therefor e, the b est wa y to represent a pro babilit y distribution that is repre- sentable by s uc h a mo del is to use the net work repre- sentation explicitly . Nonetheless, this re s ult does not hold for all mo dels. F or example, consider the following W structur e : A → C ← H → D ← B θ h θ x 2 | h θ h θ x 1 | h (1 − θ h ) θ x 2 | ¯ h (1 − θ h ) θ x 1 | ¯ h θ x 1 | h θ x 2 | h − θ x 1 | ¯ h θ x 2 | ¯ h − θ h θ x 2 | h θ h θ ¯ x 1 | h − (1 − θ h ) θ x 2 | ¯ h (1 − θ h ) θ ¯ x 1 | ¯ h θ ¯ x 1 | h θ x 2 | h − θ ¯ x 1 | ¯ h ) θ x 2 | ¯ h (1 − θ h θ x 2 | h ) − θ h θ x 1 | h (1 − θ h ) θ ¯ x 2 | ¯ h − (1 − θ h ) θ x 1 | ¯ h θ x 1 | h θ ¯ x 2 | h − θ x 1 | ¯ h θ ¯ x 2 | ¯ h ) Figure 1: The Jacobian matrix for a na iv e Bay esian netw or k with tw o bina ry feature no des where H is hidden. Ass uming all five v a riables are binary , the space ov er the observ a bles is r epresen table b y 15 pa rameters, and the n um ber of parameter s of the net work is 11. In this ex a mple, w e could not c ompute the rank symbolically . Instead, w e used the following Mathematica co de. There are 16 f unctions (only 15 are indep enden t) de- fined by W = g ( θ ). In the Mathematica co de, we use f ij k l for the true joint probability w a = i,b = j,c = k,d = l , cij for the true conditional probability θ c =0 | a = i,h = j , dij for θ d =0 | b = i,h = j , a for θ a =0 , b for θ b =0 , and h 0 for θ h =0 . The first function is given by f 0000 [ a , b , h 0 , c 00 , . . . , c 1 1 , d 00 , . . . , d 11 ] := a ∗ b ∗ ( h 0 ∗ c 0 0 ∗ d 00 + (1 − h 0) ∗ c 01 ∗ d 01) and the other functions are similar ly wr itten . The Ja- cobian matrix is computed by the command Outer , which has thr e e arguments. The first is D which stands for the differen tiation o perator, the second is a set of functions, and the third is a set of v aria bles. J [ a , b , h 0 , c 00 , . . . , c 11 , d 00 , . . . , d 11 ] := Ou t e r [ D , { f 0000 [ a, b, h 0 , c 00 , c 01 , . . . , d 11] , f 0001 [ a, b, h 0 , c 0 0 , . . . , c 1 1 , d 00 , . . . , d 11] , . . . , f 1111 [ a, b, h 0 , c 0 0 , . . . , c 1 1 , d 00 , . . . , d 11] } , { a, b , h 0 , c 00 , c 01 , c 10 , c 11 , d 00 , d 0 1 , d 10 , d 11 } ] The next co mmand pro duces a diago nalized ma trix at a random p oin t with a precision of 3 0 decimal digits. This pr ecision was selected so that matrix elemen ts equal to zero would b e co rrectly iden tified as such. N [ RowR educe [ J [ a, b , h 0 , c 00 , . . . , c 11 , d 00 , . . . , d 11] /. { a → Random[In teger , { 1 , 999 } ] / 1000 , b → Random[In teger , { 1 , 999 } ] / 1000 , . . . , d 11 → Random[In teg er , { 1 , 999 } ] / 1000 } ] , 3 0 ] The result of th is Mathematica p rogr a m w as a diago- nalized matrix with 9 non-zer o rows and 7 rows co n- taining all zer o s. The same co unts w ere obtained in ten runs of the progra m. Hence, the reg ular rank of this Jacobia n matrix is 9 with pr o babilit y 1. The int erpretation of this result is that, ar ound almost every v alue of θ , o ne can loca lly r epresen t th e hidden W structure with only 9 par a meters. In co ntrast, if we enco de the distribution using the netw or k parameters ( θ ) of the W structure, then we must use 1 1 pa r ame- ters. Th us, tw o of the net work pa rameters are locally redundant. The BIC approximation punishes this W structure acco r ding to its mos t efficien t representation, which uses 9 parameters, and not according to the rep- resentation giv en b y the W structure, which requires 11 parameter s . It is interesting to note that the dimensio n o f the W structure is 10 if H has three or four states , and 11 if H ha s 5 states. W e do not know how to predict when the dimension changes as a result of increas ing the num b er of hidden states without computing the dimension explicitly . Nonetheless , the dimension ca n not incr ease b eyond 1 2, b ecause w e can a verage out the hidden v ariable in the W structure (e.g., using ar c reversals) to obtain another netw o rk structure that has only 12 parameter s. 6 AutoClass The AutoClass clustering algo rithm develop e d by Cheeseman and Stutz (1995 ) uses a naive Bay es mo del. 2 Each sta te o f the hidden root node H repre- sents a cluster or class; and each observ able node repre- sents a mea surable feature. The n um ber of classes k is unknown a priori. AutoClass computes an approxima- tion of the marg inal likelihoo d of a naive Bayes mo del given the data us ing increa s ing v alues of k . When this probability rea c hes a pea k for a sp ecific k , that k is selected a s the num b er of classes. Cheeseman and Stutz (1995 ) use the following for m ula to a ppro ximate the marg ina l likelihoo d: log p ( D | S ) ≈ log p ( D c | S ) + log p ( D | S, ˆ θ s ) − log p ( D c | S, ˆ θ s ) where D c is a databas e consisten t with the exp ected sufficient statistics as computed by the EM algo - rithm. Although Cheeseman and Stutz suggested 2 The algorithm can handle conditional dep endencies among con tinuous v ariables. this a ppro ximation in the co ntext of simple AutoCla s s mo dels, it can be used to score any B a yesian net- work with discr ete v ariables a s well a s other models [Chic kering and Heck e r man, 1996]. W e call this ap- proximation the CS sc oring function. Using the BIC approximation for p ( D c | S ), we obtain log p ( D | S ) ≈ log p ( D | S, ˆ θ s ) − d 0 / 2 log N where d 0 is the num b er of parameters of the net- work. (Giv en a naiv e Ba y es model with k classes and n observ a ble v a riables each with b states, d 0 = nk ( b − 1) + k − 1 .) Therefore, the CS scor ing function will converge asymptotically to the B IC and hence to p ( D | S ) whenev er d 0 is equal to the regular rank of S ( d ). Given our conjecture in the previous section, we believe that the CS scoring function will conv erg e to p ( D | S ) when the num b er of classes is tw o . Nonethe- less, d 0 is not a lw ays equal to d . F or example, when b = 2 , k = 3 and n = 4, the num ber of para meters is 1 4, but the regula r rank of the Jac o bian matrix is 13. W e co mput ed this rank using Mathematica as de- scrib ed in the previous section. Consequent ly , the CS scoring function will not always c o n verge to p ( D | S ). This example is the only one that we ha ve found so far; and we b eliev e that inco r rect results are o btained only for ra re combin ations of b, k and n . Nonetheless, a simple modifica tion to the CS s coring function yields an appr o xima tion that will asymptotically co nverge to p ( D | S ): log p ( D | S ) ≈ log p ( D c | S ) + log p ( D | S, ˆ θ s ) − log p ( D c | S, ˆ θ s ) − d/ 2 lo g N + d 0 / 2 log N Chick ering and Heck erma n (1996 ) show that this scor- ing function is o ft en a better approximation for p ( D | S ) than is the BIC. 7 Gaussian Netw orks In this section, we consider the case where ea c h of the v ariables { X 1 , . . . , X n } = X are con tin uous. As be- fore, let ( S, θ s ) b e a Bayesian netw ork, where S is the net work structure of the Bayesian netw o rk, and θ s is a set of par ameters asso ciated with the netw ork struc- ture. A Gaussian net w ork is one in which the joint likelih o od is that of a m ultiv aria te Ga ussian distr ibu- tion that is a product of local lik eliho o ds. Each lo cal likelih o od is the linea r regression mo del p ( x i | pa i , θ i , S ) = N ( m i + Σ X j ∈ Pa i b j i x j , v i ) where N ( µ, v ) is a no rmal (Gaus sian) distribution with mean µ and v ariance v > 0, m i is a conditional mean of X i , b j i is a coefficient that r epresen ts the strength of the r elationship betw een v ar iable X j and X i , v i is a v ariance, 3 and θ i is the set of pa rameters consisting of m i , v i , and the b j i . The parameters θ s of a Gaussian net work with structure S is the s et of all θ i . T o apply the techniques dev elop ed in this pap er, we also need to sp ecify the parameters of th e observ- able v ar iables. Given that the join t distribution is m ultiv a riate-normal and that multiv aria te- no rmal dis- tributions are c lo sed under margina lization, we o nly need to sp ecify a vector of means fo r the observed v ariables and a cov ariance matrix ov er the o bs erv e d v ariables. In a ddit ion, we need to specify how to trans- form the parameters of the netw o rk to the obser v able parameters. The trans f ormation of the mea ns and the tr ansformation to obtain the o bserv a ble co v a riance matrix can b e ac c o mplished via the tr ek-sum ru le (for a discussion, see Glymour et al. 19 87). Using the trek-sum rule, it is easy to show that the observ able par ameters are a ll sums of products o f the net work parameters. Given that the mapping from θ s to the obser v able para met ers is W is a polynomial function of θ , it fo llo ws fro m Thm. 1 that the r a nk of the Jacobian matrix h ∂ θ s ∂ W i is almost ev erywhere some fixed constant d , which w e again call the r e gular r ank o f the Ja cobian matrix. This ra nk is the num b er of non- redundant parameters of S —that is, the dimension of S . Let us consider t wo Gaussian mo dels. W e us e Ma the- matica co de similar to the c o de in Section 5 to compute their dimensions, b ecause we can not p erform the co m- putation symbolica lly . As in the previous exp erimen ts, none of the ra ndo mly chosen v alues of θ s accidentally reduces the rank. Our first example is the naive-Bay es mo del / S S w H H H H j H X 3 X 2 X 4 X 1 in whic h H is th e hidden v aria ble and the X i are ob- served. T her e are 14 netw ork parameters : 5 condi- tional v ariances , 5 conditional means, and 4 linear pa- rameters. The marginal distribution for the observed v ariables a lso has 14 par ameters: 4 mea ns, 4 v a riances, and 6 cov ariances . Nonetheless, the ana lysis of the rank of the Ja cobian matrix tells us that the dimension of this mo del is 12. This follows from the fact that this mo del imposes tetr ad constra in ts (see Glymour et a l. 1987). In this mo del the three tetra d co ns train ts that 3 m i is the mean of X i conditional on all parents b eing zero, b j i corresponds to t h e partial regression co efficien t of X i on X j giv en the other parents of X i , and v i corresponds to the residual va riance of X i giv en the parents of X i . hold in the distribution over the observed v ariables ar e cov ( X 1 , X 2 ) cov ( X 3 , X 4 ) − cov ( X 1 , X 3 ) cov ( X 2 , X 4 ) = 0 cov ( X 1 , X 4 ) cov ( X 2 , X 3 ) − cov ( X 1 , X 3 ) cov ( X 2 , X 4 ) = 0 cov ( X 1 , X 4 ) cov ( X 2 , X 3 ) − cov ( X 1 , X 2 ) cov ( X 3 , X 4 ) = 0 t wo of whic h a re indep enden t. These tw o independent tetrad constra in ts lea d to the reduction of dimension- ality . Our second example is the W structure describ ed in Section 5 where each of the v ariables is con tin u- ous. There are 1 4 netw o rk para met ers: 5 co nditional means, 5 conditional v a riances, and 4 linear par ame- ters. The m arginal distribution for the obse r v ed v ar i- ables has 1 4 parameters, wher eas the analysis of the rank of the Ja cobian matrix tells us that the dimension of this model is 12. This coincides with the intuition that many v a lues for the v ariance of H a nd the linear parameters for C ← H and H → D pro duce the same mo del for the obser v able v ariables, but once any tw o of these para met ers are appropr iately se t, then the third parameter is uniquely determined b y the marg inal dis- tribution for the obser v able v ariables. 8 Sigmoid Netw orks Finally , let us cons ider the case where each of the v ari- ables { X 1 , . . . , X n } = X is binar y (discr ete), a nd ea ch lo cal likelihoo d is t he genera lized linear mo del p ( x i | pa i , θ i , S ) = Sig ( a i + Σ X j ∈ Pa i b j i x j ) where Sig( x ) is the sigmoid function Sig( x ) = 1 1+ e − x . These mo dels, which we call sigmoid networks , are use- ful for learning relationships among discrete v ar iables, beca use these mo dels capture non-linear relationships among v ariables yet employ o nly a sma ll nu m be r of parameters [Neal, 1992, Saul et a l., 1996]. Using tec hniques similar to those in Section 5, we can compute the rank of the Jacobian matrix h ∂ θ s ∂ W i . W e can not apply Thm. 1 to conclude that this r a nk is al- most everywhere some fixed c o nstan t, b ecause the lo- cal likelihoo ds are non-p olynomial sigmoid functions. Nonetheless, the claim of Thm. 1 holds also for ana- lytic tra ns fo rmations, hence a r egular rank exists for sigmoid netw orks a s w ell (as co nfir med b y our exp eri- men ts). Our ex periments show expected reductions in r ank for several sigmoid netw or ks. F or example, consider the t wo-lev e l netw or k ? ? @ @ @ R ) P P P P P P P q X X X X X X X X X z 9 X 3 X 2 X 4 X 1 H 2 H 1 This netw o rk has 14 parameters. In each of 10 tri- als, w e found the rank of the Ja cobian matrix to be 14, indicating that this mo del has dimension 14. In contrast, consider the three-level netw or k. ? ? @ @ @ R ) P P P P P P P q X X X X X X X X X z 9 = Z Z ~ X 3 X 2 X 4 X 1 H 2 H 1 H 3 This net w ork has 17 parameters, whereas the dimen- sion we compute is 15. This reduction is exp ected, beca use w e co uld encode the dep endency b et ween the t wo v ariables in the middle level by removing the v a r i- able in the top lay er and a dding an ar c betw een these t wo v ar iables, pro ducing a netw ork with 15 parame- ters. References [Bouck aer t, 1 995] Bouck aer t, R. (199 5). Bayesian b e- lief networks: F r om c onstruction to infer enc e . PhD thesis, University Utrech t. [Bun tine, 1996 ] Buntine (199 6). A guide to the liter- ature on learning gra phical models. IEEE KDE , t o app ear. [Bun tine, 1994 ] Buntine, W. (199 4). Op erations for learning with gr aphical mo dels. Journal of A rtificial Intel ligenc e R ese ar ch , 2:1 59–225. [Cheeseman and Stutz, 199 5] Cheese man, P . and Stutz, J . (1995). Ba yesian classification (Auto- Class): Theor y and results. In F ayyad, U., Piatesky- Shapiro, G., Smyth, P ., and Uthurusam y , R., ed- itors, Ad vanc es in Kn owle dge Disc overy and Data Mining , AAAI Press, Menlo Park, CA. [Chic kering and Heck e r man, 1996] Chick er ing , D. and Heck erman, D. (1996 ). Efficient a ppro ximations for the marginal likelihoo d of incomplete data given a Bay es ian netw or k. In Pr o c e e dings of Twelf th Confer- enc e on Unc ertainty in Artificial Intel ligenc e, P ort- land, O R. Morg an Kaufmann. [Co oper and Herskovits, 1992] Co oper , G. and Her- sko v its, E. (1992 ). A Bay esian metho d for the induc- tion of pro babilistic net works fro m data. Machine L e arning , 9 :309–347. [Geiger and Heck erma n, 1995] Geiger, D. and Hec ker- man, D. (1995). A characterization o f the Dirichlet distribution with applicatio n to learning Bay esian net works. In Pr o c e e dings of Eleventh Confer enc e on Unc ertainty in Artificia l Intel ligenc e, Mo ntreal, QU, pa ges 196 –207. Morgan K a ufmann. [Glymour et al., 1987 ] Glymour, C., Sch eines, R., Spirtes, P ., and Kelly , K. (1987). Disc overing Causal Structur e . Academic Press. [Haughton, 1988 ] Haughton, D. (1988). On the ch oice of a mo del to fit data fro m an exp onen tial family . Annals of St atistic s , 16:34 2–355. [Hec kerman, 1995a ] Heck erma n, D. (199 5a). A Bay es ian approach for lear ning ca usal netw or ks. In Pr o c e e dings of Eleventh Confer enc e on Unc ertainty in Artificial Intel ligenc e, Mon trea l, QU, pag es 2 85– 295. Morgan Kaufmann. [Hec kerman, 1995b] Heck erman, D. (199 5b). A tuto- rial on learning B a yesian net w orks. T echnical Re- po rt MSR-TR-95-06, Micr o soft, Redmond, W A. Re- vised January , 199 6. [Hec kerman et al., 1994 ] Heck erman, D., Geiger, D., and Chic kering, D . (1994). Lear ning Bay esian net- works: The combi nation of knowledge and statis- tical data. In Pr o c e e dings of T ent h Confer enc e on Unc ertainty in Artificial Int elligenc e, Seattle, W A, pages 2 93–301. Morga n Kaufmann. [Hec kerman et al., 1995 ] Heck erman, D., Geiger, D., and Chic kering, D . (1995). Lear ning Bay esian net- works: The combination of kno wledge a nd statisti- cal data. Machi ne L e arning , 2 0:197–243 . [Hersko vits, 1991 ] Hersko vits, E. (199 1 ). Computer- b ase d pr ob abilistic network c onst ruction . PhD thesis, Medical Information Sciences, Stanford Univ ersity , Stanford, CA. [Lam and Bacch us, 199 3] La m, W. a nd Bacch us, F. (1993). Using causal information and lo cal mea- sures to learn Bayesian netw orks. In Pr o c e e dings of Ninth Confer enc e on Unc ertainty in Artifici al Intel- ligenc e, W ashington, DC, pag es 243– 250. Mo rgan Kaufmann. [Neal, 199 2] Neal, R. (1992). Co nnectio nis t learning of belief net works. Artificial Intel ligenc e , 56:71–113 . [Rissanen, 1987 ] Riss anen, J. (19 87). Stochastic com- plexit y (with discussion). Journal of the Ro yal St a- tistic al So ciety, Series B , 49:22 3–239 a nd 25 3 –265. [Saul et al., 199 6] Saul, L., Jaakkola, T., a nd Jor dan, M. (1996). Mean field theory for sigmo id b elief net- works. Journal of Artificial Intel ligenc e R ese ar ch , 4:61–7 6. [Sch warz, 197 8 ] Sch w arz, G. (19 7 8). Estimating the dimension of a mo del. Annals of Statistics , 6:461– 464. [Spiegelhalter et al., 199 3] Spiegelha lter, D., Dawid, A., Lauritzen, S., and Cow ell, R. (1993 ). Bay esian analysis in exp ert sys tem s. Statistic al Scienc e , 8:219– 282. [Spiegelhalter and Lauritzen, 1990 ] Spiegelhalter , D. and Lauritzen, S. (1990). Sequential up dating of conditional probabilities on directed graphica l struc- tures. Networks , 2 0:579–605 . [Spiv ak, 197 9] Spiv ak, M. (1979). A Compr ehensive Intr o duction to Differ ential Ge ometry 1, 2nd e di- tion . Publish o r Perish, Berkeley , CA. [Suzuki, 1993 ] Suzuki, J. (19 93). A construction o f Bay es ian netw orks from databases based on an MDL scheme. In Pr o c e e dings of N i nth Confer enc e o n U n- c ertainty in A rtificial Intel ligenc e, W ashington, DC, pages 2 66–273. Morga n Kaufmann.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment