Sparse Nested Markov models with Log-linear Parameters

Sparse Nested Mark o v Mo dels with Log-linear P arameters Ily a Shpitser Mathematical Sciences Univ ersity of Southampton i.shpitser@soton.ac.uk Robin J. Ev ans Statistical Lab oratory Cam bridge Universit y rje42@cam.ac.uk Thomas S. Ric hardson Departmen t of Statistics Univ ersity of W ashington thomasr@u.w ashington.edu James M. Robins Departmen t of Epidemiology Harv ard Univ ersity robins@hsph.harv ard.edu Abstract Hidden v ariables are ubiquitous in practi- cal data analysis, and therefore mo deling marginal densities and doing inference with the resulting models is an imp ortan t problem in statistics, machine learning, and causal inference. Recen tly , a new type of graphi- cal mo del, called the nested Marko v mo del, w as developed which captures equality con- strain ts found in marginals of directed acyclic graph (D AG) mo dels. Some of these con- strain ts, suc h as the so called ‘V erma con- strain t’, strictly generalize conditional inde- p endence. T o make mo deling and inference with nested Marko v mo dels practical, it is necessary to limit the n umber of parameters in the mo del, while still correctly capturing the constrain ts in the marginal of a DA G mo del. Placing suc h limits is similar in spirit to sparsity metho ds for undirected graphical mo dels, and regression mo dels. In this paper, w e give a log-linear parameterization which allo ws sparse mo deling with nested Marko v mo dels. W e illustrate the adv antages of this parameterization with a sim ulation study . 1 In tro duction Analysis of complex multidimensional data is often made diﬃcult by the twin problems of hidden v ari- ables, and a dearth of data relative to the dimension of the mo del. The former problem motiv ates the study of marginal and/or latent mo dels, while the latter has resulted in the dev elopment of sparsity metho ds. A particularly app ealing mo del for multidimensional data analysis is the Ba yesian net work or directed acyclic graph (DA G) mo del [10], where random v ari- ables are represented as vertices in the graph, with directed edges (arrows) b et ween them. The p opular- it y of D AG mo dels stems from their well understo od theory , and from the fact that they elicit an in tuitive causal interpretation: an arrow from a v ariable A to a v ariable B in a D AG mo del can b e interpreted, in a w ay whic h can b e made precise, to mean that A is a ‘direct cause’ of B . D AG mo dels assume all v ariables are observ ed, and a laten t v ariable mo del based on D AGs simply re- laxes this assumption. Ho wev er, latent v ariables in tro- duce a n umber of problems: it is diﬃcult to correctly mo del the laten t state, and the resulting marginal den- sities are quite challenging to work with. An alter- nativ e is to enco de constraints found in marginals of D AG mo dels directly; a recen t approac h in this spirit is the nested Marko v mo del [15]. The adv an tage of the nested Marko v mo del is that it correctly captures the conditional indep endences and other equalit y con- strain ts found in marginals of DA G mo dels. How ev er, the discrete parameterization of nested Marko v mo d- els has the disadv antage of b eing unable to represent constrain ts in v arious marginals of DA Gs c oncisely , that is with few non-zero parameters. This implies that mo del selection metho ds based on scoring (via the BIC score [13] for instance) often prefer simpler mo dels which fail to capture indep endences correctly , but whic h contain many fewer parameters [15]. More generally , in high dimensional data analyses there is often such a shortage of samples that clas- sical statistical inference techniques do not work. T o address these issues, sparsity metho ds hav e b een de- v elop ed, which drive as many parameters in the sta- tistical mo del to zero as p ossible, while still providing a reasonable ﬁt to the data. Sparsit y me thods hav e b een developed for regression mo dels [16], undirected graphical mo dels [8, 9], and even some marginal mo d- els [4]. It is not natural to apply sparsity techniques to ex- isting parameterizations of nested Marko v mo dels, b e- cause the parameters are context (or strata) sp eciﬁc. 1 2 3 4 5 6 7 (a) 1 2 3 4 5 (b) Figure 1: (a) A DA G with no des 6 and 7 represent- ing hidden v ariables. (b) An ADMG representing the same conditional indep endences as (a) among the v ari- ables corresp onding to 1 , 2 , 3 , 4 , 5. In this pap er, we develop a log-linear parameteriza- tion for discrete nested Mark ov mo dels, where the pa- rameters represent (generalizations of ) log o dds-ratios within ‘kernels’ (informally ‘interv entional’ densities). These can b e viewed as interaction parameters, of the kind commonly set to zero b y sparsity metho ds. Our parameterization allo ws us to represent distributions con taining ‘V erma constrain ts’ in a sparse wa y , while main taining adv antages of nested Mark ov models, and a voiding the disadv an tages of using marginals of D AG mo dels directly . 2 Disadv antages of the M¨ obius P arameterization of Nested Marko v Mo dels One drawbac k of the standard parameterization of nested Mark o v models is that parameters are v ariation dep enden t; that is, ﬁxing the v alue of one parameter constrains the ‘legal’ v alues of other parameters. This is in direct contrast with parameterizations of D AG mo dels where parameters asso ciated with a particu- lar Marko v factor (a conditional density for a v ariable giv en all its paren ts in the DA G) do not dep end on parameters asso ciated with other Marko v factors. W e illustrate another diﬃcult y with an example. Here, and in subsequen t discussions, we will need to draw distinctions b et ween vertices in graphs, and corre- sp onding random v ariables in distributions or ‘kernels.’ W e use the following notation: v (low ercase) denotes a v ertex, X v the corresp onding random v ariable, and x v a v alue assignmen t to this v ariable. Lik ewise A (upp ercase) denotes a v ertex set, X A the corresp ond- ing random v ariable set, and x A an assignment to this set. Consider the marginal DA G sho wn in Fig. 1 (a). W e wish to av oid representing this domain with a DA G directly , in order not to commit to a particular state space of the unobserv ed v ariables X 6 and X 7 , and b e- cause, ev en if w e w ere willing to mak e suc h an assump- tion, the margin ov er ( X 1 , X 2 , X 3 , X 4 , X 5 ) obtained from a density that factorizes according to this DA G can b e complicated to work with [7]. T o use nested Marko v mo dels for this domain, we ﬁrst construct an acyclic directed mixed graph (ADMG) that represents this DA G marginal, using the latent pro jection algorithm [17]. This graph is shown in Fig. 1 (b); directed arrows in the resulting ADMG repre- sen t directed paths in the D AG where an y intermediate no des are unobserved (in this case there are no such paths, and all directed edges in the ADMG are directly inherited from the D AG). Similarly , bidirected arro ws in the ADMG, such as 2 ↔ 5, represent marginally d- connected paths in the DA G which start and end with arro wheads p oin ting a wa y from the path, in this case 2 ← 6 → 5. If we now use the nested M¨ obius parameters, describ ed in more detail in subsequen t sections, to parameter- ize the resulting ADMG, we will quickly discov er that this results in a mo del of higher dimension relativ e to the dimension of DA G mo dels which share their sk ele- ton with this ADMG. F or example, the binary nested Mark ov mo del of the graph in Fig. 1 (b) has 16 param- eters, while b oth binary DA G mo dels corresp onding to graphs in Fig. 7 (a) and (b) hav e 11 parameters each. This leads to a w orry that a structure learning al- gorithm that tries to use nested M¨ obius parameters to recov er an ADMG from data by means of a score metho d, suc h as BIC [13], which rewards ﬁt and pa- rameter parsimony , may prefer at low sample sizes in- correct indep endence mo dels given by DA Gs in pref- erence to correct mo dels given b y ADMGs, simply b e- cause the DA G mo dels comp ensate for their p oor ﬁt of the data with a m uch smaller parameter count. In fact, this precise issue has b een observ ed in simulation studies rep orted in [15]. Addressing this problem with a M¨ obius parameteriza- tion is not easy , b ecause M¨ obius parameters are strata or context-speciﬁc; in other words, the parameteriza- tion is not indep endent of how the states are lab eled. F or instance, some of the M¨ obius parameters repre- sen ting confounding b et ween X 2 , X 4 and X 5 are: 1 θ { 2 , 4 , 5 } ( x 1 , x 3 ) = p (0 4 , 0 5 | x 3 , 0 2 , x 1 ) p (0 2 | x 1 ) for all v alues of x 1 , x 3 . In a binary mo del, this giv es 4 parameters. The kinds of regularities in the true gen- erativ e pro cess, whic h w e ma y w ant to exploit to cre- ate a dimension reduction in our mo del, typically in- v olve a lac k of in teractions among v ariables, or a laten t confounder with a low dimensional state space. Suc h regularities may often not translate into constraints naturally expressible in terms of M¨ obius parameters. T o av oid this diﬃculty , we need to construct parame- ters for nested Mark ov models which represen t v arious 1 T o sav e space, here and elsewhere we will write 1 i for an assignment of X i to 1, and 0 i for an assignment to 0. t yp es of interactions among v ariables directly . In fact, parameters represen ting in teractions are w ell kno wn in log-linear mo dels, of which undirected graphical mo d- els and certain regression mo dels form a sp ecial case. 3 Log-linear P arameters for Undirected Mo dels W e will use undirected graphical mo dels, also known as Marko v random ﬁelds, to illustrate log-linear mod- els. A Marko v random ﬁeld ov er a multiv ariate binary state space X V , is a set of densities p ( x V ) represented b y an undirected graph G with vertices V , where p ( x V ) = exp   X C ∈ cl( G ) ( − 1) k x C k 1 λ C   ; here cl( G ) is the collection of (not necessarily maximal) cliques in the undirected graph, k · k 1 is the L 1 -norm, and λ C is a lo g-line ar p ar ameter . Note that the pa- rameter λ ∅ ensures the expression is normalized. Consider the undirected graph sho wn in Fig. 2. In this graph, all subsets of { 1 , 2 , 3 } , { 2 , 4 } , and { 4 , 5 , 6 } are cliques. The mo del represents densities where, condi- tional upon its adjacent no des, each no de is indepen- den t of all others. The log-linear parameter(s) corre- sp onding to eac h such subset of size k can b e viewed as represen ting k -w ay interactions among appropriate v ariables in the mo del. Setting some such interaction parameters to zero in a consistent wa y results in a mo del whic h still asserts the same conditional indepen- dences, but has a smaller parameter count, and with all strata in each clique treated symmetrically . F or in- stance, if we were to set all parameters for cliques of size k > 2 to zero, so that there remained only param- eters corresp onding to v ertices and individual edges, w e would obtain a mo del known as a Boltzmann ma- c hine [1]. A similar idea had b een used to give a sparse parameterization for discrete D AG mo dels [12]. In the remainder of the pap er, we describ e nested Mark ov mo dels, and giv e a log-linear parameteriza- tion for these mo dels which contains similar parame- ters that may b e set to zero. While in Marko v ran- dom ﬁeld mo dels the parameters are asso ciated with sets of nodes which form cliques in the corresp onding undirected graph, in nested Marko v mo dels parame- ters will b e asso ciated with sp ecial sets of no des in the corresp onding ADMG called intrinsic sets . F urther, log-linear parameterizations of this t yp e can often in- corp orate individual-level contin uous baseline co v ari- ates [5]. 1 2 3 4 5 6 Figure 2: An undirected graph representing a log- linear mo del. 4 Graphs, Kernels, and Nested Mark o v Mo dels W e now introduce the relev an t bac kground needed to deﬁne the nested Mark ov mo del. A dir e cte d mixe d gr aph G ( V , E ) is a graph with a s et of v ertices V and a set of edges E , where the edges ma y b e directed ( → ) or bidirected ( ↔ ). A directed cycle is a path of the form x → · · · → y along with an edge y → x . An acyclic directed mixed graph (ADMG) is a mixed graph containing no directed cycles. An example is giv en in Fig. 1 (b). Let a , b and d b e v ertices in a mixed graph G . If b → a then we say that b is a p ar ent of a , and a is a child of b . If a ↔ b then a is said to b e a sp ouse of b . A vertex a is said to b e an anc estor of a v ertex d if either there is a directed path a → · · · → d from a to d , or a = d ; similarly d is said to b e a desc endant of a . The sets of paren ts, c hildren, sp ouses, ancestors and descendants of a in G are written pa G ( a ), c h G ( a ), sp G ( a ), an G ( a ), de G ( a ) resp ectively . W e apply these deﬁnitions dis- junctiv ely to sets, e.g. an G ( A ) = S a ∈ A an G ( a ). 4.1 Conditional ADMGs A c onditional acyclic directed mixed graph (CADMG) G ( V , W , E ) is an ADMG with a vertex set V ∪ W , where V ∩ W = ∅ , sub ject to the restriction that for all w ∈ W , pa G ( w ) = ∅ = sp G ( w ). The vertices in V are the r andom vertices, and those in W are called ﬁxe d . Whereas an ADMG with vertex set V represents a join t densit y p ( x V ), a conditional ADMG represents the Marko v structure of a conditional densit y , or k ernel q V ( x V | x W ). F ollo wing [8, p.46], w e deﬁne a kernel to b e a non-negative function q V ( x V | x W ) satisfying: X x V q V ( x V | x W ) = 1 for all x W . (1) W e use the term ‘k ernel’ and write q V ( ·|· ) (rather than p ( ·|· )) to emphasize that these functions, though they satisfy (1) and thus most prop erties of conditional den- sities, are not in general formed via the usual op eration of conditioning on the ev ent X W = x W . T o conform with standard notation for densities, for ev ery A ⊆ V let q V ( x A | x W ) ≡ X V \ A q V ( x V | x W ) , q V ( x V \ A | x W ∪ A ) ≡ q V ( x V | x W ) q V ( x A | x W ) . F or a CADMG G ( V , W , E ) w e consider collections of random v ariables ( X v ) v ∈ V indexed by v ariables ( X w ) w ∈ W ; throughout this pap er the random v ari- ables take v alues in ﬁnite discrete sets ( X v ) v ∈ V and ( X w ) w ∈ W . F or A ⊆ V ∪ W we let X A ≡ × u ∈ A ( X u ), and X A ≡ ( X v ) v ∈ A . That we will alwa ys hold the v ariables in W ﬁxed is precisely why w e do not p ermit edges b et w een vertices in W . An ADMG G ( V , E ) may b e seen as a CADMG in whic h W = ∅ . In this manner, though we will state sub- sequen t deﬁnitions for CADMGs, they also apply to ADMGs. The induc e d sub gr aph of a CADMG G ( V , W, E ) giv en b y a set A , denoted G A , consists of G ( V ∩ A, W ∩ A, E A ) where E A is the set of edges in G with b oth endp oin ts in A . In forming G A , the status of the vertices in A with regard to whether they are in V or W is pre- serv ed. 4.2 Districts and Mark ov Blank ets A set C is c onne cte d in G if ev ery pair of v ertices in C is joined by a path such that every vertex on the path is in C . F or a given CADMG G ( V , W , E ), denote by ( G ) ↔ the CADMG formed by removing all directed edges from G . A set connected in ( G ) ↔ is called bidir e cte d c onne cte d . F or a v ertex x ∈ V , the district (or c-comp onen t) of x , denoted by dis G ( x ), is the maximal bidirected con- nected set containing x . F or instance in the ADMG sho wn in Fig. 1 (b), the district of no de 2 is { 2 , 4 , 5 } . Districts in a CADMG form a partition of V ; vertices in W are excluded by deﬁnition. In a DA G G ( V , E ) the set of districts is the set of all single element sets { v } ⊆ V . A set of vertices A in G is called anc estr al if a ∈ A ⇒ an G ( a ) ⊆ A . In a CADMG G ( V , W , E ), if A is an ancestral subset of V ∪ W in G , t ∈ A ∩ V , and ch G ( t ) ∩ A = ∅ , then the Markov blanket of t in A is deﬁned as: m b G ( t, A ) ≡ pa G  dis G A ( t )  ∪  dis G A ( t ) \ { t }  . 4.3 The ﬁxing op eration and ﬁxable vertices W e no w introduce a ‘ﬁxing’ op eration on a CADMG whic h has the eﬀect of transforming a random v ertex 1 2 3 4 5 (a) 1 2 3 4 (b) Figure 3: (a) The graph from Fig. 1 (b) after ﬁxing 3. (b) An ADMG inducing a non-trivial nested Marko v mo del. in to a ﬁxed vertex, thereby changing the graph. Ho w- ev er, this op eration is only applicable to a subset of the vertices, which we term the (p oten tially) ﬁxable v ertices. Deﬁnition 1 Given a CADMG G ( V , W, E ) the set of ﬁxable v ertices is F ( G ) ≡ { v | v ∈ V , dis G ( v ) ∩ de G ( v ) = { v }} . In words, v is ﬁxable in G if there is no vertex v ∗ that is b oth a descendant of v and in the same district as v . F or the graph in Fig. 1 (b), the vertex 2 is not ﬁxable, b ecause 4 is b oth its descendant and in the same district; all the other v ertices are ﬁxable. Deﬁnition 2 Given a CADMG G ( V , W, E ) , and a kernel q V ( X V | X W ) , with every r ∈ F ( G ) we asso ciate a ﬁxing transformation φ r on the p air ( G , q V ( X V | X W )) deﬁne d as fol lows: φ r ( G ) ≡ G ∗ ( V \ { r } , W ∪ { r } , E r ) , wher e E r is the subset of e dges in E that do not have arr owhe ads into r , and φ r ( q V ( x V | x W ); G ) ≡ q V ( x V | x W ) q V ( x r | x mb G ( r, an G (dis G ( r ))) ) Returning to the ADMG in Fig. 1 (b), ﬁxing 3 in the graph means removing the edge 2 → 3, while ﬁxing x 3 in p ( x 1 , x 2 , x 3 , x 4 , x 5 ) means dividing this marginal densit y by q 1 , 2 , 3 , 4 , 5 ( x 3 | x 2 ) = p ( x 3 | x 2 ). The resulting CADMG, shown in Fig. 3 (a), represents the resulting k ernel q 1 , 2 , 4 , 5 ( x 1 , x 2 , x 4 , x 5 | x 3 ). W e use ◦ to indicate comp osition of operations in the natural w ay , so that: φ r ◦ φ s ( G ) ≡ φ r ( φ s ( G )) and φ r ◦ φ s ( q V ( X V | X W ); G ) ≡ φ r ( φ s ( q V ( X V | X W ); G ) ; φ s ( G )) . 4.4 Reac hable and Intrinsic Sets In order to deﬁne the nested Marko v mo del, we will need to deﬁne sp ecial classes of v ertex sets in ADMGs. Deﬁnition 3 A CADMG G ( V , W ) is reachable fr om an ADMG G ∗ ( V ∪ W ) if ther e is an or dering of the ver- tic es in W = h w 1 , . . . , w k i , such that for j = 1 , . . . , k , w 1 ∈ F ( G ∗ ) and for j = 2 , . . . , k , w j ∈ F ( φ w j − 1 ◦ · · · ◦ φ w 1 ( G ∗ )) . A subgraph is reachable if, under some ordering, each v ertex w i is ﬁxable in φ w i − 1 ( · · · φ w 2 ( φ w 1 ( G ∗ )) · · · ). Fixing operations do not in general comm ute, and thus only some orderings are v alid for ﬁxing a particular set. F or example, in the ADMG shown in Fig. 1 (b), the set { 2 , 3 } may b e ﬁxed, but only under the ordering where 3 is ﬁxed ﬁrst, to yield the CADMG shown in Fig. 3 (a), and then 2 is ﬁxed in this CADMG. Fixing 2 ﬁrst in Fig. 1 (b) is not v alid, b ecause 4 is b oth a descendan t of 2 and in the same district as 2 in that graph, and th us 2 is not ﬁxable. If a CADMG G ( V , W ) is reachable from G ∗ ( V ∪ W ), w e say that the set V is reac hable in G ∗ . A reac hable set ma y be obtained by ﬁxing v ertices using more than one v alid sequence. W e will denote an y v alid comp o- sition of ﬁxing op erations that ﬁxes a set A by φ A if applied to the graph, and by φ X A if applied to a k ernel. With a sligh t abuse of notation (though justiﬁed as we will later see) we suppress the precise ﬁxing sequence c hosen. Deﬁnition 4 A set of vertic es S is in trinsic in G if it is a district in a r e achable sub gr aph of G . The set of intrinsic sets in an ADMG G is denote d by I ( G ) . F or example, in the graph in Fig. 1 (b), the set { 2 , 4 , 5 } is intrinsic (and reachable), while the set { 1 , 2 , 4 , 5 } is reac hable but not intrinsic. In any DA G G ( V , E ), I ( G ) = {{ x }| x ∈ V } , while in an y bidirected graph G , I ( G ) is equal to the set of all connected sets in G . 4.5 Nested Marko v Mo dels Just as for DA G mo dels, nested Marko v mo dels may b e deﬁned via one of several equiv alent Marko v prop- erties. These prop erties are all neste d in the sense that they apply recursively to either reac hable or intrinsic sets derived from an ADMG. In particular, there is a nested analogue of the global Marko v prop ert y for D AGs (d-separation), the lo cal Marko v prop ert y for D AGs (whic h asserts that v ariables are indep enden t of non-descendan ts given paren ts), and the moralization- based prop ert y for DA Gs. These deﬁnitions app ear and are prov en equiv alent in [11]. It is p ossible to as- so ciate a unique ADMG with a particular marginal D AG mo del, and a nested Mark ov mo del asso ciated with this ADMG will reco ver all indep endences which hold in the marginal D AG [11]. W e now deﬁne a nested factorization on probability distributions represen ted b y ADMGs using special sets of no des called ‘recursive heads’ and ‘tails.’ Deﬁnition 5 F or an intrinsic set S ∈ I ( G ) of a CADMG G , deﬁne the r e cursive he ad (rh) as: rh( S ) ≡ { x ∈ S | ch G ( x ) ∩ S = ∅} . Deﬁnition 6 The tail asso ciate d with a r e cursive he ad H of an intrinsic set S in a CADMG G is given by: tail( H ) ≡ ( S \ H ) ∪ pa G ( S ) . In the graph in Fig. 1 (b), the recursive head of the in trinsic set { 2 , 4 , 5 } is equal to the set itself, while the tail is { 1 , 3 } . A kernel q V ( X V | X W ) satisﬁes the he ad factorization pr op erty for a CADMG G ( V , W, E ) if there exist k ernels { f S ( X H | X tail( H ) ) | S ∈ I ( G ) , H = rh G ( S ) } such that q V ( X V | X W ) = Y H ∈ J V K G S :rh G ( S )= H f S ( X H | X tail( H ) ) (2) where J V K G is a partition of V into heads giv en in [14]. Let G ( G ) ≡ { ( G ∗ , w ∗ ) | G ∗ = φ w ∗ ( G ) } for an ADMG G . That is, G ( G ) is the set of v alid ﬁxing sequences and the CADMGs that they reach. The same graph ma y b e reached by more than one sequence w ∗ . W e say that a distribution p ( x V ) obeys the neste d he ad factor- ization pr op erty for G if for all ( G ∗ , w ∗ ) ∈ G ( G ), the k ernel φ w ∗ ( p ( X V ); G ) ob eys the head factorization for φ w ∗ ( G ) ≡ G ∗ . W e denote the set of such distributions b y P n h ( G ). Nested Marko v mo dels hav e b een deﬁned via a nested district factorization criterion [15], and a n umber of Marko v prop erties [11]. The head factor- ization is another wa y of deﬁning the nested Marko v mo del due to the following result. Theorem 7 The set P n h ( G ) is the neste d Markov mo del of G . Our decision to suppress the precise ﬁxing sequence from the ﬁxing op eration applied to sets is justiﬁed, due to the follo wing result. Theorem 8 If p ( x V ) is in the neste d Markov mo del of G , then for any r e achable set A in G , any valid ﬁxing se quenc e on V \ A gives the same CADMG over A , and the same kernel q A ( x A | x V \ A ) obtaine d fr om p ( x V ) . 4.6 A M¨ obius Parameterization of Binary Nested Marko v Mo dels W e now give the original parameterization for binary nested Marko v mo dels. The approach generalizes in a straightforw ard wa y to ﬁnite discrete state spaces. Multiv ariate binary distributions in the nested Mark o v mo del for an ADMG G may b e parameterized by the follo wing: Deﬁnition 9 The nested M¨ obius parameters asso ci- ate d with a CADMG G ar e a set of functions: Q G ≡  q S ( X H = 0 | x tail( H ) ) for H = rh( S ) , S ∈ I ( G )  . In tuitively , a parameter q S ( X H = 0 | x tail( H ) ) is the probabilit y that the v ariable set X H assumes v alues 0 in a kernel obtained from p ( x V ) by ﬁxing X V \ S , and conditioning on X tail( H ) . As a shorthand, w e will denote the parameter q S ( X H = 0 | x tail( H ) ) by θ H ( x tail( H ) ). Deﬁnition 10 L et ν : V ∪ W 7→ { 0 , 1 } b e an as- signment of values to the variables indexe d by V ∪ W . Deﬁne ν ( T ) to b e the values assigne d to variables in- dexe d by a subset T ⊆ V ∪ W . L et ν − 1 (0) = { v | v ∈ V , ν ( v ) = 0 } . A distribution P ( X V | X W ) is said to b e parameter- ized b y the set Q G , for CADMG G if: p ( X V = ν ( V ) | X W = ν ( W )) = X B : ν − 1 (0) ∩ V ⊆ B ⊆ V ( − 1) | B \ ν − 1 (0) | × Y H ∈ J B K G θ H ( X tail( H ) = ν (tail( H ))) (3) wher e the empty pr o duct is deﬁne d to b e 1 . F or example, the graph sho wn in Fig. 3 (b) rep- resen ting a mo del o ver binary random v ariables X 1 , X 2 , X 3 , X 4 is parameterized by the follo wing sets of parameters: θ 1 = p (0 1 ) θ 2 ( x 1 ) = p (0 2 | x 1 ) θ 1 , 3 ( x 2 ) = p (0 3 | x 2 , 0 1 ) p (0 1 ) θ 3 ( x 2 ) = X x 1 p (0 3 | x 2 , x 1 ) p ( x 1 ) θ 2 , 4 ( x 1 , x 3 ) = p (0 4 | x 3 , 0 2 , x 1 ) p (0 2 | x 1 ) θ 4 ( x 3 ) = X x 2 p (0 4 | x 3 , x 2 , x 1 ) p ( x 2 | x 1 ) . The total num b er of parameters is 1 + 2 + 2 + 2 + 4 + 2 = 13, which is 2 few er than a saturated parameterization of a 4 no de binary mo del, which contains 2 4 − 1 = 15 parameters. The tw o missing parameters reﬂect the fact that θ 4 ( x 3 ) do es not dep end on x 1 , which is a constrain t induced by the absence of the edge from 1 to 4 in Fig. 3 (b). Note that this constraint is not a conditional indep endence. In fact, no conditional indep endences o ver v ariables corresp onding to v ertices 1 , 2 , 3 , 4 are advertised in Fig. 3 (b). This parameterization maps θ H parameters to prob- abilities in a CADMG via the inv erse M¨ obius trans- form given by (3), and generalizes b oth the standard Mark ov parameterization of DA Gs in terms of param- eters of the form p ( x i = 0 | pa( x i )), and the parame- terization of bidirected graph mo dels given in [3]. 5 A Log-linear P arameterization of Nested Mark o v Mo dels W e begin by deﬁning a set of ob jects whic h are func- tions of the observ ed density , and which will serve as our parameters. Deﬁnition 11 L et G ( V , E ) b e an ADMG and p ( x V ) a density over a set of binary r andom variables X V in the neste d Markov mo del of G . F or any S ∈ I ( G ) , let M = S ∪ pa G ( S ) , A ⊆ M (with A ∩ S 6 = ∅ ), and let q S ( x S | x M \ S ) = φ V \ S ( p ( x V ); G ) b e the asso ciate d kernel. Then deﬁne λ M A = 1 2 | M | X x M ( − 1) k x A k 1 log q S ( x S | x M \ S ) , to b e the nested log-linear parameter associated with A in S . F urther let Λ( G ) b e the c ol le ction { λ M A | S ∈ I ( G ) , M = S ∪ pa G ( S ) , rh G ( S ) ⊆ A ⊆ M } of these lo g-line ar p ar ameters. We c al l Λ( G ) the nested ingen uous parameterization of G . This parameterization is based on the graphical con- cepts of recursive heads and corresp onding tails. W e call the parameterization ‘nested ingen uous’ due to its similarit y to a marginal log-linear parameterization called ingenuous in [6], and in contrast to other log- linear parameterizations which may exist for nested Mark ov mo dels. Marginal mo del parameterizations of this type were ﬁrst introduced in [2]. This deﬁnition extends easily to non-binary discrete data, in whic h case some parameters λ M A b ecome collections of pa- rameters. As an example, consider the graph shown in Fig. 3 (b) which represents a binary nested Marko v mo del. The nested ingenuous parameters asso ciated with the marginal p ( x 1 ) and conditional p ( x 2 | x 1 ) are λ 1 1 = 1 2 log p (0 1 ) p (1 1 ) λ 21 2 = 1 4 log p (0 2 | 0 1 ) · p (0 2 | 1 1 ) p (1 2 | 0 1 ) · p (1 2 | 1 1 ) λ 21 21 = 1 4 log p (0 2 | 0 1 ) · p (1 2 | 1 1 ) p (1 2 | 0 1 ) · p (0 2 | 1 1 ) whereas parameters asso ciated with the k ernel q 4 ( x 4 | x 3 ) = P x 2 p ( x 4 | x 3 , x 2 , x 1 ) p ( x 2 | x 1 ) are λ 43 4 = 1 4 log q 4 (0 4 | 0 3 ) · q 4 (0 4 | 1 3 ) q 4 (1 4 | 0 3 ) · q 4 (1 4 | 1 3 ) λ 43 43 = 1 4 log q 4 (0 4 | 0 3 ) · q 4 (1 4 | 1 3 ) q 4 (1 4 | 0 3 ) · q 4 (0 4 | 1 3 ) A parameter λ M A , where M is the union of a head H and its tail T , can b e viewed, by analogy with similar clique parameters in undirected log-linear models, as a | A | -w ay in teraction b et ween the v ertices in A , within the kernel corresp onding to M . F or instance the k ernel q 2 , 4 ( x 2 , x 4 | x 1 , x 3 ) = p ( x 4 | x 3 , x 2 , x 1 ) p ( x 2 | x 1 ), 2 mak es an appearance in 4 parameters in a binary mo del: λ 1234 24 , λ 1234 124 , λ 1234 234 , and λ 1234 1234 . If we set λ 1234 1234 to 0, w e claim there is no 4-w ay interaction b et w een X 1 , X 2 , X 3 , X 4 in the k ernel. It can b e sho wn that while the M¨ obius parameteriza- tion of the graph in Fig. 3 (b) is v ariation dep enden t, the nested ingen uous parameterization of the same graph is v ariation indep enden t. This is not true in general. In particular b oth parameterizations for the graph in Fig. 1 (b) are v ariation dep endent. 6 Main Results In this section w e prov e that the nested ingenuous pa- rameters indeed parameterize discrete nested Marko v mo dels. W e start with an intermediate result. Lemma 12 L et H ⊆ M and q ( x H | x M \ H ) b e a ker- nel. Then q is smo othly p ar ameterize d by the c ol- le ction of NLL p ar ameters { λ M A | H ⊆ A ⊆ M } to- gether with the ( | H | − 1) -dimensional mar gins of q , q ( x H \{ v } | x M \ H ) , v ∈ H . Pr o of: The pro of is essentially iden tical to the pro of of Lemma 4.4 in [6].  6.1 The Main Result W e now deﬁne a partial order on heads and use this order to inductiv ely establish the main result. Deﬁnition 13 L et ≺ I ( G ) b e the p artial or der on he ads, H i , of intrinsic sets, S i , in G such that H i ≺ I ( G ) H j whenever S i ⊂ S j . Theorem 14 The neste d ingenuous p ar ameterization of an ADMG G with no des V p ar ameterizes pr e cisely those distributions p ( x V ) ob eying the neste d glob al Markov pr op erty with r esp e ct to G . 2 This kernel, viewed causally , is p ( x 2 , x 4 | do( x 1 , x 3 )). Pr o of: Let ≺ I ( G ) b e the partial ordering on heads giv en in Deﬁnition 13. W e pro ceed by induction on this ordering. F or the base case, w e know that sin- gleton heads { h } with empty tails are parameterized b y λ h h . If a singleton head has a non-empty tail, the conclusion follo ws immediately by Lemma 12. No w, supp ose that w e wish to ﬁnd the kernel with a non-singleton head H † and a tail T † corresp onding to the intrinsic set S † . Assume, by inductive hypothe- sis, that we ha ve already obtained the kernels with all heads H such that H ≺ I ( G ) H † . W e claim this is suﬃcien t to give the ( | H † | − 1)-dimensional marginal k ernels q S † ( x H † \{ v } | x T † ), for all v ∈ H † . Fix a particular v ∈ H † . The set S † \ { v } is reachable, since V \ S † is a set with a v alid ﬁxing sequence, and an y v ∈ H † has no c hildren in S † in φ V \ S † ( G ) so is ﬁxable in φ V \ S † ( G ). Theorem 7 and Theorem 8 imply that for ev ery reachable set A , (2) holds. Hence: q S † ( x S † \{ v } | x V \ ( S † \{ v } ) ) = Y H ∈ J S † \{ v } K G S :rh G ( S )= H q S ( x H | x tail( H ) ) . (4) F or any S such that rh G ( S ) = H , and H ⊆ S † \ { v } , H ≺ I ( G ) H † , hence by the induction hypothesis, the k ernel q S ( x S | x pa ( S ) \ S ) is already obtained, and all k er- nels whic h app ear in (4) can be derived by condi- tioning from some suc h q S ( x S | x pa ( S ) \ S ). The desired k ernel q S † ( x H † \{ v } | x T † ) can itself b e obtained from q S † ( x S † \{ v } | x pa( S † ) \ S † ) b y conditioning. W e can rep eat this argument for any v ∈ H † . Fi- nally , the nested ingenuous parameterization con tains λ H † ∪ T † A for H † ⊆ A ⊆ H † ∪ T † . The result then follows b y Lemma 12.  7 Sim ulations T o illustrate the utilit y of setting higher order parame- ters to zero (‘remo ving’), w e present a sim ulation study based on the ADMG in Fig. 5 (b). This graph is a sp ecial case of t wo bidirected c hains of k vertices eac h, with a path of directed edges alternating betw een the c hains, for k = 4. The num b er of parameters in the relev an t binary nested Marko v mo del grows exp onen- tially with k in graphs of this t yp e. Consider also the laten t v ariable mo del deﬁned by re- placing each bidirected edge with an indep enden t la- ten t v ariable shown in Fig. 5 (a), so that 1 ↔ 3 b e- comes 1 ← 9 → 3. If the state space of each laten t v ari- able is the same and ﬁxed, then the num ber of param- eters in this hidden v ariable DA G mo del grows only linearly in k . This suggests that the nested Mark ov mo del may include higher order parameters whic h are 0 5 10 15 20 0.00 0.04 0.08 0.12 10 20 30 40 50 60 0.00 0.02 0.04 0.06 20 40 60 80 0.00 0.01 0.02 0.03 0.04 0.05 Figure 4: Histograms showing the increase in deviance asso ciated with setting to zero any nested log-linear parameters with eﬀects higher than orders (from left to righ t) seven, six and ﬁve resp ectively . This corresp onds to remo ving 6, 18 and 35 parameters resp ectively; the relev an t χ 2 densit y is plotted in each case. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 (a) 1 2 3 4 5 6 7 8 (b) Figure 5: (a) A hidden v ariable D AG used to generate samples for Section 7. (b) The laten t pro jection of this generating D AG. not really necessary in this case (though the higher order parameters may b ecome necessary again if the state space of laten t v ariables grows). W e generated distributions from the latent v ariable mo del asso ciated with the DA G in Fig. 5 (a) as fol- lo ws: each of the six laten t v ariables tak es one of three states with equal probability , and each observed v ari- able takes the v alue 0 with a probability generated as an indep enden t uniform random v ariable on (0 , 1), conditional up on each p ossible v alue of its parents. F or each of 1,000 distributions pro duced independently using this metho d, we generated a dataset of size 5,000. W e then ﬁtted the nested mo del generated by the graph in Fig. 5 (b) to each dataset b y maximum lik e- liho od estimation, using a v ariation of an algorithm found in [5], and measured the increase in deviance asso ciated with zeroing any nested ingenuous param- eters corresp onding to eﬀects ab o ve a certain order. If these parameters w ere truly zero, we would exp ect the increase to follow a χ 2 -distribution with an ap- propriate n umber of degrees of freedom; the ﬁrst tw o histograms in Fig. 4 demonstrate that the distribution of actual increases in deviance lo oks m uc h lik e the rele- v an t χ 2 -distribution if we remov e interactions of order 6 and higher. The third histogram shows that this starts to break down slightly when 5-wa y interactions are also zero ed. These results suggest that higher order parameters are often not useful for explaining ﬁnite datasets, and more parsimonious mo dels can b e obtained by remov- ing them; a similar simulation was p erformed for the Mark ov case in [6]. 7.1 Distinguishing Graphs The use of score-based search metho ds for recov ering nested Marko v mo dels had b een inv estigated [15]. It w as found that relatively large sample sizes were re- quired to reliably recov er the correct graph, even in examples with only 4 or 5 binary no des and after en- suring that the underlying distributions were appro x- imately faithful to the true graph. One phenomenon iden tiﬁed was that incorrect but more parsimonious graphs, esp ecially DA Gs, tended to hav e low er BIC scores than the correct mo dels, which include higher order parameters. Although BIC is guaranteed to b e smaller on the correct mo del asymptotically , in ﬁnite samples it applies strong p enalties for having addi- tional parameters with little explanatory p o w er. Here we present a simulation to show how the new parameterization can help to o vercome this diﬃculty . Using the metho d describ ed in the previous subsec- tion, we generated 1,000 multiv ariate binary distribu- tions which were nested Marko v with resp ect to the graph in Fig. 1 (b). F or each distribution w e gen- erated a dataset, and ﬁtted the data to the correct mo del, which has 16 parameters, as well as the tw o D AGs giv en in Fig. 7 (a) and (b), which e ac h hav e 1e3 1e4 1e5 1e6 0.0 0.2 0.4 0.6 0.8 1.0 Sample Size Figure 6: F rom the experiment in Section 7.1: in red, the prop ortion of times graph in Fig. 1 (b) had low er BIC than the D AGs in Fig. 7, for v arying sample sizes; in blac k, the prop ortion of times some restricted v er- sion of this mo del had a low er BIC than an y restricted v ersions of either DA G. 11 parameters. This was rep eated at v arious sample sizes. The plot in Fig. 6 shows, in red, the prop ortion of times in which the BIC score for the correct mo del w as low er than that for eac h of the DA Gs, at v arious sample sizes. The correct graph only has the lo west BIC score of the three graphs on less than 3% of runs at sample size of n = 1,000, increasing to around 50% for n = 20,000. In addition to the full mo dels, we ﬁtted the datasets to v ersions of the mo dels with higher order parameters remo ved; the graph in Fig. 1 (b) can b e restricted b y zeroing the 5-wa y parameter (leaving 15 free parame- ters), the 4-wa y and and ab o v e (13 params), or 3-wa y and abov e (10 params). Similarly w e can restrict the D AGs to hav e no 3-wa y eﬀects, giving each mo del 10 free parameters. Fig. 6 sho ws, in black, the proportion of times that one of these restricted v ersions of the true mo del had a low er BIC than an y v ersion of either D A G mo del. W e see that the correct graph has the low est score in 40% of runs at n = 1,000, rising to around 70% at n = 20,000. Note that these results should not b e compared directly to those in [15], since the single ground truth la w used in that pap er was generated so as to ensure faithfulness to the correct graph, whereas w e are randomly sampling m ultiple la ws without b oth- ering to ensure an y particular prop erties in these laws other than consistency with the underlying D AG. These results suggest that these submo dels of the nested mo del ma y b e adv an tageous in recov ering the correct graphical structure using score-based metho ds. 1 2 3 4 5 (a) 1 2 3 4 5 (b) Figure 7: (a) and (b) tw o DA Gs with the same skeleton as the graph in Fig. 1 (b). Note that determining which higher order parameters should be set to zero for a given data set and sample size remains non-trivial. Automatic selection migh t b e p ossible with an L 1 -p enalized approach [16, 4]. 8 Discussion and Conclusions W e hav e introduced a new log-linear parameterization of nested Marko v mo dels o ver discrete state spaces. The log-linear parameters corresp ond to ‘in teractions’ in kernels obtained after an iterative application of truncation and marginalization steps (informally ‘in- teractions in interv en tional densities’). By contrast the M¨ obius parameters [15] corresp ond to context sp eciﬁc eﬀects in kernels (informally ‘context sp eciﬁc causal eﬀects’). W e hav e shown by means of a simulation study that in cases where data is generated from a marginal of a DA G with ‘weak confounders’, we can reduce the dimension of the mo del by ignoring higher order in- teraction parameters, while retaining the adv antages of nested Marko v mo dels compared to mo deling weak confounding directly in a D AG. Though there is no eﬃcient, closed form mapping from ingen uous parameters to either M¨ obius parameters or standard probabilities, this is a smaller disadv antage than it may seem. This is because in cases where the ingenuous parameterization was used to select a particular submo del based on a dataset, we ma y still reparameterize and use M¨ obius parameters, or even standard join t probabilities if desired. Moreov er, this reparameterization step need only b e p erformed once, compared to m ultiple calls to a ﬁtting procedure whic h iden tiﬁed the particular graph corresp onding to our submo del in the ﬁrst place. The ingenuous and the M¨ obius parameterizations are th us complementary . The natural application of the ingen uous parameterization is in learning graph struc- ture from data in situations where many samples are not a v ailable, but we exp ect most confounding to b e w eak. The natural application of the M¨ obius param- eterization is the supp ort of probabilistic and causal inference in a particular graph [14, 15], in cases where an eﬃcient mapping from parameters to joint proba- bilities is imp ortan t. References [1] D. H. Ackley , G. E. Hin ton, and T. J. Seino wski. A learning algorithm for boltzmann mac hines. Co gnitive Scienc e , 9(1):147–169, 1985. [2] W. P . Bergsma and T. Rudas. Marginal mo dels for categorical data. Annals of Statistics , 30(1):140–159, 2002. [3] M. Drton and T. S. Richardson. Binary mo dels for marginal indep endence. Journal of the R oyal Statisti- c al Society (Series B) , 70(2):287–309, 2008. [4] R. J. Ev ans. Paramet rizations of Discr ete Gr aphic al Mo dels . PhD thesis, Department of Statistics, Univer- sit y of W ashington, 2011. [5] R. J. Ev ans and A. F orcina. Two algorithms for ﬁtting constrained marginal models. Computational Statis- tics & Data Analysis , 66:1–7, 2013. [6] R. J. Ev ans and T. S. Ric hardson. Marginal log-linear parameters for graphical Marko v models. Journal of the R oyal Statistic al So ciety (Series B) (to app e ar) , 2013. [7] D. Geiger, D. Heck erman, H. King, and C. Meek. Stratiﬁed exp onen tial families: Graphical mo dels and mo del selection. Annals of Statistics , 29(2):505–529, 2001. [8] S. L. Lauritzen. Gr aphic al Mo dels . Oxford, U.K.: Clarendon, 1996. [9] N. Meinshausen and P . B ¨ uhlmann. High-dimensional graphs and v ariable selection with the lasso. Annals of Statistics , 34(3):1436–1462, 2006. [10] J. Pearl. Pr ob abilistic R e asoning in Intel ligent Sys- tems . Morgan and Kaufmann, San Mateo, 1988. [11] T. S. Ric hardson, J. M. Robins, and I. Shpitser. Nested Mark ov properties for acyclic directed mixed graphs. In 28th Confer enc e on Unc ertainty in Artiﬁ- cial Intel ligenc e (UAI-12) . A UAI Press, 2012. [12] T. Rudas, W. P . Bergsma, and R. Nemeth. Parame- terization and estimation of path mo dels for categor- ical data. In Pr o c e e dings in Computational Statistics, 17th Symp osium , pages 383–394. Physica-V erlag HD, 2006. [13] G. E. Sch warz. Estimating the dimension of a mo del. Annals of Statistics , 6:461–464, 1978. [14] I. Shpitser, T. S. Richardson, and J. M. Robins. An eﬃcien t algorithm for computing interv entional dis- tributions in latent v ariable causal mo dels. In 27th Confer enc e on Unc ertainty in Artiﬁcial Intel ligenc e (UAI-11) . AUAI Press, 2011. [15] I. Shpitser, T. S. Richardson, J. M. Robins, and R. J. Ev ans. P arameter and structure learning in nested Mark ov mo dels. In UAI Workshop on Causal Struc- tur e Le arning 2012 , 2012. [16] R. Tibshirani. Regression shrink age and selection via the lasso. Journal of the R oyal Statistic al Society (Se- ries B) , 58(1):267–288, 1996. [17] T. S. V erma and Judea Pearl. Equiv alence and synthe- sis of causal mo dels. T ec hnical Rep ort R-150, Depart- men t of Computer Science, Universit y of California, Los Angeles, 1990.

Sparse Nested Markov models with Log-linear Parameters

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment