Recoverability of Joint Distribution from Missing Data

Recov erability of J oint Distrib ution from Missing Data Jin Tian Department of Computer Science Iow a State Univ ersity Ames, IA 50014 jtian@iasta te.edu Abstract A probabilistic query may not be estimable from observed data corrupted by missing values if the data are not missing at random (MAR). It is therefore of theoretical interest and practical importance to determine in principle whether a probabilistic query is estimable from missing data or not when the data are not MAR. W e present an algorithm that systematically determines whether the joint probability is estimable from observed data with missing values, assuming that the data-generation model is represented as a Bayesian netw ork containing unobserved latent variables that not only encodes the dependencies among the variables b ut also explicitly portrays the mechanisms responsible for the missingness process. The result signiﬁcantly advances the e xisting work. 1 Introduction Missing data occur when some variable values are missing from recorded observations. It is a common problem across many disciplines including artiﬁcial intelligence, machine learning, statistics, economics, and the health and social sciences. Missing data pose a major obstacle to v alid statistical and causal inferences in a broad range of applications. There is a vast literature on dealing with missing data in div erse ﬁelds. W e refer to [ 1 , 2 ] for a revie w of related work. Most work in machine learning assumes data are missing at random (MAR) [ 3 , 4 ], under which likelihood-based inference (as well as Bayesian inference) can be carried out while ignoring the mechanism that leads to missing data. In principle, ho wev er , to analyze data with missing v alues, we need to understand the mechanisms that lead to missing data, in particular whether the fact that variables are missing is related to the underlying v alues of the variables in the data set. Indeed some work in machine learning explicitly incorporates missing data mechanism into the model [ 5 , 6 , 7 , 8 ]. Recently Mohan et al. [ 1 ] hav e used directed acyclic graphs (D A Gs) to encode the missing data model, called m-graphs , by representing both conditional independence relations among variables and the mechanims responsible for the missingness process. M-graphs provide a general frame work for inference with missing data when the MAR assumption does not hold and the data are cate gorized as missing not at random (MN AR) . Whether a hypothesized D A G model or MAR assumption is testable with missing data is in vestigated in [9, 10]. A graphical version of MAR deﬁned in terms of graphical structures is discussed in [10]. One important research question under this graphical model frame work is: Is a target probabilistic query estimable from observed data corrupted by missing values giv en a missing data model repre- sented as a m-graph? It is known that when the data are MAR, the joint distribution is estimable. On the other hand, when the data are MN AR, a probabilistic query may or may not be estimable depending on the query and the exact missing data mechanisms. For example, consider a single random variable X and assume that whether the values of X are missing is related to the values of X (e.g., in a salary survey , people with low income are less likely to re veal their income). The model is MN AR and we can not estimate P ( X ) unbiasedly e ven if inﬁnite amount of data are collected. In practice it is important to determine in principle whether a target query is estimable from missing data or not. Se veral suf ﬁcient graphical conditions ha ve been deri ved under which probability queries of the form P ( x, y ) or P ( y | x ) are estimable [ 1 ]. Mohan and Pearl [ 2 ] extended those results and further developed conditions for recovering causal queries of the form P ( y | do ( x )) . Shpitser et al. [ 11 ] formulated the problem as a causal inference problem and developed a systematic algorithm for estimating the joint distribution when the model contains no unobserv ed latent variables. In this paper we de velop an algorithm for systematically determining the recoverability of the joint distribution from missing data in m-graphs that could contain latent variables. The result is signiﬁcantly more general than the suf ﬁcient conditions in [1, 2]. Compared to the result in [11] we allo w latent variables in the model, and treat the problem in a purely probabilistic framew ork without appealing to causality theory . The paper is organized as follo ws. Section 2 deﬁnes the notion of m-graphs as introduced in [ 1 ]. Section 3 formally deﬁnes the notion of reco verability and brieﬂy revie ws pre vious w ork. In Section 4 we present our algorithm for recov ering the joint distribution. Section 5 concludes the paper . 2 Missing data model as a Bayesian network Bayesian networks are widely used for representing data generation models [ 12 , 13 ]. Mohan et al. [ 1 ] used DA Gs called m-graphs , to represent both the data generation model and the mechanisms responsible for the missingness process. In this section we deﬁne m-graphs, mostly following the notation in [1]. Let G be a D A G o ver a set of variables V ∪ L ∪ R where V is the set of observ able v ariables, L is the set of unobserved latent v ariables, and R is the set of missingness indicator variables introduced in order to represent the mechanisms that are responsible for missingness. W e assume that V is partitioned into V o and V m such that V o is the set of variables that are observed in all data cases and V m is the set of v ariables that are missing in some data cases and observed in other cases. 1 Every v ariable V i ∈ V m is associated with a v ariable R V i ∈ R such that, in any observ ed data case, R V i = 1 if the v alue of corresponding V i is missing and R V i = 0 if V i is observed. W e require that R variables may not be parents of v ariables in V , since R variables are missingness indicator v ariables and we assume that the data generation process ov er V variabl es does not depend on the missingness mechanism. For any set S ⊆ V m , let R S represent the set of R variables corresponding to v ariables in S . The DA G G provides a compact representation of the missing data model P ( V , L, R ) = P ( V , L ) P ( R | V , L ) , and will be called a m-graph of the model. The m-graph depicts both the dependency relationships among v ariables in V ∪ L and the missingness mechanisms, and it encodes conditional independence relationships that can be read of f the graph by d-separation criterion [ 14 ] such that e very d-separation in the graph G implies conditional independence in the distrib ution P . See Figure 1 for examples of m-graphs, in which we use solid circles to represent alw ays observed variables in V o and R , and hollo w circles to represent partially observed v ariables in V m . W e often use bidirected edges as a shorthand notation to denote the e xistence of a L v ariable as common parent of two other v ariables. See Figures 1(b) for an example. 3 Recoverability Giv en a m-graph and observ ed data with missing values, it is important to kno w whether we can in principle compute a consistent estimate of a gi ven probabilistic query q (e.g. P ( x | y ) ). If q is deemed to be not estimable (or recoverable) then it is not estimable e ven if we hav e collected inﬁnite amount of data. Next we formally deﬁne the notion of reco verability . In any observ ation, let S ⊆ V m be the set of observ ed variables (i.e., v alues of v ariables in V m \ S are missing). Then the observed data is gov erned by the distribution P ( V o , S, R V m \ S = 1 , R S = 0) . Formally 1 W e assume we could partition the V variables into V o and V m based on domain kno wledge (or modeling assumption). In many applications, we ha ve the kno wledge that some v ariables are always observ ed in all data cases. 2 Deﬁnition 1 (Recoverability) Given m -graph G , a tar get pr obabilistic query q is said to be recov- erable if q can be expr essed in terms of the set of observed positive probabilities { P ( V o , S, R V m \ S = 1 , R S = 0) : S ⊆ V m } - that is, if q M 1 = q M 2 , for every pair of models P M 1 ( V , R ) and P M 2 ( V , R ) compatible with G with P M 1 ( V o , S, R V m \ S = 1 , R S = 0) = P M 2 ( V o , S, R V m \ S = 1 , R S = 0) > 0 for all S ⊆ V m . This collection of observed probabilities { P ( V o , S, R V m \ S = 1 , R S = 0) : S ⊆ V m } has been called the manifest distribution and the problem of recovering probabilistic queries from the manifest distribution has been studied in [1, 9, 2, 11]. Example 1 In Fig . 1(a), the manifest distribution is the collection { P ( X, Y , R X = 0 , R Y = 0) , P ( X, R X = 0 , R Y = 1) , P ( Y , R X = 1 , R Y = 0) , P ( R X = 1 , R Y = 1) } . 3.1 Pre vious work When data are MAR, it is known that the joint distribution is recov erable. W e hav e R ⊥ ⊥ V m | V o 2 (see [ 1 , 10 ] for graphical deﬁnition of MAR), and the joint is recoverable as P ( V ) = P ( V m | V o ) P ( V o ) = P ( V m | V o , R = 0) P ( V o ) . When data are MN AR, the joint P ( V ) may or may not be recov erable depending on the m-graph G . Mohan and Pearl [ 2 ] presented a suf ﬁcient condition for reco vering probabilistic queries including joint by using sequential factorizations (extending ordered f actorizations in [ 1 ]). The basic idea is to ﬁnd an order of v ariables, called admissible sequence , in V ∪ R such that P ( V ) could be decomposed into an ordered factorization or sum of it such that e very f actor is recov erable by using conditional independence relationships. Example 2 W e want to r eco ver P ( X, Y ) given the m-graph in F ig. 1(a). The order X < R Y < Y induces the following sum of factorization: P ( x, y ) = X r Y P ( x | r Y , y ) P ( r Y | y ) P ( y ) = P ( y ) X r Y P ( x | r Y ) P ( r Y ) , (1) wher e both P ( y ) = P ( y | R Y = 0) and P ( x | r Y ) = P ( x | r Y , R X = 0) are r eco verable. The main issue with the sequential factorization approach is that it is not clear in general whether an admissible sequence exists or ho w to ﬁnd an admissible sequence (even deciding whether a gi ven order is admissible or not does not appear to be easy). Sev eral sufﬁcient conditions for reco vering the joint P ( V ) are giv en in [1, 2] which may handle problems for which no admissible sequence exists. For example, one condition says P ( V ) is reco verable if no v ariable X is a parent of its corresponding R X and there are no edges between the R variables. If the m-graph does not contain latent v ariables ( L = ∅ ), the model is called a Mark ov model . In a Markov model, applying the chain rule of Bayesian netw ork, we hav e P ( v o , v m , R = 0) = P ( v ) P ( R = 0 | v ) = Y i P ( v i | pa i ) Y j P ( R V j = 0 | pa r V j )   R =0 (2) where P a i and P a r V j represent the parents of V i and R V j in G respectiv ely . From Eq. (2) we obtain that P ( V ) = Q i P ( v i | pa i ) is recov erable if every f actor P ( R V j = 0 | pa r V j ) is recov erable. Shpitser et al. [ 11 ] dev eloped a systematic algorithm MID for recovering P ( V ) by trying to recover e very P ( R V j = 0 | pa r V j ) using a subroutine DIR. The MID/DIR algorithm was based on formulating the recov erability problem as a causal inference problem and using techniques de veloped for the problem of identiﬁcation of causal effects. Allowing latent v ariables in the model, howev er , makes the recoverability problem substantially different (more dif ﬁcult). 2 W e use X ⊥ ⊥ Y | Z to denote that X is conditionally independent of Y given Z . 3 X Y X Y R R Z X R Z R Y X Y R (a) (b) Figure 1: (a) A m-graph that is MN AR. P ( X, Y ) is recoverable. (b) A m-graph containing latent v ariables. W e use solid circles to represent always observ ed variables in V o and R , and hollow circles to represent partially observed v ariables in V m . W e use bidirected edges to denote the existence of a latent L v ariable as common parent of two v ariables. Example 3 In F ig . 1(b), applying the chain rule of Bayesian network, we have P ( x, y , z , R X,Y ,Z = 0) = [ X l 1 P ( y | x, l 1 ) P ( z | l 1 ) P ( l 1 )][ X l 2 P ( x | l 2 ) P ( R Z = 0 | l 2 ) P ( l 2 )] P ( R X = 0 | y ) P ( R Y = 0 | z ) , (3) while P ( x, y , z ) = P ( x )[ X l 1 P ( y | x, l 1 ) P ( z | l 1 ) P ( l 1 )] . (4) It is clear that we cannot r eco ver P ( x, y , z ) by trying to r ecover e very P ( R V j = 0 | pa r V j ) , as P ( v ) and P ( R = 0 | v ) do not decouple as in Eq. (2) anymore . MID algorithm is not applicable anymor e. In this paper we deal with the problem of recov ering the joint P ( V ) in models that may contain latent variables. W e will treat the problem in a purely probabilistic framew ork without appealing to causality theory . 4 Recoverability of the joint In this section we de velop an algorithm that will systematically determine the recov erability of the joint P ( V ) . First we reformulate the gi ven observ ed probabilities. Proposition 1 Given the manifest distribution { P ( V o , S, R V m \ S = 1 , R S = 0) : S ⊆ V m } , the pr obability P ( V o , S, R V m \ S , R S = 0) is recover able for all S ⊆ V m . Pr oof: For any r V m \ S values, let T be the set of variables in V m \ S for which r T = 0 , then r V m \ ( S ∪ T ) = 1 . W e have P ( V o , S, R V m \ S , R S = 0) is recoverable as P ( v o , s, r V m \ S , R S = 0) = X t P ( v o , s ∪ t, R V m \ ( S ∪ T ) = 1 , R S ∪ T = 0) (5)  It turns out that it is much easier to work with the set of probabilities P ( V o , S, R V m \ S , R S = 0) than with P ( V o , S, R V m \ S = 1 , R S = 0) . Therefore in the following, to recov er P ( V ) , we attempt to express P ( V ) in terms of the set of observed probabilities { P ( V o , S, R V m \ S , R S = 0) : S ⊆ V m } . Example 4 In F ig . 1(a), instead of the manifest distrib ution given in Example 1, we work with the set of observed distributions { P ( X, Y , R X = 0 , R Y = 0) , P ( X , R X = 0 , R Y ) , P ( Y , R X , R Y = 0) , P ( R X , R Y ) } . 4.1 Utility lemmas The basic idea is to express P ( V ) and each P ( V o , S, R V m \ S , R S = 0) in a “canonical” form of factorization of Bayesian networks in terms of c-components [15]. 4 Next we introduce some useful concepts mostly following the notation in [ 15 ]. Let G be a Bayesian network structure over O ∪ L where O = { O 1 , . . . , O n } is the set of observed v ariables and L = { L 1 , . . . , L n 0 } is the set of unobserved latent variables. W e will often use the notation G ( O , L ) when we want to mak e it clear which set of v ariables in G are latent. For example an m-graph may be denoted by G ( V ∪ R, L ) . The observed probability distribution P ( O ) can be expressed as: P ( o ) = X l Y { i | O i ∈ O } P ( o i | pa O i ) Y { i | L i ∈ L } P ( l i | pa L i ) , (6) where the summation ranges over all the L variables. For any set S ⊆ O , deﬁne the quantity Q [ S ] to denote the following function Q [ S ]( o ) = X l Y { i | O i ∈ S } P ( o i | pa O i ) Y { i | L i ∈ L } P ( l i | pa L i ) . (7) In particular , we have Q [ O ] = P ( o ) . Q [ S ] is a function of some subset of variables in O . For con venience, we will often write Q [ S ]( o ) as Q [ S ] . The set of observed variables O can be partitioned into c-components by assigning two variables O i and O j to the same c-component if and only if O i has an unobserved parent L i and O j has an unobserved parent L j such that: either L i = L j or there exists a path between L i and L j in G such that (i)e very internal node of the path is in L , or (ii) ev ery node in O on the path is head-to-head ( → O i ← ). Note that if an observable variable has no latent parent, then it is a c-component by itself. The importance of the c-components partition lies in that if O is partitioned into c-components S 1 , . . . , S k then each Q [ S i ] , called a c-factor , is computable in terms of P ( O ) . The follo wing result is from [16, 15]: Lemma 1 Given a DA G G ( O , L ) , assuming that O is partitioned into c-components S 1 , . . . , S k , we have (i) P ( O ) is decomposed into P ( o ) = Y i Q [ S i ] . (ii) Let a topological or der over O be O 1 < . . . < O n , and let O ≤ i = { O 1 , . . . , O i } be the set of variables order ed before O i (including O i ), for i = 1 , . . . , n , and O ≤ 0 = ∅ . Then each Q [ S j ] , j = 1 , . . . , k , is computable fr om P ( O ) and is given by Q [ S j ] = Y { i | O i ∈ S j } P ( o i | o ≤ i − 1 ) . (8) Based on Lemma 1, we hav e that if P ( v ) is decomposed into product of c-factors Q [ S i ] , then P ( v ) is recov ered if each Q [ S i ] is recov ered. Example 5 In F ig. 1(b), considering the “normal” Bayesian network over V ∪ L (the part of the model without R variables), we have P ( x, y , z ) = P ( x )[ X l 1 P ( y | x, l 1 ) P ( z | l 1 ) P ( l 1 )] = P ( x ) Q [ { Y , Z } ] . (9) Ther efor e P ( v ) could be recover ed if both P ( x ) and Q [ { Y , Z } ] are r ecovered. W e can also express the given observed distribution P ( V o , S, R V m \ S , R S = 0) for S ⊆ V m in its “canonical” form of c-factor factorization based on Lemma 1. Assume V ∪ R in G ( V ∪ R, L ) is partitioned into c-components B 1 , . . . , B k , then P ( v o , v m , r ) = Y i Q [ B i ] , (10) and P ( v o , s, r V m \ S , R S = 0) = X v m \ s Y i Q [ B i ]   R S =0 . (11) 5 T o further utilize Lemma 1 we will consider variables in V m \ S as latent variables, and assume that V o ∪ S ∪ R in G ( V o ∪ S ∪ R, L ∪ ( V m \ S )) is partitioned into c-components C 1 , . . . , C m . Then P ( v o , s, r V m \ S , R S = 0) = Y i Q [ C i ]   R S =0 (12) Example 6 In F ig . 1(b), we have 3 P ( z , r X , r Y , R Z = 0) = X x,y Q [ X, R Z = 0] Q [ Y , Z ] P ( r X | y ) P ( r Y | z ) (13) = P ( r Y | z ) X x,y Q [ X, R Z = 0] Q [ Y , Z ] P ( r X | y ) (14) In lieu of Lemma 1, we ask a similar question: given the expression in Eq. (27), is a factor Q [ C i ] computable in terms of given P ( v o , s, r V m \ S , R S = 0) ? The main dif ference with the situation in Lemma 1 is that variables in R S are assuming a ﬁxed v alue. Next we extend Lemma 1 to the situation that we are not given P ( O ) but P ( O \ S, S = 0) for some S ⊂ O . For any set C , let An ( C ) denote the union of C and the set of ancestors of the v ariables in C . Lemma 2 Given a DA G G ( O , L ) , assuming that O is partitioned into c-components S 1 , . . . , S k , we have, for any S ⊆ O , (i) P ( O \ S, S = 0) = Y i Q [ S i ]   S =0 . (15) (ii) If S j ∩ An ( S ) = ∅ , that is, S j contains no ancestors of S , then Q [ S j ]   S =0 is computable fr om P ( O \ S, S = 0) . In this case, letting a topological or der over O be O 1 < . . . < O n such that non-acestors of S is order ed after ancesters of S , i.e., An ( S ) < O \ An ( S ) , then Q [ S j ]   S =0 is given by Q [ S j ]   S =0 = Y { i | O i ∈ S j } P ( o i | o ≤ i − 1 )   S =0 . (16) Pr oof: By Lemma 1, (i) holds and each Q [ S j ]   S =0 can be expressed by Eq. (16). If S j contains no ancestors of S , then all variables in S j are ordered after S . As a consequence S ⊆ O ≤ i − 1 and therefore each term P ( o i | o ≤ i − 1 )   S =0 in Eq. (16) is computable from P ( O \ S, S = 0) .  Example 7 Consider the m-graph in F ig. 1(a). Eq. (27) becomes, for S = { Y } , P ( y , r X , R Y = 0) = P ( r X | R Y = 0 , y ) P ( y )[ X x P ( R Y = 0 | x ) P ( x )] (17) R X , Y , and R Y each forms a c-component individually in G ( { Y , R X , R Y } , { X } ) . By Lemma 2, c-factors P ( r X | R Y = 0 , y ) and P ( y ) ar e computable from P ( y , r X , R Y = 0) because neither of R X or Y is an ancestor of R Y . W e also obtain that [ P x P ( R Y = 0 | x ) P ( x )] is r ecoverable by virtue of both P ( r X | R Y = 0 , y ) and P ( y ) being reco verable. Now for S = { X } with Y considered a latent variable , P ( x, r Y , R X = 0) = P ( r Y | x ) P ( x )[ X y P ( R X = 0 | r Y , y ) P ( y )] , (18) none of the thr ee c-factors is computable fr om P ( x, r Y , R X = 0) because R Y , X , and R X ar e all ancestors of R X . F or S = ∅ with both X and Y considered latent variables, we have P ( r X , r Y ) = [ X x P ( r Y | x ) P ( x )][ X y P ( r X | r Y , y ) P ( y )] = Q [ R Y ] Q [ R X ] , (19) and both Q [ R Y ] and Q [ R X ] ar e computable fr om P ( R X , R Y ) based on Lemma 1 (or 2). 3 For con v enience, we use Q [ Y , Z ] to denote Q [ { Y , Z } ] and Q [ X, R Z = 0] to denote Q [ { X , R Z } ]   R Z =0 . 6 Algorithm RE J 1. Let the c-components of G V be A 1 , . . . , A k . 2. For e very Q [ A i ] : call REQ( G, P , Q [ A i ] ). 3. P ( V ) is recoverable as P ( v ) = Q i Q [ A i ] if e very Q [ A i ] is recov erable. Fuction REQ ( G 0 , P 0 , Q [ C ]) OUTPUT : Expression for Q [ C ] or F AIL 1. Assume that Q [ C ] is a function over W . Let S = W ∩ V m , O = V o ∪ S ∪ R . 2. IF C forms a c-component in G 0 ( O , L ∪ ( V m \ S )) and C ∩ An ( R S ) = ∅ , THEN RETURN Q [ C ] recoverable as gi ven in Lemma 2. 3. Let T = ( An ( C ) ∪ An ( R S )) ∩ O and D = O \ T . IF D 6 = ∅ , THEN let G 00 be the graph resulting from removing D from G and RETURN REQ( G 00 , P D P 0 , Q [ C ] ). 4. (a) For each c-component C i of G 0 ( O , L ∪ ( V m \ S )) such as C i ∩ An ( R S ) = ∅ : Q [ C i ] is recovered by Lemma 2. Let G 00 be the graph resulting from removing C i from G 0 and RETURN REQ ( G 00 , P 0 /Q [ C i ] , Q [ C ]) . (b)For each c-component C i of G 0 ( O , L ∪ ( V m \ S )) that does not contain C and P a ( C i ) ∩ V m 6 = S : Assume Q [ C i ] = P v m \ s Q j Q [ B j ] where each B j is a c-component of G ( V ∪ R, L ) . IF Q [ C i ] is recov ered by REQ ( G, P , Q [ C i ]) or ev ery Q [ B j ] is recov ered by REQ ( G, P , Q [ B j ]) , THEN let G 00 be the graph resulting from removing C i from G 0 and RETURN REQ ( G 00 , P 0 /Q [ C i ] , Q [ C ]) . 5. RETURN F AIL Figure 2: Algorithm for recov ering P ( V ) . 4.2 Recoverability algorithm Equipped with Lemmas 1 and 2, we are now ready to develop a systematic algorithm for recovering the joint P ( V ) . The basic idea is to ﬁrst decompose P ( v ) into product of c-factors Q [ S i ] , and then attempt to recov er each Q [ S i ] by applying Lemma 2 to observed probabilities P ( V o , S, R V m \ S , R S = 0) . Example 8 In the m-graph in F ig. 1(a), P ( x, y ) = P ( x ) P ( y ) is r ecover ed if both P ( x ) and P ( y ) ar e recover ed. P ( y ) can be reco ver ed fr om P ( y , r X , R Y = 0) as shown in Example 7. However P ( x ) is not computable fr om P ( x, r Y , R X = 0) (see Eq. (18)) because X is an ancestor of R X . On the other hand Q [ R X ] has been shown to be computable fr om P ( R X , R Y ) in Example 7. W e r ewrite Eq. (18) as: P ( x, r Y , R X = 0) Q [ R X ]   R X =0 = P ( r Y | x ) P ( x ) . (20) Now P ( x ) is computable fr om the r ecover able quantity on the left-hand-side of the equation as P ( x ) = X r Y P ( x, r Y , R X = 0) Q [ R X ]   R X =0 . (21) Intuitively , P ( r Y | x ) and P ( x ) ar e c-factors of the subgr aph over { R Y , X, Y } formed by r emoving the variable R X fr om the original m-graph, and both ar e reco verable by Lemma 1 (or 2). For any set C , let G C denote the subgraph of G composed only of variables in C . W e propose a systematic algorithm REJ for recovering P ( V ) presented in Fig. 2. REJ works by ﬁrst decomposing P ( V ) into product of c-factors Q [ S i ] , and then attempting to recov er each Q [ S i ] using a subroutine REQ. REQ works by utilizing Lemma 2 and systematically reducing the problem to simpler one in subgraphs. Theorem 1 Algorithm REJ is sound The proof of Theorem 1 is giv en in the Supplementary Material. 7 Example 9 In the m-graph in Fig . 1(b), we want to r ecover P ( x, y , z ) = P ( x ) Q [ Y , Z ] . F irst we attempt to r ecover Q [ Y , Z ] fr om P ( x, y , z , R X,Y ,Z = 0) = Q [ X , R Z = 0] Q [ Y , Z ] P ( R X = 0 | y ) P ( R Y = 0 | z ) . (22) Step 4 of REQ calls for r ecovering P ( R Y = 0 | z ) fr om P ( z , r X , r Y , R Z = 0) given in Eq. (14). W e have that P ( r Y | z ) is r ecover able fr om the base case Step 2 by Lemma 2. REQ Step 4 also calls for r ecovering P ( R X = 0 | y ) from P ( y , r X , r Z , R Y = 0) = P ( r X | y ) X x,z Q [ X, R Z ] Q [ Y , Z ] P ( R Y = 0 | z ) . (23) Again we have that P ( r X | y ) is r ecoverable from the base case Step 2 by Lemma 2. Then Step 4 r educes the pr oblem to r ecovering Q [ Y , Z ] fr om P ( x, y , z , R X,Y ,Z = 0) P ( R Y = 0 | z ) P ( R X = 0 | y ) = Q [ X, R Z = 0] Q [ Y , Z ] . (24) W e obtain that Q [ Y , Z ] is r ecover able fr om the base case Step 2 by Lemma 2. Next we use REQ to r ecover P ( x ) fr om P ( x, r Y , r Z , R X = 0) = Q [ X , R Z ] X y,z Q [ Y , Z ] P ( R X = 0 | y ) P ( r Y | z ) . (25) Step 3 r educes the pr oblem to r ecovering P ( x ) fr om, by summing out R Z and R Y , P ( x, R X = 0) = P ( x ) X y,z Q [ Y , Z ] P ( R X = 0 | y ) . (26) W e have shown both Q [ Y , Z ] and P ( R X = 0 | y ) ar e r ecover able, and ther efor e P ( x ) is r ecover able. In conclusion P ( x, y , z ) is r ecoverable . It is natural to ask whether the algorithm REJ presented in Fig. 2 is complete, that is whether the output of F AIL corresponds to that P ( V ) is not recov erable. W e are not able to answer this difﬁcult question at this point. W e ﬁnd the algorithm promising in that it has pinned do wn situations in which recov erability seems not possible. W e consider the result a signiﬁcant advance ov er the existing sufﬁcient conditions in the literature [1, 2]. 5 Conclusion It is of theoretical interest and importance in practice to determine in principle whether a probabilistic query is estimable from missing data or not when the data are not MAR. In this paper we present an algorithm for systematically determining the reco verability of the joint distribution from observed data with missing v alues giv en an m-graph with latent variables. The result is signiﬁcantly more general than the sufﬁcient conditions in [ 1 , 2 ]. Compared to the result in [ 11 ] for Markov models, we allo w latent variables in the model, and treat the problem in a purely probabilistic framew ork without appealing to causality theory . Our algorithm is of course applicable to Markov models, for which the algorithm could be simpliﬁed. W e hav e also dev eloped ne w simple sufﬁcient conditions that could be used to quickly recover the joint in Markov models. These results on recov ering the joint distribution in Marko v models are presented in the Supplementary Material. Future work includes de veloping algorithms for recov ering arbitrary probabilistic queries such as P ( x | y ) , and for reco vering causal queries such as P ( y | do ( x )) . It is also an interesting research direction ho w to actually estimate distribution parameters from ﬁnite amount of data if the joint is determined to be reco verable [17]. References [1] Karthika Mohan, Judea Pearl, and Jin T ian. Graphical models for inference with missing data. In Advances in Neural Information Pr ocessing Systems (NIPS 2013) , pages 1277–1285, 2013. [2] K. Mohan and J. Pearl. Graphical models for recovering probabilistic and causal queries from missing data. In M. W elling, Z. Ghahramani, C. Cortes, and N. Lawrence, editors, Advances of Neural Information Pr ocessing 27 (NIPS Proceedings) , pages 1520–1528. 2014. 8 [3] Donald B Rubin. Inference and missing data. Biometrika , 63(3):581–592, 1976. [4] Roderick J A Little and Donald B Rubin. Statistical analysis with missing data . Wile y , 2002. [5] Manfred Jaeger . The ai&m procedure for learning from incomplete data. In Pr oceedings of the Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI) , pages 225–232, 2006. [6] Benjamin Marlin, Richard S Zemel, Sam Roweis, and Malcolm Slane y . Collaborati ve ﬁltering and the missing at random assumption. In Pr oceedings of the Conference on Uncertainty in Artiﬁcial Intelligence (U AI) , 2007. [7] Benjamin M Marlin and Richard S Zemel. Collaborati ve prediction and ranking with non- random missing data. In Pr oceedings of the thir d ACM confer ence on Recommender systems , pages 5–12. A CM, 2009. [8] Benjamin M Marlin, Richard S Zemel, Sam T Roweis, and Malcolm Slaney . Recommender systems, missing data and statistical model estimation. In IJCAI Pr oceedings-International Joint Confer ence on Artiﬁcial Intelligence , 2011. [9] Karthika Mohan and Judea Pearl. On the testability of models with missing data. Pr oceedings of AIST A T -2014 , 2014. [10] J. T ian. Missing at random in graphical models. In Pr oceedings of the International Confer ence on Artiﬁcial Intelligence and Statistics (AIST ATS 2015) , 2015. [11] I. Shpitser , K. Mohan, and J. Pearl. Missing data as a causal and probabilistic problem. In Pr oceedings of the Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI) , 2015. [12] J. Pearl. Causality: Models, Reasoning, and Infer ence . Cambridge University Press, NY , 2000. [13] Daphne K oller and Nir Friedman. Pr obabilistic graphical models: principles and techniques . MIT press, 2009. [14] J. Pearl. Pr obabilistic Reasoning in Intellig ent Systems . Morgan Kaufmann, San Mateo, CA, 1988. [15] J. Tian and J. Pearl. On the testable implications of causal models with hidden v ariables. In Pr oceedings of the Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI) , 2002. [16] J. T ian and J. Pearl. A general identiﬁcation condition for causal ef fects. In Pr oceedings of the Eighteenth National Confer ence on Artiﬁcial Intelligence (AAAI) , pages 567–573, Menlo Park, CA, 2002. AAAI Press/The MIT Press. [17] G. V an den Broeck, K. Mohan, A. Choi, A. Darwiche, and J. Pearl. Efﬁcient algorithms for bayesian network parameter learning from incomplete data. In Pr oceedings of the Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI) , 2015. 9 6 Supplementary Material 6.1 Proof of Theor em 1 Theorem 2 Algorithm REJ is sound Pr oof: Based on Lemma 1, REJ is sound if REQ is sound. Next we sho w the soundness of REQ. Step 1 speciﬁes the smallest S such that Q [ C ] could potentially form a c-factor in G ( O , L ∪ ( V m \ S )) . The algorithm then attempts to recov er Q [ C ] from observed P ( V o , S, R V m \ S , R S = 0) . Step 2 is the base case which is sound based on Lemma 2. In Step 3 summing out D from both sides of the follo wing (Eq. (12) in the main paper) P ( v o , s, r V m \ S , R S = 0) = Y i Q [ C i ]   R S =0 (27) is graphically equi valent to removing v ariables in D based on the chain rule of Bayesian netw orks since all D v ariables could be ordered after T variables. In Step 4 we attempt to recover another c-factor Q [ C i ] , either from P ( V o , S, R V m \ S , R S = 0) by Lemma 2 in 4(a) or from other observed distrib utions by calling REQ in 4(b). If Q [ C i ] is recov ered, then giv en P ( v o , s, r V m \ S , R S = 0) = Q [ C i ] Y j 6 = i Q [ C j ]   R S =0 , (28) we obtain P ( v o , s, r V m \ S , R S = 0) Q [ C i ] = Y j 6 = i Q [ C j ]   R S =0 . (29) Now the problem of recov ering Q [ C ] is reduced to a problem of recovering Q [ C ] in the subgraph resulting from removing C i from G with associated distribution P ( v o ,s,r V m \ S ,R S =0) Q [ C i ] . W e fail to reco ver Q [ C ] if Q [ C ] cannot be recov ered by Lemma 2 and the problem cannot be reduced to a smaller one by Steps 3 or 4.  6.2 Recoverability in Mark ov models If the m-graph does not contain latent v ariables ( L = ∅ ), the model is called a Marko v model. In this situation, we hav e P ( v ) = Y i P ( v i | pa i ) , (30) and P ( v o , v m , R = 0) = P ( v ) P ( R = 0 | v ) = Y i P ( v i | pa i ) Y j P ( R V j = 0 | pa r V j )   R =0 (31) where P a i and P a r V j represent the parents of V i and R V j in G respectiv ely . W e hav e that P ( V ) is recov erable if ev ery factor P ( v i | pa i ) is recov erable. Alternativ ely , from Eq. (31) we hav e that P ( V ) is recoverable if ev ery factor P ( R V j = 0 | pa r V j ) is recoverable. The algorithm REJ will recov er P ( v ) by attempting to recov er e very P ( v i | pa i ) . W e observe that it is often the case that to reco ver all P ( v i | pa i ) the algorithm REQ normally will need to recover many P ( R V j = 0 | pa r V j ) for R V j being a descendant of some V i (the opposite is not true). Therefore for the purpose of recovering P ( v ) it is often more efﬁcient to reco ver all P ( R V j = 0 | pa r V j ) instead. W e observe that the MID algorithm in [11] appears to be recov ering P ( v ) by recov ering ev ery P ( R V j = 0 | pa r V j ) . There exist simple sufﬁcient conditions by which we can quickly recov er P ( R V j = 0 | pa r V j ) . For example a necessary condition for P ( R V j = 0 | pa r V j ) being recov erable is that V j is not a parent of R V j [1]. W e summarize se veral suf ﬁcient conditions in the follo wing proposition. 10 Z Y R Z R X X Y R Figure 3: P ( X , Y , Z ) is recov erable. Algorithm REJ-M 1. For e very V j ∈ V m : Recov er P ( R V j = 0 | pa r V j )   R =0 by Proposition 2 if applicable, otherwise call REQ ( G, P , P ( R V j = 0 | pa r V j )) . 2. P ( V ) is recoverable if e very P ( R V j = 0 | pa r V j ) is recov erable. Figure 4: Algorithm for recov ering P ( V ) in Markov models. Proposition 2 P ( R V j | pa o r V j , pa m r V j , pa r r V j ) , wher e P a o r V j , P a m r V j , and P a r r V j ar e the par ents of R V j in G that ar e V o variables, V m variables, and R variables r espectively , is not reco verable if V j is a par ent of R V j ; otherwise, P ( R V j | pa o r V j , pa m r V j , pa r r V j )   R =0 is r ecover able if one of the following holds: 1. P a m r V j = ∅ . 2. P a m r V j is a subset of the set of V m variables corr esponding to P a r r V j . 3. R V j has no child. 4. None of R P a m r V j is a descendant of R V j . Pr oof: Conditions 1 and 2 are straightforward and used extensi vely in [1, 9, 2]. Conditions 3 and 4: P ( R V j | pa o r V j , pa m r V j , pa r r V j ) is a c-factor in P ( V o , S, R V m \ S , R S = 0) for S = P a m r V j . Then P ( R V j | pa o r V j , pa m r V j , pa r r V j ) is recoverable by Lemma 2 since R V j is not an ancestor of R S .  Based on the abov e conditions 3 and 4, we present the following suf ﬁcient condition for recov ering P ( V ) . Theorem 3 In a Markov model P ( V ) is r ecoverable if no variable X is a par ent of its corr esponding R X , and for each R X , either it has no child, or none of the R variables correponding to its V m par ents ar e descendants of R X . Example 10 P ( X , Y , Z ) is r ecoverable in F ig. 3 by Theor em 3. In general we propose a systematic algorithm REJ-M for recov ering P ( V ) in Markov models presented in Fig. 4. Example 11 In the m-graph in F ig. 5(a), we attempt to reco ver P ( A, B , C , D ) by reco vering P ( R A = 0 | d, b, R B = 0) and P ( R B = 0 | d, a ) . F irst P ( R A = 0 | d, b, R B = 0) is easily re- 11 D B R A B A C R C B R A B A R A B R A B R (a) (b) (c) Figure 5: P ( A, B , C , D ) is recoverable. cover ed by condition 2 or 3 in Pr oposition 2. Next we call REQ ( G, P , P ( R B = 0 | d, a )) , which attempts to r ecover P ( R B = 0 | d, a ) from ( S = { A } ) P ( c, d, a, r B , R A = 0) = P ( a ) P ( d | c ) P ( r B | d, a )[ X b P ( b ) P ( c | a, b ) P ( R A = 0 | d, b, r B )] . (32) In Step 4 of REQ we attempt to r ecover P ( d | c ) fr om P ( c, d, r A , r B ) = P ( d | c ) X a,b P ( a ) P ( r B | d, a ) P ( b ) P ( c | a, b ) P ( r A | d, b, r B ) , (33) which says P ( d | c ) is r ecoverable by Lemma 2. The pr oblem is r educed to r ecovering P ( R B = 0 | d, a ) in F ig. 5(b) fr om P ( c, d, a, r B , R A = 0) P ( d | c ) = P ( a ) P ( r B | d, a )[ X b P ( b ) P ( c | a, b ) P ( R A = 0 | d, b, r B )] . (34) C is not an ancestor of R B or R A in F ig. 5(b), and Step 3 of REQ reduces the pr oblem to r ecovering P ( R B = 0 | d, a ) in Fig . 5(c) fr om X c P ( c, d, a, r B , R A = 0) P ( d | c ) = P ( a ) P ( r B | d, a )[ X b P ( b ) P ( R A = 0 | d, b, r B )] . (35) F ig. 5(c) is the same as F ig. 1(a) for which P ( r B | d, a ) can be r ecover able as shown in Example 7. In fact, Q [ R A = 0] = P b P ( b ) P ( R A = 0 | d, b, r B ) can be r ecover ed fr om P ( c, d, r A , r B ) fr om the following X c P ( c, d, r A , r B ) P ( d | c ) = [ X a P ( a ) P ( r B | d, a )][ X b P ( b ) P ( R A | d, b, r B )] , (36) fr om whic h Q [ R A ] is r ecover able by Lemma 2 (or 1). Finally the pr oblem is r educed to r ecovering P ( R B = 0 | d, a ) from the following 1 Q [ R A = 0] X c P ( c, d, a, r B , R A = 0) P ( d | c ) = P ( a ) P ( r B | d, a ) , (37) fr om which it is clear P ( R B = 0 | d, a ) is reco verable by Lemma 2. This example is used to demonstrate the MID algorithm in [ 11 ]. W e suspect our REJ-M algorithm works in a somewhat similar way as MID, but we use a pure probabilistic frame work without appealing to causality techniques. 12

Recoverability of Joint Distribution from Missing Data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment