Entropic Causal Inference
We consider the problem of identifying the causal direction between two discrete random variables using observational data. Unlike previous work, we keep the most general functional model but make an assumption on the unobserved exogenous variable: I…
Authors: Murat Kocaoglu, Alex, ros G. Dimakis
En tropic Causal Inference Murat K o caoglu 1,* , Alexandros G. Dimakis 1, † , Sriram Vish w anath 1, ‡ and Babak Hassibi 2,§ 1 Department of Electrical and Co mputer Engineer ing, The Universit y of T exas a t Austin, USA 2 Department of Electrical Engineering, California Institute o f T ec hnology , USA * mkocaog lu@ utexas.edu † dimakis@austin.utexas.edu ‡ sriram@ec e.utexas.edu § hassibi@systems.ca ltec h.edu No v em b er 16, 2016 Abstract W e consider the problem of iden tifying the causal direction betw een tw o discrete random v ariables using observ ational data. Unlik e previous work, we keep the most g eneral functional mo del but make an assumption on the unobs erv ed exogeno us v ariable: Inspired by O ccam’s razor , we assume that the exogenous v aria ble is simple in the true causal direction. W e quan tify simplicit y using Rén yi en tropy . Our main result is that, under natural assumptions, if the exogenous v ariable has lo w H 0 ent ro p y (ca rdinalit y) in the true direction, it m ust hav e high H 0 ent ro p y in the wrong direction. W e establish several algorithmic har dness results abo ut estimating the minim um entrop y exogenous v a riable. W e show tha t the problem of finding the exogenous v ariable with minimum en tro p y is equiv alen t to the problem of finding minimu m join t ent ro p y given n ma rginal distributions, also known as minimum entropy coupling problem. W e prop ose an efficien t greedy algo rithm for the minimum entrop y coupling problem, that for n = 2 prov ably finds a lo cal optim um. This gives a gr eedy algorithm for finding the exogenous v ariable with minim um H 1 (Shannon Entrop y). Our g reedy entrop y-ba sed c a usal inference algo r ithm has similar p erforma nce to the state of the ar t additive noise mo dels in real datase ts. One adv an tage of our approach is that w e make no use of the v alues of random v ariables but only their distributions. Our method can therefore b e used for causal inference for bo th ordinal and also categorica l data, unlike a dditiv e noise mo dels. 1 In tro duction Causalit y has been studied under severa l framew orks including p oten tial outcomes [22] and struc- tural equation mo deling [19]. Under the P earlian framew ork [19] it is p ossible to disco v er some causal directions b et w een v ariables using only observ ational data with conditional indep enden ce tests. The PC algorithm [27] and its v arian ts fully ch aracterize whic h causal directions can b e learned in the general case. F or large graphs, GES algorithm [3] provide s a s core-b ased test to greedily iden tify the highest scoring causal graph giv en the data. Unfortunately , these approac hes do not guaran tee the recov ery of true causal direction b et w een every pair of v ariables, since t ypically data cou ld b e generated b y sev eral statistica lly equiv alen t causal graphs. A general solution to the causal inference problem is to conduct exp erimen ts, also called inter- v en tions. An in terv en tion forces the v alue of a v ariable without affecting the other system v ariables. This remo v es the effect of its causes, effectiv ely creating a new causal graph. These c hanges in the causal grap h create a p ost-in terven tion al distribution among v ariables, whic h can b e used to iden tify 1 some additional causal relations in the original graph. The pro cedure can be applied rep eatedly to fully identify any causal graph [8], [9], [11], [25]. Unfortunately , for man y problems, it can b e ve ry difficult to create in terve ntio ns since they require additional experimen ts after the original data collect ion. Res earche rs w ould still lik e to dis- co v er causal relations b et w een v ariables us ing only observ ational data, using so-called data-driv en causalit y . Several recen t w orks [2, 24] ha v e develo p ed such methods . T o b e able to mak e any conclu- sions on causal directions in this case, additional assumptions mu st b e made ab out the mec hanisms that gen erate the data . In this pap er w e fo cus on the simplest causal disco ver y problem that in v olv es only t w o v ariab les. The t w o causal graphs X → Y and X ← Y a re statistically indistinguishable so conditional in- dep endenc e tests cannot make an y causal inference from obse rv ational data without inte rven tions. Statistical indistinguishab ilit y easily follo ws f rom the fact that an y join t distribution on t wo v ariables p ( x, y ) can b e factorize d b oth as p ( x ) p ( y /x ) and p ( y ) p ( x/y ) . The most p opular assumption for tw o-v ariable data-drive n causalit y is the additiv e noise mo del (ANM) [26]. In ANM, an y outside factor is assumed to affect the effect v ariable additiv ely , whic h leads to the equation Y = f ( X ) + E , E ⊥ ⊥ X . A lthoug h restrictive, this assumption leads to strong theoretical guaran tees in terms of iden tifiabilit y , and pro vides the state of the art accuracy in real datasets. [26] s ho w ed that if f is linear and the noise is non-Gaussian the causal direction is iden tifiable. [10] sho wed that when f is non-linear, irrespective of the noise, iden tifia bilit y holds in a non-adv erserial s etting of sys tem parameters. [20] extended ANM to discrete v ariable s. Another approac h is to exploit the postulate that the cause and mec hanism are in genera l inde- p enden tly assigned b y nature. The notion of indep endenc e here is v ague and one needs to assign maps, or conditio nal distributions to random v ariables to a rgue ab out independence of cause and mec hanism. In this direction an information-ge ometry base d approac h is suggested [12]. Indepen- dence of cause and mec hanism is captured b y treating the log-slop e of the function as a random v ariabl e, and assuming that it is indep enden t from the cause . In the case of a deterministic relation Y = f ( X ) , there are theoretica l guaran tees on iden tifi abilit y . How ev er, this assumption is restrictiv e for real data. Previous w ork exploited these t w o ideas, additive noise, and indep endence of cause and mec h- anism, to dra w data-driv en causal conclusions ab out problems in a div erse range of areas from astronom y to neuroscience [24], [23]. [24] uses the same idea that the cause and effect are indepen- den t in the time series of a linear filter. They suggest the sp ectral indep endence criterion, whic h is robust to time shif ts . [2] uses kerne l space em b eddings with the assumption that the cause distribution p ( x ) and mechan ism p ( y | x ) are selected independen tly to distinguish cause f rom effect. As noted b y [2], although conceptually prop os ed b efore, using Kolm ogoro v complexit y of the factorization of the join t distribution p ( y | x ) p ( x ) and p ( x | y ) p ( y ) as a criterion for deciding causal direction has not been used successfully un til no w. The use of information theory as a to ol f or causal discov ery is curren tly gaining increasing atten tion. This is through differen t appoaches, e.g., for time-series data, Granger causalit y and Directed Information can b e used [7, 5, 21], see also [14]. Ho wev er, researc hers hav e not used en trop y as a measure of simplicit y in the causal disco v ery literature, probably b ecause the en tropies H ( Y | X ) and H ( X | Y ) do not giv e us an y more information than H ( X ) and H ( Y ) , due to the symmetry H ( Y ) + H ( X | Y ) = H ( X ) + H ( Y | X ) . In our w ork, as w e will explain, w e minimize H ( E ) whic h initially sounds similar, but is fundamental ly differ ent from H ( Y | X ) . En trop y has found some additional uses in the causalit y literature recently: In [6], authors use maximum m utual information betw een X , Y in orde r to quan tify the causal strength of a know n causal graph. The w ork that is most similar to ours in spirit is [16], whic h also drops the additiv e noise assumption. Their approac h and setup are different in man y w a ys: Authors w ork with con tin uous 2 data. T o b e able to handle this generic form, they hav e to mak e s trong assumptio ns on the exogenous v ariabl e, functi on, and distribution of the cause: [16] ass ume that the exogenous v ariable is a standard Gaussian, a Gaussian mixture prior for the cause, and a Gaussian pro cess as the prior of the function . 1.1 Our con tributions In this pap er, w e propose a no v el approa ch to the causal iden tifiabilit y problem for discrete v ariables. Similar to [16], we k eep the most gener al functional mo del, but only put an assumption on the ex- ogenous (bac kground) v ariable. Based on Occam’s razor, we emplo y a simplicit y as s umptio n on the unobserv ed exogenous v ariable. W e use Rényi en trop y , whic h is defined as H a ( X ) = 1 1 − a log ( P i p a i ) , for a random v ariable X with state probabilities p i . W e fo cus on tw o special cases of Rén yi en trop y: H 0 , whic h corresp onds to the logarithm of the n um b er of states, and H 1 whic h corresp onds to Shannon en trop y , but our framew ork can b e extended. Sp ecificall y , if the true causal direction is X → Y , then the random v ariable Y is an arbitrary function of X and an exogenous v aria ble E : Y = f ( X , E ) where E is indep enden t from the cause X . Our key assumption is that the exogenous v ariable E is simple , i.e., has lo w Rén yi en trop y . The p ostulate is that for an y mo del in the wrong dire ction X = f ′ ( Y , ˜ E ) , the exogenous v ariable ˜ E has high Rén yi entr opy . W e are able to pro ve this result for the H 0 sp ecial case of Rén yi entr opy , assuming generic distributions for X , Y . F urthermore, w e emp irically sho w that using H 1 Shannon en trop y we obtain practical causalit y tests that w ork with high probabilit y in synthetic datasets and that sligh tly outp erforms the previous state of the art in real datasets. Our assumption is an en tropic in terpreta tion of Occam’s razor, motiv ated b y what E represen ts in the causal mo del. The exogenous v ari able captures the combine d effect of all the v ariables not included in the system mo del, whic h affect the distribution of Y . Our causal assumption can b e stated as “ther e should not b e to o much c omplex ity n ot include d in the c ausal mo del" . F or a → 1 , i.e. , Shannon en trop y , H ( X ) + H ( E ) , H ( Y ) + H ( ˜ E ) are the n um b er of random bits required to generate an input for the causal system X → Y and X ← Y , resp ectiv ely . The s implest explanation of an observ ed joint distributio n, i.e., the direction whic h requires nature to generate s maller num ber of random bits is selected as the true causal mo del. More prec is ely we ha ve the follo wing: Assumption 1. Entr opy of the exo genous variable E i s smal l in the true c ausal di r e ction. The notions of simplicit y that we consider are H 0 , whic h is log-cardinalit y , and H 1 , whic h is Shannon entro py . One s ignifica nt adv an tage of using Shannon en trop y as a simplicit y metric is that it can b e estimated more robustly in the presence of measuremen t errors, unlik e cardina lity H 0 . W e pro v e an iden tifiabilit y result for H 0 en trop y , i.e., cardinalit y of E : I f the probabilit y v alues are not adv ersarially c hosen, for most functions, the true causal direction is iden tifiable under Assumption 1. Based on exp erimen tal evidence, we conjecture that a similar iden tifiabilit y result m ust hold for Shannon en trop y H 1 . T o use our framew ork w e need algorithms that explain a dataset b y finding an exogenous v ariable E with minim um cardinalit y H 0 and minim um Shannon entrop y H 1 . Since the en tropies of X and Y ca n b e v ery differen t, an y metric to determine the true causal direction cannot only consider the en trop y of the exogenous v ariable without incorp orating the en trop y of the cause. W e explain the exogenous v ariab le in b oth directions and declare the caus al direction to b e the one with the smallest join t en trop y H a ( X ) + H a ( E ) v ersus H a ( Y ) + H a ( ˜ E ) . Our method can be applied for an y Rén yi en trop y H a but in this pap er w e only use a = 0 and a = 1 . Unfortunately , minimizing H 0 ( E ) seems very hard for real datasets since it offers no noise robustness. F or Shannon en trop y w e can do m uc h better for real data. The first step in obtaining 3 a practical algorithm is sho wing that the minim um H 1 explanation is equiv alen t to the follo wing problem: F or n random v ariables with given marginal distributions, find a joint distribution with the mini mum Shann on en trop y that is consisten t with the giv en marginals. This problem is called the minim um Shannon en trop y coupling and is known to b e NP hard [13]. W e prop ose a greedy appro ximation algorithm f or this problem that empirically p erforms very well. W e also prov e that, for n = 2 , our algorithm alwa ys produces a lo cal minim um. In s ummar y our con tributions in this paper include: • W e sho w iden tifiabilit y for generic lo w-en trop y causal mo dels under A ssumption 1 with H 0 . • W e sho w that the problems of identifying the minim um cardinalit y ( H 0 ) exogenous v ariable, and iden tifying the minim um Shannon en trop y ( H 1 ) exogenous v ariable given a join t distri- bution are b oth NP hard. • W e design a no vel greed y algorithm f or the minim um en trop y coupling problem, whic h turns out to b e equiv alen t to the problem of finding exogenou s v ariable with minim um H 1 en trop y . • W e empiri cally v alidate the conjecture that the causal direction is identifi able under Assump- tion 1 with H 1 , using exp erimen ts on synthetic datasets. • W e empirically sho w that our causal inference algorithm based on Shannon entrop y mini- mization has sligh tly b etter performance than the existing b est algorithms on a real causal dataset. In terestingly , our algorithm uses only the probabilit y distributions rather than the actual v alues of the random v ariables, and hence is applicable to categ orical v ariables. 1.2 Bac kground and Notation A tuple M = ( X, U, F , D , p ) is a causal mo del when, 1) F = { f i } are deterministic functions, 2) X = { X i } are a set of endogeno us (observ ed) v ariables U = { U i } are a set of exogenous (latent ) v ariable s with X i = f i ( P a i , U i ) , ∀ i where P a i are the endogeno us paren ts and U i is the exogenous pare nt of X i in directed acyclic graph D , 3) U are m utually indep enden t with resp ect to p . The observ able v ariabl e set X has a joint distribution implied b y the distributions of U , and the functiona l relation s f i . D is then a Ba yesian net w ork for the induced j oin t distribution of endogenous v ariables. A standard as s umptio n emplo yed in P earl’s mo del c ausal sufficiency is also used here: Ev ery exogenous v ariabl e is a direct paren t of at most one endogenou s v ariable. In this pap er, we consider a simple t w o v ariable causal system whic h con tains only tw o endog e- nous v ariables X, Y . A s sume X causes Y , whic h is represen ted as X → Y . The mo del is determ ined only b y one exogenous v ariable E , and a function f , where Y = f ( X, E ) . The probabilit y distri- bution of X and E , and f determines the distribution of Y . This mo del is sho wn by the tuple M = ( { X , Y } , E , f , X → Y , p X,E ) . Notice that w e do not assign an ex ogeno us v ariable to X , since it is the source no de in the graph. W e denote the set { 1 , 2 , · · · , n } b y [ n ] . P i x i is mean t to run through ev ery p ossible index. log refers to the logarithm base 2. F or t w o v ariables X, Y , Y | X and X | Y denote the conditional probabilit y distributio n matrices, i.e., Y | X ( i, j ) = p ( y = i | x = j ) and X | Y ( i, j ) = p ( x = i | y = j ) . The statistical independence of tw o random v ariables X and E are sho wn b y X ⊥ ⊥ E . F or notatio nal con v enience, probabilit y distribution of random v ariable X is shown b y p ( x ) as w ell as p X ( x ) . x sho ws the distribution of X in vector f orm , i.e., x i = x ( i ) = P ( X = i ) . n − 1 simplex is the set of p oin ts x in n dimensional Eucl idean space that satisf y P i x ( i ) = 1 . card is the cardina lity of a set. 4 2 Causal Mo del with Mini m um Cardinali t y Exogenous V ariable Consider the causal mo del M = ( { X, Y } , E 0 , f 0 , X → Y , p X,E ) . The task is to ident ify the under- lying causal graph X → Y using independen t identi cally distributed samples { ( x i , y i ) } i . Assuming causal sufficiency , this task reduces to deciding whether X causes Y o r Y causes X . T o is olate the iden tifiabilit y problem from estimation errors due to finite samples, w e assume that the joint distribution of ( X , Y ) is av ailable. Most proofs are deferred to the App endix. One w ay to identify that X causes Y is by sho wing that although there exists a function f and random v ariable E with Y = f ( X , E ) , X ⊥ ⊥ E , there is no function, random v ariabl e pair ( g , ˜ E ) suc h that X = g ( Y , ˜ E ) , Y ⊥ ⊥ ˜ E . How ever, without more assumptions, this is not p ossible: F or an y join t distribution one can find v alid causal mo dels f or b oth X → Y , X ← Y . This is widely kno wn, although for completeness, we pro vide a proof (Lemma 4 in the App endix). Ev en when the true causal graph is k nown, one can create differen t constructions of f , E with Y = f ( X , E ) , X ⊥ ⊥ E . There is no w ay to distinguish the true causal mo del. How ev er, ev en though w e cannot reco ver the actual functio n and the exogenous v ariable, w e can still sho w iden tifiabilit y . First, we give an equiv alent c haracterization of a causal mo del on t w o v ariable s. Definition 1 ( Blo c k P artition Matrices). Consider a matrix M ∈ { 0 , 1 } n 2 × m . L et m i,j r epr e- sent the i + ( j − 1) n th r ow of M . L et S i,j = { k ∈ [ m ] : m i,j ( k ) 6 = 0 } . M is c al le d a blo ck p artition matrix if it b elongs to C : = { M : M ∈ { 0 , 1 } n 2 × m , S i ∈ [ n ] S i,j = [ m ] , S i,j ∩ S l,j = ∅ , ∀ i 6 = l } . C thu s stands for 0 , 1 matrices with n 2 ro ws and m columns where eac h blo c k of n ro ws corresp ond to a partitioning of the set [ m ] . W e mak e the follo wing ke y observ atio n: Lemma 1. Given discr ete r andom variables X , Y with distribution p ( x, y ) , ∃ a c ausal m o del M = ( { X, Y } , E , f , X → Y , p X,E ) , E ∈ E with c ar d ( E ) = m if and only if ∃ M ∈ C , e ∈ R m + with P i e ( i ) = 1 that satisfy v ec ( Y | X ) = Me . In other w ords, the existence of a causal pair X → Y is equiv alen t to the existence of a blo c k partition matr ix M and a v ector e of proper dimensions with v ec ( Y | X ) = Me . F or simplicit y , as sume |X | = |Y | = n . W e later remo ve this constrain t. W e first sho w that any join t distribution can b e explained using a v ariable E with n ( n − 1) + 1 states. Lemma 2 ( Upp er Bound on Minimum Cardinalit y of E ) . L et X ∈ X , Y ∈ Y b e two r andom variables with joint pr ob ability distribution p X,Y ( x, y ) , wher e |X | = |Y | = n . Then ∃ a c ausal m o del Y = f ( X , E ) , X ⊥ ⊥ E that induc es p X,Y , wher e E has supp ort size n ( n − 1) + 1 . W e can s ho w that, if the columns of Y | X are uniformly s ampled p oin ts in the n − 1 dimen sional simplex, then n ( n − 1) states are also necessary for E (see Prop osition 3 in the App endix). This sho ws, unless designed b y nature through the causal mec hanism, exogenous v ariable cannot hav e small cardinalit y . Bas ed on this observ ation, the hope is to prov e that in the wrong causal direction , sa y X → Y and w e find an ˜ E ⊥ ⊥ Y such that X = g ( Y , ˜ E ) for some g , the exogenous v ariable ˜ E has to ha v e large cardinalit y . In the next section, we sho w this is actually through, under mild conditions on f . 2.1 Iden tifiabilit y for H 0 en trop y In a causal system Y = f ( X , E ) , nature chooses the random v aria bles X , E , and function f , and the conditional probabilit y distributions are then determined b y these. W e are intere sted in the cardinalit y of v ariables ˜ E ⊥ ⊥ X in the wrong causal direction X = g ( Y , ˜ E ) . Considering X | Y , w e 5 can sho w that the same lo w er b ound of n ( n − 1) still holds despite nature no w c ho oses E and X randomly , rather than c ho osing the columns of X | Y directly . A mild ass umption on f is needed to a v oid degenerate cases (F or coun terexample s s ee the appendix). Definition 2 ( Generic F u nction). L et Y = f ( X , E ) wher e variables X, Y , E have supp orts X , Y , E , r esp e ctively. L et S y , x = f − 1 x ( y ) ⊂ E b e the inverse map for x, e , i.e., S y , x = { e ∈ E : y = f ( x, e ) } . A function f is c al le d “generic" , if for e ach ( x 1 , x 2 , y ) triple f − 1 x 1 ( y ) 6 = f − 1 x 2 ( y ) and for every ( x, y ) p air f − 1 x ( y ) 6 = ∅ . In other wo rds f is called generic if y th ro w in the x th 1 blo c k of matrix M in the decom p osition v ec ( Y | X ) = Me is differen t from y th ro w in the x th 2 blo c k, and b oth are nonzero. This is not a restrictiv e condition, for example if p ( y | x ) are all differen t, no t w o rows of M can be the same. F or an y giv en condition al distribution, if the probabilities are p erturbed b y arbitraril y small con tin uous noise, the corresp onding f will b e gener ic almost surely . W e ha v e the follow ing main ident ifiabilit y result: Theorem 1 ( Iden tifiability ) . Consi der the c ausal mo del M = ( { X, Y } , E 0 , f 0 , X → Y , p X,E 0 ) wher e the r andom variables X , Y have n states, E 0 ⊥ ⊥ X has θ states and f is a generic f unction . If the distributions of X and E ar e uniformly r andomly sele cte d fr om the n − 1 and θ − 1 sim plic es, then wi th pr ob ability 1, any ˜ E ⊥ ⊥ Y that satisfies X = g ( Y , ˜ E ) for som e deterministic function g has c ar dinality at le ast n ( n − 1) . Theorem 1 implies that the causal direction is iden tifiable, when the exogenous v ariable has cardinalit y < n ( n − 1) : Corollary 1. Assume that ther e exists an algorithm A that given n r andom variables { Z i } , i ∈ [ n ] with distributions { p i } , i ∈ [ n ] e ach with n states, outpu ts the distribution of the r andom variable E with minimum c ar dinali ty and functions { f i , i ∈ [ n ] } wher e Z i = f i ( E ) . Consider the c ausal p air X → Y wher e Y = f ( X, E 0 ) . Assume that the c ar dinality of E 0 is less than n ( n − 1) , and f is generic. Then, A c an b e use d to identify the true c ausal dir e ction with pr ob ability 1, if X, E 0 ar e sele cte d uni f ormly r andomly fr om the pr op er di mensional simplic es. Pr o of. F eed the set of conditional distributions { P ( Y | X = i ) : i ∈ [ n ] } and { P ( X | Y = i ) : i ∈ [ n ] } to A to obtain E , ˜ E . F rom Theorem 1, with probabilit y 1, A iden tifies ˜ E with card ( ˜ E ) ≥ n ( n − 1) . Then s ince card ( E ) ≤ card ( E 0 ) < card ( ˜ E ) , comparing cardinali ties give the true direction. Corollary 1 giv es an algorithm for finding the true causal direction: Estimate E , ˜ E with minimu m H 0 en trop y and declare X → Y if | ˜ E | > | E | and declare X ← Y if | ˜ E | < | E | . The result easily extends to the case where X and Y are allo we d to ha ve different n um b er of states: Prop osition 1 ( Inference algorithm). S upp ose X → Y . L et X ∈ X , Y ∈ Y , |X | = n, |Y | = m . Assume that A is the algorithm that finds the exo genous variables E , ˜ E with minimum c ar dinality. Then, if the underlying exo genous variable E 0 satisfies | E 0 | < n ( m − 1 ) , wit h pr ob ability 1, we have | X | + | E | < | Y | + | ˜ E | . Pro of follo ws from Corollary 1, and b y extending the pro of of Theorem 1 to differen t cardin alities for X , Y . Unfortunately , it turns out there do es not exist an efficien t algorithm A , unless P=NP : Theorem 2. Given a c onditional distri bution matri x Y | X , i dentifying E ⊥ ⊥ X with minimum supp ort size such that ther e exist a function f with Y = f ( X , E ) i s NP ha r d. The hard ness of this problem sets us to s earc h for alternativ e approac hes. 6 3 Causal Mo del with Mini m um H 1 En trop y In this section, w e prop ose a w a y to iden tify the causal mo del that explains the observ ational data with minim um Shannon ent rop y ( en trop y in short ). Ent rop y of a causal mo del is measured by the num ber of random bits required to generate its input. In the causal graph X → Y , where Y = f ( X , E ) , w e identi fy the exogenous v ariable E ⊥ ⊥ X with minim um en trop y . W e sho w that this corresp onds to a kno wn problem whic h has b een sho wn to b e NP hard. Later w e propose a greedy algo rithm. Notice that H ( E ) is differen t from the conditional en trop y H ( Y | X ) . Certainly , s ince Y = f ( X , E ) , H ( Y | X ) ≤ H ( E ) . The key is that since E is forced to b e indep enden t from X , H ( E ) cannot b e lo we red to H ( Y | X ) . T o see this, w e can write H ( Y | X ) = P i p X ( i ) H ( Y | X = i ) , whereas since conditional probabilit y distribution of Y | X = i is the same as the distribution of f i ( E ) for some function f i , w e ha v e H ( E ) ≥ max i H ( Y | X = i ) . 3.1 Finding E with minim um en trop y Consider the equation Y = f ( X , E ) , X ⊥ ⊥ E . Let f x : E → Y b e the function mappin g E to Y when X = x , i.e., f x ( E ) : = f ( x, E ) . Then P ( Y = y | X = x ) = P ( f x ( E ) = y | X = x ) = P ( f x ( E ) = y ) . The last equalit y follo ws from the fact that X ⊥ ⊥ E . Th us, w e can treat the conditional distributions P ( Y | X = x ) as distributions that emerge b y applying some function f x to some unobserv ed v ariable E . Then the problem of iden tifying E with minim um entrop y giv en the join t distribution p ( x, y ) b ecomes equiv alen t to, giv en distributions of the v ariables f i ( E ) , finding the distribution with minim um en trop y (distribution of E ), s uch that ther e exists functions f i whic h map this distribu tion to the observed distribu tions of Y | X = i . It can b e sho wn that H ( E ) ≥ H ( f 1 ( E ) , f 2 ( E ) , . . . , f n ( E )) . Regarding f i ( E ) as a random v ariable U i , the b est lo w er b ound on H ( E ) can b e obtained by minimizing H ( U 1 , U 2 , . . . , U n ) . W e can sho w that w e can alw a ys construct an E that ac heiv es this minim um. Th us the problem of finding the ex ogeno us v ariable E with minim um entro py giv en the join t distribution p ( x, y ) is equiv alen t to the problem of finding the minim um en trop y jo in t distribution of the random v ariables U i = ( Y | X = i ) , given the marginal distribu tions p ( Y | X = i ) : Theorem 3 ( Minim um En trop y Causal Mo del ) . Assume that ther e ex ists an algorithm A that given n r andom variables { Z i } , i ∈ [ n ] with distri butions { p i } , i ∈ [ n ] e ach with n states, outputs the joint distribution over Z i c ons i stent with the given mar ginals, with mi nimum entr opy. Then, A c an b e use d to fin d the c ausal mo de l M = ( { X, Y } , E , X → Y , p X,E ) with minimum input entr opy, given any join t dist ri bution p X,Y . The prob lem of minimizing en trop y sub ject to marginal constrain ts is non-con v ex. In fact, it is sho wn in [13] that minimizing the j oin t en trop y of a set of v ariables giv en their marginals is NP hard. Th us w e ha v e the follo wing corollary: Corollary 2. Fi nding the c ausal mo del M = ( { X, Y } , E , f , X → Y , p E ,X ) with mini mum H ( E ) that induc e a given distribution p ( x, y ) is NP har d. F or this, w e prop ose a greedy algorithm. U sing en trop y to iden tify E instead of cardinalit y , despite both turning out to be NP hard, is useful since en trop y is more robust to noise in data. I n real data , w e estimate the probabilit y v alues from s amples, and noise is una v oidable. 3.2 A Conjecture on I dentifiab ility with H 1 En tropy W e ha ve the follo wing conjecture, supp orted b y artificial and real data exp erimen ts in Section 4. 7 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 Log of number of states, log 2 (n) 0 0.2 0.4 0.6 0.8 1 1.2 H * (X 1 ,X 2 ,...,X n ) - max i (H(X i )) Minimum Joint Entropy by Greedy Algorithm Average Gap H * (X 1 ,X 2 ,...X n )-max i H(X i ) Minimum Gap Maximum Gap (a) Gr eedy Jo in t En tropy Min. H(E)/log(n) 0 0.2 0.4 0.6 0.8 1 Probability of Success 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Accuracy for Low Entropy E n = 4 n = 8 n = 16 n = 32 n = 64 n = 128 0 0.5 1 0.997 0.998 0.999 1 (b) Iden tifiability , Artificial Data 20 30 40 50 60 70 80 90 100 Decision Rate, % 0 10 20 30 40 50 60 70 80 90 100 Accuracy, % Accuracy vs. Decision Percentage Entropy-based Causal Inference 68% Confidence Interval 95% Condifence Interval (c) P erforma nce on Real Data Figure 1: (a) Performance of g reedy join t e ntropy minimization algorithm: n distributions ea c h with n states are randomly generated for each v alue of n . As can b e seen, the minimum join t entrop y obtained b y the gre edy algorithm is a t most 1 bit aw ay from the largest marginal max i H ( X i ) . (b) Identifiabilit y with En tropy: W e generate distributions of X, Y by ra ndomly selecting f , X, E . Probability of success is the fraction o f points where H ( X , E ) < H ( Y , ˜ E ) . A s observed, larger n drives probability of success to 1 when H ( E ) ≤ log n , suppo rting Conjecture 1. (c) Real Data P erforma nce: Decisio n rate is the fraction o f s amples for which algorithm makes a decision for a causa l direction. A decision is made when | H ( X , E ) − H ( Y , ˜ E ) | > t log 2 n , where t determines the decisio n rate. Confidence in terv als are also pro vided. Conjecture 1. Cons ider the c ausal mo del M = ( { X, Y } , E , f , X → Y , p X,E ) wher e discr ete r andom variables X , Y have n states, E ⊥ ⊥ X has θ states. If the distribution of X is unif orm ly r andomly sele cte d fr om the n − 1 dimensional si mplex and distribution of E is uniformly sele cte d fr om the pr ob ability distributions that satisfy H 1 ( E ) ≤ log n + O (1) and f is r andomly sele cte d fr om al l f unctions f : [ n ] × [ θ ] → [ n ] , then with high pr ob ability, an y ˜ E ⊥ ⊥ Y that satisfies X = g ( Y , ˜ E ) f or some deterministic g entails H ( X ) + H ( E ) < H ( Y ) + H ( ˜ E ) . Prop osition 2 (Assuming Conjecture 1) . Assume ther e exists an algorithm A that give n n r andom variables { Z i } , i ∈ [ n ] with distributions { p i } , i ∈ [ n ] e ach with n states, outputs the distribution of the r andom variable E with mi nimum entr opy and functions { f i } , i ∈ [ n ] wher e Z i = f i ( E ) . Consider the c ausal p air X → Y wher e Y = f ( X , E 0 ) , and c ar dinality of E 0 is cn for some c ons tan t c , and f is sele cte d r andomly. Then, A c an b e use d to identify the true c ausal dir e ction with high pr ob ability, if X , E 0 ar e uniformly r andom samples fr om the pr op er dim ensi onal simplic es. 3.3 Greedy En tropy Minimization Algorithm Giv en m discrete random v ariables with n states, w e provide a heuristic algorithm to minimize their join t entro py giv en their marginal distributions. The main idea is the follo wing: Eac h marginal probabilit y constrain t mu st b e s atisfied. F or example, for the case of tw o v ariables with distributions p 1 , p 2 , i th row of join t distribution matrix s hould sum to p 1 ( i ) . The con tribution of a probabilit y mass to the j oin t en trop y only increases when probabilit y mass is divided in to smaller c h unks: − p 1 ( i ) log p 1 ( i ) ≤ − a log a − b log b , when p 1 ( i ) = a + b, f or a, b ≥ 0 . Th us, w e try to keep large probabilit y masses inta ct to assure that their cont ribution to the j oin t distribution is minim ized. W e prop ose Algorithm 1. The sorting step is only to simplify t he presen tation. Hence, althoug h the giv en algorithm runs in time O ( m 2 n 2 log n ) , it can easily b e reduced to O (max( mn log n, m 2 n )) b y dropping the s orting s tep. The algorithm simply pro ceeds b y removing the most probabilit y mass it can at eac h round. This mak es sure the large probabilit y masses remain in tact. One can easily construct the join t distribution using a v arian t: Instead of sorting, at eac h 8 Algorithm 1 Join t En trop y Minimizati on Algorithm 1: Input: Margina l distributions of m v ariables each with n states, in matrix for m M = [ p T 1 ; p T 2 ; ..., p T m ] . 2: e = [ ] 3: Sort each row of M in decreasing order. 4: Find minim um of maximum of each r o w: r ← min i ( p i (1)) 5: while r > 0 do 6: e ← [ e, r ] 7: Update max im um of eac h ro w: p i (1) ← p i (1) − r , ∀ i 8: Sort each row of M in decreas ing order. 9: r ← min i ( p i (1)) 10: end whi le 11: return e . step, find r = min i { max j { p i ( j ) }} and assign r to the elemen t with co ordinates ( a i ) , where a i = arg max j p i ( j ) . Lemma 3. Gr e e dy entr opy min imization outpu ts a p oint with entr opy at most log m + log n . Lemma 3 follo ws from the fact that the algorithm returns a supp ort of s ize at most m ( n − 1) + 1 . W e also prov e that, when there are tw o v ariables with n dimensions, the algorithm returns a p oin t that satisfies the KKT conditions of the optimiza tion problem, whic h implies that it is a lo cal optim um (see Prop osition 4 in the App endix). 4 Exp erimen ts In this section, we test the p erformance of our algorithms on real and artificial data. First, w e test the greedy en trop y minim ization algorithm and sho w that it p erforms close to the trivial low er b ound. Then, we test our conjecture of iden tifiabilit y using en trop y . Lastly , w e test our ent rop y- minimization based causal iden tificat ion tec hnique on real data. In order to test our algorithms, w e sample p oin ts in prop er dimensional simplices, whic h corre- sp ond to distributions for X and E . Distribution of p oin ts are uniform for selecting the distribution of X . It is we ll-know n that a v ector [ x i / Z ] i is uniformly randomly distribute d o v er the simplex, if x i are i.i.d. exp onen tial random v ariables with parameter 1, and Z = P i x i [18]. T o sample lo w- en trop y distributions for E , instead of exp onen tial, we use a hea vy tailed distribution for sampling eac h co ordinat e. Specifically , w e use [ e i / Z ] i , where e i are i.i.d. log-normal random v ariables with parameter σ . W e observe that this allo ws us to sample a v ariet y of distributions with s mall en trop y . P erformance of Gr eedy En tropy Minimization : W e sample distributions for n random v ari- ables { X i } , i ∈ [ n ] eac h with n states and apply Algorithm 1 to minimize their join t entrop y . W e com- pare our greedy joint entrop y minimization algorithm with t he simple lo wer bound of max i H ( X i ) . Figure 1a sho ws a v erage, maxim um and minim um excess bits relativ e to this low er b ound. Con trary to the p essimistic b ound of log n bits, j oint en trop y is at most 1 bit a w a y from max i H ( X i ) for the giv en range of n . V erifying En trop y-Based Identifiabilit y Conjecture : In this section, we empiricall y ve rify Conjecture 1. The distributions for X are uniformly randomly sampled from the simplex in n dimensions. W e also select f randomly (see implemen tation deta ils). F or the log-normal parameter σ used for sampling the distribution of E from the n ( n − 1) dimensional simplex, w e swee p the in teger v alues from 2 to 8. This allo ws us to get distribution samples from differen t regimes. W e only consider the sample s whic h satisfy H ( E ) ≤ log n . 9 After sampling E , X , f , w e iden tify the corresp onding Y | X and X | Y for Y = f ( X , E ) . W e apply greedy en trop y minimization on the columns of the induced distributions Y | X , X | Y to get the estimates E , ˜ E f or b oth causal mo dels Y = f ( X, E ) and X = g ( Y , ˜ E ) , resp ectiv ely . Figure 1b s ho ws the v ariation of success probabilit y , i.e., the fraction of s amples whic h satisfy H ( X ) + H ( E ) < H ( Y ) + H ( ˜ E ) . As observ ed, as n is increased, proba bilit y of success con v erges to 1, when H ( E ) ≤ log n , whic h supp orts the conjecture . Exp erimen ts on R eal Cause Effect P airs : W e test our en trop y-based causal inference algorithm on the CauseEffectP airs rep ository [15]. AN M ha v e b een rep orted to ac hiev e an accuracy of 63% with a confidence in terv al of ± 10% [17]. W e also us e the binomial confidence in terv als as in [4]. The cause effect pairs sho w v ery different char acteristics. F rom the scatter plots, one can observ e that they can b e a mix of con tin uous and discrete v ariables. The c hallenge in applying our framew ork on this dataset is c hoosing the correct quan tization. Small n um b er of quan tization lev els ma y result in loss of information regarding the join t distribution, and a v ery large num b er of states migh t b e computation ally hard to w ork with. W e pic k the same n um b er of states for b oth X and Y , and use a uniform quan tization that assures eac h state of the v ariables has ≥ 10 samples on a v erage. F rom the samples, w e estimate the conditonal transition matrices Y | X and X | Y and feed the column s to the greedy en trop y minimization algorithm (Algorithm 1), which outputs an appro ximate of the smallest en trop y exogenous v aria ble. Later we compare H ( X, E ) and H ( Y , ˜ E ) and declare the mo del with smallest input en trop y to be the true mo del, based on Conjecture 1. F or a causal pair, we inv ok e the algorithm if | H ( X, E ) − H ( Y , ˜ E ) | ≥ t log ( n ) f or threshold parameter t , whic h determines the decision rate. Acc uracy becomes unstable for ve ry small decision rates, s ince the n umber of ev aluated pairs b ecomes to o small. A t 100% decision rate, algorithm ac hiev es 64 . 21% whic h is sligh tly b etter than the 63% p erforman ce of ANM as rep orted in [17]. In addition , our algorithm only uses probabilit y v alues, an d is applicable to categorical as w ell as ordinal v ariables. A c kno wl edgemen ts This research has b een supp orted b y NSF Gran ts CCF 1344179, 1344364, 1407278, 1422549 and AR O YIP W911NF-14-1-0258. References [1] Ric har d Caron and Tim T r a ynor . The zero set of a polyno mial, 2005 . [O nline]. [2] Zhitang Chen, Kun Zhang , Laiwan Chan, and Bernhard Sc hölkopf. Causa l discov ery via repro ducing kernel hilber t spa ce embeddings. Neur al Computation , 26 :1484–1 5 17, 201 4. [3] David Ma xw ell Chick ering. Optimal structure iden tification with g reedy search. Journal of Machi ne L e arning R ese ar ch , 3:5 07–554, 2002. [4] C. Clopper and E. S. Pearson. The use of confidence or fiducial limits illustra ted in the case of the binomial. Biometrika , 26:4 04–413 , 1934 . [5] Jalal Etesami a nd Negar Kiyav ash. Discov ering influence structure. In IEEE ISIT , 2 0 16. [6] W eihao Gao, Sr eeram Kannan, Sewoo ng Oh, and Pramo d Viswanath. Causal strength via shanno n capacity: Axioms, estimato rs and applications. In Pr o c e e dings of t he 33 r d International Confer enc e on Machine L e arning , 2016. 10 [7] Clive W. Granger. Inv estigating causal relations by econometric mo dels and cross-sp ectral metho ds. Ec onometric a: Journal of the Ec onometric So ciety , pages 424–4 38, 1969. [8] Alain Hauser and P eter Bühlmann. Charac ter ization and gr eedy lear ning of int ervent iona l marko v equiv alence classes of directed acyclic gr aphs. Journal of Machine L e arning R ese ar ch , 13(1):24 09–2464 , 2012. [9] Alain Hauser a nd Peter B ühlmann. T w o optimal s trategies for a ctiv e learning o f ca usal netw orks from in tervent iona l data. In Pr o c e e dings of Sixth Eur op e an W orksho p on Pr ob abilistic Gr aphic al Mo del s , 20 12. [10] Patrik O Hoy er, Dominik Janzing, Joris Mo oij, Jonas Peters, and Bernhard Schölkopf. Nonlinea r causal discov ery with additive noise mo dels. In Pr o c e e di ngs of NIPS 2008 , 2 008. [11] Ant ti Hyttinen, F rederick Eberhar dt, and P atrik Hoy er. Exp erimen t selection for ca usal discov ery . Journal of Mach ine L e arning Rese ar ch , 14:3041 –3071, 2013. [12] Dominik Janzing, Jo r is Mo oij, Kun Zhang, Jan Le meir e, Jakob Zscheisc hler, Po vilas Daniušis, Ba s- tian Steudel, and Ber nha rd Sc hölkopf. Information-geo metric appro ac h to inferr ing causal directions. Artificia l Intel ligenc e , 182-18 3:1–31, 2012. [13] Mladen Ko v acevic, Iv a n Stano jevic, and V o jin Senk. On the hardness o f en tropy minimization and related pr oblems. In IEEE Informa tion The ory W orksho p , 20 12. [14] MCI2016 . Munic h workshop on causal inference a nd information theory 20 16. https ://www.l nt.ei.tum.de/events/munich- workshop- on- causal- inference- a nd- infor mation- theory- 2016/ . Accessed: 201 6 - 09 - 14. [15] J. M. Mo oij, D. Janzing, J. Zscheisc hler, and B. Sc hölkopf. Cause effect pa irs reposito r y , 20 1 6. [Online]. [16] Joris M. Mooij, Oliver Stegle, Dominik Janzing, K un Zhang, a nd Bernhar d Schölk opf. Probabilistic latent v ariable models for distinguishing betw een cause and effect. In Pr o c e e dings of NIPS 201 0 , 2010. [17] M. Joris Mo oij, Jonas P eters, Dominik Janzing, Jakob Zscheischler, a nd Bernhard Scholk opf. Distin- guishing cause from effect using observ ational data: metho ds a nd benchmarks. Journal of Machine L e arning R ese ar ch , 17 (32):1–102, 2016. [18] Shmuel O nn and Isha y W eis sman. Generating uniform ra ndom vectors ov er a simplex with implications to the volume o f a certa in polyto pe and to multiv ar iate extremes. Annals of Op er ations R ese ar ch , 189:33 1–342, 2011. [19] Judea P ear l. Causality: Mo del s, R e asoning and Infer enc e . Cambridge Universit y Pres s , 200 9. [20] Jonas Pet er s , Dominik Janzing , and Bernhard Schölk opf. Causal inference o n discr e te da ta using additive noise models. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , 33:2436– 2450, 2011. [21] Christopher Q uinn, Negar Kiya v ash, and T o dd Coleman. Directed informa tion graphs. IEEE T r ansac - tions on Informatio n The ory , 61:6887 –6909, Dec. 20 15. [22] Donald Rubin. Estimating causal effects of trea tmen ts in randomized and nonra ndomized studies. Journal of Educ atio nal Psycholo gy , 66 (5):688–701 , 1974. [23] Bernhard Schölk opf, David W. Hogg, Dun W ang , Daniel F oreman-Mack ey , Dominik Ja nzing, Carl- Johann Simon-Gabriel, and Jonas P eters. Removing systematic err ors for exoplanet search via latent causes. In Pr o c e e dings of the 32 nd In ternational Confer enc e on Mach ine L e arning , 20 15. [24] Na ji Sha jarisales, Dominik Janzing, Bernhard Sc hölkopf, and Michel Besserve. T elling cause from effect in deterministic linea r dynamical systems. In Pr o c e e dings of the 32 nd International Confer enc e on Machine L e arning , 2015. [25] Karthikey an Shanmugam, Murat Ko caoglu, Alex Dimakis, a nd Sriram Vish wanath. Learning c a usal graphs with small interv entions. In NIPS 2015 , 201 5. 11 [26] S Shimizu, P . O Hoy er, A Hyv arinen, and A. J Kerminen. A linea r non-gaussian acyclic mo del for causal disco very . Journal of Machine L e arning Re se ar ch , 7:2003–– 2030, 2 006. [27] Pet er Spirtes, Clark Glymour, and Ric hard Sc heines. Causation, Pr e diction, and Se ar ch . A Bradford Bo ok, 20 0 1. 12 5 App endix 5.1 Pro of of Lemma 1 ( ⇒ )Assume there exis ts a causal model M = ( { X, Y } , E , f , X → Y , p X,E ) . W ithout loss of general- it y , assume E = [ m ] and p E = [ e 1 , e 2 , . . . , e m ] . W e ha v e p ( y | x ) = X e ∈E p ( y | x, e ) p ( e | x ) = X e ∈E p ( y | x, e ) p ( e ) (1) since E ⊥ ⊥ X . Define the matrix Y | X k ( i, j ) : = P ( Y = i | X = j, E = k ) . Then, f rom (1 ), w e can decompose the conditiona l probabilit y distribution matrix Y | X as follo ws: Y | X = X k ∈E e k Y | X k . (2) Since f is deterministic, eac h v alue of X is mapped to exactly one v alue of Y , when E is conditioned on. Th us eac h column of Y | X k has exactly a single 1 with remaining en tries b eing zeros. Th us, each en try of Y | X is a subset sum of p E . Let S i,j represen t this s ubset, i.e., p ( y = i | x = j ) = P k ∈ S i,j e k . Notice that the x th column of Y | X is the condi tional distribution P ( Y | X = x ) = P ( f ( x, E ) | X = x ) = P ( f ( x, E )) = P ( f x ( E )) , where f x ( E ) : = f ( x, E ) . Since f x is a deterministic function, each v alue in its domain maps to exactly one v alue in its range. This implies that S y , x = f − 1 x ( y ) are disjoin t f or fixed x , i.e., S i,j 6 = S l,j ∀ l 6 = i . Also, since eac h v alue of E m ust be mapp ed to a v alue of Y by f x , the union of S i,j o v er i m ust be the whole supp ort [ m ] . Define m i,j to be the length m vector which is 1 in the colum ns indexed b y S i,j . Construct M from the ro ws m i,j suc h that m i,j is the i + ( j − 1) n th ro w of M . By construction , M is a block partition matr ix. Pic k e = [ e 1 , e 2 , . . . , e m ] . Then w e hav e v ec ( Y | X ) = Me . ( ⇐ ) F or rev erse direction, assume there exists matrices M , e with block partition M and P i e i = 1 , suc h that v ec ( Y | X ) = Me . Define E to b e the random v ariable indep enden t from X , with probabilit y distribution p E : = e and supp ort E = [ m ] . Let w i b e the i th column of M . Then w e ha ve v ec ( Y | X ) = P i e i w i . Now de-vectori ze Y | X and w i to hav e Y | X = m X i =1 e i U i , (3) where w i = v ec ( U i ) . Eac h column of U i , comes from distinct size- n blo c ks of M and since M is block partition, each column of U i con tains a s ingle 1 ∀ i . Thus eac h U i represen t a v alid map f i : X → Y , where f i ( x ) is giv en b y the nonzero ro w of x th column of U i . A function f with tw o input v aria bles X , E , where E ∈ E = [ m ] is completely determined by the set of functions { f ( X , 1) , f ( X , 2) , ..., f ( X, m ) } . Let f ( X , e ) b e the function describ ed b y the matrix U e and f b e the function determin ed by { f ( X, e ) , e ∈ E } . Apply the constructed f on X , E to get Z = f ( X , E ) . Then w e ha ve P ( Y = y | X = x ) = P ( Z = y | X = x ) ∀ x , since b oth Z and Y induce the same conditional distribution Y | X . F or any set of realiza tions of ( X, Y ) = { x i , y i } , w e can construct a s et of realizations of E , { e i } based on the conditiona l distribution P ( E | Z = y i , X = x i ) . Then w e ha v e y i = f ( x i , e i ) . Also, since P ( Z = y , X = x ) = P ( Y = y , X = x ) this pro cess induces the same join t distributions b et w een v ariabl es X, Y , E and X , Z, E , i.e., P ( X = x, E = e, Z = y ) = P ( X = x, E = e, Y = y ) , and conditional indep endence statemen t implied b y one holds for the other. Thus X ⊥ ⊥ E . 13 5.2 Uniden tifiabilit y without assumptions Here w e pro ve that we can fit causal mo dels in b oth directions X → Y , Y → X given an y join t distribution. Lemma 4. L et X ∈ X , Y ∈ Y b e discr ete r andom variables with an arbitr ary joint distribution p ( x, y ) , wher e |X | = m, |Y | = n . Then ther e exists two c ausal mo dels M 1 = ( { X, Y } , E , f , X → Y , p X,E ) and M 2 = ( { X , Y } , ˜ E , g , X ← Y , p Y , ˜ E ) with E ⊥ ⊥ X and ˜ E ⊥ ⊥ Y that induc e the same joint distribution p ( x, y ) . W e will pro ve by construction. Con sider the conditional probabilit y transition matrix Y | X . Without loss of generalit y , assume X = [ m ] , Y = [ n ] . F rom Lemma 1, it is sufficien t to show that there ex is ts M ∈ C and e suc h that v ec ( Y | X ) = Me . F or now, assume that each ent ry of Y | X is a rational n um b er. Scale the fractional form of eac h term in Y | X so that eac h denominator b ecomes the s ame as the least common m ultiple of denominator s. Denote this least common multip le by λ . Let ǫ b e 1 /λ . Let G 0 = Y | X and apply the follo wing pro cedure for i from 1 to λ in order to construct { F i , i ∈ [ λ ] } : Set G i +1 to G i − ǫF i , where F i is the n × m, { 0 , 1 } matrix con taining only a single 1 in the largest en try of ev ery column of curren t G i . This procedure is called to be successful if G i is s et to all zero matrix for i = λ . The ab o ve pro cedure iterativ ely remo ves 1 from the n umerator of the fractional form of one probabilit y v alue per column (of Y | X ). Since each column sums to 1, nu merators of eac h column sum to λ . Th us the pro cedure is successful. Th us w e ha v e, Y | X = λ X k =1 ǫF k . (4) Let e = [ ǫ, ǫ, ..., ǫ ] b e a length- λ v ector. Each entr y of Y | X is a subset sum of e . A lso, since F i con tains a single 1 per column by construction, ev ery subset of e is disjoin t within a column. Th us, w e can construct M ∈ C suc h that vec ( Y | X ) = Me . If the en tries are not fractional, w e can still find small enough ǫ to complete the ab o v e pro cedure. The same pro cess can b e implemen ted with X | Y to obta in ˜ E , g suc h that X = Y ( g , ˜ E ) . 5.3 Pro of of Lemma 2 W e prov e b y construction for an y given j oin t probabi lity distribution. Consider the decomp osition describ ed in the pro of of Lemma 4. Assume without loss of generalit y that X = Y = [ n ] . No w, instead of taking out λ = 1 /ǫ , at step i , remo ve minim um of the maxim um probabilit y v alues at eac h column of the curren t G i matrix from the maxim um probabili ty locations. Th us, at each iteration i , at least one en try of the matrix G i is zeroed out. Since sum of the v alues in eac h column remains the same after eac h iteration, after at most n ( n − 1) steps, each column m ust hav e the same single nonzero v alue. Th us the algorithm finalizes in n ( n − 1) + 1 steps. Notice that this algorithm is the same as the entrop y minimization algorithm we prop ose in Algorithm 1. F or a more detailed explanation of the algorithm steps, s ee Algorithm 1. In the case when the matrix has zero entrie s, it can b e show n that one can alw a ys find a decomposition with nnz − 1 terms, where nnz is the num ber of non-zero elemen ts in the matrix. 14 5.4 Prop osition 3 Prop osition 3. L et the c olumns of Y | X b e n p oints indep endently sample d fr om the uniform dis- tribution over the n − 1 simplex. T hen, with pr ob ab ility 1, ∄ ( M , e ) with v ec ( Y | X ) = Me for M ∈ { 0 , 1 } n 2 × m , when m < n ( n − 1) . Pr o of. W e use the follo wing tec hnique to generate uniformly randomly s ample d p oin ts on the simplex in n dimensions [18]: Lemma 5. L et x i ∼ U [0 , 1] for i ∈ [ n − 1] b e i.i.d r andom variables, or der e d such that x i ≥ x j for i > j . L et x 0 = 0 and x n = 1 . Then u = [ u i ] i ∈ [ n ] wher e u i = x i − x i − 1 , ∀ i ∈ [ n ] , i s a r andom ve ctor uniformly distribute d over n − 1 simplex. First, we construct v ec ( Y | X ) directly using n ( n − 1) uniform i.i.d. random v ariab les: Consider ˜ x i for i ∈ [ n 2 ] where eac h consecutiv e blo c k { ˜ x j n +1 , ˜ x j n +2 , . . . , ˜ x j n + n } , j ∈ { 0 , 1 , . . . , n − 1 } of size n is sampled from the generativ e mo del in Lemma 5 b efore reordering. Th us ˜ x i are i.i.d. with ˜ x i ∼ U [0 , 1] for n ( n − 1) indexes b y construction. Order ˜ x i within each blo c k in ascending order to get x i in accordance with Lemma 5. Defining the vectors x = [ x i ] and ˜ x = [ ˜ x i ] , w e ha v e x = P ˜ x for some p erm utation matrix P . F rom x , w e can construct ˜ z = v ec ( Y | X ) using the same map used in Lemma 5 for eac h blo c k (the map from u i from x i in Lemm a 5). This construction is a linear map H where eac h submatrix of H is full rank. F or the sake of con tradiction, let ˜ z = ˜ Me b e a decom p osition where ˜ M ∈ { 0 , 1 } n ( n − 1) × m where m < n ( n − 1) . Ignoring the en tries where ˜ z i = 1 − x i − 1 , w e can relabel ˜ z to get z with z ∈ [0 , 1] n ( n − 1) . Corresp ondingly , w e can write z = Me , where M is the submatrix of ˜ M obtained b y ignoring the corresp ondin g row s. The construction of z from x based on the abov e construction y ields z = Wx for a full rank matrix W . Notice that an y subset of ro ws are linearly independent due to specific structure of H . Let r b e the rank of M . Clearly r < n ( n − 1) . Then, some of the (at least n ( n − 1) − r ) ro ws of M can be written as a unique linear com binati on of r linearly indep enden t ro ws. Consider one s uc h row m 0 , where m 0 = P r i =1 α i m i and m i are a s et of linearly indep enden t ro ws of M . Define a = [ − 1 , α 1 , α 2 , ...α r ] T . T ak e r + 1 ro ws of W corresp onding to the selected z i ’s to form W r . Then we hav e, a T W r x = a T W r P˜ x = 0 . Recall that an y subset of row s of W is full rank, and P is a p erm utation matrix. H ence, left n ullspace of W r P is empty , implying that ˜ x has to b e orthogonal to the vector a T W r P 6 = 0 . This is a probabilit y 0 even t for an y nonzero a , s ince eac h ˜ x i is indep enden tly sampled from a con tin uous distribution. Probabilit y that all suc h constrain ts are satisfied is zero since it is less than the probabilit y that one particular constrain ts is satisfied. This argumen t holds for any fixed M . Since there are finitely man y s uc h { 0 , 1 } matrices, probabilit y that there exists suc h an M is 0. Th us, unless M has rank at least n ( n − 1) , there do es not exist a decomp osition Me = z with probabilit y 1. 5.5 Pro of of Theorem 1 F or sampling from the simplex, we use the follow ing mo del: Lemma 6 ([18]) . L et x i for i ∈ [ n ] b e indep endent, exp onential r andom variables with m e an 1. Then the r andom ve ctor h x 1 P i x i , x 2 P i x i , ..., x n P i x i i is uniformly distribute d over the n − 1 simplex. Assume ab o v e generativ e mo del for s ampli ng the distributions of X and E : Let { x i , i ∈ [ n ] } and { e i , i ∈ [ θ ] } b e sets of indep enden t iden tically distributed exponen tial random v ariables with mean 1. Assign P ( X = i ) = x i P j x j , and P ( E = k ) = e k P j e j . 15 Let p = v ec ( Y | X ) . Then, as sho wn in Section 2.1, w e can write p = Me (5) Notice that j th blo c k of n ro ws of p giv e the conditional probabilit y distribution of Y given X = j . Let S i,j represen t the set of indic es of e that con tribut e to the prob abilit y of observing Y = i giv en X = j , i.e., p i,j = 1 P j e j P k ∈ S i,j e k . Th us i th row in j th blo c k, or equiv alen tly i + ( j − 1) n th ro w of M is 1 in the indices S i,j . The fact that f is generic implies eac h ro w of M is distinct and non-empt y . F or the sake of con tradiction, assume that there exists ˜ E ⊥ ⊥ Y with cardinalit y m < n ( n − 1) , with some deter ministic function g suc h that X = g ( Y , ˜ E ) induces the same join t distribu tion p ( x, y ) . By Le mma 1, this implies that there exists ˜ M ∈ { 0 , 1 } n 2 × m end ˜ e suc h that q = ˜ M ˜ e , (6) where q = v ec ( X | Y ) and ˜ e is the prob abilit y distribution v ector of ˜ E . Let q i,j = P ( X = i | Y = j ) . Then from Ba ye s’ rule, w e ha v e q i,j = p j,i x i P i p j,i x i = x i P k ∈ S j,i e k P i x i P k ∈ S j,i e k . (7) Notice that the denominators P j x j and P j e j in eac h term disapp ear due to cancellation of nu- merator and denominator. Th us w e ha v e q = p 11 x 1 P k p 1 k x k , p 12 x 2 P k p 1 k x k , · · · , (8) p 1 n x n P k p 1 k x k , p 21 x 1 P k p 2 k x k , · · · p nn x n P k p nk x k T (9) F rom (6), drop the ro ws of q that contain x n in the num erator as we ll as the corresp onding rows of ˜ M . The new linear s ystem b ecomes: ¯ q = ¯ M ˜ e , (10) where ¯ q and ¯ M are the desribed submatrices of q and ˜ M resp ectiv ely . ¯ M has n ( n − 1) ro ws and m columns. W e ha v e rank ( ¯ M ) ≤ m < n ( n − 1) . (11) Since rank of ¯ M is less than n ( n − 1) , the ro ws of ¯ M are linearly dep enden t. This implie s there is at least one set of coefficien ts { α i } not iden tically zero that satisfies α ¯ M = 0 , where α = α 1 , 1 , α 1 , 2 , · · · , α 1 ,n − 1 , α 2 , 1 , · · · , α n,n − 1 . (12) Then, α ¯ q = α ¯ M˜ e = 0 (13) Hence, the elemen ts of ¯ q should satisfy the linear equation. Then this linear equation in terms of q i,j can be written as: 16 n X i =1 P n − 1 j =1 α i,j p i,j x j P n j =1 p i,j x j = P n − 1 j =1 α 1 ,j p 1 ,j x j P n j =1 p 1 ,j x j + P n − 1 j =1 α 2 ,j p 2 ,j x j P n j =1 p 2 ,j x j + P n − 1 j =1 α 3 ,j p 3 ,j x j P n j =1 p 3 ,j x j + . . . + P n − 1 j =1 α n,j p n,j x j P n j =1 p n,j x j = 0 (14) Sligh tly abusing the notation, relabel p i,j as p i,j = P k ∈ S i,j e k , since P k e k terms cancel in the expression ab o v e. W e know from Section 2.1 that S i,j ∩ S k ,j = ∅ for k 6 = i . Add itionally , due to the assumption that f is generic, w e ha v e S i,j 6 = S i,k . T o pro v e contra diction, in the follo wing w e s ho w that for any give n non-zero α , this equation is non-zero with probabilit y 1. T o s ho w this, w e sho w that after equating the denominators, each term brings a unique monomial. Hence, the result is a p olynomial where eac h α i,j is accompanied with at least one unique monomial. Th us, the p olynomial cannot b e iden tically zero, and the probabilit y of c ho osing a ro ot of this p olynomial is zero. F or now, ass ume that the denomina tor is finite. Later, we will sho w denominator is almost surely finite to comple te the argum ent. Lemma 7. If x i and e i ar e i.i.d. exp onential r andom variables with me an 1, and the function f is generic, (14) holds with pr ob ab ili ty 0, i . e., P ((14) holds ) = 0 . Pr o of. Multiply eac h term with the denominators of others to get a single fraction. Then, with a finite denominator, n umerator should b e zero for equation to hold. Define c 1 to b e the co efficien t of x n − 1 n x 1 in the n umerator after equating the denom inators. Since x n only app ears due to terms from the denominator, w e ha v e c 1 = α 1 , 1 ( p 1 , 1 p 2 ,n p 3 ,n . . . p n,n ) + α 2 , 1 ( p 1 ,n p 2 , 1 p 3 ,n . . . p n,n ) + · · · + α n, 1 ( p 1 ,n p 2 ,n p 3 ,n . . . p n, 1 ) (15) Since S 1 , 1 6 = S 1 ,n due to f b eing generic, w e ha v e, (a) Either ∃ e i ∈ S 1 , 1 , e i / ∈ S 1 ,n (b) Or ∃ e i / ∈ S 1 , 1 , e i ∈ S 1 ,n Without loss of generalit y , as sume some e i 1 ∈ S i, 1 , after a p oten tial relabeling. This is p ossible since S i, 1 are disjoin t for differen t i and non-zero. (Then, as w e will see e 2 i 1 only app ears together with α i, 1 in (15)). Similarly , assume some e i n ∈ S i,n . Consider the m ulitiplier of co efficien t α i, 1 . Assume case (a) holds for S i, 1 , i.e., ∃ e i ∈ S i, 1 , e i / ∈ S i,n . Then, α i, 1 is accompanied with term e 2 i since e i ∈ S j,n for some j 6 = i . Also, it is easy to s ee that no other term con tains e 2 i since S i, 1 do es not app ear again and no S j, 1 , j 6 = i con tains e i . Assume case (b) holds for the m ultiplier of α i, 1 , i.e., ∃ e i n ∈ S i,n , e i n / ∈ S i, 1 . Then ev ery term except α i, 1 is accompanied b y either e i n or e i n 2 , since p i,n app ears in ev ery other term. Ab o v e argumen t implies that ev ery term is differen t from the rest. This implies that eve ry distinct α i,j is accompanied b y a differen t monomial in the form x n − 1 n x j Q k ∈ T i,j e k , for some T i,j ⊂ [ θ ] , where T i,j are distinct for differen t i . Th us, since at least one α i,j is nonzero the resulting p olynomial in the n umerator is not iden tically zero. Then the n umerator is a non-zero p olynomial of the term s { x 1 , x 2 , . . . , x n , e 1 , e 2 , . . . , e n } . W e know that the ro ots of a non-zero p olynomial defined o v er a compact doma in has Le b esque measure zero [1]. Hence probability of n umerator b eing zero is 0. 17 Then we hav e, P ((14) = 0) = P ( Numerator of (14) = 0 OR Denominator of (14) = ∞ ) ≤ P ( Numerato r of (14) = 0) + P ( Denominator of (14) = ∞ ) . P ( N umera tor of (14) = 0) is sho wn to b e zero b y the argume nt abov e. W e need to argue that the denominator cannot b e infinit y . F or this, denote denominator random v ariable to b e ξ . W e can write P ( ξ < ∞ ) = 1 − P (lim sup n →∞ ε n ) , (16) where ε n is the ev ent that { ξ ≥ t n } for a sequence of t n suc h that lim n →∞ t n = ∞ . Pic k t n = n 2 . Since ξ is nonnegativ e, w e can apply Mark o v inequalit y to get P ( ε n ) ≤ E [ ξ ] t n . (17) Clearly , E [ ξ ] < ∞ . Since P n P ( ε n ) = E [ ξ ] P n 1 t n < ∞ , applyin g Borel-Can telli lemma, we ha ve P (lim su p n →∞ ε n ) = 0 , (18) whic h implie s P ( ξ < ∞ ) = 1 . Th us, w e ha v e P ((14) = 0) ≤ 0 ⇒ P ((14) = 0) = 0 . (19) No w w e can pro ve the main theorem: Assume f or the sak e of con tradiction that there exists a pair ( g , ˜ E ) that satisfy X = g ( Y , ˜ E ) , ˜ E ⊥ ⊥ Y . By Lemma 1, this is equiv alen t to the statemen t that there exists pair ( ˜ M , e ) that satisfy (6). Define ev en ts ε 1 ( ˜ M k , ˜ e ) = { Ev en t that q = ˜ M k ˜ e } and ε 2 ( ˜ M k ) = { Ev en t that α k q = 0 } , M k is a fixed { 0 , 1 } n 2 × m matrix and α k is one set of co efficien ts imp osed b y line ar dependence of ro ws of ˜ M k . Notice that here q is a random v ector, determin ed by { x i } , i ∈ [ n ] and { e i } , i ∈ [ θ ] . Clearly , ε 1 implies ε 2 , thus P ( ε 1 ) ≤ P ( ε 2 ) . No w w e can write: P ( ∃ ( g , ˜ E ) s uc h that X = g ( Y , E )) = P ( ∃ ( ˜ M , ˜ e ) suc h that q = ˜ M ˜ e ) (20) = P ( ∃ ( ˜ M , ˜ e ) suc h that ε 1 is true ) (21) ≤ P ( ∃ ( ˜ M , ˜ e ) suc h that ε 2 is true ) (22) = P ( ∃ ( ˜ M ) suc h that ε 2 is true ) (23) ≤ X k P ( G iven ˜ M k , α k q = 0) (24) = X k P ( α k q = 0) = 0 . (25) (20) follo ws from the fact that both represen tations are equiv alen t b y Lemm a 1 . (22) is due to the fact that if ε 1 , then ε 2 . (23) is due to the fact that ε 2 do es not dep end on ˜ e , but only on ˜ M . (24) follo ws from union b ound o ver all matrices ˜ M k . The last equation follo ws from Lemma 7 and the fact that there are finitely man y ˜ M k ∈ { 0 , 1 } n 2 × m with m < n ( n − 1) columns. 18 5.6 A coun terexample when S i,j = S i,k F ollow ing coun terexample sho ws that without the additional ass umpti on, iden tifiabilit y result in Theorem 1 do es not hold. Consider the follo wing equation: p 1 , 1 p 2 , 1 p 3 , 1 p 1 , 2 p 2 , 2 p 3 , 2 p 1 , 3 p 2 , 3 p 3 , 3 = 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1 × e 1 e 2 e 3 (26) In the reverse direction, w e can fit the follo wing system, whic h has smaller cardinalit y for the exogenous v ariable: q 1 , 1 q 2 , 1 q 3 , 1 q 1 , 2 q 2 , 2 q 3 , 2 q 1 , 3 q 2 , 3 q 3 , 3 = 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1 × x 1 x 1 + x 2 + x 3 x 2 x 1 + x 2 + x 3 x 3 x 1 + x 2 + x 3 (27) Notice that e i terms completely disapp ear in this symmetric case. Th us, exogenous v ariable with n states is sufficien t to describ e the rev erse conditiona l probabilit y distribution matrix, independen t from the cardinalit y of exogenous v ariable in the true direction, i.e., the v alue of θ . Our main theorem s uggests, under the condition that no S i,j is the exact s ubset of { e 1 , e 2 , ... } , no suc h case can arise, and the rev erse directi on requires exogenous v ariable with at least n ( n − 1) states. 5.7 A coun terexample when p i,j ≥ 0 The critical comp onen t of the pro of w as that each linear equation implied b y the rank deficiency of the sy stem had unique non- zero co efficien ts. Here, we pro vide a counte rexample to the theorem when this condition is violated. Consider the follo wing system: p 1 , 1 p 2 , 1 p 1 , 2 p 2 , 2 = 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 1 × e 1 e 2 e 3 e 4 (28) In the reverse direction, w e can fit the follo wing system, whic h has smaller cardinalit y for the exogenous v ariable. 19 q 1 , 1 q 2 , 1 q 1 , 2 q 2 , 2 = 1 0 0 0 0 1 1 1 × x 1 x 1 + x 2 ( e 1 + e 2 ) x 2 ( e 1 + e 2 ) x 1 + x 2 ( e 1 + e 2 ) ! (29) Notice that this is true indep enden t of the selection of e i , hence the theorem cannot b e extended to case where p i,j = 0 is allo we d. 5.8 Pro of of Theorem 2 W e first define de c omp osition pr oblem , and sho w it is NP hard. Definition 3. Decomposition problem: F or a given nonne gative matrix M , with c olumn sums e qual to 1, c on sider the de c omp osition P x ∈X xF x , wher e F x ar e 0, 1 m atri c es wi t h single 1 p er c olu mn, and P x ∈X x = 1 with x ≥ 0 . Identify the de c omp osition that minimi zes c ar d ( X ) . Lemma 8. The de c omp osition pr oblem is NP har d. Pr o of. W e use subset sum problem for the reduction: Definition 4. Subset sum proble m: F or a given set of inte gers V , and an inte ger a , de cide whether ther e exists a subset S of V such that P u ∈ S u = a . Subset sum is a w ell kno wn NP complete problem. Consider an y i nstance of the subset sum problem with the set V = { u 1 , u 2 , ..., u m } and an in teger a . Assume without loss of generalit y u i 6 = 0 , ∀ i ∈ V . If not, one can wor k with the set of nonzero v alues in V . Construct the m b y 2 matrix M with M ( i, 1) = u i and M (1 , 2) = a, M (2 , 2) = − a + P i ∈ [ m ] u i , M ( i, 2) = 0 , ∀ i ∈ { 2 , 3 , ..., m } . Up date M b y dividing eac h elemen t b y P i ∈ V u i . This do es not c hange the answ er to the subset sum proble m. No w column sum of M is 1 for b oth columns. It can b e sho wn that the decompos ition of Lemma 2 can alw a ys b e applied here to get a decomp osition with |X | = m + 1 (n um b er of non- zero terms − 1 ). W e also kno w that an y decomp osition need to touch eac h nonzero elemen t in eac h column at least once. Hence the column with largest n um b er of nonzero elemen ts yields the low er b ound |X ∗ | ≥ m . W e sho w that optimal decomposition size is m if and only if there is a s ubset of V that sums to a . First, assume ∃ S ⊂ V s uc h that P u ∈ S u = a . Consider the follo wing decomp osition: Let x k = u k for k ∈ [ m ] . Let F k ( k , 1) = 1 b e the only nonzero elemen t in the first column. Pic k the nonzero elemen t in the second column of F k based on the mem b ership of u k in S : Let F k (1 , 2) = 1 if u k ∈ S and F k (2 , 2) = 1 if u k ∈ V \ S . This decomposition is optim al due to low er b ound. No w we sho w that if optim um has size m , then ∃ a subset sum with v alue a : Recall that each F k has a single 1 p er column and decomp osition has m terms. Since first column of M has m non-zero terms, eac h term in the deco mp osition must be equal to the ele ments of this column, whic h is the elemen ts of the give n set. Since F k are 0,1 matrices with single 1 p er column b y construction, every elemen t of M m ust b e a subset sum of the decomp osition terms. Hence, a is a subset sum of set elemen ts. This sho ws that a subset sum exists if and only if optimal decomp osition has size m . Th us, if w e could find the optimal decomposition size in all instances of the decomp osition problem, w e w ould solv e all instances of the subset sum problem. No w, w e giv e a definition relat ed to decomposabilit y: 20 Definition 5 . α − decomp os abilit y: A matri x M is c al le d α − de c omp osable if it c an b e written as a c onvex c ombination of α { 0 , 1 } matric es, e ach with a single 1 p er c olumn. Iden tifying the exogenous v ariable with minim um cardinalit y is equiv alent to finding minim um size e with M ∈ C suc h that v ec ( Y | X ) = Me from Lemma 1. Notice that matrix Y | X is α − decompos able if and only if there exists e of size α along with M ∈ C suc h that v ec ( Y | X ) = Me . Consider an y decomp osition problem. I f w e had an algorith m that could solv e all instances of the problem of iden tifying E with minim um supp ort, we could feed the normalized matrix from the decomposition problem as Y | X to this algorithm and solv e the decomp osition prob lem. 5.9 Pro of of Theorem 3 Consider m random v ariables { U 1 , U 2 , . . . , U m } eac h with n states. Let { p 1 , p 2 , . . . , p m } stand for the marginal probabilit y distributions of eac h random v ariable. Consider the follo wing optimization problem: H ∗ ( U 1 , U 2 , ..., U m ) = min p ( u 1 ,u 2 ,...,u m ) H ( U 1 , U 2 , ..., U m ) s. t. p ( u i ) = p i , ∀ i (30) In words, optimization problem finds the joint distribution of m v ariables with minim um en trop y , sub ject to marginal distribution constrain ts for eve ry v ariable. Notice that this problem is non- con v ex: It minimizes a conca v e function with resp ect to linear constrain ts. Let X , E b e discrete, indep enden t random v ariable s where X ∈ [ m ] and E ∈ [ t ] . Recall that ev ery condi tional distribution P ( Y | X = x ) can b e written as the distribution of a function of random v ariabl e E . Let us in vestigat e the underlying probabilit y space, in order to generate the probabilit y space to optimize the join t distribution o v er. W e can consider the product probabilit y space P = (Ω = Ω X × Ω E = [ m ] × [ t ] , F = 2 Ω , p ) (31) for random v ariabl es X : Ω X → [ n ] , E : Ω E → [ t ] with X ( ω ) = ω , Y ( ω ) = ω . Then for ω = ( ω x , ω e ) ∈ Ω , p ( ω ) = p X ( ω x ) p E ( ω e ) . Consider the equation Y = f ( X, E ) . Let f x : Ω E → Ω Y b e the function mapping E to Y when X = x , i.e., f x ( E ) : = f ( x, E ) . Then P ( Y = y | X = x ) = P ( f x ( E ) = y | X = x ) (32) = P ( f x ( E ) = y ) , (33) where last equalit y follo ws from the fact that X ⊥ ⊥ E . Let sigma algebra generated by the random v ariable f ( x, E ) b e F ω x . F ormally , the sigma algebras generated b y f ( x, E ) are subsigma algebras of disjoint, but iden tical sigma algebras. In other wor ds, even though F ω x ⊆ 2 ( ω x , Ω E ) are disjoin t for differen t ω x , ( ω x , Ω E ) are identic al for ev ery ω x ∈ Ω X . Th us the sigma algebras generated by f ( x, E ) = g x ( E ) can b e though t of as subsigma algeb ras of the same sigma algebra, i.e., the one generated b y E . In other words, we can equiv alen tly construct F ω x ⊆ 2 Ω E . This construction allo ws us to talk ab out the join t probabilit y distribution of the random v ariables { f x ( E ) : x ∈ X } . Let U i = f i ( E ) . Then we ha ve, H ( E ) = H ( E | U 1 , U 2 , ..., U m ) + H ( U 1 , U 2 , ..., U m ) (34) − H ( U 1 , U 2 , ..., U m | E ) (35) ≥ H ( U 1 , U 2 , ..., U m ) . (36) 21 Inequalit y follo ws from the fact that H ( U 1 , U 2 , ..., U m | E ) = 0 and H ( E | U 1 , U 2 , ..., U m ) ≥ 0 . H ( E ) ≥ H ( U 1 , U 2 , ..., U m ) . Th us, the b est lo w er b ound on the en trop y of exogenous v ariable E can be obtained b y solving the optimization problem b elo w. H ( E ) ≥ min p ( u 1 ,u 2 ,...,u m ) H ( U 1 , U 2 , ..., U m ) sub ject to p ( u i ) = p i , i = 1 , . . . , m. (37) Since w e are pic king E with minim um p ossible en trop y without restricting (not observing) the functions f x , we can actually construct an E that achi eves this minim um: Let the optimal join t distribution b e p ∗ ( u 1 , u 2 , ..., u m ) . W e can construct E with n m states with state probabilities equal to p ∗ ( u ) for eac h configuration of u . This E has the same ent rop y as the join t en trop y , since probabilit y v alues are the same. Th us, H ( E ∗ ) = min p ( u 1 ,u 2 ,...,u m ) H ( U 1 , U 2 , ..., U m ) sub ject to p ( u i ) = p i The f unctions f i , where U i has the same distribution as f i ( E ) , can b e constructed from the distribution of E and U i . This deter mines the function f , hence the causal mo del M tha t induces the condition al distribution Y | X . N ote that E ⊥ ⊥ X b y construction. (F or a complete argumen t on ho w one can alw a ys generate E ⊥ ⊥ X based on a set of samples of X, Y , see the pro of of Lemma 1). 5.10 Pro of of Corollary 2 Assume there exists a blac k b o x that could find the causal mo del with minim um en trop y exogenous v ariabl e E . Consider an arbit rary instance of the problem of minimizing the join t en trop y of a set of v ariabl es { U 1 , U 2 , . . . , U n } s ub ject to marginal constrain ts. Construct matrix M = [ p 1 , p 2 , . . . , p n ] , where p i is the distribution of v ariable U i as a column v ector. F eed M in to this black b ox as a h yp othetical conditional distribution Y | X . F rom the pro of of Theorem 3, H ( E ) output by this blac k b o x gives the minim um en trop y j oin t distribution of the v ariables U i . Hence, finding the causal model with minim um en trop y w ould giv e the solution to this NP h ard problem. 5.11 Pro of of Prop osition 2 Giv en a join t distribution p ( x, y ) , consider the follow ing algorithm: In s tage 1, feed the set of conditional distributions { P ( Y | X = i ) : i ∈ [ n ] } to algorithm A to obtain E with minim um en trop y . Algorithm outputs a v ariable E along with { f 1 , f 2 , . . . , f n } such that the distribution of f j ( E ) is the same as the conditional distribution of Y giv en X = j, ∀ j . This set of f j determine f where Y = f ( X , E ) , E ⊥ ⊥ X . Since algorithm optimizes en trop y of E , H ( E ) ≤ H ( E 0 ) . In stage 2, feed the conditional distributions { P ( X | Y = i ) : i ∈ [ n ] } to algorithm A to obtain ˜ E . F rom the conjecture an y ˜ E satisfies H ( X ) + H ( E 0 ) < H ( Y ) + H ( ˜ E ) . S ince H ( E ) ≤ H ( E 0 ) , w e hav e H ( X ) + H ( E ) < H ( Y ) + H ( ˜ E ) , whic h can b e used for iden tifiabilit y . 5.12 Greedy Entrop y Minimization Outputs a Lo cal Optim um Prop osition 4. F or two variables, Algorithm 1 always r eturns a lo c al optimum. 22 Consider rando m v ariables U, V with marginal distributions p u , p v . W e first show that the KKT conditions on the problem (30) imply that the optimal solution is quasi-ortho gonal : p ∗ ( u, v ) = M = U ( x, y ) δ x,y , (38) where U is rank 1 and δ u,v is an indicator for the supp ort of optimal join t distribution. W e call suc h M as mask ed submatrix of a rank 1 matrix. Let ( i, j ) th en try of the join t distribution b e x i,j . W e ha v e n 2 v ariabl es { x i,j , i ∈ [ n ] , j ∈ [ n ] } to optimize o v er. In a general optimiz ation problem min x f 0 ( x ) s. t. h i ( x ) = 0 , i ∈ [ p ] f i ( x ) ≤ 0 , i ∈ [ m ] , Lagrangian b ecomes L ( x, λ, v ) = f 0 ( x ) + m X i =1 λ i f i ( x ) + p X i =1 v i h i ( x ) , (39) whic h giv es the KKT condition s f i ( x ∗ ) ≤ 0 , i ∈ [ m ] h i ( x ∗ ) = 0 , i ∈ [ p ] λ ∗ i ≥ 0 , i ∈ [ m ] λ ∗ i f i ( x ∗ ) = 0 , i ∈ [ m ] ∇ L ( x ∗ , λ ∗ , v ∗ ) = 0 This implie s, for fixed i , either f i ( x ∗ ) = 0 or λ ∗ i = 0 . The optimization problem w e hav e is min x i,j X i,j − x i,j log x i,j s. t. X j x i,j = p u ( i ) , ∀ i, X i x i,j = p v ( j ) , ∀ j x i,j ≥ 0 , ∀ i, j. Substituting corresponding f i , h i , w e get the followi ng conclusion: A t optimal poin t x i,j , either x i,j = 0 , or 1 − log x i,j + v (1) i + v (2) j = 0 . This implies that x i,j = 2 1+ v (1) i + v (2) j . Hence the optimal join t satisfies p ∗ i,j = u i v j for s ome vect ors u, v , whenev er p ∗ i,j 6 = 0 . Next we show that suc h u, v can b e constructed from an y algorithm outp ut: Theorem 4. F or any m atrix M output by Algorithm 1, ther e exists u, v wher e M is a maske d submatrix of uv T . Pr o of of Pr op osition 4. Consider the f ollo wing v arian t of Algorithm 1 for tw o distri butions p , q : Initialize an n × n zero matrix M 0 . I n eve ry iteration, find m 1 = max i p ( i ) , m 2 = max i q ( i ) . Let a = arg max i p ( i ) and b = arg max i q ( i ) . Assign r = m in { m 1 , m 2 } to the ( a, b ) th en try of M . Up date p ( a ) ← p ( a ) − r , q ( b ) ← q ( b ) − r . Rep eat the process un til p = 0 , q = 0 . This v arian t constructs the join t distributio n matrix rather than the distribution of E directly . W e need the followin g definitions: 23 Definition 6. An or der e d set of c o or dinates (( i 1 , j 1 ) , ( i 2 , j 2 ) , . . . , ( i m , j m )) of a matrix M is c al le d a p ath on matrix M , i f either i t +1 = i t or j t +1 = j t , ∀ t ∈ [ m − 1] . Definition 7. An or der e d set of c o or dinates (( i 1 , j 1 ) , ( i 2 , j 2 ) , . . . , ( i m , j m )) of a matrix M is c al le d a cycle on matrix M , if either i t +1 = i t or j t +1 = j t , ∀ t ∈ [ m − 1] and either i 1 = i m or j 1 = j m . First we pro ve a structural c haracteriz ation f or matrix M . Lemma 9. Any matrix M output by Algorithm 1 c ontain s no cycles. Pr o of. Consider a bipartite graph G with n left and n righ t v ertices constructed as follo ws: G has an edge ( i, j ) if and only if M ( i, j ) is nonzero. It is easy to see that the matrix M has no cycles if and only if the corresp onding bipartite graph G has no cycles. W e claim that, for an M output b y Algorithm 1 , corresp onding G cannot ha ve an y cycles. Construct the bipart ite graph in the same order as matrix M is created by Algorithm 1 . Notice that when the algorithm assigns a v alue to M ( i, j ) , it satisfies either i th row constraint or j th column constrain t. Then no other v alue can b e assigned to that ro w or column. W e call this zeroing out the corresp ondin g row /column. In the bipartite represen tation, say an edge ( i, j ) is added at time t . W e call a v ertex of the adde d edge close d , if the corresp onding dimension (row or column) is zero ed out. Th us eac h added edge to the bipa rtite graph closes one of the endp oin ts of that edge. A closed v ertex cannot participate to the formation of any other edges. This captures the f act that when zero ed out, a ro w/column cannot b e assigned a non-zero v alue again by the algorit hm. Assume at time t , a cycle is formed in the bipartite graph. Then at time t − 1 , there m ust b e a path with a single edge miss ing. Let the edges of this path b e lab eled as { e 1 , e 2 , . . . e m } . There is a time order in whic h these edges are constructed. Let e k b e the first edge formed in this path. Then one of its endp oin ts m ust be closed. Ho w ev er it belongs to a path, whic h mean s an edge w as attac hed to a closed v ertex, whic h is a con tradiction. Assum e e k is the edge at the end of the path and the closed endp oin t is also the endp oin t of the path. Ho w ever an edge is attac hed to this closed v ertex at time t to form the cycle, whi ch is a con trad iction. Consider a matrix M output b y the algorithm. W e prov e that M is a mask ed submatrix of uv T b y construction: Let the first en try selected b y Algorith m 1 ha ve co ordinate s ( i 0 , j 0 ) . Since algorithm zero es out either the ro w or column, ( i 0 , j 0 ) is the only non-zero entry in either its column or ro w. Ass ume without loss of gen eralit y that selection of ( i 0 , j 0 ) b y Algorithm 1 zeroes out row i 0 . Initialize by ass igning u i 0 = 1 and v j 0 = M ( i 0 , j 0 ) . Let S j 0 represen t the non-zero row indices for column j 0 . Now assign the v alues of u ( k ) f or k ∈ S j 0 suc h that u ( S j 0 ) v j 0 = M ( S j 0 , j 0 ) , where u ( S ) stands for the sub vecto r con taining entr ies with ind ex in S . Rep eat the ab o v e procedure b y exhausting either a row or column at eac h time. Lemma 10. The ab ove c onstruction never runs into an entry ( i, j ) for which b oth u i and v j wer e assigne d b efor e. Pr o of. As sume otherwise. Then, due to the pro cess of ass igning the en tries of u, v , there m ust b e t wo paths from ( i, j ) to the starting p oin t of ( i 0 , j 0 ) : One path that follo ws column j , and one path that follo ws row i . In other wor ds, there exists k , l ∈ [ n ] suc h that there are t w o paths (( i, j ) , ( k , j ) , . . . , ( i 0 , j 0 )) and (( i, j ) , ( i, l ) , . . . , ( i 0 , j 0 )) . Com bining these paths, we get the cycle (( i, j ) , ( k , j ) , . . . , ( i 0 , j 0 ) , . . . , ( i, l )) , whic h is a con tradi cton. 24 Hence, algorithm s elects ev ery elemen t in u and v at most once. Th us it pro duces a v alid assignmen t for u, v . 5.13 Implemen tation Details and Sampling Distributions with Diverse En trop y V alues W e implemen ted our algorithms in MA TLAB. I n order to s ample from a wide range of en trop y v alues, when w e need to sample a distribution from the n − 1 dimensional simplex, w e generate n independent iden tically distributed log Gaussian random v ariables with par ameter σ . F or verifying the conjecture on artificial data, w e generate the distributions f or E swiping the σ parameter from 2 to 8, taking only the in teger v alues. The distribution for X is uniformly randomly sampled o ve r the simplex. F or the true causal directio n, the f unction f is sampled as follows: W e generate θ matrices F i , where eac h has a single 1 p er column, randomly selected out of all n ro ws. Eac h F e represen t the function f e ( X ) : = f ( X , e ) . T ogether with e , this determines the conditional probabilit y distribution matrix Y | X . The num b er of states for quan tization is c hosen as n = min { N/ 10 , 512 } , where N is the num b er of samples for that particular cause-effect pair. Hence, w e pick n to assure eac h state has at least 10 samples on a v erage. An upper b ound of 512 is used to limit the computational complexit y . 25
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment