Unsupervised Discovery of El Nino Using Causal Feature Learning on Microlevel Climate Data

Unsupervised Disco very of El Ni ˜ no Using Causal F eatur e Lear ning on Microle v el Climate Data Krzysztof Chalupka Computation and Neural Systems Caltech T obias Bischoff En vironmental Science and Engineering Caltech Pietro P er ona Electrical Engineering Caltech Frederick Eberhardt Humanities and Social Sciences Caltech Abstract W e show that the climate phenomena of El Ni ˜ no and La Ni ˜ na arise naturally as states of macro- variables when our recent causal feature learn- ing frame work (Chalupka et al., 2015, 2016) is applied to micro-level measures of zonal wind (ZW) and sea surface temperatures (SST) taken ov er the equatorial band of the Paciﬁc Ocean. The method identiﬁes these unusual climate states on the basis of the relation between ZW and SST patterns without an y input about past occurrences of El Ni ˜ no or La Ni ˜ na. The sim- pler alternativ es of (i) clustering the SST ﬁelds while disregarding their relationship with ZW patterns, or (ii) clustering the joint ZW -SST pat- terns, do not discov er El Ni ˜ no. W e discuss the degree to which our method supports a causal interpretation and use a lo w-dimensional toy ex- ample to e xplain its success over other cluster- ing approaches. Finally , we propose a new ro- bust and scalable alternati ve to our original algo- rithm (Chalupka et al., 2016), which circumv ents the need for high-dimensional density learning. 1 INTR ODUCTION The accurate characterization of macro-le vel climate phe- nomena is crucial to an understanding of climate dynam- ics, long term climate e volution and forecasting. Modern climate science models, despite their complexity , rely on an accurate and valid aggreg ation of micro-lev el measure- ments into macro-phenomena. While many aspects of the climate may indeed be subject fundamentally to chaotic dy- namics, many large scale phenomena are deemed amenable to precise modeling. The El Ni ˜ no–Southern Oscillation (ENSO) is arguably the most studied climate phenomenon at the inter -annual time scale, but much about its dynam- ics relating zonal winds (ZW) and sea surface temperatures (SST) remains poorly understood. Figure 1: El Ni ˜ no vs. neutral conditions from Di Liberto (2014). T op: An illustration of the state of the atmosphere and surface during typical El Ni ˜ no conditions. Here, the colors indicate SST de viations from the neutral state with red being a positive and blue being a negati ve deviation. Bottom: Similar to the top panel but now sho wing neutral conditions of the W alker circulation (neither El Ni ˜ no nor La Ni ˜ na). W e apply our recent causal feature learning (CFL) frame- work (Chalupka et al., 2016) to learn causal macro- variables from the equatorial Paciﬁc climate data. Our goal is threefold: • apply CFL to real-world data, de veloping new practi- cal algorithms as needed, • test whether CFL can, without supervision, learn the ground truth that El Ni ˜ no is an important macro- variable state in the ZW -SST system’ s dynamics, • explore the theoretical and practical dif ference be- tween CFL and clustering methods. From the climate-science point of view , our research sho ws Figure 2: Ni ˜ no 3.4 SST anomalies for the time period 1950–2005. The ﬁgure was adapted from McPhaden et al. (2006). Red shadings indicate El Ni ˜ no years and blue shad- ings indicate La Ni ˜ na years. The two dashed lines indicate the threshold for strong El Ni ˜ no or La Ni ˜ na ev ents. that CFL can be successfully used for an unbiased auto- mated extraction of climate macro-v ariables, which would otherwise require tedious hand-crafting by domain e xperts. Moreov er , the framew ork can directly suggest (compu- tationally) expensiv e climate experiments (for example, through climate simulations) that could differentiate be- tween true causes and mere correlations efﬁciently . Closer inspection of the output of CFL can also yield insights about ne w climate macro-phenomena (or important v ari- ants of e xisting ones) that inspire new ph ysical mod- els of the climate. Python code that reproduces our re- sults and ﬁgures is a vailable online at http://vision. caltech.edu/ ˜ kchalupk/code.html . 1.1 EL NI ˜ NO–SOUTHERN OSCILLA TION El Ni ˜ no is a weather pattern that is principally charac- terized by the state of eastern Paciﬁc near-surface winds (ZW , zonal wind), sea surface temperature (SST) patterns, and the associated state of the atmospheric W alker circula- tion (see for example, Holton et al., 1989; Trenberth, 1997). The W alker circulation (see Fig. 1) is characterized by warm air rising ov er Indonesia and Papua Ne w Guinea and cooler subsiding air ov er the eastern P aciﬁc cold tongue re- gion just west of equatorial South America (Lau and Y ang, 2003). Near the surface, easterly winds (winds blo wing from the east) dri ve water from east to west resulting in oceanic upwelling near the coast of equatorial South Amer - ica (and do wnwelling east of Indonesia), that brings with it cold and nutrient rich w aters from the deep oceans. During the ENSO w arm phase, commonly referred to as El Ni ˜ no (because it often occurs around and after Christmas), the W alker circulation weakens, ultimately resulting in weaker upwelling in the Eastern Paciﬁc and thus in positiv e SST anomalies. Fig. 1 illustrates these phenomena. ENSO-related weather in the tropics includes droughts, ﬂooding, and may have direct impact on ﬁsheries through reduced nutrient upwelling (e.g., Glantz, 2001). Atmo- spheric wa ves (ripples in wind, SST and rainf all pat- terns) generated by the change in circulation and SST anomalies in the tropics, make their way across the planet with dramatic impact (e.g, Ropele wski and Halpert, 1987; Changnon, 1999). Cashin et al. (2015) show that the eco- nomic impact of El Ni ˜ no varies across regions. Economic activity may decline brieﬂy in Australia, Chile, Indonesia, India, Japan, New Zealand, and South Africa after an El Ni ˜ no ev ent. Enhanced gro wth may be registered in other countries, such as the United States. The ENSO cold phase, usually referred to as La Ni ˜ na, is the opposing phase of El Ni ˜ no with enhanced upwelling and colder SSTs in the eastern Paciﬁc. Currently , predict- ing the strength of El Ni ˜ no and La Ni ˜ na e vents remains a difﬁcult challenge for climate scientists as the period may vary between 3 and 7 years (see Fig. 2); as a consequence accurate forecasts are only possible less than a year in ad- vance (e.g., Landsea and Knaf f, 2000). The National Oceanic and Atmospheric Administration (NO AA) deﬁnes El Ni ˜ no as a positi ve three-month run- ning mean SST anomaly of more than 0 . 5 ◦ C from nor- mal (for the 1971–2000 base period) in the Ni ˜ no 3.4 re- gion ( 120 ◦ W – 170 ◦ W , 5 ◦ N– 5 ◦ S, see also Fig. 4). Simi- larly , La Ni ˜ na conditions are deﬁned as ne gati ve anoma- lies of more than − 0 . 5 ◦ C. Conditions in between − 0 . 5 ◦ C and 0 . 5 ◦ C are called neutral. This is illustrated using red and blue shadings in Fig. 2. Strong El Ni ˜ no/La Ni ˜ na e vents are deﬁned as SST -anomalies greater than 1 . 5 ◦ C. Ho we ver , the deﬁnitions for El Ni ˜ no and La Ni ˜ na hav e e volv ed over time. For example, other regions than the Ni ˜ no 3.4 region or other av eraging conv entions ha ve been used in the spec- iﬁcation of the SST anomalies. 1.2 CA USAL FEA TURES AND MA CRO-V ARIABLES Climate experts view zonal winds as drivers of SST pat- terns. W e take the view that if El Ni ˜ no and La Ni ˜ na are indeed genuine macro-lev el climate phenomena in their own right (and not just arbitrary quantities deﬁned by con- vention) then the y must consist of macro-le vel features of the relation between the high-dimensional micro-le vel ZT and SST patterns that can be detected by an unsupervised method. That is, it must be possible to identify El Ni ˜ no and La Ni ˜ na from a mass of air pressure and sea temperature readings, using a method that has no independent informa- tion about when such periods occurred. In Chalupka et al. (2016) we de veloped a theoretically pre- cise account of causal relations of macro-variables that su- pervene on micro-v ariables, and proposed an unsupervised method for their discovery , which we called Causal Fea- ture Learning (CFL). W e adopt the framew ork (summa- rized belo w) with a few interpretational adjustments for our climate setting. The method (originally inspired by the neuroscience setting, only tested on synthetic data) was designed to establish claims such as “The presence of faces (in an image) causes speciﬁc neural processes in the brain. ” , where a neural process identiﬁes a class of spike trains across a large number of neurons recorded by elec- trodes. An ability to characterize such neural processes would provide the basis to e xplain, for example, what con- stitutes face recognition in the brain. There we considered as input visual stimuli (in the form of still images) and as output electrode recordings of the neural response of 1000 neurons (in the form of spike trains). Formally , let an input (micro-)variable X take v alues in a high-dimensional domain X (in Chalupka et al. (2016), the pixel space of an image, in our case here ZW maps) and the output (micro-)variable Y take values in the high- dimensional domain Y (the space of neural spike trains then, the SST patterns here). The basic idea underlying our set-up is that the causal macro-variable relation is de- ﬁned in terms of the coarsest aggregation of the micro- lev el spaces that preserves the probabilistic relations un- der intervention (hence, causal) between the micro-le vel spaces. Conceptually , macro-level causal v ariables group together micro-le vel states that make no causal difference. In Chalupka et al. (2016) we started by deﬁning a micro- lev el manipulation (similar to Pearl’ s do ( )-operator (Pearl, 2000)): Deﬁnition 1 (Micro-lev el Manipulation) . A micro-lev el manipulation is the operation man ( X = x ) that changes the value of the micr o-variable X to x ∈ X , while not (di- r ectly) affecting any other variables. W e write man ( x ) if the manipulated variable X is clear fr om context. The micro-lev el manipulation is then used to deﬁne what we refer to as the fundamental causal partition : Deﬁnition 2 (Fundamental Causal Partition, Causal Class) . Given the pair ( X , Y ) , the fundamental causal partition of X , denoted by Π c ( X ) is the partition induced by the equiv- alence r elation X ∼ such that x 1 X ∼ x 2 ⇔ ∀ y P ( y | man ( x 1 )) = P ( y | man ( x 2 )) . Similarly , the fundamental causal partition of Y , denoted by Π c ( Y ) , is the partition induced by the equivalence r elation Y ∼ such that y 1 Y ∼ y 2 ⇔ ∀ x P ( y 1 | man ( x )) = P ( y 2 | man ( x )) . A cell of a causal partition is a causal class of X or Y . The fundamental causal partitions then naturally give rise to the macro-level cause v ariable C and ef fect v ariable E that stand in a bijectiv e relation to the cells of Π c ( X ) and Π c ( Y ) , respecti vely . Thus, the macro-variable cause C ig- nores all the micro-le vel changes in X that do not have an ef fect on the probabilities ov er Y , and the macro-le vel Figure 3: The Causal Coarsening Theorem, adapted from Chalupka et al. (2016). In this plot, the observational input macro-variable (top, gray) has four states, and has a well- deﬁned joint with the observational output macro-variable (with six states). In each case, the causal macro-variable states are a coarsening of the observ ational states. For ex- ample, the input causal macro-v ariable merges the two top observational states. E.g. P ( Y | x 1 ) 6 = P ( Y | x 2 ) , but P ( Y | man ( x 1 )) = P ( Y | man ( x 2 )) . effect E ignores all the micro-le vel detail in Y , which oc- cur with the same probability giv en a manipulation to any X = x . W ith these deﬁnitions there is no reason a priori to think that macro-v ariables are common phenomena. In fact quite the opposite: The conditions that the probability distri- butions ov er X and Y must satisfy to give rise to non- trivial macro-variables C and E can easily be described as a measure-zero e vent when taken in their strict form. Conse- quently , our vie w is that to the extent that macro-v ariables are discussed in a scientiﬁc domain, there must be a pre- supposition that such strong conditions are satisﬁed at least approximately . In the present conte xt, our climate data consisting of ZW and SST measurements (we gi ve a detailed description of the data in Section 1.3 belo w) is entirely observational. That is, the data is naturally sampled from P ( SST , ZW ) and not created by a (hypothetical) experimentalist from P ( SST | man ( ZW = z )) for dif ferent values of z . Nev er- theless, we can identify the observational macro-variables that characterize the probabilistic relation between ZW and SST by replacing the probabilities in Deﬁnition 1.2 with observational probabilities P ( y | x ) : Deﬁnition 3 (Fundamental Observ ational Partition, Obser - vational Class) . Given the pair ( X , Y ) , the fundamental observational partition of X , denoted by Π o ( X ) is the par - tition induced by the equivalence r elation X ∼ such that x 1 X ∼ x 2 ⇔ ∀ y P ( y | x 1 ) = P ( y | x 2 ) . Similarly , the fundamental observ ational partition of Y , de- Figure 4: A micro-v ariable climate dataset. T op: A week’ s av erage ZW ﬁeld. Bottom: A week’ s a verage SST ﬁeld ov er the same region. In addition, the Ni ˜ no 3.4 region is marked. Our dataset comprises 36 years’ worth of ov erlap- ping weekly av erages ov er the presented region. noted by Π o ( Y ) , is the partition induced by the equivalence r elation Y ∼ such that y 1 Y ∼ y 2 ⇔ ∀ x P ( y 1 | x ) = P ( y 2 | x ) . A cell of an observational partition is an observational class of X or Y . In Chalupka et al. (2016) we showed that the fundamen- tal causal partition is almost always a coarsening of the corresponding fundamental observational partition, as il- lustrated in Fig. 3. W e thus hav e some reason to expect that any macro-v ariables we do identify from our observ ational climate data will capture all the distinctions that are causal, but may in addition mak e some distinctions that do not sup- port a causal inference. W e return to this point in Section 6, where we discuss in more detail what causal insights can be drawn from this work. Our results should be seen as a step to wards a characterization of macro-level causal vari- ables for climate science, b ut we fully acknowledge that a complete causal characterization of the equatorial Paciﬁc climate dynamics is beyond the scope of this paper . 1.3 D A T ASET The data used for this study is based on the daily- av eraged version of the NCEP-DOE Reanalysis 2 prod- uct for the time period 1979–2014 inclusiv e (Kanamitsu et al., 2002), a data product pro vided by the US National Centers for En vironmental Protection (NCEP) and the De- partment of Energy (DOE). Reanalysis data sets are gen- erated by ﬁtting a complex climate model to all a vail- able data for a given period of time, thus generating es- timates for times and locations that were not originally observed. In addition, we used the Geophysical Obser- vational Analysis T ool (http://www .goat-geo.org) to inter- polate the SST and zonal wind ﬁelds onto a 2 . 5 ◦ × 2 . 5 ◦ spatial grid for easier analysis. W e chose to focus on the (140 ◦ , 280 ◦ )E × (-10 ◦ , +10 ◦ )N equatorial band of the Pa- ciﬁc Ocean. From the raw dataset, we extracted the zonal (west-to-east) wind component and SST data in this re gion (speciﬁcally , we e xtracted the ﬁelds at the 1000 hPa lev el near the surface). Finally , we smoothed the data by com- puting a running weekly average in each domain. The re- sulting dataset contains 13140 zonal wind and 13140 cor - responding SST maps, each a 9 × 55 matrix. Fig. 4 sho ws sample data points. 2 P A CIFIC MA CR O-V ARIABLES T o apply CFL in practice, we adapted our unsupervised causal feature learning algorithm (Chalupka et al., 2016) to more realistic scenarios. The new solution (Sec. 3) is more rob ust and applicable to high-dimensional real-world data. W e start with a description of the results. Throughout the article, we will refer to zonal wind macr o- variables as W , and to temperature macr o-variables as T . W e ﬁrst chose to search for four-state macro-variables (though we experiment with varying this number in Sec. 4.1) and considered a zero-time delay 1 between W and T . In the CFL framew ork, each macro-v ariable state corresponds to a cell of a partition of the respectiv e micro- variable input space. Fig. 5 visualizes the W and T we learned by plotting the dif ference between each macro- variable cell’ s mean and the ZW (SST) mean across the whole dataset. The visualized states are easy to describe: For example, when W=WEqt there is a lar ger -than-av erage westerly wind component in the west-equatorial region, a feature often associated with the causes of El Ni ˜ no (see Fig. 1). Indeed, T able 1 shows that the El Ni ˜ no cell of T only arises in connection with W=WEqt. In addition, WEqt is often positi vely correlated with the T=W arm. Through- out the rest of the article, we will mostly focus on the T macro-variable. Our ﬁrst goal is to quantitativ ely justify calling T=1 “El Ni ˜ no” and calling T=2 “La Ni ˜ na”. Quali- tativ ely , the warm and cold water tongues that reach west- ward across the Paciﬁc and that are often used to describe the two phenomena, are e vident in the image. Follo wing the standard deﬁnition of El Ni ˜ no (see Sec- tion 1.1), we use the SST anomaly in the Ni ˜ no 3.4 region to detect its presence (Trenberth, 1997). The anomaly is com- puted with respect to the climatological mean, that is the 1 A zero time delay implies that CFL will attempt to relate the weekly moving ZW av erage to the weekly moving SST average. The question of dif ferent time delays turns out to be a v ery subtle issue in the study of El Ni ˜ no as El Ni ˜ no is not a periodic ev ent, nor does it ha ve a ﬁxed duration (see Fig. 2). A careful discussion of other delays is not feasible in a short article and the zero-time delay was deemed a reasonable starting point by domain experts we consulted. Figure 5: Macro-variables discovered by Alg. 1. For each state, the av erage difference from the dataset mean is shown. Left: Four states of W , the zonal wind macro- variable. W e named the states “Easterly Equatorial” (EEqt),“W esterly Equatorial” (WEqt), “Easterly North of Equator” (EN) and “Easterly South of Equator” (ES). Right: Four states of T , the SST macro-variable. W e named the states “Cold [American Coastal W aters]”, “El Ni ˜ no”, “La Ni ˜ na” and “W arm [American Coastal W aters]”. The main text provides additional justiﬁcation for calling T=1 and T=2 “El Ni ˜ no” and ”La Ni ˜ na”, respectiv ely . mean temperature during the same week of the year over all the weeks in our dataset. W e will call a weekly average anomaly exceeding +.5 ◦ C a mild episode, and an anomaly exceeding +1.5 ◦ C a strong episode. The deﬁnition of La Ni ˜ na is analogous, with negati ve thresholds. Fig. 6 sho ws that in the T=1 and T=2 cells, over 75% of all the points exceed the threshold for a mild (positi ve and negativ e, re- spectiv ely) anomaly , and ov er 50% of the points exceed the strong threshold. The situation is dif ferent in the W arm and Cold cells, where almost no points e xceed the strong threshold while the number of points falling in these non- anomalous cells is about 30% of the total. Since this macro- variable contains a state capturing a high proportion of El Ni ˜ no-like patterns, we will say that this state has a “high precision” of detecting El Ni ˜ no, while similarly , state T=2 has a high La Ni ˜ na precision. Formally , we deﬁne the pre- cision of a macro-variable state as follo ws: Deﬁnition 4 (precision) . Let T = { T 1 , · · · , T K } be a par- tition of the set of all the SST maps used in our experiments. Let n 34 : S S T → R be the function that computes the Ni ˜ no 3.4 anomaly for a given map. Then, let c θ ( T k ) =    1 | T k | |{ t ∈ T k s.t. n 34( t ) > θ }| if θ > 0 1 | T k | |{ t ∈ T k s.t. n 34( t ) < θ }| if θ < 0 be the function that computes for , a given cell T k of the partition, the fraction of its members whose anomaly is gr eater than (if θ > 0 ) or lesser than (if θ < 0 ) a given thr eshold θ . Finally , call the four numbers max k c . 5 ( T k ) , Figure 6: T=1 and T=2 are El Ni ˜ no and La Ni ˜ na. T op: Each plot sho ws the cumulative histogram of the Ni ˜ no 3.4 anomalies, computed o ver all the weekly SST a verages that belong to the gi ven state of T . The dashed lines show the +/-0.5 and +/-1.5 “mild” and “strong” anomaly thresholds. Bottom: The minimal manipulations needed to transition from a giv en T -state into another (the exact procedure to obtain the plots is described in the text). max k c 1 . 5 ( T k ) , max k c ( − . 5) ( T k ) , max k c ( − 1 . 5) ( T k ) the mild/str ong-El Ni ˜ no and mild/str ong-La Ni ˜ na pr ecision of the macr o-variable T . T ogether , the precisions indicate how well the partition T separates the mild and strong El Ni ˜ no and La Ni ˜ na anoma- lies from other structures in the data. In Fig. 6, for ex- ample, c . 5 ( T ) ≈ . 75 and c 1 . 5 ( T ) ≈ . 25 (both because of T=1), c ( − . 5) ( T ) ≈ . 85 and c ( − 1 . 5) ( T ) ≈ . 5 (both because of T=2). Thus, T has high mild-El Ni ˜ no precision, and high mild-La Ni ˜ na precision. As further evidence that Alg. 1 reco vered El Ni ˜ no and La Ni ˜ na, we sho w minimal state-to-state manipulations in Fig. 6. T ake the La Ni ˜ na → El Ni ˜ no plot as an example. T o compute it, we took all the SST maps for which T=La Ni ˜ na, and for each found the closest (in the Euclidean space) map for which T=El Ni ˜ no. W e then averaged these differences. One of the insights the ﬁgure offers is that low SSTs in the Ni ˜ no 3.4 region really are the distinguishing feature of T=La Ni ˜ na. Similarly , an important difference between the T=W arm and T=El Ni ˜ no is the characteristic tongue of warm water extending into the Ni ˜ no 3.4 region. Adding this tongue is necessary to switch from T=Cold to T=El Ni ˜ no, but not to switch from T=Cold or T=La Ni ˜ na to T=W arm. The CFL framework allows us to interpret W and T as stan- Figure 7: Alg. 1 vs. clustering. In this toy example, the data is s ampled from the distribution P ( X ) = U ( { 1 / 5; 2 / 5) } ∪ { 3 / 5; 4 / 5 } ) , P ( Y | X ) = P ( Y ) = U ( { 1 / 5; 2 / 5) } ∪ { 3 / 5; 4 / 5 } ) . The clusters in the X , Y , and joint X , Y space are e vident. Howe ver , since X and Y are inde- pendent, we expect Alg. 1 to ﬁnd only one macrole vel class of X . Indeed, (properly regularized) regression gi ves f ( x ) = const ∀ x , so W ( x ) = 0 ∀ x . Incidentally , since the density of Y is similar in the neighborhood of each sample y (see data Y -projection on the right), T ( y ) = 0 ∀ y . dard probabilistic random v ariables with distribution we can estimate. T able 1 of fers a probabilistic description of the system we learned. “When the equatorial zonal wind is unusually westerly , there is a 75% chance that the eastern Paciﬁc is warm, and a 25% chance that El Ni ˜ no arises. ” and “When the North-equatorial zonal wind is predominantly westerly , but the South-equatorial easterly , then the East- ern P aciﬁc is most likely to be cold. ”—are e xample insights about the equatorial Paciﬁc wind-SST system offered by CFL. W e emphasize that both the macro-variables and the probabilities are learned from the data in an entirely un- supervised manner , without any a priori input about what constitutes ENSO e vents (except the fact that we restrict the SST and ZW ﬁelds to the equatorial Paciﬁc re gion). 3 CFL: A R OBUST ALGORITHM The practical bottleneck of the original CFL algo- rithm (Chalupka et al., 2016) is the need for joint den- sity estimation of p ( X , Y ) . Density estimation is noto- riously hard, especially in high dimensions. W e modiﬁed the original algorithm to av oid explicit density estimation. An additional advantage of our approach (Alg. 1) is that it is very robust with respect to input space dimensional- ity: Input data is only used e xplicitly in regression, which can be implemented using any algorithm that easily handles high-dimensional inputs (we used neural nets). Let X , Y denote the micro-v ariable input and output space, respectiv ely . Our algorithm is based on the insight that CFL only needs to detect the two equi v alences p ( Y | x 1 ) = p ( Y | x 2 ) for any x 1 , x 2 ∈ X and (1) p ( y 1 | x ) = p ( y 2 | x ) for any y 1 , y 2 ∈ Y , x ∈ X , (2) instead of actually computing the conditionals p ( Y | X ) . Algorithm 1: Unsupervised Causal Feature Learning input : D = { ( x 1 , y 1 ) , · · · , ( x N , y N ) } Cluster – a clustering algorithm output : W ( x ) , T ( y ) – the causal class of each x, y . 1 Regress f ← arg min f Σ i ( f ( x i ) − y i ) 2 ; 2 Let W ( x i ) ← Cluster ( f ( x 1 ) , · · · , f ( x N ))[ x i ] ; 3 Let Range ( W ) = { 0 , · · · , N } ; 4 Let Y w ← { y | W ( x ) = w and ( x, y ) ∈ D } ; 5 Let g ( y ) ← [ kNN ( y , Y 0 ) , · · · , kNN ( y , Y N )] ; 6 Let T ( y i ) ← Cluster ( g ( y 1 ) , · · · , g ( y N ))[ y i ] ; If Eq. (1) holds, we also have E [ Y | x 1 ] = E [ Y | x 2 ] . Computing conditional expectations is much easier than learning the full conditional: f ( X ) = E [ Y | X ] minimizes E [( Y − f ( X )) 2 ] , so learning the conditional expectation amounts to regressing Y on X under the mean-squared er- ror measure. Unfortunately , equal conditional expectations do not imply equal conditional distributions. Howe ver , ar- guably the practical risk of encountering differing condi- tionals with identical means is lower than the risk of f ailing at high-dimensional density learning. For this reason, we use E [ Y | x 1 ] = E [ Y | x 2 ] as a heuristic indicator of the equiv alence of the conditionals in Eq. (1) (see Line 2 in Alg. 1). For a more robust heuristic one could use more than just equal expectations to decide distribution equality . A promising direction would be to use a Mixture Density Network (Bishop, 1994) to approximate P ( Y | x ) with a mixture of Gaussians for each x , and then cluster the mix- tures. Clustering the conditional expectations gi ves us the macro- variable class W ( x ) of each input x . By construc- tion (Chalupka et al., 2015), we have p ( Y | x ) = P ( Y | W ( x )) and by assumption the range of W is small. Instead of checking whether Eq. (2) holds for a giv en pair y 1 , y 2 ov er all the x ∈ X , it is thus enough to check whether p ( y 1 | W = w ) = p ( y 2 | W = w ) for each v alue w ∈ Range ( W ) . For each gi ven w we have a subset Y w ⊂ Y which consists of all the y ’ s whose corresponding x ’ s ha ve causal class w . Consequently , Eq. (2) does not de- pend on the exact densities conditional on the micro-state, but only the densities conditional on the macro-lev el state. Thus, instead of trying to ev aluate any giv en p ( y | w ) , Line 5 computes the distance of y to the k-th near est neigh- bor in Y w . This idea is based on a principle that under - Cold El Ni ˜ no La Ni ˜ na W arm EEqt 2/3 0 1/3 0 WEqt 0 1/4 0 3/4 EN ∼ 1/10 0 1/4 ∼ 2/3 ES 3/4 0 0 1/4 T able 1: Each ro w shows P ( T | W = w ) for a gi ven w . Figure 8: Changes in macro-variable precision as we vary the number of states in CFL, clustering, and CFL on reshuf- ﬂed data (“Rand CFL ”). W ith two states, it is impossible to differentiate El Ni ˜ no and La Ni ˜ na from other weather fea- tures, be it dynamic (CFL) or spatio-structural (clustering). Increasing the number of states rev eals dif ferences between the algorithms. lies a whole class of nonparametric density estimation al- gorithms (Fukunaga and Hostetler, 1973; Mack and Rosen- blatt, 1979): Where the density is high, samples from the distribution are closer to each other than where the den- sity is low . This is illustrated in Fig 7. On the right, we plotted the projection of the data onto the y-space. In this projection, the distance of y 1 to its third-nearest neighbor is roughly the same as the distance of y 2 to its third-nearest neighbor . Indeed, this is the case for all the y ’ s, because they are generated from a distrib ution that assigns equal density to all of them. In Chalupka et al. (2016) we represented each y by an esti- mate of [ p ( y | x 1 ) , · · · , p ( y | x N )] , where N is the number of datapoints. The new approach represents each y sam- ple by its ’k-nn representation’, one scalar v alue for each w ∈ Range ( W ) (Line 5). Clustering these representations giv es us the causal state T ( y ) for each y . Algorithm 1 relies on a successful regression f that mini- mizes the mean squared error E [( f ( x ) − y ) 2 ] . In our e x- periments, we used the Theano (Bastien et al., 2012) and Lasagne packages to implement and train a three-hidden- layers, fully-connected neural network (Bishop, 1995) in Python. The data was suf ﬁciently simple (compared to e.g. image datasets used to ev aluate state-of-the-art neural nets in vision) that no regularization technique beyond simple weight decay and early stopping was necessary to minimize the validation error . Figure 9: t-SNE (V an der Maaten and Hinton, 2008) em- bedding of the k-nn representation of SST data. The blue dots sho w , for v arying K, the state of T with largest c ( − . 5) precision (see Def. 4). The red dots show the state with largest c . 5 . Thus, the blue dots are “the” La Ni ˜ na cluster for each K, and the red dots “the” El Ni ˜ no cluster . 4 R OBUSTNESS OF THE RESUL TS In this section, we describe tw o additional studies we per- formed to ensure our algorithm behaves as expected, and that the results are rob ust with respect to changing the e x- perimental parameters. 4.1 V ARYING THE NUMBER OF ST A TES Our choice of discovering four -state macro-variables was rather arbitrary . T o check how varying the number of states changes the macro-v ariable precision (Def. 4), we repeated our experimental procedure, varying the number of states K from 2 to 16 (both in the ZW and SST space). Fig. 8 sho ws the precisions for each case. As expected, a low number of states (K=2, 3) doesn’t allow the algorithm to precisely detect El Ni ˜ no and La Ni ˜ na. With K > 4 howe ver , a slo wly growing trend persists at high precision values. El Ni ˜ no and La Ni ˜ na remain important features as K changes. There are se veral possible beha viors of the algorithm gi ven the slowly gro wing precision of the macro-variables with growing K: (1) The El Ni ˜ no and La Ni ˜ na states remain roughly constant, (2) CFL sub-divides the El Ni ˜ no and La Ni ˜ na states, (3) CFL ﬁnds better El Ni ˜ no and La Ni ˜ na re- gions, (3) A mix of the above. Fig. 9 suggests that (2) is true. As K grows, the clusters that most precisely detect the mild El Ni ˜ no and mild La Ni ˜ na phenomena form a chain of strict subsets. T1 T2 T3 T4 W1 .075 .40 .25 .27 W2 .083 .39 .25 .27 W3 .084 .39 .26 .27 W4 .080 .40 .24 .27 T able 2: Conditional probabilities P ( T | W ) when Alg. 1 is applied to randomly (in time) reshuf ﬂed ZW and SST data. 4.2 RESHUFFLED D A T A As a sanity check, we ran Alg. 1 on randomly reshuf- ﬂed (across the time dimension) ZW and SST data. W e asked the algorithm to ﬁnd K=4, . . . , 16-state ZW and SST macro-variables. T able 2 shows P ( T | W ) , where W and T are the input and output macro-v ariables discov- ered in the randomized dataset with K = 4 . Note that P ( T | W = W 1) , P ( T | W = W 2) , P ( T | W = W 3) and P ( T | W = W 4) are all equal. This is exactly as ex- pected, since by reshuf ﬂing the data we remo ved any prob- abilistic dependence between the inputs and the outputs. Applying Deﬁnition 2 to this data indicates that the al- gorithm implicitly only discovered one true input state, ev en though we explicitly asked it to look for a four- state macro-v ariable. The cardinality of the output macro- variable is three or four states, depending on whether . 25 is close enough to . 27 to apply Def. 2 to mer ge the last two columns. W e performed the same reshuf ﬂed analysis for each K and computed as before the precision for the weak and strong El Ni ˜ no and the weak and strong La Ni ˜ na. Fig. 8, large dotted lines, sho ws that in each case none of the clusters contains a signiﬁcant proportion of either El Ni ˜ no or La Ni ˜ na patterns. This experiment offers tw o in- sights: • Alg. 1 passes the sanity check. When the inputs and outputs are independent, the input macro-variable is trivial, it has a single state. • When SST patterns are clustered according to their probability of occurrence (e.g. as the W variable does in T able 2), El Ni ˜ no and La Ni ˜ na are not identiﬁed as macro-lev el climate states. W e will return to this point in the Discussion. 5 WHY NO T NAIVE CLUSTERING? It is instructiv e to compare our results with unsupervised clustering. Fig. 8 sho ws the precision coefﬁcients for k- means clustering with k=4, . . . , 16 (small dotted line), alongside our CFL results. Whereas CFL detects both El Ni ˜ no and La Ni ˜ na with high precision using only four states, k-means struggles to achie ve a similar result even for larger K. Barring particularities of the data (which we consider in the Discussion), there is in general no reason for CFL to giv e the same results as clustering. Consider the example in Fig. 7. Ar guably , a reasonable clustering algorithm should ﬁnd four linearly separable clusters in the joint X , Y space, and two clusters in the X and Y space each. Howe ver , the variables are probabilistically independent. In contrast, CFL would only ﬁnd a one-state input variable, since all values of X imply the same distrib ution ov er Y . Addition- ally , since P ( Y | X ) = P ( Y ) is constant across all the samples, CFL would also only ﬁnd a one-state output vari- able. The ﬁgure illustrates that Alg. 1 does precisely that (as should the original algorithm in Chalupka et al. (2016)). 6 DISCUSSION The CFL framework we dev eloped in Chalupka et al. (2015, 2016) aspires to solve an important problem in causal reasoning: ho w to automatically form macro-lev el variables from micro-le vel observations. In this work we hav e sho wn, for the ﬁrst time, that these algorithms can be successfully applied to real-life data. W e hav e recov- ered well-known, complex climate phenomena (El Ni ˜ no, La Ni ˜ na) as macro-variable states directly from climate data, in an entirely unsupervised manner . In order to do so, we developed a new , practical version of the original CFL algorithm. W e emphasize that our experiments use observational cli- mate data, and we ha ve to be cautious about causal conclu- sions. It is not e ven clear a priori whether the Z W → S S T causal direction is a reasonable choice: it is known that wind patterns cause changes in SST and it in turn affects the wind by changing the atmospheric pressure. Feedback loops are commonplace in climate dynamics. The Causal Coarsening Theorems in Chalupka et al. (2015, 2016) provide the basis for an efﬁcient learning of causal relationships based on observ ational macro-variables – but some e xperiments are required. In addition, the theorems were only sho wn to hold for variables that are not subject to feedback. Ho wev er, we are hopeful that an extension accounting for feedback can be proven. While real cli- mate experiments are generally not feasible, such a theo- rem would provide the basis to perform large-scale climate experiments with detailed climate models, for example, to check whether interventionally shifting from the W = 0 zonal wind state to W = 1 in the climate model increases the likelihood of El Ni ˜ no (i.e. of SST ending up in state T=1). Connecting the CFL framework with such experi- ments is an exciting future direction as it would also enable the possibility of using the macro-v ariables we ha ve found to inform policy that aims to inﬂuence climate phenomena. Our experiments that compare CFL with clustering sho wed that, as the number of clusters gro ws, k-means approaches nev er e xceed CFL ’ s precision in detecting El Ni ˜ no and La Ni ˜ na. One e xplanation for this ﬁnding is that while clus- tering looks for spatial featur es in the data, CFL looks for relational pr obabilistic features . Fig. 8 suggests that when the number of clusters is small there are strong spa- tial features in the data that supersede El Ni ˜ no and La Ni ˜ na in their distincti veness. In contrast, CFL already de- tects El Ni ˜ no with high precision with only four clusters. This indicates that either (1) There is something unique about P ( El Ni ˜ no | W ) and P ( La Ni ˜ na | W ) , or (2) There is something unique about P ( El Ni ˜ no ) and P ( La Ni ˜ na ) . Since we disproved the second hypothesis in Sec. 4.2, our results o verall indicate that the El Ni ˜ no and La Ni ˜ na phe- nomena do not only constitute interesting spatial features of the SST map, but are also crucially characterized by the dynamic aspect of the interplay between zonal winds and sea surface temperatures. Even when working with purely observational data, CFL offers an important causal insight not rev ealed by cluster- ing methods. It guards against learning variables with am- biguous manipulation ef fects (Spirtes and Scheines, 2004). An illustrativ e example of an ambiguous macro-v ariable is total cholesterol. Low density lipids (LDL, commonly called “bad cholesterol”) and high density lipids (HDL, “good cholesterol”) can be aggregated together to count to- tal cholesterol (TC), b ut TC has an ambiguous ef fect on heart disease because effects of LDL and HDL dif fer . The Causal Coarsening Theorem guarantees that each state of the observational macro-v ariable is causally unambiguous: no mixing of HDL and LDL can occur . In case of our El Ni ˜ no setup, this means that two ZW states within the same cell are guaranteed to hav e the same ef fect on the SST macro-variable. Finally , we note that there still is signiﬁcant debate among climate scientists about what exactly constitutes El Ni ˜ no and what its causes are. For example, recent research has shown that there may be multiple dif ferent types of El Ni ˜ no states (Kao and Y u, 2009; Johnson, 2013) that all fall under NO AA ’ s deﬁnition. Our results suggest that the current def- inition described in Section 1.1 coincides well with states of the probabilistic macro-v ariable discov ered by CFL. In addition, Sec. 4.1 indicates that ﬁner-grained structure does exist within the El Ni ˜ no and La Ni ˜ na clusters when they are analyzed from the relational-probabilistic standpoint. W e leav e this line of research as an important future direction. Acknowledgements KC’ s and PP’ s work was supported by the ONR MURI grant N00014-10-1-0933 and Gordon and Betty Moore Foundation. References F . Bastien, P . Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow , A. Ber geron, N. Bouchard, and Y . Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 W orkshop, 2012. C. M. Bishop. Neur al networks for pattern recognition . Oxford univ ersity press, 1995. Christopher M Bishop. Mixture density networks. 1994. P . A. Cashin, K. Mohaddes, and M. Raissi. Fair weather or foul? The macroeconomic effects of El Ni ˜ no. 2015. K. Chalupka, P . Perona, and F . Eberhardt. V isual Causal Feature Learning. In Thirty-F irst Confer ence on Uncer- tainty in Artiﬁcial Intelligence , pages 181–190. A U AI Press, 2015. K. Chalupka, P . Perona, and F . Eberhardt. Multi-Level Cause-Effect Systems. In The 19th International Con- fer ence on Artiﬁcial Intelligence and Statistics , 2016. S. A. Changnon. Impacts of 1997-98 El Ni ˜ no-generated weather in the United States. Bulletin of the American Meteor ological Society , 80(9):1819, 1999. T . Di Liberto. The Walker Circulation: ENSO’ s atmo- spheric buddy , 2014. K. Fukunaga and L. D. Hostetler . Optimization of k nearest neighbor density estimates. Information Theory , IEEE T ransactions on , 19(3):320–326, 1973. M. H. Glantz. Curr ents of chang e: impacts of El Ni ˜ no and La Ni ˜ na on climate and society . Cambridge Uni versity Press, 2001. J. R. Holton, R. Dmowska, and S. G. Philander . El Ni ˜ no, La Ni ˜ na, and the southern oscillation , v olume 46. Aca- demic press, 1989. N. C. Johnson. How many ENSO ﬂav ors can we distin- guish? Journal of Climate , 26(13):4816–4827, 2013. M. Kanamitsu, W . Ebisuzaki, J. W oollen, S.-K. Y ang, J. J. Hnilo, M. Fiorino, and G. L. Potter . NCEP-DOE AMIP- II reanalysis (r -2). Bulletin of the American Meteor olog- ical Society , 83(11):1631–1643, 2002. H.-Y . Kao and J.-Y . Y u. Contrasting eastern-Paciﬁc and central-Paciﬁc types of ENSO. Journal of Climate , 22 (3):615–632, 2009. C. W . Landsea and J. A. Knaf f. How much skill was there in forecasting the v ery strong 1997-98 El Ni ˜ no? Bulletin of the American Meteor ological Society , 81(9):2107–2119, 2000. K. M. Lau and S. Y ang. W alker circulation. Encyclopedia of atmospheric sciences , pages 2505–2510, 2003. Y . P . Mack and M. Rosenblatt. Multiv ariate k-nearest neighbor density estimates. J ournal of Multivariate Analysis , 9(1):1–15, 1979. M. J. McPhaden, S. E. Zebiak, and M. H. Glantz. Enso as an integrating concept in earth science. Science , 314 (5806):1740–1745, 2006. J. Pearl. Causality: Models, Reasoning and Infer ence . Cambridge univ ersity press, 2000. C. F . Ropele wski and M. S. Halpert. Global and re- gional scale precipitation patterns associated with the El Ni ˜ no/Southern Oscillation. Monthly W eather Revie w , 115(8):1606–1626, 1987. Peter Spirtes and Richard Scheines. Causal inference of ambiguous manipulations. Philosophy of Science , 71(5): 833–845, 2004. K. E. T renberth. The deﬁnition of El Ni ˜ no. Bulletin of the American Meteor ological Society , 78(12):2771– 2777, 1997. L. V an der Maaten and G. Hinton. V isualizing data using t-sne. Journal of Machine Learning Researc h , 9(2579- 2605):85, 2008.

Unsupervised Discovery of El Nino Using Causal Feature Learning on Microlevel Climate Data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment