Combining observational and experimental data to find heterogeneous treatment effects

Combining observational and e xperimental data to ﬁnd heter og eneous treatment effects Ale xander P e ysakhovich F acebook Ar tiﬁcial Intelligence Research 770 Broadwa y New Y ork, NY 10003 Akos Lada F acebook News F eed 1 Hack er W ay Menlo P ark, CA 94025 ABSTRA CT Ev ery design c hoice will hav e diﬀerent eﬀects on diﬀeren t units. Ho wev er traditional A/B tests are often underpow- ered to iden tify these heterogeneous eﬀects. This is es- pecially true when the set of unit-level attributes is hi gh- dimensional and our priors are weak ab out whic h particular co v ariates are important. Ho wev er, there are often ob ser- v ational data sets av ailable that are orders of magnitude larger. W e prop ose a metho d to combine these tw o data sources to estimate heterogeneous treatmen t eﬀects. First, w e use observ ational time series data to estimate a map- ping from cov ariates to u nit-level eﬀects. These estimates are likely biased but under some conditions the bias pre- serv es unit-lev el relative rank orderings. If these conditions hold, we only need suﬃcient exp erimen tal data to identify a monotonic, one-dimensional transformation from observ a- tionally predic ted treatmen t eﬀects to real treatmen t eﬀects. This reduces p ow er demands greatly and makes the detec- tion of heterogeneous eﬀects muc h easier. As an application, w e sho w ho w our method can b e used to impro ve F aceb o ok page recommendations. K eywords Heterogeneous treatment eﬀects; Personalization 1. INTR ODUCTION Experimentation and data-grounded approac hes to the design of pro ducts, websites and services hav e b ecome im- mensely p opular in the internet industry [28, 3]. This is for a very go o d reason: when decision-makers employ exper- imen tation they hav e a far greater chanc e of making go o d decisions than with observ ation alone [22, 4, 2 0, 19, 6]. Stan- dard exp erimentation techniques are often used to ev aluate whether a particular p olicy works on av erage but decision- mak ers and analysts are increasingly in terested in how their designs will aﬀect diﬀerent groups [13]. Understanding this heterogeneit y is particularly imp ortan t when a design is to o expensive to give to the whole p opulation (and so should Submitted to WWW ’17 P erth, Austr alia A CM ISBN N A. DOI: NA be giv en to the subset that will beneﬁt the most) or when it has positive eﬀects for some, but may not b e appropriate for others. Our con tribution is to discuss how analysts can com bine experimental with observ ational data to help solv e the p ersonalization problem. T raditional metho ds hav e allow ed analysts to sp ecify ex- an te interesting subgroups [13, 17] and look for diﬀerences. More modern work in machine learning, statistics and econo- metrics has begun to focus on streamlinin g this process. The methods automatically ﬁnd heterogeneous groups using a v ariet y of tricks such as non-parametric pro cedures [10, 26], regularized regression [18], trees [14, 25], causal trees and forests [2, 27], deep neural netw orks [24], ‘virtual twin’ anal- yses [11] and mixtures of mo dels [15]. Ho wev er, these automated searches are diﬃcult. In many important domains a verage eﬀects are small (relative to noise), the set of unit-level cov ariates is high-dimensional and we lac k a priori knowledge ab out which cov ariates are imp or- tan t predictors of heterogeneity . The combination of these conditions means that to identify heterogeneous eﬀects pre- cisely exp eriments hav e to b e very large and, in general, prohibitiv ely costly [21]. In contrast to exp ensiv e exp eri- men tal data, observ ational data is often av ailable in muc h larger quantities especially in applications suc h as medicine, online commerce or so cial media. Our contribution is to in- v estigate whether we can combin e this observ ational data with exp eriments to help learn the mapping from unit-level features to heterogeneous eﬀects. Our general approach can b e describ ed by the heuristic “larger correlations suggest larger causal eﬀects.” W e learn a mapping from unit-level cov ariates to the size of the unit- lev el causal eﬀect that we get from using time series observ a- tional data. W e then use this mapping as a rank-preserving transformation of the true causal eﬀects. Of course our heuristic is not alwa ys applicable and we discuss the sta- tistical assumptions on the data generating pro cess under whic h the statement ab ov e is true. Statistical assumptions are useful but often they are ab- stract and so analysts need guidelines for when they are lik ely to hold or not. In general, this requires domain kno wl- edge and reasoning. W e fo cus on a real world application: recommendations of pages to users on F aceb o ok. W e presen t a stylized mo del of user behavior and argue that under rea- sonable assumptions on this behavioral mo del the required statistical assumptions are satisﬁed. Of course, the ﬁnal ar- biter of suc h questions is data so we re-analyze an existing product A/B test to show that indeed observ ational data seems to provide a useful prior for exp erimentally estimated causal eﬀects. W e note that our observ ational approach is not in tended to supplant the approac hes su rvey ed abov e. Rather, w e view observ ational data as a complemen t to, not a substitute for, randomized trials. In practice, we suggest that the observ a- tionally predicted treatment eﬀect for eac h unit b e added as a feature into an analyst’s fav orite heterogeneous treatmen t eﬀect pro cedure. There is a cost here: adding another feature raises the complexit y of the machine learning procedure. There is also a b eneﬁt: if the conditions on the data generating pro cesses w e outline here are satisﬁed the b eneﬁts in predictiv e p ow er ma y b e large. W e argue that if the cov ariate space is already large, the eﬀect on mo del complexity is negligible [12] and so the cost is likely to b e small while the b eneﬁt in terms of reducing exp erimental pow er requiremen ts may b e quite large. 2. THE BASIC SETUP W e work with the following situation: w e ha ve units in- dexed b y i , time indexed by t , a con tinuous v ariable of in- terest x and a contin uous outcome v ariable y . W e hav e a treatmen t whic h can change the v alue of x by one and we are interested in personalizing the cho ice of whether a par- ticular unit (i.e. group or individual) with a given cov ariate proﬁle should or should not receive the treatment (p erhaps the treatment has a cost or ﬁnite supply). T o mo del this we assume eac h unit has a linear 1 response: when we raise x i b y one, y will increase by β i . T o complete the mo del, we assume there is a (p otentially high dimen- sional) space of co v ariates C and each unit has a cov ariate proﬁle c i . Note that this cov ariate is assumed ﬁxed per unit and not aﬀected by p olicy choices. There is a (p otentiall y probabilistic) mapping f from co- v ariates to treatmen t eﬀects β i = f ( c i ) and the key to doing goo d p ersonalization is to learn this mapping. One wa y to estimate f is to run a very large exp erimen t. W e can raise x by one unit in treatment (lea ving the control as is) and estimate the function E ( f ( c i )) = E ( y | tr eated, c i ) − E ( y | contr ol, c i ) using oﬀ-the-shelf metho ds [26, 18, 14, 25, 2, 27, 24, 11, 15]. Ho wev er, when treatment eﬀects are relatively small and cov ariate space is large this tech nique will require quite large and expensive exp eriments. On the other hand, highly granular observ ational data is often av ailable in large quantit ies in online applications and increasingly also in the domains of medicine and public policy . W e no w discuss ho w to incorporate this observ ational data to help make the pro cess of estimating heterogenous causal eﬀects easier. Suppose that the real data generating pro cess is giv en by the structural equations x t i = θ i +  t i + U t i ψ i and y t i = µ i + x t i β i + U t i γ i + η t i . 1 W e assume linearity b ecause in man y cases of interest our treatmen t will ha ve relativel y small eﬀects on x and thus w e are interested in the locally linear approximation of the true response function. Here we hav e that  and η are error terms indep enden t of all the other v ariables, U is some time v arying unobserved v ariables (whic h in general can be a v ector but for ease of no- tation we write as a scalar quan tity), µ and θ are unobserv ed v ariables that are ﬁxed at the unit level. F or simplicity , all of these v ariables are mean 0 with ﬁnite v ariances. Note that this is without loss of generalit y as we could include some observ ed v ariables V t i , then the wa y to apply the method b elow is simply to use not the original x and y but x and y conditional on V (ie. the residuals). F or the rest of this paper w e assume all v ariables other than x and y are unobserv ed b y the analyst. Suppose that for a large set of units i we hav e obser- v ational data for x and y coming from a time series with perio ds t ∈ { 1 , . . . , T } where T is large. Denote each daily data p oint for each unit as ( x t i , y t i ). This is the standard panel data set up, how ever standard panel data techniques are generally in terested in learning the av erage causal eﬀect, not individualized ones [1]. What happ ens if we take this data and run a separate linear regression, including an intercept, for eac h unit? F or simplicit y we assume that T is very large, so we can work with the p opulation quantities free of uncertain ty . First note that the panel asp ect of our data means we can ignore µ i and θ i in the analyses that follo w. This is b ecause E ( x ) = θ i and E ( y ) = µ i + β i µ i . Both of these quantities are ﬁxed within units and thus pick ed up b y the intercept term of eac h unit-speciﬁc linear regression. Let us focus on the more interesting question: the co eﬃ- cien t on x in eac h of our unit level regressions. W e will call this our observational ly estimate d ca usal eﬀect ˆ β i . W e don’t observ e U and so our estimated ˆ β i will not be equal to β i due to omitted v ariable bias [1, 6]. W e abuse notation a bit and let X i be demeaned vector of observ ations of x for unit i (similarly for Y i ). Recall that the co eﬃcien t ˆ β i is the solution to the least squares problem ˆ β i = ( X 0 i X i ) − 1 ( X 0 i Y i ) . W e can substitute the structural equation for y (ignoring µ and θ b ecause they hav e b een pick ed up by removing the means) into the abov e to get E ( ˆ β i ) = E (( X 0 i X i ) − 1 ( X 0 i X i β + U i γ i + η i )) whic h after some algebra turns in to E ( ˆ β i ) = β i + E (( X 0 i X i ) − 1 ( X 0 i U i ) γ i ) = β i + C ov ( x i , u i ) V ar ( x i ) γ i where the last step comes from the fact that x and u are scalar. This is an imp ortant example of the diﬀerence b etw een prediction problems and in terven tional problems [6], [23]. x i ˆ β i is the b est unbiase d line ar pr e dictor of y i but it is not an estimate of the causal eﬀect β for the same reason that kno wing the num b er of units of ice cream sold on a given da y can predict how many p eople will drown in swimming po ols on that da y but that that banning ice cream (an in ter- v ention) will likely not aﬀect drownings b y nearly as muc h as the observ ational regression would suggest. The bias term here is the quantit y of interest for us. Sup- pose that we ha v e tw o units i , j and we ha v e that β i > β j . If C ov ( x i ,u i ) V ar ( x i ) γ i > C ov ( x j ,u j ) V ar ( x j ) γ j then we are guaran teed to ha ve ˆ β i > ˆ β j . If this condition holds for any such i , j then ˆ β deriv ed from observ ational time series data will b e a rank preserving estimate of β . In other w ords, if “bigger causal ef- fects imply bigger bias” then “larger correlation eﬀects imply larger causation.” 2 In such a case ˆ β i is a suﬃcient statistic for targeting bud- get constrained in terven tions (ie. when we can only aﬀord to give the in terven tion to some p ercent age of the p opula- tion). This also means that if we can learn a function g whic h maps cov ariates c i to ˆ β i then this function will b e a monotonic transformation of the true heterogeneous eﬀects function f . In practice, we suggest learning g by estimating a set of { ˆ β 1 , . . . , ˆ β N } for a large num b er of units using time series data and then using an y standard machine learning tech- nique to learn E ( ˆ β | c ) . Note that the existence of time series data is what mak e our job here possible because it allows us to remov e time in- v arian t unit-level eﬀects. Without multiple observ ations per unit, it would theoretically b e p ossible to p erform a simi- lar procedure but the assumptions required w ould be muc h stronger as we would need muc h more complex conditions on the co v ariance structure b etw een µ , θ , u and β to hold. Recall that our motiv ation is environmen ts where obser- v ational data is plentiful but exp erimental data is sparse. Th us, if our assumptions ab ov e are satisﬁed, once we ha ve g we only need to use exp erimental data to learn a one di- mensional transformation h from ˆ β to β rather than the full, high dimensional function f (ﬁgure 1). Figure 1: The relationship b etw een the true hetero- geneous treatmen t eﬀect function f and the obser- v ational one g . There is no wa y to chec k whether the assumptions un- derlying our pro cedure hold or not using observ ational data only . How ev er, we are not advocating that observ ational data replace experiments. When analyzing real results, an- alysts will likely already b e using a technique to lo ok for heterogeneous treatment eﬀects using C as a feature space. Our argumen t is simply that the analyst should add ˆ g ( c ) (the unit level predicted v alue of ˆ β ), to the set of unit-lev el features. In the worst case, this increases the complexity of the heterogeneous eﬀect mo del slightly (assuming C is al- ready high dimensional). In the best case, our assumptions hold exactly , ˆ g ( c ) is a p erfect monotonic transformation of f ( c ) and th us using ˆ g ( c ) as a feature greatly reduces exper- imen tal p o wer requirements. 2 Note that this is a suﬃcien t, but not necessarily condition. It is necessary th at ∂ bias ∂ β > − 1 ho wev er this do esn’t make for as catch y one sentence explanations or as simple to describe behavioral assumptions. 3. APPLICA TION: F A CEBOOK P A GE REC- OMMEND A TIONS W e now w alk through an example where observ ational es- timates can be helpful to analysts interested in understand- ing heterogeneous eﬀects in the real world. W e fo cus on personalization of F aceb o ok’s page recommendation engine. P ages are non-user entities on F aceb o ok than can post con- ten t (e.g. news sites, blogs, certain celebrities). People can ‘Lik e’ a page to connect to it. Liking a page makes p osts that the page creates eligible to app ear in the liker’s News F eed. Helping users connect to conten t they care ab out can greatly improv e their exp erience on the site, therefore F ace- bo ok employs recommendations to suggest relev ant pages to users. These recommendations can sho w up in a user’s News F eed in the form of ‘Pages Y ou May Like’ units (ﬁgure 2 for an example). F acilitating these connections has co sts. There is opportu- nit y cost (v arious ‘Recommended Page’ units tak e up space on the News F eed) and there is a related user exp erience cost (users who do not wan t or need more connections can be incon venienced or anno y ed). Thus, unlike in the standard recommender system setting where the question of interest is which p age should we suggest [5], an equally imp ortant question for us to address is: which users’ experience will be improv ed the most by additio nal page recommendations? Figure 2: An example page recommendation unit on F aceb o ok from 2014. In the fall of 2015 F acebo ok tested a new type of rec- ommendation - this unit would sho w a ‘represen tative‘ p ost from a page in a user’s news F eed (with a header of ‘Rec- ommended F or Y ou’). T o construct these recommendations F acebo ok’s standard recommender systems were used and only user-page pairs that had very high similarity scores ac- cording to this system w ere eligible for this new recommen- dation system. Other asp ects of the recommendation unit’s design were also ﬁne tuned such that the recommendation w as ligh tw eight and did not detract from user exp erience. T o gauge whether this unit impro ved user experience, F ace- bo ok p erformed a standard A/B test on randomly c hosen users. Approx imately 8 million people were eligible for this test (eligibility required that the underlying recommender systems were conﬁdent in their suggestions of p otential new pages) and approximately 400 , 000 were randomly chosen to see the new recommendation units. W e will now show that in this example observ ationally estimated causal eﬀects (from time series data) do indeed predict true causal eﬀects estimated via an interv ention. 3.1 Beha vioral Model First we discuss a micro economic mo del of the world to get an intuition for whether the heuristic of “larger correlations imply larger causal eﬀects” requires very stringen t assump- tions to hold or whether it is p ossible for this to be true in a st ylized but realistic model. The ultimate judge of the eﬀectiv eness of our assumption will b e data, but the mo del here is useful for gaining an in tuition about our chances of success. W e deﬁne x t i as a user’s p age inventory supply . This is the n umber of p osts that all pages that he or she is connected to (ie. has selected to b e a fan of ) ha ve made on a given da y . Note that not all of a users’ page in ven tory is necessarily view ed or engaged with by that user. F or our outcome v ariable y t i , w e will use o verall time spent on F acebo ok by that user, on an y desktop or mobile device. W e will call this user i ’s demand for F aceb o ok on da y t . W e c ho ose time sp en t as our measure of demand b ecause there are man y wa ys that a piece of page in ven tory could be inter- esting to a user: page in ven tory can b e consumed passiv ely (e.g. just by reading an article or watc hing a video). It can also lead a user to engage with similar conten t, the user can discuss the article in the comments section or in a group or reshare it to their friends. Asking whether a piece of in ven- tory increases the total time sp en t on F acebo ok captures all of these v arious wa ys that piece of conten t can improv e a user’s exp erience. Th us, we are interested in learning the mapping from c to β b ecause individuals with large β are ones who can b eneﬁt the most from additional page in ven tory in te rms of increase their demand (as opp osed to users who would not ﬁnd page posts v ery in teresting or those that already hav e more than enough inv entory and would not beneﬁt from any more). Let us consider a simple mo del for demand and supply . Suppose there is a single dimension on which individuals dif- fer, which we’ll call their aﬃnity for F aceb ook denoted a i . This dimension aﬀects baseline demand for F aceb o ok (b e- cause more activ e p eople hav e more friends and ﬁnd more use from the platform) as well as their supply of page in- v entory (because more active p eople fan more pages). F or clarit y , w e also assume that higher a implies higher β . This w ould b e true in a wo rld where eﬀects were diﬀerent b e- cause latent heterogeneity is on a multiplicativ e scale. This dimension is laten t and unobserved in our data directly . W e also assume there is a time v arying unobserv able e t whic h is a prop erty of a da y . W e call this the da y’s ‘even t w orthyness’ - high e days means that more things hav e hap- pened on that day . This is another unobserved v ariable and we assume e aﬀects x and y as follows. First, higher e days increase the demand y through channels that are un- related to page inv entory supply (for example, by making it more attractiv e to come to F acebo ok to talk ab out the da y’s ev ents either due to the individuals’ own choices or due to the fact that more friends are p osting and discussing). Sec- ond, higher e aﬀects page inv entory supply b ecause higher e days cause pages to create more con tent. W riting this in terms of our linear structural equations giv es: x t i = D ( a i ) +  t i + e t ψ ( a i ) and y t i = µ ( a i ) + x t i β ( a i ) + e t γ ( a i ) + η t i . Running unit lev el regressions gives estimates ˆ β ( a i ) = β ( a i ) + γ ( a i ) C ov ( x i , e i ) V ar ( x i ) where the bias term can b e rewritten by substituting the structural equation for x as γ ( a i ) ψ ( a i ) σ e ψ ( a i ) 2 + σ  . Since higher a implies higher β we also need the bias term to b e increasing in a. It seems reasonable to assume that in our mo del ψ and γ are b oth increasing in a . The former because we hav e made the assumption that more active p eo- ple fan more pages (and thus their inv entory supply is more aﬀected by the o ccurrence of some even t) and the latter b e- cause p eople with higher baseline aﬃnit y are more likely to ha ve more friends and also to be more active on F aceb o ok in general (thus even ts are more likely to generate additional demand for a user with h igh aﬃnity than o ne with low aﬃn- it y). Therefore, to get the bias term to grow in a we need that in general γ ( a ) ψ ( a ) grows at a faster rate than ψ ( a ) 2 . Thus, w e need γ ( a ) to grow faster in a than ψ ( a ) . This would imply in our case that the impact of aﬃnit y on baseline demand is larger than the impact of increasing aﬃnit y on page inv entory supply . There are many things that aﬀect F aceb o ok demand (for example friends, groups joined, baseline like of F aceb o ok, local communit y norms, quality of in ternet connection and page in ven tory supply). Higher a likely increases all of these things and this total eﬀect is what is captured in γ . On the other hand, ψ captures only one piece of that sum. Thus, in our mo del the growth conditions ab ov e seem quite reason- able. Again, we note that the purpose of this mo deling exer- cise is to gain in tuition, not claim that we’v e mo deled the true data generating process. Rather by putting a st ylized structure on the data generating pro cess w e can understand whether there is an y cha nce that our statistical conditions hold and that our observ ationally estimated eﬀect is related to our true causal eﬀect. The ultimate arbiter will alwa ys be the data and that is what we turn to next. 3.2 Estimating ˆ β ( c ) W e now estimate our observ ational causal eﬀect. T o do this, w e take a random, deidentiﬁed, sample of 120 million F acebo ok users. W e then compute ˆ β i for each of the users b y running the panel regressions as describ ed ab ov e on 60 da ys of ( y t i , x t i ) pairs with the deﬁnitions of y and x as in our b ehavioral mo del. This gives us our set of user-level coeﬃcients ˆ β i . W e estimate the function E ( ˆ β | c i ) using oﬀ the shelf mac hine learning. Due to outliers, skew and noise in the estimates of ˆ β we c hange the regression problem to a classiﬁcation problem. Instead of learning a predictor for contin uous heterogenous treatmen t eﬀects we perform a vers ion of quantile regression [1] and give a user a lab el of 1 if their estimated ˆ β i is in the top 20% of all estima ted ˆ β i and 0 otherwise. This also allows us to gauge our mac hine learning using AUC as a metric (whic h is more interpreta ble than MSE) as w ell as remov es outliers (some estimated eﬀects are v ery large or very small due to the fact that the ﬁrst stage pro cedure is quite noisy). F or c we tak e a large set of user-level co v ariates (these are standard cov ariates that analysts at F acebo ok hav e found useful including but not limited to a verage amounts of in- v entory of v arious kinds, engagements p er conten t type, time spent, etc - the full set of features is to o long to en umerate here) and use the F aceb o ok machine learning stack which uses gradient b o osted decision trees as feature transformers follo wed by a ﬁnal linear la yer (see [16] for more details) to train a classiﬁer using these features. The classiﬁer giv es us our function ˆ g . Note that now the output of the classiﬁer is the probability that a unit with co v ariate proﬁle c i is in the top quan tile of treatmen t eﬀects. This is ok b ecause we are not looking for actual n umbers (because we don’t b elieve that β = ˆ β ) but rather a rank ordering of units, which this probabilit y provides. 3.3 T esting ˆ β ( c ) W e now turn to v erifying whether these observ ational esti- mates predict treatmen t eﬀects in the pro duct test described abov e. As our outcome measure we lo ok at the total time spent ov er 1 week of the experiment. T o impro ve statistical pow er w e transform the v ariable by taking natural logs and diﬀerence out 1 w eek of pre-treatment time sp ent p er unit (so our outcome v ariable is ∆ y rather than just y ). Because our outcome measure is so highly autocorrelated and right sk ewed these t wo transformations increase statistical p ow er dramatically . W e now ask: is there an increasing relation- ship betw een our learned E ( ˆ β | c ) and the actual treatmen t eﬀect in the exp eriment? Figure 3 shows that the answer is y es. Stratifying the ex- perimental p opulation by predicted treatment eﬀects sho ws an increasing relationship betw een ˆ g ( c ) and the actual treat- men t eﬀect (the av erage diﬀerence in treatmen t and con- trol). Note that the error bars are quite large b ecause ev en though there are approximate ly 8 million people in the con- trol group, there are only approximatel y 400 , 000 p eople in the treatment group and the exp eriment inv olv es a very small change in the user experience. W e can also show this eﬀect in a linear regression. W e regress demand during the experimental p erio d on an in ter- cept, a treatment 0 , 1 dummy , the predicted treatment ef- fect, the in teraction of the treatment dumm y and the obser- v ationally predicted treatmen t eﬀect and the pre-exp eriment unit-lev el demand (again this last term is just to reduce v ari- ance). W e ﬁnd a highly signiﬁcant interaction (p < .01; th us conﬁrming the visual impression of Figure 3). 4. CONCLUSION Estimating heterogeneous eﬀ ects is an important but v ery diﬃcult problem. It is particularly diﬃcult when the set of unit-lev el cov ariates is large and priors ab out which ones are imp ortan t are weak. W e hav e shown a method that uses observ ational data to get an estimate for true heterogeneity in causal eﬀects. Our metho d requires assumptions on the data generating pro cess, but w e ha ve discussed ho w analysts can think through whether these assumptions are likely to be satisﬁed in practice. In addition, w e hav e argued that the cost of including an additional v ariable (the observ ationally predicted treatmen t eﬀect) i n exp erimental analyses is lik ely small but the gain is potentially large. W e ha ve sho wn that this is true on a real world task. There are many p ossible extensions to our metho d. F rom the applied side we think that incorporating observ ational and experimental data in to a single end-to-end training pro- ● ● ● 0.00 0.01 Bottom 1/3 Middle 1/3 T op 1/3 Predicted T reatment Effect From Observational Data Actual T reatment Effect (log scale) From Experiment Figure 3: There is a monotonic relationship betw een predicted treatmen t eﬀects and actual treatment ef- fects. cedure (rather than step-wise as we do) is an interesting direction for future work. F rom the statistical side we hav e made strong assumptions ab out the data generating pro cess to get our results. Ho wev er whether there are weak er as- sumptions which yield weak er but nonetheless useful results about the relationship b etw een observ ationally and experi- men tally estimated treatment eﬀects remains op en. The metho d presented here b ears conceptual resemblance to recent work on combining surv ey results from a probabil- it y sample and a biased sample using ‘data enriched’ linear regression [8]. It seems to us that b oth our work here and the results discussed there ﬁt into the muc h larger growing literature on semisup ervised learning [29], [9] or multitask learning [7]. The idea b ehind suc h approac hes is that ma- c hine learning can b e made more eﬃcien t by p erforming m ultiple tasks at the same time. The intuitio n behind why this can o ccur is that repre- sen tations of the data learned from doing one task can b e useful for other related tasks (for example the embedding of images learned from classifying whether a picture contains a cat may also b e useful to detect whether an picture con- tains a dog). Viewed in this framew ork, our met ho d uses the observ ational task to represents each unit by a scalar (the observ ationally estimated causal eﬀect), Ho wev er, we could also imagine training the map from c to ˆ β using a neural net work and, instead of using the predictions as the feature of interest for our exp erimental heterogeneous eﬀects clas- siﬁer we could use some intermediate representation of the unit features learned from the neural net work. This seems lik e a promising av en ue for future w ork. It’s hard to make goo d decisions without data and recen t y ears hav e seen an explosion in the scope and complexity of data that decision-makers can use. Ho wev er, data b y itself is useless without methods to pro cess it. W e hop e that our research contributes to both the theoretical and applied aspects of this vital discussion. 5. REFERENCES [1] J. D. Angrist and J.-S. Pischk e. Mostly harmless e c onometrics: An empiricist’s c ompa nion . Princeton univ ersity press, 2008. [2] S. Athey and G. Im b ens. Recursive partitioning for heterogeneous causal eﬀects. Pr o c e e dings of the National A c ademy of Scienc es , 113(27):7353–7360, 2016. [3] E. Baksh y , D. Eckles, and M. S. Bernstein. Designing and deploying online ﬁeld exp eriment s. In Pr o c e e dings of the 23r d ACM c onfer enc e on the World Wide Web . A CM, 2014. [4] A. Banerjee and E. Duﬂo. Po or e c onomics: A r adica l r ethinking of the way to ﬁght glob al p overty . PublicAﬀairs, 2012. [5] J. Bennett and S. Lanning. The netﬂix prize. In Pr o c ee dings of KDD cup and workshop , volume 2007, page 35, 2007. [6] L. Bottou, J. Peters, J. Q. Candela, D. X. Charles, M. Chick ering, E. Portugaly , D. Ray , P . Y. Simard, and E. Snelson. Counterfactual reasoning and learning systems: the example of computational adv ertising. Journal of Machine L e arning R ese ar ch , 14(1):3207–3260, 2013. [7] R. Caruana. Multitask learning. In L earning to learn , pages 95–133. Springer, 1998. [8] A. Chen, A. B. Owen, M. Shi, et al. Data enric hed linear regression. Ele ctr onic Journal of Statistics , 9(1):1078–1112, 2015. [9] R. Collob ert and J. W eston. A uniﬁed arc hitecture for natural language processing: Deep neural netw orks with multitask learning. In Pr o c e e dings of the 25th international c onfer enc e on Machine le arning , pages 160–167. ACM, 2008. [10] R. K. Crump, V. J. Hotz, G. W. Im b ens, and O. A. Mitnik. Nonparametric tests for treatmen t eﬀect heterogeneit y . The R eview of Ec onomics and Statistics , 90(3):389–405, 2008. [11] J. C. F oster, J. M. T aylor, and S. J. Ruberg. Subgroup iden tiﬁcation from randomized clinical trial data. Statistics in me dicine , 30(24):2867–2880, 2011. [12] J. F riedman, T. Hastie, and R. Tibshirani. The elements of statistic al le arning , v olume 1. Springer series in statistics Springer, Berlin, 2001. [13] M. Gail and R. Simon. T esting for qualitativ e in teractions betw een treatmen t eﬀects and patien t subsets. Biometrics , pages 361–372, 1985. [14] D. P . Green and H. L. Kern. Mo deling heterogeneous treatmen t eﬀects in surv ey exp erimen ts with bay esian additiv e regression trees. Public Opinion Quarterly , page nfs036, 2012. [15] J. Grimmer, S. Messing, and S. J. W estw oo d. Estimating heterogeneous treatmen t eﬀects and the eﬀects of heterogeneous treatments with ensemble methods. Unpublishe d manuscript, Stanfor d University, Stanfor d, CA , 2014. [16] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbric h, S. Bow ers, et al. Practical lessons from predicting clicks on ads at facebo ok. In Pr o c ee dings of the Eighth International Workshop on Data Mining for Online A dvertising , pages 1–9. A CM, 2014. [17] J. J. Heckman and E. Vytlacil. Structural equations, treatmen t eﬀects, and econometric p olicy ev aluation. Ec onometric a , 73(3):669–738, 2005. [18] K. Imai, M. Ratko vic, et al. Estimating treatmen t eﬀect heterogeneity in randomized program ev aluation. The Annals of Applie d Statistics , 7(1):443–470, 2013. [19] R. Koha vi, R. M. Henne, and D. Sommerﬁeld. Practical guide to controlled exp erimen ts on the web: listen to yo ur customers not to the hippo. In Pr o c ee dings of the 13th A CM SIGKDD international c onfer enc e on Know le dge disc overy and data mining , pages 959–967. A CM, 2007. [20] R. J. LaLonde. Ev aluating the econometric ev aluations of training programs with experimental data. The Am eric an ec onomic r eview , pages 604–620, 1986. [21] M. W. Lipsey . Design sensitivity: Statistic al p ower for exp erimental r ese ar ch , volume 19. Sage, 1990. [22] M. N. Meyer. Two cheers for corp orate experimentation: The a/b illusion and the virtues of data-driv en inno v ation. J. on T ele c omm. & High T e ch. L. , 13:273, 2015. [23] J. Pearl. Causality . Cambridg e universit y press, 2009. [24] U. Shalit, F. Johansson, and D. Sontag. Bounding and minimizing counterfactual error. arXiv prep rint arXiv:1606.03976 , 2016. [25] X. Su, C.-L. Tsai, H. W ang, D. M. Nic kerson, and B. Li. Subgroup analysis via recursive partitioning. The Journal of Machine L e arning R ese ar ch , 10:141–158, 2009. [26] M. T addy , M. Gardner, L. Chen, and D. Drap er. A nonparametric bay esian analysis of heterogeneous treatmen t eﬀects in digital exp erimentation. arXiv pr eprint arXiv:1412.8563 , 2014. [27] S. W ager and S. Athey . Estimation and inference of heterogeneous treatment eﬀects using random forests. arXiv pr eprint arXiv:1510.04342 , 2015. [28] Y. Xu, N. Chen, A. F ernandez, O. Sinno, and A. Bhasin. F rom infrastructure to culture: A/b testing c hallenges in large scale so cial net works. In Pr o c ee dings of the 21th A CM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mining , pages 2227–2236. A CM, 2015. [29] X. Zhu. Semi-sup ervised learning literature surv ey . 2005.

Combining observational and experimental data to find heterogeneous treatment effects

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment