Long-term causal effects via behavioral game theory
Planned experiments are the gold standard in reliably comparing the causal effect of switching from a baseline policy to a new policy. One critical shortcoming of classical experimental methods, however, is that they typically do not take into accoun…
Authors: Panagiotis (Panos) Toulis, David C. Parkes
Long-term causal effects via b eha vioral game theory P anos T oulis * and Da vid C. P arkes ** * Econometrics and Statistics, Univ ersit y of Chicago, Bo oth Sc ho ol ** Harv ard Univ ersity , School of Engineering and Applied Science No vem b er 7, 2016 Abstract Planned exp erimen ts are the gold standard in reliably comparing the causal effect of switc hing from a baseline p olicy to a new p olicy . One critical shortcoming of classi- cal exp erimen tal metho ds, how ever, is that they typically do not take into account the dynamic nature of resp onse to policy c hanges. F or instance, in an exp erimen t where w e seek to understand the effects of a new ad pricing p olicy on auction reven ue, agents ma y adapt their bidding in resp onse to the exp erimental pricing c hanges. Thus, causal effects of the new pricing p olicy after suc h adaptation p erio d, the long-term c ausal effe cts , are not captured b y the classical metho dology ev en though they clearly are more indicativ e of the v alue of the new p olicy . Here, w e formalize a framework to define and estimate long-term causal effects of p olicy changes in m ultiagen t economies. Cen tral to our approac h is b ehavioral game theory , whic h we leverage to formulate the ignorabilit y assumptions that are necessary for causal inference. Under suc h assump- tions we estimate long-term causal effects through a laten t space approach, where a b eha vioral mo del of ho w agents act conditional on their latent b ehaviors is com bined with a temp oral mo del of how b ehaviors evolv e ov er time. 1 1 In tro duction A multiagen t econom y is comprised of agents in teracting under sp ecific economic rules. A common problem of interest is to exp erimen tally ev aluate c hanges to such rules, also kno wn as tr e atments , on an ob jectiv e of in terest. F or example, an online ad auction platform is a m ultiagent economy , where one problem is to estimate the effect of raising the reserve price on the platform’s rev enue. Assessing causality of such effects is a challenging problem b ecause there is a conceptual discrepancy b etw een what needs to b e estimated and what is a v ailable in the data, as illustrated in Figure 1. What needs to b e estimated is the c ausal effe ct of a p olicy c hange, which is defined as the difference b etw een the ob jective v alue when the econom y is treated, i.e., when al l agen ts in teract under the new rules, relativ e to when the same economy is in con trol, i.e., when al l agen ts in teract under the baseline rules. Such definition of causal effects is logically necessi- tated from the designer’s task, which is to select either the treatmen t or the control p olicy based on their estimated reven ues, and then apply suc h p olicy to all agen ts in the econom y . The long-term c ausal effe ct is the causal effect defined after the system has stabilized, and is more representativ e of the v alue of p olicy c hanges in dynamical systems. Thus, in Figure 1 the long-term causal effect is the difference b etw een the ob jectiv e v alues at the top and b ottom endp oints, mark ed as the “targets of inference”. What is av ailable in the exp erimental data, ho wev er, t ypically comes from designs suc h as the so-called A/B test, where we randomly assign some agents to the treated economy (new rules B) and the others to the control economy (baseline rules A), and then compare the outcomes. In Figure 1 the exp erimen tal data are depicted as the solid time-series in the middle of the plot, marked as the “observed data”. Therefore the c hallenge in estimating long-term causal effects is that we generally need to p erform t w o inferential tasks simultaneously , namely , (i) infer outcomes across p ossible exp erimental assignments (y-axis in Figure 1), and (ii) infer long-term outcomes from short-term exp erimental data (x-axis in Figure 1). The first task is commonly kno wn as the “fundamen tal problem of causal inference” (Hol- land, 1986; Rubin, 2011) b ecause it underscores the imp ossibilit y of observing in the same exp erimen t the outcomes for b oth p olicy assignmen ts that define the causal effect; i.e., that 2 Figure 1: The t w o inferen tial tasks for causal inference in multiagen t economies. First, infer agent actions across treatment assignments (y-axis), particularly , the assignment where all agen ts are in the treated econom y (top assignmen t, Z = 1), and the assignment where all agents are in the con trol economy (b ottom assignmen t, Z = 0). Second, infer across time, from t 0 (last observ ation time) to long-term T . What we seek in order to ev aluate the causal effect of the new treatmen t is the difference b et w een the ob jectives (e.g., reven ue) at the tw o inferen tial target endp oints. w e cannot observe in the same exp eriment b oth the outcomes when all agents are treated and the outcomes when all agen ts are in con trol, the assignmen ts of which are denoted by Z = 1 and Z = 0 , resp ectiv ely , in Figure 1. In fact the role of exp erimen tal design, as conceiv ed b y Fisher (1935), is exactly to quan tify the uncertaint y ab out such causal effects that cannot b e observed due to the aforementioned fundamen tal problem, b y using standard errors that can b e observ ed in a carefully designed exp eriment. The second task, how ever, is unique to causal inference in dynamical systems, suc h as the multiagen t economies that w e study in this pap er, and has receiv ed limited attention so far. Here, we argue that it is crucial to study long-term causal effects, i.e., effects measured after the system has stabilized, b ecause such effects are more represen tative of the v alue of p olicy changes. If our analysis fo cused only on the observ ed data part depicted in Figure 1, then p olicy ev aluation would reflect transient effects that might differ substantially from the long-term effects. F or instance, raising the reserv e price in an auction migh t increase rev enue in the short-term but as agen ts adapt their bids, or switch to another platform altogether, the long-term effect could b e a net decrease in reven ue (Holland and Miller, 1991). 3 1.1 Related work and our con tributions There ha v e been sev eral imp ortan t pro jects related to causal inference in multiagen t economies. F or instance, Ostro vsky and Sc hw arz (2011) ev aluated the effects of an increase in the reserve price of Y aho o! ad auctions on reven ue. Auctions were randomly assigned to an increased reserv e price treatment, and the effect w as estimated using difference-in-differences (DID), whic h is a p opular econometric metho d (Card and Krueger, 1994; Donald and Lang, 2007; Ostro vsky and Sch w arz, 2011). The DID metho d compares the difference in outcomes b efore and after the in terv ention for b oth the treated and con trol units —the ad auctions in this exp erimen t— and then compares the tw o differences. In relation to Figure 1, DID extrap- olates across assignments (y-axis) and across time (x-axis) b y making a strong additivit y assumption (Abadie, 2005; Angrist and Pischk e, 2008, Section 5.2), sp ecifically , by assuming that the dep endence of rev enue on reserve price and time is additiv e. In a structural approach, A they et al. (2011) studied the effects of auction format (as- cending versus sealed bid) on comp etition for timber tracts. Their approac h w as to estimate agen t v aluations from observ ed data (agent bids) in one auction format and then impute coun terfactual bid distributions in the other auction format, under the assumption of equi- librium pla y in the observed data. In relation to Figure 1, their approac h extrap olates across assignmen ts b y assuming that agent individual v aluations for tracts are indep endent of the treatmen t assignment, and extrap olates across time by assuming that the observ ed agent bids are already in equilibrium. Similar approaches are follo wed in econometrics for estimation of general equilibrium effects (Heckman et al. , 1998; Hec kman and Vytlacil, 2005). In a causal graph approach (P earl, 2000), Bottou et al. (2013) studied effects of changes in the algorithm that scores Bing ads on the ad platform’s reven ue. Their approach w as to create a directed acyclic graph (D AG) among related v ariables, suc h as queries, bids, and prices. Through a “Causal Mark ov” assumption they could predict coun terfactuals for reven ue, using only data from the control economy (observ ational study). In relation to Figure 1, their approac h is non-exp erimen tal and extrap olates across assignmen ts and across time b y assuming a directed acyclic graph (D AG) as the correct data model, which is also assumed to b e stable with resp ect to treatmen t assignmen t, and by estimating coun terfactuals through the fitted mo del. Our work is differen t from prior w ork b ecause it takes in to accoun t the short-term asp ect of exp erimental data to ev aluate long-term causal effects, which is the key conceptual and 4 practical challenge that arises in empirical applications. In contrast, classical econometric metho ds, suc h as DID, assume strong linear trends from short-term to long-term, whereas structural approaches t ypically assume that the exp erimen tal data are already long-term as they are observed in equilibrium. W e refer the reader to Sections 2 and 3 of the supplement for more detailed comparisons. In summary , our key con tribution is that we develop a formal framew ork that (i) artic- ulates the distinction b etw een short-term and long-term causal effects, (ii) leverages b eha v- ioral game-theoretic mo dels for causal analysis of multiagen t economies, and (iiii) explicates theory that enables v alid inference of long-term causal effects. 2 Definitions Consider a set of agents I and a set of actions A , indexed by i and a , resp ectiv ely . The exp erimen t designer w an ts to run an experiment to ev aluate a new p olicy against the baseline p olicy relativ e to an ob jectiv e. In the exp eriment eac h agen t is assigned to one p olicy , and the exp erimen ter observ es how agents act ov er time. F ormally , let Z = ( Z i ) b e the |I | × 1 assignmen t vector where Z i = 1 denotes that agent i is assigned to the new p olicy , and Z i = 0 denotes that i is assigned to the baseline p olicy; as a shorthand, Z = 1 denotes that all agents are assigned to the new p olicy , and Z = 0 denotes that all agents are assigned to the baseline policy , where 1 , 0 generally denote an appropriately-sized v ector of ones and zero es, resp ectively . In the simplest case, the exp eriment is an A/B test, where Z is uniformly random on { 0 , 1 } |I | sub ject to P i Z i = |I | / 2. After the initial assignmen t Z agents pla y actions at discrete time p oin ts from t = 0 to t = t 0 . Let A i ( t ; Z ) ∈ A b e the random v ariable that denotes the action of agen t i at time t under assignment Z . The p opulation action α j ( t ; Z ) ∈ ∆ |A| , where ∆ p denotes the p -dimensional simplex, is the frequency of actions at time t under assignment Z of agen ts that w ere assigned to game j ; for example, assuming t wo actions A = { a 1 , a 2 } , then α 1 (0; Z ) = [0 . 2 , 0 . 8] denotes that, under assignment Z , 20% of agents assigned to the new p olicy play action a 1 at t = 0, while the rest play a 2 . W e assume that the ob jective v alue for the exp erimenter depends on the p opulation action, in a similar wa y that, say , auction rev enue dep ends on agents’ aggregate bidding. The ob jective v alue in p olicy j at time t under assignment Z is denoted by R ( α j ( t ; Z )), where R : ∆ |A| → R . F or instance, supp ose 5 in the previous example that a 1 and a 2 pro duce rev enue $10 and − $2, resp ectively , each time they are pla yed, then R is linear and R ([ . 2 , . 8]) = 0 . 2 · $10 − 0 . 8 · $2 = $0 . 4. Definition 1. The aver age c ausal effe ct on obje ctive R at time t of the new p olicy r elative to the b aseline is denote d by CE( t ) and is define d as CE( t ) = E ( R ( α 1 ( t ; 1 )) − R ( α 0 ( t ; 0 ))) . (1) Supp ose that ( t 0 , T ] is the time interv al required for the economy to adapt to the ex- p erimen tal conditions. The exact definition of T is imp ortan t but we defer this discussion for Section 3.1. The designer concludes that the new p olicy is b etter than the baseline if CE( T ) > 0. Th us, CE( T ) is the long-term aver age c ausal effe ct and is a function of tw o ob jective v alues, R ( α 1 ( T ; 1 )) and R ( α 0 ( T ; 0 )), which corresp ond to the t w o inferen tial target endp oin ts in Figure 1. Neither v alue is observed in the exp eriment b ecause agen ts are ran- domly split b et ween p olicies, and their actions are observ ed only for the short-term p erio d [0 , t 0 ]. Th us we need to (i) extrap olate across assignmen ts b y pivoting from the observed assignmen t to the coun terfactuals Z = 1 and Z = 0 ; (ii) extrap olate across time from the short-term data [0 , t 0 ] to the long-term t = T . W e p erform these tw o extrap olations based on a laten t space approach, which is describ ed next. 2.1 Beha vioral and temp oral mo dels W e assume a latent b ehavioral mo del of how agen ts select actions, inspired by mo dels from b eha vioral game theory . The b eha vioral mo del is used to predict agen t actions conditional on agent b eha viors, and is com bined with a temp oral model to predict b ehaviors in the long-term. The t w o mo dels are ultimately used to estimate agen t actions in the long-term, and thus estimate long-term causal effects. As the choice of the latent space is not unique, in Section 3.1 we discuss why we c hose to use b eha vioral mo dels from game theory . Let B i ( t ; Z ) denote the b eha vior that agent i adopts at time t under exp erimental as- signmen t Z . The following assumption puts a constraints on the space of p ossible b ehaviors that agen ts can adopt, which will simplify the subsequent analysis. Assumption 1 (Finite set of p ossible b eha viors) . Ther e is a fixe d and finite set of b ehaviors B such that for every time t , assignment Z and agent i , it holds that B i ( t ; Z ) ∈ B ; i.e., every agent c an only adopt a b ehavior fr om B . 6 The set of p ossible b eha viors B essen tially defines a |B |×|A| collection of probabilities that is sufficien t to compute the lik eliho o d of actions pla y ed conditional on adopted b eha vior—w e refer to suc h collection as the b ehavioral mo del. Definition 2 (Behavioral mo del) . The b ehavior al mo del for p olicy j define d by set B of b ehaviors is the c ol le ction of pr ob abilities P ( A i ( t ; Z ) = a | B i ( t ; Z ) = b, G j ) , for every action a ∈ A and every b ehavior b ∈ B , wher e G j denotes the char acteristics of p olicy j . As an example, a non-sophisticated b ehavior b 0 could imply that P ( A i ( t ; Z ) = a | b 0 , G j ) = 1 / |A| , i.e., that the agent adopting b 0 simply pla ys actions at random. Conditioning on p olicy j in Definition 2 allo ws an agent to c ho ose its actions based on exp ected pa y offs, whic h dep end on the p olicy characteristics. F or instance, in the application of Section 4 we consider a b eha vioral mo del where an agent picks actions in a tw o-p erson game according to exp ected pay offs calculated from the game-sp ecific pay off matrix—in that case G j is simply the pa yoff matrix of game j . The p opulation b ehavior β j ( t ; Z ) ∈ ∆ |B| denotes the frequency at time t under assignment Z of the adopted b ehaviors of agents assigned to p olicy j . Let F t denote the en tire history of p opulation b eha viors in the exp erimen t up to time t . A temp oral mo del of b ehaviors is defined as follo ws. Definition 3 (T emp oral mo del) . F or an exp erimental assignment Z a temp or al mo del for p olicy j is a c ol le ction of p ar ameters φ j ( Z ) , ψ j ( Z ) , and densities ( π , f ) , such that for al l t , β j (0; Z ) ∼ π ( · ; φ j ( Z )) , β j ( t ; Z ) | F t − 1 , G j ∼ f ( ·| ψ j ( Z ) , F t − 1 ) . (2) A temp oral mo del defines the distribution of p opulation b eha vior as a time-series with a Marko vian structure sub ject to π and f b eing stable with resp ect to Z . In other w ords, regardless of how agen ts are assigned to games, the p opulation b ehavior in the game will ev olve according to a fixed mo del describ ed by f and π . The mo del parameters φ, ψ ma y still dep end on the treatmen t assignment Z . 7 3 Estimation of long-term causal effects Here we dev elop the assumptions that are necessary for inference of long-term causal effects. Assumption 2 (Stabilit y of initial b ehaviors) . L et ρ Z = P i ∈I Z i / |I | b e the pr op ortion of agents assigne d to the new p olicy under assignment Z . Then, for every p ossible Z , ρ Z β 1 (0; Z ) + (1 − ρ Z ) β 0 (0; Z ) = β (0) , (3) wher e β (0) is a fixe d p opulation b ehavior invariant to Z . Assumption 3 (Behavioral ignorability) . The assignment is indep endent of p opulation b e- havior at time t , c onditional on p olicy and b ehavior al history up to t ; i.e., for every t > 0 and p olicy j , Z | = β j ( t ; Z ) | F t − 1 , G j . R emarks. Assumption 2 implies that the agents do not anticipate the assignmen t Z as they “ha ve made up their minds” to adopt a p opulation b ehavior β (0) b efore the exp erimen t. It follo ws that the p opulation b eha vior β 1 ( t ; Z ) marginally corresp onds to ρ Z |I | draws from |B | bins of total size |I | β (0) . The bin selection probabilities at ev ery draw dep end on the exp erimen tal design; for instance, in an A/B exp erimen t where ρ Z = 0 . 5 the p opulation b eha vior at t = 0 can b e sampled uniformly suc h that β 1 (0; Z ) + β 0 (0; Z ) = 2 β (0) . Quantities suc h as that in Eq. (3) are crucial in causal inference b ecause they can b e used as a pivot for extrap olation across assignmen ts. Assumption 3 states that the treatment assignment do es not add information ab out the p opulation b ehavior at time t , if we already know the full b ehavioral history of up to t , and the p olicy whic h agen ts are assigned to; hence, the treatment assignment is conditionally ignor able . This ignorabilit y assumption precludes, for instance, an agent adopting a differen t b eha vior dep ending on whether it w as assigned with friends or fo es in the exp eriment. Algorithm 1 is the main metho dological con tribution of this paper. It is a Bay esian pro cedure as it puts priors on parameters φ, ψ of the temp oral mo del, and then marginalizes these parameters out. 8 Algorithm 1 Estimation of long-term causal effects. Input: Z, T , A , B , G 1 , G 0 , D 1 = { a 1 ( t ; Z ) : t = 0 , . . . , t 0 } , D 0 = { a 0 ( t ; Z ) : t = 0 , . . . , t 0 } Output: Estimate of long-term causal effect CE( T ) in Eq. (1). 1: By Assumption 3, define φ j ≡ φ j ( Z ), ψ j ≡ ψ j ( Z ). 2: Set µ 1 ← 0 and µ 0 ← 0 , b oth of size |A| ; set ν 0 = ν 1 = 0. 3: for iter = 1 , 2 , . . . do 4: F or j = 0 , 1, sample φ j , ψ j from prior, and sample β j (0; Z ) conditional on φ j . 5: Calculate β (0) = ρ Z β 1 (0; Z ) + (1 − ρ Z ) β 0 (0; Z ). 6: for j = 0 , 1 do 7: Set β j (0; j 1 ) = β (0) . 8: Sample B j = { β j ( t ; j 1 ) : t = 0 , . . . , T } giv en ψ j and β j (0 , j 1 ). # temp or al mo del 9: Sample α j ( T ; j 1 ) conditional on β j ( T ; j 1 ). # b ehavior al mo del 10: Set µ j ← µ j + P ( D j | B j , G j ) · R ( α j ( T ; j 1 )). 11: Set ν j ← ν j + P ( D j | B j , G j ). 12: end for 13: end for 14: Return estimate c CE( T ) = µ 1 /ν 1 − µ 0 /ν 0 . Theorem 1 (Estimation of long-term causal effects) . Supp ose that b ehaviors evolve ac- c or ding to a known temp or al mo del, and actions ar e distribute d c onditional ly on b ehaviors ac c or ding to a known b ehavior al mo del. Supp ose that Assumptions 1, 2 and 3 hold for such mo dels. Then, for every p olicy j ∈ { 0 , 1 } as the iter ations of A lgorithm 1 incr e ase, µ j /ν j → E ( R ( α j ( T ; j 1 )) |D j ) . The output c CE( T ) of Algorithm 1 asymptotic al ly estimates the long-term c ausal effe ct, i.e., E ( c CE( T )) = E ( R ( α 1 ( T ; 1 )) − R ( α 0 ( T ; 0 ))) ≡ CE( T ) . 9 R emarks. Theorem 1 sho ws that c CE( T ) consisten tly estimates the long-term causal effect in Eq. (1). W e note that it is also p ossible to deriv e the v ariance of this estimator with resp ect to the randomization distribution of assignment Z . T o do so w e first create a set of assignments Z by rep eatedly sampling Z according to the exp erimental design. Then w e adapt Algorithm 1 so that (i) Step 4 is remo ved; (ii) in Step 5, β (0) is sampled from its p osterior distribution conditional on observ ed data, which can b e obtained from the original Algorithm 1. The empirical v ariance of the outputs ov er Z from the adapted algorithm estimates the v ariance of the output c CE( T ) of the original algorithm. W e leav e the full c haracterization of this v ariance estimation pro cedure for future w ork. As T heorem 1 relies on Assumptions 2 and 3, it is w orth noting that the assumptions ma y b e hard but not imp ossible to test in practice. F or example, one idea to test Assumption 3 is to use data from m ultiple exp eriments on a single game j . If fitting the temp oral model (2) on suc h data yields parameter estimates ( φ j ( Z ) , ψ j ( Z )) that dep end on e xp erimental assignmen t Z , then Assumption 3 would b e unjustified. A similar test could b e used for Assumption 2 as w ell. 3.1 Discussion Metho dologically , our approac h is aligned with the idea that for long-term causal effects w e need a mo del for outcomes that lev erages structural information p ertaining to how outcomes are generated and ho w they evolv e. In our application such structural information is the micro economic information that dictates what agen t b ehaviors are successful in a given p olicy and ho w these b eha viors ev olve ov er time. In particular, Step 1 in the algorithm relies on Assumptions 2 and 3 to infer that mo del parameters, φ j , ψ j are stable with resp ect to treatment assignmen t. Step 5 of the algorithm is the key estimation pivot, which uses Assumption 2 to extrap olate from the exp erimental assignmen t Z to the counterfactual assignmen ts Z = 1 and Z = 0 , as required in our problem. Ha ving piv oted to suc h coun terfactual assignmen t, it is then p ossible to use the temp oral mo del parameters ψ j , whic h are unaffected by the pivot under Assumption 3, to sample p opulation b ehaviors up to long-term T , and subsequently sample agent actions at T (Steps 8 and 9). Th us, a lot of burden is placed on the b ehavioral game-theoretic mo del to predict agen t actions, and the accuracy of such mo dels is still not settled (Hahn et al. , 2015). How ev er, 10 it do es not seem necessary that suc h prediction is completely accurate, but rather that the b eha vioral mo dels can pull relev an t information from data that w ould otherwise b e inacces- sible without game theory , thereby improving ov er classical metho ds. A formal assessment of suc h improv ement, e.g., using information theory , is op en for future work. An empirical assessmen t can b e supp orted by the extensive literature in b eha vioral game theory (Stahl and Wilson, 1994; McKelv ey and P alfrey, 1995), whic h has b een successful in predicting h uman actions in real-world exp eriments (W righ t and Leyton-Brown, 2010). Another limitation of our approac h is Assumption 1, which p osits that there is a finite set of predefined b eha viors. A nonparametric approac h where b eha viors are estimated on-the-fly migh t do b etter. In addition, the long-term horizon, T , also needs to b e defined a priori . W e should b e careful how T in terferes with the temp oral mo del since such a mo del implies a time T 0 at which p opulation b ehavior reac hes stationarit y . Th us if T 0 ≤ T w e implicitly assume that the long-term causal effect of interest p ertains to a stationary regime (e.g., Nash equilibrium), but if T 0 > T we assume that the effect p ertains to a transien t regime, and therefore the p olicy ev aluation might b e misguided. 4 Application on data from a b eha vioral exp erimen t In this section, we apply our metho dology to exp erimental data from Rap op ort and Bo eb el (1992), as rep orted by McKelvey and P alfrey (1995). The exp erimen t consisted of a series of zero-sum t w o-agen t games, and aimed at examining the hypothesis that h uman play ers pla y according to minimax solutions of the game, the so-called minimax h yp othesis initially suggested b y V on Neumann and Morgenstern (1944). Here w e repurp ose the data in a sligh tly artificial w ay , including ho w w e construct the designer’s ob jective. This enables a suitable demonstration of our approach. Eac h game in the exp eriment w as a simultaneous-mo ve game with fiv e discrete actions for the row pla y er and five actions for the column play er. The structure of the pay off matrix, giv en in the supplement in T able 1, is parametrized b y tw o v alues, namely W and L ; the exp erimen t used tw o differen t versions of pa yoff matrices, corresp onding to paymen ts by the ro w agent to the column agent when the row agen t won ( W ), or lost ( L ): mo dulo a scaling factor Rapop ort and Bo eb el (1992) used ( W , L ) = ($10 , − $6) for game 0 and ( W , L ) = ($15 , − $1) for game 1. 11 F orty agents, I = { 1 , 2 , . . . , 40 } , were randomized to one game design (20 agen ts per game), and eac h agen t pla yed once as ro w and once as column, matched against t w o different agen ts. Ev ery match-up b etw een a pair of agen ts lasted for t wo p erio ds of 60 rounds, with eac h round consisting of a selection of an action from eac h agen t and a paymen t. Th us, eac h agent pla y ed for four p erio ds and 240 rounds in total. If Z is the entire assignment v ector of length 40, Z i = 1 means that agent i w as assigned to game 1 with pa y off matrix ( W , L ) = ($15 , − $1) and Z i = 0 means that i w as assigned to game 0 with pa y off matrix ( W , L ) = ($10 , − $6). In adapting the data, we tak e adv antage of the randomization in the exp erimen t, and ask a question in regard to long-term causal effects. In particular, assuming that agents pay a fee for eac h action taken, which accounts for the reven ue of the game, we ask the following question: What is the long-term causal effect on reven ue if w e switc h from pa yoffs ( W, L ) = ($10 , − $6) of game 0 to pa yoffs ( W, L ) = ($15 , − $1) of game 1?”. The games induced by the t w o aforemen tioned pay off matrices represent the t wo different p olicies w e wish to compare. T o ev aluate our metho d, we consider the last p erio d as long- term, and hold out data from this p erio d. W e define the causal estimand in Eq. (1) as CE = c | ( α 1 ( T ; 1 ) − α 0 ( T ; 0 )) , (4) where T = 3 and c is a vector of co efficients. The in terpretation is that, giv en an elemen t c a of c , the agent pla ying action a is assumed to pay a constan t fee c a . T o c hec k the robustness of our metho d w e test Algorithm 1 ov er multiple v alues of c . 4.1 Implemen tation of Algorithm 1 and results Here w e demonstrate how Algorithm 1 can b e applied to estimate the long-term causal effect in Eq. (4) on the Rap op ort & Bo eb el dataset. T o this end w e clarify Algorithm 1 step by step, and giv e more details in the supplement. Step 1: Mo del parameters. F or simplicit y w e assume that the mo dels in the t wo games share common parameters, and th us ( φ 1 , ψ 1 , λ 1 ) = ( φ 0 , ψ 0 , λ 0 ) ≡ ( φ, ψ , λ ), where λ 12 are the parameters of the behavioral mo del to be describ ed in Step 8. Ha ving common parameters also acts as regularization and thus helps estimation. Step 4: Sampling parameters and initial b eha viors As explained later we assume that there are 3 different b ehaviors and thus φ, ψ , λ are vectors with 3 comp onents. Let x ∼ U ( m, M ) denote that ev ery comp onent of x is uniform on ( m, M ), indep endently . W e choose diffuse priors for our parameters, sp ecifically , φ ∼ U(0 , 10), ψ ∼ U( − 5 , 5), and λ ∼ U( − 10 , 10). Giv en φ w e sample the initial b eha viors as Diric hlet, i.e., β 1 (0; Z ) ∼ Dir( φ ) and β 0 (0; Z ) ∼ Dir( φ ), indep endently . Steps 5 & 7: Piv ot to coun terfactuals. Since we hav e a completely randomized exp erimen t (A/B test) it holds that ρ Z = 0 . 5 and therefore β (0) = 0 . 5( β 1 (0; Z ) + β 0 (0; Z )). No w we can piv ot to the counterfactual p opulation b ehaviors under Z = 1 and Z = 0 b y setting β 1 (0; 1 ) = β 0 (0; 0 ) = β (0) . Step 8: Sample counterfactual b eha vioral history . As the temp oral mo del, we adopt the lag-one ve ctor autor e gr essive mo del , also kno wn as V AR(1). W e transform 1 the p opulation b ehavior in to a new v ariable w t = logit( β 1 ( t ; 1 )) ∈ R 2 (also do so for β 0 ( t ; 0 )). Suc h transformation with a unique in v erse is necessary b ecause p opulation b eha viors are constrained on the simplex, and thus form so-called comp ositional data (Aitc hison, 1986; Grun wald et al. , 1993). The V AR(1) mo del implies that w t = ψ [1] 1 + ψ [2] w t − 1 + ψ [3] t , where ψ [ k ] is the k th comp onen t of ψ and t ∼ N (0 , I ) is i.i.d. standard biv ariate normal. Eq. (6) is used to sample the b ehavioral history , B j , in Step 8 of Algorithm 1. Step 9: Beha vioral model. F or the behavioral mo del, w e adopt the quantal p -r esp onse (QL p ) mo del (Stahl and Wilson, 1994), whic h has b een successful in predicting h uman actions in real-w orld exp eriments (W righ t and Leyton-Brown, 2010). W e choose p = 3 b ehaviors, namely B = { b 0 , b 1 , b 2 } of increased sophistication parametrized b y λ = ( λ [1] , λ [2] , λ [3]) ∈ R 3 . Let G j denote the 5 × 5 pa y off matrix of game j and let the term str ate gy denote a distribution o ver all actions. An agent with b ehavior b 0 pla ys the uniform strategy , P ( A i ( t ; Z ) = a | B i ( t ; Z ) = b 0 , G j ) = 1 / 5 . 1 y = logit( x ) is defined as the function ∆ m → R m − 1 , y [ i ] = log( x [ i + 1] /x [1]), where x [1] 6 = 0 wlog. 13 An agent of level-1 (ro w pla yer) assumes to b e playing only against lev el-0 agen ts and thus exp ects p er-action profit u 1 = (1 / 5) G j 1 (for column pla y er we use the transp ose of G j ). The lev el-1 agen t will then play a strategy prop ortional to e λ [1] u 1 , where e x for vector x denotes the elemen t-wise exp onen tiation, e x = ( e x [ k ] ). The precision parameter λ [1] determines ho w m uch an agent insists on maximizing exp ected utility; for example, if λ [1] = ∞ , the agent pla ys the action with maximum exp ected pay off (b est resp onse); if λ [1] = 0, the agent acts as a lev el-0 agent. An agen t of level-2 (ro w pla yer) assumes to be playing only against level-1 agen ts with precision λ [2] and therefore exp ects to face strategy prop ortional to e λ [2] u 1 . Thus its exp ected p er-action profit is u 2 ∝ G j e λ [2] u 1 , and pla ys strategy ∝ e λ [3] u 2 . Giv en G j and λ we calculate a 5 × 3 matrix Q j where the k th column is the strat- egy play ed b y an agent with b eha vior b k − 1 . The exp ected p opulation action is there- fore ¯ α j ( t ; Z ) = Q j β j ( t ; Z ). The p opulation action α j ( t ; Z ) is distributed as a normalized m ultinomial random v ariable with exp ectation ¯ α j ( t ; Z ), and so P ( α j ( t ; 1 ) | β j ( t ; 1 ) , G j ) = Multi( |I | · α j ( t ; 1 ); ¯ α j ( t ; 1 )), where Multi( n ; p ) is the m ultinomial densit y of observ ations n = ( n 1 , . . . , n K ) with probabilities p = ( p 1 , . . . , p K ). Hence, the full lik eliho o d for observed actions in game j in Steps 10 and 11 of Algorithm 1 is given b y the pro duct P ( D j | B j , G j ) = T − 1 Y t =0 Multi( |I | · α j ( t ; j 1 ); ¯ α j ( t ; j 1 )) . Running Algorithm 1 on the Rap op ort and Bo eb el dataset yields the estimates shown in Figure 2, for 25 different fee vectors c , where eac h comp onen t c a is sampled uniformly at random from (0 , 1). W e also test difference-in-differences (DID), which estimates the causal effect through ˆ τ did = [ R ( α 1 (2; Z )) − R ( α 1 (0; Z ))] − [ R ( α 0 (2; Z )) − R ( α 0 (0; Z ))] , and a naiv e metho d (“naive” in the plot), which ignores the dynamical asp ect and estimates the long-term causal effect as ˆ τ nai = [ R ( α 1 (2; Z )) − R ( α 0 (2; Z ))]. Our estimates (“LA CE” in the plot) are closer to the truth (mse = 0 . 045) than the estimates from the naiv e metho d (mse = 0 . 185) and from DID (mse = 0 . 361). This illustrates that our metho d can pull game-theoretic information from the data for long-term causal inference, whereas the other metho ds cannot. 14 Figure 2: Estimates of long-term effects from differen t metho ds corresp onding to 25 random ob jective co efficien ts c in Eq. (4). F or estimates of our metho d w e ran Algorithm 1 for 100 iterations. 5 Conclusion One critical shortcoming of statistical metho ds of causal inference is that they t ypically do not assess the long-term effect of p olicy c hanges. Here w e com bined causal inference and game theory to build a framework for estimation of such long-term effects in m ultiagen t economies. Central to our approac h is b ehavioral game theory , which provides a natural laten t space mo del of how agen ts act and how their actions ev olv e o ver time. Suc h mo dels enable to predict ho w agents w ould act under v arious p olicy assignments and at v arious time p oin ts, which is key for v alid causal inference. W orking on data from an actual b ehavioral exp erimen t set we sho w ed ho w our framework can b e applied to estimate the long-term effect of c hanging the pay off structure of a normal-form game. Our framew ork could b e extended in future w ork b y incorp orating learning (e.g., ficti- tious pla y , bandits, no-regret learning) to b etter mo del the dynamic resp onse of m ultiagent systems to p olicy changes. Another in teresting extension w ould b e to use our framew ork for optimal design of exp eriments in suc h systems, whic h needs to accoun t for heterogeneity in agen t learning capabilities and for intrinsic dynamical prop erties of the systems’ resp onses to exp erimen tal treatmen ts. 15 Ac kno wledgemen ts The authors wish to thank Leon Bottou, the organizers and participants of CODE@MIT’15, GAMES’16, the W orkshop on Algorithmic Game Theory and Data Science (EC’15), and the anon ymous NIPS reviewers for their v aluable feedback. Panos T oulis has b een supp orted in part by the 2012 Go ogle US/Canada F ellowship in Statistics. Da vid C. P arkes w as supp orted in part b y NSF grant CCF-1301976 and the SEAS T omKat fund. References Abadie, A. (2005). Semiparametric difference-in-differences estimators. The R eview of Ec o- nomic Studies , 72 (1), 1–19. Aitc hison, J. (1986). The statistic al analysis of c omp ositional data . Springer. Angrist, J. D. and Pischk e, J.-S. (2008). Mostly harmless e c onometrics: An empiricist’s c omp anion . Princeton univ ersit y press. A they , S., Levin, J., and Seira, E. (2011). Comparing op en and sealed bid auctions: Evidence from tim b er auctions. The Quarterly Journal of Ec onomics , 126 (1), 207–257. Bottou, L., P eters, J., Qui ˜ nonero-Candela, J., Charles, D. X., Chic kering, D. M., P ortugualy , E., Ray , D., Simard, P ., and Snelson, E. (2013). Couterfactual reasoning and learning systems. J. Machine L e arning R ese ar ch , 14 , 3207–3260. Bro dersen, K. H., Gallusser, F., Ko ehler, J., Rem y , N., and Scott, S. L. (2014). Inferring causal impact using bay esian structural time-series mo dels. Annals of Applie d Statistics . Card, D. and Krueger, A. B. (1994). Minim um w ages and employmen t: A case study of the fast fo o d industry in New Jersey and P ennsylv ania. A meric an Ec onomic R eview , 84 (4), 772–793. Dash, D. (2005). Restructuring dynamic causal systems in equilibrium. In Pr o c e e dings of the T enth International Workshop on A rtificial Intel ligenc e and Statistics (AIStats 2005) , pages 81–88. 16 Dash, D. and Druzdzel, M. (2001). Cav eats for causal reasoning with equilibrium mo dels. In Symb olic and Quantitative Appr o aches to R e asoning with Unc ertainty , pages 192–203. Springer. Donald, S. G. and Lang, K. (2007). Inference with difference-in-differences and other panel data. The r eview of Ec onomics and Statistics , 89 (2), 221–233. Fisher, R. A. (1935). The design of exp eriments. Oliver & Boyd. Granger, C. W. (1988). Some recen t dev elopmen t in a concept of causalit y . Journal of e c onometrics , 39 (1), 199–211. Grun wald, G. K., Raftery , A. E., and Guttorp, P . (1993). Time series of contin uous prop or- tions. Journal of the R oyal Statistic al So ciety. Series B (Metho dolo gic al) , pages 103–116. Hahn, P . R., Gosw ami, I., and Mela, C. F. (2015). A bay esian hierarc hical model for inferring pla yer strategy t yp es in a num b er guessing game. The A nnals of Applie d Statistics , 9 (3), 1459–1483. Hec kman, J. J. and Vytlacil, E. (2005). Structural equations, treatmen t effects, and econo- metric p olicy ev aluation1. Ec onometric a , 73 (3), 669–738. Hec kman, J. J., Lo c hner, L., and T ab er, C. (1998). General equilibrium treatmen t effects: A study of tuition p olicy. A meric an Ec onomic R eview , 88 (2), 3810386. Holland, J. H. and Miller, J . H. (1991). Artificial adaptiv e agents in economic theory . The A meric an Ec onomic R eview , pages 365–370. Holland, P . W. (1986). Statistics and causal inference. Journal of the Americ an statistic al Asso ciation , 81 (396), 945–960. McKelv ey , R. D. and Palfrey , T. R. (1995). Quan tal resp onse equilibria for normal form games. Games and e c onomic b ehavior , 10 (1), 6–38. Ostro vsky , M. and Sch w arz, M. (2011). Reserve prices in in ternet advertising auctions: A field exp erimen t. In Pr o c e e dings of the 12th ACM c onfer enc e on Ele ctr onic c ommer c e , pages 59–60. A CM. P earl, J. (2000). Causality: mo dels, r e asoning and infer enc e . Cam bridge Universit y Press. 17 Rap op ort, A. and Bo eb el, R. B. (1992). Mixed strategies in strictly comp etitive games: A further test of the minimax hypothesis. Games and Ec onomic Behavior , 4 (2), 261–283. Rubin, D. B. (2011). Causal inference using p otential outcomes. Journal of the A meric an Statistic al Asso ciation . Stahl, D. O. and Wilson, P . W. (1994). Experimental evidence on pla yers’ mo dels of other pla yers. Journal of Ec onomic Behavior & Or ganization , 25 (3), 309–327. V on Neumann, J. and Morgenstern, O. (1944). The ory of games and e c onomic b ehavior. Princeton Univ ersity Press. W right, J. R. and Leyton-Brown, K. (2010). Beyond equilibrium: Predicting h uman b eha vior in normal-form games. In Pr o c. 24th AAAI Conf. on A rtificial Intel ligenc e . 18 A Pro of of Theorem 1 Theorem 1 (Estimation of long-term causal effects) . Supp ose that b ehaviors evolve ac- c or ding to a known temp or al mo del, and actions ar e distribute d c onditional ly on b ehaviors ac c or ding to a known b ehavior al mo del. Supp ose that Assumptions 1, 2 and 3 hold for such mo dels. Then, for every p olicy j ∈ { 0 , 1 } as the iter ations of A lgorithm 1 incr e ase, µ j /ν j → E ( R ( α j ( T ; j 1 )) |D j ) . The output c CE ( T ) of Algorithm 1 asymptotic al ly estimates the long-term c ausal effe ct, i.e., E ( c CE ( T )) = E ( R ( α 1 ( T ; 1 )) − R ( α 0 ( T ; 0 ))) ≡ CE ( T ) . Pr o of. Fix a p olicy j in Algorithm 1 and drop the subscript j in the notation of the algorithm. Therefore w e can write: ω ≡ ( φ j , ψ j , B j ) α ≡ α j ( T ; j 1 ) P ( D | ω ) ≡ P ( D j | B j , G j ) . The w ay Algorithm 1 is defined, as the iterations increase the v ariable µ is estimating lim µ = Z R ( α ) P ( D | ω ) p ( α, ω ) dω dα. W e now rewrite this integral as follo ws. lim µ = Z R ( α ) P ( D | ω ) p ( α, ω ) dω dα = Z R ( α ) P ( D | α , ω ) p ( α, ω ) dω dα [ p ( D | α, ω ) = P ( D | ω ) ] = Z R ( α ) P ( α, ω |D ) P ( D ) dω dα [ by Bayes the or em ] = P ( D ) Z R ( α ) P ( α |D ) dα [ ω is mar ginalize d out ] = P ( D ) E ( R ( α ) |D ) . The first equation, p ( D | α, ω ) = P ( D | ω ), holds b y definition of the b ehavioral mo del: the history of laten t b ehaviors is sufficient for the likelihoo d of observ ed actions. Another wa y to phrase this is that conditional on laten t b eha vior the observ ed action is indep enden t from 19 an y other v ariable. Similarly , as the iterations increase the v ariable ν is estimating lim ν = Z P ( D | ω ) p ( α , ω ) dω dα. W e now rewrite this integral as follo ws. lim ν = Z P ( D | ω ) p ( α , ω ) dω dα = Z P ( D | α, ω ) p ( α, ω ) dω dα [ b e c ause p ( D | α , ω ) = P ( D | ω ) ] = Z P ( α, ω |D ) P ( D ) dω dα [ by Bayes the or em ] = P ( D ) Z P ( α |D ) dα = P ( D ) . By the con tinuous mapping theorem we conclude that lim µ/ν → E ( R ( α ) |D ) . Th us E (lim µ 1 /ν 1 ) = E ( R ( α 1 ( T ; 1 ))) and E (lim µ 0 /ν 0 ) = E ( R ( α 0 ( T ; 0 ))) and so E (lim µ 1 /ν 1 ) − E (lim µ 0 /ν 0 ) → E ( R ( α 1 ( T ; 1 ))) − E ( R ( α 0 ( T ; 0 ))) , i.e., Algorithm 1 consistently estimates the long-term causal effect. B Connection of assumptions to p olicy in v ariance Assumption 3 in our framework is related to p olicy invarianc e assumptions in econometrics of p olicy effects (Hec kman and Vytlacil, 2005; Heckman et al. , 1998). In tuitiv ely , p olicy in v ariance p osits that given the choic e of p olicy by an agen t, the initial pro cess that resulted in this choice do es not affect the outcome. F or example, given that an individual chooses to participate in a tax b enefit program, the wa y the individual was assigned to the program (e.g., lottery , recommendation, or p oin t of a gun) do es not alter the outcome that will b e observ ed for that individual. Our assumption is different b ecause we hav e a temp oral ev olution of p opulation b ehavior and there is no free c hoice of an agen t ab out the assignment, 20 since we assume a randomized exp eriment. But our assumption shares the essential asp ect of conditional ignorabilit y of assignment that is crucial in causal inference. C Discussion of related metho ds Consider the estimand for the Rap op ort-Bo eb el exp eriment (Rap op ort and Bo eb el, 1992): τ = c | ( α 1 ( T ; 1 ) − α 0 ( T ; 0 )) . Here we discuss how standard metho ds would estimate such estimand. Our goal is to illus- trate the fundamen tal assumptions underpinning each metho d, and compare with our As- sumptions 2 and 3. T o illustrate we will assume a sp ecific v alue c = (0 , 1 , 0 , 2 , 0 , 0 , 0 , 0 , 1 , 1) | . In discussing these metho ds, we will mostly b e concerned with how p oint estimates compare to the true v alue of the estimand, which here is τ = $0 . 054 using the exp erimental data in T able 2. The naive approach w ould b e to consider only the latest observed time p oint ( t 0 = 2) under the exp eriment assignment Z , and use the observed p opulation actions under Z as an estimate for τ ; i.e., ˆ τ naiv e = c | ( α 1 ( t 0 ; Z ) − α 0 ( t 0 ; Z )) = − $0 . 051 . But for this estimate to b e un biased for τ , we generally require that α 1 ( t 0 ; Z ) − α 0 ( t 0 ; Z ) = α 1 ( T ; 1 ) − α 0 ( T ; 0 ) . The naive estimate therefore mak es a direct extrap olation from t = t 0 to t = T and from the observ ed assignment Z to the counterfactual assignments Z = 1 and Z = 0 . This ignores, among other things, the dynamic nature of agent actions. A more sophisticated approac h is to analyze the agent actions as a time series. F or exam- ple, Bro dersen et al. (2014) dev elop ed a metho d to estimate the effects of ad campaigns on w ebsite visits. Their metho d w as based on the idea of “synthetic con trols”, i.e., they created a time-series using different sources of information that would act as the counterfactual to the observ ed time-series after the in terven tion. How ev er, their problem is macroeconometric and 21 they work with observ ational data. Thus, there is neither exp erimen tal randomized assign- men t to games, nor strategic interference b etw een agen ts, nor dynamic agent actions. More crucially , they do not study long-term equilbrium effects. By construction, in our problem w e can lev erage b ehavioral game theory to make more informed predictions of coun terfactuals to time p oin ts after the interv ention at whic h the distribution of outcomes has stabilized. Another approach, common in econometrics, is the differ enc e-in-differ enc es (DID) esti- mator (Card and Krueger, 1994; Donald and Lang, 2007; Ostrovsky and Sc h warz, 2011). In our case, this metho d is not p erfectly applicable b ecause there are no observ ations b efore the interv ention, but we can still en tertain the idea by considering p erio d t = 1 as the pre- in terven tion perio d. The DID estimator compares the difference in outcomes before and after the interv en tion for b oth the treated and control groups. In our application, this estimator tak es the v alue ˆ τ did = c | ( α 1 ( t 0 ; Z ) − α 1 (1; Z )) | {z } change in reven ue for game 2 − c | ( α 0 ( t 0 ; Z ) − α 0 (1; Z )) | {z } change in reven ue for game 1 = − $0 . 164 . (5) This estimate is also far from the true v alue similar to the naive estimate. The DID estimator is unbia sed for τ only if there is an additive structure in the actions (Abadie, 2005), (Angrist and Pischk e, 2008) (Section 5.2), e.g., α j ( t ; Z ) = µ j + λ t + j t , where µ j is a p olicy-sp ecific parameter, λ t is a temp oral parameter, and is noise. The DID estimator thus captures a linear trend in the data by assuming a common parameter for b oth treatment arms ( λ t ) that is canceled out in subtraction in Eq. (5). The exten t to which an additivity assumption is reasonable dep ends on the application, how ever, b y definition, it implies ignorability of the assignmen t (i.e., Z do es not app ear in the mo del of a j ( t ; Z )), and th us it relies on assumptions that are stronger than our assumptions (Abadie, 2005; Angrist and Pischk e, 2008). In a structural approach, Athey et al. (2011) studied the effects of timber auction format (ascending versus sealed bid) on comp etition for tim b er tracts. They estimated bidder v al- uations from observ ed data in one auction and imputed coun terfactual bid distributions in the other auction, under the assumption of equilibrium pla y in b oth auctions. This approac h mak es tw o critical implicit assumptions that together are stronger than Assumption 3. First, the bidder v aluation distribution is assumed to b e a primitive that can b e used to impute coun terfactuals in other treatment assignmen ts. In other w ords, the assignmen t is inde- p enden t of bidder v alues, and thus it is strongly ignorable. Second, although imputation 22 is p erformed for p otential outcomes in equilibrium, which captures the notion of long-term effects, inference is performed under the assumption of equilibrium pla y in the observe d outcomes, and th us temp oral dynamic b ehavior is assumed aw ay . Finally , another p opular approach to causalit y is through dir e cte d acyclic al gr aphs (DA Gs) b et w een the v ariables of in terest (Pearl, 2000). F or example, Bottou et al. (2013) studied the causal effects of the machine learning algorithm that scores online ads in the Bing searc h engine on the search engine rev en ue. Their approach w as to create a full DA G of the system including v ariables such as queries, bids, and prices, and made a Causal Mark ov assumption for the D AG. This allo ws to predict coun terfactuals for the rev enue under manipulations of the scoring algorithm, using only observed data generated from the assumed DA G. How ever, a k ey assumption of the D A G approac h is that the underlying structural equation mo del is stable under the treatment assignment, and only edges coming from parents of the manipulated v ariable need to b e remov ed; as b efore, assignment is considered strongly ignorable. As p ointed out by Dash and Druzdzel (2001) this might b e implausible in equilibrium systems. Consider, for example, a system where X → Y ← Z , and a manipulation that sets the distribution of Y indep endently of X , Z . Then after manipulation the t w o edges will need to b e remo v ed. Ho wev er, if in an equilibrium it is required that Y ≈ X Z , then the tw o arrows should b e reversed after the manipulation. Prop er causal inference in equilibrium systems through causal graphs remains an op en area without a w ell-established metho dology (Dash, 2005). Finally w e note that there exists the concept of Granger causalit y (Granger, 1988), which remains imp ortan t in econometrics. The central idea in Granger causality is predictabilit y , in particular the ability of lagged iterates of a time series x ( t ) to predict future v alues of the outcome of interest, which in our case is the p opulation action α j ( t ; Z ). This causality concept do es not tak e into account the randomization from the exp erimental design, whic h is k ey in statistical causal inference. 23 D Application: Rap op ort and Bo eb el (1992) data The follo wing tables report the pa yoff matrix structure (T able 1 used by Rap op ort and Bo eb el) and the observ ed data (T able 2), as rep orted by McKelvey and P alfrey (1995). T able 1: Normal-form game in the exp eriment of Rap op ort and Bo eb el (v alues L and W are sp ecified as describ ed in the b o dy of the pap er) (Rap op ort and Bo eb el, 1992). a 0 1 a 0 2 a 0 3 a 0 4 a 0 5 a 1 W L L L L a 2 L L W W W a 3 L W L L W a 4 L W L W L a 5 L W W L L T able 2: Exp erimental data of Rap op ort and Bo eb el Rap op ort and Bo eb el (1992), as re- p orted by McKelv ey and Palfrey McKelv ey and Palfrey (1995). The data includes frequency of actions for the ro w agent and the column agent in the exp erimen t, brok en do wn by game and session. Gray color indicates that we assume the data to b e long-term and th us w e hold them out of data analysis and only use them to measure predictiv e p erformance. (Note: Ther e ar e five total actions available to every player ac c or ding to the p ayoff struc- tur e in T able 1. The fr e quencies for actions a 5 , a 0 5 c an b e inferr e d b e c ause P 5 i =1 a i = 1 and P 5 i =1 a 0 i = 1 .) ro w agent column agen t Game P erio d a 1 a 2 a 3 a 4 a 0 1 a 0 2 a 0 3 a 0 4 1 1 0.308 0.307 0.113 0.120 0.350 0.218 0.202 0.092 1 2 0.293 0.272 0.162 0.100 0.333 0.177 0.190 01.40 1 3 0.273 0.350 0.103 0.123 0.353 0.133 0.258 0.102 1 4 0.295 0.292 0.113 0.135 0.372 0.192 0.222 0.063 2 1 0.258 0.367 0.105 0.143 0.332 0.115 0.245 0.140 2 2 0.290 0.347 0.118 0.110 0.355 0.198 0.208 0.108 2 3 0.355 0.313 0.082 0.100 0.355 0.215 0.187 0.110 2 4 0.323 0.270 0.093 0.105 0.343 0.243 0.168 0.107 E More details on Ba y esian computation Here we offer more details ab out the c hoices in implemen ting Algorithm 1 in Section 4.1 of the main pap er. F or con venience w e rep eat the con ten t of Section 4.1 in the main pap er and 24 then expand with our details. Step 1: Mo del parameters. F or simplicit y w e assume that the mo dels in the t w o games share common parameters, and th us ( φ 1 , ψ 1 , λ 1 ) = ( φ 0 , ψ 0 , λ 0 ) ≡ ( φ, ψ , λ ), where λ are the parameters of the b ehavioral mo del to b e describ ed in Step 8. Having common parameters also acts as regularization and th us helps estimation. W e emphasize that this simplification is not necessary as w e could hav e t wo differen t set of parameters for each game. It is crucial, ho wev er, that the parameters are stable with resp ect to the treatmen t assignmen t b ecause w e need to extrap olate from the observed assignment to the coun terfactual ones. Step 4: Sampling parameters and initial b eha viors As explained later we assume that there are 3 different b ehaviors and thus φ, ψ , λ are vectors with 3 comp onents. Let x ∼ U ( m, M ) denote that ev ery comp onent of x is uniform on ( m, M ), indep endently . W e c ho ose diffuse priors for our parameters, sp ecifically , φ ∼ U(0 , 10), ψ ∼ U( − 5 , 5), and λ ∼ U( − 10 , 10). Giv en φ w e sample the initial b ehaviors in the t w o games as β 1 (0; Z ) ∼ Dir( φ ) and β 0 (0; Z ) ∼ Dir( φ ), indep endently . Regarding the particular c hoices of these distributions, we first note that φ needs to hav e p ositiv e comp onen ts b ecause it is used as an argument to the Dirichlet distribution. Larger v alues than 10 could b e used for the comp onents of φ but the implied Diric hlet distributions w ould not differ significantly than the ones we use in our exp erimen ts. Regarding λ we note that its comp onents are used in quantities of the form e λ [ k ] u and so it is reasonable to b ound them, and the interv al [ − 5 , 5] is diffuse enough giv en the v alues of u implied b y the pa y off matrix in T able 1. Finally the prior for the temp oral mo del parameters, ψ , is also diffuse enough. An alternative would b e to use a m ultiv ariate normal distribution as the prior for ψ but this w ould not alter the pro cedure significan tly . Steps 5 & 7: Piv ot to coun terfactuals. Since we hav e a completely randomized exp erimen t (A/B test) it holds that ρ Z = 0 . 5 and therefore β (0) = 0 . 5( β 1 (0; Z ) + β 0 (0; Z )). No w we can piv ot to the counterfactual p opulation b ehaviors under Z = 1 and Z = 0 b y setting β 1 (0; 1 ) = β 0 (0; 0 ) = β (0) . Step 8: Sample counterfactual b eha vioral history . As the temp oral mo del, we adopt the lag-one ve ctor autor e gr essive mo del , also kno wn as V AR(1). W e transform 2 the p opulation b ehavior in to a new v ariable w t = logit( β 1 ( t ; 1 )) ∈ R 2 (also do so for β 0 ( t ; 0 )). 2 The map y = logit( x ) is defined as the function ∆ m → R m − 1 suc h that, for v ectors y = ( y 1 , . . . , y m − 1 ) and x = ( x 1 , . . . , x m ), P i x i = 1, and x 1 6 = 0 wlog, indicates that y i = log( x i +1 /x 1 ), for i = 1 , . . . , n − 1. 25 Suc h transformation with a unique in v erse is necessary b ecause p opulation b eha viors are constrained on the simplex, and thus form so-called comp ositional data (Aitc hison, 1986; Grun wald et al. , 1993). The V AR(1) mo del implies that w t = ψ [1] 1 + ψ [2] w t − 1 + ψ [3] t , (6) where ψ [ k ] is the k th comp onen t of ψ and t ∼ N (0 , I ) is i.i.d. standard biv ariate normal. Eq. (6) is used to sample the b eha vioral history , B j , from t = 0 to t = T , as describ ed in Step 8 of Algorithm 1. Suc h sampling is straightforw ard to do. W e simply need to sample the random noises t for every t ∈ { 0 , . . . , T } , and then compute eac h w t successiv ely . Given the sample { w t : t = 0 , . . . , T } we can then transform back to calculate the p opulation b ehaviors β 1 ( t ; 1 ) = { logit − 1 ( w t ) : t = 0 , . . . , T } —for B 0 w e rep eat the same pro cedure with a new sample of t since the t wo games share the same temp oral mo del parameters ψ . Step 9: Behavioral mo del. Here we rewrite the sp ecifics of the b eha vioral mo del with more details. In QL p agen ts p ossess increasing levels of sophistication. F ollowing earlier w ork W righ t and Leyton-Brown (2010), w e adopt p = 3, and thus consider a b ehavioral space with three different b ehaviors B = { b 0 , b 1 , b 2 } . Recall that a b eha vior ∈ B represen ts the distribution of actions that an agent will pla y conditional on adopting that b ehavior. In QL p suc h distributions dep end on an assumption of quantal r esp onse , which is defined as follo ws. Let u ∈ R |A| denote a vector such that u a is the exp ected utilit y of an agen t taking action a ∈ A , and let G j denote the pay off matrix in game j as in T able 1. If an agen t is facing another agent with strategy (distribution o ver actions) b , then u = G j b . The quantal b est-resp onse with parameter x determines the distribution of actions that the agent will take facing exp ected utilities u , and is defined as QBR( u ; x ) = expit( xu ) , where, for a v ector y with elemen ts y i , expit( y ) is a v ector with elemen ts exp( y i ) / P i exp( y i ). The parameter x ≥ 0 is called the pr e cision of the quan tal b est-resp onse. If x is very large then the resp onse is closer to the classical Nash b est-resp onse, whereas if x = 0 the agen t ignores the utilities and randomizes among actions. Let λ = ( λ [1] , λ [2] , λ [3]) b e the precision parameters. Let α ( b ) denote the distribution 26 o ver actions implied for an agen t who adopts b ehavior. Giv en λ the mo del QL 3 calculates α ( b k ), for k = 0 , 1 , 2, as follows: • Agents who adopt b 0 , termed level-0 agents, hav e precision λ 0 = 0, and th us will randomly pic k one action from the action space A . Thus, α ( b 0 ) = QBR( u ; 0) = (1 / |A| ) 1 , regardless of the argument u . • An agen t who adopts b 1 , termed level-1 agent, has precision λ [1] and assumes that is playing against a lev el-0 type agent. Thus, the agent is facing a vector of utilities u 1 = G j b 0 , and so α ( b 1 ) = QBR( u 1 ; λ [1]) . • An agen t who adopts b 2 , termed level-2 agen t, has precision λ [3] and assumes is pla ying against a level-1 agent with precision λ [2]. Th us, it estimates that it is facing strategy α (1)2 = QBR( u 1 ; λ [2]), where u 1 = G j b 0 as ab ov e. The exp ected utility v ector of the lev el-2 agent is u 2 = G j α (1)2 , and th us α ( b 2 ) = QBR( u 2 ; λ [3]) . Giv en G j and λ w e can therefore write down a 5 × 3 matrix Q j = [ α ( b 0 ) , α ( b 1 ) , α ( b 2 )] where the k th column is the distribution ov er actions play ed b y an agent conditional on adopting b ehavior b k − 1 . Conditional on p opulation action β j ( t ; Z ) the exp ected p opulation action is ¯ α j ( t ; Z ) = Q j β j ( t ; Z ). The p opulation action α j ( t ; Z ) is distributed as a m ultino- mial with exp ectation ¯ α j ( t ; Z ), and so P ( α j ( t ; 1 ) | β j ( t ; 1 ) , G j ) = Multi( |I | · α j ( t ; 1 ); ¯ α j ( t ; 1 )), where Multi( n, p ) is the multinomial densit y of observ ations n = ( n 1 , . . . , n K ) with exp ected frequencies p = ( p 1 , . . . , p K ). Hence, the full lik eliho o d for observ ed actions in game j re- quired in Steps 10 and 11 of Algorithm 1 is given by the pro duct P ( D j | B j , λ j , G j ) = T − 1 Y t =0 Multi( |I | · α j ( t ; j 1 ); ¯ α j ( t ; j 1 )) . 27
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment