Designing and Deploying Online Field Experiments

Designing and Deplo ying Online Field Experiments Eytan Baksh y F acebook Menlo P ark, CA eytan@fb .com Dean Eckles F acebook Menlo P ark, CA deaneckles@fb .com Michael S. Bernstein Stanf ord University P alo Alto , CA msb@cs.stanf ord.edu ABSTRA CT Online experiments are widely used to compare speciﬁc design alternativ es, but they can also be used to produce generalizable knowledge and inform strategic decision making. Doing so often requires sophisticated experimental designs, iterativ e reﬁnement, and careful logging and analysis. Fe w tools e xist that support these needs. W e thus introduce a language for online ﬁeld experiments called PlanOut . PlanOut separates experimental design from ap- plication code, allowing the experimenter to concisely describe experimental designs, whether common “ A/B tests” and factorial designs, or more comple x designs in volving conditional logic or multiple experimental units. These latter designs are often useful for understanding causal mechanisms inv olved in user behaviors. W e demonstrate how experiments from the literature can be im- plemented in PlanOut, and describe two large ﬁeld experiments conducted on Facebook with PlanOut. For common scenarios in which experiments are run iterati vely and in parallel, we introduce a namespaced management system that encourages sound experi- mental practice. K eywords A/B testing; online experiments; toolkits; methodology Categories and Subject Descriptors H.5.3. [ Group and Organization Interfaces ]: Evaluation / methodology 1. INTR ODUCTION Randomized ﬁeld experiments are central to contemporary de- sign and de velopment processes for Internet services. In the most popular case, practitioners use “ A/B tests” that randomly assign users to one of two variations of a service. Doing so often allows designers and developers to quickly identify the best choice of the two. The Internet industry has distinct advantages in how orga- nizations can use experiments to make decisions: developers can introduce numerous variations on the service without substantial engineering or distribution costs, and observe how a large random Copyright is held by the International W orld W ide W eb Conference Com- mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the author’ s site if the Material is used in electronic media. WWW’14, April 7–11, 2014, Seoul, K orea. A CM 978-1-4503-2744-2/14/04. http://dx.doi.org/10.1145/2566486.2567967. sample of users (rather than a conv enience sample) behave when randomly assigned to these variations. So, in many ways, experi- mentation with Internet services is easy . For some organizations, randomized experiments play a central role throughout the design and decision-making process. Experi- ments may be used to explore a design space [19], better attribute outcomes to causes [3, 14], and estimate effects that help decision makers understand how people react to changes and use their ser- vices [23, 35]. In this way , ev en early stages of the user-centered design process can be informed by ﬁeld experiments. Such trials are not intended to optimize short-term objectiv es through a pick- the-winner process; they instead aim to provide more lasting, gen- eralizable kno wledge. While experiments that achie ve these objec- tiv es often draw from experimental designs used in the behavioral and social sciences, av ailable tools do little to support the design, deployment, or analysis of these more sophisticated experiments. The realities surrounding the deployment of routine e xperiments can make their ev aluation quite complex. Online experimentation is highly iterativ e, such that preliminary results are used to rapidly run follow-up experiments. This necessitates changing or launch- ing new experiments. Changing live experiments can easily result in statistical inferences that are incorrect. From a de velopment per- spectiv e, running follow-up experiments can be time consuming and error prone because experimental logic is often mixed in with application code. Online experimentation is also distributed across individuals and teams, and over time. This can make it difﬁcult to run experiments simultaneously without interacting with other’ s experiments or complicating application logic. Combined, these features make it so that experimentation can become so embedded in application code that only a few engineers can correctly modify a particular experiment without introducing errors in its design or future analysis. In this w ork, we discuss ho w Internet-scale ﬁeld experiments can be designed and deployed with PlanOut , a domain-speciﬁc lan- guage for experimentation used at Facebook. Designers and en- gineers working with PlanOut can view any aspect of a service as tunable via parameters: e.g., a ﬂag signaling whether a banner is visible, a variable encoding the number of items in an aggregation, or a string that corresponds to the text of a button. Experimen- tal logic is encapsulated in simple scripts that assign values to pa- rameters. Basic random assignment primitiv es can be combined to reliably implement complex experimental designs, including those that in volve assignment of multiple experimental units to multiple factors and treatments with continuous values. PlanOut scripts can also be used as a concise summary of an experiment’ s design and manipulations, which makes it easier to communicate about, and replicate experiments. PlanOut is enough by itself to author one-off, isolated experi- 1 ments quickly . Howe ver , if experiments must be iterated upon, or related experiments must run in parallel, additional infrastructure is necessary . Such systems can be used to manage experiments and logging, prev ent interference between experiments. In this paper, we introduce the architecture of a management system for such sit- uations, and illustrate how iterative experiments can soundly run and analyzed. In sum, this paper contributes: • A characterization of online experimentation based on parametrization of user experiences, • The PlanOut language, which provides a high-ceiling, low- threshold toolkit for parameter-based e xperiments, • Guidelines for managing and analyzing iterative and dis- tributed e xperiments. These contributions together articulate a perspective on ho w on- line ﬁeld experiments should be conceptualized and implemented. This perspective advocates the use of short, centralized scripts to describe assignment procedures at a high le vel. It results in experi- mental practice that is more agile and encourages the production of generalizable, scientiﬁc knowledge. The paper is structured as follows. After revie wing related work in Section 2, we introduce the PlanOut language in Section 3 and show how it can be used to design both simple and complex ex- periments. Then, in Sections 5 and Section 6, we describe how distributed, iterativ e experiments can be managed, logged, and an- alyzed. Finally , we discuss the broader implications and and limi- tations of our work in Section 7. 2. RELA TED WORK The design and analysis of experiments is a developed area within statistics that is regularly taught, with domain-speciﬁc el- ements, to students in industrial engineering, psychology , market- ing, human–computer interaction, and other ﬁelds [8, 16, 23, 31]. T ools for producing design matrices for factorial and fractional fac- torial designs are available (e.g., in the R packages DoE , Design , and rms ). Such packages are useful for designing small scale studies, but since the design matrix is created a priori , the y are not well suited for online settings where newly created experimental units must be assigned in real-time and assignment may depend on unit characteristics not av ailable in adv ance. The pre valence of randomized experiments in the Internet indus- try and the tools developed there are only partially represented in the scholarly literature. Mao et al. [24] created experimental frame- works for crowdsourcing sites such as Amazon Mechanical Turk. Sev eral papers by Koha vi et al. present recommendations on how to implement and instrument e xperiments [19], as well as common pitfalls in analyzing experiments [12, 18]. Existing experimentation tools include associating experiments with “layers” (at Google and Microsoft [18, 32]) or “universes” (at Facebook), such that all conditions in the same layer are mu- tually exclusiv e. Some tools (e.g., Google Analytics, Adobe T ar- get) include mechanisms for associating condition identiﬁers with conﬁguration information (e.g., dictionaries of parameters and their values). While the types of experiments we focus on in this paper are de- signed to inform product decision-making, other experiments are run simply to optimize a single outcome variable. For e xample, an- other activ e area of dev elopment focuses on implementing heuris- tics for optimizing stochastic functions (e.g., multi-armed bandit optimization) [22, 30]. 3. THE PLANOUT LANGU A GE The PlanOut language separates experimental design from ap- plication logic and focuses experimenters on the core aspect of an experiment: how units (e.g., users, items, cookies) are randomly assigned to conditions, as deﬁned by parameters (e.g., settings for user interface elements, references to ranking algorithms). PlanOut promotes a mental model where ev ery aspect of the site is parame- terizable, and experiments are a way of ev aluating user experiences deﬁned by those parameters. This approach encourages experimenters to decompose large changes into smaller components that can be manipulated indepen- dently . In doing so, experimenters are better equipped to attribute changes in user behavior to speciﬁc features and thus inform design decisions. Decomposition also allows experimenters to more eas- ily iterate on experiments so that some features remain ﬁxed while others change. Experimenters use PlanOut by writing a PlanOut script, which may then be executed via an API for each unit (e.g., user or user- story combination). Each script indicates which inputs are used for assignment, how random assignment should occur, and the names of parameters that can be accessed via the API and logged. PlanOut scripts are executed sequentially . In later sections, we introduce a number of scripts for both standard and complex e xperiments. From a systems perspectiv e, PlanOut is a way of serializing ex- periment deﬁnitions (e.g., as JSON) so they can be easily stored and executed on multiple platforms, such as layers of backend and frontend services, and mobile devices. Serialized PlanOut code (Figure 1c) can be generated and edited through a domain- speciﬁc language (DSL) (Figure 1a) or graphical user interfaces (Figure 1b). The DSL is presented in the remaining sections, but many of these examples could alternatively be formulated through the use of graphical interfaces. The PlanOut DSL and its syntax are minimal. The primary con- tribution of the language is in providing a parsimonious set of op- erations for thinking about, designing, and implementing experi- ments. Because of this, we spend little time discussing the lan- guage itself or its built-in operators. A complete list of operators can be found in the documentation for the reference implementa- tion of PlanOut. 1 3.1 Functionality W e begin our discussion of PlanOut by giving several illustrative examples of how scripts and operators work. W e ﬁrst describe a simple A/B test and show how it can be generalized into factorial designs. Then we consider how experimental designs that in volv e multiple types of units in the randomization of a user interface ele- ment can be used to estimate different effects. Finally , we discuss conditional ev aluation (e.g., for pre-stratiﬁcation), and other oper- ators. 3.1.1 A/B test The most common type of experiment inv olves uniform random selection — for example randomly assigning users to one of several button colors or text options. This can be accomplished via the uniformChoice operator , button_color = uniformChoice( choices=[’#3c539a’, ’#5f9647’, ’#b33316’], unit=cookieid); Here, each cookieid is assigned deterministically to one of three 1 An open source implementation of a PlanOut interpreter and API is av ailable at https://github.com/facebook/planout . 2 button_color = uniformChoice( choices=['#3c539a', '#5f9647', '#b33316'], unit=cookieid); button_text = weightedChoice( choices=['Sign up', 'Join now'], weights=[0.8, 0.2], unit=cookieid); {"op":"seq", "seq":[ {"op":"set", "var":"button_color", "value": {"op":"uniformChoice", "unit":{"op":"get", "var":"cookieid"}, "choices":["#3c539a", "#5f9647", "#b33316"]}}, {"op":"set", "var":"button_text", "value": {"op":"weightedChoice", "unit":{"op":"get", "var":"cookieid"}, "weights":[0.8, 0.2], "choices":["Sign up", "Join now"]} } ]} (a) (c) (b) Sign up Sign up Sign up Join now Join now Join now button_color button_text (d) Figure 1: A factorial experiment in PlanOut. (a) PlanOut language script (b) a graphical interface for specifying simple PlanOut experiments (c) a JSON representation of serialized PlanOut code (d) an illustration of the proportion of cookieids allocated to each parameterization. Note that because we use weightedChoice() to assign button_text , more cookies are assigned to “Sign up” than ‘’Join Now”. possible button colors. In application code, the experimenter will later be able to e v aluate this PlanOut script for a particular cookieid (e.g., in the case of a W eb-based sign-up form), and re- triev e the runtime value through the v ariable name button_color . 3.1.2 Multifactor experiment Creating a full factorial experiment means setting multiple vari- ables that are ev aluated independently . For example, suppose we wanted to manipulate not just the color of a b utton but also its te xt. The script for this experiment is given in Figure 1a, and includes two operators, a uniformChoice and a weightedChoice . W e use weightedChoice to assign (on a verage) 80% of the cookies to hav e the button text “Sign up”, and 20% to hav e the te xt “Join no w”. Set- ting these two parameters as sho wn in Figure 1 generates 2 × 3 = 6 conditions, whose proportions are summarized in Figure 1d. 3.1.3 Conditional execution Many experiments cannot be described through fully f actorial designs, e.g., in the case where some values of one parameter may only be valid when another parameter is set to a particular value, or assignment probabilities are dependent on another vari- able. PlanOut thus includes operators for conditional control ﬂow , such as if / else , boolean operations (i.e., and , or , not ), compari- son operations (e.g., == , >= ) and array indexing. Consider a scenario where we wish to control the population of users receiving a ne w translation feature so that a higher proportion of US users recei ve the feature. T o accomplish this, one could pass in both a userid and country to the PlanOut interpreter, and use conditional logic, if (country == ‘US’) { has_translate = bernoulliTrial(p=0.2, unit=userid); } else { has_translate = bernoulliTrial(p=0.05, unit=userid); } or alternativ ely , via array indexing, strata_p = [0.05, 0.2]; has_translate = bernoulliTrial( p=strata_p[country == ‘US’], unit=userid); Here, arrays are zero-indexed and true/false ev aluate to 1 and 0 , respectiv ely . 3.1.4 Experiments with multiple and nested units Many ef fects are better understood through randomization of units other than the user . For instance, while most standard A/B tests are between-subjects designs, where users are randomly as- signed to different experiences, some effects may be more precisely estimated through a within-subjects design. These experiments can be implemented by transitioning from a single experimental unit (e.g., viewerid ) to tuples of units. Consider an experiment that manipulates whether a story in users’ Ne ws Feed has its comment box collapsed (Figure 2a) or ex- panded (Figure 2b). If an experimenter wanted to assign 5% of all News Feed items to have a collapsed comment box, so that users must click to see comments attached to a story , one could deﬁne such an experiment by: collapse_story = bernoulliTrial(p=0.05, unit=[viewerid, storyid]); The bernoulliTrial operator returns 1 with probability p , and 0 otherwise. By making unit a tuple, [viewerid, storyid] , one achiev es a fully randomized within-subjects design, where each user sees, in expectation, an independent 5% of posts with col- lapsed comment boxes. This type of design may be used to estimate the ef fect of collapsing indi vidual stories on indi vidual viewers’ re- sponses to that story (e.g., likes, comments). 3 (a) (b) Figure 2: (a) A News Feed story whose comment box is collapsed and (b) an expanded comment box. Different causal effects can be esti- mated by randomizing the state of the comment box over different ex- perimental units (e.g., source users, viewers, stories). Other ways of assigning units can be used to estimate different quantities. If one were to instead assign viewerid s to conditions, 5% of users w ould see all stories collapsed, which could have large effects on how viewers interact with all stories and how many sto- ries they would consume. Had storyid been the randomized unit, particular stories would be collapsed or not collapsed for all view- ers. Because vie wers are expected to comment at a higher rate when the box is e xpanded, randomizing ov er just storyid can pro- duce herding effects. These differences result from interference or “spillov ers” across conditions [1, 29], and highlights how support- ing multiple experimental units can be useful for ev aluating how user interface elements af fect complex user dynamics. 3.1.5 Extensions PlanOut is e xtensible. If an assignment procedure cannot be implemented using built-in operators, dev elopers may write cus- tom operators in a nativ e language (e.g., PHP or Python), includ- ing those that integrate with other services. PlanOut experiments at Facebook often use custom operators that interface with gating infrastructure, graph cluster randomization [34], and other exper - imentation systems. Classes for random assignment are also ex- tensible, so that procedures can be easily implemented using the hashing methods described below . 3.2 Random assignment implementation Many operators in volv e generating pseudo-random numbers de- terministically based on input data (e.g., a user generally should be assigned to a the same b utton color each time the y load a particular page). A sound assignment procedure maps experimental units to parameters in a way that is deterministic, as good as random, and unless by design, independent of other parameter values from the same or other experiments. And because experiments can be link ed across multiple service layers (e.g., ranking and user interfaces), it is important that any pseudo-random operation can be kept consis- tent across loosely-coupled services that may be written in different languages. The PlanOut interpreter implements procedures that automati- cally fulﬁll these requirements. Rather than using a pseudo-random number generator , or directly hashing units into numbers, the in- terpreter “salts” inputs to a hash function so that each assignment (unless otherwise speciﬁed) is independent of other assignments, both within and across experiments. Because the procedure based off of standard hashing functions (i.e., SHA1), it is deterministic and platform-independent. At a low le vel, this is done by prepend- ing each unit with a unique experiment identiﬁer and variable- speciﬁc salt. Thus, the hash used to assign a variable such as button_color is not just the input ID (e.g., 42 ), but instead, e.g., user_signup.my_exp.button_color.42 , where user_signup is the namespace of the e xperiment and my_exp is the identiﬁer of the par- ticular experiment (namespaces and experiments are more speciﬁ- cally deﬁned in following sections). 4. EXAMPLE EXPERIMENTS The preceding examples show how PlanOut is a low-threshold language for implementing basic experiments. In this section we demonstrate that PlanOut also has a high ceiling, in that complex, scientiﬁc experiments can be concisely speciﬁed in only a fe w lines of code. 4.1 Examples from prior resear ch W e begin by demonstrating how one could implement two pub- lished experiments from the social computing literature. 4.1.1 Experimenting with goal-setting In an inﬂuential application of social psychological theory to on- line systems, Beenen et al. [6] experimented with strategies for encouraging users to contribute ratings to the movie recommenda- tion service and online community MovieLens. In Study 2, they randomly assigned users to an email that sets a goal for users to rate movies. Users were either identiﬁed as being part of a group, and having a group-lev el goal, or having an individual goal. The goal was either speciﬁc or a “do your best” goal; if speciﬁc, it was a number of movies to be rated in a week scaled by the size of the group. This experiment could be implemented as: group_size = uniformChoice(choices=[1, 10], unit=userid); specific_goal = bernoulliTrial(p=0.8, unit=userid); if (specific_goal) { ratings_per_user_goal = uniformChoice( choices=[8, 16, 32, 64], unit=userid); ratings_goal = group_size * ratings_per_user_goal; } This experiment could then be analyzed in terms of the per-person speciﬁc goal, as in Beenen et al. [6]. There are multiple other ways to implement this experiment, some of which correspond to different choices about ho w to split logic between PlanOut and the application code. For example, the actual text used in the emails could be constructed in the PlanOut code, while the implementation abov e follo ws from the judgement that it would be better to do so in the application logic.Such choices can also depend on other av ailable tools, such as tools for automatically creating translation tasks for new strings used in an online service. 4 (a) (b) Figure 3: A within-subjects experimental design that deterministically randomizes social cues presented to users. Example from Bakshy et al. [3] when (a) 1 of 3 and (b) 3 of 3 cues are shown. 4.1.2 A social cues experiment with comple x inputs Consider an experiment on the effects of placing social cues alongside advertisements from Bakshy et al. [3]. A small percent- age of user segments were allocated to this e xperiment (see Section 5.1); for these users, some social cues were remo ved from ads. For instance, if a user in the experiment had three friends that “like” a particular Facebook page being adv ertised, then this user would be randomly assigned to see one, two, or three friends associated with the page (Figure 3). This experiment can be written as follo ws: num_cues = randomInteger( min=1, max=min(length(liking_friends), 3), unit=[userid, pageid]); friends_shown = sample( choices=liking_friends, draws=num_cues, unit=[userid, pageid]); The input da ta to the PlanOut experiment would be userid , pageid , and liking_friends , an array of friends associated with the page. The script speciﬁes that each user–page pair is randomly assigned to some number of cues, num_cues , between one and the maximum number of displayable cues (i.e., no more than three, but no greater than the number of friends eligible to be displayed alongside the ad). That number is then used to randomly sample num_cues draws from liking_friends . That is, the script determines both the num- ber of cues to display and the speciﬁc array of friends to display . 4.2 Experiments deployed using PlanOut The following represent a sample of experiments that have been designed and deployed using PlanOut at Facebook. 4.2.1 V oter turnout experiment In an extension and replication of prior experiments with voter turnout [7, 9], an experiment assigned all v oting-aged US Facebook users to encouragements to vote in the 2012 Presidential Election. This experiment inv olved assigning users to both a banner at the top of the screen and eligibility for seeing social stories about friends’ self-reported voting behavior in News Feed. W e show a subset of the parameters set by this experiment. has_banner = bernoulliTrial(p=0.97, unit=userid); cond_probs = [0.5, 0.98]; has_feed_stories = bernoulliTrial( p=cond_probs[has_banner], unit=userid); button_text = uniformChoice( choices=[“I’m a voter”, “I’m voting”], unit=userid); W e can see that has_banner is 1 for 97% of users. Then, we deﬁne cond_probs to be the conditional probability that one would show feed stories given that has_banner is either 0 or 1 . W e assign has_feed_stories using a bernoulliTrial() with p=cond_probs[has_banner] , so that those with the banner have a high chance of also being able to see the feed stories, and those without a banner hav e an equal probability of being able to see or not see the feed stories. Finally , the b utton te xt for the call to action in the banner is subject to experimental manipulation (provided that the user is in the has_banner=1 condition). Analyses can then ex- amine effects of the banner , effects of social stories about voting, and its interaction with verb/noun phrasing [9]. 4.2.2 Continuous-tr eatment encourag ement design Encouragement designs [17] randomize an inducement to a be- havior of interest so as to ev aluate the inducement or study the be- havior’ s downstream effects. In online services, it is common to encourage users to engage with a particular entity , user , or piece of content. For instance, if having more ties on Facebook is hy- pothesized to increase long-term engagement, one could establish a causal relationship by randomizing whether or not some users receiv e recommendations for additional friends. Evidence suggests that users who receiv e more feedback on Facebook are more likely to become engaged with the site [10]. If there is a (forward) causal relationship between these variables, then changes to the site that affect how much feedback users receiv e can in turn affect user eng agement and content production. The following experiment examines this hypothesized effect by randomizing encouragements for friends to engage with a source user’ s content. It also illustrates random assignment in volving mul- tiple units. As mentioned in Section 3.1.4, e xpanding or collapsing News Feed stories can affect engagement with stories. The script below randomly assigns each source user to a proportion, such that on average, that proportion of the source’ s friends see a collapsed comment box when stories they produce appear in Ne ws Feed. prob_collapse = randomFloat(min=0.0, max=1.0, unit=sourceid); collapse = bernoulliTrial(p=prob_collapse, unit=[storyid, viewerid]); Each source user is assigned to a probability prob_collapse in [0 , 1] . Then, each story–viewer pair is assigned to ha ve a collapsed comment box with probability prob_collapse . T o carry out this as- signment, we inv oke the PlanOut script from the part of Ne ws Feed rendering logic that determines whether stories’ comment boxes should be e xpanded or collapsed, using sourceid , storyid , and viewerid as inputs (more discussion of the application interface is cov ered in Section 5.3.1). There a number of possible ways this experiment can be ana- lyzed. First, one can identify the ef fect of modulating feedback en- couragements on the total amount of feedback a user’ s stories get. Second, we can test our original hypothesis that feedback causes users to engage more with the site (e.g., log in more often or pro- duce more content). One can look at the ef fect of the assignment to different v alues of prob_collapse on users’ engagement lev els, or use an instrumental variables analysis [17, 27], which combines es- timates of effects of the encouragement on feedback receiv ed and engagement, to estimate the effect of feedback on users’ engage- ment. 5. R UNNING EXPERIMENTS W e ha ve discussed ways of designing and e xecuting randomized assignment procedures, but hav e not described how experiments are managed, tested, and logged. Here we deﬁne an experiment to refer to a PlanOut script combined with a target population for 5 which that script was launched at a speciﬁc point in time. From the perspectiv e of the experimenter and logging infrastructure, differ - ent experiments are considered separately . In the following subsections, we describe a broader technical context for running e xperiments. This supporting infrastructure in- cludes: a system for managing and logging experiments, including a segmentation layer which maps units to experiments, a launch system which provides default v alues for parameters not subject to experimental manipulation, an API for retrie ving parameters, and a logging architecture which simpliﬁes data analysis tasks. 5.1 Namespace model of site parameters Field experiments with Internet services frequently inv olve the manipulation of persistent parameters that are the subject of multi- ple experiments, whether conducted serially or in parallel. W e use a namespace model of parameters to support these practices. Experimentation is frequently iterativ e; as in scientiﬁc research, a single experiment is often not deﬁnitive and so requires follow- up experiments that manipulate the same parameters. A second experiment with a near-identical design may be used to more pre- cisely estimate effects, or might include new variations suggested by the ﬁrst’ s results or other design work. Continual redesigns and dev elopment might also change the effects of the parameters, thus motiv ating the need for additional experiments. Like wise, multiple experiments manipulating the same aspects of a service are frequently run in parallel, sometimes by different teams with minimal explicit coordination. T wo teams may manip- ulate (a) the same features of the same service, (b) independent fea- tures of the same service (e.g., font size and items per page), or (c) different layers of the same service (e.g., link colors and the rank- ing model which selects which items are to be displayed). In these cases, it is helpful to ha ve an e xperimentation system that can keep track of and/or restrict which parameters are set by each experi- ment. This requires that experimentation tools be cross-platform and can handle allocation of units to multiple experiments started at different times by dif ferent teams. The solution to support these practices within the PlanOut frame- work is to use parameter namespaces (or namespaces for short). This is a natural extension of thinking of the experimentation sys- tem as being how many parameter values are read in application code; often these parameters are an enduring part of the service, such that, over time, many e xperiments will set a particular pa- rameter . Each namespace is centered around on a primary unit (e.g., users). A new experiment is created within a new or exist- ing namespace by allocating some portion of the population to a PlanOut script. 2 5.2 Experiment management Experiments can be managed as follows: for each namespace, hash each primary unit to one of a lar ge number (e.g., 10,000) of segments , and then allocate individual segments to experi- ments. This segment-to-experiment mapping may be maintained in a database or other data storage system. Similar to segmentation systems discussed in prior work [32], when a new experiment is 2 Any two experiments which set the same parameter must be mu- tually exclusi ve of one another (i.e., must assign parameter values only for disjoint sets of units). More generally , consider the graph of experiments in which two experiments are neighbors if they set any of the same parameters. It is then natural to require that all experiments in the same connected component to be mutually ex- clusiv e. This motiv ates the idea of using namespaces to group pa- rameters that are e xpected to be set by e xperiments in the same connected component of this graph. parameter assignments ... segments PlanOut script namespace-speciﬁc segment mapping API Figure 4: Blocks of segments (e.g., multiple buckets of user ids) are assigned to experiments, which map experimental units to parameters in a way that is uncorrelated with the segmentation. created, it is allocated a random set of segments. These segments are deallocated once the experiment is complete. Each experiment’ s script makes no reference to these segments, such that random assignment to parameter values within each ex- periment is independent of the segmentation (Figure 4). 3 This feature is accomplished via the hashing method described in Sec- tion 3.2, and reduces the risk of carryover effects [8] that might occur if whole segments from one experiment were all assigned to the same parameterizations and subsequently reallocated to a dif- ferent experiment [18, 19]. Namespaces can be represented in experimentation management tools as a dashboard of currently running experiments along with a listing of parameters for that namespace. When an experiment is complete, its segments can become av ailable for future experi- ments. When iteratively experimenting, new versions of an exper- iment can be created and allocated new segments. For example, when experimenters need to increase precision by experimenting with more units, engineers can duplicate the experiment deﬁnition and allocate additional segments to the experiment. Increases in size are often coupled with changes to the experiment deﬁnition, e.g., adding new parameterizations similar to promising versions. Such iteration is generally preferable to modifying the existing ex- periment, which can frequently produce problems in analysis (see Section 6). 5.2.1 P arameter defaults In some cases, all units will hav e a parameter set by an experi- ment. For example, an experimenter w orking with a ne w parameter in a new namespace may simply allocate all segments to an exper - iment. But in other cases, some units will not have a particular parameter set by any experiment. Then the value of the parameter used for those units can be set in some other way . Often, this would reﬂect the status quo and/or what values are currently believ ed to 3 This corresponds to what K ohavi et al. [18] call “local random- ization”. It requires that each experiment hav e its own dedicated control group, which they identify as the approach’ s lone disadv an- tage. 6 week 1: launch initial PlanOut script with many conditions week 2: launch same script to more segments week 4: launch new script with fewer conditions to more segments experiment 1 2 3 - parameter assignments Figure 5: An illustration of how namespaces are used to launch experiments. Segments (vertical regions) are randomly allocated to experiments; here we order the segments by which experiment they were allocated to. Segments not allocated to experiments use the default parameter values. Results from each experiment are generally analyzed separately . Horizontal lines distinguish conditions in each experiment; dotted lines indicate the conditions remov ed in experiment 3. be optimal. This same value is lik ely assigned by other experiments (e.g., if they include “control” conditions). 4 More technically , if a request for a parameter is made but that parameter is not set by an experiment (i.e., the unit is not assigned to an activ e experiment, or the unit is assigned to an active exper- iment that does not set the parameter), then some speciﬁed launch value is used. If this launch value is not speciﬁed, or the experi- mentation service f ails, the default speciﬁed in the application code is used instead. Other extensions, such as having launch values vary depending on unit characteristics (e.g., country) may also be implemented in a straightforward way . 5.2.2 W orkﬂow for iter ative experimentation W e summarize how the tools presented here ﬁt together by de- scribing an iterati ve experiment (Figure 5). First, we created a namespace for a particular user interface, implemented front-end code to retrie ve parameter values from that namespace immediately before rendering UI elements, and set the default launch values to settings that had previously been hard-coded in PHP . In our case, logging for the outcomes of interest was already instrumented, so no additional instrumentation was needed. Then, we scripted an initial PlanOut-based experiment and launched it to a small set of users (experiment 1 in Figure 5). Af- ter one week of observing that the experiment did not cause sta- tistically signiﬁcant decreases in key metrics, this same script was launched to roughly 8 times the number of users (experiment 2). Results were initially analyzed using internal tools, and then in greater detail using R . Because the experiment was expected to hav e long-term ef fects, results from each experiment were analyzed sep- arately ev en though they used the same script. W e found that the primary outcomes were clearly worse under parameterizations in- volving a particular parameter value, and that higher values of a second parameter appeared to increase one outcome at the e xpense of another outcome. Based on these results, we created a new PlanOut script that did not include clearly suboptimal parameterizations, and extended the range of the parameter we hypothesized to represent an important tradeoff. This new script was launched in a third experiment to an additional segment of users (experiment 3 in Figure 5), and ana- lyzed for sev eral weeks. 4 From a statistical perspecti ve, it may seem that not counting these units as part of a control group makes for inefﬁcient analysis. This is sometimes true, but doing so would place additional require- ments for standard analyses to be correct (e.g., requirements on the history of values assigned to units not in experiments). When these requirements are satisﬁed, an analyst could make use of that data as needed. After considering longer-term results from the three experi- ments, we decided on a parameterization to use as a default for all users. W e de-allocated all segments, set a new default parameteri- zation, and created a new , smaller experiment that assigned users to the new and original parameterizations with equal probability . This fourth e xperiment, commonly referred to as a “backtest”, is used to ev aluate the efﬁcacy of the launch decision after a long period of time. While this type of backtest is easy to implement and av oids the potential for downstream errors in analysis, there are a number of possible ways we could have run the backtest. If a prior experi- ment (e.g., experiment 3) had reasonable power for the comparison between the old and new defaults, we could continue to run that e x- periment and disable all other parameterizations. Users assigned to disabled parameterizations would take on the new default parame- terization, but would not be used in the subsequent analysis of the experiment. This approach can be particularly attractiv e if the ex- periment is expected to ha ve time-v arying effects, or if one wanted to minimize changes to individual users’ experiences. A third op- tion would in volv e running a ne w e xperiment in the unallocated segments, if there are enough such segments after experiments 1– 3. 5.3 Integration with application code 5.3.1 Application pr ogr amming interface At runtime, application code needs to retrie ve parameter val- ues within a namespace for particular units. This can be done by constructing an object which interfaces with management and logging components. For example, retrie ving the parameter collapse_story associated with a particular viewer -story pair in the comment_box namespace might be inv oked by instantiating an experiment object for the input units, and request the needed pa- rameters: exp = getExp(‘comment_box’, {‘viewerid’: vid, ‘storyid’: sty}) collapse_story = exp.get(‘collapse_story’) The management system would then map the units to a segment within the comment_box namespace, which gets mapped to an ex- periment and its respectiv e PlanOut script. The script is ex ecuted with the input data, and if the script that sets a parameter requested by get() , the value is returned and the event is logged. If the re- quested parameter is not set, then the parameter default (described in Section 5.2.1) is used. Because assignment procedures bear some computational cost, and are generally deterministic, parame- ter assignments can be cached. 7 5.3.2 T esting e xperiments and parameter values When designing experiments, engineers often need to be able to test a service under a range of parameter values. While PlanOut scripts can be dif ﬁcult to interact with directly (as the y are not written in a native language, like PHP), they can still be tested and debugged in situ with a small amount of additional infrastruc- ture. In particular , by providing developers with a way to ov erride or “freeze” parameters so that they maintain a prespeciﬁed value throughout a PlanOut script’ s ex ecution, one can test assignment to conditions e ven if few units (or combinations of units) are assigned to them, without modifying any application or PlanOut code. This functionality can be surfaced to W eb dev elopers via URL query parameters. Freezing the has_feed_stories parameter to 1 in the v oter turnout experiment described in Section 4.2.1 (running within the vote2012 namespace) may then be accomplished by ac- cessing a URL like: http://...php?ns_vote2012=has_feed_stories:1 Freezing also allo ws one to test downstream effects of dif fer- ent inputs. Ov erriding userid or has_bannner may in turn change whether feed stories are shown. Combinations of parameters can also be frozen by specifying a list of parameters to be set, e.g. has_banner:1,has_feed_stories:0 . Overrides for mobile appli- cations or backend services may alternatively be set through serv er- side management tools. 5.4 Logging Logging occurs automatically when get() is called, so that there is a record of the exposure of units to the experiment. By default, the namespace, the experiment name, all input data, and variables set by the PlanOut script are logged. This type of exposur e log- ging has a number of beneﬁts, including simplifying downstream data analysis and increasing statistical power by distinguishing be- tween units that may have been affected by assignment and those that are known to be unaffected. (For example, many users who are assigned to be in an experiment may not actually arriv e at the part of the site that triggers the manipulation, and thus their outcome data should not be included in analysis in the normal way .) As with experimental assignment, caching of prior exposures help reduce load on experimental infrastructure. It is also sometimes desirable to log auxiliary information not related to the assignment, includ- ing user characteristics or events (i.e., “conv ersions”). This might be done through a separate method (e.g., log() ) in the experiment object. Exposure logging, combined with the management system, prev ents a number of common pitfalls we have observed at Face- book. These beneﬁts are discussed in the following section. 6. ANAL YZING EXPERIMENTS While domain-speciﬁc aims and especially complex experimen- tal designs will often require custom analysis, much analysis of online experiments can be automated in support of their routine and v alid use in decision making. This eliminates common sources of errors in analysis, makes results more directly comparable, and reduces the burden of running additional experiments. This kind of automation is easy to accomplish with PlanOut experiments be- cause their scripts directly encode a representation of their param- eters, v alues, and design. A complete description of accompan ying systems for analyzing experiments is beyond the scope of this pa- per , but we discuss how the design of PlanOut interacts with com- mon analyses, automated or not. Logging only users who have receiv ed the treatment (versus an- alyzing the entire population who could could hav e potentially be exposed) improves statistical inference in two ways. First, it can substantially decrease the variance of estimated treatment effects when the number of users exposed to an experimental manipulation is small relative to the number who are assigned (e.g., in the case of a less commonly used or new feature). This reduces risk of T ype II errors, in which an experiment has an effect, but experimenters are unable to detect that effect. Secondly , exposure logging focuses experimenters on a more relev ant sub-population whose outcomes are plausibly affected by the treatment. Explicit logging of labeled input units and assigned parameter values affords ﬂe xibility in terms of automated analysis. Many relativ ely simple designs can be fruitfully analyzed by computing summary statistics for outcomes of interest for each unique combi- nation of parameter values. Since parameters are logged for each exposure, analyses of full or fractional factorial designs that make use of this structure can also be automated. For example, ques- tions about main effects of factors (e.g., “does button color hav e any av erage effect?”) can be answered via an analysis of variance. Representation of the experiment in terms of parameters can also make it easier to automatically use this structure in estimating ex- pected outcomes for each condition. For example, systems can ﬁt penalized regression models with the main effects and all relev ant interactions, thus “borrowing strength” across conditions that have the same values for some parameters [15]. These types of model- based inference also help reduce the risk of not being able to de- tect clinically signiﬁcant changes. W e hav e found that when these forms of analyses are not possible, engineers and decision mak- ers tend to av oid more complex experimental designs, lik e factorial designs, because they tend to be underpowered, ev en though they hav e a number of beneﬁts for improving understanding and identi- fying optimal treatments. 6.1 Analyzing iterative experiments Iteration on experiments with PlanOut occurs primarily through creating ne w experiments that are variants of pre vious experiments. By default, these experiments are then analyzed separately , which av oids several problems that can occur when attempting to pool data from before and after a change in experimental design. For example, adding additional users to an existing experiment but as- signing them to new conditions means that these users are ﬁrst ex- posed to their treatment more recently than other users; compar - isons with other conditions can be biased by , e.g., novelty effects or cumulativ e treatment effects. There are some cases where two similar experiments can be an- alyzed together . For example, if two experiments’ PlanOut scripts are identical but hav e different numbers of segments allocated to them and are started at different times, they can be pooled to- gether to increase po wer , though this changes what is estimated to a weighted average of potentially time-varying ef fects. This may re- sult in underestimation or overestimation of treatment effects, and highlights the ways in which gradual product rollouts might be bet- ter represented as experiments. More sophisticated automatic se- lection of analyses that pool data across experiments remains an area for future work. 6.2 Units of analysis Many online experiments use a small number of standard types of units, for which outcomes of interest may already be av ailable in data repositories. For example, most experiments at Facebook in volv e random assignment of user IDs, and the standard desired analysis in volv es analysis of behaviors associated with user IDs. Other cases can be more complex. For example, an experiment may randomly assign users and advertisements, userid–advertisementid 8 pairs, to parameter values, but it may be necessary for an analysis to account for dependence in multiple observations of the same user or ad to obtain correct conﬁdence intervals and hypothesis tests [2, 11]. Inspection of a script can identify the units for which differ- ent parameters are randomized, which can be used in subsequent selection of methods for statistical inference. 7. DISCUSSION Randomized ﬁeld experiments are a widely used tool in the Inter- net industry to inform decision-making about product design, user interfaces, ranking and recommendation systems, performance en- gineering, and more. Effecti ve use of these experiments for under- standing user behavior and choosing among product designs can be aided by new experimentation tools. W e dev eloped PlanOut to support such scalable, randomized parameterization of the user ex- perience. Our goal has been to moti vate the design of PlanOut using our experiences as experimenters and by demonstrating its ability to specify experimental designs, both simple and complex, from our work and the literature. One aim of conceptualizing experiments in terms of parameters and enabling more complex e xperimental designs is that online e x- periments can be more effecti vely used for understanding causal mechanisms and inv estigating general design principles, rather than simply choosing among already built alternati ves. That is, PlanOut aims to support uses of randomized experiments more familiar in the sciences than in the Internet industry . For example, the primary purpose of the social cues experiment described in Section 4.1.2 is not to decide whether it is better to show fewer social cues along- side ads (doing so was expected to and did reduce desired behav- iors), b ut to estimate quantities that are useful for understanding an existing service, allocating design and engineering resources, and anticipating effects of future changes. In addition to being a means for deploying Internet-scale ﬁeld experiments, we hav e found PlanOut to be useful aid for describing and collaborating on the design of complex experiments, well be- fore they are deployed. W e hope others will also ﬁnd the notation to be a clear way describe their experiments, whether in face-to-face settings or documentation of published research. The PlanOut lan- guage itself may also be applicable to other types of experiments, such as those conducted on Amazon Mechanical T urk [25]. There are important limitations to online experiments in general and PlanOut in particular . As others have argued, randomized ex- periments cannot effecti vely replace all other methods for learning from current and potential users [28] and anticipating effects of fu- ture interventions [13]. Most notably , random assignment of users to a new version of a service requires that that version is built and of suf ﬁcient quality . Nielson [28] additionally ar gues that A/B tests encourage short-term focus and do not lead to behavioral insights. While this is perhaps a fair critique of many widespread experimen- tation practices, PlanOut is designed to run experiments that lead to behavioral insights, modeling, and long-term learning. Even though ﬁeld experiments are the gold standard for causal inference, their results are also subject to limitations. Because the underlying effects can be heterogeneous and dynamic, results of a ﬁeld experiment from one time and one population may not gener- alize to new times and populations [5, 23, 35]. One hope that we hav e for PlanOut is that it encourages more sophisticated behav- ioral e xperiments that allo w estimation of parameters that are more likely to generalize to future interfaces. PlanOut and the associated infrastructure also make it easy to replicate prior experiments. Fi- nally , standard experimental designs and analyses do not account for one unit’ s outcomes being affected by the assignment of other units (e.g., because of for peer inﬂuence and network effects) [1, 29]. In the presence of such interference, user behavior can sub- stantially change post-launch beha viors as connected users interact with one another [34]. PlanOut has more speciﬁc limitations with respect to designs where one unit’ s assignment depends on the assignment of a large number of other units. Random assignment schemes that inv olve optimizing global characteristics of the experimental design are thus more difﬁcult to implement directly using built-in operators. This includes pre-stratiﬁed or block-randomized designs [8, 16] that use sampling without replacement in a prior, ofﬂine assignment procedure, but in online e xperiments these designs usually offer minimal precision gains. 5 Assignment schemes such as graph clus- ter randomization [34], which inv olves partitioning the social net- work of users, require ofﬂine computation. In such cases, PlanOut may simply be a useful frame work for providing a consistent in- terface to accessing information about units (e.g., the results of the graph partitioning) that has been computed ofﬂine through a cus- tom operator and then assigning parameter values based on that information. Sequential e xperimentation techniques, such as multi- armed bandit heuristics [30] are another such example where cus- tom operators are generally needed. W e hav e only brieﬂy discussed how online experiments should be analyzed. From our experience, the av ailability of easy-to-use tools for routine analysis of experiments has been a major factor in the adoption of randomized experiments across product groups at Facebook; so the automation of analysis deserves further atten- tion. This also suggests another area for future research: How should we best ev aluate new tools for designing, running, and an- alyzing experiments? W e hav e primarily done so by appealing to prior work, our own professional experiences, and by demonstrat- ing the expressi veness of the PlanOut language. Deciding whether an e xperiment was “successful” or ef fectiv e can depend on broader organizational context and hard-to-trace consequences as an e xper- iment’ s results diffuse throughout an or ganization. Some of the most effecti ve experiments directly inform decisions to set the pa- rameters the y manipulate, b ut other well-designed e xperiments can be effecti ve through broader , longer-term inﬂuences on beliefs of designers, dev elopers, scientists, and managers. 8. A CKNO WLEDGEMENTS As described here, PlanOut is only a small piece of the broader set of experimentation tools created by our colleagues. At Face- book, PlanOut runs as part of QuickExperiment, a framework de- veloped by Breno Roberto and W esley May . The perspective we take on how experiments should be logged and managed is greatly inﬂuenced by previous tools at Facebook and con versations with Daniel Ting, W ojciech Galuba, and W esley May . The design of PlanOut was inﬂuenced by con versations with John Fremlin, Brian Karrer , Cameron Marlow , Itamar Rosenn, and those already men- tioned. Finally , we would like to thank Brian Davison, René Kizilcec, W inter Mason, Solomon Messing, Daniel Ting, and John Myles White for their comments on this paper . René Kizilcec built the PlanOut GUI in Figure 1(c). John Fremlin built the PlanOut DSL to JSON compiler . 9. REFERENCES [1] Aronow , P ., and Samii, C. Estimating average causal ef fects under interference between units. Manuscript, 2013. 5 The dif ferences between the variance of a dif ference in means from a pre-stratiﬁed design and a post-stratiﬁed estimator with a unstratiﬁed design is of order 1 /n 2 [26]. This difference is thus of little importance for large e xperiments. 9 [2] Bakshy , E., and Eckles, D. Uncertainty in online experiments with dependent data: An ev aluation of bootstrap methods. In Pr oc. of the 19th A CM SIGKDD confer ence on knowledge discovery and data mining , A CM (2013). [3] Bakshy , E., Eckles, D., Y an, R., and Rosenn, I. Social inﬂuence in social advertising: Evidence from ﬁeld experiments. In Pr oc. of the 13th ACM Confer ence on Electr onic Commer ce , A CM (2012), 146–161. [4] Bakshy , E., Rosenn, I., Marlow , C., and Adamic, L. The role of social networks in information dif fusion. In Pr oc. of the 21st international confer ence on W orld W ide W eb , A CM (2012), 519–528. [5] Bareinboim, E., and Pearl, J. Transportability of causal effects: Completeness results. In Pr oc. of the T wenty-Sixth National Confer ence on Artiﬁcial Intelligence , AAAI (2012). [6] Beenen, G., Ling, K., W ang, X., Chang, K., Franko wski, D., Resnick, P ., and Kraut, R. E. Using social psychology to motiv ate contributions to online communities. In Pr oc. of the 2004 A CM conference on Computer supported cooper ative work , CSCW ’04, A CM (2004), 212–221. [7] Bond, R. M., Fariss, C. J., Jones, J. J., Kramer , A. D. I., Marlow , C., Settle, J. E., and F owler , J. H. A 61-million-person experiment in social inﬂuence and political mobilization. Natur e 489 , 7415 (2012), 295–298. [8] Box, G. E., Hunter , J. S., and Hunter , W . G. Statistics for Experimenters: Design, Innovation, and Discovery , vol. 13. W iley Online Library , 2005. [9] Bryan, C. J., W alton, G. M., Rogers, T ., and Dweck, C. S. Motiv ating voter turnout by in voking the self. Pr oc. of the National Academy of Sciences 108 , 31 (2011), 12653–12656. [10] Burke, M., Marlow , C., and Lento, T . Feed me: Motiv ating newcomer contrib ution in social network sites. In Pr oc. of the SIGCHI Confer ence on Human F actors in Computing Systems , CHI ’09, A CM (2009), 945–954. [11] Cameron, A., Gelbach, J., and Miller , D. Robust inference with multi-way clustering. J ournal of Business & Economic Statistics 29 , 2 (2011), 238–249. [12] Crook, T ., Frasca, B., Koha vi, R., and Longbotham, R. Se ven pitfalls to a void when running controlled experiments on the web . In Pr oc. of the 15th A CM SIGKDD international confer ence on Knowledge discovery and data mining , A CM (2009), 1105–1114. [13] Deaton, A. Instruments, randomization, and learning about dev elopment. Journal of Economic Liter atur e (2010), 424–455. [14] Farahat, A., and Bailey , M. C. Ho w effecti ve is tar geted advertising? In Pr oc. of the 21st international conference on W orld W ide W eb , A CM (2012), 111–120. [15] Gelman, A. Analysis of variance — why it is more important than ev er . The Annals of Statistics 33 , 1 (2005), 1–53. [16] Gerber , A. S., and Green, D. P . Field Experiments: Design, Analysis, and Interpr etation . WW Norton, 2012. [17] Holland, P . W . Causal inference, path analysis, and recursiv e structural equations models. Sociological Methodology 18 (1988), 449–484. [18] Koha vi, R., Deng, A., Frasca, B., Longbotham, R., W alker , T ., and Xu, Y . T rustworthy online controlled e xperiments: Fiv e puzzling outcomes explained. In Pr oc. of the 18th ACM SIGKDD international confer ence on Knowledge discovery and data mining , A CM (2012), 786–794. [19] Koha vi, R., Longbotham, R., Sommerﬁeld, D., and Henne, R. Controlled experiments on the web: Survey and practical guide. Data Mining and Knowledge Discovery 18 , 1 (2009), 140–181. [20] Kulkarni, C., and Chi, E. All the ne ws that’ s ﬁt to read: a study of social annotations for news reading. In Pr oc. of the SIGCHI Confer ence on Human F actors in Computing Systems , A CM (2013), 2407–2416. [21] Lewis, R. A., Rao, J. M., and Reile y , D. H. Here, there, and ev erywhere: Correlated online behaviors can lead to ov erestimates of the effects of adv ertising. In Pr oc. of the 20th international confer ence on W orld wide web , A CM (2011), 157–166. [22] Li, L., Chu, W ., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized ne ws article recommendation. In Pr oc. of the 19th international confer ence on W orld wide web , A CM (2010), 661–670. [23] Manzi, J. Uncontr olled: The Surprising P ayoff of T rial-and-Error for Business, P olitics, and Society . Basic Books, 2012. [24] Mao, A., Chen, Y ., Gajos, K. Z., Parkes, D., Procaccia, A. D., and Zhang, H. T urkserver: Enabling synchronous and longitudinal online experiments. Pr oc. HCOMP ’12 (2012). [25] Mason, W ., and Suri, S. Conducting behavioral research on Amazon’ s Mechanical Turk. Behavior resear ch methods 44 , 1 (2012), 1–23. [26] Miratrix, L. W ., Sekhon, J. S., and Y u, B. Adjusting treatment effect estimates by post-stratiﬁcation in randomized experiments. J ournal of the Royal Statistical Society: Series B (Statistical Methodology) 75 , 2 (2013), 369–396. [27] Morgan, S. L., and W inship, C. Counterfactuals and Causal Infer ence: Methods and Principles for Social Resear ch . Cambridge Univ ersity Press, July 2007. [28] Neilson, J. Putting A/B testing in its place, 2005. http://www .nngroup.com/articles/putting-ab-testing-in-its- place. [29] Rubin, D. B. Statistics and causal inference: Comment: Which ifs hav e causal answers. Journal of the American Statistical Association 81 , 396 (1986), 961–962. [30] Scott, S. L. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry 26 , 6 (2010), 639–658. [31] Shadish, W . R., and Cook, T . D. The renaissance of ﬁeld experimentation in e valuating interv entions. Annual Revie w of Psychology 60 , 1 (Jan. 2009), 607–629. [32] T ang, D., Agarwal, A., O’Brien, D., and Meyer , M. Overlapping e xperiment infrastructure: More, better , faster experimentation. In Pr oc. of the 16th ACM SIGKDD international confer ence on Knowledge discovery and data mining , A CM (2010), 17–26. [33] T aylor , S. J., Bakshy , E., and Aral, S. Selection effects in online sharing: Consequences for peer adoption. In Pr oc. of the F ourteenth ACM Confer ence on Electr onic Commerce , EC ’13, A CM (2013), 821–836. [34] Ugander , J., Karrer , B., Backstrom, L., and Kleinberg, J. M. Graph cluster randomization: Network exposure to multiple univ erses. In Pr oc. of KDD , A CM (2013). [35] W atts, D. Everything Is Obvious: *Once Y ou Know the Answer . Crown Publishing Group, 2011. 10

Designing and Deploying Online Field Experiments

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment