Counterfactual Estimation and Optimization of Click Metrics for Search Engines

Counterfactual Estimation and Optimization of Clic k Metrics f or Sear ch Engines Lihong Li 1 Shunbao Chen 1 1 Microsoft Inc. Redmond, W A 98052 {lihongli,shchen,ankurg}@microsoft.com Jim Kleban 2 ∗ Ankur Gupta 1 2 F acebook Inc. Seattle, W A 98101 jim.kleban@gmail.com ABSTRA CT Optimizing an interac tive system against a predeﬁned on- line metric is particularly challenging, when the metric is computed from user feedback suc h as clic ks an d paymen ts. The k ey challenge is the c ounterfactual nature: in the case of W eb search, any c hange to a comp onen t of the search en- gine ma y result in a diﬀeren t search re sult page for the same query , but we normally cannot infer reliably from search log ho w users w ould react to the new result page. Consequ ently , it app ears imp ossible to accurately estimate online metrics that depend on user feedback, unless the new engine is run to serve users and compared with a baseline in an A/B test. This approach, while v alid and successful, is unfortunately expensive and time-consuming. In this paper, we propose to address this problem using causal inference techniques, under the con textual-bandit fram ework. This approac h ef- fectiv ely allo ws one to run (potentially inﬁn itely) man y A/B tests oﬄine from searc h log, making it possible to estimate and optimize online metrics quic kly and inexp ensively . F o- cusing on an importan t comp onent in a commercial search engine, w e show how these ideas can b e instan tiated and applied, and obtain v ery promising results that suggest the wide applicabilit y of these techniques. Categories and Subject Descriptors G.3 [ Mathematic s of Computing ]: Probabilit y and Statistics— Exp erimental Design ; H.3.3 [ Information Storage and Retriev al ]: Information Searc h and Re- triev al; H.3.3 [ Information Storage and Retriev al ]: On- line Information Services General T erms Experimentation, Performance K eywords Experimental design, coun terfactual analysis, information retriev al, contextual bandits ∗ This work was done when J. Kleban was with Microsoft. 1. INTR ODUCTION The standard approach to ev aluating ranking quality of a searc h engine is to ev aluate its ranking results on a set of human-labeled examples and compute r elevanc e metrics lik e mean av erage precision (MAP) [1] and normalized dis- coun ted cumulativ e gain (NDCG) [17]. Such an approac h has b een highly successful at facilitatin g easy comparison and impro vemen t of ranking functions ( e.g. , [6, 32, 34]). Ho wev er, suc h oﬄine relev ance metrics ha ve a few limita- tions. First, there can b e a mismatch b etw een users’ actual information need and the relev ance judgments of human la- belers. F or example, for the query “tom cruise,” it is natural for a judge to giv e a high relev ance score to the actor’s oﬃcial w ebsite, http://tomcru ise.com. How ever, search log from a commercial search engine suggests the opp osite—users who issue that query are often more in terested in news ab out the actor, not the oﬃcial website. 1 Second, in some applica- tions lik e p ersonalized searc h [25] and recency searc h [11], judges simply lac k the information to pro vide sensible la- bels. Third, user experience with a search engine relies on b oth the ranking function and other mo dules like user inter- faces. Relev ance labels for query-document pairs only reﬂect one aspect of a search engine’s o verall quality . Finally , an important factor in now ada ys searc h engine is its monetiza- tion p erformance (from advertising), whic h cannot b e easily judged b y human lab elers. All the challenges ab ov e imply the strong need for consider- ing user feedback in ev aluating, and p oten tially optimizing, a searc h engine. F or example, user b ehavior like clicks is used to infer p ersonalized relev ance for ev aluation purposes [31], and to compare tw o ranking systems by interlea ving [8]. Unfortunately , metrics that dep end on user feedbac k are hard to estimate oﬄine , due to their counterfactu al nature. F or example, suppose we are interested in measuring the time-to-ﬁrst-clic k metric. When we c hange any part of the searc h engine, the ﬁnal searc h engine result page (SERP) for a pa rticular query may b e diﬀeren t, and hence users’ clic k behavior ma y change as well. Based on search log, it is of- ten challenging to in fer what a user w ould ha ve done for a SERP diﬀerent from the one in the log. Prediction errors of state-of-the-art user click mo dels ( e.g. , [9, 14] and their v ariants) are likely muc h larger than usual impro vemen ts of clic k-based metrics in now adays commercial search engines that are already highly optimized. Therefore, oﬄine ev alua- 1 The example is from Imed Zitouni. tion based on such user mo dels may not alwa ys be reliable. In practice, the common solution is to run a controlled ex- periment ( a.k.a. an A/B test). Speciﬁcally , one randomly splits users in to tw o statistically identical groups, known as c ontr ol and tr e atment , resp ectiv ely . Users in the control group are served by a baseline search engine, while users in the treatment group by a mo diﬁed engine (which often dif- fers from the baseline in one comp onent). The experiment ma y last for days or weeks, at the end of which online met- rics (like click-through rate and time to ﬁrst click) of the t wo systems are calculated. One then reaches a conclusion whether the mo diﬁed engine is b etter than the baseline at a certain statistically signiﬁcance level. Con trolled exp erimen ts hav e prov ed very successful in prac- tice ( e.g. , [18]), allowing engineering and business decisions to b e made in a data-driven manner. Ho wev er, these ex- periments usually require nontrivial engineering resources and are time-consuming, since the exp eriments are run on r e al users, so signiﬁcant eﬀorts are need ed to av oid surprises in the exp eriments. F urthermore, when trying to optimize an online clic k metric, one often takes a guess-then-chec k approac h: an easy-to-compute pr oxy metric (lik e NDCG) is used oﬄine to obtain a mo diﬁed engine, which is hop ed to do well later in the con trolled experiment in terms of the click metric of real interest. Due to its approxima tion nature, the proxy metric can b e misleading in determining whic h mo diﬁed system to run in the exp eriment. Combined with the long turnaround time of A/B tests, this indirect optimization pro cedure can b e rather ineﬃcient. In this pap er, w e advocate the use of causal inference tech- niques from statistics to p erform unbiase d oﬄine evaluation of click metrics for search engines. Compared to A/B tests, oﬄine ev aluation allows multiple mo dels to b e ev aluated on the same search log, without the need to b e run online. Eﬀectiv ely , this tec hnique makes it possible to run many A/B tests simultaneously , leading to substanti al increase in experimentation agility , and to ev en optimize against the online metrics directly . T o the b est of our knowledge, this w ork is the ﬁrst to v alidate the p ossibilit y of oﬄine ev alu- ation in a liv e, commercial search engine, and to use such oﬄine ev aluation as a subroutine to do oﬄine optimization. The rest of the pap er is organized as follo ws. Sectio n 2 describes the contextual bandit as a general framework to capture a n umber of in teractiv e problems, in cluding man y in W eb search. Section 3 describ es the basic technical idea of oﬄine ev aluation, and discusses solutions to a few importan t issues that arise in practice. Section 4 gives details in a case study in a commercial search engine. Section 5 discusses related wo rk. Finally , Section 6 concludes the pap er. 2. CONTEXTU AL-BANDIT FORMULA- TION The contextual-bandit formalism [2, 20] generalizes classic m ulti-armed bandits by introducing con textual information in the interaction loop b etw een a le arner and the environ- ment it is situated in. It has prov ed useful to mo del many important applications when such in teraction is present, suc h as online adv ertising [20] and conten t recommenda- tion [21]. F ormally , we deﬁne by A = { 1 , 2 , . . . , K } a set of actions . A contextual bandit describ es a round-by -round interaction betw een a learner and the environmen t: at eac h round, • En vironmen t c ho oses contextual information x ∈ X , and a r eward signal r a ∈ [0 , 1] for each action a ∈ A , where X is the set of p ossible contextu al information. Let ~ r = ( r 1 , . . . , r K ) b e the rew ard vector. Only x is rev ealed to the learner. It is assumed that ( x, ˆ r ) is dra wn i.i.d. from some unknown distribution D . • Upon observing x , the learner chooses an action a ∈ A , and in return observes the corresp onding reward r a . A common goal for a contextual-bandit learning algorithm is to optimize its action-selection p olicy , denoted π : X → A , in order to maximize the exp ected reward it gets through in teraction with the environmen t. F or conv enience, we call V ( π ) := E ( x,~ r ) ∼ D  r π ( x )  the value of policy π , whic h measures how muc h p er-round rew ard it receiv es on av erage. If we run π to choose actions for T rounds and observe the corresp onding reward in every round, the v alue of π can b e estimated b y a v eraging the observ ed reward, and this estimate conv erges almost surely to V ( π ) as T increases. As an example, consider federated searc h where a search en- gine needs to decide, given a query , whether (and where) to include vertical search results like news and images on the ﬁnal SERP ( e.g. , [10]). Here, the context conta ins the sub- mitted query , user proﬁle, and p ossibly other information. The actions are ho w to combin e the vertical search result with W eb searc h results. The reward is often click-through (or its v ariants) of the v ertical results. Finally , a basic prob- lem in federate search is to optimize the policy that decides what to do with v ertical searc h results given curren t context in order to maximize the av erage reward. In Section 4, we will study in more details another imp ortant comp onen t of a search engine. An imp ortant observ ation in con textual bandits is that, only rew ards of c hosen actions are observ ed. An online- learning algorithm must therefore ﬁnd a goo d explo- ration/exploitation tradeoﬀ, a deﬁning challenge in bandit problems. F or oﬄine p olicy ev aluation, such partial observ- abilit y raises a related diﬃculty . Data in a contextual bandit is often in the form of ( x, a, r a ), where a is the action c hosen for context x when collecting the data, and r a is the corre- sponding reward. If this data is used to ev aluate a p olicy π , whic h chooses a diﬀer ent action π ( x ) 6 = a , then we simply do not ha ve the reward signal to ev aluate the p olicy in that con text. It is the focus of this paper to study ho w un biasded oﬄine policy ev aluation can be done when opti mizing searc h engines. 3. UNBIASED OFFLINE EV ALU A TION 3.1 Basic T echniques As is observed previously [5], oﬄine p olicy ev aluation can be interp reted as a causal inference problem, an imp ortant researc h topic in statistics. Here, we try to infer the av erage rew ard V ( π ) (the causal eﬀect) if p olicy π is used to choose actions in the bandit problem (the interv en tion). The ap- proac h we take in this pap er relies on randomized data col- lection, which has b een known to b e a critical condition for dra wing valid causal conclusions [27]. Data collection proceeds as follo ws. At each round, • En vironmen t chooses ( x, ~ r ) i.i.d. from some unknown distribution D , and only rev eals context x . • Based on x , one compute a multinomial distribution ~ p := ( p 1 , p 2 , . . . , p K ) o ver the actions A . A random action a is dra wn according to the distribution, and the corresp onding reward r a and probability mass p a are logged. Note that ~ p may dep end on x . How to select the distribution ~ p will b e discussed later. A t the end of the pro cess, w e ha ve a set D con taining da ta of the form ( x, a, r a , p a ). W e will call this kind of data “explo- ration data,” since all actions ha v e some nonzero probabilit y of b eing explored in the collection pro cess. In statistics, the probabilities p a are also known as pr opensity sc or es . When we are to ev aluate the v alue of p olicy π oﬄine , the follo wing estimator is used ˆ V oﬄine ( π ) := X ( x,a,r a ,p a ) ∈D r a I ( π ( x ) = a ) p a , (1) where I ( C ) is the set-indicator function that ev aluates to 1 if condition C holds true, and 0 otherwise. The key observ ation of the estimator is that, for any con- text x , if one c ho oses action a randomly according to the distribution ~ p , then r π ( x ) = E a ∼ ~ p  r a I ( π ( x ) = a ) p a  . With this equality , one can show the unbiased ness of the oﬄine estimator [21]: E D h ˆ V oﬄine ( π ) i = V ( π ) for any π , pro vided that every comp onent in ~ p is nonzero. In other w ords, as long as we can randomize action selection, we can construct an unbia sed estimate for any p olicy without even running it on users. This b eneﬁt is highly desirable, since the oﬄine ev aluator allows one to simulate many A/B tests in a fast and inexp ensive wa y . 3.2 How to Randomize Data The un biasedness guarantee holds for an y probability distri- bution ~ p , as long as none of its comp onent is zero. How ever, varianc e of our oﬄine ev aluator dep ends critically on this distribution. The ev aluator gives more accurate (and thus more reliable) estimates when v ariance is low er. It follows from the deﬁnitions that the oﬄine ev aluator has a v ariance V ar h ˆ V oﬄine ( π ) i = E ( x,~ r ) ∼ D  r 2 π ( x )  1 p π ( x ) − 1  . Therefore, the v ariance is smaller when we place more proba- bilit y mass to actions that are c hosen b y policy π . In realit y , ho wev er, one typically do es not kno w π ahead of time when data are b eing collected, and there may b e multiple p olicies to b e ev aluated on the same exploration data. One natural c hoice, as is adopted b y some authors [21, 22], is to minimize the worst-c ase v ariance, leading to a uniform distribution: p a ≡ 1 /K for all a . There are at least tw o limitations for this c hoice. First, c ho osing an action uniformly at random ma y b e to o risky for user exp erience, unless one knows a priori that every action is reasonably go o d. Second, when improving an ex- isting p olicy that is already working reasonably well, it is lik ely that an y impro vemen t do es not diﬀer too muc h from it. Minimizing the worst-case v ariance ma y not yield the best v ariance reduction in reality . These tw o concerns im- ply a more conserv ative data collection pro cedure that can be more eﬀective than the uniform random distribution. In- tuitiv ely , given a baseline p olicy (lik e the existing p olicy in production), we may inject randomization to it to generate randomized actions that are close to the baseline p olicy . The precise wa y of doing this depends on th e problem at hand. In Section 4.2, w e describ e a sensible approac h that has wo rked w ell, which is exp ected to b e useful in other scenarios. Finally , in order to prev en t un b ounded v ariance, it is helpful to hav e prop ensity scores that are not too close to 0, by ensuring a low er b ound p min > 0. If such a condition cannot be met (say , due to system constraints), one can still use max { p min , p a } to replace p a in Equation 1, as done by other authors [29]. Suc h a threshold trick ma y introduce a small bias to the oﬄine estimator, but can drastically decrease its v ariance so that the ov erall mean squared error is reduced. 3.3 How to V erify Propensity Scor es As sho wn in Equation 1, it is necessary to compute and log the prop ensit y scores p a . Any errors in the calculation and/or logging of the scores can lead to a bias in the ﬁnal oﬄine estimator. F urthermore, since the r e cipr o c al of the score is used in the estimate, even a small error in the score can lead to a muc h larger bias in the oﬄine estimator when p a is close to 0. When the system is complex, it is sometimes nontrivial to get the scores correct, even if a uniform distribution ( i.e. , p a ≡ 1 /K ) is used. It is thus imp ortan t to v erify prop en- sit y scores b efore trusting the oﬄine estimates. In our ex- perience, this veriﬁcation turns out to b e one of the most critical, and sometimes challen ging, steps when doing oﬄine ev aluation. One solution 2 is to obtain and log a randomization seed whenev er an action is chosen. Sp eciﬁcally , in each round of data collection ( e.g. , Section 3.2), we c ho ose a seed s (whic h may b e a function of context x and timestamp, etc. ) and use it to reset the internal state of a pseudo-random n umber generator. Then, we use the generator to select a random action from the m ultinomial distribution ~ p . The ﬁnal data hav e the form of ( x, s, a, ~ p, r a ), con taining more information to facilitate oﬄine veriﬁcation. When we wan t to verify the prop ensity scores, we may simply use the seed s to reproduce the randomized data collection pro cess, and c heck consistency among s , a and ~ p . An alternative, somewhat simpler approach do es not re- quire resetting the pseudo-random num ber generator in ev- 2 Proposed by Leon Bottou et al. , in priv ate comm unication. ery round of data collection. Instead, it runs simple statisti- cal tests to detect inconsistency . Such an approac h has b een quite useful in our exp erience, although we note that it only detects some but not all data issues. One suc h test, which we call an arithmetic me an test , is to compare the num ber of times a particular action a ∗ ∈ A ap- pears in the data to the exp ected num b er of o ccurrences conditioned on the logged prop ensity scores. Conce ntra- tion inequalities lik e Hoeﬀding’s [15] can b e used to estim ate whether the gap b et ween the tw o quantities is statistically signiﬁcan t or not. If the gap is signiﬁcan t, it indicates errors in the randomized data collection process. Another test, which we found useful, is based on the follow- ing observ ation: for any context x and action a ∗ , E a ∼ ~ p  I ( a = a ∗ ) p a ∗ + I ( a 6 = a ∗ ) 1 − p a ∗  ≡ 2 . Therefore, we may compare the mean of the ab ov e random v ariable from the data, and verify if it is close to the exp ected v alue, 2. Again, statistical signiﬁcance of the gap can be estimated by concen tration inequalities. Since the condition abov e uses harmonic means of propensity scores, it is called a harmonic mean test . 3.4 How to Construct Conﬁdence Interv als Equation 1 gives a p oint -estimate, which in itself is not very useful without considering v ariance information: when we compare the oﬄine estimates of t wo p olicies’ v alues, w e m ust resort to reliable conﬁdence in terv als to infer whether the diﬀerence in the tw o p oint-estimates are signiﬁcant or not. Based on v arious concen tration inequaliti es, Bottou et al. [5] dev elop ed a series of very interesting conﬁdence in terv als. The widths of the interv als can b e used to gain helpful in- sigh ts in to the data collection process. While the results are interesting, they are not necessarily the b est candidate to use empirically , due to their worst-cas e nature. As sug- gested by the same authors [5], in reality it ma y b e b etter to use normal appro ximation theory to obtain conﬁdence in terv als. This is the approach we tak e in this work. Speciﬁcally , from the exploration data set D , we can com- pute an un biased estimate of the standard deviation ˆ σ of the random v ariable r a I ( π ( x ) = a ) p a . Then, a 95% conﬁdence interv al can b e constructed: ˆ V oﬄine ± 1 . 96 × ˆ σ p |D | , and so on. 4. CASE STUD Y 4.1 Speller Speller is a critical component for a searc h engine, enabling it to translate queries with typing and phonetic errors to their correct forms, so that it can match and rank relev ant W eb results and instant answers even when user-typed query is misspelt. Sp elling correction for web queries is a hard prob- lem, particularly b ecause of absence of a dictionary of terms and new words and en tities emerging on the W eb as you are reading this. F urther, one p erson’s typo could b e an- other p erson’s correct query . F or example, given that a user t yp ed “CCN” has both the p ossibility of him w anting to t yp e “CNN” but ending up making a t yp o to “CCN,” or really ha ving an inten t to type “CCN.” Typ ically , noisy chan nel models are applied to address this by computing probabili- ties of eac h of them being the true in tents, giv en that “CCN” w as typed, using p opularities of “CCN” and “CNN” as well as how likely a user is to mak e this exact typo. With addi- tional features, mac hine learning can b e used to rank these candidates [13]. The problem w e focus on here is to train a mo del to se- lect a subset of candidates oﬀ an already computed set of rewritten queries for a given input query . The idea to select m ultiple such candidates is to mitigate the risk of picking a bad correction early in the lifetime of the query [28]. After fetc hing the results for each of these formulations, we would either predict a single b est rewrite or merge the results of m ultiple such rewrites. While training the rank er of rewritten query candidates via human-labeled corrections works for a large num b er of queries, in cases such as ab ov e, a judge may b e at a com- plete loss on what the users’ real inten t was or predicting what is the likelihoo d of the tw o in tents CCN and CNN. Hence, it is desirable to learn which spelling correction ac- tually serves the user’s inten t implicitly from user b ehavior. F urthermore, this approac h is muc h cheaper than a human judging the queries oﬄine. F or a giv en sp elling correction algorithm, the user’s satisfaction can b e measured by mo d- eling how the user int eracted with the search engine on a real search session. Ha ving said that, every single tec hnique or an algorithm in the improv emen t of a search algorithm cannot b e exp osed to the user to measure its go odness (or badness), given that: 1. It can p ose risk to the relev ance and quality of results seen by the users in the exp eriment. 2. User query v olume restricts total num b er of exp eri- men ts that can b e run p er unit time. 3. Cost of failure is higher online since t ypically more in vestmen t is required to make co de robust enough to bring it in front of the users. 4. Online exp erimentation has more noise as well as harder to analyze the results of exp erimen t against baseline, since the queries, users and sessions on which t wo algorithms were run diﬀer. The concerns ab ov e lead to a need of an oﬄine ev aluation system even when the labels at ﬁrst place hav e been collected via online data. Using terminology in Section 2, the con text includes the user-t yp ed query and rewritten candidates (together with their features); an action is to decide which candidate(s) to use; and the reward is a metric derived from user clicks on the ﬁnal search result page. Due to business sensitivity , the metric is not revealed, although it suﬃces to say that the metric measures goo dness of the ﬁnal SERP for the present user. F rom now on, this metric is referred to as the tar get metric . Apparen tly , the actions aﬀect the ﬁnal SERP , whic h in turn aﬀect user clicks and the reward. 4.2 Data Collection Our data collection pro cess was run on a small fraction of random users (identiﬁed by bro wser co okies) for a week in late 2013, yielding ab out 15M impressions. This is the ex- ploration data set D to be used in the follo wing experiments. T o av oid adverse user exp erience, we require the top can- didate must b e included in the action for every impression. Other candidates were sent randomly: for i ∈ { 2 , . . . , L } , Pr ( q i is chosen ) := 1 1 + exp( λ 1 ( s 1 − s i ) + λ 2 ) where λ 1 and λ 2 w ere parameters tuned to yield a go o d balance betw een aggressiveness of exploration and p oten- tial negative user exp erience. In our case, the parameters w ere chosen so that the path-selection probabilities fell in [0 . 1 , 0 . 9], and that an oﬄine metric based on judge lab els w as not severely aﬀected. With these probabilities, prop en- sit y scores of actions can b e computed. There are a few beneﬁts of using this randomization sc heme. First, by alw ays including q 1 in P , the ﬁnal SERP is usually reasonably go o d, so users who were included in the data- collection pro cess would not notice muc h decrease in rele- v ance quality . Second, the sc heme is motiv ated by the intu- ition that candidates with higher scores tend to be b etter, so are more likel y to b e chosen by a go o d p olicy . Biasing data collection tow ards such more promising candidates will lik ely reduce v ariance, following the discussion in Section 3.2. After data were collected, we ran the arithmetic and har- monic mean tests describ ed in Section 3.3 and found no ma- jor issues. It is worth mentioning that the tests were able to help detect data issues in earlier versions of data collection, leading to ﬁxes that even tually improv ed data quality . 4.3 Bootstrapping Histogram In Section 3.4, we advocate the use of asymptotic normal appro ximation to construct conﬁdence interv als. The un- derlying assumption is that, when the amoun t of data is large, the estimator in Equation 1 is almost normally dis- tributed. Here, we verify this assumption empirically with bo otstrapping. In particular, we sampled impressions from D with replace- men t to get a b o otstrapping set D 1 of the same size as D . The sampling procedure w as rep eated B = 1000 times, and w e ended up with B data sets: D 1 , D 2 , . . . , D B . On eac h of them, we computed the unbiased oﬄine estimate of click- through rate (CTR) for some ﬁxed p olicy π . Finally , w e had B estimates of CTR for the same p olicy . Since we were dealing with a large data set, w e implemen ted the online bo otstrap [24], whic h is essentially identical to the standard bo otstrap for the size of our data. Figure 1 gives the histogram of the B estimates of CTR. 3 It is rather close to a normal distribution, as exp ected. The 3 All CTRs and target metrics reported in the pap er are normalized ( i.e. , mul tiplied b y a constan t) for conﬁdentialit y reasons. Click ‐ through  Rat e Figure 1: Bootstrapping histogram. result thus v alidates the form of conﬁdence interv als given in Section 3.4. Giv en the size of our data, the conﬁdence in terv als computed this wa y were all tin y , usually on the order of 10 − 2 or less. W e therefore did not include these tin y interv als in the plots. 4.4 Accuracy against Ground T ruths W e now inv estigate accuracy of the oﬄine ev aluator. Dur- ing the same p eriod of time when D was collected, we also ran another candidate selection p olicy π 0 . The online statis- tics of π 0 could then b e used as “ground truth,” which can be used to v alidate accuracy of the oﬄine estimator using exploration data D . First, we examine the estimate of the target metric for each da y of the week. Figure 2 is a scatter plot of the online (true) vs. oﬄine (estimated) v alues. As exp ected, the of- ﬂine estimates are highly accurate, centering around the on- line ground truth v alues. Also included in the plot is a biased version of the oﬄine estimate, lab eled “Oﬄine (bi- ased),” which uses the follow ing v arian t of Equation 1: ˆ V (biased) oﬄine ( π ) := P ( x,a,r a ,p a ) ∈D r a I ( π ( x ) = a ) P ( x,a,r a ,p a ) ∈D I ( π ( x ) = a ) . This estimator, whic h is not uncommon in practice, ig- nores sampling bias in D . The eﬀect can b e seen from the m uch larger estimation error compared to the online ground truths, conﬁrming the need for using the recipro cal of prop ensit y scores in Equation 1. W e now look at the CTR metric more clo sely . Figure 3 plots ho w daily CTR v aries within one week. Again, the oﬄine estimates match the online v alues very accurately . The gap betw een the tw o curves is not statistically signiﬁcan t, as the 95% conﬁdence in terv als’ widths are roughly 0 . 01. Figure 4 giv es the fraction of clicks contributed by URLs of diﬀerent positions on the SERP . Figure 5 measures how many clicks w ere received within a given time after a user submitted the query . All results sho w high accuracy of the oﬄine ev aluator. W e ha ve also done the same comparison on other metrics and observ ed similar results. 4.5 Ofﬂine Optimization While oﬄine ev aluation can b e very useful on its own, re- ducing a substantial fraction of online A/B testing, it can be used as a subroutine for oﬄine optimization of p olicies. 0.95 0.96 0.97 0.98 0.99 1 1.01 1.02 1.03 0.95 0.97 0.99 1.01 1.03 Offline  Estima tes Online  Ta r g e t  Metric  Va l u e s Offline Offline  (biased) Figure 2: Scatter plot of the online vs. oﬄine metric v alues. Eac h p oint corresp onds to one of the sev en da ys in the data collection p erio d. 0.93 0.95 0.97 0.99 1.01 1.03 1.05 1234567 Day Online Offline Figure 3: Daily CTR. W e used a diﬀerent data set similarly collected from fall 2013, and split it randomly into a training set (2 / 3) and an ev aluation set (1 / 3). F rom the training set, w e extracted training data whose (binary) lab els were determined based on whether the sent candidate contri buted positively to the target metric. This wa y , we obtained a binary classiﬁcation problem, in which one tries to predict whether a rewritten query will contribute to the target metric, conditioned on the original query . Bo osted logistic regression was used to learn a mo del. When learning suc h a mo del, a few meta-parameters had to b e chosen to balance the target metric improv ement and capacit y constrain ts. One could hav e used prediction accu- racy on a hold-out set to select them, but accuracy do es not necessarily correlated with the target metric that we aim to optimize even tually . F ortunately , the reliable oﬄine ev alua- tor can b e used to select these parameters to optimize the target metric dir ectly while resp ecting capacity constraints. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 12345678 URL  Pos it io n Online Offline Figure 4: CTR as a function of p osition on SERP . 0.5 0.6 0.7 0.8 0.9 1 1.1 0 50 100 150 200 Time  (in  seconds) Online Offline Figure 5: CTR as a function of time (in seconds). Based on the oﬄine ev aluation results, we pick ed a mo del and ran an A/B test. In a tw o-week experiment done in late 2013, the mo del did show a statistically signiﬁcant impro v e- men t ov er the existing baseline, demonstrating the p ow er of the oﬄine ev aluator. Below, we demonstrate a few success- ful cases where the new mo del improv es up on the baseline. In the ﬁrst example, the user-typed query was “umeck a and zinc.” Our p olicy successfully identiﬁed the corrected query “umc k a and zinc,” which is ab out medical treatments for cold symptoms relief. The SERP included an Amazon page about a pro duct (Umck a Cold and Flu Chew able) and cus- tomer reviews comparing to zinc supplemen ts (another com- mon treatment for similar purp oses). Analyzing the user clic ks on the SERP suggested the user found the needed in- formation. In contrast, the baseline as well as other similar commercial search engines app eared to miss the correction, and only show ed results that contained an exact matc h to “umec k a.” In the second example, the original query submitted by user w as “catalina left attorney .” The new p olicy correctly sug- gested “catalina leﬀ attorney ,” which app eared to b e the righ t correction: there is an attorney at San Diego whose name is Catalina Leﬀ. The baseline failed to identify this correction, and only show results that contains “left,” which w as probably not what the user really intended. 5. RELA TED WORK There is a long history of ev aluation metho dology research in the information retriev al communit y [26]. The domi- nan t approac h is to collect relev ance judgments for a col- lection of query-do cumen t pairs, and compare diﬀeren t retriev al/ranking functions against metrics like mean a v- erage precision [1] and normalized discounted cum ulativ e gains [17]. This approac h has been very successful as a lo w-cost ev aluation sc heme. How ever, several authors hav e argued for several of its limitations ( e.g. , [3, 33]), in ad- dition to the ones discussed in Sections 1 and 4.1; an al- ternativ e, “user-centered” ev aluation emphasizes interaction betw een user and the search engine. One challenge with the user-cen tered approach is the relatively high cost for system ev aluation and comparison. Our work, therefore, provides a promising solution that has sho wn success in an actual searc h engine. In industry , p eople hav e also measured v arious online met- rics (suc h as CTR and time to ﬁrst click) to monitor and compare systems while running them to serve users. Ran- domized control exp eriments ( e.g. , [18]) are the standard w ay to measure and compare suc h online metrics. More re- cen tly , in terlea ving has b ecome an attractive technique to quic kly identify the winner when comping t wo systems [8]. Both tec hniques requires running a system on real users, while our oﬄine approac h here can be more eﬃcient and less exp ensiv e. As mentioned earlier, the oﬄine ev aluation tec hnique is closely related to causal inference in statistics [16], in which one aims to infer, from observ ational data, the counterfac- tual eﬀect on some measuremen t b y changing the policy (more often called “in terven tion” in the statistics literature). Suc h counterfact ual metho ds hav e shown promise in a few important W eb applications recently like advertising [5, 7, 19, 20, 30], conten t recommendation [22]. In this work, w e form ulate the problem in the contextual bandit framework as in [20, 22], whic h is natural to mo del suc h interactiv e mac hine learning problems. F urthermore, although oﬄine ev aluation was applied to recency search in the past [23], to the b est of our knowledge, this work is the ﬁrst to demon- strate eﬀectiveness of coun terfactual analytic techniques for W eb search, including head-to-head comparisons and oﬄine optimization in a commercial search engine. 6. CONCLUSIONS In this work, w e formulate a class of optimization problems in search engines as a con textual bandit problem, and fo- cus on the oﬄine p olicy ev aluation problem. Our approach uses counterfa ctual analytic techniques to obtain an un bi- ased estimate of the true p olicy v alue, without the need to run the p olicy on real users. Using data collected from a commercial searc h engine, we veriﬁed the reliability of suc h an ev aluation, and also show ed a successful application of it for oﬄine p olicy ev aluation. The promising results in this pap er suggest a num b er of in teresting directions for future researc h. The action set in Sp eller is tractably small when we only consider a short list of candidates. The set of actions in a ranking problem, deﬁned naively , consists of all p ermutati ons of URLs. This is an exp onentially large set that can cause the v ariance to be large. It would b e interesting to see ho w to lev erage successful ideas in related work [5, 12] to address this issue. Another direction worth inv estigating is direct optimization of p olicies based on exploration data with, for instance, the oﬀset tree algorithm [4]. Acknowledgements This w ork beneﬁted from helpful discussions with Leon Bot- tou, Chris Burges, Nick Craswell, Susan Dumais, Jianfeng Gao, John Platt, Ryen White, and Yinzhe Y u. 7. REFERENCES [1] Ricardo Baeza-Y ates and Berthier Rib eiro-Neto. Mo dern Information R etrieval . Addison W esley , 1999. [2] Andrew G. Barto and P . Anandan. P attern-recognizing sto c hastic learning automata. IEEE T r ansactions on Systems, Man, and Cyb ernetics , 15(3):360–375, 1985. [3] Nic holas J. Belkin. Some(what) grand cha llenges for information retriev al. ACM SIGIR F orum , 42(1):47–54, 2008. [4] Alina Beygelzimer and John Langford. The oﬀset tree for learning with partial lab els. In Pr o c e e dings of the Fifte enth ACM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mining , pages 129–138, 2009. [5] L ´ eon Bottou, Jonas Peters, Joaquin Qui˜ nonero-Candela, Denis Xavier Charles, D. Max Chic kering, Elon Portugaly , Dipank ar Ray , Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational adv ertising. Journal of Machine L e arning R ese ar ch , 14:3207–3260, 2013. [6] Christopher J. C. Burges, T al Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N. Hullender. Learning to rank using gradient descen t. In Pr o c e e dings of the Twenty-Se c ond International Confer enc e on Machine L e arning , pages 89–96, 2005. [7] Da vid Chan, Rong Ge, Ori Gershony , Tim Hesterb erg, and Diane Lambert. Ev aluating online ad campaigns in a pip eline: Causal mo dels at scale. In Pr oc ee dings of the Sixte enth ACM SIGKDD Confer enc e on Know le dge Disc overy and Data Mining , pages 7–15, 2010. [8] Olivier Chap elle, Thorsten Joachims, Filip Radlinski, and Yisong Y ue. Large scale v alidation and analysis of in terleav ed search ev aluation. ACM T r ansactions on Information Scienc e , 30(1), 2012. [9] Olivier Chap elle and Y a Zhang. A dynamic Bay esian net work click mo del for W eb search ranking. In Pr o c ee dings of the Eighte enth International Confer ence on World Wide Web , pages 1–10, 2009. [10] F ernando Diaz. Integration of news conten t into W eb results. In Pr o c e e dings of the Se c ond ACM International Confer enc e on Web Se ar ch and Data Mining , pages 182–191, 2009. [11] Anlei Dong, Yi Chang, Zhaohui Zheng, Gilad Mishne, Jing Bai, Ruiqiang Zhang, Karolina Buc hner, Ciya Liao, and F ernando Diaz. T ow ards recency ranking in W eb search. In Pro c e e dings of the Thir d International Confer enc e on Web Sea r ch and Web Data Mining , pages 11–20, 2010. [12] Mirosla v Dud ´ ık, John Langford, and Lihong Li. Doubly robust policy ev aluation and learning. In Pr o c ee dings of the Twenty-Eighth International Confer enc e on Machine L e arning , pages 1097–1104, 2011. [13] Jianfeng Gao, Xiaolong Li, Daniel Micol, Chris Quirk, and Xu Sun. A large scale ranker-based system for searc h query sp elling correction. In Pr o c e edi ngs of the Twenty-Thir d International Confer enc e on Computational Linguistics , pages 358–366, 2010. [14] F an Guo, Chao Liu, Anitha Kannan, T om Mink a, Mic hael J. T ayl or, Yi-Min W ang, and Christos F aloutsos. Click chain mo del in W eb search. In Pr o c ee dings of the Eighte enth International Confer ence on World Wide Web , pages 11–20, 2009. [15] W assily Ho eﬀding. Probabilit y inequalities for sums of bounded random v ariables. Journal of the Americ an Statistic al Asso ciation , 58(301):13–30, 1963. [16] P aul W. Holland. Statistics and causal inference. Journal of the Americ an Statistic al Asso ciation , 81(6):945–960, 1986. [17] Kalerv o J ¨ arv elin and Jaana Kek ¨ al ¨ ainen. Cum ulated gain-based ev aluation of IR techniques. ACM T r ansactions on Information Systems , 20(4):422–446, 2002. [18] Ron Kohavi, Roger Longb otham, Dan Sommerﬁeld, and Randal M. Henne. Controll ed exp eriments on the w eb: Survey and practical guide. Data Minining and Know le dge Disc overy , 18:140–181, 2009. [19] Diane Lambert and Daryl Pregib on. More bang for their buc ks: Assessing new features for online adv ertisers. SIGKDD Explor ations , 9(2):100–107, 2007. [20] John Langford, Alexander L. Strehl, and Jennifer W ortman. Exploration scav enging. In Pro ce edi ngs of the Twenty-Fifth International Confer ence on Machine Le arning , pages 528–535, 2008. [21] Lihong Li, W ei Chu, John Langford, and Rob ert E. Sc hapire. A contextual-bandit approach to personalized news article recommendation. In Pr o c ee dings of the Ninete enth International Confer enc e on World Wide Web , pages 661–670, 2010. [22] Lihong Li, W ei Chu, John Langford, and Xuanhui W ang. Unbiased oﬄine ev aluation of con textual-bandit-based news article recommendation algorithms. In Pr o c e e dings of the F ourth International Confer enc e on Web Sea r ch and Web Data Mining , pages 297–306, 2011. [23] T aesup Mo on, W ei Chu, Lihong Li, Zhaoh ui Zheng, and Yi Chang. An online learning framework for reﬁning recency searc h results with user click feedbac k. ACM T r ansactions on Information System , 30(4), 2012. [24] Nikunj C. Oza and Stuart Russell. Online bagging and bo osting. In Pr o c e e dings of the Eighth International Workshop on Artiﬁcial Intel ligenc e and Statistics , pages 105–112, 2001. [25] James Pitoko w, Hinrich Sch ¨ utze, T o dd Cass, Rob Cooley , Don T urnbul l, Andy Edmonds, Eytan Adar, and Thomas Breuel. Person alized search. Communic ations of the AC M , 45(9):50–55, 2002. [26] Stephen Rob ertson. On the history of ev aluation in IR. Journal of Information Scienc e , 34(4):439–456, 2008. [27] Donald B. Rubin. Bay esian inference for causal eﬀects: The role of randomization. The A nnals of Statistics , 6(1):34–58, 1978. [28] Daniel Sheldon, Milad Shokouhi, Martin Szummer, and Nic k Craswell. Lambdamerge: merging the results of query reformulatio ns. In Pr o c e e dings of the F orth International Confer enc e on Web Se ar ch and Web Data Mining , pages 795–804, 2011. [29] Alexander L. Strehl, John Langford, Lihong Li, and Sham M. Kak ade. Learning from logged implicit exploration data. In A dvanc es in Neur al Information Pr o c essing Systems 23 , pages 2217–2225, 2011. [30] Liang T ang, Romer Rosales, Ajit Singh, and Deepak Agarw al. Automatic ad format selection via contextual bandits. In Pr o c e e dings of the Twenty-Se c ond ACM International Confer enc e on Information & Know le dge Management , pages 1587–1594, 2013. [31] Jaime T eev an, Susan T. Dumais, and Eric Horvitz. P otential for p ersonalization. ACM T r ansactions on Computer-Human Inter action , 17(1), 2010. [32] TREC. The T ext REtriev al Conference. h ttp://trec.nist.go v. [33] Andrew H. T urpin and William Hersh. Why batch and user ev aluations do not give the same results. In Pr o c ee dings of the Twenty-F ourth A CM SIGIR Confer enc e on R ese ar ch and Development in Information R etrieval , pages 225–231, 2001. [34] Zhaoh ui Zheng, Hongyuan Zha, T ong Zhang, Olivier Chapelle, Keke Chen, and Gordon Sun. A general bo osting metho d and its application to learning ranking functions for w eb search. In A dvanc es in Neur al Information Pr o c essing Systems 20 , pages 1000–1007, 2008.

Counterfactual Estimation and Optimization of Click Metrics for Search Engines

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment