Large-scale Validation of Counterfactual Learning Methods: A Test-Bed

The ability to perform effective off-policy learning would revolutionize the process of building better interactive systems, such as search engines and recommendation systems for e-commerce, computational advertising and news. Recent approaches for o…

Authors: Damien Lefortier, Adith Swaminathan, Xiaotao Gu

Large-scale Validation of Counterfactual Learning Methods: A Test-Bed
Large-scale V alidation of Counterfactual Lear ning Methods: A T est-Bed Damien Lefortier ∗ Facebook & Uni v ersity of Amsterdam dlefortier@fb .com Adith Swaminathan Cornell Univ ersity , Ithaca, NY adith@cs.cornell.edu Xiaotao Gu Tsinghua Univ ersity , Beijing, China gxt13@mails.tsinghua.edu.cn Thorsten Joachims Cornell Univ ersity , Ithaca, NY tj@cs.cornell.edu Maarten de Rijke Univ ersity of Amsterdam derijke@uv a.nl Abstract The ability to perform ef fecti ve of f-polic y learning would re v olutionize the process of building better interactive systems, such as search engines and recommendation systems for e-commerce, computational adv ertising and ne ws. Recent approaches for of f-policy ev aluation and learning in these settings appear promising [1, 2]. W ith this paper , we provide real-world data and a standardized test-bed to systematically inv estigate these algorithms using data from display advertising. In particular , we consider the prob- lem of filling a banner ad with an aggre gate of multiple products the user may want to purchase. This paper presents our test-bed, the sanity checks we ran to ensure its validity , and shows results comparing state-of-the-art off-polic y learning methods like doubly robust optimization [3], POEM [2], and reductions to supervised learning us- ing regression baselines. Our results show experimental e vidence that recent of f-policy learning methods can improv e upon state-of-the-art supervised learning techniques on a large-scale real-w orld data set. 1 Introduction Effecti v e learning methods for optimizing policies based on logged user-interaction data have the potential to re v olutionize the process of building better interactive systems. Unlike the industry standard of using expert judgments for training, such learning methods could directly optimize user -centric performance measures, they would not require interactiv e experimental control like online algorithms, and they would not be subject to the data bottlenecks and latency inherent in A/B testing. Recent approaches for of f-policy e v aluation and learning in these settings appear promising [1, 2, 4], but highlight the need for accurately logging propensities of the logged actions. W ith this paper , we pro vide the first public dataset that contains accurately logged propensities for the problem of Batch Learning from Bandit Feedback (BLBF). W e use data from Criteo, a leader in the display advertising space. In addition to ∗ This work was done while w orking at Criteo. 1 providing the data, we propose an ev aluation methodology for running BLBF learning e xperiments and a standardized test-bed that allows the research community to systematically in vestigate BLBF algorithms. At a high le vel, a BLBF algorithm operates in the contextual bandit setting and solves the following learning task: 1. T ake as input: { π 0 , h x i , y i , δ i i n i =1 } . π 0 encodes the system from which the logs were collected, x denotes the input to the system, y denotes the output predicted by the system and δ is a number encoding the observed online metric for the output that was predicted; 2. Produce as output: π , a new polic y that maps x 7→ y ; and 3. Such that π will perform well (according to the metric δ ) if it wer e deployed online. W e elaborate on the definitions of x, y, δ, π 0 as logged in our dataset in the next section. Since past research on BLBF was limited due to the av ailability of an appropriate dataset, we hope that our test-bed will spur research on sev eral aspects of BLBF and of f-polic y e v aluation, including the follo wing: 1. New training objecti v es, learning algorithms, and regularization mechanisms for BLBF; 2. Improved model selection procedures (analogous to cross-v alidation for supervised learning); 3. Effecti v e and tractable policy classes π ∈ Π for the specified task x 7→ y ; and 4. Algorithms that can scale to massiv e amounts of data. The rest of this paper is or ganized as follo ws. In Section 2, we describe our standardized test-bed for the e v aluation of of f-polic y learning methods. Then, in Section 3, we describe a set of sanity checks that we used on our dataset to ensure its validity and that can be applied generally when gathering data for off-polic y learning and ev aluation. Finally , in Section 4, we show results comparing state-of-the-art off-polic y learning methods like doubly rob ust optimization [3], POEM [2], and reductions to supervised learning using re gression baselines. Our results show , for the first time, experimental e vidence that recent off-polic y learning methods can impro ve upon state-of-the-art supervised learning techniques on a large- scale real-world data set. 2 Dataset W e create our test-bed using data from display advertising, similar to the Kaggle challenge hosted by Criteo in 2014 to compare CTR prediction algorithms. 1 Howe v er , in this paper , we do not aim to build clickthrough or con version prediction models for bidding in real-time auctions [5, 6]. Instead, we consider the problem of filling a banner ad with an aggregate of multiple products the user may want to purchase. This part of the system takes place after the bidding agent has won the auction. In this context, each ad has one of many banner types, which differ in the number of products they contain and in their layout as shown in Figure 1. The task is to choose the products to display in the ad knowing the banner type in order to maximize the number of clicks. This task is thus very different from the Kaggle challenge. In this setting of choosing the best products to fill the banner ad, we can easily gather exploration data where the placement of the products in the banner ad is randomized, without incurring a prohibiti ve cost unlike in W eb search for which such exploration is much more costly (see, e.g., [7, 8]). Our logging policy uses randomization aggressiv ely , while being very dif ferent from a uniformly random policy . Each banner type corresponds to a different look & feel of the banner ad. Banner ads can differ in the number of products, size, geometry (vertical, horizontal, . . . ), background color and in the data sho wn 1 https://www.kaggle.com/c/criteo- display- ad- challenge 2 Figure 1: Four examples of ads used in display adv ertising: a vertical ad, a grid, and two horizontal ads (mock-ups). (with or without a product description or a call to action); these we call the fixed attributes. Banner types may also have dynamic aspects such as some form of pagination (multiple pages of products) or an animation. Some examples are shown in Figure 1. Throughout the paper , we label positions in each banner type from 1 to N from left to right and from top to bottom. Thus 1 is the top left position. For each user impression, we denote a user conte xt by c , the number of slots in the banner type by l c , and the candidate pool of products p by P c . Each context c and product p pair is described by features φ ( c, p ) . The input x to the system encodes c, P c , { φ ( c, p ) : p ∈ P c } . The logging policy π 0 stochastically selects products to construct a banner by first computing non-negati ve scores f p for all candidate products p ∈ P c , and using a Plackett-Luce ranking model (i.e., sampling without replacement from the multinomial distribution defined by the f p scores): P ( sl ot 1 = p ) = f p P { p 0 ∈ P c } f p 0 P ( sl ot 2 = p 0 | slot 1 = p ) = f p 0 P { p † ∈ P c ∧ p † 6 = p } f p † , etc. (1) The propensity of a chosen banner ad h p 1 , p 2 , . . . i is P ( sl ot 1 = p 1 ) · P ( slot 2 = p 2 | slot 1 = p 1 ) · . . . . W ith these propensities in hand, we can counterfactually ev aluate any banner-filling policy in an unbiased way using in verse propensity scoring [9]. The following was logged, committing to a single feature encoding φ ( c, p ) and a single π 0 that produces the scores f for the entire duration of data collection. • Record the feature vector φ ( c, p ) for all products in the candidate set P c ; • Record the selected products sampled from π 0 via the Plackett-Luce model and its propensity; • Record the click/no click and their location(s) in the banner . The format of this data is: example $ { exID } : $ { hashID } $ { wasAdClicked } $ { propensity } $ { nbSlots } $ { nbCandidates } $ { displayFeat1 } :$ { v 1 } ... $ { wasProduct1Clicked } exid:$ { exID } $ { productFeat1 1 } :$ { v1 1 } ... 3 ... $ { wasProductMClicked } exid:$ { exID } $ { productFeatM 1 } :$ { vM 1 } ... Each impression is represented by M + 1 lines where M is the cardinality of P c and the first line is a header containing summary information. Note that the first $ { nbSlots } candidates correspond to the displayed products ordered by position (consequently , $ { wasProductMClicked } information for all other candidates is irrelev ant). There are 35 features. Display features are context features or banner type fea- tures, which are constant for all candidate products in a given impression. Each unique quadruplet of feature IDs h 1 , 2 , 3 , 5 i correspond to a unique banner type. Product features are based on the similarity and/or complementarity of the candidate products with historical products seen by the user on the adver - tiser’ s website. W e also included interaction terms between some of these features directly in the dataset to limit the amount of feature engineering required to get a good policy . Features 1 and 2 are numerical, while all other features are categorical. Some cate gorical features are multi-v alued, which means that they can take more than one value for the same product (order does not matter). Note that the e xample ID is increasing with time, allo wing temporal slices for e valuation [10], although we do not enforce this for our test-bed. Importantly , non-clicked examples were sub-sampled aggressiv ely to reduce the dataset size and we kept only a random 10% sub-sample of them . So, one needs to account for this during learning and ev aluation – the ev aluator we provide with the test-bed accounts for this sub-sampling. The result is a dataset of ov er 103 million ad impressions. In this dataset, we have: • 8500+ banner types with the top 10 banner types representing 30% of the total number of ad impressions, the top 50 about 65% , and the top 100 about 80% . • The number of displayed products is between 1 and 6 included. • There are over 21 M impressions for 1 -slot banners, ov er 35 M for 2 -slot, almost 23 M for 3 -slot, 7 M for 4 -slot, 3 M for 5 -slot and o ver 14 M for 6 -slot banners. • The size of the candidate pool P c is about 10 times (upper bound) lar ger than the number of products to display in the ad. This dataset is hosted on Amazon A WS (35GB gzipped / 256GB ra w). Details for accessing and process- ing the data are av ailable at http://www.cs.cornell.edu/ ˜ adith/Criteo/ . 3 Sanity Checks The work-horse of counterfactual ev aluation is In verse Propensity Scoring (IPS) [11, 9]. IPS requires accurate propensities, and, to a crude approximation, produces estimates with v ariance that scales roughly with the range of the in verse propensities. In T able 1, we report the number of impressions and the av erage and largest inv erse propensities, partitioned by $ { nbSlots } . When constructing confidence intervals for importance weighted estimates like IPS, we often appeal to asymptotic normality of large sample a verages [12]. Howe ver , if the in verse propensities are very large relativ e to the number of samples (as we can see for $ { nbS l ots } ≥ 4 ), the asymptotic normality assumption will probably be violated. There are some simple statistical tests that can be run to detect some issues with inaccurately logged propensities [13]. These arithmetic and harmonic tests, ho wever , require that the candidate actions av ail- able for each impression are fixed a priori. In our scenario, we have a context-dependent candidate set that precludes running these tests, so we propose a more general class of diagnostics that can detect some systematic biases and issues in propensity-logged datasets. Some notation: x i iid ∼ Pr( X ); y i ∼ π 0 ( Y | x i ); δ i ∼ Pr(∆ | x i , y i ) . The propensity for the logging policy π 0 to take the logged action y i in context x i is denoted q i ≡ π 0 ( y i | x i ) . If the propensities 4 T able 1: Number of impressions and propensity statistics computed for slices of traffic with k -slot banners, 1 ≤ k ≤ 6 . Estimated sample size ( ˆ N ) corrects for 10% sub-sampling of unclicked impressions. #Slots 1 2 3 4 5 6 #Impressions 2 . 13 e + 07 3 . 55 e + 07 2 . 27 e + 07 6 . 92 e + 06 2 . 95 e + 06 1 . 40 e + 07 ˆ N 2 . 03 e + 08 3 . 39 e + 08 2 . 15 e + 08 6 . 14 e + 07 2 . 65 e + 07 1 . 30 e + 08 A vg(In vPropensity) 11 . 96 3 . 29 e + 02 1 . 87 e + 04 2 . 29 e + 06 2 . 62 e + 07 3 . 51 e + 09 Max(In vPropensity) 5 . 36 e + 05 3 . 38 e + 08 3 . 23 e + 10 9 . 78 e + 12 2 . 03 e + 12 2 . 34 e + 15 are correctly logged, then the expected importance weight should be 1 for an y new banner-filling policy π ( Y | x ) . Formally , we hav e the following: ˆ C ( π ) = 1 N N X i =1 π ( y i | x i ) q i ' 1 . (2) The IPS estimate for a new polic y is simply: ˆ R ( π ) = 1 N N X i =1 δ i π ( y i | x i ) q i . (3) These equations are valid when π 0 has full support, as our logging system does: π 0 ( y | x ) > 0 ∀ x, y . The self-normalized estimator [14, 4] is: ˆ R snips ( π ) = ˆ R ( π ) ˆ C ( π ) . (4) Remember that we sub-sampled non-clicked impressions. Sub-sampling is indicated by the binary random variable o i : o i ∼ Pr( O = 1 | δ ) =  0 . 1 if δ = 0 , 1 otherwise. (5) The IPS estimate and the diagnostic abo ve are not computable in our case since the y require all data- points before sub-sampling. So, we use the following straightforward modification to use only our N sub-sampled data-points instead. First, we estimate the number of data-points before sub-sampling ˆ N only using samples where o i = 1 : ˆ N = N X i =1 1 { o i = 1 } Pr( O = 1 | δ i ) = # { δ = 1 } + 10# { δ = 0 } . (6) ˆ N is an unbiased estimate of N = P N i =1 1 since E ( x i ,y i ,δ i ) E o i ∼ Pr( O | δ i ) h 1 { o i =1 } Pr( O =1 | δ i ) i = E ( x i ,y i ,δ i ) 1 = 1 . Next, consider estimating R ( π ) = E ( x i ,y i ,δ i ) δ i π ( y i | x i ) q i as: ˆ R ( π ) = 1 ˆ N N X i =1 δ i π ( y i | x i ) q i 1 { o i = 1 } Pr( O = 1 | δ i ) . (7) 5 Again, E ( x i ,y i ,δ i ) E o i ∼ Pr( O | δ i ) h δ i π ( y i | x i ) q i 1 { o i =1 } Pr( O =1 | δ i ) i = E ( x i ,y i ,δ i ) δ i π ( y i | x i ) q i . Hence, the sum in the nu- merator of ˆ R ( π ) is, in expectation, N R ( π ) , while the normalizing constant ˆ N is, in expectation, N . Ratios of expectations are not equal to the e xpectation of a ratio, so we expect a small bias in this estimate but it is easy to sho w that this estimate is asymptotically consistent. Finally consider estimating C ( π ) = E ( x i ,y i ) π ( y i | x i ) q i = 1 as: ˆ C ( π ) = 1 ˆ N N X i =1 π ( y i | x i ) q i 1 { o i = 1 } Pr( O = 1 | δ i ) . (8) Again, E ( x i ,y i ,δ i ) E o i ∼ Pr( O | δ i ) h π ( y i | x i ) q i 1 { o i =1 } Pr( O =1 | δ i ) i = E ( x i ,y i ,δ i ) π ( y i | x i ) q i = 1 . The sum in the numerator of ˆ C ( π ) is, in expectation, N as is the denominator . Again, we e xpect this estimate to have a small bias but to remain asymptotically consistent. The computable v ariant of the self-normalized IPS estimator simply uses the computable ˆ R ( π ) and ˆ C ( π ) in its definition: ˆ R snips ( π ) = ˆ R ( π ) / ˆ C ( π ) . W e use a family of ne w policies π  , parametrized by 0 ≤  ≤ 1 to diagnose ˆ C ( π ) and the expected behavior of IPS estimates ˆ R ( π ) . The policy π  behav es like a uniformly random ranking policy with probability  , and with probability 1 −  , beha ves like the logging polic y . Formally , for an impression with context x i , |Y | possible actions (e.g., rankings of candidate products), and logged action y i , the probability for choosing y i under the new polic y π  is: π  ( y i | x i ) =  1 |Y | + (1 −  ) π 0 ( y i | x i ) . (9) As we v ary  away from 0 , the new policy looks more dif ferent than the logging policy π 0 on the logged impressions. In T ables 2,3,4 we report ˆ C ( π  ) and a 99% confidence interval assuming asymptotic nor- mality , for different choices of  . W e also report the IPS-estimated clickthrough rates for these policies ˆ R ( π  ) , their standard error ( 99% CI), and finally , their self-normalized IPS-estimates [14, 4]. As we pick policies that dif fer from the logging polic y , we see that the estimated variance of the IPS estimates (as reflected in their approximate 99% confidence intervals) increases. Moreov er, the control variate ˆ C ( π  ) is systematically under-estimated. This should caution us to not rely on a single point- estimate (e.g. only IPS or SNIPS). SNIPS can often pro vide a better bias-variance trade-of f in these estimates, but can fail catastrophically when the v ariance is v ery high due to systematic under -estimation of ˆ C ( π ) . Moreov er, in these v ery high-v ariance situations (e.g. when k ≥ 3 and  ≥ 2 − 2 ), the constructed confidence intervals are not reliable — C ( π  ) clearly does not lie in the computed intervals. Based on these sanity checks, we focus the ev aluation set-up in Section 4 on the 1 -slot case. 4 Benchmarking Learning Algorithms 4.1 Evaluation Estimates based on importance sampling ha ve considerable variance when the number of slots increases. W e would thus need tens of millions of impressions to estimate the CTR of slot-filling policies with high precision. T o limit the risks of people “ov er-fitting to the v ariance” by querying far a way from our logging policy , we propose the follo wing estimates for any polic y: • Report the in verse propensity scoring (IPS) [9] ˆ R ( π ) as well as the self-normalized (SN) estimate [4] for the new policy ˆ R ( π ) / ˆ C ( π ) (self-normalized, so that learnt policies cannot cheat by not having their importance weights sum to 1); 6 T able 2: Diagnostics and IPS-estimated clickthrough rates for different policies π  ev aluated on slices of traffic with k -slot banners, k ∈ { 1 , 2 } .  interpolates between the logging policy (  = 0 ) and the uniform random policy (  = 1 ). Error bars are 99% confidence interv als under a normal distribution. #Slots 1 2  ˆ C ( π  ) ˆ R ( π  ) × 10 4 ˆ R ( π  ) × 10 4 ˆ C ( π  ) ˆ C ( π  ) ˆ R ( π  ) × 10 4 ˆ R ( π  ) × 10 4 ˆ C ( π  ) 0 1 . 000 ± 0 . 000 53 . 604 ± 0 . 129 53 . 604 1 . 000 ± 0 . 000 52 . 554 ± 0 . 099 52 . 554 2 − 10 1 . 000 ± 0 . 000 53 . 598 ± 0 . 129 53 . 599 1 . 000 ± 0 . 000 52 . 541 ± 0 . 099 52 . 545 2 − 9 1 . 000 ± 0 . 000 53 . 593 ± 0 . 130 53 . 595 1 . 000 ± 0 . 000 52 . 529 ± 0 . 101 52 . 536 2 − 8 1 . 000 ± 0 . 000 53 . 582 ± 0 . 131 53 . 585 1 . 000 ± 0 . 000 52 . 503 ± 0 . 107 52 . 517 2 − 7 1 . 000 ± 0 . 000 53 . 560 ± 0 . 138 53 . 567 0 . 999 ± 0 . 000 52 . 453 ± 0 . 129 52 . 481 2 − 6 1 . 000 ± 0 . 000 53 . 516 ± 0 . 163 53 . 531 0 . 999 ± 0 . 001 52 . 351 ± 0 . 193 52 . 407 2 − 5 0 . 999 ± 0 . 000 53 . 428 ± 0 . 236 53 . 457 0 . 998 ± 0 . 002 52 . 148 ± 0 . 346 52 . 260 2 − 4 0 . 999 ± 0 . 001 53 . 251 ± 0 . 416 53 . 311 0 . 996 ± 0 . 003 51 . 742 ± 0 . 671 51 . 965 2 − 3 0 . 998 ± 0 . 001 52 . 899 ± 0 . 802 53 . 017 0 . 991 ± 0 . 006 50 . 929 ± 1 . 331 51 . 370 2 − 2 0 . 996 ± 0 . 003 52 . 194 ± 1 . 589 52 . 428 0 . 983 ± 0 . 012 49 . 305 ± 2 . 657 50 . 166 2 − 1 0 . 991 ± 0 . 006 50 . 785 ± 3 . 171 51 . 241 0 . 966 ± 0 . 024 46 . 056 ± 5 . 312 47 . 693 1 0 . 982 ± 0 . 012 47 . 966 ± 6 . 338 48 . 836 0 . 931 ± 0 . 048 39 . 557 ± 10 . 623 42 . 473 T able 3: Diagnostics for different policies π  ev aluated on slices of traffic with k -slot banners, k ∈ { 3 , 4 } . Error bars are 99% confidence interv als under a normal distribution. #Slots 3 4  ˆ C ( π  ) ˆ R ( π  ) × 10 4 ˆ R ( π  ) × 10 4 ˆ C ( π  ) ˆ C ( π  ) ˆ R ( π  ) × 10 4 ˆ R ( π  ) × 10 4 ˆ C ( π  ) 0 1 . 000 ± 0 . 000 64 . 298 ± 0 . 137 64 . 298 1 . 000 ± 0 . 000 141 . 114 ± 0 . 366 141 . 114 2 − 10 1 . 000 ± 0 . 000 64 . 296 ± 0 . 148 64 . 305 1 . 000 ± 0 . 001 141 . 065 ± 0 . 366 141 . 082 2 − 9 1 . 000 ± 0 . 000 64 . 294 ± 0 . 179 64 . 312 1 . 000 ± 0 . 001 141 . 015 ± 0 . 366 141 . 049 2 − 8 0 . 999 ± 0 . 000 64 . 291 ± 0 . 268 64 . 326 1 . 000 ± 0 . 002 140 . 916 ± 0 . 368 140 . 984 2 − 7 0 . 999 ± 0 . 001 64 . 284 ± 0 . 480 64 . 354 0 . 999 ± 0 . 003 140 . 717 ± 0 . 378 140 . 853 2 − 6 0 . 998 ± 0 . 001 64 . 269 ± 0 . 930 64 . 410 0 . 998 ± 0 . 006 140 . 320 ± 0 . 413 140 . 590 2 − 5 0 . 996 ± 0 . 003 64 . 240 ± 1 . 844 64 . 523 0 . 996 ± 0 . 012 139 . 526 ± 0 . 534 140 . 065 2 − 4 0 . 991 ± 0 . 006 64 . 182 ± 3 . 681 64 . 750 0 . 992 ± 0 . 024 137 . 937 ± 0 . 863 139 . 007 2 − 3 0 . 982 ± 0 . 011 64 . 066 ± 7 . 359 65 . 211 0 . 985 ± 0 . 049 134 . 761 ± 1 . 610 136 . 867 2 − 2 0 . 965 ± 0 . 023 63 . 834 ± 14 . 716 66 . 157 0 . 969 ± 0 . 097 128 . 407 ± 3 . 161 132 . 484 2 − 1 0 . 930 ± 0 . 045 63 . 370 ± 29 . 430 68 . 156 0 . 938 ± 0 . 194 115 . 700 ± 6 . 295 123 . 288 1 0 . 860 ± 0 . 090 62 . 443 ± 58 . 860 72 . 643 0 . 877 ± 0 . 389 90 . 285 ± 12 . 577 102 . 960 • Compute the standard error of the IPS estimate (appealing to asymptotic normality), and report this error as an “approximate confidence interval”. This is pro vided in our ev aluation software alongside the dataset online. In this way , learning algorithms must reason about bias/variance e xplicitly to reliably achiev e better estimated CTR. 7 T able 4: Diagnostics for different policies π  ev aluated on slices of traffic with k -slot banners, k ∈ { 5 , 6 } . Error bars are 99% confidence interv als under a normal distribution. #Slots 5 6  ˆ C ( π  ) ˆ R ( π  ) × 10 4 ˆ R ( π  ) × 10 4 ˆ C ( π  ) ˆ C ( π  ) ˆ R ( π  ) × 10 4 ˆ R ( π  ) × 10 4 ˆ C ( π  ) 0 1 . 000 ± 0 . 000 125 . 965 ± 0 . 530 125 . 965 1 . 000 ± 0 . 000 90 . 620 ± 0 . 206 90 . 620 2 − 10 0 . 999 ± 0 . 000 125 . 899 ± 0 . 532 125 . 976 1 . 000 ± 0 . 000 90 . 579 ± 0 . 207 90 . 622 2 − 9 0 . 999 ± 0 . 001 125 . 833 ± 0 . 538 125 . 988 0 . 999 ± 0 . 000 90 . 537 ± 0 . 210 90 . 625 2 − 8 0 . 998 ± 0 . 001 125 . 702 ± 0 . 563 126 . 011 0 . 998 ± 0 . 000 90 . 454 ± 0 . 222 90 . 629 2 − 7 0 . 995 ± 0 . 001 125 . 439 ± 0 . 653 126 . 057 0 . 996 ± 0 . 001 90 . 289 ± 0 . 264 90 . 638 2 − 6 0 . 990 ± 0 . 002 124 . 913 ± 0 . 931 126 . 149 0 . 992 ± 0 . 001 89 . 957 ± 0 . 389 90 . 657 2 − 5 0 . 980 ± 0 . 004 123 . 861 ± 1 . 624 126 . 337 0 . 985 ± 0 . 002 89 . 293 ± 0 . 691 90 . 694 2 − 4 0 . 961 ± 0 . 007 121 . 756 ± 3 . 119 126 . 725 0 . 969 ± 0 . 004 87 . 967 ± 1 . 336 90 . 769 2 − 3 0 . 922 ± 0 . 014 117 . 548 ± 6 . 172 127 . 549 0 . 938 ± 0 . 008 85 . 313 ± 2 . 649 90 . 928 2 − 2 0 . 843 ± 0 . 029 109 . 131 ± 12 . 314 129 . 428 0 . 877 ± 0 . 017 80 . 006 ± 5 . 287 91 . 279 2 − 1 0 . 686 ± 0 . 057 92 . 298 ± 24 . 613 134 . 475 0 . 753 ± 0 . 033 69 . 392 ± 10 . 568 92 . 154 1 0 . 373 ± 0 . 115 58 . 631 ± 49 . 221 157 . 307 0 . 506 ± 0 . 066 48 . 164 ± 21 . 135 95 . 185 4.2 Methods Consider a 1 -slot banner filling task defined using our dataset. This 21 M slice of traf fic can be modeled as a logged contextual bandit problem with a small number of arms. This slice is further randomly di vided into a 33 − 33 − 33% train-v alidate-test split. The follo wing methods are benchmark ed in the code accompanying this dataset release. All these methods use a linear policy class π ∈ Π lin to map x 7→ y (i.e., score candidates using a linear scorer w · φ ( c, p ) ), b ut differ in their training objectiv es. Their hyper - parameters are chosen to maximize ˆ R ( π ) on the validation set and their test-set estimates are reported in T able 5. 1. Random : A policy that picks p ∈ P c uniformly at random to display . 2. Regression : A reduction to supervised learning that predicts δ for every candidate action. The number of training epochs (ranging from 1 . . . 40 ), regularization for Lasso (ranging from 10 − 8 . . . 10 − 4 ), and learning rate for SGD ( 0 . 1 , 1 , 10 ) are the hyper-parameters. 3. IPS : Directly optimizes ˆ R ( π ) e valuated on the training split. This implementation uses a re- duction to weighted one-against-all multi-class classification as emplo yed in [3]. The hyper - parameters are the same as in the Regression approach. 4. DRO [3]: Combines the Regression method with IPS using the doubly robust estimator to per- form policy optimization. Again uses a reduction to weighted one-against-all multi-class classi- fication, and uses the same set of hyper-parameters. 5. POEM [2]: Directly trains a stochastic policy following the counterfactual risk minimization principle, thus reasoning about differences in the variance of the IPS estimate ˆ R ( π ) . Hyper- parameters are variance regularization, L 2 regularization, propensity clipping and number of training epochs. The results of the learning experiments are summarized in T able 5. For more details and the specifics of the experiment setup, visit the dataset website. Differences in Random and π 0 numbers compared to T able 2 8 T est set estimates Approach ˆ R ( π  ) × 10 4 ˆ R ( π  ) × 10 4 / ˆ C ( π  ) ˆ C ( π  ) Random 44 . 676 ± 2 . 112 45 . 446 ± 0 . 001 0 . 983 ± 0 . 021 π 0 53 . 540 ± 0 . 224 53 . 540 ± 0 . 000 1 . 000 ± 0 . 000 Regression 48 . 353 ± 3 . 253 48 . 162 ± 0 . 001 1 . 004 ± 0 . 041 IPS 54 . 125 ± 2 . 517 53 . 672 ± 0 . 001 1 . 008 ± 0 . 016 DR O 57 . 356 ± 14 . 008 57 . 086 ± 0 . 005 1 . 005 ± 0 . 025 POEM 58 . 040 ± 3 . 407 57 . 480 ± 0 . 001 1 . 010 ± 0 . 018 T able 5: T est set performance of policies learnt using dif ferent counterfactual learning baselines. Errors bars are 99% confidence intervals under a normal distrib ution. Confidence interv al for SNIPS is constructed using the delta method [12]. are because they are computed on a 33% subset — we do e xpect their confidence interv als to o verlap. W e see that the Re gr ession approach, which loosely corresponds to predicting CTR for each candidate using supervised machine learning, can be substantially improv ed using many recent off-policy learning algorithms that effecti vely use the logged propensities. W e also note that very limited hyper-parameter tuning was performed for methods like POEM and DR O — for instance, POEM can concei vably be improv ed by employing the doubly robust estimator . W e leav e such algorithm-tuning to future work. 5 Conclusions In this paper , we ha ve introduced a standardized test-bed to systematically in vestigate off-polic y learning algorithms using real-word data. W e presented this test-bed, the sanity checks we ran to ensure its validity , and showed results comparing state-of-the-art of f-policy learning methods (doubly rob ust optimization [3] and POEM [2]) to regression baselines on a 1 -slot banner filling task. Our results sho w experimental evidence that recent off-polic y learning methods can improv e upon state-of-the-art supervised learning techniques on a large-scale real-w orld data set. These results we presented are for the 1-slot banner filling tasks. There are se veral dimensions in setting up challenging, interesting, relev ant off-polic y learning problems on the data collected for future work. Size of the action space: Increase the size of the action space, i.e. of the number of slots in the banner . Feedback granularity: W e can use global feedback (was there a click some where in the banner), or per item feedback (which item in the banner was clicked). Contextualization: W e can learn a separate model for each banner type or learn a contextualized model across multiple banner types. Acknowledgments W e thank Ale xandre Gilotte and Thomas Nedelec at Criteo for their help in creating the dataset. This work was funded in part through NSF A wards IIS-1247637, IIS-1615706, IIS-1513692. References [1] L. Bottou, J. Peters, J. Q. Candela, D. X. Charles, M. Chickering, E. Portugaly , D. Ray , P . Y . Simard, and E. Snelson, “Counterfactual reasoning and learning systems: the example of computational advertising., ” J ournal of Machine Learning Resear ch , pp. 3207–3260, 2013. 9 [2] A. Swaminathan and T . Joachims, “Batch learning from logged bandit feedback through counterfac- tual risk minimization, ” Journal of Mac hine Learning Researc h , pp. 1731–1755, 2015. [3] M. Dud ´ ık, J. Langford, and L. Li, “Doubly robust policy ev aluation and learning, ” in ICML , pp. 1097–1104, 2011. [4] A. Swaminathan and T . Joachims, “The self-normalized estimator for counterf actual learning, ” in NIPS , pp. 3231–3239, 2015. [5] O. Chapelle, E. Manav oglu, and R. Rosales, “Simple and scalable response prediction for display advertising, ” T ransactions on Intelligent Systems and T echnology , p. Article 61, 2014. [6] F . V asile, D. Lefortier , and O. Chapelle, “Cost-sensiti ve learning for utility optimization in online advertising auctions, ” arXiv pr eprint arXiv:1603.03713 , 2016. [7] A. V orobe v , D. Lefortier , G. Gusev , and P . Serdyuk ov , “Gathering additional feedback on search results by multi-armed bandits with respect to production ranking, ” in WWW , pp. 1177–1187, 2015. [8] D. Lefortier , P . Serdyuko v , and M. de Rijke, “Online exploration for detecting shifts in fresh intent, ” in CIKM , pp. 589–598, 2014. [9] L. Li, W . Chu, J. Langford, and X. W ang, “Unbiased offline e valuation of contextual-bandit-based news article recommendation algorithms, ” in WSDM , pp. 297–306, 2011. [10] H. B. McMahan, G. Holt, D. Sculley , M. Y oung, D. Ebner, J. Grady , L. Nie, T . Phillips, E. Davydo v , D. Golovin, et al. , “ Ad click prediction: a view from the trenches, ” in KDD , pp. 1222–1230, 2013. [11] P . R. Rosenbaum and D. B. Rubin, “The central role of the propensity score in observational studies for causal effects, ” Biometrika , pp. 41–55, 1983. [12] A. B. Owen, Monte Carlo theory , methods and examples . 2013. [13] L. Li, S. Chen, J. Kleban, and A. Gupta, “Counterfactual estimation and optimization of click metrics in search engines: A case study , ” in WWW , pp. 929–934, 2015. [14] T . Hesterberg, “W eighted av erage importance sampling and defensi ve mixture distributions, ” T ech- nometrics , pp. 185–194, 1995. 10

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment