Off-policy evaluation for slate recommendation

Off-policy e valuation f or slate r ecommendation Adith Swaminathan Microsoft Research, Redmond adswamin@microsoft.com Akshay Krishnamurthy Univ ersity of Massachusetts, Amherst akshay@cs.umass.edu Alekh Agarwal Microsoft Research, New Y ork alekha@microsoft.com Mirosla v Dudík Microsoft Research, New Y ork mdudik@microsoft.com John Langf ord Microsoft Research, New Y ork jcl@microsoft.com Damien Jose Microsoft, Redmond dajose@microsoft.com Imed Zitouni Microsoft, Redmond izitouni@microsoft.com Abstract This paper studies the ev aluation of policies that recommend an ordered set of items (e.g., a ranking) based on some context—a common scenario in web search, ads, and recommendation. W e build on techniques from combinatorial bandits to introduce a ne w practical estimator that uses logged data to estimate a policy’ s performance. A thorough empirical ev aluation on real-world data rev eals that our estimator is accurate in a v ariety of settings, including as a subroutine in a learning- to-rank task, where it achieves competiti ve performance. W e derive conditions under which our estimator is unbiased—these conditions are weaker than prior heuristics for slate e v aluation—and experimentally demonstrate a smaller bias than parametric approaches, e ven when these conditions are violated. Finally , our theory and experiments also show exponential savings in the amount of required data compared with general unbiased estimators. 1 Introduction In recommendation systems for e-commerce, search, or ne ws, we would like to use the data collected during operation to test new content-serving algorithms (called policies ) along metrics such as rev enue and number of clicks [ 4 , 25 ]. This task is called off-policy e valuation . General approaches, namely in verse propensity scores (IPS) [ 13 , 18 ], require unrealistically large amounts of logged data to ev aluate whole-page metrics that depend on multiple recommended items, which happens when sho wing ranked lists. The key challenge is that the number of possible lists (called slates ) is combinatorially large. As a result, the policy being e v aluated is likely to choose dif ferent slates from those recorded in the logs most of the time, unless it is very similar to the data-collection policy . This challenge is fundamental [ 34 ], so any of f-policy e valuation method that works with large slates needs to make some structural assumptions about the whole-page metric or the user beha vior . Previous work on off-policy ev aluation and whole-page optimization improves the probability of match between logging and e valuation by restricting attention to small slate spaces [ 35 , 26 ], intro- ducing assumptions that allow for partial matches between the proposed and observed slates [ 27 ], or assuming that the policies used for logging and ev aluation are similar [ 4 , 32 ]. Another line of work constructs parametric models of slate quality [ 8 , 16 , 14 ] (see also Sec. 4.3 of [ 17 ]). While these approaches require less data, they can hav e large bias, and their use in practice requires an expensi ve trial-and-error cycle in volving weeks-long A/B tests to dev elop new policies [ 20 ]. In this paper we Submitted to 31st Conference on Neural Information Processing Systems (NIPS 2017). 0.0 0.2 0.4 0.6 0.8 1.0 Number of logged samples (n) 0.0 0.2 0.4 0.6 0.8 1.0 RMSE 10 3 10 4 10 − 1 10 0 10 1 Re w ard: Negativ e Time-to-success 10 3 10 4 10 − 0 . 5 10 0 Re w ard: Utility Rate OnP olicy IPS DM: tree PI Figure 1: Off-polic y e v aluation of two whole-page user-satisf action metrics on proprietary search engine data. A verage RMSE of different estimators over 50 runs on a log-log scale. Our method (PI) achie ves the best performance with moderate data sizes. The unbiased IPS method suffers high variance, and direct modeling (DM) of the metrics suffers high bias. O N P O L I C Y is the expensi ve choice of deploying the policy , for instance, in an A/B test. design a method more robust to problems with bias and with only modest data requirements, with the goal of substantially shortening this cycle and accelerating the polic y de velopment process. W e frame the slate recommendation problem as a combinatorial generalization of contextual ban- dits [ 3 , 23 , 13 ]. In combinatorial conte xtual bandits, for each conte xt , a polic y selects a slate consisting of component actions , after which a r ewar d for the entire slate is observed. In web search, the context is the search query augmented with a user proﬁle, the slate is the search results page consisting of a list of retrie ved documents (actions), and e xample re ward metrics are page-le vel measures such as time-to-success, NDCG (position-weighted relev ance), or other measures of user satisfaction. As input we receiv e contextual bandit data obtained by some log ging policy , and our goal is to estimate the rew ard of a ne w tar get policy . This off-policy setup dif fers from online learning in contextual bandits, where the goal is to adapti vely maximize the re ward in the presence of an explore-e xploit trade-off [5]. Inspired by work in combinatorial and linear bandits [ 7 , 31 , 11 ], we propose an estimator that makes only a weak assumption about the ev aluated metric, while exponentially reducing the data requirements in comparison with IPS. Speciﬁcally , we posit a linearity assumption , stating that the slate-lev el reward (e.g., time to success in web search) decomposes additiv ely across actions, but the action-le vel re wards are not observed. Crucially , the action-lev el re wards are allo wed to depend on the context, and we do not require that the y be easily modeled from the features describing the context. In fact, our method is completely agnostic to the representation of contexts. W e mak e the follo wing contributions: 1. The pseudoin verse estimator (PI) for of f-policy ev aluation: a general-purpose estimator from the combinatorial bandit literature, adapted for off-policy ev aluation. When ranking ` out of m items under the linearity assumption, PI typically requires O ( `m/ε 2 ) samples to achieve error at most ε —an exponential gain ov er the m Ω( ` ) sample complexity of IPS. W e provide distribution-dependent bounds based on the o verlap between logging and tar get policies. 2. Experiments on real-world search ranking datasets: The strong performance of the PI estimator provides, to our kno wledge, the ﬁrst demonstration of high-quality of f-policy ev aluation of whole-page metrics, comprehensiv ely outperforming prior baselines (see Fig. 1). 3. Off-polic y optimization: W e provide a simple procedure for learning to rank (L2R) using the PI estimator to impute action-le vel re wards for each context. This allows direct optimization of whole-page metrics via pointwise L2R approaches, without r equiring pointwise feedback . Related work Large state spaces have typically been studied in the online, or on-policy , setting. Some works assume speciﬁc parametric (e.g., linear) models relating the metrics to the features describing a slate [ 2 , 31 , 15 , 10 , 29 ]; this can lead to bias if the model is inaccurate (e.g., we might not hav e access to suf ﬁciently predicti ve features). Others posit the same linearity assumption as we do, but further assume a semi-bandit feedback model where the re wards of all actions on the slate 2 are revealed [ 19 , 22 , 21 ]. While much of the research focuses on on-policy setting, the of f-policy paradigm studied in this paper is often preferred in practice since it might not be possible to implement low-latenc y updates needed for online learning, or we might be interested in many dif ferent metrics and require a manual revie w of their trade-offs before deploying ne w policies. At a technical level, the PI estimator has been used in online learning [ 7 , 31 , 11 ], but the analysis there is tailored to the speciﬁc data collection policies used by the learner . In contrast, we provide distribution-dependent bounds without an y assumptions on the logging or tar get policy . 2 Setting and notation In combinatorial contextual bandits, a decision maker repeatedly interacts as follo ws: 1. the decision maker observes a context x drawn from a distribution D ( x ) o ver some space X ; 2. based on the context, the decision mak er chooses a slate s = ( s 1 , . . . , s ` ) consisting of actions s j , where a position j is called a slot , the number of slots is ` , actions at position j come from some space A j ( x ) , and the slate s is chosen from a set of allowed slates S ( x ) ⊆ A 1 ( x ) × · · · × A ` ( x ) ; 3. giv en the context and slate, a reward r ∈ [ − 1 , 1] is drawn from a distrib ution D ( r | x, s ) ; rew ards in different rounds are independent, conditioned on conte xts and slates. The context space X can be inﬁnite, but the set of actions is ﬁnite. W e assume | A j ( x ) | = m j for all contexts x ∈ X and deﬁne m : = max j m j as the maximum number of actions per slot. The goal of the decision maker is to maximize the r ewar d . The decision maker is modeled as a stochastic policy π that speciﬁes a conditional distribution π ( s | x ) (a deterministic policy is a special case). The value of a policy π , denoted V ( π ) , is deﬁned as the expected re ward when follo wing π : V ( π ) : = E x ∼ D E s ∼ π ( ·| x ) E r ∼ D ( ·| x, s )  r  . (1) T o simplify deriv ations, we e xtend the conditional distribution π into a distribution over triples ( x, s , r ) as π ( x, s , r ) : = D ( r | x, s ) π ( s | x ) D ( x ) . W ith this shorthand, we have V ( π ) = E π [ r ] . T o ﬁnish this section, we introduce notation for the expected rew ard for a giv en context and slate, which we call the slate value , and denote as: V ( x, s ) : = E r ∼ D ( ·| x, s ) [ r ] . (2) Example 1 (Cartesian product) . Consider the optimization of a news portal where the re ward is the whole-page advertising re venue. Context x is the user proﬁle, slate is the news-portal page with slots corresponding to ne ws sections, 1 and actions are the articles. The set of v alid slates is the Cartesian product S ( x ) = Q j ≤ ` A j ( x ) . The number of valid slates is e xponential in ` : | S ( x ) | = Q j ≤ ` m j . Example 2 (Ranking) . Consider web search and ranking. Context x is the query along with user proﬁle. Actions correspond to search items (such as webpages). The policy chooses ` of m items, where the set A ( x ) of m items for a context x is chosen from a corpus by a ﬁltering step (e.g., a database query). W e have A j ( x ) = A ( x ) for all j ≤ ` , but the allowed slates S ( x ) hav e no repetitions. The number of v alid slates is exponential in ` : | S ( x ) | = m ! / ( m − ` )! = m Ω( ` ) . A rew ard could be the ne gative time-to-success , i.e., negativ e of the time taken by the user to ﬁnd a relev ant item. 2.1 Off-policy evaluation and optimization In the off-policy setting, we hav e access to the logged data ( x 1 , s 1 , r 1 ) , . . . , ( x n , s n , r n ) collected using a past policy µ , called the logging policy . Off-policy evaluation is the task of estimating the value of a ne w policy π , called the tar get policy , using the logged data. Off-policy optimization is the harder task of ﬁnding a policy ˆ π that achiev es maximal re ward. There are tw o standard approaches for of f-policy e v aluation. The dir ect method (DM) uses the logged data to train a (parametric) model ˆ r ( x, s ) for predicting the expected re ward for a giv en context and slate. V ( π ) is then estimated as ˆ V DM ( π ) = 1 n P n i =1 P s ∈ S ( x ) ˆ r ( x i , s ) π ( s | x i ) . (3) 1 For simplicity , we do not discuss the more general setting of showing multiple articles in each news section. 3 The direct method is often biased due to mismatch between model assumptions and ground truth. The second approach, which is provably unbiased (under modest assumptions), is the in verse pr open- sity scor e (IPS) estimator [ 18 ]. The IPS estimator re-weights the logged data according to ratios of slate probabilities under the target and logging polic y . It has two common variants: ˆ V IPS ( π ) = 1 n P n i =1 r i · π ( s i | x i ) µ ( s i | x i ) , ˆ V wIPS ( π ) = P n i =1 r i · π ( s i | x i ) µ ( s i | x i )   P n i =1 π ( s i | x i ) µ ( s i | x i )  . (4) wIPS generally has better variance with an asymptotically zero bias. The variance of both estimators grows linearly with π ( s | x ) µ ( s | x ) , which can be Ω( | S ( x ) | ) . This is prohibitive when | S ( x ) | = m Ω( ` ) . 3 Our approach The IPS estimator is minimax optimal [ 34 ], so its e xponential v ariance is una voidable in the worst case. W e circumvent this hardness by positing an assumption on the structure of re wards. Speciﬁcally , we assume that the slate-lev el re ward is a sum of unobserved action-le vel rewards that depend on the context, the action, and the position on the slate, b ut not on the other actions on the slate. Formally , we consider slate indicator vector s in R `m whose components are index ed by pairs ( j, a ) of slots and possible actions in them. A slate is described by an indicator vector 1 s ∈ R `m whose entry at position ( j, a ) is equal to 1 if the slate s has action a in slot j , i.e., if s j = a . The above assumption is formalized as follows: Assumption 1 (Linearity Assumption) . For each conte xt x ∈ X there exists an (unkno wn) intrinsic r ewar d vector φ x ∈ R `m such that the slate value satisﬁes V ( x, s ) = 1 T s φ x = P ` j =1 φ x ( j, s j ) . The slate indicator vector can be viewed as a feature vector , representing the slate, and φ x can be viewed as a conte xt-speciﬁc weight vector . The assumption refers to the fact that the v alue of a slate is a linear function of its feature representation. Howe ver , note that this linear dependence is allo wed to be completely different across contexts, because we make no assumptions on how φ x depends on x , and in fact our method does not e ven attempt to accurately estimate φ x . Being agnostic to the form of φ x is the key departure from the direct method and parametric bandits. While Assumption 1 rules out interactions among different actions on a slate, 2 its ability to vary intrinsic rew ards arbitrarily across contexts captures man y common metrics in information retrie v al, such as the normalized discounted cumulative gain (NDCG) [6], a common metric in web ranking: Example 3 (NDCG) . For a slate s , we ﬁrst deﬁne DCG ( x, s ) : = P ` j =1 2 rel ( x,s j ) − 1 log 2 ( j +1) where r el ( x, a ) is the relev ance of document a on query x . Then NDCG ( x, s ) : = DCG ( x, s ) / DCG ? ( x ) where DCG ? ( x ) = max s ∈ S ( x ) DCG ( x, s ) , so NDCG takes values in [0 , 1] . Thus, NDCG satisﬁes Assump- tion 1 with φ x ( j, a ) =  2 r el ( x,a ) − 1   log 2 ( j + 1) DCG ? ( x ) . In addition to Assumption 1, we also make the standard assumption that the logging policy puts non-zero probability on all slates that can be potentially chosen by the target policy . This assumption is also required for IPS, otherwise unbiased off-polic y ev aluation is impossible [24]. Assumption 2 (Absolute Continuity) . The of f-policy evaluation problem satisﬁes the absolute continuity assumption if µ ( s | x ) > 0 whene ver π ( s | x ) > 0 with probability one o ver x ∼ D . 3.1 The pseudoin verse estimator Using Assumption 1, we can no w apply the techniques from the combinatorial bandit literature to our problem. In particular , our estimator closely follows the recipe of Cesa-Bianchi and Lugosi [7] , albeit with some differences to account for the off-policy and contextual nature of our setup. Under Assumption 1, we can view the recovery of φ x for a gi ven context x as a linear regression problem. The cov ariates 1 s are drawn according to µ ( · | x ) , and the re ward follo ws a linear model, conditional on s and x , with φ x as the “weight vector”. Thus, we can write the MSE of an estimate w as E s ∼ µ ( ·| x ) E r ∼ D ( ·| s ,x ) [( 1 T s w − r ) 2 ] , or more compactly as E µ [( 1 T s w − r ) 2 | x ] , using our deﬁnition of µ as a distribution o ver triples ( x, s , r ) . W e estimate φ x by the MSE minimizer with the smallest 2 W e discuss limitations of Assumption 1 and directions to ov ercome them in Sec. 5. 4 norm, which can be written in closed form as ¯ φ x =  E µ [ 1 s 1 T s | x ]  † E µ [ r 1 s | x ] , (5) where M † is the Moore-Penrose pseudoin verse of a matrix M . Note that this idealized “estimator” ¯ φ x uses conditional expectations ov er s ∼ µ ( · | x ) and r ∼ D ( · | s , x ) . T o simplify the notation, we write Γ µ,x : = E µ [ 1 s 1 T s | x ] ∈ R `m × `m to denote the (uncentered) cov ariance matrix for our regression problem, appearing on the right-hand side of Eq. (5) . W e also introduce notation for the second term in Eq. (5) and its empirical estimate: θ µ,x : = E µ [ r 1 s | x ] , and ˆ θ i : = r i 1 s i . Thus, our re gression estimator (5) is simply ¯ φ x = Γ † µ,x θ µ,x . Under Assumptions 1 and 2, it is easy to sho w that V ( x, s ) = 1 T s ¯ φ x = 1 T s Γ † µ,x θ µ,x . Replacing θ µ,x with ˆ θ i moti vates the follo wing estimator for V ( π ) , which we call the pseudoin verse estimator or PI: ˆ V PI ( π ) = 1 n n X i =1 X s ∈ S π ( s | x i ) 1 T s Γ † µ,x i ˆ θ i = 1 n n X i =1 r i · q T π ,x i Γ † µ,x i 1 s i . (6) In Eq. (6) we hav e expanded the deﬁnition of ˆ θ i and introduced the notation q π ,x for the expected slate indicator under π conditional on x , q π ,x : = E π [ 1 s | x ] . The summation o ver s required to obtain q π ,x i in Eq. (6) can be replaced by a small sample. W e can also deri ve a weighted variant of PI: ˆ V wPI ( π ) = P n i =1 r i · q T π ,x i Γ † µ,x i 1 s i P n i =1 q T π ,x i Γ † µ,x i 1 s i . (7) W e pro ve the follo wing unbiasedness property in Appendix A. Proposition 1. If Assumptions 1 and 2 hold, then the estimator ˆ V PI is unbiased, i.e ., E µ n [ ˆ V PI ] = V ( π ) , wher e the expectation is over the n logged e xamples sampled i.i.d. fr om µ . As special cases, PI reduces to IPS when ` = 1 , and simpliﬁes to P n i =1 r i /n when π = µ (see Appendix C). T o b uild further intuition, we consider the settings of Examples 1 and 2, and simplify the PI estimator to highlight the improv ement ov er IPS. Example 4 (PI for a Cartesian product when µ is a product distribution) . The PI estimator for the Cartesian product slate space, when µ factorizes across slots as µ ( s | x ) = Q j µ ( s j | x ) , simpliﬁes to ˆ V PI ( π ) = 1 n P n i =1 r i ·  P ` j =1 π ( s ij | x i ) µ ( s ij | x i ) − ` + 1  , by Prop. 2 in Appendix D. Note that unlike IPS, which di vides by probabilities of whole slates, the PI estimator only divi des by probabilities of actions appearing in indi vidual slots. Thus, the magnitude of each term of the outer summation is only O ( `m ) , whereas the IPS terms are m Ω( ` ) . Example 5 (PI for rankings with ` = m and uniform logging) . In this case, ˆ V PI ( π ) = 1 n P n i =1 r i ·  P ` j =1 π ( s ij | x i ) 1 / ( m − 1) − m + 2  , by Prop. 4 in Appendix E.1. The summands are again O ( `m ) = O ( m 2 ) . 3.2 Deviation analysis So far , we hav e sho wn that PI is unbiased under our assumptions and o vercomes the deﬁciencies of IPS in speciﬁc examples. W e no w derive a ﬁnite-sample error bound, based on the o verlap between π and µ . W e use Bernstein’ s inequality , for which we deﬁne the v ariance and range terms: σ 2 : = E x ∼ D  q T π ,x Γ † µ,x q π ,x  , ρ : = sup x sup s : µ ( s | x ) > 0   q T π ,x Γ † µ,x 1 s   . (8) The quantity σ 2 bounds the v ariance whereas ρ bounds the range. They capture the “a verage” and “worst-case” mismatch between µ and π . They equal one when π = µ (see Appendix C), and yield the following de viation bound: 5 Theorem 1. Under Assumptions 1 and 2, let σ 2 and ρ be deﬁned as in Eq. (8) . Then, for any δ ∈ (0 , 1) , with pr obability at least 1 − δ ,    ˆ V PI ( π ) − V ( π )    ≤ r 2 σ 2 ln(2 /δ ) n + 2( ρ + 1) ln(2 /δ ) 3 n . W e observ e that this ﬁnite sample bound is structurally different from the re gret bounds studied in the prior works on combinatorial bandits. The bound incorporates the extent of o verlap between π and µ so that we ha ve higher conﬁdence in our estimates when the logging and e valuation policies are similar—an important consideration in of f-policy ev aluation. While the bound might look complicated, it simpliﬁes if we consider the class of ε -uniform logging policies. Formally , for any polic y µ , deﬁne µ ε ( s | x ) = (1 − ε ) µ ( s | x ) + εν ( s | x ) , with ν being the uniform distribution o ver the set S ( x ) . For suitably small ε , such logging policies are widely used in practice. W e ha ve the following corollary for these policies, pro ved in Appendix E: Corollary 1. In the settings of Example 1 or Example 2, if the logging is done with µ ε for some ε > 0 , we have | ˆ V PI ( π ) − V ( π ) | ≤ O  p ε − 1 `m/n  . Again, this turns the Ω( m ` ) data dependence of IPS into O ( m` ) . The ke y step in the proof is the bound on a certain norm of Γ † ν , similar to the bounds of Cesa-Bianchi and Lugosi [7] , but our results are a bit sharper . 4 Experiments W e empirically e valuat e the performance of the pseudoinv erse estimator for ranking problems. W e ﬁrst show that PI outperforms prior works in a comprehensiv e semi-synthetic study using a public dataset. W e then use our estimator for off-policy optimization , i.e., to learn ranking policies, competitively with supervised learning that uses more information. Finally , we demonstrate substantial improvements on proprietary data from search engine logs for two user -satisfaction metrics used in practice: time- to-success and utility r ate , which do not satisfy the linearity assumption. More detailed results are deferred to Appendices F and G. All of our code is av ailable online. 3 4.1 Semi-synthetic evaluation Our semi-synthetic ev aluation uses labeled data from the Microsoft Learning to Rank Challenge dataset [ 30 ] (MSLR-WEB30K) to create a contextual bandit instance. Queries form the contexts x and actions a are the av ailable documents. The dataset contains over 31K queries, each with up to 1251 judged documents, where the query-document pairs are judged on a 5-point scale, r el ( x, a ) ∈ { 0 , . . . , 4 } . Each pair ( x, a ) has a feature vector f ( x, a ) , which can be partitioned into title and body features ( f title and f body ). W e consider tw o slate rew ards: NDCG from Example 3, and the expected r ecipr ocal rank , ERR [9], which does not satisfy linearity , and is deﬁned as ERR ( x, s ) : = P ` r =1 1 r Q r − 1 i =1 (1 − R ( s i )) R ( s r ) , where R ( a ) = 2 rel ( x,a ) − 1 2 maxrel with maxr el = 4 . T o derive several distinct logging and target policies, we ﬁrst train two lasso regression models, called lasso title and lasso body , and two regression tree models, called tr ee title and tr ee body , to predict relev ances from f title and f body , respectiv ely . T o create the logs, queries x are sampled uniformly , and the set A ( x ) consists of the top m documents according to tr ee title . The logging policy is parametrized by a model, either tr ee title or lasso title , and a scalar α ≥ 0 . It samples from a multinomial distribution ov er documents p α ( a | x ) ∝ 2 − α b log 2 rank ( x,a ) c where rank ( x, a ) is the rank of document a for query x according to the corresponding model. Slates are constructed slot-by-slot, sampling without r eplacement according to p α . V arying α interpolates between uniformly random and deterministic logging. Thus, all logging policies are based on the models deriv ed from f title . W e consider two deterministic target policies base d on the two models deri ved from f body , i.e., tr ee body and lasso body , which select the top ` documents according to the corresponding model. The four base models are fairly distinct: on average fe wer than 2.75 documents overlap among top 10 (see Appendix H). 3 https://github .com/adith387/slates_semisynth_expts 6 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 # of samples (n) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 log10(RMSE) 10 3 10 4 10 5 10 6 -4 -3 -2 -1 0 NDCG, m=100, l=10 logging: unif or m, target: tree OnP olicy wIPS DM: lasso DM: tree wPI 10 3 10 4 10 5 10 6 NDCG, m=100, l=10, α =1.0 logging: tree , target: tree 10 3 10 4 10 5 10 6 ERR, m=100, l=10 logging: unif or m, target: lasso 10 3 10 4 10 5 10 6 ERR, m=10, l=5 logging: unif or m, target: tree 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Nor maliz ed RMSE @ 600k samples 0 2 4 6 8 10 # of conditions 10 − 3 10 − 2 10 − 1 10 0 0 2 4 6 8 10 NDCG, m=10, l=5 10 − 3 10 − 2 10 − 1 10 0 ERR, m=10, l=5 10 − 3 10 − 2 10 − 1 10 0 NDCG, m=100, l=10 10 − 3 10 − 2 10 − 1 10 0 ERR, m=100, l=10 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Nor maliz ed RMSE @ 60k samples 0 2 4 6 8 10 # of conditions 10 − 3 10 − 2 10 − 1 10 0 0 2 4 6 8 10 NDCG, m=10, l=5 10 − 3 10 − 2 10 − 1 10 0 ERR, m=10, l=5 10 − 3 10 − 2 10 − 1 10 0 NDCG, m=100, l=10 10 − 3 10 − 2 10 − 1 10 0 ERR, m=100, l=10 Figure 2: T op: RMSE of various estimators under four e xperimental conditions (see Appendix F for all 40 conditions). Middle: CDF of normalized RMSE at 600k samples; each plot aggregates ov er 10 logging-target combinations; closer to top-left is better . Bottom: Same as middle but at 60k samples. W e compare the weighted estimator wPI with the direct method (DM) and weighted IPS (wIPS). (W eighted v ariants outperformed the unweighted ones.) W e implement two variants of DM: re gression trees and lasso, each trained on the ﬁrst n/ 2 examples and using the remaining n/ 2 examples for ev aluation according to Eq. (3) . W e also include an aspirational baseline, O N P O L I C Y , which corresponds to deploying the tar get policy as in an A/B test and returning the a verage of observed rew ards. This is the expensiv e alternati ve we wish to a void. W e e v aluate the estimators by recording the root mean square error (RMSE) as a function of the number of samples, averaged over at least 25 independent runs. W e do this for 40 different experimental conditions, considering two re ward metrics, two slate-space sizes, and 10 combinations of tar get and logging policies (including the choice of α ). The top ro w of Fig. 2 sho ws results for four representati ve conditions (see Appendix F for all results), while the middle and bottom rows aggregate across conditions. T o produce the aggregates, we shift and rescale the RMSE of all methods, at 600k (middle row) or 60k (bottom ro w) samples, so the best performance is at 0.001 and the worst is at 1.0 (excluding O N P O L I C Y ). (W e use 0.001 instead of 0.0 to allow plotting on a log scale.) The aggregate plots display the cumulati ve distrib ution function of these normalized RMSE values across 10 target-logging combinations, k eeping the metric and the slate-space size ﬁxed. The pseudoin verse estimator wPI easily dominates wIPS across all experimental conditions, as can be seen in Fig. 2 (top) and in Appendix F. While wIPS and IPS are (asymptotically) unbiased e ven without linearity assumption, they both suf fer from a large variance caused by the slate size. The v ariance and hence the mean square error of wIPS and IPS gro ws exponentially with the slate size, so they perform poorly beyond the smallest slate sizes. DM performs well in some cases, especially with few samples, b ut often plateaus or degrades e ventually as it overﬁts on the logging distrib ution, which is different from the tar get. While wPI does not always outperform DM methods (e.g., Fig. 2, top ro w , second from right), it is the only method that w orks robustly across all conditions, as can be seen in the aggregate plots. In general, choosing between DM and wPI is largely a matter of bias-variance tradeof f. DM can be particularly good with v ery small data sizes, because of its lo w variance, and in those settings it is often the best choice. Howe ver , PI performs comprehensiv ely better giv en enough data (see Fig. 2, middle row). 7 In the top ro w of Fig. 2, we see that, as expected, wPI is biased for the ERR metric since ERR does not satisfy linearity . The right two panels also demonstrate the effect of v arying m and ` . While wPI deteriorates some what for the lar ger slate space, it still gi ves a meaningful estimate. In contrast, wIPS fails to produce any meaningful estimate in the larger slate space and its RMSE barely improv es with more data. Finally , the left two plots in the top ro w sho w that wPI is fairly insensiti ve to the amount of stochasticity in logging, whereas DM improv es with more ov erlap between logging and tar get. 4.2 Semi-synthetic policy optimization W e now show how to use the pseudoinv erse estimator for off-polic y optimization. W e leverage pointwise learning to rank (L2R) algorithms, which learn a scoring function for query-document pairs by ﬁtting to relev ance labels. W e call this the supervised approach, as it requires relev ance labels. Instead of requiring rele v ance labels, we use the pseudoin verse estimator to con vert page-le vel re ward into per-slot re ward components—the estimates of φ x ( j, a ) —and these become tar gets for re gression. Thus, the pseudoin verse estimator enables pointwise L2R to optimize whole-page metrics e ven without relev ance labels. Giv en a contextual bandit dataset { ( x i , s i , r i ) } i ≤ n collected by the logging policy µ , we begin by creating the estimates of φ x i : ˆ φ i = Γ † µ,x i ˆ θ i , turning the i -th contextual bandit example into `m regression e xamples. The trained regression model is used to create a slate, starting with the highest scoring slot-action pair , and continuing greedily (e xcluding the pairs with the already chosen slots or actions). This procedure is detailed in Appendix G. Note that without the linearity assumptions, our imputed regression targets might not lead to the best possible learned policy , but we still expect to adapt some what to the slate-level metric. W e use the MSLR-WEB10K dataset [ 30 ] to compare our approach with benchmark ed results [ 33 ] for NDCG@3 (i.e., ` = 3 ). 4 This dataset contains 10k queries, o ver 1.2M relev ance judgments, and up to 908 judged documents per query . The state-of-the-art listwise L2R method on this dataset is a highly tuned variant of LambdaMAR T [1] (with an ensemble of 1000 trees, each with up to 70 leaves). W e use the pro vided 5-fold split and alw ays train on bandit data collected by uniform logging from four folds, while e valuating with supervised data on the ﬁfth. W e compare our approach, titled PI-OPT , against the supervised approach (SUP), trained to predict the gains , equal to 2 r el ( x,a ) − 1 , computed using annotated relev ance judgements in the training folds (predicting raw rele vances was inferior). Both PI-OPT and SUP train gradient boosted re gression trees (with 1000 trees, each with up to 70 leav es). Additionally , we also experimented with the ERR metric. The av erage test-set performance (computed using ground-truth rele vance judgments for each test set) across the 5 -folds is reported in T able 1. Our method, PI-OPT is competitive with the supervised baseline SUP for NDCG, and is substantially superior for ERR. A dif ferent transformation instead of gains might yield a stronger supervised baseline for ERR, b ut this only illustrates the ke y beneﬁt of PI-OPT : the right pointwise tar gets ar e automatically inferr ed for any whole-page metric. Both PI-OPT and SUP are slightly worse than LambdaMAR T for NDCG@3, but the y are arguably not as highly tuned, and PI-OPT only uses the slate-lev el metric. T able 1: Comparison of L2R approaches optimizing NDCG@3 and ERR@3. LambdaMAR T is a tuned list-wise approach. SUP and PI-OPT use the same pointwise L2R learner; SUP uses 8 × 10 5 relev ance judgments, PI-OPT uses 10 7 samples (under uniform logging) with page-lev el rewards. Metric LambdaMAR T uniformly random SUP PI-OPT NDCG@3 0 . 457 0 . 152 0 . 438 0 . 421 ERR@3 — 0 . 096 0 . 311 0 . 321 4.3 Real-world experiments W e ﬁnally ev aluate all methods using logs collected from a popular search engine. The dataset consists of search queries, for which the logging policy randomly (non-uniformly) chooses a slate of 4 Our dataset here differs from the dataset MSLR-WEB30K used in Sec. 4.1. There our goal was to study realistic problem dimensions, e.g., constructing length-10 rankings out of 100 candidates. Here, we use MSLR- WEB10K, because it is the largest dataset with public benchmark numbers by state-of-the-art approaches (speciﬁcally LambdaMAR T). 8 size ` = 5 from a small pre-ﬁltered set of documents of size m ≤ 8 . After preprocessing, there are 77 unique queries and 22K total examples, meaning that for each query , we have logged impressions for many of the av ailable slates. As before, we create the logs by sampling queries uniformly at random, and using a logging policy that samples uniformly from the slates sho wn for this query . W e consider tw o page-le vel metrics: time-to-success (TTS) and U T I L I T Y R A T E . TTS measures the number of seconds between presenting the results and the ﬁrst satisﬁed click from the user , deﬁned as any click for which the user stays on the linked page for suf ﬁciently long. TTS value is capped and scaled to [0 , 1] . U T I L I T Y R AT E is a more complex page-le vel metric of user satisf action. It captures the interaction of a user with the page as a timeline of e vents (such as clicks) and their durations. The ev ents are classiﬁed as re vealing a positiv e or negati ve utility to the user and their contribution is proportional to their duration. U T I L I T Y R AT E takes v alues in [ − 1 , 1] . W e e valuate a target polic y based on a logistic regression classiﬁer trained to predict clicks and using the predicted probabilities to score slates. W e restrict the target policy to pick among the slates in our logs, so we kno w the ground truth slate-le vel re ward. Since we know the query distrib ution, we can calculate the target polic y’ s v alue exactly , and measure RMSE relati ve to this true value. W e compare our estimator (PI) with three baselines similar to those from Sec. 4.1: DM, IPS and O N P O L I C Y . DM uses regression trees over roughly 20,000 slate-le vel features. Fig. 1 from the introduction sho ws that PI pro vides a consistent multiplicati ve improv ement in RMSE ov er IPS, which suffers due to high v ariance. Starting at moderate sample sizes, PI also outperforms DM, which suffers due to substantial bias. 5 Discussion In this paper we hav e introduced a new estimator (PI) for off-policy ev aluation in combinatorial contextual bandits under a linearity assumption on the slate-le vel rewards. Our theoretical and empirical analysis demonstrates the merits of the approach. The empirical results sho w a fa vorable bias-variance tradeoff. Even in datasets and metrics where our assumptions are violated, the PI estimator typically outperforms all baselines. Its performance, especially at smaller sample sizes, could be further improv ed by designing doubly-rob ust v ariants [12] and possibly also incorporating weight clipping [34]. One promising approach to relax Assumption 1 is to posit a decomposition ov er pairs (or tuples) of slots to capture higher -order interactions such as di versity . More generally , one could replace slate spaces by arbitrary compact con vex sets, as done in linear bandits. In these settings, the pseudoin verse estimator could still be applied, but tight sample-comple xity analysis is open for future research. References [1] Nima Asadi and Jimmy Lin. Training efﬁcient tree-based models for document ranking. In European Confer ence on Advances in Information Retrieval , 2013. [2] Peter Auer . Using conﬁdence bounds for exploitation-exploration trade-of fs. Journal of Machine Learning Resear ch , 2002. [3] Peter Auer , Nicolò Cesa-Bianchi, Y oa v Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing , 2002. [4] Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis Charles, Max Chickering, Elon Portugaly , Dipankar Ray , Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational adv ertising. Journal of Machine Learning Resear ch , 2013. [5] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. F oundations and T r ends R  in Machine Learning , 2012. [6] Chris Burges, T al Shaked, Erin Renshaw , Ari Lazier , Matt Deeds, Nicole Hamilton, and Greg Hullender . Learning to rank using gradient descent. In International Conference on Mac hine Learning , 2005. [7] Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences , 2012. 9 [8] Olivier Chapelle and Y a Zhang. A dynamic Bayesian network click model for web search ranking. In International Confer ence on W orld W ide W eb , 2009. [9] Olivier Chapelle, Donald Metlzer , Y a Zhang, and Pierre Grinspan. Expected reciprocal rank for graded relev ance. In Confer ence on Information and Knowledge Manag ement , 2009. [10] W ei Chu, Lihong Li, Le v Reyzin, and Robert E Schapire. Contextual bandits with linear payoff functions. In Artiﬁcial Intelligence and Statistics , 2011. [11] V arsha Dani, Thomas P . Hayes, and Sham M. Kakade. The price of bandit information for online optimization. In Advances in Neural Information Pr ocessing Systems , 2008. [12] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In International Confer ence on Machine Learning , 2011. [13] Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly robust policy ev aluation and optimization. Statistical Science , 2014. [14] Georges E. Dupret and Benjamin Piwo warski. A user browsing model to predict search engine click data from past observations. In SIGIR Confer ence on Resear ch and Development in Information Retrieval , 2008. [15] Sarah Filippi, Oli vier Cappe, Aurélien Garivier , and Csaba Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Pr ocessing Systems , 2010. [16] Fan Guo, Chao Liu, Anitha Kannan, T om Minka, Michael T aylor , Y i-Min W ang, and Christos Faloutsos. Click chain model in web search. In International Conference on W orld W ide W eb , 2009. [17] Katja Hofmann, Lihong Li, Filip Radlinski, et al. Online ev aluation for information retriev al. F oundations and T rends in Information Retrie val , 2016. [18] Daniel G Horvitz and Dono van J Thompson. A generalization of sampling without replacement from a ﬁnite univ erse. Journal of the American Statistical Association , 1952. [19] Satyen Kale, Le v Reyzin, and Robert E Schapire. Non-stochastic bandit slate problems. In Advances in Neural Information Pr ocessing Systems , 2010. [20] Ron K ohavi, Roger Longbotham, Dan Sommerﬁeld, and Randal M Henne. Controlled experiments on the web: surve y and practical guide. Knowledge Discovery and Data Mining , 2009. [21] Akshay Krishnamurthy , Alekh Agarw al, and Mirosla v Dudík. Efﬁcient conte xtual semi-bandit learning. Advances in Neural Information Pr ocessing Systems , 2016. [22] Branislav Kv eton, Zheng W en, Azin Ashkan, and Csaba Szepesvári. T ight regret bounds for stochastic combinatorial semi-bandits. In Artiﬁcial Intelligence and Statistics , 2015. [23] John Langford and T ong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Pr ocessing Systems , 2008. [24] John Langford, Alexander Strehl, and Jennifer W ortman. Exploration scavenging. In International Confer ence on Machine Learning , 2008. [25] Lihong Li, W ei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In International Conference on W orld W ide W eb , 2010. [26] Lihong Li, W ei Chu, John Langford, and Xuanhui W ang. Unbiased ofﬂine e v aluation of contextual-bandit- based news article recommendation algorithms. In International Conference on W eb Search and Data Mining , 2011. [27] Lihong Li, Imed Zitouni, and Jin Y oung Kim. T o ward predicting the outcome of an a/b experiment for search relev ance. In International Confer ence on W eb Sear ch and Data Mining , 2015. [28] Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook. T ec hnical University of Denmark , 2008. [29] Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. Contextual combinatorial bandit and its application on div ersiﬁed online recommendation. In International Conference on Data Mining , 2014. [30] T ao Qin and T ie-Y an Liu. Introducing LETOR 4.0 datasets. , 2013. 10 [31] Paat Rusme vichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Resear ch , 2010. [32] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Mac hine Learning , 2015. [33] Niek T ax, Sander Bockting, and Djoerd Hiemstra. A cross-benchmark comparison of 87 learning to rank methods. Information Processing and Mana gement , 2015. [34] Y u-Xiang W ang, Alekh Agarwal, and Miroslav Dudik. Optimal and adaptive off-policy ev aluation in contextual bandits. In International Confer ence on Machine Learning , 2017. [35] Y ue W ang, Dawei Y in, Luo Jie, Pengyuan W ang, Mak oto Y amada, Y i Chang, and Qiaozhu Mei. Beyond ranking: Optimizing whole-page presentation. In International Conference on W eb Sear ch and Data Mining , pages 103–112, 2016. 11 A Proof of Proposition 1 Lemma 2. If Assumption 1 holds and µ ( s | x ) > 0 , then V ( x, s ) = 1 T s Γ † µ,x θ µ,x . Pr oof. Fix one x for the entirety of the proof. Recall from Sec. 3.1 that V ( x, s ) = 1 T s φ x . Let N = | supp µ ( · | x ) | be the size of the support of µ ( · | x ) and let M ∈ { 0 , 1 } N × m` denote the binary matrix with ro ws 1 T s for each s ∈ supp µ ( · | x ) . Thus M φ x is the vector enumerating V ( x, s ) ov er s for which µ ( s | x ) > 0 . Let Null( M ) denote the null space of M and Π be the projection on Null( M ) . Let φ ? x = ( I − Π ) φ x . Then clearly , M φ x = M φ ? x , and hence, for an y s ∈ supp µ ( · | x ) , V ( x, s ) = 1 T s φ ? x . (9) W e will no w show that φ ? x = Γ † µ,x θ µ,x , which will complete the proof. Recall from Sec. 3.1 that θ µ,x = Γ µ,x φ x . (10) Next note that Γ µ,x in symmetric positiv e semideﬁnite by deﬁnition, so Null( Γ µ,x ) = { v : v T Γ µ,x v = 0 } = { v : 1 T s v = 0 for all s ∈ supp µ ( · | x ) } = Null( M ) where the ﬁrst step follo ws by positi ve semi deﬁniteness of Γ µ,x , the second step is from the deﬁnition of Γ µ,x , and the ﬁnal step from the deﬁnition of M . Since Null( Γ µ,x ) = Null( M ) , we hav e from Eq. (10) that θ x = Γ µ,x φ ? x , but, importantly , this also implies φ ? x ⊥ Null( Γ µ,x ) , so by the deﬁnition of the pseudoin verse, Γ † µ,x θ x = φ ? x . This proves Lemma 2, since for any s with µ ( s | x ) > 0 , we argued that V ( x, s ) = 1 T s φ ? x = 1 T s Γ † µ,x θ x . Pr oof of Pr op. 1. Note that it suf ﬁces to analyze the expectation of a single term in the estimator , that is X s ∈ S π ( s | x i ) 1 T s Γ † µ,x i ˆ θ i . First note that E ( s i ,r i ) ∼ µ ( · , ·| x i )  ˆ θ i  = θ x i , because E ( s i ,r i ) ∼ µ ( · , ·| x i )  ˆ θ i ( j, a )  = E ( s i ,r i ) ∼ µ ( · , ·| x i )  r i 1 { s j = a }  = θ x i ( j, a ) . The remainder follows by Lemma 2: E " X s ∈ S π ( s | x i ) 1 T s Γ † µ,x i ˆ θ i # = E x i ∼ D " X s ∈ S π ( s | x i ) 1 T s Γ † µ,x i E ( s i ,r i ) ∼ µ ( · , ·| x i )  ˆ θ i  # = E x i ∼ D " X s ∈ S π ( s | x i ) 1 T s Γ † µ,x i θ x i # = E x i ∼ D " X s ∈ S π ( s | x i ) V ( x i , s ) # = V ( π ) . B Proof of Theorem 1 Pr oof. The proof is based on an application of Bernstein’ s inequality to the centered sum n X i =1 h q T π ,x i Γ † µ,x i ˆ θ i − V ( π ) i . 12 The fact that this quantity is centered is directly from Prop. 1. W e must compute both the second moment and the range to apply Bernstein’ s inequality . By independence, we can focus on just one term, so we will drop the subscript i . First, bound the variance: V ar h q T π ,x Γ † µ,x ˆ θ i ≤ E µ   q T π ,x Γ † µ,x ˆ θ  2  = E µ h  q T π ,x Γ † µ,x r 1 s  2 i ≤ E µ h  q T π ,x Γ † µ,x 1 s  2 i = E x ∼ D h q T π ,x Γ † µ,x E s ∼ µ ( ·| x )  1 s 1 T s  Γ † µ,x q π ,x i = E x ∼ D h q T π ,x Γ † µ,x Γ µ,x Γ † µ,x q π ,x i = E x ∼ D  q T π ,x Γ † µ,x q π ,x  = σ 2 . Thus the per-term v ariance is at most σ 2 . W e no w bound the range, again focusing on one term,    q T π ,x Γ † µ,x ˆ θ − V ( π )    ≤    q T π ,x Γ † µ,x ˆ θ    + 1 =   q T π ,x Γ † µ,x r 1 s   + 1 ≤   q T π ,x Γ † µ,x 1 s   + 1 ≤ ρ + 1 The ﬁrst line here is the triangle inequality , coupled with the fact that since rew ards are bounded in [ − 1 , 1] , so is V ( π ) . The second line is from the deﬁnition of ˆ θ , while the third follows because r ∈ [ − 1 , 1] . The ﬁnal line follows from the deﬁnition of ρ . Now , we may apply Bernstein’ s inequality , which says that for any δ ∈ (0 , 1) , with probability at least 1 − δ ,      n X i =1 h q T π ,x i Γ † µ,x i ˆ θ i − V ( π ) i      ≤ p 2 nσ 2 ln(2 /δ ) + 2( ρ + 1) ln(2 /δ ) 3 . The theorem follows by di viding by n . C Pseudo-in verse estimator when π = µ In this section we sho w that when the target policy coincides with logging (i.e., π = µ ), we have σ 2 = ρ = 1 , i.e., the bound of Theorem 1 is independent of the number of actions and slots. Indeed, in Claim 2 we will see that the estimator actually simpliﬁes to taking an empirical average of re wards which are bounded in [ − 1 , 1] . Before proving Claim 2 we pro ve one supporting claim: Claim 1. F or any policy µ and context x , we have q T µ,x Γ † µ,x 1 s = 1 for all s ∈ supp µ ( · | x ) . Pr oof. T o simplify the exposition, write q and Γ instead of a more verbose q µ,x and Γ µ,x . The bulk of the proof is in deriving an explicit expression for Γ † . W e begin by expressing Γ in a suitable basis. Since Γ is the matrix of second moments and q is the vector of ﬁrst moments of 1 s , the matrix Γ can be written as Γ = V + qq T where V is the cov ariance matrix of 1 s , i.e., V : = E s ∼ µ ( ·| x )  ( 1 s − q )( 1 s − q ) T  . Assume that the rank of V is r and consider the eigen v alue decomposition of V V = r X i =1 λ i u i u T i = UΛU T , where λ i > 0 and vectors u i are orthonormal; we have grouped the eigenv alues into the diagonal matrix Λ : = diag( λ 1 , . . . , λ r ) and eigen vectors into the matrix U : = ( u 1 u 2 . . . u r ) . 13 W e ne xt argue that q 6∈ Range( V ) . T o see this, note that the all-ones-vector 1 is in the null space of V because, for any v alid slate s , we hav e 1 T s 1 = ` and thus also for the con vex combination q we hav e q T 1 = ` , which means that 1 T V1 = E s ∼ µ ( ·| x )  1 T ( 1 s − q )( 1 s − q ) T 1  = 0 . Now , since 1 ⊥ Range( V ) and q T 1 = ` , we ha ve that q 6∈ Range( V ) . In particular , we can write q in the form q = r X i =1 β i u i + α n = ( U n )  β α  (11) where α 6 = 0 and n ∈ Null( V ) is a unit vector . Note that n ⊥ u i since u i ⊥ Null( V ) . Thus, the second moment matrix Γ can be written as Γ = V + qq T = ( U n )  Λ + β β T α β α β T α 2  ( U n ) T . (12) Let Q ∈ R ( r +1) × ( r +1) denote the middle matrix in the factorization of Eq. (12): Q : =  Λ + β β T α β α β T α 2  . (13) This matrix is a representation of Γ with respect to the basis { u 1 , . . . , u r , n } . Since q 6∈ Range( V ) , the rank of Γ and that of Q is r + 1 . Thus, Q is in vertible and Γ † = ( U n ) Q − 1 ( U n ) T . (14) T o obtain Q − 1 , we use the following identity (see [28]):  A 11 A 12 A 21 A 22  − 1 =  M − 1 − M − 1 A 12 A − 1 22 − A − 1 22 A 21 M − 1 A − 1 22 A 21 M − 1 A 12 A − 1 22 + A − 1 22  , (15) where M : = A 11 − A 12 A − 1 22 A 21 is the Schur complement of A 22 . The identity of Eq. (15) holds whenev er A 22 and its Schur complement M are both in vertible. In the block representation of Eq. (13), we hav e A 22 = α 2 6 = 0 and M = ( Λ + β β T ) − ( α β ) α − 2 ( α β T ) = Λ , so Eq. (15) can be applied to obtain Q − 1 : Q − 1 =  Λ + β β T α β α β T α 2  − 1 =  Λ − 1 − Λ − 1 ( α β ) α − 2 − α − 2 ( α β T ) Λ − 1 α − 2 ( α β T ) Λ − 1 ( α β ) α − 2 + α − 2  =  Λ − 1 − α − 1 Λ − 1 β − α − 1 β T Λ − 1 α − 2 (1 + β T Λ − 1 β )  . (16) Next, we will e valuate Γ † q , using the factorizations in Eqs. (14) and (11) , and substituting Eq. (16) for Q − 1 : Γ † q = ( U n ) Q − 1 ( U n ) T ( U n )  β α  = ( U n ) Q − 1  β α  = ( U n )  Λ − 1 β − Λ − 1 β − α − 1 β T Λ − 1 β + α − 1 (1 + β T Λ − 1 β )  = ( U n )  0 α − 1  = α − 1 n . T o ﬁnish the proof, we consider any s ∈ supp µ ( · | x ) and consider the decomposition of 1 s in the basis { u 1 , . . . , u r , n } . First, note that ( 1 s − q ) ⊥ Null( V ) since Null( V ) =  v : E s ∼ µ ( ·| x )  ( 1 s − q ) T v  2  = 0  =  v : ( 1 s − q ) T v = 0 for all s ∈ supp µ ( ·| x )  . 14 Thus, ( 1 s − q ) ∈ Range( V ) . Therefore, we obtain q T Γ † µ,x 1 s = α − 1 n T 1 s = α − 1 n T ( 1 s − q ) + α − 1 n T q = 0 + α − 1 α = 1 , where the third equality follows because ( 1 s − q ) ⊥ n and the decomposition in Eq. (11) shows that n T q = α . Claim 2. If π = µ then σ 2 = ρ = 1 and ˆ V PI ( π ) = ˆ V PI ( µ ) = 1 n P n i =1 r i . Pr oof. From Claim 1 q T µ,x Γ † µ,x q µ,x = E s ∼ µ ( ·| x ) [ q T µ,x Γ † µ,x 1 s ] = 1 . T aking e xpectation over x then yields σ 2 = 1 . Equality ρ = 1 follows immediately from plugging Claim 1 into the deﬁnition of ρ . The ﬁnal statement of Claim 2 follo ws by applying Claim 1 to a single term of ˆ V PI ( µ ) : q T µ,x i Γ † µ,x i r i 1 s i = r i . D A product slate space under a product logging distribution Proposition 2. Consider the pr oduct slate space wher e S ( x ) = A 1 ( x ) × · · · × A ` ( x ) and assume that the logging policy pic ks any s ∈ S ( x ) with non-zer o pr obability and factorizes acr oss the slots as µ ( s | x ) = Q j µ ( s j | x ) . F or any policy π , any s ∈ S ( x ) , and any r ∈ [ − 1 , 1] we then have q T π ,x Γ † µ,x r 1 s = r ·   ` X j =1 π ( s j | x ) µ ( s j | x ) − ` + 1   . (17) Pr oof. The proof uses Claim 1 and the identities introduced in its proof. As in the proof of Claim 1, write q and Γ instead of q µ,x and Γ µ,x , and let V : = E s ∼ µ ( ·| x )  ( 1 s − q )( 1 s − q ) T  . Thus, Γ = V + qq T . It sufﬁces to sho w that for any s , s 0 ∈ S ( x ) , 1 T s 0 Γ † 1 s = ` X j =1 1 { s 0 j = s j } µ ( s j | x ) − ` + 1 . (18) Pick s , s 0 ∈ S ( x ) = supp µ ( · | x ) . By Claim 1, we have q T Γ † 1 s = 1 s 0 Γ † q = q T Γ † q = 1 , so 1 T s 0 Γ † 1 s = ( 1 s 0 − q ) T Γ † ( 1 s − q ) + q T Γ † 1 s | {z } =1 + 1 s 0 Γ † q | {z } =1 − q T Γ † q | {z } =1 = ( 1 s 0 − q ) T Γ † ( 1 s − q ) + 1 . (19) Similar to the reasoning at the end of the proof of Claim 1, we kno w that ( 1 s − q ) ∈ Range( V ) and ( 1 s 0 − q ) ∈ Range( V ) . The factorization of Γ † in Eqs. (14) and (16) therefore yields ( 1 s 0 − q ) T Γ † ( 1 s − q ) = ( 1 s 0 − q ) T  UΛ − 1 U T  ( 1 s − q ) = ( 1 s 0 − q ) T V † ( 1 s − q ) , (20) where the last step follows from the f act that V = UΛU T , and so V † = UΛ − 1 U T . T o ﬁnish the proof, we study the structure of V and V † . First, let q j denote the block of q corresponding to the j th slot. Its a th entry corresponds to the probability µ ( s j = a | x ) . Since the v alues s j are conditionally independent, conditioned on x , the cov ariance matrix V takes form V = diag j =1 ,...,` V j , where V j =  diag a ∈ A j ( x ) q j,a  − q j q T j is the cov ariance matrix of the multinomial distribution described by q j . Thus, V † = diag j =1 ,...,` V † j . (21) It can be directly veriﬁed that the pseudoin verse of V j takes form V † j = P j  diag a ∈ A j ( x ) q − 1 j,a  P j , (22) 15 where P j : = I j − 1 j 1 T j /m j , and I j is the m j × m j identity matrix, and 1 j the m j -dimensional all-ones vector . T o verify that Eq. (22) holds, ﬁrst note that P j is the projection matrix on Range( V j ) . Then set V 0 j : = P j  diag a ∈ A j ( x ) q − 1 j,a  P j , and directly verify that V 0 j V j = P j and V j V 0 j = P j . The ﬁrst identity can be veriﬁed as follo ws: V 0 j V j = P j  diag a ∈ A j ( x ) q − 1 j,a  P j V j = P j  diag a ∈ A j ( x ) q − 1 j,a  V j = P j  I j − 1 j q T j  = P j . The second identity follows similarly . Combining Eqs. (20), (21) and (22) yields ( 1 s 0 − q ) T V † ( 1 s − q ) = ` X j =1  1 s 0 j − q j  T V † j ( 1 s j − q j ) = ` X j =1  1 s 0 j − q j  T P j  diag a ∈ A j ( x ) q − 1 j,a  P j ( 1 s j − q j ) = ` X j =1  1 s 0 j − q j  T  diag a ∈ A j ( x ) q − 1 j,a  ( 1 s j − q j ) = ` X j =1  1 { s 0 j = s j } q − 1 j,s j − 1  = ` X j =1  1 { s 0 j = s j } µ ( s j | x ) − 1  . Plugging this back into Eq. (19) then prov es Eq. (18). E Proof of Corollary 1 For a gi ven logging policy µ and context x , let ¯ ρ µ,x : = sup s ∈ supp µ ( ·| x ) 1 T s Γ † µ,x 1 s . This quantity can be viewed as a norm of Γ † µ,x with respect to the set of slates chosen by µ with non-zero probability . It can be used to bound σ 2 and ρ , and thus to bound an error of ˆ V PI : Proposition 3. F or any logging policy µ and target policy π that is absolutely continuous with r espect to µ , we have σ 2 ≤ ρ ≤ sup x ¯ ρ µ,x . Pr oof. Recall that σ 2 = E x ∼ D  q T π ,x Γ † µ,x q π ,x  , ρ = sup x sup s ∈ supp µ ( ·| x )   q T π ,x Γ † µ,x 1 s   . T o see that σ 2 ≤ ρ note that q T π ,x Γ † µ,x q π ,x = E s ∼ π ( ·| x )  q T π ,x Γ † µ,x 1 s  ≤ ρ where the last inequality follows by the absolute continuity of π with respect to µ . It remains to show that ρ ≤ sup x ¯ ρ µ,x . First, by positi ve semi-deﬁniteness of Γ † µ,x and from the deﬁnition of ¯ ρ µ,x , we ha ve that for an y slates s , s 0 ∈ supp µ ( · | x ) and an y z ∈ {− 1 , 1 } z 1 T s 0 Γ † µ,x 1 s ≤ 1 T s Γ † µ,x 1 s + 1 T s 0 Γ † µ,x 1 s 0 2 ≤ max { 1 T s Γ † µ,x 1 s , 1 T s 0 Γ † µ,x 1 s 0 } ≤ ¯ ρ µ,x . Therefore, for any π absolutely continuous with respect to µ and any s ∈ supp µ ( · | x ) , we have   q T π ,x Γ † µ,x 1 s   = max z ∈{− 1 , 1 } E s 0 ∼ π ( ·| x )  z 1 T s 0 Γ † µ,x 1 s  ≤ ¯ ρ µ,x . T aking a supremum o ver x and s ∈ supp µ ( · | x ) , we obtain ρ ≤ sup x ¯ ρ µ,x . 16 W e next deri ve bounds on ¯ ρ µ,x for uniformly-random policies in the ranking e xample. Then we pro ve a translation theorem, which allo ws translating the bound for uniform distrib utions into a bound for the ε -uniform distributions. Finally , we put these results together to prove Corollary 1. E.1 Uniform logging distribution over rankings Let 1 j ∈ R `m be the vector that is all-ones on the actions in the j -th position and zeros else where. Similarly , let 1 a ∈ R `m be the vector that is all-ones on the action a in all positions and zeros elsewhere. Finally , let 1 ∈ R `m be the all-ones vector . W e also use I j = diag ( 1 j ) to denote the diagonal matrix with all-ones on the actions in the j -th position and zeros elsewhere. Proposition 4. Consider the ranking setting wher e for each x ther e is a set A ( x ) such that A j ( x ) = A ( x ) and wher e all slates s ∈ A ( x ) ` without r epetitions ar e legal. Let ν denote the uniform logging policy over these slates. If ` < m , then ¯ ρ ν,x = m` − ` + 1 and Γ † ν,x =  1 ` 2 − m − 1 m ( m − ` )  · 11 T + ( m − 1) I − m − 1 m X j 1 j 1 T j + m − 1 m − ` X a 1 a 1 T a , and for ` = m , we have ¯ ρ ν,x = m 2 − 2 m + 2 and Γ † ν,x = 1 m · 11 T + ( m − 1) I − m − 1 m X j 1 j 1 T j − m − 1 m X a 1 a 1 T a . F or ` = m , we have for any policy π , any s ∈ S ( x ) , and any r ∈ [ − 1 , 1] that q T π ,x Γ † ν,x r 1 s = r ·   ` X j =1 π ( s j | x ) 1 / ( m − 1) − m + 2   . (23) Pr oof. Throughout the proof we will write Γ instead of the more verbose Γ ν,x . Note that for ranking and the uniform distribution we ha ve Γ ( j, a ; k , a 0 ) =      1 m if j = k and a = a 0 1 m ( m − 1) if j 6 = k and a 6 = a 0 0 otherwise. Thus, for any z z T Γz = X j,a z 2 j,a m + 1 m ( m − 1) X j 6 = k,a 6 = a 0 z j,a z k,a 0 = 1 m k z k 2 2 + 1 m ( m − 1)   ( z T 1 ) 2 − X j ( z T 1 j ) 2 − X a ( z T 1 a ) 2 + k z k 2 2   = 1 m ( m − 1)   ( z T 1 ) 2 − X j ( z T 1 j ) 2 − X a ( z T 1 a ) 2 + m k z k 2 2   . (24) Let 1 J ∈ R ` and 1 A ∈ R m be all-ones v ectors in the respecti ve spaces and I J ∈ R ` × ` and I A ∈ R m × m be identity matrices in the respecti ve spaces. W e can re write the quadratic form 17 described by Γ as m ( m − 1) Γ = 11 T − X j 1 j 1 T j − X a 1 a 1 T a + m I = ( 1 J 1 T J ) ⊗ ( 1 A 1 T A ) − I J ⊗ ( 1 A 1 T A ) − ( 1 J 1 T J ) ⊗ I A + m · I J ⊗ I A = `m · 1 J 1 T J ` ⊗ 1 A 1 T A m − m · I J ⊗ 1 A 1 T A m − ` · 1 J 1 T J ` ⊗ I A + m · I J ⊗ I A = ` ( m − 1) · 1 J 1 T J ` ⊗ 1 A 1 T A m − m · I J ⊗  1 A 1 T A m − I A  − ` · 1 J 1 T J ` ⊗  I A − 1 A 1 T A m  = ` ( m − 1) · 1 J 1 T J ` ⊗ 1 A 1 T A m + m ·  I J − 1 J 1 T J `  ⊗  I A − 1 A 1 T A m  + ( m − ` ) · 1 J 1 T J ` ⊗  I A − 1 A 1 T A m  . (25) Next, we would lik e to ar gue that Eq. (25) is an eigendecomposition. For this, we just need to sho w that each of the three Kronecker products in Eq. (25) equals a projection matrix in R `m , and that the ranges of the projection matrices are orthogonal. The ﬁrst property follo ws, because if P 1 and P 2 are projection matrices then so is P 1 ⊗ P 2 . The second property follows, because for P 1 , P 0 1 (square of the same dimension) and P 2 , P 0 2 (square of the same dimension) such that either ranges of P 1 and P 0 1 are orthogonal or ranges of P 2 and P 0 2 are orthogonal, we obtain that the ranges of P 1 ⊗ P 2 and P 0 1 ⊗ P 0 2 are orthogonal. Now we are ready to deri ve the pseudo-in verse. W e distinguish tw o cases. Case ` < m : W e directly in vert the eigen v alues in Eq. (25) to obtain Γ † = m ` · 1 J 1 T J ` ⊗ 1 A 1 T A m + ( m − 1) ·  I J − 1 J 1 T J `  ⊗  I A − 1 A 1 T A m  + m − 1 1 − `/m · 1 J 1 T J ` ⊗  I A − 1 A 1 T A m  = 1 ` 2 · 11 T + ( m − 1) ·  I J + 1 J 1 T J m − `  ⊗  I A − 1 A 1 T A m  =  1 ` 2 − m − 1 m ( m − ` )  · 11 T + ( m − 1) I − m − 1 m X j 1 j 1 T j + m − 1 m − ` X a 1 a 1 T a . Recall that Eq. (25) in volves m ( m − 1) Γ . T o obtain ¯ ρ , we again ev aluate 1 T s 0 Γ † 1 s for any s ∈ S ( x ) . W e write A s for the set of actions appearing on the slate s : 1 T s 0 Γ † 1 s =  1 ` 2 − m − 1 m ( m − ` )  · ( 1 T s 0 1 )( 1 T 1 s ) + ( m − 1) 1 T s 0 1 s − m − 1 m X j ( 1 T s 0 1 j )( 1 T j 1 s ) + m − 1 m − ` X a ( 1 T s 0 1 a )( 1 T a 1 s ) =  1 ` 2 − m − 1 m ( m − ` )  · ` 2 + X j 1 { s 0 j = s j } 1 / ( m − 1) − m − 1 m · ` + m − 1 m − ` X a 1 { a ∈ A s 0 } 1 { a ∈ A s } (26) = 1 − ( m − 1)( ` 2 + m` − ` 2 ) m ( m − ` ) + X j 1 { s 0 j = s j } 1 / ( m − 1) + m − 1 m − ` · | A s 0 ∩ A s | = 1 − m − 1 m − ` · ` + X j 1 { s 0 j = s j } 1 / ( m − 1) + m − 1 m − ` · | A s ∩ A s 0 | , 18 where Eq. (26) follows because 1 T 1 s = ` and 1 T j 1 s = 1 for any v alid slate s . By setting s 0 = s , we obtain ¯ ρ = 1 + ` ( m − 1) = m` − ` + 1 . Case ` = m : Again, we directly in vert the eigen values in Eq. (25) to obtain Γ † = 1 ` 2 · 11 T + ( m − 1) ·  I J − 1 J 1 T J `  ⊗  I A − 1 A 1 T A m  = 1 m · 11 T + ( m − 1) I − m − 1 m X j 1 j 1 T j − m − 1 m X a 1 a 1 T a . W e ﬁnish the theorem by e valuating 1 T s 0 Γ † 1 s : 1 T s 0 Γ † 1 s = 1 m · ( 1 T s 0 1 )( 1 T 1 s ) + ( m − 1) 1 T s 0 1 s − m − 1 m X j ( 1 T s 0 1 j )( 1 T j 1 s ) − m − 1 m X a ( 1 T s 0 1 a )( 1 T a 1 s ) = 1 m · m 2 + X j 1 { s 0 j = s j } 1 / ( m − 1) − m − 1 m · m − m − 1 m · m = X j 1 { s 0 j = s j } 1 / ( m − 1) − m + 2 . W e obtain ¯ ρ = m 2 − 2 m + 2 by setting s 0 = s and Eq. (23) by taking an expectation over s 0 ∼ π ( · | x ) . E.2 Proof of Corollary 1 W e need one last technical result in order to establish the proposition. Claim 3. Let A , B be two symmetric positive semi-deﬁnite matrices with Null( A ) ⊆ Null( B ) . Then max z ⊥ Null( B ) , z 6 = 0 z T B † z z T A † z ≤ max z ⊥ Null( B ) , z 6 = 0 z T Az z T Bz . W e no w provide the proof of Corollary 1, following which we will pro ve Claim 3. Pr oof of Cor ollary 1. The corollary follows by Prop. 3. The key step is to bound ¯ ρ µ ε ,x , for which we in voke Claim 3. Speciﬁcally , we apply the claim with A = Γ ν,x and B = Γ µ ε ,x . Since Γ µ ε ,x = (1 − ε ) Γ µ,x + ε Γ ν,x = (1 − ε ) E s ∼ µ ( ·| x ) [ 1 s 1 T s ] + ε E s ∼ ν ( ·| x ) [ 1 s 1 T s ] , we observe that Null( Γ ν,x ) = Null( Γ µ ε ,x ) , because the support of µ ( · | x ) is always included in the support of ν ( · | x ) . Now we can in voke Claim 3 with these choices to see that ¯ ρ µ ε ,x = sup s ∈ supp µ ε ( ·| x ) 1 T s Γ † µ ε ,x 1 s ≤ sup s ∈ supp ν ( ·| x ) 1 T s Γ † ν,x 1 s sup s ∈ supp µ ε ( ·| x ) 1 T s Γ † µ ε ,x 1 s 1 T s Γ † ν,x 1 s ≤ ¯ ρ ν,x max z ⊥ Null( Γ µ ε ,x ) , z 6 = 0 z T Γ † µ ε ,x z z T Γ † ν,x z ≤ ¯ ρ ν,x max z ⊥ Null( Γ µ ε ,x ) , z 6 = 0 z T Γ ν,x z z T Γ µ ε ,x z ≤ ¯ ρ ν,x z T Γ ν,x z ε z T Γ ν,x z = ¯ ρ ν,x ε 19 For the product slate space, using Eq. (18), which was pro ved within the proof of Prop. 2, we hav e ¯ ρ ν,x = sup s ∈ supp ν ( ·| x ) 1 T s Γ † ν,x 1 s = sup s ∈ supp ν ( ·| x )   ` X j =1 1 1 /m − ` + 1   = `m − ` + 1 . For the ranking slate space, using Prop. 4, we also have ¯ ρ ν,x = O ( `m ) , so for both the product slate space and ranking slate space, we obtain ¯ ρ µ ε ,x = O ( `m/ε ) . Finally , plugging this upper bound and Prop. 3 into the statement of Theorem 1 completes the proof. W e ﬁnally pro ve Claim 3. Pr oof of Claim 3. Let U be the square root of matrix A , i.e., U is a symmetric positiv e semideﬁnite matrix with the same eigen vectors as A , b ut with eigenv alues that are square root of the corresponding eigen values of A . Similarly , let V be the square root of matrix B . Thus, we hav e A = UU and A † = U † U † and similarly for B and V . Let Π A = U † U = UU † denote the projection onto the range of A and Π B denote the projection onto the range of B . Since Null( A ) ⊆ Null( B ) , we hav e Range( A ) ⊇ Range( B ) . W e pro ve the claim as follo ws: max z ⊥ Null( B ) , z 6 = 0 z T B † z z T A † z = max z ⊥ Null( B ) , z 6 = 0 z T U † U B † UU † z z T U † U † z (27) ≤ max y 6 = 0 y T U B † Uy y T y (28) = max y 6 = 0 y T U V † V † Uy y T y = max y : k y k 2 =1 k V † Uy k 2 2 (29) = max y : k y k 2 =1 k UV † y k 2 2 (30) = max y 6 = 0 y T V † U UV † y y T y = max y ⊥ Null( B ) , y 6 = 0 y T V † A V † y y T y (31) = max z ⊥ Null( B ) , z 6 = 0 z T V V † A V † Vz z T VV z (32) = max z ⊥ Null( B ) , z 6 = 0 z T Az z T Bz . (33) In Eq. (27) we substitute U † U † = A † and also use the fact that UU † = Π A and Π A z = z because z ∈ Range( B ) ⊆ Range( A ) . Eq. (28) is obtained by substituting y = U † z and relaxing the maximization to be ov er y 6 = 0 . In Eq. (29) we substitute V † V † = B † . In Eq. (30) we use the fact that the operator norm of a matrix and its transpose are equal. In Eq. (31) we substitute A = UU and note that it sufﬁces to consider y ⊥ Null( B ) because Null( V † A V † ) = Null( B ) . In Eq. (32) we use the fact that z 7→ Vz is a bijection on Range( B ) , which is an orthogonal complement to Null( B ) , so we can substitute Vz = y . Finally , in Eq. (33) we substitute B = VV and use the fact that V † V = Π B and Π B z = z because z ∈ Range( B ) . 20 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=10, l=5 logging: unif or m, target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) ERR, m=10, l=5 logging: unif or m, target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=10, l=5 logging: unif or m, target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) ERR, m=10, l=5 logging: unif or m, target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree Figure 3: RMSE of value estimators for an increasing logged dataset under a uniform logging policy with slate space (10 , 5) . T arget is lasso body (top panel) and tr ee body (bottom panel). Metrics are NDCG (left) and ERR (right). F Supplementary plots for off-policy evaluation on semi-synthetic data W e e xperimented with se veral conﬁgurations of slate spaces, logging and tar get policies, and whole- page metrics in the semi-synthetic ev aluation setup. This section details the plots for all conﬁgurations. The key parameters were: 1. Metric: NDCG or ERR. NDCG satisﬁes the linearity assumption, while ERR does not. 2. Slate space: ( m, l ) = (100 , 10) or (10 , 5) . 3. Logging policy: Unif, lasso title , tr ee title 4. T arget polic y: lasso body , tr ee body 5. T emperature α : Uniform, Slightly peaked, V ery peaked. Uniform corresponds to α = 0 . For the small slate spaces with ( m, l ) = (10 , 5) , α = 1 . 0 creates a slightly peaked logging distribution, while α = 2 . 0 creates a se verely peaked logging distribution. For the larger slate spaces with ( m, l ) = (100 , 10) , α = 0 . 5 is moderately peaked while α = 1 . 0 is sev erely peaked. The plots in Figures 3 – 12 detail the top row of Figure 2 for all combinations of these parameters. 21 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 log10(RMSE) NDCG, m=100, l=10 logging: unif or m, target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 log10(RMSE) ERR, m=100, l=10 logging: unif or m, target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=100, l=10 logging: unif or m, target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 log10(RMSE) ERR, m=100, l=10 logging: unif or m, target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree Figure 4: RMSE of value estimators for an increasing logged dataset under a uniform logging policy with slate space (100 , 10) . T ar get is lasso body (top panel) and tr ee body (bottom panel). Metrics are NDCG (left) and ERR (right). 22 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=10, l=5, α =1.0 logging: lasso , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) ERR, m=10, l=5, α =1.0 logging: lasso , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=10, l=5, α =1.0 logging: lasso , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) ERR, m=10, l=5, α =1.0 logging: lasso , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree Figure 5: RMSE of value estimators for an increasing logged dataset under a moderately peaked logging polic y ( lasso title , α = 1 . 0 ) with slate space (10 , 5) . T ar get is lasso body (top panel) and tr ee body (bottom panel). Metrics are NDCG (left) and ERR (right). 23 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=10, l=5, α =1.0 logging: tree , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) ERR, m=10, l=5, α =1.0 logging: tree , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=10, l=5, α =1.0 logging: tree , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) ERR, m=10, l=5, α =1.0 logging: tree , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree Figure 6: RMSE of value estimators for an increasing logged dataset under a moderately peaked logging policy ( tr ee title , α = 1 . 0 ) with slate space (10 , 5) . T ar get is lasso body (top panel) and tr ee body (bottom panel). Metrics are NDCG (left) and ERR (right). 24 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=10, l=5, α =2.0 logging: lasso , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 log10(RMSE) ERR, m=10, l=5, α =2.0 logging: lasso , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=10, l=5, α =2.0 logging: lasso , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 log10(RMSE) ERR, m=10, l=5, α =2.0 logging: lasso , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree Figure 7: RMSE of v alue estimators for an increasing logged dataset under a se verely peaked logging policy ( lasso title , α = 2 . 0 ) with slate space (10 , 5) . T ar get is lasso body (top panel) and tr ee body (bottom panel). Metrics are NDCG (left) and ERR (right). 25 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=10, l=5, α =2.0 logging: tree , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) ERR, m=10, l=5, α =2.0 logging: tree , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=10, l=5, α =2.0 logging: tree , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 log10(RMSE) ERR, m=10, l=5, α =2.0 logging: tree , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree Figure 8: RMSE of v alue estimators for an increasing logged dataset under a se verely peaked logging policy ( tr ee title , α = 2 . 0 ) with slate space (10 , 5) . T ar get is lasso body (top panel) and tr ee body (bottom panel). Metrics are NDCG (left) and ERR (right). 26 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=100, l=10, α =0.5 logging: lasso , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) ERR, m=100, l=10, α =0.5 logging: lasso , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=100, l=10, α =0.5 logging: lasso , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) ERR, m=100, l=10, α =0.5 logging: lasso , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree Figure 9: RMSE of value estimators for an increasing logged dataset under a moderately peaked logging policy ( lasso title , α = 0 . 5 ) with slate space (100 , 10) . T ar get is lasso body (top panel) and tr ee body (bottom panel). Metrics are NDCG (left) and ERR (right). 27 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 log10(RMSE) NDCG, m=100, l=10, α =0.5 logging: tree , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 log10(RMSE) ERR, m=100, l=10, α =0.5 logging: tree , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=100, l=10, α =0.5 logging: tree , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) ERR, m=100, l=10, α =0.5 logging: tree , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree Figure 10: RMSE of value estimators for an increasing logged dataset under a moderately peak ed logging policy ( tr ee title , α = 0 . 5 ) with slate space (100 , 10) . T ar get is lasso body (top panel) and tr ee body (bottom panel). Metrics are NDCG (left) and ERR (right). 28 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=100, l=10, α =1.0 logging: lasso , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 log10(RMSE) ERR, m=100, l=10, α =1.0 logging: lasso , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=100, l=10, α =1.0 logging: lasso , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) ERR, m=100, l=10, α =1.0 logging: lasso , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree Figure 11: RMSE of value estimators for an increasing logged dataset under a sev erely peaked logging policy ( lasso title , α = 1 . 0 ) with slate space (100 , 10) . T ar get is lasso body (top panel) and tr ee body (bottom panel). Metrics are NDCG (left) and ERR (right). 29 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=100, l=10, α =1.0 logging: tree , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) ERR, m=100, l=10, α =1.0 logging: tree , target: lasso OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) NDCG, m=100, l=10, α =1.0 logging: tree , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree 10 3 10 4 10 5 10 6 Number of logged samples (n) − 4 . 5 − 4 . 0 − 3 . 5 − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 log10(RMSE) ERR, m=100, l=10, α =1.0 logging: tree , target: tree OnP olicy wPI wIPS PI IPS DM: lasso DM: tree Figure 12: RMSE of value estimators for an increasing logged dataset under a sev erely peaked logging policy ( tr ee title , α = 1 . 0 ) with slate space (100 , 10) . T ar get is lasso body (top panel) and tr ee body (bottom panel). Metrics are NDCG (left) and ERR (right). G Off-policy optimization For off-policy optimization experiments, we compare two methods – SUP and PI-OPT on the MSLR-WEB10K dataset. Both SUP and PI-OPT uses all the features f ( x, a ) for query-document ( x, a ) pairs in the training fold. SUP uses regression targets as outlined in Section 4.2. W e also experimented with regression to raw relev ance judgments, this is denoted SUP- rel . F or PI-OPT , each query-document- position triplet produces a regression e xample ( x, a, j ) with a concatenated feature vector f ( x, a, j ) : = [ f ( x, a ); 1 j ] where 1 j is a ` -dimensional one-hot encoding of position j . Every logged sample with query x yields an estimate ˆ φ j ( x, a ) for e very candidate document a and position j . These are our natural regression targets. There is one further optimization we can do that is computationally tractable when the set of queries { x } is ﬁnite. By averaging all estimated ˆ φ j ( x, a ) for a particular query , we can create a lo wer-variance target for re gression that remains an unbiased estimate of φ j ( x, a ) . Both SUP and PI-OPT employ gradient boosted regression trees with n = 1000 tree-ensembles and up to 70 leav es in each tree, to predict their corresponding regression tar gets. W ith a trained model, SUP constructs slates in a straightforward way: For any input query x , we score all candidate documents a ∈ A ( x ) using the trained model f ( x, a ) 7→ scor e ( a ) , and sort the scores in descending order . Rankings are constructed using the top- ` scoring candidates in order . For PI-OPT , we score e very document-position pair ( a, j ) ∈ A ( x ) × { 1 · · · ` } , f ( x, a, j ) 7→ scor e ( a, j ) . Now we greedily pick the highest scored pair ( a, j ) and insert document a in slot j of the slate. After eliminating all in valid 30 pairs, ( ∗ , j ) and ( a, ∗ ) , we repeat this greedy procedure until all positions in the slate are ﬁlled. This giv es us a computationally efﬁcient, albeit approximate, maximizer of argmax s P l j =1 scor e ( s j , j ) . H Overlap between base-rankers W e use four dif ferent base-rankers lasso title , lasso body , tree title , tree body in our semi-synthetic experi- ments to instantiate logging and target policies. In T able 2, we report how similar the top- ` rankings ( ` = 10 ) retriev ed by these rankers are. W e report two metrics for e very pair of rank ers: the av erage fraction of documents retrieved in common by both rankers (and its standard deviation), and the Kendall’ s tau computed o ver the union of documents retriev ed by either ranker (documents retrie ved by one ranker b ut not the other are assumed to be ranked at ` + 1 in the other ranking). T able 2: Reporting the difference between the base-rankers lasso title , lasso body , tree title , tree body as measured by av erage ov erlap of retrieved document sets and K endall’ s tau. Pair Overlap Kendall’ s τ A vg. Std.Dev . A vg. Std.Dev . ( lasso title , tree title ) 0.523 0.216 -0.041 0.307 ( lasso body , tree body ) 0.426 0.236 -0.221 0.322 ( tr ee title , tree body ) 0.270 0.198 -0.394 0.236 ( tr ee title , lasso body ) 0.274 0.203 -0.405 0.239 ( lasso title , tree body ) 0.250 0.199 -0.421 0.231 ( lasso title , lasso body ) 0.262 0.202 -0.415 0.233 31

Off-policy evaluation for slate recommendation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment