Importance Sampling with Unequal Support
Importance sampling is often used in machine learning when training and testing data come from different distributions. In this paper we propose a new variant of importance sampling that can reduce the variance of importance sampling-based estimates …
Authors: Philip S. Thomas, Emma Brunskill
Importance Sampling with Unequal Support Philip S. Thomas and Emma Brunskill Carnegie Mellon Uni versity Abstract Importance sampling is often used in machine learning when training and testing data come from different distributions. In this paper we propose a new variant of importance sam- pling that can reduce the variance of importance sampling- based estimates by orders of magnitude when the supports of the training and testing distributions differ . After motiv at- ing and presenting our new importance sampling estimator, we provide a detailed theoretical analysis that characterizes both its bias and variance relative to the ordinary importance sampling estimator (in various settings, which include cases where ordinary importance sampling is biased, while our new estimator is not, and vice versa ). W e conclude with an exam- ple of how our new importance sampling estimator can be used to impro ve estimates of how well a new treatment pol- icy for diabetes will work for an individual, using only data from when the individual used a pre vious treatment policy . Introduction A key challenge in artificial intelligence is to estimate the expectation of a random variable. Instances of this problem arise in areas ranging from planning and decision making (e.g., estimating the expected sum of re wards produced by a policy for decision making under uncertainty) to proba- bilistic inference. Although the estimation of an expected value is straightforward if we can generate many indepen- dent and identically distributed (i.i.d.) samples from the rel- ev ant probability distrib ution (which we refer to as the tar- get distribution ), we may not have generativ e access to the target distribution. Instead, we might only ha ve data from a different distrib ution that we call the sampling distribution . For example, in off-policy ev aluation for reinforcement learning, the goal is to estimate the e xpected sum of re wards that a decision policy will produce, given only data gathered using some other policy . Similarly , in supervised learning, we may wish to predict the performance of a regressor or classifier if it were to be applied to data that comes from a distribution that differs from the distribution of the av ail- able data (e.g., we might predict the accuracy of a classifier for hand-written letters given that observed letter frequen- cies come from English, using a corpus of labeled letters collected from German documents). More precisely , we consider the problem of estimating θ : = E [ h ( X )] , where h is a real-valued function and the expectation is ov er the random v ariable X , which is a sam- ple from the target distribution. As input we assume ac- cess to n i.i.d. samples from a sampling distribution that is different from the target distribution. A classical approach to this problem is to use importance sampling (IS), which reweighs the observ ed samples to account for the difference between the tar get and sampling distrib utions ( Kahn , 1955 ). Importance sampling produces an unbiased but often high- variance estimate of θ . W e introduce importance sampling with unequal support (US)—a simple ne w importance sampling estimator that can drastically reduce the variance of importance sampling when the supports of the sampling and target distributions dif fer . This setting with unequal support can occur, for example, in our earlier example where German documents might in- clude symbols like ß, that the classifier will not encounter . US essentially performs importance sampling only on the data that falls within the support of the target distribution, and then scales this estimate by a constant that reflects the relativ e support of the target and sampling distrib utions. US typically has lo wer v ariance than ordinary importance sampling (sometimes by orders of magnitude), and is unbi- ased in the important setting where at least one sample falls within the support of the target distribution. If no samples do, then none of the av ailable data could hav e been gen- erated by the target distribution, and so it is unclear what would make for a reasonable estimate. Furthermore, the con- ditionally unbiased nature of US is sufficient to allow for its use with concentration inequalities like Hoeffding’ s in- equality to construct confidence bounds on θ . By contrast, weighted importance sampling ( Rubinstein , 1981 ) is another variant of importance sampling that can reduce variance, but which introduces bias that makes it incompatible with Ho- effding’ s inequality . Problem Setting and Importance Sampling Let f and g be pr obability density functions (PDFs) for two distributions that we call the target distribution and sam- pling distribution , respectiv ely . Let h : R → R be called the evaluation function . Let θ : = E f [ h ( X )] , where E f de- notes the expected value gi ven that f is the PDF of the ran- dom variable(s) in the e xpectation (in this case, just X ). Let F : = { x ∈ R : f ( x ) 6 = 0 } , G : = { x ∈ R : g ( x ) 6 = 0 } , and H : = { x ∈ R : h ( x ) 6 = 0 } be the supports of the target and sampling distributions, and the ev aluation function, respec- tiv ely . In this paper we will discuss techniques for estimating θ gi ven n ∈ N > 0 i.i.d. samples, X n : = { X 1 , . . . , X n } , from the sampling distribution, and we focus on the setting where F ∩ H ⊂ G —where the joint support of F and H is a strict subset of the support of G . The importance sampling estimator , IS( X n ) : = t + 1 n n X i =1 f ( X i ) g ( X i ) ( h ( X i ) − t ) , (1) is a widely used estimator of θ , where t = 0 (we con- sider non-zero values of t later). If F ∩ H ⊆ G , then IS( X n ) is a consistent and unbiased estimator of θ . That is, IS( X n ) a.s. − → θ and E g [IS( X n )] = θ (we revie w this latter result in Property 1 in the supplemental document). A contr ol variate is a constant, t ∈ R , that is subtracted from each h ( X i ) and then added back to the final esti- mate, as in ( 1 ) ( Hammersle y , 1960 ; Hammersley and Hand- scomb , 1964 ). Although control variates, t ( X i ) , that de- pend on the sample, X i , can be beneficial, for our later pur - poses we only consider constant control variates. Intuitiv ely , including a constant control variate equates to estimating θ 0 : = E f [ h 0 ( X )] using importance sampling without a con- trol variate, where h 0 ( x ) = h ( x ) − t , and then adding t to the resulting estimate to get an estimate of θ . Later we show that the variance of importance sampling increases with θ 2 , and so applying importance sampling to h results in higher v ariance than applying importance sam- pling to h 0 with t ≈ θ , since then θ 0 ≈ 0 . That is, by in- ducing a kind of normalization, a control v ariate can reduce the variance of estimates without introducing bias—a prop- erty that has made the inclusion of control variates a pop- ular topic in some recent works using importance sampling ( Dud ´ ık et al. , 2011 ; Jiang and Li , 2016 ; Thomas and Brun- skill , 2016 ). Although later we discuss control variates more, for simplicity our deri v ations focus on importance sampling estimators without control v ariates. There are also other e x- tensions of the importance sampling estimator that can re- duce v ariance—notably the weighted importance sampling estimator , which we compare to later , and which can pro- vide large reductions of variance and mean squared error, but which introduces bias. An Illustrative Example In this section we present an e xample that highlights the pe- culiar behavior of the IS estimator when F ∩ H 6 = G . Let g ( x ) = 0 . 5 if x ∈ [0 , 2] and g ( x ) = 0 otherwise, and let f ( x ) = 1 if x ∈ [0 , 1] and f ( x ) = 0 otherwise. So, F = [0 , 1] and G = [0 , 2] . Let h ( x ) = 1 if x ∈ [0 , 1] and h ( x ) = 0 otherwise, so that H = [0 , 1] . Notice that θ = 1 . Since the sampling and target distributions are both uni- form, an obvious estimator of θ (if f and g are known but h is not) would be the av erage of the points that fall within F . Let (# X i ∈ F ) denote the number of samples in X n that are in F . Formally , the obvious estimator is ˆ θ : = 1 (# X i ∈ F ) n X i =1 1 F ( X i ) h ( X i ) , where 1 A ( x ) = 1 if x ∈ A and 1 A ( x ) = 0 otherwise. Giv en our kno wledge of h , it is straightforward to sho w that this estimator is equal to 1 if (# X i ∈ F ) > 0 and is un- defined otherwise—it is exactly correct (has zero bias and variance) as long as at least one sample f alls within F . If no samples fall within F , then we have only observed data that will nev er occur under the target distribution, and so we have no useful information about θ . In this case, we might define our ob vious estimator to return an arbitrary v alue, e.g., zero. Perhaps surprisingly , the importance sampling estimator does not degenerate to this ob vious estimator: IS( X n ) = 1 n n X i =1 1 F ( X i )2 h ( X i ) = 2(# X i ∈ F ) n . Since E g [(# X i ∈ F ) /n ] = 1 / 2 , this estimate is correct in expectation, but does not have zero variance giv en that at least one sample falls within F . If more than 1 / 2 of the samples fall within F , this estimate will be an o ver -estimate of θ , and if fewer than 1 / 2 of the samples fall within F , this estimate will be an under -estimate. Although correct on av erage, the importance sampling estimator has unnecessary additional variance relati ve to the ob vious estimator . Importance Sampling with Unequal Support W e propose a new importance sampling estimator, impor- tance sampling with unequal support (ISUS, or US for brevity), that does degenerate to the obvious estimator for our illustrati ve example. Intuitiv ely , US prunes from X n the samples that are outside F (or more generally , outside some set C , that we define later) to construct a new data set, X 0 n , that has fewer samples. This new data set can be viewed as (# X i ∈ F ) i.i.d. samples from a different sampling distribution—a distrib ution with PDF g 0 , which is simply g , but truncated to only have support on F and re-normalized to integrate to one. US then applies ordinary importance sam- pling to this new data set. For generality , we allo w US to prune from X n all of the points that are not in a set, C , which can be defined many different ways, including C : = F (as in our previous exam- ple). Our only requirement is that F ∩ H ⊆ C ⊆ G . In order to compute US , we must compute a value, c : = Z C g ( x ) d x, which is the probability that a sample from the sampling distribution will be in C . In general, C should be chosen to be as small as possible while still ensuring that both 1) F ∩ H ⊆ C ⊆ G (so that informative samples are not dis- carded) and 2) c can be computed. Ideally , we would select C = F ∩ H , howe ver in some cases c cannot be computed for this v alue of C . For example, in our later experiments we consider a problem where h and H are not known, but F is, and so we can compute c using C = F , b ut not C = F ∩ H . Let k ( X n ) : = P n i =1 1 C ( X i ) be the number of X i that are in C . The US estimator is then defined as: US( X n ) : = c k ( X n ) n X i =1 f ( X i ) g ( X i ) h ( X i ) , (2) if k ( X n ) > 0 , and US( X n ) : = 0 if k ( X n ) = 0 . This is equiv alent to applying importance sampling to the pruned data set, X 0 n , since then g 0 ( x ) = g ( x ) /c for x ∈ C . Also, in ( 2 ) we sum o ver all n samples rather than just the k ( X n ) samples in C because f ( X i ) h ( X i ) = 0 for all X i not in C . Theoretical Analysis of US W e begin with two simple theorems that elucidate the rela- tionship between IS and US . The proofs of both theorems are straightforward, b ut deferred to the supplemental docu- ment. First, Theorem 1 shows that, when C = G , US de- generates to IS . One case where C = G is when the support of the target distribution and e valuation function are both equal to the support of the sampling distrib ution, i.e., when F = H = G , and so C = G necessarily . Theorem 1. If C = G , then US( X n ) = IS( X n ) . Theorem 2 shows that, if we replace c in the definition of US with an empirical estimate, ˆ c ( X n ) : = k ( X n ) /n , then US and IS are equiv alent. This provides some intuition for why US tends to outperform IS when C ⊂ G —IS is US, but using an empirical estimate of c (the probability that a sample falls within C ), in place of its known v alue. Theorem 2. If we r eplace c with an empirical estimate, ˆ c ( X n ) : = k ( X n ) /n , then US( X n ) = IS( X n ) . In T able 1 we summarize more theoretical results that clarify the differences between IS and US in several settings. The first setting (denoted by a † in T able 1 ) is the standard setting where we consider the ordinary expected value and variance of the two estimators. The second setting (denoted by a ‡ in T able 1 ) conditions on the e vent that at least one sample falls within C , that is, the event that k ( X n ) > 0 . This is a reasonable setting to consider if one takes the vie w that no estimate should be returned if all of the samples are out- side C . That is, if the pruned data set, X 0 n , is empty , then no estimate should be produced or considered (just as IS does not produce an estimate when n = 0 —when there are no samples at all). Finally , the third setting (denoted by a ? in T able 1 ) conditions on the event that k ( X n ) = κ —that a specific constant number of the n samples are in C . T able 1 and the theorems that it references use additional symbols that we re view here. Let ρ : = Pr( k ( X n ) > 0) = 1 − (1 − c ) n be the probability that at least one of n samples is in C . Let V ar g ( · ) denote the v ariance giv en that the ran- dom variables within the parenthesis are sampled from the distribution with PDF g . Let v : = V ar g f ( X ) g ( X ) h ( X ) X ∈ C be the conditional v ariance of the importance sampling esti- mate when using a single sample and given that the sample is in C . Let B ( n, c ) denote the binomial distribution with pa- rameters n and c and let E B ( n,c ) denote the expected v alue giv en that κ ∼ B ( n, c ) . Although the proofs of the claims in T able 1 are some of the primary contributions of this work, we defer them to the supplemental document because they are straightfor- ward (though lengthy) and do not provide further insights into the results. The primary result of T able 1 is that US is unbiased and often has lower v ariance in the key setting of interest: when at least one sample is in the support of the target distribution—when k ( X n ) > 0 . W e find this setting compelling because, when no samples are in F , little can be inferred about E f [ h ( X )] . In this setting (denoted by ‡ in T able 1 ) US is an unbiased estimator , while IS is not (although the bias of IS does go to zero as n → ∞ ). 1 T o understand the source of this bias, consider the bias of IS gi ven that k ( X n ) = κ —the ? setting in T able 1 . In this case, E g [IS( X n )] = κ cn θ . Recall that IS uses an empirical estimate of c , i.e., ˆ c ≈ κ n (as discussed in Theorem 2 ). When this estimate is correct, terms in κ cn θ cancel, making IS unbiased. Thus, the bias of IS when con- ditioning on the event that k ( X n ) > 0 stems from IS ’ s use of an estimate of c . Next we discuss the variance of the two estimators given that at least one sample falls within C , i.e., in the ‡ set- ting. First consider ho w the v ariances of IS and US change as c → 0 —that is, as the differences between the sup- ports of the sampling and target distributions increases. Specifically , let c i : = 1 i for i ∈ N > 0 . W e then hav e that: V ar(IS( X n ) | k ( X n ) > 0 , c i ) ≥ c i v nρ = v nρi ≥ v ni , since ρ ∈ (0 , 1] , and V ar(US( X n ) | k ( X n ) > 0 , c i ) = ( v /i 2 ) E B ( n,c ) [1 /κ | κ > 0] ≤ v /i 2 , since E B ( n,c ) [ κ − 1 | κ > 0] ≤ 1 . Thus, as i → ∞ (as c → 0 logarithmically), and giv en some fixed n and v , the variance of US goes to zero much faster than the v ariance of IS. The v ariance of US (as a function of i ) con ver ges to zero linearly (or faster) with a rate of at most 1 while the v ariance of IS conv erges to zero sublinearly (at best, logarithmically). Next note that the variance of US in this setting is inde- pendent of θ 2 , b ut the variance of IS increases with θ 2 (see Property 3 in the supplemental document, applied to The- orem 9 ). T o ameliorate this issue, a control variate, t , can be used to center the data so that θ ≈ 0 . Howe ver , since θ is not known a priori , selecting t = θ is not practical. The term that scales with θ 2 in the variance of IS gi ven that k ( X n ) > 0 therefore means that the variance of IS depends on the quality of the control variate—poor control variates can cause IS to ha ve high v ariance. By contrast, the v ariance of US in this setting does not have a term that scales with θ 2 , and so the quality of the control variate is less important. 2 There is a rare case when IS can have a lower variance than US . First, we assume that the control variate is perfect so that θ = 0 (which, as discussed before, is impractical) and consider the term that scales with v . From this term, it is clear that US will ha ve lo wer variance that IS if: c 2 E B ( n,c ) [ κ − 1 | κ > 0] ≤ c nρ . (3) 1 If we do not condition on the ev ent that k ( X n ) > 0 , then US is a biased estimator of θ . This is because it is unclear how to define US( X n ) when k ( X n ) = 0 , and we chose (arbitrarily) to define it to be 0 . Howe ver , the bias of IS( X n ) in this setting con verges quickly to zero, since ρ (the probability that no samples fall within C ) conv erges quickly to one as n → ∞ . 2 The quality of the control variate can still impact the variance of estimates though, since it can change v . E g [ · ] † E g [ · ] ‡ E g [ · ] ? V ariance † V ariance ‡ Strongly Consistent IS θ (Property 1 ) 1 ρ θ (Theorem 6 ) κ cn θ (Theorem 5 ) 1 n cv + θ 2 1 c − 1 (Theorem 11 ) v c nρ + θ 2 cρ ( n − 1)+ ρ − cn cnρ 2 (Theorem 9 ) Y es ( † and ‡ ) US ρθ (Theorem 7 ) θ (Theorem 4 ) θ (Theorem 3 ) ρc 2 v E B ( n,c ) [ κ − 1 | κ > 0] + θ 2 ρ (1 − ρ ) (Theorem 10 ) c 2 v E B ( n,c ) κ − 1 κ > 0 (Theorem 8 ) Y es ( † and ‡ ) T able 1: Theoretical properties of IS and US estimators. † = given no conditions. ‡ = conditioned on the event that k ( X n ) > 0 —that at least one sample is in C . ? = conditioned on the event that k ( X n ) = κ —that exactly κ of n samples are in C . All theorems require the assumption that F ∩ H ⊆ G . The consistenc y results follo w immediately from the fact that the biases and variances all con verge to zero as n → ∞ ( Thomas and Brunskill , 2016 , Lemma 3). Notice that this inequality depends only on n and c , which must both be known in order to implement US , and so we can test a priori whether US will hav e lower v ariance than IS . That is, if ( 3 ) holds, then US will have lower variance than IS , gi ven that k ( X n ) > 0 . Howe ver , if ( 3 ) does not hold, it does not mean that IS will hav e lower variance than US unless the perfect (typically unkno wn) control variate is used so that θ = 0 . A pplication to Illustrative Example Because neither method is always superior, here we consider the application of IS and US to the illustrative example to see when each method works best, and by how much. W e consider the setting where C = F , but modify the example slightly . First, although the target distribution is always uni- form, we allo w for its support to be scaled. Specifically , we define the support of f to be [0 , F max ] , where F max ∈ (0 , 2] . When F max is small, it corresponds to significant dif ferences in support, while large F max correspond to small differences (when F max = 2 , C = F = G and so the two estimators are equiv alent). W e also modify h to allow for v arious values of θ . Specifically , we define h ( x ) = − 1 + θ if x < F max / 2 and h ( x ) = 1 + θ if x ≥ F max / 2 . Notice that, although we defined h in terms of θ , θ remains E f [ h ( X )] , and also that using this definition of h and θ = 0 is an instance that is particularly fa vorable to IS . For this example, it is straightforward to verify that v = 4 /F 2 max for any definition of θ , and c = F max / 2 . Gi ven these two v alues (and θ ), we can compute the bias and v ariance of each estimator . The biases and variances of the tw o estima- tors for v arious settings are depicted in Figure 1 . Notice that US is always competitive with IS , although the re verse is not true. Particularly , when F max is small (so that c is small), or when θ is large, US can hav e orders of magnitude lower variance than IS . Also, as n increases, the two estimators become increasingly similar , since the empirical estimate of c used by IS becomes increasingly accurate, although US is still vastly superior to IS ev en when n is large if c is cor- respondingly small. This matches our theoretical analysis from the previous section: we expect US to perform better when c is small (by our conv ergence rate analysis) or when θ 2 is large (due to US’ s lesser dependence on the quality of the control variate), and we e xpect the two estimators to be- comes increasingly similar as n → ∞ (because ˆ c becomes increasingly similar to c ). Notice also that gains are not only obtained when c is so small relativ e to n that no samples are expected to fall within C (a relatively uninteresting setting). For example, the right-most plot in Figure 1 sho ws that with F max = 0 . 5 , where Pr( k ( X n ) > 0) = ρ = 1 − 1 2 50 ≈ 1 , the MSE of US is approximately 0 . 086 , while the MSE of IS is approx- imately 6 . 08 —US is has roughly 1 / 70 the MSE of IS ( 1 / 8 the RMSE). Perhaps surprisingly , there are cases where IS has lower variance than US (even when both are unbiased, since θ = 0 ). For example, consider the plot with θ = 0 and n = 10 , and the position on the horizontal axis that corresponds to F max = 1 . 0 . This is one case where IS is marginally better than US (it has lo wer variance in both settings, and neither estimator is biased). Intuiti vely , the IS estimator includes the points outside the support of F , although they ha ve associ- ated values, h ( X i ) = 0 , which pulls the importance sam- pling estimate towards zero. In this case, when θ = 0 , this extra pull tow ards zero happens to be beneficial. Ho wev er , to remain unbiased giv en the pull towards zero, IS also in- creases the magnitudes of the weights associated with points in F , which incurs additional variance. When F max is small enough, this additional variance outweighs the variance re- duction that results from the extra pull towards zero, and so US is again superior . This intuition is supported by the fact that in Figure 1 IS does not outperform US for small F max or θ ≥ 1 , since then a pull to wards zero is detrimental. Finally , we consider the use of IS and US to create high- confidence upper and lower bounds on θ using a concen- tration inequality ( Massart , 2007 ) like Hoeffding’ s inequal- ity ( Hoeffding , 1963 ). If b denotes the range of the function f ( x ) h ( x ) /g ( x ) , for x ∈ G , then using Hoeffding’ s inequal- ity , we hav e that IS( X n ) − b p ln(1 /δ ) / (2 n ) is a 1 − δ con- fidence lower bound on θ . Similarly , we can use US with Hoeffding’ s inequality to create a 1 − δ confidence lower bound: US( X n ) − cb p ln(1 /δ ) / (2 k ( X n )) , since the range of the k ( X n ) i.i.d. random variables averaged by US( X n ) is cb . Notice that, if k ( X n ) = 0 , then this second estima- tor is undefined (one might define the lower bound to be a known lower bound on θ in this setting). Although we ex- pect that k ( X n ) ≈ cn , the resulting c in the denominator of the US-based bound is within the square root, while the c in the numerator is not, and so the bound constructed using US should tend to be tighter when c is small. θ = 0 θ = 1 θ = 10 n = 10 n = 50 Figure 1: The v ariances of IS and US across v arious settings of n and θ (denoted along the left and top). At a glance, notice that the red and green curv es (US) tend to be belo w the black curv es (IS), particularly when considering the log arithmic scale of the v ertical ax es. The dotted lines sho w the v ariance conditioned on the e v ent that k ( X n ) > 0 . The green line sho ws the mean squared error of the US estimator (without an y conditions), which sho ws that the v ariance reduction of US is not completely of fset by increased bias (compare the solid black and green curv es). When θ = 0 the green line obscures the solid red line. The plot on the right sho ws a zoomed-in vie w of the θ = 10 , n = 50 plot without the log arithmic v ertical axis. A pplication to Diabetes T r eatment W e applied US and IS to the problem of predicting the ef fec- ti v eness of alt ering the treatment polic y f o r diabetes 1 for a particular indi vidual. That is, we w ould lik e to use prior data from when the indi vidual w as treated with one treatment polic y to estimate ho w well a related polic y w ould w ork. The treatment polic y is parameterized by tw o numbers, CR and CF , and dictates ho w much insulin a person should in- ject prior to eating a meal in order to k eep his or her blood glucose close to optimum le v els. CR and CF are typically specified by a diabetologist and tweak ed during follo w-up visits e v er y 3–6 months. If follo w-up visits are not an option, recent research has suggested using reinforcement learning algorithms to tune CR and CF ( Bastani , 2014 ). Here we focus on a sub-problem of impro ving CR and CF—using data collected from an initial range of admissi- ble v alues of CR and CF to predict ho w well a ne w range of v alues for CR and CF w ould perform. When collecting data, CR and CF are dra wn uniformly from an initial admissible range, and then used for one day (which we vie w as one episode of a Mark o v decision process). The performance during each day is measured using an objecti v e function similar to the re w ard function proposed by Bastani ( 2014 ), which measures the de viation of blood glucose from opti- mum le v els, with lar ger penalties for lo w blood glucose le v- els. W e refer to the measure of ho w good the outcome w as from one day as the r eturn associated with that day , with lar ger v alues being better . Using approximately 30 days of data, our goal is to estimate the e xpected return if a dif ferent distrib ution of CR and CF were to be used. W e consider a specific in silico person—a person sim- ulated using a metabolic simulator . W e used the subject “ Adult#003” in the T ype 1 Diabetes Metabolic Simulator (T1DMS) ( Dalla Man et al. , 2014 )—a simulator that has been appro v ed by the US F ood and Drug Administration as a substitute for animal trials in pre-clinical testing of treatment policies for type 1 diabetes. During each day , the subject is gi v en three or four me als of randomized sizes at randomized times, similar to the e xperimental setup proposed by Bastani ( 2014 ). As a result of this randomness, and t he stochastic nature of the T1DMS model, applying the same v alues of CR and CF can produce dif ferent returns if used for multi- ple days. After analyzing the performance of man y CR and CF pairs, we selected an initial range that results in good performance: CR ∈ [8 . 5 , 11] and CF ∈ [10 , 15] . Using a lar ge number of samples, we computed an estimate of the e xpected ret urn if dif fe rent CR and CF v alues are used for a single day—this estimate is depicted in Figure 2 . As described by Bastani ( 2014 ), when the v alue of CR is set appropriately , performance is rob ust to changes in CF . W e therefore focus on possible changes to CR. Specif- ically , we consider ne w treatment policies where CF re- mains sampled from the uniform distrib ution o v er [10 , 15] , b ut where CR is sampled from the truncated normal distrib u- tion o v er [ CR min , 11] , with mean 11 and standard de viation 11 − CR min . This distrib ution places the lar gest probability densities at the upper end of the range of CR, which f a v ors better policies. As CR min increases to w ards 11 , the support of the sampling distrib ution and tar get distrib ution become increasingly dif ferent ( c = (11 − CR min ) / 2 . 5 ) and the e x- pected return increases. F or each v alue of CR min (each of which corresponds to Figure 2: The first and second plots sho w an estimate of the expected return for various CR and CF , from two different angles. The second plot includes points depicting the returns observed from using dif ferent v alues of CR and CF for a day—notice the high variance. The two plots on the right depict the bias, variance, and MSE of IS, US, and WIS (without any conditioning) for various values of c and both without (third plot) and with (fourth plot) a control variate. The curves for US are largely obscured by the corresponding curves for WIS. Notice that the variance of IS approaches 0 . 06 , which is enormous given that the difference between the best and w orst CR and CF pairs possible under the sampling policy is approximately 0 . 06 . a value of c ), we performed 2 , 433 trials, each of which in- volv ed sampling 30 days of data from the sampling distrib u- tion and then using IS, US, and weighted importance sam- pling (WIS) to estimate the expected return if CR and CF were sampled from the target distrib ution. Figure 2 displays the bias, variance and mean squar ed err or (MSE) of these 2 , 433 estimates, using an estimate of ground truth computed using Monte Carlo sampling. Figure 2 also shows the impact of providing a constant control v ariate to all the estimators: the chosen control v ariate was the e xpected return under the sampling distribution. Notice that we see the same trend as in the illustrativ e example—for small c (the best treatment policies, which hav e small ranges of CR), US significantly outperforms IS. Furthermore, when a decent control v ariate is not used, the benefits of US are increased, e ven when controlling for the resulting bias by measuring the mean squared error . W e also computed the biases and v ariances giv en that k ( X n ) > 0 , and observed similar results (not sho wn), which fav ored US slightly more. Notice that WIS and US perform very simi- larly . Indeed, if the sampling and target distrib utions are both uniform, it is straightforward to verify that WIS and US are equiv alent. In other experiments (not sho wn) we found that WIS yields lower v ariance than US when the target distrib u- tion is modified to be even less like the uniform distribution. Howe ver , it is often important to be able to produce con- fidence interv als around estimates (especially when data is limited), and since WIS is biased, it cannot be used with standard concentration inequalities. W e used Hoeffding’ s in- equality to compute a 90% confidence interval around the estimates produced by IS and US (without control variates and with CR min = 10 . 375 , so that c = 1 / 4 ) using v arious numbers of samples (days of data). The mean confidence in- tervals are depicted in Figure 3 , which also sho ws a Monte Carlo estimate of θ , as well as deterministic domain-specific upper and lower bounds on h ( X ) (denoted by “ h range” in the legend). If k ( X n ) = 0 , then US is not defined, and so the confidence intervals shown for US are av eraged only over the instances where k ( X n ) > 0 . T o sho w how often US re- turns a solution, Figure 3 also sho ws ρ —the probability that US will produce a confidence bound—using the right verti- Figure 3: Confidence bounds using IS and US. cal axis for scale. US produces a much tighter confidence interval than IS in all cases. Furthermore, the setting where US often does not return a bound corresponds to the setting where IS pro- duces a confidence interv al that is outside the deterministic bound on h ( X ) —a trivial confidence interval. In additional experiments (not shown) we defined the bounds to be trun- cated to alw ays be within the deterministic bounds on h ( X ) and define the bound produced using US to be conserv ativ e (equal to the deterministic bounds) when k ( X n ) = 0 . In this experiment we saw similar results—the confidence intervals produced using US were much tighter than those using IS. Conclusion and Future W ork W e have presented a simple new variant of importance sam- pling, US. Our analytical and empirical results suggest that US can significantly outperform ordinary importance sam- pling, and we provide an a priori calculation to check for the rare cases where it can perform slightly worse. Unlike some other IS estimators that hav e been developed to reduce v ari- ance (like WIS), US is unbiased given mild conditions that still permit the easy computation of confidence intervals. References M. Bastani. Model-free intelligent diabetes management using machine learning. Master’ s thesis, Department of Computing Science, Univ ersity of Alberta, 2014. C. Dalla Man, F . Micheletto, D. Lv , M. Breton, B. Ko- vatche v , and C. Cobelli. The uv a/padov a type 1 diabetes simulator new features. Journal of diabetes science and technology , 8(1):26–34, 2014. M. Dud ´ ık, J. Langford, and L. Li. Doubly robust policy ev aluation and learning. In Pr oceedings of the T wenty- Eighth International Conference on Machine Learning , pages 1097–1104, 2011. J. M. Hammersley . Monte carlo methods for solving mul- tiv ariable problems. Annals of the New Y ork Academy of Sciences , 86(3):844–874, 1960. J. M. Hammersley and D. C. Handscomb . Monte carlo meth- ods, methuen & co. Ltd., London , page 40, 1964. W . Hoeffding. Probability inequalities for sums of bounded random v ariables. J ournal of the American Statistical As- sociation , 58(301):13–30, 1963. N. Jiang and L. Li. Doubly robust off-polic y value e v alua- tion for reinforcement learning. In International Confer- ence on Machine Learning , 2016. H. Kahn. Use of different Monte Carlo sampling tech- niques. T echnical Report P-766, The RAND Corporation, September 1955. P . Massart. Concentration Inequalities and Model Selection . Springer , 2007. R. Rubinstein. Simulation and the Monte Carlo method . W i- ley , New Y ork, 1981. P . S. Thomas and E. Brunskill. Data-ef ficient of f-policy pol- icy ev aluation for reinforcement learning. In International Confer ence on Machine Learning , 2016. Supplemental Document In this supplemental document we prov e the various prop- erties and theorems referenced earlier (particularly those in T able 1 ). Property 1. If F ∩ H ⊆ G then E g [IS( X n )] = θ . Pr oof. E g [IS( X n )] (a) = E g f ( X ) g ( X ) h ( X ) = Z G g ( x ) f ( x ) g ( x ) h ( x ) d x (b) = Z F ∩ H f ( x ) h ( x ) d x = E f [ h ( X )] = θ, where (a) holds because IS( X n ) is the mean of n indepen- dent and identically distributed random variables, and (b) holds because ∀ x ∈ G \ ( F ∩ H ) , f ( x ) = 0 . W e now pro vide a proof of Theorem 1 , which states that if C = G , then US( X n ) = IS( X n ) . Pr oof. In this setting, c = R G g ( x ) d x = 1 and since every X i must be within C , k ( X n ) = n . So, US( X n ) = c k ( X n ) n X i =1 f ( X i ) g ( X i ) h ( X i ) = 1 n n X i =1 f ( X i ) g ( X i ) h ( X i ) . W e now provide a proof of Theorem 2 , which states that if we replace c with an empirical estimate, ˆ c ( X n ) : = n − 1 k ( X n ) , then US( X n ) = IS( X n ) . Pr oof. Using the empirical estimate, ˆ c ( X n ) , in place of c within US we hav e: US( X n ) = ˆ c ( X n ) k ( X n ) n X i =1 f ( X i ) g ( X i ) h ( X i ) = k ( X n ) nk ( X n ) n X i =1 f ( X i ) g ( X i ) h ( X i ) = 1 n n X i =1 f ( X i ) g ( X i ) h ( X i ) = IS( X n ) . Theorem 3. If F ∩ H ⊆ G and κ ∈ N > 0 , then E g [US( X n ) | k ( X n ) = κ ] = θ . Pr oof. Let Pr g ( X ∈ C ) denote the probability that a sam- ple, X , from the sampling distribution is in C . E g [ US( X n ) | k ( X n ) = κ ] = E g " c κ n X i =1 f ( X i ) g ( X i ) h ( X i ) k ( X n ) = κ # (a) = E g " c κ κ X i =1 f ( X i ) g ( X i ) h ( X i ) ∀ i ∈ { 1 , . . . , κ } , X i ∈ C # (b) = E g c f ( X ) g ( X ) h ( X ) X ∈ C (c) = Z C g ( x ) Pr g ( X ∈ C ) c f ( x ) g ( x ) h ( x ) d x (d) = Z C g ( x ) c c f ( x ) g ( x ) h ( x ) d x = Z C f ( x ) h ( x ) d x (e) = E f [ h ( X )] , where (a) holds because f ( X i ) = 0 for all but κ of the terms in the summation, and so (by re-ordering the X i so that these κ terms hav e indices 1 , . . . , κ ) we need only sum to κ rather than n , (b) holds because the summation is over κ independent and identically distributed random v ariables, (c) holds by the definition of conditional expectations, (d) holds because Pr g ( X ∈ C ) = c , and (e) holds because F ∩ H ⊆ C . Theorem 4. If F ∩ H ⊆ G then E g [US( X n ) | k ( X n ) > 0] = θ . Pr oof. E g [ US( X n ) | k ( X n ) > 0] = n X κ =1 Pr( k ( X n ) = κ | k ( X n ) > 0) Pr( k ( X n ) > 0) E g [US( X n ) | k ( X n ) = κ ] (a) = n X κ =1 Pr( k ( X n ) = κ | k ( X n ) > 0) Pr( k ( X n ) > 0) θ = θ n X κ =1 Pr( k ( X n ) = κ | k ( X n ) > 0) Pr( k ( X n ) > 0) = θ , where (a) holds because, by Theorem 3 , E [US( X n ) | k ( X n ) = κ ] = θ . Theorem 5. If F ∩ H ⊆ G and κ ∈ N > 0 , then E g [IS( X n ) | k ( X n ) = κ ] − θ = κ cn − 1 θ . (4) Pr oof. Follo wing roughly the same steps as used to prove Theorem 3 we ha ve that: E g [ IS( X n ) | k ( X n ) = κ ] = E g " 1 n n X i =1 f ( X i ) g ( X i ) h ( X i ) k ( X n ) = κ # = E g " 1 n κ X i =1 f ( X i ) g ( X i ) h ( X i ) ∀ i ∈ { 1 , . . . , κ } , X i ∈ C # = E g κ n f ( X 1 ) g ( X 1 ) h ( X 1 ) X 1 ∈ C = Z C g ( x ) c κ n f ( x ) g ( x ) h ( x ) d x = κ cn E f [ h ( X )] = κ cn θ , and so ( 4 ) follows. Theorem 6. If F ∩ H ⊆ G then E g [IS( X n ) | k ( X n ) > 0] = 1 1 − (1 − c ) n θ . Pr oof. Recall from Property 1 that E g [IS( X n )] = θ . By marginalizing over whether or not k ( X n ) > 0 , we also hav e that: E g [IS( X n )] = Pr( k ( X n ) > 0) E g [IS( X n ) | k ( X n ) > 0] + Pr( k ( X n ) = 0) E g [IS( X n ) | k ( X n ) = 0] . So, E g [ IS( X n ) | k ( X n ) > 0] = θ − Pr( k ( X n ) = 0) E g [IS( X n ) | k ( X n ) = 0] Pr( k ( X n ) > 0) (a) = θ 1 − (1 − c ) n , where (a) holds because E g [IS( X n ) | k ( X n ) = 0] = 0 and Pr( k ( X n ) > 0) = 1 − Pr( k ( X n ) = 0) = 1 − (1 − c ) n . Theorem 7. If F ∩ H ⊆ G , then E g [US( X n )] = (1 − (1 − c ) n ) θ . Pr oof. E g [ US( X n )] = Pr( k ( X n ) > 0) | {z } =1 − (1 − c ) n E g [US( X n ) | k ( X n ) > 0] | {z } = θ , by Theorem 4 + Pr( k ( X n ) = 0) E g [US( X n ) | k ( X n ) = 0] | {z } =0 =(1 − (1 − c ) n ) θ . Before continuing, recall the following property (which we prov e for completeness): Property 2. Let X 1 , . . . , X n be n independent and iden- tically distributed random variables, each with finite mean and variance. Then, E 1 n n X i =1 X i ! 2 = 1 n V ar ( X 1 ) + E [ X 1 ] 2 . Pr oof. Recall that V ar 1 n n X i =1 X i ! = E " 1 n n X i =1 X i ! 2 # − E " 1 n n X i =1 X i # 2 . So, by rearranging terms: E " 1 n n X i =1 X i ! 2 # = 1 n 2 V ar n X i =1 X i ! + 1 n 2 E " n X i =1 X i # 2 . Since the X i are independent and identically distributed, we therefore hav e that: E 1 n n X i =1 X i ! 2 = 1 n 2 n V ar ( X 1 ) + 1 n 2 n 2 E [ X 1 ] 2 = 1 n V ar ( X 1 ) + E [ X 1 ] 2 . Theorem 8. If F ∩ H ⊆ G then V ar g (US( X n ) | k ( X n > 0)) = c 2 v E B ( n,c ) 1 κ κ > 0 . Pr oof. V ar g (US( X n ) | k ( X n ) > 0) = E g [US( X n ) 2 | k ( X n ) > 0] − E g [US( X n ) | k ( X n ) > 0] 2 = E g [US( X n ) 2 | k ( X n ) > 0] − θ 2 = n X κ =1 Pr( k ( X n ) = κ ) Pr( k ( X n ) > 0) E g [US( X n ) 2 | k ( X n ) = κ ] ! − θ 2 . (5) W e will write y to denote a vector in R n , the ele- ments of which are y 1 , . . . , y n ∈ R . W e also write y i : j to denote the i th through j th entries of y , i.e., y i : j : = [ y i , y i +1 , . . . , y j − 1 , y j ] . Let G n κ = { y ∈ G n : k ( y ) = κ } be the set of all possible tuples of n samples where e xactly κ are in C . W e also overload the definition of g by defin- ing g ( y ) : = Q n i =1 g ( y i ) . Using this notation, we hav e that (where . . . are used to denote that a long line is split across multiple lines via scalar multiplication): E g [ US( X n ) 2 | k ( X n ) = κ ] = Z G n κ g ( y ) Pr( k ( X n ) = κ ) US( y ) 2 d y (a) = n κ Pr( k ( X n ) = κ ) Z C κ Z ( G \ C ) n − κ g ( y ) US( y ) 2 d y 1: κ d y κ +1: n (b) = n κ Pr( k ( X n ) = κ ) Z C κ Z ( G \ C ) n − κ g ( y 1: κ ) g ( y κ +1: n ) . . . US( y 1: κ ) 2 d y 1: κ d y κ +1: n = n κ n κ c κ (1 − c ) n − κ Z C κ g ( y 1: κ ) US( y 1: κ ) 2 d y 1: κ . . . Z ( G \ C ) n − κ g ( y κ +1: n ) d y κ +1: n | {z } =(1 − c ) n − κ = n κ (1 − c ) n − k n κ c κ (1 − c ) n − κ Z C κ g ( y 1: κ ) c κ κ X i =1 f ( y i ) g ( y i ) h ( y i ) ! 2 d y 1: κ = c 2 c κ Z C κ g ( y 1: κ ) 1 κ κ X i =1 f ( y i ) g ( y i ) h ( y i ) ! 2 d y 1: κ (c) = c 2 Z C κ g ( y 1: κ ) Pr( k ( X κ ) = κ ) 1 κ κ X i =1 f ( y i ) g ( y i ) h ( y i ) ! 2 d y 1: κ = c 2 E g " 1 κ κ X i =1 f ( X i ) g ( X i ) h ( X i ) ! 2 X κ ∈ C κ # (d) = c 2 1 κ v + E f ( X ) g ( X ) h ( X ) X ∼ g , X ∈ C 2 ! = c 2 1 κ v + Z C g ( x ) c f ( x ) g ( x ) h ( x ) d x 2 ! = c 2 κ v + θ 2 , (6) where (a) comes from 1) the fact that there are n κ ways of ordering n elements such that κ are in C and n − κ are in G \ C , and 2) the fact that US does not depend on the order of its inputs, (b) comes from 1) the property that US( y ) does not change if additional samples are appended to y that are not in C and 2) the fact that g ( y ) can be de- composed into g ( y 1: κ ) g ( y κ +1: n ) since it represents the joint probability density function for n independent and identi- cally distributed random v ariables, (c) comes from the fact that Pr( k ( X κ ) = κ ) = c κ , and (d) comes from Property 2 . Combining ( 5 ) with ( 6 ) we hav e that V ar g (US( X n ) | k ( X n ) > 0) = n X κ =1 Pr( k ( X n ) = κ ) Pr( k ( X n ) > 0) c 2 κ v + θ 2 ! − θ 2 = c 2 v n X κ =1 Pr( k ( X n ) = κ ) Pr( k ( X n ) > 0) 1 κ ! + θ 2 n X κ =1 Pr( k ( X n ) = κ ) Pr( k ( X n ) > 0) ! | {z } =1 − θ 2 = c 2 v n X κ =1 Pr( k ( X n ) = κ ) Pr( k ( X n ) > 0) 1 κ = c 2 v E B ( n,c ) 1 κ κ > 0 . Theorem 9. If F ∩ H ⊆ G then V ar g (IS( X n ) | k ( X n > 0)) = v c nρ + θ 2 cρ ( n − 1) + ρ − cn cnρ 2 . Pr oof. At a high level, this proof is similar to the proof of Theorem 8 , b ut uses the property that IS( X n ) = k ( X n ) cn US( X n ) . V ar g (IS( X n ) | k ( X n ) > 0) = E g [IS( X n ) 2 | k ( X n ) > 0] − E g [IS( X n ) | k ( X n ) > 0] 2 (a) = E g [IS( X n ) 2 | k ( X n ) > 0] − θ 1 − (1 − c ) n 2 = n X κ =1 Pr( k ( X n ) = κ ) Pr( k ( X n ) > 0) E g [IS( X n ) 2 | k ( X n ) = κ ] ! − θ 1 − (1 − c ) n 2 , (7) where (a) comes from Theorem 6 . Also, E g [ IS( X n ) 2 | k ( X n ) = κ ] (a) = E g " k ( X n ) cn US( X n ) 2 k ( X n ) = κ # = κ 2 c 2 n 2 E g [US( X n ) 2 | k ( X n ) = κ ] (b) = κ 2 c 2 n 2 c 2 κ v + θ 2 , (8) where (a) holds because IS( X n ) = k ( X n ) cn US( X n ) and (b) follows from ( 6 ). Using the shorthand, ρ : = Pr( k ( X n ) > 0) = 1 − (1 − c ) n and by combining ( 7 ) with ( 8 ) we have that: V ar g (IS( X n ) | k ( X n ) > 0) = n X κ =1 Pr( k ( X n ) = κ ) Pr( k ( X n ) > 0) κ 2 c 2 n 2 c 2 κ v + θ 2 ! − θ 1 − (1 − c ) n 2 = v n 2 ρ n X κ =1 Pr( k ( X n ) = κ ) κ ! | {z } = E B ( n,c ) [ κ ]= nc + θ 2 c 2 n 2 ρ n X κ =1 Pr( k ( X n ) = κ ) κ 2 ! | {z } = E B ( n,c ) [ κ 2 ]= nc (( n − 1) c +1) − θ ρ 2 = v c nρ + θ 2 (( n − 1) c + 1) cnρ − θ 2 ρ 2 = v c nρ + θ 2 cρ ( n − 1) + ρ − cn cnρ 2 . Theorem 10. If F ∩ H ⊆ G then V ar g (US( X n )) = ρc 2 v E B ( n,c ) 1 κ κ > 0 + θ 2 ρ (1 − ρ ) . Pr oof. V ar g (US( X n )) = E g [US( X n ) 2 ] − E g [US( X n )] 2 (a) = E g [US( X n ) 2 ] − ρ 2 θ 2 = n X κ =0 Pr( k ( X n ) = κ ) E g [US( X n ) 2 | k ( X n ) = κ ] ! − ρ 2 θ 2 = Pr( k ( X n ) = 0) E g [US( X n ) 2 | k ( X n ) = 0] | {z } =0 + n X κ =1 Pr( k ( X n ) = κ ) E g [US( X n ) 2 | k ( X n ) = κ ] ! − ρ 2 θ 2 (b) = ρ n X κ =1 Pr( k ( X n ) = κ ) ρ c 2 κ v + θ 2 ! − ρ 2 θ 2 = ρc 2 v n X κ =1 Pr( k ( X n ) = κ ) ρ 1 κ ! + ρθ 2 n X κ =1 Pr( k ( X n ) = κ ) ρ ! | {z } =1 − ρ 2 θ 2 = ρc 2 v E B ( n,c ) 1 κ κ > 0 + θ 2 ρ (1 − ρ ) , where (a) comes from Theorem 7 , (b) comes from ( 6 ) and from multiplying one term by ρ/ρ = 1 . Theorem 11. If F ∩ H ⊆ G then V ar g (IS( X n )) = 1 n cv + θ 2 1 c − 1 . Pr oof. V ar g (IS( X n )) (a) = 1 n V ar g (IS( X )) = 1 n E g [IS( X ) 2 ] − E g [IS( X )] 2 (b) = 1 n E g [IS( X ) 2 ] − θ 2 = 1 n Pr( X ∈ C | X ∼ g ) E g [IS( X ) 2 | X ∈ C ] + Pr( X 6∈ C | X ∼ g ) E g [IS( X ) 2 | X 6∈ C ] | {z } =0 − θ 2 ! = 1 n c E g [IS( X ) 2 | X ∈ C ] − θ 2 ! (c) = 1 n c v + θ 2 c 2 − θ 2 ! = 1 n cv + θ 2 1 c − 1 , where (a) holds because IS( X n ) is the sum of n independent and identically distributed random variables, (b) comes from Property 1 , and (c) comes from applying ( 8 ) with n = 1 and κ = 1 . Property 3. cρ ( n − 1) + ρ − cn ≥ 0 , Pr oof. Recall that ρ : = 1 − (1 − c ) n , so we hav e that: cρ ( n − 1) + ρ − cn = c (1 − (1 − c ) n )( n − 1) + 1 − (1 − c ) n − cn =( cn − c )(1 − (1 − c ) n ) + 1 − (1 − c ) n − cn = cn − cn (1 − c ) n − c + c (1 − c ) n + 1 − (1 − c ) n − cn =(1 − c ) n ( − cn + c − 1) − c + 1 . (9) W e will show by induction that ( 9 ) is non-negati ve for all n ≥ 1 . First, notice that for the base case where n = 1 , ( 9 ) is equal to zero. For the inducti ve step we will sho w that ( 9 ) is non-ne gativ e for n + 1 given that it is non-ne gativ e for n . (1 − c ) n +1 ( − c ( n + 1) + c − 1) − c + 1 =(1 − c )(1 − c ) n ( − cn + c − 1) − (1 − c ) n +1 c + ( − c + 1)(1 − c + c ) =(1 − c ) (1 − c ) n ( − cn + c − 1) − c + 1 | {z } (a) − (1 − c ) n +1 c + c (1 − c ) , where (a) is positive by the inductiv e hypothesis, and so we need only show that − (1 − c ) n +1 c + c (1 − c ) ≥ 0 . Since − (1 − c ) n +1 c + c (1 − c ) = c (1 − c ) − (1 − c ) n +1 , and 1 − c ≥ (1 − c ) n +1 because c ∈ (0 , 1] , we conclude.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment