Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting
A learned generative model often produces biased statistics relative to the underlying data distribution. A standard technique to correct this bias is importance sampling, where samples from the model are weighted by the likelihood ratio under model …
Authors: Aditya Grover, Jiaming Song, Alekh Agarwal
Bias Corr ection of Lear ned Generativ e Models using Likelihood-Fr ee Importance W eighting Aditya Gro ver 1 , Jiaming Song 1 , Alekh Agarwal 2 , Kenneth T ran 2 , Ashish Kapoor 2 , Eric Horvitz 2 , Stefano Ermon 1 1 Stanford Univ ersity , 2 Microsoft Research, Redmond Abstract A learned generati ve model often produces biased statistics relati v e to the under - lying data distribution. A standard technique to correct this bias is importance sampling, where samples from the model are weighted by the likelihood ratio under model and true distrib utions. When the likelihood ratio is unkno wn, it can be estimated by training a probabilistic classifier to distinguish samples from the two distrib utions. W e employ this lik elihood-free importance weighting method to correct for the bias in generati ve models. W e find that this technique consis- tently improv es standard goodness-of-fit metrics for ev aluating the sample quality of state-of-the-art deep generative models, suggesting reduced bias. Finally , we demonstrate its utility on representati ve applications in a) data augmentation for classification using generative adversarial networks, and b) model-based policy ev aluation using off-policy data. 1 Introduction Learning generativ e models of complex en vironments from high-dimensional observ ations is a long- standing challenge in machine learning. Once learned, these models are used to draw inferences and to plan future actions. For example, in data augmentation, samples from a learned model are used to enrich a dataset for supervised learning [ 1 ]. In model-based off-polic y policy e valuation (henceforth MBOPE), a learned dynamics model is used to simulate and e valuate a target policy without real-w orld deployment [ 2 ], which is especially valuable for risk-sensiti ve applications [ 3 ]. In spite of the recent successes of deep generati ve models, existing theoretical results sho w that learning distrib utions in an unbiased manner is either impossible or has prohibitiv e sample complexity [ 4 , 5 ]. Consequently , the models used in practice are inherently biased , 1 and can lead to misleading downstream inferences. In order to address this issue, we start from the observation that many typical uses of generativ e models in v olve computing e xpectations under the model. For instance, in MBOPE, we seek to find the expected return of a policy under a trajectory distribution defined by this policy and a learned dynamics model. A classical recipe for correcting the bias in e xpectations, when samples from a dif ferent distribution than the ground truth are av ailable, is to importance weight the samples according to the likelihood ratio [ 6 ]. If the importance weights were exact, the resulting estimates are unbiased. But in practice, the lik elihood ratio is unkno wn and needs to be estimated since the true data distribution is unkno wn and e ven the model likelihood is intractable or ill-defined for many deep generativ e models, e.g., v ariational autoencoders [7] and generati ve adv ersarial networks [8]. Our proposed solution to estimate the importance weights is to train a calibrated, probabilistic classifier to distinguish samples from the data distribution and the generati ve model. As sho wn in prior work, the output of such classifiers can be used to extract density ratios [ 9 ]. Appealingly , this estimation procedure is likelihood-free since it only requires samples from the two distrib utions. 1 W e call a generativ e model biased if it produces biased statistics relati ve to the true data distrib ution. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), V ancouver , Canada. T ogether, the generative model and the importance weighting function (specified via a binary classifier) induce a new unnormalized distribution. While exact density estimation and sampling from this induced distrib ution is intractable, we can derive a particle based approximation which permits ef ficient sampling via resampling based methods. W e deri ve conditions on the quality of the weighting function such that the induced distribution pro v ably improv es the fit to the the data distribution. Empirically , we ev aluate our bias reduction frame work on three main sets of experiments. First, we consider goodness-of-fit metrics for ev aluating sample quality metrics of a lik elihood-based and a likelihood-free state-of-the-art (SO T A) model on the CIF AR-10 dataset. All these metrics are defined as Monte Carlo estimates from the generated samples. By importance weighting samples, we observe a bias reduction of 23.35% and 13.48% a veraged across commonly used sample quality metrics on PixelCNN++ [10] and SNGAN [11] models respecti vely . Next, we demonstrate the utility of our approach on the task of data augmentation for multi-class classification on the Omniglot dataset [ 12 ]. W e show that, while nai v ely extending the model with samples from a data augmentation, a generati v e adversarial network [ 1 ] is not v ery ef fecti ve for multi- class classification, we can improv e classification accurac y from 66.03% to 68.18% by importance weighting the contributions of each augmented data point. Finally , we demonstrate bias reduction for MBOPE [ 13 ]. A typical MBOPE approach is to first estimate a generati ve model of the dynamics using of f-policy data and then ev aluate the polic y via Monte Carlo [ 2 , 14 ]. Again, we observe that correcting the bias of the estimated dynamics model via importance weighting reduces RMSE for MBOPE by 50.25% on 3 MuJoCo en vironments [15]. 2 Preliminaries Notation. Unless explicitly stated otherwise, we assume that probability distributions admit abso- lutely continuous densities on a suitable reference measure. W e use uppercase notation X, Y , Z to denote random variables and lo wercase notation x, y, z to denote specific values in the corresponding sample spaces X , Y , Z . W e use boldface for multiv ariate random v ariables and their vector v alues. Background. Consider a finite dataset D train of instances x drawn i.i.d. from a fixed (unknown) distribution p data . Giv en D train , the goal of generative modeling is to learn a distrib ution p θ to approximate p data . Here, θ denotes the model parameters, e.g. weights in a neural network for deep generativ e models. The parameters can be learned via maximum likelihood estimation (MLE) as in the case of autoregressi ve models [ 16 ], normalizing flows [ 17 ], and variational autoencoders [ 7 , 18 ], or via adversarial training e.g., using generati ve adv ersarial networks [8, 19] and v ariants. Monte Carlo Evaluation W e are interested in use cases where the goal is to ev aluate or optimize expectations of functions under some distribution p (either equal or close to the data distribution p data ). Assuming access to samples from p as well some generati ve model p θ , one extreme is to ev aluate the sample av erage using the samples from p alone. Howe v er , this ignores the av ailability of p θ , through which we have a virtually unlimited access of generated samples ignoring computational constraints and hence, could improv e the accuracy of our estimates when p θ is close to p . W e begin by presenting a direct motiv ating use case of data augmentation using generativ e models for training classifiers which generalize better . Example Use Case: Suf ficient labeled training data for learning classification and re gression system is often expensi v e to obtain or susceptible to noise. Data augmentation seeks to o vercome this shortcoming by artificially injecting ne w datapoints into the training set. These new datapoints are deriv ed from an e xisting labeled dataset, either by manual transformations (e.g., rotations, flips for images), or alternativ ely , learned via a generativ e model [1, 20]. Consider a supervised learning task o ver a labeled dataset D cl . The dataset consists of feature and label pairs ( x , y ) , each of which is assumed to be sampled independently from a data distribution p data ( x , y ) defined over X × Y . Further, let Y ⊆ R k . In order to learn a classifier f ψ : X → R k with parameters ψ , we minimize the expectation of a loss ` : Y × R k → R o ver the dataset D cl : E p data ( x , y ) [ ` ( y , f ψ ( x ))] ≈ 1 | D cl | X ( x , y ) ∼ D cl ` ( y , f ψ ( x )) . (1) 2 E.g., ` could be the cross-entropy loss. A generativ e model for the task of data augmentation learns a joint distribution p θ ( x , y ) . Several algorithmic v ariants exist for learning the model’ s joint distribution and we defer the specifics to the experiments section. Once the generative model is learned, it can be used to optimize the expected classification loss in Eq. (1) under a mixture distrib ution of empirical data distributions and generati v e model distributions gi v en as: p mix ( x , y ) = mp data ( x , y ) + (1 − m ) p θ ( x , y ) (2) for a suitable choice of the mixture weights m ∈ [0 , 1] . Notice that, while the eventual task here is optimization, reliably ev aluating the expected loss of a candidate parameter ψ is an important ingredient. W e focus on this basic question first in adv ance of leveraging the solution for data augmentation. Further , ev en if e valuating the e xpectation once is easy , optimization requires us to do repeated ev aluation (for different v alues of ψ ) which is significantly more challenging. Also observe that the distribution p under which we seek expectations is same as p data here, and we rely on the generalization of p θ to generate transformations of an instance in the dataset which are not explicitly present, but plausibly observ ed in other , similar instances [21]. 3 Likelihood-Fr ee Importance W eighting Whenev er the distribution p , under which we seek expectations, differs from p θ , model-based estimates exhibit bias. In this section, we start out by formalizing bias for Monte Carlo expectations and subsequently propose a bias reduction strate gy based on likelihood-free importance weighting (LFIW). W e are interested in ev aluating e xpectations of a class of functions of interest f ∈ F w .r .t. the distribution p . For any gi ven f : X → R , we have E x ∼ p [ f ( x )] = R p ( x ) f ( x )d x . Giv en access to samples from a generativ e model p θ , if we knew the densities for both p and p θ , then a classical scheme to ev aluate expectations under p using samples from p θ is to use importance sampling [ 6 ]. W e reweight each sample from p θ according to its likelihood ratio under p and p θ and compute a weighted av erage of the function f ov er these samples. E x ∼ p [ f ( x )] = E x ∼ p θ p ( x ) p θ ( x ) f ( x ) ≈ 1 T T X i =1 w ( x i ) f ( x i ) (3) where w ( x i ) := p ( x i ) / p θ ( x i ) is the importance weight for x i ∼ p θ . The validity of this procedure is subject to the use of a proposal p θ such that for all x ∈ X where p θ ( x ) = 0 , we also ha ve f ( x ) p ( x ) = 0 . 2 T o apply this technique to reduce the bias of a generative sampler p θ w .r .t. p , we require knowledge of the importance weights w ( x ) for any x ∼ p θ . Howe v er , we typically only hav e a sampling access to p via finite datasets. For instance, in the data augmentation example abo ve, where p = p data , the unknown distrib ution used to learn p θ . Hence we need a scheme to learn the weights w ( x ) , using samples from p and p θ , which is the problem we tackle next.In order to do this, we consider a binary classification problem ov er X × Y where Y = { 0 , 1 } and the joint distribution is denoted as q ( x , y ) . Let γ = q ( y =0) q ( y =1) > 0 denote any fix ed odds ratio. T o specify the joint q ( x , y ) , we additionally need the conditional q ( x | y ) which we define as follows: q ( x | y ) = p θ ( x ) if y = 0 p ( x ) otherwise . (4) Since we only assume sample access t o p and p θ ( x ) , our strategy would be to estimate the conditional abov e via learning a probabilistic binary classifier . T o train the classifier , we only require datasets of samples from p θ ( x ) and p ( x ) and estimate γ to be the ratio of the size of two datasets. Let c φ : X → [0 , 1] denote the probability assigned by the classifier with parameters φ to a sample x belonging to the positi ve class y = 1 . As shown in prior work [ 9 , 22 ], if c φ is Bayes optimal, then the importance weights can be obtained via this classifier as: w φ ( x ) = p ( x ) p θ ( x ) = γ c φ ( x ) 1 − c φ ( x ) . (5) 2 A stronger sufficient, but not necessary condition that is independent of f , states that the proposal p θ is valid if it has a support lar ger than p , i.e., for all x ∈ X , p θ ( x ) = 0 implies p ( x ) = 0 . 3 (a) Setup (b) n = 50 (c) n = 100 (d) n = 1000 Figure 1: Importance W eight Estimation using Probabilistic Classifiers. (a) A univ ariate Gaussian (blue) is fit to samples from a mixture of two Gaussians (red). (b-d) Estimated class probabilities (with 95% confidence intervals based on 1000 bootstraps) for varying number of points n , where n is the number of points used for training the generativ e model and multilayer perceptron. In practice, we do not have access to a Bayes optimal classifier and hence, the estimated importance weights will not be exact. Consequently , we can hope to reduce the bias as opposed to eliminating it entirely . Hence, our default LFIW estimator is giv en as: E x ∼ p [ f ( x )] ≈ 1 T T X i =1 ˆ w φ ( x i ) f ( x i ) (6) where ˆ w φ ( x i ) = γ c φ ( x i ) 1 − c φ ( x i ) is the importance weight for x i ∼ p θ estimated via c φ ( x ) . Practical Considerations. Besides imperfections in the classifier , the quality of a generati ve model also dictates the ef ficac y of importance weighting. For example, images generated by deep generati v e models often possess distinct artifacts which can be exploited by the classifier to gi ve highly-confident predictions [ 23 , 24 ]. This could lead to v ery small importance weights for some generated images, and consequently greater relati ve v ariance in the importance weights across the Monte Carlo batch. Below , we present some practical variants of LFIW estimator to of fset this challenge. 1. Self-normalization: The self-normalized LFIW estimator for Monte Carlo ev aluation normalizes the importance weights across a sampled batch: E x ∼ p [ f ( x )] ≈ T X i =1 ˆ w φ ( x i ) P T j =1 ˆ w φ ( x j ) f ( x i ) where x i ∼ p θ . (7) 2. Flattening: The flattened LFIW estimator interpolates between the uniform importance weights and the default LFIW weights via a po wer scaling parameter α ≥ 0 : E x ∼ p [ f ( x )] ≈ 1 T T X i =1 ˆ w φ ( x i ) α f ( x i ) where x i ∼ p θ . (8) For α = 0 , there is no bias correction, and α = 1 returns the default estimator in Eq. (6) . For intermediate values of α , we can trade-off bias reduction with an y undesirable v ariance introduced. 3. Clipping: The clipped LFIW estimator specifies a lower bound β ≥ 0 on the importance weights: E x ∼ p [ f ( x )] ≈ 1 T T X i =1 max( ˆ w φ ( x i ) , β ) f ( x i ) where x i ∼ p θ . (9) When β = 0 , we recov er the default LFIW estimator in Eq. (6) . Finally , we note that these estimators are not exclusi ve and can be combined e.g., flattened or clipped weights can be normalized. Confidence interv als. Since we have real and generated data coming from a finite dataset and parametric model respectiv ely , we propose a combination of empirical and parametric bootstraps to deriv e confidence interv als around the estimated importance weights. See Appendix A for details. Synthetic experiment. W e visually illustrate our importance weighting approach in a toy e xperiment (Figure 1a). W e are giv en a finite set of samples drawn from a mixture of two Gaussians (red). The model family is a unimodal Gaussian, illustrating mismatch due to a parametric model. The mean 4 Algorithm 1 SIR for the Importance Resampled Generativ e Model p θ,φ Input: Generativ e Model p θ , Importance W eight Estimator ˆ w φ , budget T 1: Sample x 1 , x 2 , . . . , x T independently from p θ 2: Estimate importance weights ˆ w ( x 1 ) , ˆ w ( x 2 ) , . . . , ˆ w ( x T ) 3: Compute ˆ Z ← P T t =1 ˆ w ( x t ) 4: Sample j ∼ Categorical ˆ w ( x 1 ) ˆ Z , ˆ w ( x 2 ) ˆ Z , . . . , ˆ w ( x T ) ˆ Z 5: retur n x j and variance of the model are estimated by the empirical means and v ariances of the observed data. Using estimated model parameters, we then draw samples from the model (blue). In Figure 1b, we sho w the probability assigned by a binary classifier to a point to be from true data distribution. Here, the classifier is a single hidden-layer multi-layer perceptron. The classifier is not Bayes optimal, which can be seen by the g aps between the optimal probabilities curve (black) and the estimated class probability curve (green). Ho we ver , as we increase the number of real and generated examples n in Figures 1c-d, the classifier approaches optimality . Furthermore, even its uncertainty shrinks with increasing data, as e xpected. In summary , this experiment demonstrates ho w a binary classifier can mitigate this bias due to a mismatched generati ve model. 4 Importance Resampled Generative Modeling In the previous section, we described a procedure to augment any base generativ e model p θ with an importance weighting estimator ˆ w φ for debiased Monte Carlo e valuation. Here, we will use this augmentation to induce an importance r esampled gener ative model with density p θ,φ giv en as: p θ,φ ( x ) ∝ p θ ( x ) ˆ w φ ( x ) (10) where the partition function is expressed as Z θ,φ = R p θ ( x ) ˆ w φ ( x )d x = E p θ [ ˆ w φ ( x )] . Density Estimation. Exact density estimation requires a handle on the density of the base model p θ (typically intractable for models such as V AEs and GANs) and estimates of the partition function. Exactly computing the partition function is intractable. If p θ permits fast sampling and importance weights are estimated via LFIW (requiring only a forward pass through the classifier network), we can obtain unbiased estimates via a Monte Carlo average, i.e., Z θ,φ ≈ 1 T P T i =1 ˆ w φ ( x i ) where x i ∼ p θ . T o reduce the variance, a potentially lar ge number of samples are required. Since samples are obtained independently , the terms in the Monte Carlo av erage can be ev aluated in parallel. Sampling-Importance-Resampling. While exact sampling from p θ,φ is intractable, we can instead perform sample from a particle-based approximation to p θ,φ via sampling-importance-resampling [ 25 , 26] (SIR). W e define the SIR approximation to p θ,φ via the following density: p SIR θ,φ ( x ; T ) := E x 2 , x 3 ,..., x T ∼ p θ " ˆ w φ ( x ) ˆ w φ ( x ) + P T i =2 ˆ w φ ( x i ) p θ ( x ) # (11) where T > 0 denotes the number of independent samples (or “particles"). For any finite T , sampling from p SIR θ,φ is tractable, as summarized in Algorithm 1. Moreover , any expectation w .r .t. the SIR approximation to the induced distribution can be e v aluated in closed-form using the self-normalized LFIW estimator (Eq. 7). In the limit of T → ∞ , we recover the induced distrib ution p θ,φ : lim T →∞ p SIR θ,φ ( x ; T ) = p θ,φ ( x ) ∀ x (12) Next, we analyze conditions under which the resampled density p θ,φ prov ably improv es the model fit to p data . In order to do so, we further assume that p data is absolutely continuous w .r .t. p θ and p θ,φ . W e define the change in KL via the importance resampled density as: ∆( p data , p θ , p θ,φ ) := D KL ( p data , p θ,φ ) − D KL ( p data , p θ ) . (13) Substituting Eq. 10 in Eq. 13, we can simplify the abov e quantity as: ∆( p data , p θ , p θ,φ ) = E x ∼ p data [ − log ( p θ ( x ) ˆ w φ ( x )) + log Z θ,φ + log p θ ( x )] (14) = E x ∼ p data [log ˆ w φ ( x )] − log E x ∼ p θ [ ˆ w φ ( x )] . (15) 5 T able 1: Goodness-of-fit ev aluation on CIF AR-10 dataset for PixelCNN++ and SNGAN. Standard errors computed ov er 10 runs. Higher IS is better . Lower FID and KID scores ar e better . Model Evaluation IS ( ↑ ) FID ( ↓ ) KID ( ↓ ) - Reference 11.09 ± 0.1263 5.20 ± 0.0533 0.008 ± 0.0004 PixelCNN++ Default (no debiasing) 5.16 ± 0.0117 58.70 ± 0.0506 0.196 ± 0.0001 LFIW 6.68 ± 0.0773 55.83 ± 0.9695 0.126 ± 0.0009 SNGAN Default (no debiasing) 8.33 ± 0.0280 20.40 ± 0.0747 0.094 ± 0.0002 LFIW 8.57 ± 0.0325 17.29 ± 0.0698 0.073 ± 0.0004 The above expression provides a necessary and suf ficient condition for any positi ve real valued function (such as the LFIW classifier in Section 3) to improve the KL di v ergence fit to the underlying data distribution. In practice, an unbiased estimate of the LHS can be obtained via Monte Carlo av eraging of log- importance weights based on D train . The empirical estimate for the RHS is ho we ver biased. 3 T o remedy this shortcoming, we consider the following necessary b ut insufficient condition. Proposition 1. If ∆( p data , p θ , p θ,φ ) ≥ 0 , then the following conditions hold: E x ∼ p data [ ˆ w φ ( x )] ≥ E x ∼ p θ [ ˆ w φ ( x )] , (16) E x ∼ p data [log ˆ w φ ( x )] ≥ E x ∼ p θ [log ˆ w φ ( x )] . (17) The conditions in Eq. 16 and Eq. 17 follo w directly via Jensen’ s inequality applied to the LHS and RHS of Eq. 15 respectively . Here, we note that estimates for the expectations in Eqs. 16-17 based on Monte Carlo av eraging of (log-) importance weights are unbiased. 5 A pplication Use Cases In all our e xperiments, the binary classifier for estimating the importance weights was a calibrated deep neural network trained to minimize the cross-entropy loss. The self-normalized LFIW in Eq. (7) worked best. Additional analysis on the estimators and experiment details are in Appendices B and C. 5.1 Goodness-of-fit testing In the first set of experiments, we highlight the benefits of importance weighting for a debiased ev aluation of three popularly used sample quality metrics viz. Inception Scores (IS) [ 27 ], Frechet Inception Distance (FID) [ 28 ], and Kernel Inception Distance (KID) [ 29 ]. All these scores can be formally expressed as empirical expectations with respect to the model. For all these metrics, we can simulate the population lev el unbiased case as a “reference score" wherein we artificially set both the real and generated sets of samples used for ev aluation as finite, disjoint sets derived from p data . W e ev aluate the three metrics for two state-of-the-art models trained on the CIF AR-10 dataset viz. an autoregressi v e model PixelCNN++ [ 10 ] learned via maximum likelihood estimation and a latent variable model SNGAN [ 11 ] learned via adv ersarial training. For e v aluating each metric, we draw 10,000 samples from the model. In T able 1, we report the metrics with and without the LFIW bias correction. The consistent debiased e v aluation of these metrics via self-normalized LFIW suggest that the SIR approximation to the importance resampled distribution (Eq. 11) is a better fit to p data . 5.2 Data A ugmentation for Multi-Class Classification W e consider data augmentation via Data Augmentation Generativ e Adversarial Networks (D A- GAN) [ 1 ]. While D A GAN was moti v ated by and e v aluated for the task of meta-learning, it can also be applied for multi-class classification scenarios, which is the setting we consider here. W e trained a D A GAN on the Omniglot dataset of handwritten characters [ 12 ]. The D A GAN training procedure is described in the Appendix. The dataset is particularly relev ant because it contains 1600+ classes b ut only 20 examples from each class and hence, could potentially benefit from augmented data. 3 If ˆ Z is an unbiased estimator for Z , then log ˆ Z is a biased estimator for log Z via Jensen’ s inequality . 6 (a) (b) (c) (d) (e) (f) Figure 2: Qualitativ e ev aluation of importance weighting for data augmentation. (a-f) T op row sho ws held-out data samples from a specific class in Omniglot. Bottom row sho ws generated samples from the same class ranked in decr easing or der of importance weights. T able 2: Classification accuracy on the Omniglot dataset. Standard errors computed over 5 runs. Dataset D cl D g D g w/ LFIW D cl + D g D cl + D g w/ LFIW Accuracy 0.6603 ± 0.0012 0.4431 ± 0.0054 0.4481 ± 0.0056 0.6600 ± 0.0040 0.6818 ± 0.0022 Once the model has been trained, it can be used for data augmentation in many w ays. In particular , we consider ablation baselines that use various combinations of the real training data D cl and generated data D g for training a do wnstream classifier . When the generated data D g is used, we can either use the data directly with uniform weighting for all training points, or choose to importance weight (LFIW) the contributions of the indi vidual training points to the o verall loss. The results are shown in T able 2. While generated data ( D g ) alone cannot be used to obtain competitiv e performance relati ve to the real data ( D cl ) on this task as expected, the bias it introduces for ev aluation and subsequent optimization ov ershadows e v en the naiv e data augmentation ( D cl + D g ). In contrast, we can obtain significant improv ements by importance weighting the generated points ( D cl + D g w/ LFIW). Qualitativ ely , we can observe the effect of importance weighting in Figure 2. Here, we show true and generated samples for 6 randomly choosen classes (a-f) in the Omniglot dataset. The generated samples are ranked in decreasing order of the importance weights. There is no way to formally test the v alidity of such rankings and this criteria can also prefer points which ha ve high density under p data but are unlikely under p θ since we are looking at ratios. V isual inspection suggests that the classifier is able to appropriately downweight poorer samples, as sho wn in Figure 2 (a, b, c, d - bottom right). There are also failure modes, such as the lo west ranked generated images in Figure 2 (e, f - bottom right) where the classifier weights reasonable generated samples poorly relativ e to others. This could be due to particular artifacts such as a tiny disconnected blurry speck in Figure 2 (e - bottom right) which could be more rev ealing to a classifier distinguishing real and generated data. 5.3 Model-based Off-policy Policy Ev aluation So far , we ha ve seen use cases where the generative model was trained on data from the same distribution we wish to use for Monte Carlo e valuation. W e can extend our debiasing frame work to more in v olved settings when the generative model is a building block for specifying the full data generation process, e.g., trajectory data generated via a dynamics model along with an agent policy . In particular , we consider the setting of off -policy policy ev aluation (OPE), where the goal is to ev aluate policies using experiences collected from a different polic y . Formally , let ( S , A , r, P , η , T ) denote an (undiscounted) Markov decision process with state space S , action space A , rew ard function r , transition P , initial state distribution η and horizon T . Assume π e : S × A → [0 , 1] is a known policy that we wish to ev aluate. The probability of generating a certain trajectory τ = { s 0 , a 0 , s 1 , a 1 , ..., s T , a T } of length T with policy π e and transition P is giv en as: p ? ( τ ) = η ( s 0 ) T − 1 Y t =0 π e ( a t | s t ) P ( s t +1 | s t , a t ) . (18) The return on a trajectory R ( τ ) is the sum of the rew ards across the state, action pairs in τ : R ( τ ) = P T t =1 r ( s t , a t ) , where we assume a known r ewar d function r . 7 T able 3: Off-polic y policy e valuation on MuJoCo tasks. Standard error is over 10 Monte Carlo estimates where each estimate contains 100 randomly sampled trajectories. En vironment v ( π e ) (Ground truth) ˜ v ( π e ) ˆ v ( π e ) (w/ LFIW) ˆ v 80 ( π e ) (w/ LFIW) Swimmer 36 . 7 ± 0 . 1 100 . 4 ± 3 . 2 25.7 ± 3 . 1 47.6 ± 4 . 8 HalfCheetah 241 . 7 ± 3 . 56 204 . 0 ± 0 . 8 217 . 8 ± 4 . 0 219.1 ± 1 . 6 HumanoidStandup 14170 ± 53 8417 ± 28 9372 ± 375 9221 ± 381 0 20 40 60 80 100 H 20 40 60 | ( v ) | Swimmer 0 20 40 60 80 100 H 20 30 | ( v ) | HalfCheetah 0 20 40 60 80 100 H 4500 5000 5500 | ( v ) | HumanoidStandup Figure 3: Estimation error δ ( v ) = v ( π e ) − ˆ v H ( π e ) for different v alues of H (minimum 0, maximum 100). Shaded area denotes standard error ov er different random seeds. W e are interested in the v alue of a polic y defined as v ( π e ) = E τ ∼ p ∗ ( τ ) [ R ( τ )] . Evaluating π e requires the (unknown) transition dynamics P . The dynamics model is a conditional generativ e model of the ne xt states s t +1 conditioned on the previous state-action pair ( s t , a t ) . If we hav e access to historical logged data D τ of trajectories τ = { s 0 , a 0 , s 1 , a 1 , . . . , } from some beha vioral policy π b : S × A → [0 , 1] , then we can use this off-polic y data to train a dynamics model P θ ( s t +1 | s t , a t ) . The policy π e can then be e valuated under this learned dynamics model as ˜ v ( π e ) = E τ ∼ ˜ p ( τ ) [ R ( τ )] , where ˜ p uses P θ instead of the true dynamics in Eq. (18). Howe ver , the trajectories sampled with P θ could significantly de viate from samples from P due to compounding errors [ 30 ]. In order to correct for this bias, we can use likelihood-free importance weighting on entire trajectories of data. The binary classifier c ( s t , a t , s t +1 ) for estimating the importance weights in this case distinguishes between triples of true and generated transitions. For any true triple ( s t , a t , s t +1 ) extracted from the of f-polic y data, the corresponding generated triple ( s t , a t , ˆ s t +1 ) only differs in the final transition state, i.e., ˆ s t +1 ∼ P θ ( ˆ s t +1 | s t , a t ) . Such a classifier allows us to obtain the importance weights ˆ w ( s t , a t , ˆ s t +1 ) for every predicted state transition ( s t , a t , ˆ s t +1 ) . The importance weights for the trajectory τ can be deri ved from the importance weights of these individual transitions as: p ? ( τ ) ˜ p ( τ ) = Q T − 1 t =0 P ( s t +1 | s t , a t ) Q T − 1 t =0 P θ ( s t +1 | s t , a t ) = T − 1 Y t =0 P ( s t +1 | s t , a t ) P θ ( s t +1 | s t , a t ) ≈ T − 1 Y t =0 ˆ w ( s t , a t , ˆ s t +1 ) . (19) Our final LFIW estimator is giv en as: ˆ v ( π e ) = E τ ∼ ˜ p ( τ ) " T − 1 Y t =0 ˆ w ( s t , a t , ˆ s t +1 ) · R ( τ ) # . (20) W e consider three continuous control tasks in the MuJoCo simulator [ 15 ] from OpenAI gym [ 31 ] (in increasing number of state dimensions): Swimmer , HalfCheetah and HumanoidStandup. High dimensional state spaces makes it challenging to learning a reliable dynamics model in these environ- ments. W e train behavioral and evaluation policies using Proximal Policy Optimization [ 32 ] with dif ferent hyperparameters for the two policies. The dataset collected via trajectories from the beha vior policy are used train a ensemble neural netw ork dynamics model. W e the use the trained dynamics model to e valuate ˜ v ( π e ) and its IW version ˆ v ( π e ) , and compare them with the ground truth returns v ( π e ) . Each estimation is averaged o ver a set of 100 trajectories with horizon T = 100 . Specifically , for ˆ v ( π e ) , we also a verage the estimation o ver 10 classifier instances trained with dif ferent random seeds on dif ferent trajectories. W e further consider performing IW ov er only the first H steps, and use uniform weights for the remainder , which we denote as ˆ v H ( π e ) . This allow us to interpolate between ˜ v ( π e ) ≡ ˆ v 0 ( π e ) and ˆ v ( π e ) ≡ ˆ v T ( π e ) . Finally , as in the other experiments, we used the self-normalized variant (Eq. (7)) of the importance weighted estimator in Eq. (20). W e compare the policy ev aluations under dif ferent en vironments in T able 3. These results sho w that the re wards estimated with the trained dynamics model dif fer from the ground truth by a large margin. 8 By importance weighting the trajectories, we obtain much more accurate policy ev aluations. As expected, we also see that while LFIW leads to higher returns on average, the imbalance in trajectory importance weights due to the multiplicati ve weights of the state-action pairs can lead to higher variance in the importance weighted returns. In Figure 3, we demonstrate that policy ev aluation becomes more accurate as more timesteps are used for LFIW ev aluations, until around 80 − 100 timesteps and thus empirically v alidates the benefits of importance weighting using a classifier . Given that our estimates hav e a lar ge variance, it would be worthwhile to compose our approach with other variance reduction techniques such as (weighted) doubly robust estimation in future work [ 33 ], as well as incorporate these estimates within a framew ork such as MAGIC to further blend with model-free OPE [ 14 ]. In Appendix C.5.1, we also consider a stepwise LFIW estimator for MBOPE which applies importance weighting at the lev el of e very decision as opposed to entire trajectories. Overall. Across all our experiments, we observe that importance weighting the generated samples leads to uniformly better results, whether in terms of ev aluating the quality of samples, or their utility in do wnstream tasks. Since the technique is a black-box wrapper around any generativ e model, we expect this to benefit a di verse set of tasks in follo w-up works. Howe ver , there is also some caution to be exercised with these techniques as e vident from the results of T able 1. Note that in this table, the confidence intervals (computed using the reported standard errors) around the model scores after importance weighting still do not contain the reference scores obtained from the true model. This would not have been the case if our debiased estimator was completely unbiased and this observation reiterates our earlier claim that LFIW is reducing bias, as opposed to completely eliminating it. Indeed, when such a mismatch is observed, it is a good diagnostic to either learn more powerful classifiers to better approximate the Bayes optimum, or find additional data from p data in case the generativ e model fails the full support assumption. 6 Related W ork & Discussion Density ratios enjoy widespread use across machine learning e.g., for handling co variate shifts, class imbalance etc. [ 9 , 34 ]. In generative modeling, estimating these ratios via binary classifiers is frequently used for defining learning objectives and two sample tests [ 19 , 35 , 35 – 41 ]. In partic- ular , such classifiers have been used to define learning frameworks such as generativ e adversarial networks [ 8 , 42 ], likelihood-free Approximate Bayesian Computation (ABC) [ 43 ] and earlier work in unsupervised-as-supervised learning [ 44 ] and noise contrastiv e estimation [ 43 ] among others. Recently , [ 45 ] used importance weighting to reweigh datapoints based on differences in training and test data distributions i.e., dataset bias. The key difference is that these works are explicitly interested in learning the parameters of a generati ve model. In contrast, we use the binary classifier for estimating importance weights to correct for the model bias of any fixed generativ e model. Recent concurrent works [ 46 – 48 ] use MCMC and rejection sampling to explicitly transform or reject the generated samples. These methods require extra computation beyond training a classifier , in rejecting the samples or running Markov chains to conv ergence, unlike the proposed importance weighting strategy . F or many model-based Monte Carlo ev aluation usecases (e.g., data augmentation, MBOPE), this e xtra computation is unnecessary . If samples or density estimates are explicitly needed from the induced resampled distribution, we presented a particle-based approximation to the induced density where the number of particles is a tunable knob allowing for trading statistical accuracy with computational ef ficiency . Finally , we note resampling based techniques hav e been e xtensi vely studied in the context of improving v ariational approximations for latent variable generati ve models [ 49 – 52 ]. 7 Conclusion W e identified bias with respect to a target data distribution as a fundamental challenge restricting the use of deep generativ e models as proposal distrib utions for Monte Carlo ev aluation. W e proposed a bias correction frame work based on importance sampling. The importance weights are learned in a likelihood-free fashion via a binary classifier . Empirically , we find the bias correction to be useful across a surprising variety of tasks including goodness-of-fit sample quality tests, data augmentation, and model-based of f-policy policy e valuation. The ability to characterize the bias of a deep generati ve model is an important step towards using these models to guide decisions in high-stakes applications under uncertainty [53, 54], such as healthcare [55 – 57] and robust anomaly detection [58, 59]. 9 Acknowledgments This project was initiated when A G was an intern at Microsoft Research. W e are thankful to Daniel Levy , Rui Shu, Y ang Song, and members of the Reinforcement Learning, Deep Learning, and Adaptiv e Systems and Interaction groups at Microsoft Research for helpful discussions and comments on early drafts. This research was supported by NSF (#1651565, #1522054, #1733686), ONR, AFOSR (F A9550-19-1-0024), and FLI. References [1] Antreas Antoniou, Amos Storkey , and Harrison Edwards. Data augmentation generativ e adversarial networks. arXiv pr eprint arXiv:1711.04340 , 2017. [2] Shie Mannor , Duncan Simester , Peng Sun, and John N Tsitsiklis. Bias and variance approxima- tion in value function estimates. Mana gement Science , 53(2):308–322, 2007. [3] Philip S Thomas. Safe r einfor cement learning . PhD thesis, Univ ersity of Massachusetts Libraries, 2015. [4] Murray Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics , pages 832–837, 1956. [5] Sanjee v Arora, Andrej Risteski, and Y i Zhang. Do gans learn the distribution? some theory and empirics. In International Confer ence on Learning Repr esentations , 2018. [6] Daniel G Horvitz and Donov an J Thompson. A generalization of sampling without replacement from a finite univ erse. J ournal of the American statistical Association , 1952. [7] Diederik P Kingma and Max W elling. Auto-encoding v ariational bayes. arXiv pr eprint arXiv:1312.6114 , 2013. [8] Ian Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, Da vid W arde-Farley , Sherjil Ozair , Aaron Courville, and Y oshua Bengio. Generative adversarial nets. In Advances in Neur al Information Pr ocessing Systems , 2014. [9] Masashi Sugiyama, T aiji Suzuki, and T akafumi Kanamori. Density ratio estimation in machine learning . Cambridge Univ ersity Press, 2012. [10] T im Salimans, Andrej Karpathy , Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture lik elihood and other modifications. arXiv pr eprint arXiv:1701.05517 , 2017. [11] T akeru Miyato, T oshiki Kataoka, Masanori K o yama, and Y uichi Y oshida. Spectral normalization for generativ e adversarial netw orks. arXiv pr eprint arXiv:1802.05957 , 2018. [12] Brenden M Lake, Ruslan Salakhutdino v , and Joshua B T enenbaum. Human-level concept learning through probabilistic program induction. Science , 350(6266):1332–1338, 2015. [13] Doina Precup, Richard S. Sutton, and Satinder P . Singh. Eligibility traces for off-polic y policy ev aluation. In International Confer ence on Machine Learning , 2000. [14] Philip Thomas and Emma Brunskill. Data-ef ficient off-polic y policy e valuation for reinforce- ment learning. In International Confer ence on Machine Learning , 2016. [15] Emanuel T odorov , T om Erez, and Y uval T assa. Mujoco: A physics engine for model-based control. In International Confer ence on Intelligent Robots and Systems . IEEE, 2012. [16] Benigno Uria, Marc-Alexandre Côté, Karol Gregor , Iain Murray , and Hugo Larochelle. Neural autoregressi ve distribution estimation. The J ournal of Machine Learning Resear ch , 17(1): 7184–7220, 2016. [17] Laurent Dinh, David Krue ger , and Y oshua Bengio. Nice: Non-linear independent components estimation. arXiv pr eprint arXiv:1410.8516 , 2014. 10 [18] Danilo Jimenez Rezende, Shakir Mohamed, and Daan W ierstra. Stochastic backpropagation and approximate inference in deep generativ e models. arXiv pr eprint arXiv:1401.4082 , 2014. [19] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv pr eprint arXiv:1610.03483 , 2016. [20] Alexander J Ratner , Henry Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher Ré. Learning to compose domain-specific transformations for data augmentation. In Advances in Neural Information Pr ocessing Systems , 2017. [21] Shengjia Zhao, Hongyu Ren, Arianna Y uan, Jiaming Song, Noah Goodman, and Stefano Ermon. Bias and generalization in deep generativ e models: An empirical study . In Advances in Neural Information Pr ocessing Systems , 2018. [22] Aditya Grov er and Stefano Ermon. Boosted generati ve models. In AAAI Conference on Artificial Intelligence , 2018. [23] Augustus Odena, V incent Dumoulin, and Chris Olah. Decon v olution and checkerboard artifacts. Distill , 2016. doi: 10.23915/distill.00003. URL http://distill.pub/2016/ deconv- checkerboard . [24] Augustus Odena. Open questions about generati ve adversarial networks. Distill , 4(4):e18, 2019. [25] Jun S Liu and Rong Chen. Sequential monte carlo methods for dynamic systems. J ournal of the American statistical association , 93(443):1032–1044, 1998. [26] Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and computing , 10(3):197–208, 2000. [27] T im Salimans, Ian Goodfellow , W ojciech Zaremba, V icki Cheung, Alec Radford, and Xi Chen. Improv ed techniques for training gans. In Advances in Neural Information Pr ocessing Systems , pages 2234–2242, 2016. [28] Martin Heusel, Hubert Ramsauer , Thomas Unterthiner , Bernhard Nessler , and Sepp Hochreiter . Gans trained by a two time-scale update rule conv erge to a local nash equilibrium. In Advances in Neural Information Pr ocessing Systems , pages 6626–6637, 2017. [29] Mikołaj Bi ´ nko wski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv pr eprint arXiv:1801.01401 , 2018. [30] Stéphane Ross and Drew Bagnell. Ef ficient reductions for imitation learning. In International Confer ence on Artificial Intelligence and Statistics , 2010. [31] Greg Brockman, V icki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie T ang, and W ojciech Zaremba. Openai gym. arXiv pr eprint arXiv:1606.01540 , 2016. [32] John Schulman, Filip W olski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov . Proximal policy optimization algorithms. arXiv pr eprint arXiv:1707.06347 , 2017. [33] Mehrdad Farajtabar , Y inlam Chow , and Mohammad Ghavamzadeh. More rob ust doubly robust off-polic y e v aluation. In International Confer ence on Machine Learning , 2018. [34] Jonathon Byrd and Zachary C Lipton. What is the effect of importance weighting in deep learning? arXiv preprint , 2018. [35] Mihaela Rosca, Balaji Lakshminarayanan, David W arde-Farle y , and Shakir Mohamed. V ari- ational approaches for auto-encoding generativ e adversarial networks. arXiv preprint arXiv:1706.04987 , 2017. [36] Arthur Gretton, Karsten M Bor gwardt, Malte Rasch, Bernhard Schölkopf, and Ale x J Smola. A kernel method for the tw o-sample-problem. In Advances in Neural Information Pr ocessing Systems , 2007. 11 [37] Samuel R Bo wman, Luke V ilnis, Oriol V in yals, Andrew M Dai, Rafal Jozefo wicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint , 2015. [38] David Lopez-P az and Maxime Oquab . Re visiting classifier two-sample tests. arXiv pr eprint arXiv:1610.06545 , 2016. [39] Ivo Danihelka, Balaji Lakshminarayanan, Benigno Uria, Daan Wierstra, and Peter Dayan. Comparison of maximum likelihood and gan-based training of real nvps. arXiv pr eprint arXiv:1705.05263 , 2017. [40] Daniel Jiwoong Im, He Ma, Graham T aylor , and Kristin Branson. Quantitatively e valuating gans with di ver gences proposed for training. arXiv pr eprint arXiv:1803.01045 , 2018. [41] Ishaan Gulrajani, Colin Raffel, and Luke Metz. T owards gan benchmarks which require generalization. In International Confer ence on Learning Repr esentations , 2019. [42] Sebastian Now ozin, Botond Cseke, and Ryota T omioka. f-gan: Training generati ve neural sam- plers using variational di ver gence minimization. In Advances in Neural Information Pr ocessing Systems , pages 271–279, 2016. [43] Michael U Gutmann and Aapo Hyvärinen. Noise-contrasti ve estimation of unnormalized statistical models, with applications to natural image statistics. J ournal of Mac hine Learning Resear ch , 13(Feb):307–361, 2012. [44] Jerome Friedman, T re v or Hastie, and Robert T ibshirani. The elements of statistical learning , volume 1. Springer series in statistics Ne w Y ork, NY , USA:, 2001. [45] Maurice Diesendruck, Ethan R Elenberg, Rajat Sen, Guy W Cole, Sanjay Shakkottai, and Sinead A W illiamson. Importance weighted generativ e networks. arXiv pr eprint arXiv:1806.02512 , 2018. [46] Ryan Turner , Jane Hung, Y unus Saatci, and Jason Y osinski. Metropolis-hastings generative adversarial networks. arXiv pr eprint arXiv:1811.11357 , 2018. [47] Samaneh Azadi, Catherine Olsson, Tre vor Darrell, Ian Goodfellow , and Augustus Odena. Discriminator rejection sampling. arXiv pr eprint arXiv:1810.06758 , 2018. [48] Chenyang T ao, Liqun Chen, Ricardo Henao, Jianfeng Feng, and La wrence Carin. Chi-square generativ e adversarial netw ork. In International Confer ence on Machine Learning , 2018. [49] Y uri Burda, Roger Grosse, and Ruslan Salakhutdinov . Importance weighted autoencoders. arXiv pr eprint arXiv:1509.00519 , 2015. [50] T im Salimans, Diederik Kingma, and Max W elling. Marko v chain monte carlo and variational inference: Bridging the gap. In International Confer ence on Machine Learning , 2015. [51] Christian A Naesseth, Scott W Linderman, Rajesh Ranganath, and Da vid M Blei. V ariational sequential monte carlo. arXiv pr eprint arXiv:1705.11140 , 2017. [52] Aditya Gro ver , Ramki Gummadi, Miguel Lazaro-Gredilla, Dale Schuurmans, and Stefano Ermon. V ariational rejection sampling. In International Conference on Artificial Intelligence and Statistics , 2018. [53] Y arin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Confer ence on Machine Learning , 2016. [54] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictiv e uncertainty estimation using deep ensembles. In Advances in Neural Information Pr ocessing Systems , 2017. [55] Matthieu K omorowski, A Gordon, LA Celi, and A Faisal. A mark ov decision process to suggest optimal treatment of sev ere infections in intensiv e care. In Neur al Information Pr ocessing Systems W orkshop on Machine Learning for Health , 2016. 12 [56] Zhengyuan Zhou, Daniel Miller , Neal Master , Da vid Scheinker , Nicholas Bambos, and Peter Glynn. Detecting inaccurate predictions of pediatric surgical durations. In International Confer ence on Data Science and Advanced Analytics , 2016. [57] Aniruddh Raghu, Matthieu Komoro wski, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Continuous state-space models for optimal sepsis treatment-a deep reinforcement learning approach. arXiv pr eprint arXiv:1705.08422 , 2017. [58] Eric Nalisnick, Akihiro Matsuka wa, Y ee Whye T eh, Dilan Gorur , and Balaji Lakshminarayanan. Do deep generative models know what they don’t know? arXiv preprint , 2018. [59] Hyunsun Choi and Eric Jang. Generati ve ensembles for robust anomaly detection. arXiv pr eprint arXiv:1810.01392 , 2018. [60] Bradley Efron and Robert J T ibshirani. An intr oduction to the bootstr ap . CRC press, 1994. [61] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In International Confer ence on Machine learning , 2005. [62] Chuan Guo, Geof f Pleiss, Y u Sun, and Kilian Q W einberger . On calibration of modern neural networks. In International Confer ence on Machine Learning , 2017. [63] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffre y Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. T ensorflow: a system for large-scale machine learning. In Oper ating Systems Design and Implementation , 2016. [64] Christian Sze gedy , V incent V anhoucke, Serge y Iof fe, Jon Shlens, and Zbignie w W ojna. Rethink- ing the inception architecture for computer vision. In IEEE confer ence on Computer V ision and P attern Recognition , 2016. [65] Adam Paszke, Sam Gross, Soumith Chintala, Gre gory Chanan, Edward Y ang, Zachary DeV ito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer . Automatic differentiation in pytorch. 2017. [66] Oriol V inyals, Charles Blundell, T imothy Lillicrap, Daan W ierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Pr ocessing Systems , 2016. [67] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov , Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor , and Y uhuai W u. Openai baselines. GitHub, GitHub r epository , 2017. [68] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for acti v ation functions. arXiv pr eprint arXiv:1710.05941 , 2017. 13 A ppendices A Confidence Intervals via Bootstrap Bootstrap is a widely-used tool in statistics for deriving confidence interv als by fitting ensembles of models on resampled data points. If the dataset is finite e.g., D train , then the bootstrapped dataset is obtained via random sampling with replacement and confidence intervals are estimated via the empirical bootstrap . For a parametric model generating the dataset e.g., p θ , a fresh bootstrapped dataset is resampled from the model and confidence intervals are estimated via the parametric bootstrap . See [ 60 ] for a detailed revie w . In training a binary classifier , we can estimate the confidence intervals by retraining the classifier on a fresh sample of points from p θ and a resampling of the training dataset D train (with replacement). Repeating this process over multiple runs and then taking a suitable quantile giv es us the corresponding confidence interv als. B Bias-V ariance of Different LFIW estimators As discussed in Section 3, bias reduction using LFIW can suf fer from issues where the importance weights are too small due to highly confident predictions of the binary classifier . Across a batch of Monte Carlo samples, this can increase the corresponding v ariance. Inspired from the importance sampling literature, we proposed additional mechanisms to mitigate this additional variance at the cost of reduced debiasing in Eqs. (7-9). W e now look at the empirical bias-v ariance trade-off of these different estimators via a simple e xperiment belo w . Our setup follo ws the goodness-of-fit testing e xperiments in Section 5. The statistics we choose to estimate is simply are the 2048 acti v ations of the prefinal layer of the Inception Network, a veraged across the test set of 10 , 000 samples of CIF AR-10. That is, the true statistics s = { s 1 , s 2 , · · · , s 2048 } are giv en by: s j = 1 | D test | X x ∈ D test a j ( x ) (21) where a j is the j -th prefinal layer activ ation of the Inception Network. Note that set of statistics s is fixed (computed once on the test set). T o estimate these statistics, we will use dif ferent estimators. For example, the default estimator in v olving no reweighting is gi v en as: ˆ s j = 1 T T X i =1 a j ( x ) (22) where x ∼ p θ . Note that ˆ s j is a random v ariable since it depends on the T samples drawn from p θ . Similar to Eq. (22) , other v ariants of the LFIW estimators proposed in Section 3 can be deri v ed using Eqs. (7-9). For any LFIW estimate ˆ s j , we can use the standard decomposition of the expected mean-squared error into terms corresponding to the (squared) bias and variance as sho wn below . E [( s j − ˆ s j ) 2 ] = s 2 j − 2 s j E [ ˆ s j ] + E [ ˆ s j ] 2 (23) = s 2 j − 2 s j E [ ˆ s j ] + ( E [ ˆ s j ]) 2 + E [ ˆ s j 2 ] − ( E [ ˆ s j ]) 2 (24) = ( s j − E [ ˆ s j ]) 2 | {z } Bias 2 + E [ ˆ s j 2 ] − ( E [ ˆ s j ]) 2 | {z } V ariance . (25) In T able 4, we report the bias and variance terms of the estimators a veraged ov er 10 draws of T = 10 , 0000 samples and further av eraging ov er all 2048 statistics corresponding to s . W e observe that self-normalization performs consistently well and is the best or second best in terms of bias and MSE in all cases. The flattened estimator with no debiasing (corresponding to α = 0 ) has lower bias and higher variance than the self-normalized estimator . Amongst the flattening estimators, lower 14 T able 4: Bias-variance analysis for Pix elCNN++ and SNGAN when T = 10 , 000 . Standard errors ov er the absolute values of bias and variance evaluations are computed over the 2048 acti v ation statistics. Lower absolute v alues of bias, lo wer v ariance, and lo wer MSE is better . Model Evaluation | Bias | ( ↓ ) V ariance ( ↓ ) MSE ( ↓ ) PixelCNN++ Self-norm 0.0240 ± 0.0014 0.0002935 ± 7.22e-06 0.0046 ± 0.00031 Flattening ( α = 0 ) 0.0330 ± 0.0023 9.1e-06 ± 2.6e-07 0.0116 ± 0.00093 Flattening ( α = 0 . 25 ) 0.1042 ± 0.0018 5.1e-06 ± 1.5e-07 0.0175 ± 0.00138 Flattening ( α = 0 . 5 ) 0.1545 ± 0.0022 8.4e-06 ± 3.7e-07 0.0335 ± 0.00246 Flattening ( α = 0 . 75 ) 0.1626 ± 0.0022 3.19e-05 ± 2e-06 0.0364 ± 0.00259 Flattening ( α = 1 . 0 ) 0.1359 ± 0.0018 0.0002344 ± 1.619e-05 0.0257 ± 0.00175 Clipping ( β = 0 . 001 ) 0.1359 ± 0.0018 0.0002344 ± 1.619e-05 0.0257 ± 0.00175 Clipping ( β = 0 . 01 ) 0.1357 ± 0.0018 0.0002343 ± 1.618e-05 0.0256 ± 0.00175 Clipping ( β = 0 . 1 ) 0.1233 ± 0.0017 0.000234 ± 1.611e-05 0.0215 ± 0.00149 Clipping ( β = 1 . 0 ) 0.1255 ± 0.0030 0.0002429 ± 1.606e-05 0.0340 ± 0.00230 SNGAN Self-norm 0.0178 ± 0.0008 1.98e-05 ± 5.9e-07 0.0016 ± 0.00023 Flattening ( α = 0 ) 0.0257 ± 0.0010 9.1e-06 ± 2.3e-07 0.0026 ± 0.00027 Flattening ( α = 0 . 25 ) 0.0096 ± 0.0007 8.4e-06 ± 3.1e-07 0.0011 ± 8e-05 Flattening ( α = 0 . 5 ) 0.0295 ± 0.0006 1.15e-05 ± 6.4e-07 0.0017 ± 0.00011 Flattening ( α = 0 . 75 ) 0.0361 ± 0.0006 1.93e-05 ± 1.39e-06 0.002 ± 0.00012 Flattening ( α = 1 . 0 ) 0.0297 ± 0.0005 3.76e-05 ± 3.08e-06 0.0015 ± 7e-05 Clipping ( β = 0 . 001 ) 0.0297 ± 0.0005 3.76e-05 ± 3.08e-06 0.0015 ± 7e-05 Clipping ( β = 0 . 01 ) 0.0297 ± 0.0005 3.76e-05 ± 3.08e-06 0.0015 ± 7e-05 Clipping ( β = 0 . 1 ) 0.0296 ± 0.0005 3.76e-05 ± 3.08e-06 0.0015 ± 7e-05 Clipping ( β = 1 . 0 ) 0.1002 ± 0.0018 3.03e-05 ± 2.18e-06 0.0170 ± 0.00171 values of α seem to provide the best bias-v ariance trade-of f. The clipped estimators do not perform well in this setting, with lower v alues of β slightly preferable ov er larger v alues. W e repeat the same experiment with T = 5 , 000 samples and report the results in T able 5. While the variance increases as expected (by almost an order of magnitude), the estimator bias remains roughly the same. T able 5: Bias-variance analysis for Pix elCNN++ and SNGAN when T = 5 , 000 . Standard errors ov er the absolute values of bias and v ariance e v aluations are computed over the 2048 acti v ation statistics. Lower absolute v alues of bias, lower v ariance, and lower MSE is better . Model Evaluation | Bias | ( ↓ ) V ariance ( ↓ ) MSE ( ↓ ) PixelCNN++ Self-norm 0.023 ± 0.0014 0.0005086 ± 1.317e-05 0.0049 ± 0.00033 Flattening ( α = 0 ) 0.0330 ± 0.0023 1.65e-05 ± 4.6e-07 0.0116 ± 0.00093 Flattening ( α = 0 . 25 ) 0.1038 ± 0.0018 9.5e-06 ± 3e-07 0.0174 ± 0.00137 Flattening ( α = 0 . 5 ) 0.1539 ± 0.0022 1.74e-05 ± 8e-07 0.0332 ± 0.00244 Flattening ( α = 0 . 75 ) 0.1620 ± 0.0022 6.24e-05 ± 3.83e-06 0.0362 ± 0.00256 Flattening ( α = 1 . 0 ) 0.1360 ± 0.0018 0.0003856 ± 2.615e-05 0.0258 ± 0.00174 Clipping ( β = 0 . 001 ) 0.1360 ± 0.0018 0.0003856 ± 2.615e-05 0.0258 ± 0.00174 Clipping ( β = 0 . 01 ) 0.1358 ± 0.0018 0.0003856 ± 2.615e-05 0.0257 ± 0.00173 Clipping ( β = 0 . 1 ) 0.1234 ± 0.0017 0.0003851 ± 2.599e-05 0.0217 ± 0.00148 Clipping ( β = 1 . 0 ) 0.1250 ± 0.0030 0.0003821 ± 2.376e-05 0.0341 ± 0.00232 SNGAN Self-norm 0.0176 ± 0.0008 3.88e-05 ± 9.6e-07 0.0016 ± 0.00022 Flattening ( α = 0 ) 0.0256 ± 0.0010 1.71e-05 ± 4.3e-07 0.0027 ± 0.00027 Flattening ( α = 0 . 25 ) 0.0099 ± 0.0007 1.44e-05 ± 3.7e-07 0.0011 ± 8e-05 Flattening ( α = 0 . 5 ) 0.0298 ± 0.0006 1.62e-05 ± 5.3e-07 0.0017 ± 0.00012 Flattening ( α = 0 . 75 ) 0.0366 ± 0.0006 2.38e-05 ± 1.11e-06 0.0021 ± 0.00012 Flattening ( α = 1 . 0 ) 0.0302 ± 0.0005 4.56e-05 ± 2.8e-06 0.0015 ± 7e-05 Clipping ( β = 0 . 001 ) 0.0302 ± 0.0005 4.56e-05 ± 2.8e-06 0.0015 ± 7e-05 Clipping ( β = 0 . 01 ) 0.0302 ± 0.0005 4.56e-05 ± 2.8e-06 0.0015 ± 7e-05 Clipping ( β = 0 . 1 ) 0.0302 ± 0.0005 4.56e-05 ± 2.81e-06 0.0015 ± 7e-05 Clipping ( β = 1 . 0 ) 0.1001 ± 0.0018 5.19e-05 ± 2.81e-06 0.0170 ± 0.0017 15 C Additional Experimental Details C.1 Calibration Figure 4: Calibration of classifiers for density ratio estimation. W e found in all our cases that the binary classifiers used for training the model were highly calibrated by default and did not require any further recalibration. See for instance the calibration of the binary classifier used for goodness-of-fit experiments in Figure 4. W e performed the analysis on a held-out set of real and generated samples and used 10 bins for computing calibration statistics. W e believe the default calibration beha vior is largely due to the fact that our binary classifiers distinguishing real and fake data do not require very complex neural networks architectures and training tricks that lead to miscalibration for multi-class classification. As shown in [ 61 ], shallow networks are well-calibrated and [ 62 ] further argue that a major reason for miscalibration is the use of a softmax loss typical for multi-class problems. C.2 Synthetic experiment The classifier used in this case is a multi-layer perceptron with a single hidden layer of 100 units and has been trained to minimize the cross-entropy loss by first order optimization methods. The dataset used for training the classifier consists of an equal number of samples (denoted as n in Figure 1) drawn from the generati ve model and the data distrib ution. C.3 Goodness-of-fit testing W e used the T ensorflow implementation of Inception Network [ 63 ] to ensure the sample quality metrics are comparable with prior w ork. For a semantic e v aluation of dif ference in sample quality , this test is performed in the feature space of a pretrained classifier , such as the prefinal activ ations of the Inception Net [ 64 ]. For example, the Inception score for a generati v e model p θ giv en a classifier d ( · ) can be expressed as: IS = exp( E x ∼ p θ [ KL ( d ( y | x ) , d ( y ))]) . The FID score is another metric which unlike the Inception score also takes into account real data from p data . Mathematically , the FID between sets S and R sampled from distributions p θ and p data respectiv ely , is defined as: FID ( S, R ) = k µ S − µ R k 2 2 + T r (Σ S + Σ R − 2 p Σ S Σ R ) where ( µ S , Σ S ) and ( µ R , Σ R ) are the empirical means and covariances computed based on S and R respecti vely . Here, S and R are sets of datapoints from p θ and p data . In a similar vein, KID compares statistics between samples in a feature space defined via a combination of kernels and a pretrained classifier . The standard kernel used is a radial-basis function kernel with a fixed bandwidth of 1 . As desired, the score is optimized when the data and model distributions match. W e used the open-sourced model implementations of Pix elCNN++ [ 27 ] and SNGAN [ 11 ]. Follo wing the observ ation by [ 38 ], we found that training a binary classifier on top of the feature space of an y 16 pretrained image classifier was useful for removing the lo w-le vel artifacts in the generated images in classifying an image as real or fake. W e hence learned a multi-layer perceptron (with a single hidden layer of 1000 units) on top of the 2048 dimensional feature space of the Inception Network. Learning was done using the Adam optimizer with the default hyperparameters with a learning rate of 0 . 001 and a batch size of 64 . W e observed relativ ely fast con vergence for training the binary classifier (in less than 20 epochs) on both PixelCNN++ and SNGAN generated data and the best validation set accuracy across the first 20 epochs was used for final model selection. C.4 Data A ugmentation Our codebase was implemented using the PyT orch library [ 65 ]. W e built on top of the open-source implementation of D A GAN 4 [1]. A DA GAN learns to augment data by training a conditional generativ e model G θ : X × Z → X based on a training dataset D cl . This dataset is same as the one we used for training the generati ve model and the binary classifier for density ratio estimation. The generati ve model is learned via a minimax game with a critic. For any conditioning datapoint x i ∈ D train and noise vector z ∼ p ( z ) , the critic learns to distinguish the generated data G θ ( x i , z ) paired along with x i against another pair ( x i , x j ) . Here, the point x j is chosen such that the points x i and x j hav e the same label in D cl , i.e., y i = y j . Hence, the critic learns to classify pairs of (real, real) and (real, generated) points while encouraging the generated points to be of the same class as the point being conditioned on. For the generated data, the label y is assumed to be the same as the class of the point that w as used for generating the data. W e refer the reader to [1] for further details. Gi ven a D A GAN model, we additionally require training a binary classifier for estimating importance weights and a multi-class classifier for subsequent classification. The architecture for both these use cases follo ws prior work in meta learning on Omniglot [ 66 ]. W e train the DA GAN on the 1200 classes reserved for training in prior works. For each class, we consider a 15/5/5 split of the 20 examples for training, v alidation, and testing. Except for the final output layer , the architecture consists of 4 blocks of 3x3 con volutions and 64 filters, follo wed by batch normalization [ 64 ], a ReLU non-linearity and 2x2 max pooling. Learning was done for 100 epochs using the Adam optimizer with default parameters and a learning rate of 0.001 with a batch size of 32. C.5 Model-based Off-policy Policy Ev aluation For this set of e xperiments, we used T ensorflow [ 63 ] and OpenAI baselines 5 [ 67 ]. W e ev aluate ov er three en vionments viz. Swimmer , HalfCheetah, and HumanoidStandup (Figure 5. Both HalfCheetah and Swimmer re wards the agent for gaining higher horizontal velocity; HumanoidStandup rew ards the agent for gaining more height via standing up. In all three environments, the initial state distributions are obtained via adding small random perturbation around a certain state. The dimensions for state and action spaces are shown in T able 6. (a) Swimmer (b) HalfCheetah (c) HumanoidStandup Figure 5: En vironments in OPE experiments. Our policy network has two fully connected layers with 64 neurons and tanh activ ations for each layer , where as our transition model / classifier has three hidden layers of 500 neurons with swish activ ations [ 68 ]. W e obtain our ev aluation policy by training with PPO for 1M timesteps, and our behavior polic y by training with PPO for 500k timesteps. Then we train the dynamics model P θ for 4 https://github.com/AntreasAntoniou/DAGAN.git 5 https://github.com/openai/baselines.git 17 T able 6: Statistics for the environments. En vironment State dimensionality # Action dimensionality Swimmer 8 2 HalfCheetah 17 6 HumanoidStandup 376 17 T able 7: Off-polic y policy e valuation on MuJoCo tasks. Standard error is over 10 Monte Carlo estimates where each estimate contains 100 randomly sampled trajectories. Here, we perform stepwise LFIW ov er transition triplets. En vironment v ( π e ) (Ground truth) ˜ v ( π e ) ˆ v ( π e ) (w/ LFIW) ˆ v 80 ( π e ) (w/ LFIW) Swimmer 36 . 7 ± 0 . 1 100 . 4 ± 3 . 2 19 . 4 ± 4 . 3 48.3 ± 4 . 0 HalfCheetah 241 . 7 ± 3 . 6 204 . 0 ± 0 . 8 229.1 ± 4 . 9 214 . 9 ± 3 . 9 HumanoidStandup 14170 ± 5 . 3 8417 ± 28 10612 ± 794 9950 ± 640 100k iterations with a batch size of 128. Our classifier is trained for 10k iterations with a batch size of 250, where we concatenate ( s t , a t , s t +1 ) into a single vector . 0 20 40 60 80 100 H 20 40 60 | ( v ) | Swimmer 0 20 40 60 80 100 H 20 40 | ( v ) | HalfCheetah 0 20 40 60 80 100 H 4000 6000 | ( v ) | HumanoidStandup Figure 6: Estimation error δ ( v ) = v ( π e ) − ˆ v H ( π e ) for different v alues of H (minimum 0, maximum 100). Shaded area denotes standard error o ver dif ferent random seeds; each seed uses 100 sampled trajectories. Here, we use LFIW ov er transition triplets. C.5.1 Stepwise LFIW Here, we consider performing LFIW over the transition triplets, where each transition triplet ( s t , a t , s t +1 ) is assigned its o wn importance weight. This is in contrast to assigning a single impor - tance weight for the entire trajectory , obtained by multiplying the importance weights of all transitions in the trajectory . The importance weight for a transition triplet is defined as: p ? ( s t , a t , s t +1 ) ˜ p ( s t , a t , s t +1 ) ≈ ˆ w ( s t , a t , s t +1 ) , (26) so the corresponding LFIW estimator is giv en as ˆ v ( π e ) = E τ ∼ ˜ p ( τ ) " T − 1 X t =0 ˆ w ( s t , a t , ˆ s t +1 ) · r ( s t , a t ) # . (27) W e describe this as the “stepwise" LFIW approach for off-polic y policy ev aluation. W e perform self-normalization ov er the weights of each triplet. From the results in T able 7 and Figure 6, stepwise LFIW also reduces bias for OPE compared to without LFIW . Compared to the “trajectory based" LFIW described in Eq. (20), the stepwise estimator has slightly higher variance and weaker performance for H = 20 , 40 , but outperforms the trajectory lev el estimators when H = 100 on HalfCheetah and HumanoidStandup environments. 18
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment