The Informed Sampler: A Discriminative Approach to Bayesian Inference in Generative Computer Vision Models

The Inf ormed Sampler: A Discriminativ e A pproach to Bay esian Inference in Generativ e Computer V ision Models V arun Jampani V A RU N . JA M PA N I @ T U E B I N G E N . M P G . D E Max Planck Institute for Intelligent Systems, Spemannstraße 41, 72076, T ¨ ubingen, Germany Sebastian Nowozin S E B A S T I A N . N O W O Z I N @ M I C R O S O F T . C O M Microsoft Research Cambridge, 21 Station Road, Cambridge, CB1 2FB, United Kingdom Matthew Loper M L O P E R @ T U E B I N G E N . M P G . D E Max Planck Institute for Intelligent Systems, Spemannstraße 41, 72076, T ¨ ubingen, Germany Peter V . Gehler P G E H L E R @ T U E B I N G E N . M P G . D E Max Planck Institute for Intelligent Systems, Spemannstraße 41, 72076, T ¨ ubingen, Germany Abstract Computer vision is hard because of a large vari- ability in lighting, shape, and texture; in addition the image signal is non-additiv e due to occlusion. Generativ e models promised to account for this variability by accurately modelling the image for- mation process as a function of latent variables with prior beliefs. Bayesian posterior inference could then, in principle, e xplain the observ ation. While intuitiv ely appealing, generati v e models for computer vision hav e largely f ailed to deli ver on that promise due to the dif ﬁculty of posterior in- ference. As a result the community has fav oured efﬁcient discriminativ e approaches. W e still be- liev e in the usefulness of generati ve models in computer vision, but argue that we need to le ver - age existing discriminati ve or e ven heuristic com- puter vision methods. W e implement this idea in a principled way with an informed sampler and in careful e xperiments demonstrate it on challeng- ing generati ve models which contain renderer pro- grams as their components. W e concentrate on the problem of in verting an e xisting graphics render- ing engine, an approach that can be understood as “In verse Graphics”. The informed sampler , using simple discriminativ e proposals based on existing computer vision technology , achieves signiﬁcant improv ements of inference. Copyright 2015 by the author(s). 1. Introduction A conceptually eleg ant view on computer vision is to con- sider a generati ve model of the physical image formation process. The observed image becomes a function of un- observed variables of interest (for example presence and positions of objects) and nuisance variables (for example light sources, shadows). When building such a generativ e model, we can think of a scene description θ that produces an image I = G ( θ ) using a deterministic rendering en- gine G , or more generally , results in a distribution over images, p ( I | θ ) . Given an image observation ˆ I and a prior ov er scenes p ( θ ) we can then perform Bayesian inference to obtain updated beliefs p ( θ | ˆ I ) . This view was adv ocated since the late 1970’ies (Horn, 1977; Grenander, 1976; Zhu and Mumford, 1997; Mumford and Desolneux, 2010; Mans- inghka et al., 2013; Y uille and K ersten, 2006). Now , 30 years later , we would ar gue that the generative approach has largely failed to deli ver on its promise. The few successes of the idea hav e been in limited settings. In the successful examples, either the generativ e model was restricted to few high-le vel latent variables, e.g. (Oliver et al., 2000), or restricted to a set of image transforma- tions in a ﬁxed reference frame, e.g. (Black et al., 2000), or it modelled only a limited aspect such as object shape masks (Eslami et al., 2012), or , in the worst case, the gen- erativ e model w as merely used to generate training data for a discriminati ve model (Shotton et al., 2011). With all its intuitiv e appeal, its beauty and simplicity , it is fair to say that the track record of generati ve models in computer vision is poor . As a result, the ﬁeld of computer vision is now dominated by efﬁcient b ut data-hungry discriminative models, the use of empirical risk minimization for learning, and energy minimization on heuristic objective functions The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision Figure 1. An example “in verse graphics” problem. A graphics en- gine renders a 3D body mesh and a depth image using an artiﬁcial camera. By In verse Graphics we refer to the process of estimat- ing the posterior probability ov er possible bodies giv en the depth image. for inference. Why did generativ e models not succeed? There are two key problems that need to be addressed, the design of an accu- rate generative model, and the inference therein. Modern computer graphic systems that lev erage dedicated hardware setups produce a stunning le vel of realism with high frame rates. W e believ e that these systems will ﬁnd its way in the design of generativ e models and will open up exciting mod- elling opportunities. This observation moti vates the research question of this paper , the design of a general inference tech- nique for efﬁcient posterior inference in accurate computer graphics systems. As such it can be understood as an in- stance of In verse Graphics (Baumgart, 1974), illustrated in Figure 1 with one of our applications. The key problem in the generative world view is the difﬁ- culty of posterior inference at test-time. This dif ﬁculty stems from a number of reasons: ﬁrst , the parameter θ is typically high-dimensional and so is the posterior . Second , given θ , the image formation process realizes complex and dynamic dependency structures, for e xample when objects occlude or self-occlude each other . These intrinsic ambiguities result in multi-modal posterior distributions. Thir d , while most ren- derers are real-time, each simulation of the forward process is expensi ve and prev ents exhausti ve enumeration. W e believ e in the usefulness of generativ e models for com- puter vision tasks, but ar gue that in order to overcome the substantial inference challenges we hav e to de vise tech- niques that are general and allo w reuse in sev eral different models and nov el scenarios. On the other hand we want to maintain correctness in terms of the probabilistic estimates that they produce. One way to improv e on inference efﬁ- ciency is to le verage existing computer vision features and discriminativ e models in order to aid inference in the gener- ati ve model. In this paper , we propose the informed sampler , a Marko v chain Monte Carlo (MCMC) method with dis- criminativ e proposal distributions. It can be understood as an instance of a data-driven MCMC method (Zhu et al., 2000), and our aim is to design a method that is general enough such that it can be applied across different problems and is not tailored to a particular application. During sampling, the informed sampler leverages computer vision features and algorithms to make informed propos- als for the state of latent variables and these proposals are accepted or rejected based on the generativ e model. The informed sampler is simple and easy to implement, but it enables inference in generative models that were out of reach for current uninformed samplers. W e demonstrate this claim on challenging models that incorporate rendering engines, object occlusion, ill-posedness, and multi-modality . W e carefully assess con ver gence statistics for the samplers to in vestigate their truthfulness about the probabilistic esti- mates. In our experiments we use e xisting computer vision technology: our informed sampler uses standard histogram- of-gradients features (HoG) (Dalal and T riggs, 2005), and the OpenCV library , (Bradski and Kaehler, 2008), to pro- duce informed proposals. Like wise one of our models is an existing computer vision model, the BlendSCAPE model, a parametric model of human bodies (Hirshberg et al., 2012). In Section 2, we discuss related work and explain our in- formed sampler approach in Section 3. Section 4 presents baseline methods and experimental setup. Then we present experimental analysis of informed sampler with three di- verse problems of estimating camera e xtrinsics (Section 5), occlusion reasoning (Section 6) and estimating body shape (Section 7). W e conclude with a discussion of future w ork in Section 8. 2. Related W ork This work stands at the intersection of computer vision, com- puter graphics, and machine learning; it builds on pre vious approaches we will discuss below . There is a vast literature on approaches to solve computer vision applications by means of generative models. W e mention some works that also use an accurate graphics process as generative model. This includes applications such as indoor scene understanding (Del Pero et al., 2012), human pose estimation (Lee and Cohen, 2004), hand pose estimation (de La Gorce et al., 2008) and many more. Most of these works are howe ver interested in inferring MAP solutions, rather than the full posterior distribution. Our method is similar in spirit to a Data Driven Markov Chain Monte Carlo (DDMCMC) methods that use a bottom- up approach to help con vergence of MCMC sampling. DDMCMC methods ha ve been used in image segmenta- tion (T u and Zhu, 2002), object recognition (Zhu et al., 2000), and human pose estimation (Lee and Cohen, 2004). The idea of making Mark ov samplers data dependent is very general, but in the works mentioned abov e, lead to highly problem speciﬁc implementations, mostly using approxi- mate likelihood functions. It is due to specialization on a The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision problem domain, that the proposed samplers are not easily transferable to new problems. This is what we focus on in our work: to provide a simple, yet ef ﬁcient and general inference technique for problems where an accurate forward process exists. Because our method is general we believe that it is easy to adapt to a variety of ne w models and tasks. The idea to in vert graphics (Baumgart, 1974) in order to understand scenes also has roots in the computer graphics community under the term “inv erse rendering”. The goal of in verse rendering ho wever is to deri ve a direct mathematical model for the forward light transport process and then to analytically in vert it. The work of (Ramamoorthi and Han- rahan, 2001) falls in this category . The authors formulate the light reﬂection problem as a con volution, to then under- stand the in verse light transport problem as a deconv olution. While this is a very ele gant way to pose the problem, it does require a speciﬁcation of the inv erse process, a requirement generativ e modelling approaches try to circumvent. Our approach can also be viewed as an instance of a proba- bilistic programming approach. In the recent work of (Mans- inghka et al., 2013), the authors combine graphics modules in a probabilistic programming language to formulate an approximate Bayesian computation. Inference is then im- plemented using Metropolis-Hastings (MH) sampling. This approach is appealing in its generality and elegance, how- ev er we show that for our graphics problems, a plain MH sampling approach is not sufﬁcient to achie ve reliable infer- ence and that our proposed informed sampler can achieve robust con vergence in these challenging models. Another piece of work from (Stuhlm ¨ uller et al., 2013) is similar to our proposed inference method in that knowledge about the forward process is learned as “stochastic inv erses”, then applied for MCMC sampling in a Bayesian network. In the present work, we de vise an MCMC sampler that we show works in both a multi-modal problem as well as for in verting an existing piece of image rendering code. In sum- mary , our method can be understood in a similar context as the abov e-mentioned papers, including (Mansinghka et al., 2013). 3. The Inf ormed Sampler In general, inference about the posterior distribution is chal- lenging because for a complex model p ( ˆ I | θ ) no closed-form simpliﬁcations can be made. This is especially true in the case that we consider , where p ( ˆ I | θ ) corresponds to a graph- ics engine rendering images. Despite this apparent com- plexity we observe the following: for many computer vision applications there exist well performing discriminativ e ap- proaches, that, giv en the image, predict some parameters θ or distributions thereof. These do not correspond to the posterior distrib ution that we are interested in, but, intu- itively the av ailability of discriminativ e inference methods should make the task of inferring p ( θ | ˆ I ) easier . Further- mor e a physically accurate generative model can be used in an ofﬂine stage prior to inference to generate as many samples as we would like or can af ford computationally . Again, intuitively this should allo w us to prepare and sum- marize useful information about the distribution in order to accelerate test-time inference. Concretely , in our case we will use a discriminative method to provide a global density T G ( ˆ I ) , which we then use in a valid MCMC inference method. In the remainder of the section we ﬁrst revie w Metropolis-Hastings Markov Chain Monte Carlo (MCMC) and then discuss our proposed in- formed samplers . 3.1. Metropolis-Hastings MCMC The goal of any sampler is to realize independent and iden- tically distributed samples from a given probability distri- bution. MCMC sampling, due to Metropolis et al. (1953) is a particular instance that generates a sequence of random variables by simulating a Markov chain. Sampling from a target distrib ution π ( · ) consists of repeating the follo wing two steps (Liu, 2001): 1. Propose a transition using a pr oposal distribution T and the current state θ t ¯ θ ∼ T ( ·| θ t ) 2. Accept or reject the transition based on Metropolis Hastings (MH) acceptance rule: θ t +1 = ( ¯ θ , rand (0 , 1) < min  1 , π ( ¯ θ ) T ( ¯ θ → θ t ) π ( θ t ) T ( θ t → ¯ θ )  , θ t , otherwise. Different MCMC techniques mainly dif fer in the implemen- tation of the proposal distribution T . 3.2. Informed Pr oposal Distribution W e use a common mixture kernel for Metropolis-Hastings sampling T α ( ·| ˆ I , θ t ) = α T L ( ·| θ t ) + (1 − α ) T G ( ·| ˆ I ) . (1) Here T L is an ordinary local proposal distribution, for ex- ample a multiv ariate Normal distribution centered around the current sample θ , and T G is a global proposal distrib u- tion independent of the current state. W e inject knowledge by conditioning the global proposal distrib ution T G on the image observation. W e learn the informed proposal T G ( ·| ˆ I ) discriminativ ely in an of ﬂine training stage using a non- parametric density estimator described below . The mixture parameter α ∈ [0 , 1] controls the contrib ution of each proposal, for α = 1 we recover MH. For α = The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision Algorithm 1 Learning a global proposal T G ( θ | I ) 1. Simulate { ( θ ( i ) , I ( i ) ) } i =1 ,...,n from p ( I | θ ) p ( θ ) 2. Compute a feature representation v ( I ( i ) ) 3. Perform k-means clustering of { v ( I ( i ) ) } i 4. For each cluster C j ⊂ { 1 , . . . , n } , ﬁt a kernel density estimate KDE ( C j ) to the vectors θ { C j } 0 the proposal T α would be identical to T G ( ·| ˆ I ) and the resulting Metropolis sampler would be a valid metropolized independence sampler (Liu, 2001). W ith α = 0 we call this baseline method Informed Independent MH (INF-INDMH). For intermediate v alues, α ∈ (0 , 1) , we combine local with global mov es in a valid Marko v chain. W e call this method Informed Metr opolis Hastings (INF-MH). 3.3. Discriminatively Lear ning T G The key step in the construction of T G is to include discrim- inativ e information about the sample ˆ I . Ideally we would hope to hav e T G propose global moves which impro ve mix- ing and e ven allo w mixing between multiple modes, whereas the local proposal T L is responsible for e xploring the density locally . T o see that this is in principle possible, consider the case of a perfect global proposal, that is, T G ( ·| ˆ I ) = p θ ( ·| ˆ I ) . In that case we would get independent samples with α = 0 because e very proposal is accepted. In practice T G is only an approximation to p θ ( ·| ˆ I ) . If the approximation is good enough then the mixture of local and global proposals will hav e a high acceptance rate and explore the density rapidly . In principle we can use an y conditional density estimation technique for learning a proposal T G from samples. T yp- ically high-dimensional density estimation is dif ﬁcult and ev en more so in the conditional case; howe ver , in our case we do hav e the true generating process av ailable to provide example pairs ( θ , I ) . Therefore we use a simple but scal- able non-parametric density estimation method based on clustering a feature representation of the observed image, v ( ˆ I ) ∈ R d . For each cluster we then estimate an uncondi- tional density over θ using kernel density estimation (KDE). W e chose this simple setup since it can easily be reused in many different scenarios, in the experiments we solve di- verse problems using the same method. This method yields a valid transition k ernel for which detailed balance holds. In addition to the KDE estimate for the global transition kernel we also experimented with a random forest approach that maps the observations to transition k ernels T G . More details will be giv en in Section 7. For the feature representation we lev erage successful dis- criminativ e features and heuristics developed in the com- puter vision community . Different task speciﬁc feature representations can be used in order to provide in variance Algorithm 2 INF-MH Input: observed image ˆ I T L ← Local proposal distrib ution (Gaussian) c ← cluster for v ( ˆ I ) T G ← K DE ( c ) (as obtained by Alg. 1) T = α T L + (1 − α ) T G Initialize θ 1 for t = 1 to N − 1 do 1. Sample ¯ θ ∼ T ( · ) 2. γ = min  1 , π ( ¯ θ | ˆ I ) T ( ¯ θ → θ t ) π ( θ t | ˆ I ) T ( θ t → ¯ θ )  if rand (0 , 1) < γ then θ t +1 = ¯ θ else θ t +1 = θ t end if end for to small changes in θ and to nuisance parameters. The main inference method remains the same across problems. W e construct the KDE for each cluster and we use a rel- ativ ely small kernel bandwidth in order to accurately rep- resent the high probability regions in the posterior . This is similar in spirit to using only high probability regions as “darts” in the Darting Monte Carlo sampling technique of Sminchisescu and W elling (2011). W e summarize the ofﬂine training in Algorithm 1. At test time, this method has the adv antage that gi ven an image ˆ I we only need to identify the corresponding cluster once using v ( ˆ I ) in order to sample ef ﬁciently from the ker - nel density T G . W e show the full procedure in Algorithm 2. This method yields a transition kernel that is a mixture ker - nel of a reversible symmetric Metropolis-Hastings kernel and a metropolized independence sampler . The combined transition kernel T is hence also reversible. Because the measure of each kernel dominates the support of the pos- terior , the kernel is ergodic and has the correct stationary distribution (Brooks et al., 2011). This ensures correctness of the inference and in the experiments we in vestigate the efﬁcienc y of the different methods in terms of con vergence statistics. 4. Setup and Baseline Methods In the remainder of the paper we demonstrate the proposed method in three different experimental setups. For all exper - iments, we use four parallel chains initialized at diff erent random locations sampled from the prior . The reported num- bers are median statistics o ver multiple test images except when noted otherwise. The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision 4.1. Baseline Methods Metropolis Hastings (MH) Described abov e, corre- sponds to α = 1 , we use a symmetric diagonal Gaussian distribution, centered at θ t . Metropolis Hastings within Gibbs (MHWG) W e use a Metropolis Hastings scheme in a Gibbs sampler, that is, we draw from one-dimensional conditional distributions for proposing moves and the Mark ov chain is updated along one dimension at a time. W e further use a block ed v ariant of this MHWG sampler , where we update blocks of dimensions at a time, and denote it by BMHWG. Parallel T empering (PT) W e use Parallel T empering to address the problem of sampling from multi-modal distri- butions (Geyer, 1991; Swendsen and W ang, 1986). This technique is also known as “replica e xchange MCMC sam- pling” (Hukushima and Nemoto, 1996). W e run different parallel chains at different temperatures T , sampling π ( · ) 1 T and at each sampling step propose to exchange two ran- domly chosen chains. In our experiments we run three chains at temperature lev els T ∈ { 1 , 3 , 27 } that were found to be best working out of all combinations in { 1 , 3 , 9 , 27 } for all experiments individually . The highest temperature lev els corresponds to an almost ﬂat distribution. Regeneration Sampler (REG-MH) W e implemented a regenerati ve MCMC method (Mykland et al., 1995) that performs adaption (Gilks et al., 1998) of the proposal distri- bution during sampling. W e use the mixture kernel (Eq. 1) as proposal distribution and adapt only the global part T G ( ·| ˆ I ) . This is initialized as the prior ov er θ and at times of re gen- eration we ﬁt a KDE to the already drawn samples. For comparison we used the same mixture coefﬁcient α as for INF-MH (more details of this technique in A). 4.2. MCMC Diagnostics W e use established methods for monitoring the con vergence of our MCMC method (Kass et al., 1998; Fle gal et al., 2008). In particular , we report different diagnostics. W e compare the different samplers with respect to the number of itera- tions instead of time. The forward graphics process signiﬁ- cantly dominates the runtime and therefore the iterations in our experiments correspond linearly to the runtime. Acceptance Rate (AR) The ratio of accepted samples to the total Markov chain length. The higher the acceptance rate, the fe wer samples we need to approximate the posterior . Acceptance rate indicates ho w well the proposal distrib ution approximates the true distribution locally . Potential Scale Reduction Factor (PSRF) The PSRF di- agnostics (Gelman and Rubin, 1992; Brooks and Gelman, 1998) is deri ved by comparing within-chain v ariances with between-chain v ariances of sample statistics. For this, it requires independent runs of multiple chains (4 in our case) in parallel. Because our sample θ is multi-dimensional, we estimate the PSRF for each parameter dimension separately and take the maximum PSRF value as ﬁnal PSRF v alue. A value close to one indicates that all chains characterize the same distrib ution. This does not imply con vergence, the chains may all collecti vely miss a mode. Howe ver , a PSRF v alue much larger than one is a certain sign of lack of con vergence of the chain. PSRF also indicates how well the sampler visits different modes of a multi-modal distribution. Root Mean Square Error (RMSE) During our experi- ments we hav e access to the input parameters θ ∗ that gener- ated the image. T o assess whether the posterior distribution cov ers the “correct” value we report the RMSE between the posterior expectation E p ( ·| ˆ I ) [ G ( · )] and the value G ( θ ∗ ) of the generating input. Since there is noise being added to the observation we do not have access to the ground truth posterior expectation and therefore this measure is only an indicator . Under con vergence all samplers w ould agree on the same correct value. 4.3. Parameter Selection For each sampler we individually selected hyper-parameters that ga ve the best PSRF value after 10 k iterations. In case the PSRF does not dif fer for multiple v alues, we chose the one with highest acceptance rate. W e include a detailed analysis of the baseline samplers and parameter selection in the supplementary material. 5. Experiment: Estimating Camera Extrinsics W e implement the following simple graphics scenario to create a challenging multi-modal problem. W e render a cu- bical room of edge length 2 with a point light source in the center of the room (0 , 0 , 0) from the viewpoint of a camera somewhere inside the room. The camera parameters are described by its ( x, y , z ) -position and the orientation, speci- ﬁed by yaw , pitch, and roll angles. The inference process consists of estimating the posterior over these 6D camera parameters θ . See Figure 2 for two e xample renderings. Pos- terior inference is a highly multi-modal problem because the room is a cubical and thus symmetric. There are 24 dif ferent camera parameters that will result in the same image. This is also shown in Figure 2 where we plot the position and orientation (but not camera roll) of all camera parameters that create the same image. A rendering of a 200 × 200 image with a resolution of 32 bit using a single core on an Intel Xeon 2.66GHz machine takes about 11 ms on av erage. A small amount of isotropic Gaussian noise is added to the rendered image G ( θ ) , using a standard deviation of σ = The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision Figure 2. T wo rendered room images with possible camera posi- tions and headings that produce the same image. Not sho wn are the orientations; in the left example all six headings can be rolled by 90,180, and 270 degrees for the same image. 0 . 02 . The posterior distribution we try to infer then reads: p ( θ | ˆ I ) ∝ p ( ˆ I | θ ) p ( θ ) = N ( ˆ I | G ( θ ) , σ 2 ) Uniform ( θ ) . The uniform prior over location parameters ranges between − 1 . 0 and 1 . 0 and the prior ov er angle parameters is modelled with wrapped uniform distribution o ver [ − π, π ] . T o learn the informed part of the proposal distribution from data, we computed a histogram of oriented gradients (HOG) descriptor (Dalal and T riggs, 2005) from the image, using 9 gradient orientations and cells of size 20 × 20 yielding a feature vector v ( I ) ∈ R 900 . W e generated 300 k training images using a uniform prior over the camera extrinsic parameters, and performed k-means using 5k cluster centers based on the HOG feature vector . For each cluster cell, we then computed and stored a KDE for the 6 dimensional camera parameters, follo wing the steps in Algorithm 1. As test data, we create 30 images using extrinsic parameters sampled uniform at random ov er their range. 5.1. Results W e sho w results in Figure 3. W e observ e that both MH and PT yield lo w acceptance rate compared to other meth- ods. Howe ver parallel tempering appears to o vercome the multi-modality better and improves over MH in terms of con vergence. The same holds for the re generation technique, we observe man y regenerations, good con vergence and AR. Both INF-INDMH and INF-MH con verge quickly . In this experimental setup hav e access to the different exact modes, there are 24 different ones. W e analyze how quickly the samplers visit the modes and whether or not they capture all of them. For e ver dif ferent instance the pairwise distances between the modes changes, therefore we chose to deﬁne “visiting a mode” in the following way . W e compute a V oronoi tesselation with the modes as centers. A mode is visited if a sample falls into its corresponding V oronoi cell, that is, it is closer than to any other mode. Sampling uniform at random would quickly ﬁnd the modes (depending on the cell sizes) but is not a v alid sampler . W e also experimented with balls of dif ferent radii around the modes and found a similar behaviour to the one we report here. Figure 3 (right) shows results for v arious samplers. W e ﬁnd that INF-MH discov ers different modes quick er when compared to other baseline samplers. Just sampling from the global proposal distribution INF-INDMH is initially visiting more modes (it is not being held back by local steps) but is dominated by INF-MH ov er some range. This indicates that the mixture kernel takes adv antage of both local and global mov es, either one of them is exploring slo wer . Also in most examples all samplers miss some modes under our deﬁnition, the a verage number of discovered modes is 21 for INF-MH and ev en lower for MH. Figure 14(c) sho ws the effect of mixture coefﬁcient ( α ) on the informed sampling INF-MH. Since there is no signiﬁcant difference in PSRF values for 0 ≤ α ≤ 0 . 7 , we chose 0 . 7 due to its high acceptance rate. Likewise, the parameters of the baseline samplers are chosen based on the PSRF and acceptance rate metrics. See supplementary material for the analysis of the baseline samplers and the parameter selection. Figure 4. Role of mixture coefﬁcient. PRSFs and Acceptance rates corresponding to various mixture coef ﬁcients ( α ) of INF-MH sam- pling in ‘Estimating Camera Extrinsics’ experiment. W e also tested the MHWG sampler and found that it did not con verge e ven after 100 k iterations, with a PSRF value around 3. This is to be expected since single variable up- dates will not trav erse the multi-modal posterior distribu- tions fast enough due to the high correlation of the camera parameters. In Figure 5 we plot the median auto-correlation of samples obtained by different sampling techniques, sepa- rately for each of the six extrinsic camera parameters. The informed sampling approach (INF-MH and INF-INDMH) appears to produce samples which are more independent compared to other baseline samplers. As expected, some kno wledge of the multi-modal structure of the posterior needs to be av ailable for the sampler to perform well. The methods INF-INDMH and INF-MH hav e this information and perform better than baseline methods and REG-MH. 6. Experiment: Occluding Tiles In a second experiment we render images depicting a ﬁx ed number of six quadratic tiles placed at a random location x , y in the image at a random depth z and orientation θ . The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision Figure 3. Results of the ‘Estimating Camera Extrinsics’ e xperiment. Acceptance Rates (left), PSRFs (middle), and A verage number of modes visited (right) for different sampling methods. W e plot the median/average statistics ov er 30 test examples. Figure 5. Auto-Correlation of samples obtained by different sampling techniques in camera extrinsics experiment, for each of the six extrinsic camera parameters. W e blur the image and add a bit of Gaussian random noise ( σ = 0 . 02 ). An example is depicted in Figure 6(a), note that all the tiles are of the same size, but f arther away tiles look smaller . A rendering of one 200 × 200 image takes about 25 ms on av erage. Here, as prior , we again use the uniform distribution ov er the 3D cube for tile location parameters, and wrapped uniform distribution o ver [ − π 4 , π 4 ] for tile ori- entation angle. T o av oid label switching issues, each tile is giv en a ﬁxed color and is not changed during the inference. W e chose this experiment such that it resembles the “dead leav es model” of Lee et al. (2001), because it has proper- ties that are commonplace in computer vision. It is a scene composed of several objects that are independent, except for occlusion, which complicates the problem. If occlusion did not exist, the task is readily solved using a standard OpenCV (Bradski and Kaehler, 2008) rectangle ﬁnding al- gorithm ( minAreaRect ). The output of such an algorithm can be seen in Figure 6(c), and we use this algorithm as a discriminati ve source of information. This problem is higher dimensional than the previous one (24, due to 6 tiles of 4 parameters). Inference becomes more challenging in higher dimension and our approach without modiﬁcation does not scale well with increasing dimensionality . One way to approach this problem, is to factorize the joint distrib ution into blocks and learn informed proposals separately . In the present experiment, we observ ed that both baseline samplers and the plain informed sampling fail when proposing all parameters jointly . Since the tiles are independent except for the occlusion, we can approximate the full joint distribution as product of block distributions where each block corre- sponds to the parameters of a single tile. T o estimate the full posterior distribution, we learn global proposal distrib utions for each block separately and use a block-Gibbs lik e scheme in our sampler where we propose changes to one tile at a time, alternating between tiles. The experimental protocol is the same as before, we render 500 k images, apply the OpenCV algorithm to ﬁt rectangles and take their found four parameters as features for cluster- ing (10 k clusters). Again KDE distributions are ﬁt to each cluster and at test time, we assign the observed image to its corresponding cluster . The KDE in that chosen cluster determines the global sampler T G for that tile. W e then use T G to propose an update to all 4 parameters of the tile. W e refer to this procedure as INF-BMHWG. Empirically we ﬁnd α = 0 . 8 to be optimal for INF-BMHWG sampling. 6.1. Results An example result is sho wn in Figure 6. W e found that the the MH and INF-MH samplers fail entirely on this problem. Both use a proposal distribution for the entire state and due to the high dimensions there is almost no acceptance ( < The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision Figure 6. A visual result in ‘Occluding Tiles’ experiment. (a) A sample rendered image, (b) Ground truth squares, and most probable estimates from 5000 samples obtained by (c) MHWG sampler (best baseline) and (d) the INF-BMHWG sampler . (f) Posterior expectation of the square boundaries obtained by INF-BMHWG sampling. (The ﬁrst 2000 samples are discarded as burn-in) 1% ) and thus they do not reach con vergence. The MHWG sampler , updating one dimension at a time, is found to be the best among the baseline samplers with acceptance rate of around 42% , followed by a block sampler that samples each tile separately . The OpenCV algorithm produces a reasonable initial guess but f ails in occlusion cases. The block wise informed sampler INF-BMHWG conv erges quicker , with higher acceptance rates ( ≈ 53% ), and lo wer re- construction error . The median curves for 10 test e xamples are sho wn in Figure 7, INF-BMHWG by far produces lo wer reconstruction errors. Also in Fig 6(f) the posterior distri- bution is visualized, fully visible tiles are more localized, position and orientation of occluded tiles more uncertain. Figure 12 in the appendix sho ws some more visual results. Although the model is relati vely simple, all the baseline samplers perform poorly and discriminati ve information is crucial to enable accurate inference. Here the discriminati ve information is provided by a readily av ailable heuristic in the OpenCV library . This experiment illustrates a v ariation of the informed sam- pling strategy that can be applied to sampling from high- dimensional distributions. Inference methods for general high-dimensional distributions is an acti ve area of research and intrinsically difﬁcult. The occluding tiles experiment is simple but illustrates this point, namely that all non-block baseline samplers fail. Block sampling is a common strat- egy in such scenarios and man y computer vision problems hav e such block-structure. Again the informed sampler improv es in con vergence speed over the baseline method. Other techniques that produce better ﬁts to the conditional (block-)marginals should gi ve faster con vergence. 7. Experiment: Estimating Body Shape The last experiment is moti vated by a real world problem: estimating the 3D body shape of a person from a single static depth image. With the recent av ailability of cheap activ e depth sensors, the use of RGBD data has become ubiquitous in computer vision (Shao et al., 2013; Jungong et al., 2013). T o represent a human body we use the BlendSCAPE model (Hirshberg et al., 2012), which updates the origi- nally proposed SCAPE model (Anguelov et al., 2005) with better training and blend weights. This model produces a 3D mesh of a human body as sho wn in Figure 8 as a function of shape and pose parameters. The shape parameters allo w us to represent bodies of many builds and sizes, and includes a statistical characterization (being roughly Gaussian). These parameters control directions in deformation space, which were learned from a corpus of roughly 2000 3D mesh mod- els registered to scans of human bodies via PCA. The pose parameters are joint angles which indirectly control local orientations of predeﬁned parts of the model. Our model uses 57 pose parameters and any number of shape parameters to produce a 3D mesh with 10,777 vertices. W e use the ﬁrst 7 SCAPE components to represent the shape of a person. The camera viewpoint, orientation, and pose of the person is held ﬁxed. Thus a rendering process takes θ ∈ R 7 , generates a 3D mesh representation of it and projects it through a virtual depth camera to create a depth image of the person. This can be done in various resolutions, we chose 430 × 260 with depth v alues represented as 32 bit numbers in the interval [0 , 4] . On a verage, a full render path takes about 28 ms . W e add Gaussian noise with standard deviation of 0 . 02 to the created depth image. See Fig.8(left) for an example. W e used very simple low le vel features for feature represen- tation. In order to learn the global proposal distrib ution we compute depth histogram features on a 15 × 10 grid on the image. For each cell we record the mean and variance of the depth values. Additionally we add the height and the width of the body silhouette as features resulting in a feature vector v ( I ) ∈ R 302 . As normalization, each feature dimen- sion is di vided by the maximum v alue in the training set. W e used 400 k training images sampled from the standard normal prior distribution and 10 k clusters to learn the KDE proposal distributions in each cluster cell. For this e xperiment we also e xperimented with a dif ferent conditional density estimation approach using a forest of random regression trees (Breiman et al., 1984; Breiman, 2001). In the previous experiments, utilizing the KDE esti- mates, the discriminative information entered through the The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision Figure 7. Results of the ‘Occluding T iles’ experiment. Acceptance Rates (left), PSRFs (middle), and RMSEs (right) for different sampling methods. Median results for 10 test examples. feature representation. Then, suppose if there was no rela- tion between some observed features and the v ariables that we are trying to infer , we would require a large number of samples to reliably estimate the densities in the different clusters. The regression forest can adapti vely partition the parameter space based on observ ed features and is able to ignore uninformati ve features, thus may lead to better ﬁts of the conditional densities. It can thus be understood as the adaptiv e version of the k-Means clustering technique that solely relies on the used metric (Euclidean in our case). In particular , we use the same features as for k-means clus- tering b ut grow the regression trees using a mean square error criterion for scoring the split functions. A forest of 10 binary trees with a depth of 15 is grown, with the constraint of having a minimum of 40 training points per leaf node. Then for each of the leaf nodes, a KDE is trained as before. At test time the regression forest yields a mixture of KDEs as the global proposal distrib ution. W e denote this method as INF-RFMH in the experiments. Instead of placing using one KDE model for each cluster , we could also explore a re gression approach, for example using a discriminativ e linear regression model to map observ ations into proposal distributions. By using informativ e cov ariates in the regression model one should be able to ov ercome the curse of dimensionality . Such a semi-parametric approach would allo w to capture explicit parametric dependencies of the variables (for e xample linear dependencies) and combine them with non-parametric estimates of the residuals. W e are exploring this technique as future work. Again, we chose parameters for all samplers individually , based on empirical mixing rates. For informed samplers, we chose α = 0 . 8 , and a local proposal standard de viation of 0.05. The full analysis for all samplers is included in the supplementary material. 7.1. Results W e tested the different approaches on 10 test images that are generated by parameters drawn from the standard normal prior distribution. Figure 9 summarizes the results of the sampling methods. W e make the following observations. The baselines methods MH, MHWG, and PT show inferior con vergence results and MH and PT also suf fer from lower acceptance rates. Just sampling from the distribution of the discriminativ e step (INF-INDMH) is not enough, because the lo w acceptance rate indicates that the global proposals do not represent the correct posterior distribution. Ho wev er , combined with a local proposal in a mixture kernel, we achiev e a higher acceptance rate, f aster con ver gence and a decrease in RMSE. The regression forest approach has slower conv ergence than INF-MH. In this example, the re- generation sampler REG-MH does not impro ve ov er simpler baseline methods. W e attribute this to rare regenerations which may be improv ed with more specialized methods. W e believ e that our simple choice of depth image represen- tation can also signiﬁcantly be impro ved on. For e xample, features can be computed from identiﬁed body parts, some- thing that the simple histogram features have not taken into account. In the computer vision literature some discrimina- ti ve approaches for pose estimation do e xist, most prominent being the inﬂuential work on pose reco very in parts for the Kinect XBox system (Shotton et al., 2011). In future work we plan to use similar methods to deal with pose v aria- tion and complicated dependencies between parameters and observations. 7.2. 3D Mesh Reconstruction In Figure 8 we sho w a sample 3D body mesh reconstruction result using the INF-MH sampler after only 1000 iterations. W e visualized the difference of the mean posterior and the ground truth 3D mesh in terms of mesh edge directions. One The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision Figure 8. Inference of body shape from a depth image. A sample test result sho wing the result of 3D mesh reconstruction with the ﬁrst 1000 samples obtained using our INF-MH sampling method. W e visualize the angular error (in degrees) between the estimated and ground truth edge and project onto the mesh. Figure 9. Results of the ‘Body Shape’ experiment. Acceptance Rates (left), PSRFs (middle), and RMSEs (right) for dif ferent sampling methods in the body shape experiment. Median results over 10 test e xamples. can observe that most differences are in the belly region and the feet of the person. The retrie ved posterior distribution allows us to assess the model uncertainty . T o visualize the posterior variance we record standard de viation over the edge directions for all mesh edges. This is backprojected to achie ve the visualization in Figure 8(right). W e see that posterior v ariance is higher in re gions of higher error , that is, our model predicts its own uncertainty correctly (Dawid, 1982). In a real-world body scanning scenario, this infor- mation will be beneﬁcial; for example, when scanning from multiple viewpoints or in an e xperimental design scenario, it helps in selecting the ne xt best pose and viewpoint to record. Figure 13 sho ws more 3D mesh reconstruction results using our sampling approach. 7.3. Body Measurements Predicting body measurements has many applications in- cluding clothing, sizing and ergonomic design. Gi ven pix el observations, one may wish to infer a distribution o ver mea- surements (such as height and chest circumference). For- tunately , our original shape training corpus includes a host of 47 dif ferent per-subject measurements, obtained by pro- fessional anthropometrists; this allo ws us to relate shape parameters to measurements. Among many possible forms of regression, regularized linear re gression (Zou and Hastie, 2005) was found to best predict measurements from shape parameters. This linear relationship allows us to transform any posterior distribution over SCAPE parameters into a posterior over measurements, as shown in Figure 10. W e report for three randomly chosen subjects’ (S1, S2, and S3) results on three out of the 47 measurements. The dashed lines corresponds to ground truth v alues. Our estimate not only faithfully recov ers the true value b ut also yields a char- acterization of the full conditional posterior . 7.4. Incomplete Evidence Another adv antage of using a generati ve model is the ability to reason with missing observ ations. W e perform a simple The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision Figure 10. Body measurements with quantiﬁed uncertainty . Box plots of three body measurements for three test subjects, computed from the ﬁrst 10 k samples obtained by the INF-MH sampler . Dot- ted lines indicate measurements corresponding to ground truth SCAPE parameters. experiment by occluding a portion of the observ ed depth im- age. W e use the same inference and learning codes, with the same parametrization and features as in the non-occlusion case but retrain the model to account for the changes in the forward process. The result of INF-MH, computed on the ﬁrst 10 k samples is sho wn in Fig. 11. The 3D reconstruction is reasonable e ven under lar ge occlusion; the error and the edge direction variance did increase as e xpected. 8. Discussion and Conclusions This work proposes a method to incorporate discrimina- tiv e methods into Bayesian inference in a principled way . W e augment a sampling technique with discriminativ e in- formation to enable inference with global accurate gener- ativ e models. Empirical results on three challenging and div erse computer vision experiments are discussed. W e care- fully analyse the con vergence beha viour of several dif ferent baselines and ﬁnd that the informed sampler performs well across all dif ferent scenarios. This sampler is applicable to general scenarios and in this work we le verage the accu- rate forward process for of ﬂine training, a setting frequently found in computer vision applications. The main focus is the generality of the approach, this inference technique should be applicable to many different problems and not be tailored to a particular problem. W e show that e ven for very simple scenarios, most baseline samplers perform poorly or fail completely . By includ- ing a global image-conditioned proposal distribution that is informed through discriminative inference we can im- prov e sampling performance. W e deliberately use a simple learning technique (KDEs on k-Means cluster cells and a forest of regression trees) to enable easy reuse in other ap- plications. Using stronger and more tailored discriminati ve models should lead to better performance. W e see this as a way where top-do wn inference is combined with bottom-up proposals in a probabilistic setting. There are some avenues for future work; we understand Figure 11. Inference with incomplete evidence. Mean 3D mesh and corresponding errors and uncertainties (std. deviations) in mesh edge directions, for the same test case as in ﬁgure 8, com- puted from ﬁrst 10 k samples of our INF-MH sampling method with (bottom ro w) occlusion mask in image evidence. (blue indi- cates small values and red indicates high v alues) this method as an initial step into the direction of gen- eral inference techniques for accurate generati ve computer vision models. Identifying conditional dependence struc- ture should improv e results, e.g. recently Stuhlm ¨ uller et al. (2013) used structure in Bayesian networks to identify such dependencies. One assumption in our work is that we use an accurate generati ve model. Relaxing this assumption to al- low for more general scenarios where the generati ve model is kno wn only approximately is important future work. In particular for high-le vel computer vision problems such as scene or object understanding there are no accurate genera- tiv e models av ailable yet but there is a clear trend to wards physically more accurate 3D representations of the world. This more general setting is different to the one we con- sider in this paper , but we believe that some ideas can be carried over . F or example, we could create the informed proposal distributions from manually annotated data that is readily available in many computer vision data sets. An- other problem domain are trans-dimensional models, that require dif ferent sampling techniques like re versible jump MCMC methods (Green, 1995; Brooks et al., 2011). W e are in vestigating general techniques to “inform” this sampler in similar ways as described in this manuscript. W e believ e that generativ e models are useful in man y com- puter vision scenarios and that the interplay between com- puter graphics and computer vision is a prime candidate for studying probabilistic inference and probabilistic program- ming (Mansinghka et al., 2013). Ho wev er , current inference techniques need to be improv ed on many fronts: efﬁciency , ease of usability , and generality . Our method is a step to- wards this direction: the informed sampler le verages the power of existing discriminativ e and heuristic techniques to enable a principled Bayesian treatment in rich generative models. Our emphasis is on generality; we aimed to create a method that can be easily reused in other scenarios with existing code bases. The presented results are a successful example of the in version of an inv olved rendering pass. In the future we plan to inv estigate ways to combine e xist- ing computer vision techniques with principled generativ e The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision models, with the aim of being general rather than problem speciﬁc. A ppendix A. Regeneration Sampler (REG-MH) Adapting the proposal distribution with existing MCMC samples is not straight-forward as this would potentially vi- olate the Markov property of the chain (Atchad ´ e and Rosen- thal, 2005). One approach is to identify times of r eg ener- ation at which the chain can be restarted and the proposal distribution can be adapted using samples drawn pre viously . Sev eral approaches to identify good regeneration times in a general Marko v chain ha ve been proposed (Athre ya and Ney, 1978; Nummelin, 1978). W e build on (Mykland et al., 1995) that proposed two splitting methods for ﬁnding the regeneration times. Here, we brieﬂy describe the method that we implemented in this study . Let the present state of the sampler be x and let the indepen- dent global proposal distribution be T G . When y ∼ T G is accepted according to the MH acceptance rule, the probabil- ity of a regeneration is gi ven by: r ( x, y ) =    max { c w ( x ) , c w ( y ) } , if w ( x ) > c and w ( y ) > c, max { w ( x ) c , w ( y ) c } , if w ( x ) < c and w ( y ) < c, 1 , otherwise , (2) where c > 0 is an arbitrary constant and w ( x ) = π ( x ) T G ( x ) . The value of c can be set to maximize the regeneration probability . At every sampling step, if a sample from the independent proposal distribution is accepted, we compute regeneration probability using equation 2. If a regeneration occurs, the present sample is discarded and replaced with one from the independent proposal distribution T G . W e use the same mixture proposal distribution as in our informed sampling approach where we initialize the global proposal T G with a prior distribution and at times of regeneration ﬁt a KDE to the existing samples. This becomes the new adapted distrib ution T G . Refer to (Mykland et al., 1995) for more details of this regeneration technique. In the work of (Ahn et al., 2013) this re generation technique is used with success in a Darting Monte Carlo sampler . B. Additional Qualitative Results B.1. Occluding Tiles In Figure 12 more qualitati ve results of the occluding tiles experiment are sho wn. The informed sampling approach (INF-BMHWG) is better than the best baseline (MHWG). This still is a very challenging problem since the parameters for occluded tiles are ﬂat o ver a lar ge region. Some of the posterior v ariance of the occluded tiles is already captured by the informed sampler . B.2. Body Shape Figure 13 shows some more results of 3D mesh reconstruc- tion using posterior samples obtained by our informed sam- pling INF-MH. References S. Ahn, Y . Chen, and M. W elling. Distributed and adaptiv e darting Monte Carlo through regenerations. In Pr oceed- ings of the 16th International Confer ence on Artiﬁcial Intelligence and Statistics (AI Stats) , 2013. D. Anguelov , P . Srini vasan, D. K oller, S. Thrun, J. Rodgers, and J. Davis. Scape: shape completion and animation of people. In ACM T ransactions on Graphics (TOG) , volume 24, pages 408–416. A CM, 2005. Y . F . Atchad ´ e and J. S. Rosenthal. On adaptive Marko v chain Monte Carlo algorithms. Bernoulli , 11(5):815–828, 2005. K. B. Athreya and P . Ney . A ne w approach to the limit theory of recurrent Markov chains. T ransactions of the American Mathematical Society , 245:493–501, 1978. B. G. Baumgart. Geometric Modeling for Computer V ision . PhD thesis, Stanford univ ersity , 1974. M. J. Black, D. J. Fleet, and Y . Y acoob . Robustly estimat- ing changes in image appearance. Computer V ision and Image Under standing , 78(1):8–31, 2000. G. Bradski and A. Kaehler . Learning OpenCV : Computer vision with the OpenCV library . OReilly , 2008. L. Breiman. Random forests. Machine Learning , 45(1): 5–32, 2001. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classiﬁcation and Re gr ession T r ees . W adsworth, Bel- mont, 1984. S. Brooks and A. Gelman. General methods for monitoring con vergence of iterati ve simulations. J ournal of Compu- tational and Graphical Statistics , 7:434–455, 1998. S. Brooks, A. Gelman, G. Jones, and X.-L. Meng. Handbook of Markov Chain Monte Carlo . CRC Press, 2011. N. Dalal and B. T riggs. Histograms of oriented gradients for human detection. In Computer V ision and P attern Recognition , v olume 1, pages 886–893, 2005. P . A. Da wid. The well-calibrated Bayesian. Journal of the American Statistical Association , 77(379):605–610, 1982. The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision M. de La Gorce, N. Paragios, and D. J. Fleet. Model-based hand tracking with te xture, shading and self-occlusions. In Computer V ision and P attern Recognition , 2008. L. Del Pero, J. Bowdish, D. Fried, B. Kermgard, E. Hartle y , and K. Barnard. Bayesian geometric modeling of indoor scenes. In Computer V ision and P attern Recognition , 2012. S. M. A. Eslami, N. Heess, and J. M. W inn. The shape Boltzmann machine: A strong model of object shape. In Computer V ision and P attern Recognition , pages 406– 413. IEEE, 2012. J. M. Flegal, M. Haran, and G. L. Jones. Marko v chain Monte Carlo: Can we trust the third signiﬁcant ﬁgure? Statistical Science , 23(2):250–260, 2008. A. Gelman and D. Rubin. Inference from iterativ e simu- lation using multiple sequences. Statistical Science , 7: 457–511, 1992. C. J. Ge yer . Marko v chain Monte Carlo maximum like- lihood. In Pr oceedings of the 23r d Symposium on the Interface , pages 156–163, 1991. W . R. Gilks, G. O. Roberts, and S. K. Sahu. Adaptive Marko v chain Monte Carlo through regeneration. J ournal of the American statistical association , 93(443):1045– 1054, 1998. P . J. Green. Rev ersible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika , 82(4):711–732, 1995. U. Grenander . P attern Synthesis - Lectur es in P attern Theory . Springer , New Y ork, 1976. D. A. Hirshberg, M. Loper , E. Rachlin, and M. J. Black. Coregistration: simultaneous alignment and modeling of articulated 3d shape. In European confer ence on Com- puter V ision , pages 242–255, 2012. B. K. P . Horn. Understanding image intensities. Artiﬁcial Intelligence , 8:201–231, 1977. K. Hukushima and K. Nemoto. Exchange Monte Carlo method and application to spin glass simulations. Journal of the Physical Society of Japan , 65(6):1604–1608, 1996. H. Jungong, S. Ling, X. Dong, and J. Shotton. Enhanced computer vision with Microsoft Kinect sensor: A re- view . IEEE T ransactions on Cybernetics , 43(5):1318– 1334, Oct 2013. R. E. Kass, B. P . Carlin, A. Gelman, and R. M. Neal. Markov chain Monte Carlo in practice: A roundtable discussion. The American Statistician , 52:93–100, 1998. A. B. Lee, D. Mumford, and J. Huang. Occlusion models for natural images: A statistical study of a scale-inv ariant dead leaves model. International Journal of Computer V ision , 41(1-2):35–59, 2001. M. W . Lee and I. Cohen. Proposal maps driven MCMC for estimating human body pose in static images. In Computer V ision and P attern Recognition , 2004. J. S. Liu. Monte Carlo Strate gies in Scientiﬁc Computing . Springer Series in Statistics. New Y ork, 2001. V . Mansinghka, T . D. Kulkarni, Y . N. Perov , and J. T enen- baum. Approximate Bayesian image interpretation using generativ e probabilistic graphics programs. In Advances in Neural Information Pr ocessing Systems , pages 1520– 1528, 2013. N. Metropolis, A. W . Rosenbluth, M. N. Rosenbluth, A. H. T eller, and E. T eller . Equation of state calculations by fast computing machines. The journal of chemical physics , 21:1087, 1953. D. Mumford and A. Desolneux. P attern Theory: The Stochastic Analysis of Real-W orld Signals . 2010. P . Mykland, L. Tierne y , and B. Y u. Regeneration in Markov chain samplers. Journal of the American Statistical Asso- ciation , 90(429):233–241, 1995. E. Nummelin. A splitting technique for Harris recurrent Marko v chains. Zeitschrift f ¨ ur W ahrscheinlichk eitstheorie und verwandte Gebiete , 43(4):309–318, 1978. N. Oliver , B. Rosario, and A. Pentland. A Bayesian com- puter vision system for modeling human interactions. IEEE T rans. P attern Anal. Mac h. Intell , 22(8):831–843, 2000. R. Ramamoorthi and P . Hanrahan. A signal-processing framew ork for in verse rendering. In Computer graphics and interactive tec hniques , pages 117–128. A CM, 2001. L. Shao, J. Han, D. Xu, and J. Shotton. Computer vision for RGB-D sensors: Kinect and its applications. IEEE T ransactions on Cybernetics , 43(5):1314–1317, 2013. J. Shotton, A. Fitzgibbon, M. Cook, T . Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In Computer V ision and P attern Recognition , 2011. C. Sminchisescu and M. W elling. Generalized darting Monte Carlo. P attern Recognition , 44(10):2738–2748, 2011. A. Stuhlm ¨ uller , J. T aylor , and N. Goodman. Learning stochastic in verses. In Advances in Neural Information Pr ocessing Systems , pages 3048–3056, 2013. The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision R. H. Swendsen and J.-S. W ang. Replica Monte Carlo simulation of spin-glasses. Physical Review Letters , 57 (21):2607, 1986. Z. Tu and S.-C. Zhu. Image segmentation by data-driven Markov chain Monte Carlo. P attern Analysis and Ma- chine Intellig ence, IEEE T ransactions on , 24(5):657–673, 2002. A. Y uille and D. Kersten. V ision as Bayesian inference: analysis by synthesis? T r ends in cognitive sciences , 10 (7):301–308, 2006. S. C. Zhu and D. Mumford. Learning generic prior models for visual computation. In Computer V ision and P attern Recognition , pages 463–469. IEEE, 1997. S.-C. Zhu, R. Zhang, and Z. T u. Integrating bottom-up/top- down for object recognition by data dri ven Marko v chain Monte Carlo. In Computer V ision and P attern Recogni- tion , volume 1, pages 738–745, 2000. H. Zou and T . Hastie. Regularization and v ariable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 67(2):301–320, 2005. The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision Figure 12. Qualitativ e Results of the occluding tiles experiment. From left to right: (a) Given image, (b) Ground truth tiles, and most probable estimates from 5000 samples obtained by (c) MHWG sampler (best baseline) and (d) our INF-BMHWG sampler . (f) Posterior expectation of the tiles boundaries obtained by INF-BMHWG sampling. (First 2000 samples are discarded as burn-in) The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision Figure 13. Qualitativ e results for the body shape experiment. Shown is the 3D mesh reconstruction results with ﬁrst 1000 samples obtained using the INF-MH informed sampling method. (blue indicates small values and red indicates high values) Supplementary Material f or The Inf ormed Sampler: A Discriminativ e A pproach to Bay esian Inference in Generativ e Computer V ision Models V arun Jampani V A RU N . JA M PA N I @ T U E B I N G E N . M P G . D E Max Planck Institute for Intelligent Systems, Spemannstraße 41, 72076, T ¨ ubingen, Germany Sebastian Nowozin S E B A S T I A N . N O W O Z I N @ M I C R O S O F T . C O M Microsoft Research Cambridge, 21 Station Road, Cambridge, CB1 2FB, United Kingdom Matthew Loper M L O P E R @ T U E B I N G E N . M P G . D E Max Planck Institute for Intelligent Systems, Spemannstraße 41, 72076, T ¨ ubingen, Germany Peter V . Gehler P G E H L E R @ T U E B I N G E N . M P G . D E Max Planck Institute for Intelligent Systems, Spemannstraße 41, 72076, T ¨ ubingen, Germany C. Baseline Results and Analysis On the next pages of this supplementary material, we gi ve an in-depth performance analysis of the various samplers and the ef fect of their hyperparameters. W e choose hyper- parameters with the lowest PSRF v alue after 10 k iterations, for each sampler indi vidually . If the dif ferences between PSRF are not signiﬁcantly different among multiple v alues, we choose the one that has the highest acceptance rate. D. Experiment: Estimating Camera Extrinsics D.1. P arameter Selection Metropolis Hastings (MH) Figure 14(a) shows the me- dian acceptance rates and PSRF values corresponding to various proposal standard de viations of plain MH sampling. Mixing gets better and the acceptance rate gets worse as the standard de viation increases. The value 0 . 3 is selected standard deviation for this sampler . Metropolis Hastings Within Gibbs (MHWG) As men- tioned in the main paper , the MHWG sampler with one- dimensional updates did not con verge for an y v alue of pro- posal standard de viation. This problem has high correla- tion of the camera parameters and is of multi-modal nature, which this sampler has problems with. Parallel T empering (PT) For PT sampling, we took the best performing MH sampler and used different temperature chains to improv e the mixing of the sampler . Figure 14(b) shows the results corresponding to different combination of temperature lev els. The sampler with temperature levels of [1 , 3 , 27] performed best in terms of both mixing and acceptance rate. Effect of Mixture Coefﬁcient in Inf ormed Sampling (INF-MH) Figure 14(c) shows the ef fect of mixture coef- ﬁcient ( α ) on the informed sampling INF-MH. Since there is no signiﬁcant dif ferent in PSRF v alues for 0 ≤ α ≤ 0 . 7 , we chose 0 . 7 due to its high acceptance rate. E. Experiment: Occluding Tiles E.1. Parameter Selection Metropolis Hastings (MH) Figure 15(a) sho ws the re- sults of MH sampling. Results show the poor con ver gence for all proposal standard deviations and rapid decrease of AR with increasing standard deviation. This is due to the high-dimensional nature of problem. W e selected a standard deviation of 1 . 1 . Blocked Metropolis Hastings Within Gibbs (BMHWG) The results of BMHWG are shown in Figure 15(b). In this sampler we update only one block of tile v ariables (of dimension four) in each sampling step. Results show much better performance compared to plain MH. The optimal proposal standard deviation for this sampler is 0 . 7 . Metropolis Hastings Within Gibbs (MHWG) Fig- ure 15(c) sho ws the result of MHWG sampling. This sam- pler is better than BMHWG and con verges much more quickly . Here a standard de viation of 0 . 9 is found to be best. The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision (a) MH (b) PT (c) INF-MH Figure 14. Results of the ‘Estimating Camera Extrinsics’ experi- ment. PRSFs and Acceptance rates corresponding to (a) various standard deviations of MH, (b) v arious temperature lev el combi- nations of PT sampling and (c) various mixture coefﬁcients of INF-MH sampling. Parallel T empering (PT) The Figure 15(d) shows the re- sults of PT sampling with various temperature combinations. Results sho w no improv ement in AR from plain MH sam- pling and again [1 , 3 , 27] temperature lev els are found to be optimal. Effect of Mixture Coefﬁcient in Inf ormed Sampling (INF-BMHWG) Figure 15(e) shows the effect of mix- ture coefﬁcient ( α ) on the blocked informed sampling INF- BMHWG. Since there is no signiﬁcant different in PSRF values for 0 ≤ α ≤ 0 . 8 , we chose 0 . 8 due to its high accep- tance rate. F . Experiment: Estimating Body Shape F .1. Parameter Selection Metropolis Hastings (MH) Figure 16(a) sho ws the result of MH sampling with various proposal standard de viations. The value of 0 . 1 is found to be best. (a) MH (b) BMHWG (c) MHWG (d) PT (e) INF-BMHWG Figure 15. Results of the ‘Occluding Tiles’ e xperiment. PRSF and Acceptance rates corresponding to various standard de viations of (a) MH, (b) BMHWG, (c) MHWG,; (d) various temperature le vel combinations of PT sampling and; (e) various mixture coef ﬁcients of our informed INF-BMHWG sampling. Metropolis Hastings Within Gibbs (MHWG) For MHWG sampling we select 0 . 3 proposal standard devia- tion. Results are shown in Figure 16(b). The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision Parallel T empering (PT) As before, results in Fig- ure 16(c), the temperature levels were selected to be [1 , 3 , 27] due its slightly higher AR. Effect of Mixture Coefﬁcient in Inf ormed Sampling (INF-MH) Figure 16(d) shows the effect of α on PSRF and AR. Since there is no signiﬁcant differences in PSRF values for 0 ≤ α ≤ 0 . 8 , we choose 0 . 8 . (a) MH (b) MHWG (c) PT (d) INF-MH Figure 16. Results of the ‘Body Shape Estimation’ experiment. PRSFs and Acceptance rates corresponding to various standard deviations of (a) MH, (b) MHWG; (c) v arious temperature lev el combinations of PT sampling and; (d) various mixture coef ﬁcients of the informed INF-MH sampling. The Informed Sampler: A Discriminative A pproach to Bayesian Inference in Computer V ision G. Results Overview (a) Results for: Estimating Camera Extrinsics (b) Results for: Occluding T iles (c) Results for: Estimating Body Shape Figure 17. Summary of the statistics for the three e xperiments. Shown are for se veral baseline methods and the informed samplers the acceptance rates (left), PSRFs (middle), and RMSE values (right). All results are median results ov er multiple test examples. ¡

The Informed Sampler: A Discriminative Approach to Bayesian Inference in Generative Computer Vision Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment