A Compilation Target for Probabilistic Programming Languages

A Compilation T arget f or Pr obabilistic Programming Languages Brooks Paige B RO O K S @ R O B OT S . OX . A C . U K Frank W ood F W O O D @ RO B O T S . OX . AC . U K Univ ersity of Oxford, Department of Engineering Science, Oxford, UK Abstract Forward inference techniques such as sequential Monte Carlo and particle Marko v chain Monte Carlo for probabilistic programming can be im- plemented in any programming language by cre- ativ e use of standardized operating system func- tionality including processes, forking, mutexes, and shared memory . Exploiting this we have de- ﬁned, de veloped, and tested a probabilistic pro- gramming language intermediate representation language we call probabilistic C, which itself can be compiled to machine code by standard com- pilers and linked to operating system libraries yielding an efﬁcient, scalable, portable proba- bilistic programming compilation tar get. This opens up a ne w hardware and systems research path for optimizing probabilistic programming systems. 1. Introduction Compilation is source to source transformation. W e use the phrase intermediate representation to refer to a target lan- guage for a compiler . This paper introduces a C-language library that makes possible a C-language intermediate rep- resentation for probabilistic programming languages that can itself be compiled to e xecutable machine code. W e call this intermediate language pr obabilistic C . Probabilis- tic C can be compiled normally and uses only macros and POSIX operating system libraries ( Open Group , 2004a ) to implement general-purpose, scalable, parallel probabilistic programming inference. Note that in this paper we do not show ho w to compile any existing probabilistic program- ming languages (i.e. IBAL ( Pfeffer , 2001 ), BLOG ( Milch et al. , 2007 ), Church ( Goodman et al. , 2008 ), Figaro ( Pfef- fer , 2009 ), V enture ( Mansinghka et al. , 2014 ), or Anglican ( W ood et al. , 2014 )) to this intermediate representation; in- stead we lea ve this to future work noting there is a wealth of Pr oceedings of the 31 st International Confer ence on Machine Learning , Beijing, China, 2014. JMLR: W&CP volume 32. Copy- right 2014 by the author(s). readily av ailable resources on language to language compi- lation that could be lev eraged to do this. W e instead char- acterize the performance of the intermediate representation itself by writing programs directly in probabilistic C and then testing them on computer architectures and programs that illustrate the capacities and trade-of fs of both the for- ward inference strategies probabilistic C employs and the operating system functionality on which it depends. Probabilistic C programs compile to machine executable meta-programs that perform inference o ver the original program via forward methods such as sequential Monte Carlo ( Doucet et al. , 2001 ) and particle MCMC variants ( Andrieu et al. , 2010 ). Such inference methods can be im- plemented in a sufﬁciently general w ay so as to support in- ference ov er the space of probabilistic program execution traces using just POSIX operating system primitiv es. 1.1. Related work The characterization of probabilistic programming infer- ence we consider here is the process of sampling from the a posteriori distrib ution of e xecution traces arising from stochastic programs constrained to reﬂect observed data. This is the vie w taken by the Church ( Goodman et al. , 2008 ), V enture ( Mansinghka et al. , 2014 ), and Anglican ( W ood et al. , 2014 ), programming languages among oth- ers. In such languages, models for observed data can be described purely in terms of a forward generati ve process. Markov chain Monte Carlo (MCMC) is used by these sys- tems to sample from the posterior distribution of program ex ecution traces. Single-site Metropolis Hastings (MH) ( Goodman et al. , 2008 ) and particle MCMC (PMCMC) ( W ood et al. , 2014 ) are two such approaches. In the lat- ter it was noted that a fork -like operation is a fundamental requirement of forward inference methods for probabilis- tic programming, where fork is the standard POSIX op- erating system primitiv e ( Open Group , 2004b ). Kiselyo v & Shan ( 2009 ) also noted that delimited continuations, a user-le vel generalization of fork could be used for infer - ence, albeit in a restricted family of models. A Compilation T arget for Probabilistic Programming Languages #include "probabilistic.h" int main( int argc, char ** argv) { double var = 2; double mu = normal_rng(1, 5); o b s e r v e (normal_lnp(9, mu, var)); o b s e r v e (normal_lnp(8, mu, var)); p r e d i c t ( "mu,%f\n" , mu); return 0; } Figure 1. This program performs posterior inference ov er the un- known mean mu of a Gaussian, conditioned on two data points. The predict directive formats output using standard printf semantics. 2. Probabilistic Programming Any program that makes a random choice over the course of its execution implicitly deﬁnes a prior distribution over its random v ariables; running the program can be inter- preted as drawing a sample from the prior . Inference in probabilistic programs inv olves conditioning on observed data, and characterizing the posterior distribution of the random v ariables gi ven data. W e introduce probabilistic programming capabilities into C by pro viding a library with two primary functions: observe which conditions the program ex ecution trace given the log-likelihood of a data point, and predict which marks e xpressions for which we want posterior samples. Any random number genera- tor and sampling library can be used for making random choices in the program, any numeric log likelihood value can be passed to an observe , and any C e xpression which can be printed can be reported using predict . The library includes a single macro which renames main and wraps it in another function that runs the original in an inner loop in the forward inference algorithms to be described. Although C is a comparati vely low-le vel language, it can nonetheless represent many well-kno wn generativ e models concisely and transparently . Figure 1 shows a simple prob- abilistic C program for estimating the posterior distribution for the mean of a Gaussian, conditioned on two observed data points y 1 , y 2 , corresponding to the model µ ∼ N (1 , 5) , y 1 , y 2 iid ∼ N ( µ, 2) . (1) W e observe the data y 1 , y 2 and predict the poste- rior distribution of µ . The functions normal rng and normal lnp in Figure 1 return (respectively) a normally- distributed random variate and the log probability density of a particular value, with mean and v ariance parameters mu and var . The observe statement requires only the log- probability of the data points 8 and 9 conditioned on the current program state; no other information about the like- lihood function or the generati ve process. In this program we predict the posterior distribution of a single v alue mu . #include "probabilistic.h" #define K 3 #define N 11 / * Markov transition matrix * / static double T[K][K] = { { 0.1, 0.5, 0.4 }, { 0.2, 0.2, 0.6 }, { 0.15, 0.15, 0.7 } }; / * Observed data * / static double data[N] = { NAN, .9, .8, .7, 0, -.025, -5, -2, -.1, 0, 0.13 }; / * Prior distribution on initial state * / static double initial_state[K] = { 1.0/3, 1.0/3, 1.0/3 }; / * Per-state mean of Gaussian emission distribution * / static double state_mean[K] = { -1, 1, 0 }; / * Generative program for a HMM * / int main( int argc, char ** argv) { int states[N]; for ( int n=0; n 0) { o b s e r v e (normal_lnp(data[n], state_mean[states[n]], 1)); } p r e d i c t ( "state[%d],%d\n" , n, states[n]); } return 0; } Figure 2. A hidden Markov model (HMM) with 3 underlying states and Gaussian emissions, observ ed at 10 discrete time- points. W e observe 10 data points and predict the marginal distri- bution o ver the latent state at each time point. A hidden Marko v model example is shown in Figure 2 , in which N = 10 observed data points y 1: N are drawn from an underlying Markov chain with K latent states, each with Gaussian emission distributions with mean µ k , and a (known) K × K state transition matrix T , such that z 0 ∼ Discrete ([1 /K , . . . , 1 /K ]) (2) z n | z n − 1 ∼ Discrete ( T z n − 1 ) (3) y n | z n ∼ N ( µ z n , σ 2 ) . (4) Bayesian nonparametric models can also be represented concisely; in Figure 3 we show a generati ve program for an inﬁnite mixture of Gaussians. W e use a Chinese restaurant process (CRP) to sequentially sample non-negativ e integer partition assignments z n for each data point y 1 , . . . , y N . For each partition, mean and variance parameters µ z n , σ 2 z n are drawn from a normal-gamma prior; the data points y n themselves are dra wn from a normal distribution with these parameters, deﬁning a full generativ e model z n ∼ CRP( α, z 1 , . . . , z n − 1 ) (5) 1 /σ 2 z n ∼ Gamma(1 , 1) (6) µ z n | σ 2 z n ∼ N (0 , σ 2 z n ) (7) y n | z n , µ z n , σ 2 z n ∼ N ( µ z n , σ 2 z n ) . (8) This program also demonstrates the additional library func- tion memoize , which can be used to implement stochastic memoization as described in ( Goodman et al. , 2008 ). A Compilation T arget for Probabilistic Programming Languages #include "probabilistic.h" #define N 10 // Observed data static double data[N] = { 1.0, 1.1, 1.2, -1.0, -1.5, -2.0, 0.001, 0.01, 0.005, 0.0 }; // Struct holding mean and variance parameters for each cluster typedef struct theta { double mu; double var; } theta; // Draws a sample of theta from a normal-gamma prior theta draw_theta() { double variance = 1.0 / gamma_rng(1, 1); return (theta) { normal_rng(0, variance), variance }; } // Get the class id for a given observation index static polya_urn_state urn; void get_class( int * index, int * class_id) { * class_id = polya_urn_draw(&urn); } int main( int argc, char ** argv) { double alpha = 1.0; polya_urn_new(&urn, alpha); mem_func mem_get_class; memoize(&mem_get_class, get_class, sizeof ( int ), sizeof ( int )); theta params[N]; bool known_params[N] = { false }; int class; for ( int n=0; n 1 ) we hav e the recursive identity p ( x 1: n | y 1: n ) (12) = p ( x 1: n − 1 | y 1: n − 1 ) g ( y n | x 1: n ) f ( x n | x 1: n − 1 ) , we sample from p ( x 1: N | y 1: N ) by iterati vely sampling from each p ( x 1: n | y 1: n ) , in turn, from 1 through N . At each n , we construct an importance sampling distribution by proposing from some distribution q ( x n | x 1: n − 1 , y 1: n ) ; in probabilistic programs we ﬁnd it con venient to propose directly from the executions of the program, i.e. each se- quence of random variates x n is jointly sampled from the program ex ecution state dynamics x ` n ∼ f ( x n | x a ` n − 1 1: n − 1 ) (13) where a ` n − 1 is an “ancestor index, ” the particle index 1 , . . . , L of the parent (at time n − 1 ) of x ` n . The unnormal- ized particle importance weights at each observ ation y n are simply the observe data likelihood ˜ w ` n = g ( y 1: n , x ` 1: n ) (14) which can be normalized as w ` n = ˜ w ` n P L ` =1 ˜ w ` n . (15) Algorithm 1 Parallel SMC program ex ecution Assume: N observ ations, L particles launch L copies of the program (parallel) for n = 1 . . . N do wait until all L reach observe y n (barrier) update unnormalized weights ˜ w 1: L n (serial) if E S S < τ then sample number of offspring O 1: L n (serial) set weight ˜ w 1: L n = 1 (serial) for ` = 1 . . . L do fork or exit (parallel) end for else set all number of offspring O ` n = 1 (serial) end if continue program ex ecution (parallel) end for wait until L program traces terminate (barrier) predict from L samples from ˆ p ( x 1: L 1: N | y 1: N ) (serial) After each step n , we no w hav e a weighted set of e xecu- tion traces which approximate p ( x 1: n | y 1: n ) . As the pro- gram continues, traces which do not correspond well with the data will have weights which become negligibly small, leading in the worst case to all weight concentrated on a single execution trace. T o counteract this deﬁcienc y , we resample from our current set of L execution traces after each observ ation y n , according to their weights w ` n . This is achiev ed by sampling a count O ` n for the number of “off- spring” of a giv en execution trace ` to be included at time n + 1 ; an y sampling scheme must ensure E [ O ` n ] = w ` n . Sampling offspring counts O ` n is equiv alent to sampling ancestor indices a ` n . Program execution traces with no off- spring are killed; program execution traces with more than one offspring are forked multiple times. After resampling, all weights w ` n = 1 . W e only resample if the effecti ve sample size E S S ≈ 1 P ` ( w ` n ) 2 (16) is less than some threshold value τ ; we choose τ = L/ 2 . In probabilistic C, each observe statement forms a bar- rier: parallel execution cannot continue until all particles hav e arri ved at the observe and have reported their cur- rent unnormalized weight. As e xecution traces arriv e at the observe barrier , they take the number of particles which hav e already reached the current observe as a (temporary) unique identiﬁer . Program execution is then blocked as the effecti ve sample size is computed and the number of off- spring are sampled. The number of offspring are stored in a shared memory block; when the number of offspring are computed, each particle uses the identiﬁer assigned when A Compilation T arget for Probabilistic Programming Languages reaching the observe barrier to retrieve (asynchronously) from shared memory the number of children to fork . Par - ticles with no offspring wait for any child processes to com- plete execution, and terminate; particles with only one off- spring do not fork any children but continue execution as normal. The SMC algorithm is outlined in Algorithm 1 , with anno- tations for which steps are executed in parallel, serially , or form a barrier . After a single SMC sweep is complete, we sample values for each predict , and then (if desired) re- peat the process, running a new independent particle ﬁlter, to draw an additional batch of samples. 3.3. Particle Metropolis-Hastings Particle Mark ov chain Monte Carlo, introduced in Andrieu et al. ( 2010 ), uses sequential Monte Carlo to generate high- dimensional proposal distributions for MCMC. The most simple formulation is the particle independent Metropolis- Hastings algorithm. After running a single particle ﬁlter sweep, we compute an estimate of the marginal likelihood, ˆ Z ≡ p ( y 1: N ) ≈ N Y n =1 " 1 L L X ` =1 w ` n # . (17) W e then run another iteration of sequential Monte Carlo which we use as a MH proposal; we estimate the marginal likelihood ˆ Z 0 of the new proposed particle set, and then with probability min(1 , ˆ Z 0 / ˆ Z ) we accept the new particle set and output a new set of predict samples, otherwise outputting the same predict samples as in the previous iteration. The inner loop of Algorithm 2 is otherwise substantially similar to SMC. 3.4. Particle Gibbs Particle Gibbs is a particle MCMC technique which also has SMC at its core, with better theoretical statistical con- ver gence properties than PIMH but additional computa- tional overhead. W e initialize particle Gibbs by running a single sequential Monte Carlo sweep, and then alternate be- tween (1) sampling a single execution trace ˆ x 1: M from the set of L weighted particles, and (2) running a “conditional” SMC sweep, in which we generate L − 1 new particles in addition to the retained ˆ x 1: M . The implementation based on operating system primitiv es is described in algorithms 3 and 4 . The challenge here is that we must “retain” an execution trace, which we can later revisit to resume and branch arbitrarily many times. This is achiev ed by spawning off a “control” process at ev ery observation point, which from then on manages the future of that particular ex ecution state. Algorithm 2 Parallel PIMH program ex ecution Assume: M iterations, N observ ations, L particles for m = 1 . . . M do launch L copies of the program (parallel) for n = 1 . . . N do wait until all L reach an observe (barrier) update unnormalized weights ˜ w 1: L (serial) if E S S < τ then update proposal evidence estimate ˆ Z 0 (serial) sample number of offspring O 1: L n (serial) set weight ˜ w 1: L n = 1 (serial) for ` = 1 . . . L do fork or exit (parallel) end for else set all number of offspring O ` n = 1 (serial) end if continue program ex ecution (parallel) end for wait until L program traces terminate (barrier) accept or reject new particle set (serial) predict from L samples from ˆ p ( x 1: L 1: N | y 1: N ) (serial) store current particle set x and evidence ˆ Z (serial) continue to next iteration (parallel) end for As before, processes arri ve at an observe barrier, and when all particles have reached the observe we compute weights, and sample offspring counts O ` n . Particles with O ` n = 0 terminate, but new child processes are no longer spawned right away . Instead, all remaining particles fork a new process whose execution path immediately di ver ges from the main codebase and enters the retain and branch loop in Algorithm 4 . This new process takes responsibility for actually spawning the O ` n new children. The spawned child processes (and the original process which arriv ed at the observe barrier) wait (albeit brieﬂy) at a new barrier marking the end of observe n , not continuing ex ecution until all new child processes ha ve been launched. Program execution continues to the next observe , during which the retain / branch process waits until a full particle set reaches the end of the program. Once ﬁnal weights ˜ w 1: L N are computed, we sample (according to weight) from the ﬁnal particle set to select a single particle to retain during the next SMC iteration. When the particle is selected, a signal is broadcast to all retain / branch loops indicating which process ids correspond to the retained particle; all except the retained trace e xit. The retain / branch loop now goes into waiting again (this time for a branch signal), and we begin SMC from the top of the program. As we arrive at each observe n , we only sample L − 1 new offspring to consider: we guarantee that A Compilation T arget for Probabilistic Programming Languages Algorithm 3 Parallel Particle Gibbs program e xecution Assume: M iterations, N observ ations, L particles for m = 1 . . . M do L 0 ← L if m = 1 , otherwise L − 1 launch S 0 copies of the program (parallel) for n = 1 . . . N do wait until all L 0 reach an observe (barrier) compute weights for all particles (serial) if m > 1 then signal num offspring to retained trace (serial) end if for ` = 1 . . . L 0 do spawn retain / branch process [Algo. 4 ] (parallel) end for wait until L particles ﬁnish branching (barrier) continue program ex ecution (parallel) end for wait until L program traces terminate (barrier) predict from L samples from ˆ p ( x 1: L 1: N | y 1: N ) (serial) select and signal particle to retain (serial) wait until N processes are ready to branch (barrier) continue to next iteration (parallel) end for at least one offspring is spawned from the retained particle at n (namely , the retained execution state at n + 1 ). How- ev er , depending on the weights, often sampling offspring will cause us to want more than a single child from the retained particle. So, we signal to the retained particle ex- ecution state at time n the number of children to spa wn; the retain / branch loop returns to its entry point and re- sumes waiting, to see if the previously retained execution state will be retained yet again. Note that in particle Gibbs, we must resample (select of f- spring and reset weights w ` n = 1 ) after every observation in order to be able to properly align the retained particle on the next iteration through the program. 4. Experiments W e now turn to benchmarking probabilistic C against ex- isting probabilistic programming engines, and ev aluate the relativ e strengths of the inference algorithms in Section 3 . W e ﬁnd that compilation impro ves performance by approx- imately 100 times o ver interpreted versions of the same in- ference algorithm. W e also ﬁnd evidence that suggests that optimizing operating systems to support probabilistic pro- gramming usage could yield signiﬁcant performance im- prov ements as well. The programs and models we use in our experiments are chosen to be sufﬁciently simplistic that we can compute the exact posterior distribution of interest analytically , al- Algorithm 4 Retain and Branch inner loop Assume: input initial C > 0 children to spawn is retained ← false while true do if C = 0 and not is retained then discard this ex ecution trace, exit else { C ≥ 0 } spawn C new children end if wait for signal which resets is retained if is retained then wait for signal which resets C else discard this ex ecution trace, exit end if end while lowing us to e valuate correctness of inference. Given the true posterior distribution p , we measure sampler perfor - mance by the KL-di vergence K L ( ˆ p || p ) , where ˆ p is our Monte Carlo estimate. The ﬁrst benchmark program we consider is a hidden Markov model (HMM) very similar to that of Figure 2 , where we predict the marginal distri- butions of each latent state. The HMM used in our exper - iments here is larger; it has the same model structure, but with K = 10 states and 50 sequential observ ations, and each state k = 1 , . . . , 10 has a Gaussian emission distribu- tion with µ k = k − 1 and σ 2 = 4 . The second benchmark is the CRP mixture of Gaussians program in Figure 3 , where we predict the total number of distinct classes. 4.1. Comparative performance of inference engines W e begin by benchmarking against two existing probabilis- tic programming engines: Anglican , as described in W ood et al. ( 2014 ), which also implements particle Gibbs but is an interpreted language based on Scheme, implemented in Clojure, and running on the JVM; and pr obabilistic-js 1 , a compiled system implementing the inference approach in W ingate et al. ( 2011 ), which runs Metropolis-Hastings ov er each individual random choice in the program ex ecu- tion trace. The interpreted particle Gibbs engine is mul- tithreaded, and we run it with 100 particles and 8 simul- taneous threads; the Metropolis-Hastings engine only runs on a single core. In Figure 4 we compare inference per - formance in both of these existing engines to our particle Gibbs backend, running with 100 and 1000 particles, in an 8 core cloud computing en vironment on Amazon EC2, run- ning on Intel Xeon E5-2680 v2 processors. Our compiled probabilistic C implementation of particle Gibbs runs over 100 times faster that the e xisting interpreted engine, gener- ating good sample sets in on the order of tens of seconds. 1 https://github .com/dritchie/probabilistic-js A Compilation T arget for Probabilistic Programming Languages 1 0 0 1 0 1 1 0 2 1 0 3 Elapsed time for HMM (seconds) 2 - 1 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 K L ( ^ p j j p ) PG (100) PG (1000) Interpreted PG Compiled MH 1 0 - 1 1 0 0 1 0 1 1 0 2 Elapsed time for CRP (seconds) 2 - 9 2 - 8 2 - 7 2 - 6 2 - 5 2 - 4 2 - 3 2 - 2 2 - 1 2 0 2 1 2 2 K L ( ^ p j j p ) PG (100) PG (1000) Interpreted PG Compiled MH Figure 4. Performance plots for (left) HMM and (right) CRP models, run in an 8 core computing environment on Amazon EC2. In both models we see that compiling particle Gibbs, compared to running within an interpreter, leads to a large (approx. 100 × ) constant-factor speed increase. The Metropolis-Hastings sampler conv erges at a similar rate as particle Gibbs with 100 particles in the HMM, but appears to mix slower asymptotically for the CRP . In both models, increasing to 1000 particles in particle Gibbs yields somewhat faster con vergence at the e xpense of a longer waiting time until the ﬁrst sample set arrives. 1 0 0 1 0 1 1 0 2 Elapsed time for HMM (seconds) 2 - 1 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 K L ( ^ p j j p ) PG (100) PG (1000) PIMH (100) PIMH (1000) SMC (100) SMC (1000) 1 0 - 1 1 0 0 1 0 1 1 0 2 Elapsed time for CRP (seconds) 2 - 1 1 2 - 1 0 2 - 9 2 - 8 2 - 7 2 - 6 2 - 5 2 - 4 2 - 3 2 - 2 2 - 1 2 0 2 1 2 2 K L ( ^ p j j p ) Figure 5. Comparison of SMC, PIMH, and PG for 100 and 1000 particles in (left) the HMM, and (right) the CRP . The relativ e compu- tational and statistical efﬁciency of the PIMH and PG algorithms varies across models and number of particles. The SMC algorithm is quick to draw initial samples, but has no guarantee of conv ergence to the target density; we see this in its apparently poor asymptotic behavior in the HMM with 100 particles. Colors are the same across plots. The probabilistic C inference engine implements particle Gibbs, SMC, and PIMH sampling, which we compare in Figure 5 using both 100 and 1000 particles. SMC is run indeﬁnitely by simply repeatedly dra wing independent sets of particles; in contrast to the PMCMC algorithms, this is known to be a biased estimator ev en as the number of it- erations goes to inﬁnity ( Whiteley et al. , 2013 ), and is not recommended as a general-purpose approach. Figures 4 and 5 plot wall clock time against KL-div ergence. W e use all generated samples as our empirical posterior dis- tribution in order to produce as fair a comparison as possi- ble. In all engines, results are reasonably stable across runs; the shaded band co vers the 25th to 75th percentiles over multiple runs, with the median marked as a dark line. A sampler drawing from the tar get density will show approx- imately linear decrease in KL-diver gence on these log-log plots; a steeper slope correspond to greater statistical efﬁ- ciency . The methods based on sequential Monte Carlo do not provide any estimate of the posterior distribution until completing a single initial particle ﬁlter sweep; for lar ge numbers of particles this may be a non-tri vial amount of time. In contrast, the MH sampler begins drawing sam- ples effecti vely immediately , although it may take a large number of indi vidual steps before con verging to the cor - rect stationary distribution; individual Metropolis-Hastings samples are lik ely to be more autocorrelated, b ut producing each one is faster . 4.2. Perf ormance characteristics across multiple cores As the probabilistic C inference engine ofﬂoads much of the computation to underlying operating system calls, we characterize the limitations of the OS implementation by comparing time to completion as we v ary the number of cores. T ests for the hidden Markov model across core count (all on EC2, all with identical Intel Xeon E5-2680 v2 pro- cessors) are shown in Figure 6 . A Compilation T arget for Probabilistic Programming Languages 2 4 8 16 32 Number of cores 0 10 20 30 40 50 60 Runtime (seconds) 10 50 100 500 1000 5000 10000 Figure 6. Effect of system architecture on runtime performance. Here we run the HMM code on EC2 instances with identical pro- cessors (horizontal axis) with varying number of particles (indi- vidual bars) and report runtime to produce 10,000 samples. De- spite adding more cores, after 16 cores performance begins to de- grade. Similarly , adding more particles eventually degrades per- formance for any ﬁxed number of cores. In combination these suggest the a vailability of operating system optimizations that could improv e overall performance. 5. Discussion Probabilistic C is a method for performing inference in probabilistic programs. Methodologically it deri ves from the forward methods for performing inference in statisti- cal models based on sequential Monte Carlo and particle Markov chain Monte Carlo. W e hav e shown that it is pos- sible to ef ﬁciently and scalably implement this particular kind of inference strategy using existing, standard compil- ers and POSIX compliant operating system primitiv es. What most distinguishes Probabilistic C from prior art is that it is highly compatible with modern computer archi- tectures, all the w ay from operating systems to central processing units (in particular their virtual memory oper- ations), and, further, that it delineates a future research program for scaling the performance of probabilistic pro- gramming systems to large scale problems by in vestigat- ing systems optimizations of e xisting computer architec- tures. Note that this is distinct but compatible with ap- proaches to optimizing probabilistic programming systems by compilation optimizations, stochastic hardware, and de- pendency tracking with ef ﬁcient updating of local execu- tion trace subgraphs. It may , in the future, be possible to delineate model complexity and hardware architecture regimes in which each approach is optimal; we assert that, for now , it is unclear what those regimes are or will be. This paper is but one step to wards such a delineation. Sev eral interesting research questions remain: (1) Is it more sensible to write custom memory management and use threads than fork and processes as we ha ve done? The main contribution of this paper is to establish a probabilistic programming system implementation against a standard- ized, portable abstraction layer . It might be possible to eke out greater performance by capitalizing on the fact mod- ern architectures are optimised for parallel threads more so than parallel processes; howe ver , exploiting this would entail implementing memory management de facto equi v- alent in action to fork which may lead to lower portabil- ity . (2) W ould shifting architectures to small page sizes help? There is a bias to wards large page size support in modern computer architectures. It may be that the system use characteristics of probabilistic programming systems might pro vide a counterargument to that bias or inspire the creation of tuned end-to-end systems. F orking itself is lightweight until variable assignment which usually require manipulations of entire page tables. Large pages require large amounts of amortisation in order to absorb the cost of copying upon stochastic v ariable assignment. Smaller pages could potentially yield higher efﬁciencies. (3) What characteristics of process synchronisation can be improved speciﬁcally for probabilistic programming systems? This is both a systems and machine learning question. From a ma- chine learning perspecti ve we believ e it may be possible to construct ef ﬁcient sequential Monte Carlo algorithms that do not synchronize indi vidual threads at observe barriers and instead synchronize in a queue. On a systems lev el it begs questions about what page replacement strategies to consider; perhaps entirely changing the page replacement schedule to reﬂect rapid process rather than thread multi- plexing across cores. Probabilistic C does not disallow programmers from ac- cidentally writing programs that are statistically incoher - ent. Many probabilistic programming languages (includ- ing Church, Anglican, and V enture) are nearly purely func- tional and so being disallow program variable value reas- signment. This ensures a well-deﬁned joint distribution on program variables. Probabilistic C of fers no such guaran- tees. For this reason we are not, in this paper, making a claim about a ne w probabilistic programming language per se — rather we describe probabilistic C as an intermediate representation compilation target that implements a partic- ular style of inference that is nativ ely parallel and possible to optimize by system architecture choice. It is possible (as helpfully pointed out by V ikash Mansinghka) that compiler optimization techniques such as checking to see whether or not the program is nativ ely in “static single assignment” form ( Appel , 1998 ) can help av oid statistically incoherent programs being allowed (at compilation). This we lea ve as future work. Further (as helpfully pointed out by Noah Goodman), programming languages constructs such as de- limited continuations ( Felleisen , 1988 ) can be thought of as user level abstractions of fork, and might provide simi- lar functionality at the user rather than system lev el in the context of statistically safer languages. A Compilation T arget for Probabilistic Programming Languages References Andrieu, Christophe, Doucet, Arnaud, and Holenstein, Ro- man. Particle Markov chain Monte Carlo methods. Jour - nal of the Royal Statistical Society: Series B (Statistical Methodology) , 72(3):269–342, 2010. Appel, Andrew W . SSA is functional programming. SIG- PLAN notices , 33(4):17–20, 1998. Doucet, Arnaud, De Freitas, Nando, Gordon, Neil, et al. Sequential Monte Carlo methods in practice . Springer New Y ork, 2001. Felleisen, Mattias. The theory and practice of ﬁrst-class prompts. In Pr oceedings of the 15th A CM SIGPLAN- SIGA CT symposium on Principles of pr ogramming lan- guages , pp. 180–190. A CM, 1988. Goodman, Noah D., Mansinghka, V ikash K., Roy , Daniel M., Bonawitz, Keith, and T enenbaum, Joshua B. Church: A language for generative models. In Pr oceed- ings of the T wenty-F ourth Conference on Uncertainty in Artiﬁcial Intelligence (U AI2008) , pp. 220–229, 2008. Kiselyov , Ole g and Shan, Chung-chien. Monolingual prob- abilistic programming using generalized coroutines. In Pr oceedings of the T wenty-F ifth Conference on Uncer- tainty in Artiﬁcial Intelligence (U AI2009) , 2009. Mansinghka, V ikash, Selsam, Daniel, and Perov , Y ura. V enture: a higher-order probabilistic programming plat- form with programmable inference. arXiv pr eprint arXiv:1404.0099 , 2014. Milch, Brian, Marthi, Bhaskara, Russell, Stuart, Sontag, David, Ong, Daniel L, and K olobov , Andre y . BLOG: Probabilistic models with unknown objects. Statistical r elational learning , pp. 373, 2007. Open Group, The. IEEE Std 1003.1, 2004 Edition, 2004a. URL http://pubs.opengroup.org/ onlinepubs/009695399 . Open Group, The. IEEE Std 1003.1, 2004 Edition, 2004b. URL http://pubs.opengroup.org/ onlinepubs/009695399/functions/fork. html . Pfeffer , A vi. IBAL: A probabilistic rational programming language. In IJCAI , pp. 733–740, 2001. Pfeffer , A vi. Figaro: An object-oriented probabilistic pro- gramming language. Charles River Analytics T echnical Report , pp. 137, 2009. Smith, Jonathan M. and Maguire, Jr , Gerald Q. Ef fects of copy-on-write memory management on the response time of UNIX fork operations. COMPUTING SYS- TEMS , 1(3):255–278, 1988. Whiteley , Nick, Lee, Anthony , and Heine, Kari. On the role of interaction in sequential Monte Carlo algorithms. arXiv pr eprint arXiv:1309.2918 , 2013. W ingate, David, Stuhlmueller , Andreas, and Goodman, Noah D. Lightweight implementations of probabilistic programming languages via transformational compila- tion. In Pr oceedings of the 14th international conference on Artiﬁcial Intelligence and Statistics , pp. 131, 2011. W ood, Frank, v an de Meent, Jan W illem, and Mansinghka, V ikash. A new approach to probabilistic programming inference. In Pr oceedings of the 17th International con- fer ence on Artiﬁcial Intelligence and Statistics , 2014.

A Compilation Target for Probabilistic Programming Languages

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment