SIMD Parallel MCMC Sampling with Applications for Big-Data Bayesian Analytics

SIMD Parallel MCMC Sampling with Applications for Big-Data Bayesian Analytics Alireza S. Mahani Scientiﬁc Computing Gr oup, Sentrana Inc., W ashington DC, USA Phone: + 1-202-507-4524 Mansour T .A. Sharabiani Department of Epidemiology and Biostatistics, Imperial Colle ge London, UK Abstract Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identiﬁcation of po- tential opportunities to parallelize techniques such as Monte Carlo Markov Chain (MCMC) sampling, and the devel- opment of general strategies for mapping such parallel algorithms to modern CPUs in order to elicit the performance up the compute-based and / or memory-based hardware limits. T wo opportunities for Single-Instruction Multiple-Data (SIMD) parallelization of MCMC sampling for probabilistic graphical models are presented. In exchangeable mod- els with many observations such as Bayesian Generalized Linear Models (GLMs), child-node contributions to the conditional posterior of each node can be calculated concurrently . In undirected graphs with discrete nodes, concur- rent sampling of conditionally-independent nodes can be transformed into a SIMD form. High-performance libraries with multi-threading and vectorization capabilities can be readily applied to such SIMD opportunities to gain decent speedup, while a series of high-lev el source-code and runtime modiﬁcations provide further performance boost by reducing parallelization ov erhead and increasing data locality for Non-Uniform Memory Access architectures. For big-data Bayesian GLM graphs, the end-result is a routine for ev aluating the conditional posterior and its gradient vector that is 5 times f aster than a naiv e implementation using (built-in) multi-threaded Intel MKL BLAS, and reaches within the striking distance of the memory-bandwidth-induced hardware limit. Using multi-threading for cache- friendly , ﬁne-grained parallelization can outperform coarse-grained alternati ves which are often less cache-friendly , a likely scenario in modern predictiv e analytics workﬂo w such as Hierarchical Bayesian GLM, variable selection, and ensemble regression and classiﬁcation. The proposed optimization strategies improve the scaling of performance with number of cores and width of vector units (applicable to many-core SIMD processors such as Intel Xeon Phi and Graphic Processing Units), resulting in cost-e ﬀ ecti veness, energy e ﬃ ciency (‘green computing’), and higher speed on multi-core x86 processors. K e ywor ds: GPU, Hierarchical Bayesian, Intel Xeon Phi, logistic regression, OpenMP, v ectorization 1. Intr oduction Many inference problems in statistics and machine learning are best expressed in the language of probabilistic graphical models, where a probability distribution function (PDF) over a high-dimensional parameter space can be motiv ated as the product of a collection of terms, each a function of a subset of the parameters. Inference in such models requires summarizing the joint PDF , which can be quite complex and lacking closed-form integrals. Monte Carlo Markov Chain (MCMC) sampling techniques o ﬀ er a practical way to summarize complex PDFs for which e xact Email addr ess: alireza.mahani@sentrana.com (Alireza S. Mahani) Pr eprint submitted to Computational Statistics and Data Analysis November 21, 2014 sampling algorithms are not av ailable. For man y real-world problems, MCMC sampling can be very time-consuming due to a combination of large data sets, high dimensionality of joint PDF , lack of conjugacy between likelihood and prior functions, and poor mixing of the MCMC chain. Fast MCMC sampling is, therefore, important for wider adoption of probabilistic models in real-world applications. For many years, software dev elopers could rely on faster processors to see improved performance, without any need for code modiﬁcation. In the past decade, howe v er , chip manufacturers ha ve warned of single-core perfor- mance saturation as CPU clock rates and instruction-lev el parallelism reach their ph ysical limits (Je ﬀ ers and Reinders (2013)). Instead, we are seeing a steady rise in core counts and width of vector units, culminating in the emergence of many-core Single-Instruction-Multiple-Data (SIMD) architectures such as Graphic Processing Units (GPUs) 1 and Intel Xeon Phi 2 . T oday , a CPU purchased for around $3000 can o ﬀ er a theoretical peak double-precision performance of more than 300 GFLOPS 3 , and many-core processors selling for nearly the same price ha ve brok en the T era FLOP barrier 4 . For scientiﬁc computing applications to take full adv antage of such enormous performance potential within a single compute node, the entire parallelism potential of an algorithm must be e ﬃ ciently exposed to the compiler, and ev entually to the hardware. Most e ﬀ orts on parallelizing MCMC hav e focused on identifying high-le vel parallelism opportunities such as concurrent sampling of conditionally-independent nodes (W ilkinson (2010)). Such coarse-grained parallelism is of- ten mapped to a distributed-memory cluster or to multiple cores on a shared-memory node. As such, vectorization capabilities of the processor are implicitly assumed to be the responsibility of libraries and compilers, resulting in a systemic under-utilization of vectorization in scientiﬁc computing applications. Furthermore, with increasing data sizes and widening gap between ﬂoating-point performance and memory bandwidth, modern processors ha ve seen an architectural change in memory layout from Symmetric Multi-Processors (SMPs) to Non-Uniform Memory Access (NUMA) designs to better scale total memory bandwidth with the rising core count. The software-le vel, shared- memory view of this asymmetric memory is a con v enient programming feature but can lead to a critical memory bandwidth uder-utilization for data-intensiv e applications. This paper seeks to expand our ability to make e ﬃ cient use of multicore x86 processors - today’ s most ubiquitous computing platform - to rapidly generate samples from the posterior distribution of model parameters. Our focus is two-fold: 1) identifying opportunities for SIMD parallelism in MCMC sampling of graphical models, and 2) e ﬃ ciently mapping such SIMD opportunities to multi-threading and vectorization parallel modes on x86 processors. Using examples from directed and undirected graphs, we show that o ﬀ -the-shelf, multi-threaded and vectorized high-performance libraries (along with vectorizing compilers) provide a decent speedup with small programming e ﬀ ort. Additionally , a series of high-level source-code and runtime modiﬁ- cations lead to signiﬁcant additional speedup, even approaching hardware-induced performance limits. V ectorization of SIMD parallelism opportunities are often complementary with coarse-grained parallel modes and create a multi- plicativ e performance boost. Moreover , we illustrate the counter-intuiti ve result that, gi ven a limited number of cores av ailable, e ﬃ cient ﬁne-grained (SIMD) multi-threading can outperform coarse-grained multi-threading over a range of data sizes where L3 cache utilization is the dominating factor . In the process, we propose a general strategy for optimally combining a sequence of maps to minimize multi-threading ov erhead while maximizing vectorization cov- erage. This strategy naturally allows for data locality at the memory and cache le vel, and signiﬁcantly reduces the cost of complex synchronizations such as reduction on arrays. Furthermore, a di ﬀ erential update strategy for MCMC sampling of graphical models is proposed which, where applicable, can lead to signiﬁcant reduction in data mov ement as well as the amount of computation, applicable to graphs with continuous as well as discrete nodes. The remainder of this paper is organized as follows. In Section 2 we lay the theoretical foundation for the paper, including a summary of previous research on parallel MCMC and overvie w of two single-chain parallelization tech- niques for parallel MCMC of Directed Acyclic Graphs (D A Gs). In Section 3 we do a detailed performance analysis and optimization of parallel MCMC for Bayesian GLM (focusing on logistic regression). In Section 4 we discuss sev eral extensions to the ideas dev eloped in Section 3, including Hierarchical Bayesian (HB) GLM, calculation of deriv ati ves, Ising model, batch RNG, distributed and man y-core computing, and compile-time loop unrolling. Section 5 contains a summary of our results and pointers for future research. 1 http://www.nvidia.com/object/what- is- gpu- computing.html 2 http://tinyurl.com/co9e8hy 3 http://ark.intel.com/products/64595/ 4 http://tinyurl.com/oexotqv 2 2. Parallel MCMC for Graphical Models 2.1. Pr e vious W ork on P arallel MCMC A straightforward method for parallel MCMC is to run multiple chains (W ilkinson (2010)). Since each chain must go through the burn-in period individually , multi-chain parallelization is less suitable for complex models with poor con ver gence where a big fraction of time might be spent on the burn-in phase. The most common single-chain MCMC parallelization strategy is concurrent sampling of conditionally-independent nodes within the Gibbs sampling framew ork (Wilkinson (2010)). This is also known as graph coloring or chromatic sampling. Example applications include Hierarchical LD A models (Newman et al. (2009)) and Marko v Random Fields (Gonzalez et al. (2011)). In molecular dynamics simulations, assumption of short-range intermolecular forces provides for similar domain decompositions (Ren and Orkoulas (2007)). This parallelization strategy is revie wed in Section 2.3. In synchronous Gibbs sampling (Geman and Geman (1984)), all variables are updated simultaneously - rather than sequentially - in each Gibbs cycle, regardless of whether conditional independence holds or not. Despite its theoretical and practical shortcomings, this embarrassingly parallelizable approach remains popular in lar ge-scale MCMC problems such as Latent Dirichlet Allocation (LD A) models for topical analysis of text documents (Ne wman et al. (2009)). Parallel message-passing algorithms are discussed in Gonzalez et al. (2009, 2011, 2012) with application to discrete factor graphs. Tibbits et al. (2011) present CPU and GPU parallelization of multiv ariate slice sampling (Neal (2003)) with application to linear Gaussian processes. Xu and Ihler (2011) propose a parallel ‘look-ahead’ sampler for discrete- valued densely-connected graphical models such as Boltzmann Machines and LD A. This approach takes advantage of the observation that, in densely-connected networks, contrib ution from any giv en node to the conditional posterior of a target node is small. A similar strategy , called ‘pre-fetching’ has been proposed (Brockwell (2006); Strid (2010)) in the context of one-block Metropolis-Hastings algorithms. Y u and Xu (2009) propose an approximate parallel Gibbs sampler - with GPU implementation - for DN A motif ﬁnding, where parallelization is achieved through replacing sequential random steps by a deterministic, pre-ordered search of the parameter space. In terms of target platforms, MCMC parallelization work has primarily focused on shared-memory multi-threading (Xu and Ihler (2011); Tibbits et al. (2011)), distributed-memory parallelization using e xplicit or implicit message- passing (Ren and Orkoulas (2007); Ahmed et al. (2012); Low et al. (2012); Gonzalez et al. (2009); Newman et al. (2009)), and more recently , many-core architectures such as GPUs (Y u and Xu (2009); Tibbits et al. (2011); da Silva (2011); Dumont (2011)). This lea ves two areas under-e xplored: e ﬃ cient use of vector instructions for SIMD par- allelization, and NUMA aw areness for e ﬃ cient utilization of memory bandwidth. Addressing these two areas is important in light of trends to wards higher core counts and multi-socket processors with NUMA architecture, and in- creasing width of v ector units. T oday’ s desktop processors are increasingly dual-socket, while quad-sock et setups are av ailable for high-end servers. Since each socket has a separate memory controller, a processor’ s access to memory attached to other sockets is mediated through a slo wer path such as Intel QPI 5 , thus creating a non-uniform access to memory . On the other hand, vector units hav e steadily widened ov er recent generations to reach 128 bits for SSE4, and 256 bits for A VX instruction set extensions. Intel Xeon Phi co-processors have doubled the width yet again to 512 bits. The combined e ﬀ ect of v ectorization and NUMA awareness for a dual-socket x86 can reach an order of magnitude: A fully-vectorized code can run up to 4 times faster for double-precision calculations, while minimizing cross-socket communication on a dual-sock et processor can double the performance for a memory-bound application. Our paper aims to help better utilize single-server capacities of a modern x86 processor for parallel MCMC sampling. In addition to achieving more e ﬃ cient execution on x86 processors, our performance optimization techniques pav e the way for further speedup on man y-core processors since the underlying principles of reducing parallel overhead and increase data locality apply to all architectures. This strategy has been dubbed ‘transforming and tuning’ (Je ﬀ ers and Reinders (2013)). 2.2. Gibbs Sampling of Gr aphical Models A graphical model consists of nodes (or vertices ) connected by links (or edges ) 6 . In a probabilistic graphical model, each node represents a random v ariable (or a group of random v ariables), and links express probabilistic 5 http://tinyurl.com/7hhc7rj 6 Our introduction to graphical models borrows hea vily from Section 8.1 of Bishop’ s excellent te xtbook (Bishop (2006)). 3 relationship between these variables. The graph represents a decomposition of the joint probability distribution over all v ariables into a product of factors, each consisting of a subset of variables. In D A Gs, links hav e directionality indicated by arrows, and no closed path exists such that one can always trav el in the direction of arrows. For a D A G with K nodes, the joint distrib ution is giv en by the follo wing factorization p ( x ) = K Y k = 1 p ( x k | pa k ) , (1) where pa k represents the set of parents of node x k , and x = { x 1 , · · · , x K } . For undirected graphs, the joint distribution is expressed as a product of potential functions ψ C ( x C ) ov er maximal cliques of the graph: p ( x ) = 1 Z Y C ψ C ( x C ) , (2) where Z is the normalization factor , also known as the partition function. Inference and prediction in graphical models often requires calculating the expected v alues of functions of x with respect to the joint distribution p ( x ). In many cases, closed forms for integrals do not exist and approximate techniques such as MCMC can be used to approximate the joint distribution with a ﬁnite sample. A popular MCMC technique is Gibbs sampling (Geman and Geman (1984)) which is a special case of the Metropolis-Hastings (MH) algorithm (Hastings (1970)) with 100% acceptance rate. In Gibbs sampling, we iterate through the stochastic nodes and draw samples from the probability distribution for each node, conditioned on the latest samples drawn for all remaining nodes. Below is an outline of the Gibbs sampling algorithm: 1. Initialize all nodes x = { x 1 , · · · , x K } . 2. F or iteration t = 1 , · · · , T : • Sample x t + 1 1 ∼ p ( x 1 | x t 2 , x t 3 , · · · , x t K ). • Sample x t + 1 2 ∼ p ( x 2 | x t + 1 1 , x t 3 , · · · , x t K ). • . . . • Sample x t + 1 k ∼ p ( x j | x t + 1 1 , · · · , x t + 1 k − 1 , x t k + 1 , · · · , x t K ). • . . . • Sample x t + 1 K ∼ p ( x M | x t + 1 1 , x t + 1 2 , · · · , x t + 1 K − 1 ). Gibbs sampling is appealing because it reduces the problem of sampling from a high-dimensional distribution to a series of low-dimensional sampling problems, which are often easier, especially when the resulting low-dimensional distributions hav e standard sampling algorithms (e.g. multi v ariate Gaussian, Gamma, etc.). When conditional distri- butions cannot be directly sampled, a hybrid approach can be used where an MCMC algorithm such as Metropolis (Metropolis and Ulam (1949)) or slice sampler (Neal (2003)) is applied within the Gibbs framework (Robert and Casella (2004)). 2.3. SIMD P arallel MCMC for D AGs Here we present two canonical parallelization opportunities for D A Gs to provide a foundation for our analysis of Bayesian GLM. Equi v alent ideas for undirected graphs are discussed in 4.3 in the context of Ising model and Boltzmann machine. According to Eq. 1, a giv en node appears in the joint distrib ution in two ways: a) self-term, i.e. the term that expresses the probability of the node conditioned on its parents, and b) child terms, i.e. terms corresponding to the conditional probability of the node’ s children. Therefore, the conditional distributions appearing in Gibbs sampling 4 algorithm can be decomposed as follows: p ( x k |− ) ∝ self term z }| { p ( x k | pa k ) × child terms z }| { Y j s . t . x k ∈ pa j p ( x j | pa j ) (3) This decomposition allows us to identify the follo wing two single-chain parallelization modes for Gibbs sampling of D A Gs. P arallel Sampling of Conditionally-Independent Nodes : Consider two nodes x 1 and x 2 . If their joint conditional distribution is factorizable, i.e. if p ( x 1 , x 2 |− ) = p 1 ( x 1 |− ) p 2 ( x 2 |− ), then the nodes are called conditionally-independent and can be sampled simultaneously in each iteration of Gibbs sampling. From Eq. 3, we can infer the following theorem about conditional independence: Theorem 2.1. T wo nodes ar e conditionally-independent (and can be sampled in parallel) if a) neither node is a par ent of the other , AND b) nodes have no common childr en. Pr oof. Condition (a) means that neither node appears in the self term of the other node. Condition (b) means that neither node appears in any of the child terms for the other nodes. Therefore, neither node appears in the conditional distribution of the other node. This is su ﬃ cient to prov e that nodes are conditionally-independent and can be sampled concurrently in Gibbs sampling. The above theorem can be used to determine if any pair of nodes are conditionally-independent or not. Howe ver , constructing the most e ﬃ cient partitioning of a graph into node blocks, each of which consists of conditionally- independent nodes is non-trivial and an NP-hard problem (W ilkinson (2010)). For speciﬁc graph structures, such as HB models (Rossi et al. (2005)), conditionally-independent node blocks can be constructed rather easily . This paral- lelization mode can be di ﬃ cult to vectorize, especially for non-conjugate, complex posteriors (such as GLM) where drawing a sample for each node inv olves multiple function calls including random number generation (RNG) and a sampling routine. As we see for Ising models, howev er , for simpler distributions, this parallel mode can be partially vectorized using the strategy discussed in Section 3.4, with signiﬁcant contribution towards overall performance of the sampling routine. P arallel Calculation of Conditional P osterior : First, consider the conjugate case, where the product of prior and each child-term contribution retains the functional form of the prior . For example, in LD A models a Dirichlet prior is used on the distribution of topics in each document. The likelihood contribution of each token in a document is a categorical distribution, with their collective contribution being a multinomial distribution whose parameter vector is the sum of contributions from all tokens. Similarly , with a conjugate Dirichlet prior on distribution of words in each topic, the contribution from each document (and tokens within each document) takes an additiv e form towards the posterior . In such cases, calculating the parameters of the posterior distribution can be parallelized, followed by a reduction. Second, consider the case where conjugacy does not exist, and rather than drawing independent samples from the posterior we must create a Markov chain that conv erges to the posterior . F or most MCMC sampling techniques such as Slice sampler (Neal (2003)) or Metropolis sampler (Metropolis and Ulam (1949)), the majority of time during each sampling cycle is spent on ev aluating the conditional log-posterior function (e.g. within the Gibbs sampling framew ork) or its deriv ativ es (Section 4.2). Therefore, any speedup in function e valuation translates, nearly 1-to-1, into MCMC speedup. Similar to the conjugate case, the additiv e contribution of child terms can be calculated in parallel, follo wed by a reduction step. In both cases, when the model is e xchangeable (Bernardo (1996)) as in the case of i.i.d. observations, the contrib utions from child nodes are symmetric, resulting in a SIMD computation. Non-conjugate cases are typically better candidates for SIMD parallelization since 1) they tend to take a larger fraction of each sampling cycle, 2) in conjugate cases, we need a lar ger number of observ ations before the cost of calculating the conjugate posterior parameters dominates the cost of generating a random deviate from the posterior . GLM models are good candidates for SIMD Gibbs sampling due to non-conjugacy of prior and likelihood functions, and due to high computational cost of each function e v aluation, especially for long and wide data and in the presence of transcendental functions (Section 3.1). ********* 5 (a) Parallel sampling of conditionally-independent nodes: Such nodes cannot ha ve common children, or hav e parent-child relationship amongst themselves. (b) Parallel computation of conditional posterior: Contribution from child nodes can be calculated concurrently . If parent-child relation- ships are symmetric, this parallelization mode takes a SIMD form. Figure 1: T wo parallelization modes for MCMC sampling of probabilistic D A Gs. Figure 1 summarizes the two single-chain parallelization modes for D A Gs discussed above. In relative terms, parallel sampling of conditionally-independent nodes can be considered coarse-grained, while parallel e valuation of log-likelihood can be considered ﬁne-grained. In the remainder of this paper, we use two examples - Bayesian GLM and Ising model - to demonstrate how each of these modes can be parallelized on multi-core x86 processors using multi-threading and vectorization. 3. SIMD Parallel MCMC f or Bayesian GLM on Multi-core x86 3.1. Setup 3.1.1. Running Example: Bayesian Logistic Re gr ession GLMs (Nelder and Baker (1972)) are the workhorse of statistics and machine learning, with applications such as risk analysis (Sobehart et al. (2000); Fenton and Neil (2004); Antonio and Beirlant (2007)) and public health (Hastie and Tibshirani (1987); Sharabiani et al. (2011)) among others. They can be extended to handle data sparseness and heterogeneity via HB framework (Rossi et al. (2005); Gelman and Hill (2007)), or to account for repeated measure- ments and longitudinal data via Generalized Linear Mixed Models (McCulloch (2006)). For our purposes, GLM models ha ve the follo wing log-likelihood function: L ( β ) = N X n = 1 f ( x t n β ; y n ) , (4) where x n is the vector of features or explanatory variables for observation n and has a length K , y n is the response or target v ariable, β is the coe ﬃ cient vector of length K , and N is the number of observation points. Logistic regression is an important member of GLM family , where each binary response variable y i is assumed to follow a Bernoulli 6 Figure 2: D A G representing the logistic regression problem. W e use the plate notation to describe an array of nodes. distribution whose probability is 1 / (1 + exp( − x t n β )). The log-likelihood becomes L ( β ) = − N X n = 1 n (1 − y n ) x t n β + log(1 + e xp( − x t n β )) o , (5) with y n being a binary response variable with possible values { 0 , 1 } . In a Bayesian setting, we impose a prior distri- bution on β such as a multiv ariate Gaussian with parameters α and µ , leading to the DA G representation of Fig. 2. Unless we are dealing with v ery small data sizes, calculation of prior contrib ution is often computationally ne gligible compared to the likelihood calculation, and hence we focus on the latter throughout this paper . 3.1.2. Har dwar e W e use a dual-socket 2.6GHz Intel Xeon E5-2670 7 with 8 cores per socket (16 total cores), 20MB L3 cache per processor (shared among 8 cores), 256KB of L2 cache per core, and 32KB of L1 cache per core (separately for data and instructions each). T otal installed RAM is 128GB, with maximum memory bandwidth of 51.2GB / sec per socket. Since each processor has its own memory controller , the dual-socket setup creates a Non-Uniform Memory Access (NUMA) en vironment (Section 8.8.2 of Intel Optimization Reference Manual 8 ). This processor supports the A VX instruction set extensions, using 256-bit wide ﬂoating-point registers YMM0-YMM15. Hyper-threading is turned o ﬀ for our experiments. 3.1.3. Softwar e All code is written in C / C ++ 9 . (The only C ++ feature used is template metaprogramming for compile-time loop unroll. See Section 4.6.) W e use the Intel software stack for compiling and micro-benchmarking as well as high-performance mathematical routines: • Intel Math K ernel Library (MKL) (part of Intel Composer XE 2013): W e use MKL for vectorized and mul- tithreaded BLAS calls (dgemv / ddot), vectorized transcendental functions (as part of V ector Math Library or VML 10 ), and vectorized random number generation (RNG) as part of V ector Statistical Library or VSL 11 . Switching between single-threaded and multi-threaded MKL is done via the link option ’-mkl = sequential’ or ’-mkl = parallel’. 12 • Intel C ++ Compiler (also part of Intel Composer XE 2013): Combined with Intel VML, the compiler allows us to v ectorize functions of transcendental functions. All codes tested in this paper were compiled using optimiza- tion ﬂag -O2 . 7 http://ark.intel.com/products/64595/ 8 http://tinyurl.com/7ty2lsx 9 During performance measurement, all BLAS calls are made to FOR TRAN routines. For presentation brevity , howe ver , we sho w calls to the C interface of BLAS in this paper . 10 http://tinyurl.com/nc9yjhh 11 http://tinyurl.com/pevz69r 12 For more on ho w to compile and link using Intel C ++ compiler , see http://tinyurl.com/lctboe3 . 7 • Intel Performance Counter Monitor (PCM) 13 , an open-source C ++ wrapper library for reading micro-benchmarking (core and uncore) ev ents from Intel’ s Performance Monitoring Units, including cache hit rate, clock c ycles lost to memory stalls, and instructions per cycle. 3.1.4. P arallel Pr ogramming Our overall approach is to maintain software portability and dev eloper productivity while achieving good perfor- mance. F or multi-threading, we use the OpenMP API and Intel’ s implementation of it (as part of Intel Composer XE 2013). OpenMP o ﬀ ers a mix of high-lev el and low-le vel compiler directiv es and runtime functions for creating and managing threads, and o ﬀ ers an e ﬃ cient alternative to working directly with Pthreads. OpenMP is perhaps today’ s most-widely used API for multi-threading in HPC applications. Other notable options for multi-threading include Cilk Plus 14 (also handles vectorization) and Threading Building Blocks 15 from Intel, Chapel 16 from Cray , X10 17 from IBM, and Phoenix MapReduce 18 from Stanford. For vectorization, we choose ‘guided auto-vectorization’ 19 , i.e. we use pragmas and code modiﬁcations to guide the compiler to vectorize the code. W e ﬁnd this option to strike a balance between performance and portability / productivity . Other options include vector intrinsics and assembly code. Another parallelization option is distributed-computing on a server cluster using Message-Passing Interface (MPI) 20 , MapReduce 21 or Hadoop 22 . Gi ven the ﬁne-grained nature of our SIMD parallelization approach and signiﬁ- cant network latency and bandwidth restrictions in distrib uted computing, howe ver , focusing within a single server is the logical choice. Our parallelization concepts, ho we ver , can be combined with coarse-grained options and thus incorporated into distributed computing frame works. See section 4.5. Note that while we are using Intel’ s hardware and software, the optimization lessons learned are applicable to other platforms, including open-source software. More importantly , the software itself is portable since we are using standard APIs (C / C ++ standard library , OpenMP , BLAS). 3.2. Baseline Implementation T o calculate the log-likelihood in Eq. 5, we can consolidate x n ’ s into rows of a N × K matrix X , and use a matrix- vector multiplication routine such as BLAS dgemv to calculate X β . W e refer to this ﬁrst map operation (McCool et al. (2012)) as the Linear Algebra (LA) map. This is followed by an element-wise transformation of X β inv olving transcendental functions, and a ﬁnal reduction step. W e call this second map operation the Transcendental (TR) map. The result is the loglike routine in Fig. 3. Since each map operation is ﬁnished before the start of the new map, this strategy is a Sequence of Maps (SOM) pattern (McCool et al. (2012)). Using an o ﬀ -the-shelf, optimized BLAS library such as Intel MKL provides automatic vectorization and multithreading of the dgemv routine. Further parallelization can be achieved by multithreading the TR map using the omp parallel for pragma. This same loop can also be vectorized, b ut it requires access to a vector math library for transcendental functions, as well as compiler support for vectorization of functions of transcendental functions. The Intel Composer suite o ﬀ ers both such capabilities. The simd pragma in Fig. 3 directs the Intel C ++ compiler to ignore any potential vectorization barriers (such as pointer aliasing between Xbeta and y ) and generate vector instructions for the TR map. 23 . Fig. 4 (left) shows the performance of our baseline implementation, measured in ‘Clock Cycles Per Row’ (CPR), deﬁned as total time (measured in CPU clock cycles) for each function ev aluation, divided by N , the number of rows in X 24 . Time per function e v aluation is an average over many iterations. All iterations share the same X and y - allowing for caching - but β is refreshed from one iteration to the next. This mimics the real-world application in 13 http://tinyurl.com/q4er8qp 14 http://www.cilkplus.org/ 15 https://www.threadingbuildingblocks.org/ 16 http://chapel.cray.com/ 17 http://x10- lang.org/ 18 http://mapreduce.stanford.edu/ 19 http://tinyurl.com/qbtpc38 20 http://www.mcs.anl.gov/research/projects/mpi/ 21 http://research.google.com/archive/mapreduce.html 22 http://hadoop.apache.org/ 23 SIMD constructs hav e been incorporated into OpenMP 4.0 API: http://www.openmp.org/mp- documents/OpenMP4.0.0.pdf 24 This is inspired by the Clock Cycles Per Element (CPE) metric, deﬁned in Bryant and O’Hallaron (2011), Chapter 5. 8 -------------------------------------------------------------------------- double loglike(double beta[], double X[], double y[], int N, int K) { double r=0.0; double *Xbeta = new double[N]; /* Linear Algebra (LA) Map */ cblas_dgemv(CblasRowMajor,CblasNoTrans,N,K,1.0,X,K,beta,1,0.0,Xbeta,1); /* Transcendental Loop (TR) Map + Reduction */ #pragma omp parallel for reduction(+:r) num_threads(NTHD) schedule(static) #pragma simd reduction(+:r) for (int n=0; n w i j s i s j , (21) where s is the state vector , w i j is the interaction weight between neighbors i and j , and b i ’ s control the bias terms. The sum occurs over pairs of i and j that are neighbors. In image de-noising, bias terms reﬂect the pixel intensity for the noisy image, serving to induce the output to represent the input. w i j ’ s enhance correlation among neighboring pixels to reduce noise. The conﬁguration probability is giv en by the Boltzmann distribution: P ( s ) = e − E ( s ) Z , (22) where the normalizing factor Z is called the partition function, and is giv en by Z = P s e − E ( s ) . In Gibbs sampling of Ising model, the probability of node i is giv en by: P ( s i = 1) = 1 1 + exp( − z i ) , (23) where z i = b i + P j w i j s j and sum is ov er neighbors j of node i . This graph has a 2-color decomposition (Fig. 21), and nodes within each color can be sampled concurrently . This is the coarse parallelism discussed in Section 2.3. On the other hand, since the sum in z i contains only 4 terms, the ﬁne-grained parallelism opportunity is unattractiv e. It is clear , therefore, that multi-core multi-threading must be dedicated to the only av ailable parallel mode. But the outstanding question is, can vectorization be applied here as well? Recall that in HB GLM, due to complexity of sampling steps for each β k and availability of ﬁne-grained parallelism, we did not consider vectorization for coarse-grained parallelism. But in Ising model, the sampling steps for each node are signiﬁcantly simpler , as described belo w: 1. Calculate z i , the contribution from the neighbors. This consists of 4 multiplications and 5 additions. 29 2. Calculate P ( s i = 1) according to Eq. 23. This in volv es 1 exponentiation, 1 di vision, and 1 addition. 3. Generating a sample for s i . This consists of generating a random deviate from uniform distribution, and com- paring it to the probability calculated in step 2. As always, the ideal scenario is to v ectorize all steps. Howe ver , we face two challenges in doing so: 1. Memory Access P atterns: In step 1, fetching neighbor states of a gi ven node requires a gather memory operation. Handling boundary conditions adds further complexity to the code. Both of these can render vectorization ine ﬃ cient or impossible. 2. RNG: Making an external function call to a pre-compiled RNG library in step 3 can act as a vectorization blocker . Here ag ain we can tak e adv antage of the PLF strate gy for incremental vectorization. Given the memory access pattern challenge of step 1 and its small contribution to total computation per sample, we choose to lea ve it scalar . As for step 3, we propose a batch RNG approach. Batch RNG Batch generation of uniform random de viates used in step 3 o ﬀ ers two adv antages. Firstly , it removes RNG call from the sampling loop, allowing it to be v ectorized more e ﬃ ciently . Secondly , it speeds up RNG by reducing the function call overhead and allowing for use of high-performance libraries that utilize vector instructions or other forms of parallelization. Examples include Intel VSL 32 , Nvidia’ s cuRAND 33 , and the SPRNG project 34 . Applying the batch RNG strategy is more in v olved for comple x distribution. See Section 4.4. Data Layout Optimization V ectorization e ﬃ ciency for step 2 depends on data alignment for s i (and b i ). If the entire network occupies a contiguous block of memory , then data for each (colored) graph partition will not be contiguous. Such non-unit-stride memory accesses have a particularly negati ve impact on vectorization. T o o vercome this, we can extract each sub-graph and write it to new arrays, perform Gibbs sampling on these contiguous-memory arrays, and transform the results back to the original arrangement at the end (Fig. 21). Fig. 22 sho ws the impact of 3 optimization techniques on performance of Gibbs sampling for an Ising model used in denoising a 560-by-516 binary image (1000 samples). W e follo wed the speciﬁcation and parameters in Bishop (2006) (Chapter 8). Switching to batch RNG provides a 4.9x speedup for the RNG part of the code (84.1 → 17.1 clock cycles per sample). V ectorization results in a 2.3x speedup in the non-RNG part of the code (75.4 → 32.5). Finally , switching to unit-stride memory access led to a 1.5x speedup in non-RNG part (32.5 → 21.7). The cumulati ve impact of all 3 optimization techniques (on the entire code) is a 4.3x speedup. 4.3.2. Boltzmann Mac hine A close cousin of Ising model is the Boltzmann machine (Ackley et al. (1985)), where nodes are densely connected rather than having the lattice connection of Ising model. As a result, 1) there is no conditional independence among nodes and hence the graph cannot be colored for parallel sampling, 2) each node is potentially connected to a large number of nodes, creating a SIMD parallelization opportunity for calculation of z i ’ s. A v ariant of Boltzmann machine is the Restricted Boltzmann machine (RBM) (Smolensky (1986)), a key ingredient in building deep belief nets (Hinton et al. (2006)) that are seeing increasing application in supervised and unsupervised learning of generative models for image (Krizhevsk y et al. (2012)), video (Le et al. (2011)) and speech (Mohamed et al. (2011)) data. RBMs ha ve a bipartite graph structure, where visible and hidden nodes are densely connected, but no connection amongst nodes of each layer is allowed. This leads to conditional independence of hidden nodes, allo wing for their parallel sampling. In choosing between the ﬁne- and coarse-grained parallel modes to apply vectorization, we must consider these factors: 1) Fine-grained parallel mode consists of a summation over contribution from nodes connected to a giv en node. Its arithmetic complexity is lo w , but vectorization is easy . 2) Coarse-grained parallel mode contains the inner , summation loop. E ﬃ cient vectorization of this step within the coarse-grained parallelism would require loop unroll (see Section 4.6). 3) Steps 2 and 3 from the Ising mdoel are repeated here, and we can use strategies discussed before. Note that dense connectivity can alle viate or remo ve the problem of suboptimal memory access patterns. 32 http://tinyurl.com/pevz69r 33 https://developer.nvidia.com/curand 34 http://www.sprng.org/ 30 A B C D A B C D A C B D Figure 21: Left: Illustration of graph coloring for lattice Ising model. All nodes of the same color are independent of each other , conditioned on the value of all nodes with the other color . Therefore, Gibbs sampling of this graph can proceed in two steps, in each of which all nodes with the same color can be sampled concurrently . Right: Default memory layout for the graph, showing that nodes of the same color will not occupy contiguous blocks of memory . Howe v er , an explicit mapping of each subset can create unit-stride memory access for Gibbs sampling of each colored subset. 31 Baseline +BatchRNG +V ector ization +UnitStride RNGOnly SamplingMinusRNG Code Optimization Lev el Clock Cycles P er Sample 0 50 100 150 Figure 22: Cumulative impact of three optimization techniques on performance of Gibbs sampling for a square-lattice Ising model used in image de-noising. Image dimensions are 560 by 516. 1000 samples were generated. The energy function followed the speciﬁcation and parameters in Chapter 8 of Bishop (2006). First, we switched from one-at-a- time to batch RNG. Second, sampling step 2 is vectorized. Third, graph nodes belonging to each color are copied to contiguous blocks of memory , allowing for unit-stride access during vectorization. 32 4.3.3. Di ﬀ er ential Update and Atomic V ector Operations W e introduced the di ﬀ erential update strategy in the context of Bayesian GLM models (Section 3.4). This strate gy can also be used with discrete-node graphs such as Ising models, Boltzmann machines and LDA models. For example, consider the image de-noising e xample introduced in this section. W e measured transition rates for binary pix els in the model, ﬁnding that after the ﬁrst few iterations less than 10% of the pixels switch value. If we switch to a di ﬀ erential update approach where we maintain intermediate data structures z and p , and update them each time a node has a state transition, we see a nearly 2x speedup. Howe v er , combining this di ﬀ erential update approach with vectorization can be a challenge. If two nodes are updated simultaneously and the y share neighbors, the y will attempt to update the same elements of z and p , requiring atomic operations inside vectorized code. Unfortunately , current versions of x86 processors do not support atomic operations (Kumar et al. (2008)), and it is likely that future support will be rather ine ﬃ cient. Similar opportunities and challenges exist for RBMs where conditional independence allows hidden nodes to be updated concurrently , but they could attempt to write to same locations in intermediate arrays. In non-collapsed Gibbs sampling of LD A, a di ﬀ erential update strategy is commonly used where matrices representing counts of words by topic and topics by document are maintained and updated after each new topic assignment for a token (Xu and Ihler (2011)). V ectorization of parallel sampling for topic assignments will face the same challenge of needing to perform atomic operations on these count matrices. 4.4. Batc h RNG for Complex Distrib utions In Section 4.3 we saw that using a batch RNG for uniform deviates in Gibbs sampling of Ising model led to a nearly 2x speedup (Fig. 22) and paved the way for vectorization of concurrent sampling of nodes of the same color . Distributions such as uniform and normal are easy to handle in batch process due to their linear transformation properties, allowing a de viate with standard parameters to be con v erted to a deviate of arbitrary parameters: X ∼ U (0 , 1) − → a + ( b − a ) X ∼ U ( a , b ) , (24) where U ( a , b ) is the uniform distribution on the real interv al [ a , b ). Similarly , for the normal distribution we ha ve: X ∼ N (0 , 1) − → µ + σ X ∼ N ( µ, σ ) , (25) where N ( µ, σ ) is the normal distribution with mean µ and standard de viation σ . If a distribution does not enjoy such properties, we cannot pre-generate its random deviates in advance without knowing the v alue of parameters needed. The best we can do is to pre-generate the ‘component’ random deviates that are used in the algorithm for sampling from the desired distribution. As a concrete example, consider the Dirichlet distribution that is used in LDA models as a conjugate prior to categorical and multinomial distributions of topics in documents and words in topics (Blei (2012)): f ( x 1 , . . . , x K ; α 1 , . . . , α K ) = 1 B( α ) K Y i = 1 x α i − 1 i , (26) where x i ∈ [0 , 1) , i = 1 , . . . , K , P i x i = 1, and α i > 0 , i = 1 , . . . , K . Dirichlet deviates can easily be constructed from Gamma deviates 35 , which in turn require uniform (and possibly normal) deviates. W e took the Gamma sampling algorithm described in Press (2007) 36 , and adapted it to use a bu ﬀ er of uniform and normal random de viates, generated in batch. This Gamma sampler was then used in the Dirichlet sampler . For reference, we used a baseline Gamma sampler which generated its uniform and normal de viates one at a time. In both versions, Intel VSL routines were used for uniform and normal deviates. W e tested the batch and one-at-a-time versions of Dirichlet sampler on the Enron email data set from UCI repository (Bache and Lichman (2013)), speciﬁcally for sampling θ , the probability of topics (100) for each document (39861). A ﬂat (Dirichlet) Je ﬀ re ys prior was used for all documents. One-at-a-time Dirichlet sampler had a performance of 1263 clock cycles per Gamma sample, while the batch version had a performance of 198 clock cycles, showing a speedup of 6.4x. Fig. 23 summarizes the speedup o ﬀ ered by batch vs. one-at-a-time RNG for uniform, normal, Gamma, and Dirichlet distributions. 35 First, we draw y i ’ s from Gamma( α i , 1). Next we normalize: x i = y i / P i y i . 36 The algorithm uses the rejection method of Marsaglia and Tsang (2000) for α > 1, and a lemma described in Ripley (2009) to cast an α < 1 problem into an α > 1 problem. 33 Distribution One-At-A-Time (CCPS) Batch (CCPS) Batch (Speedup) Uniform 76 18 4.2 Normal 949 39 24.3 Gamma 1170 162 7.2 Dirichlet 1263 198 6.4 Figure 23: Comparison of one-at-a-time and batch RNG methods, for 4 probability distributions. For uniform, normal and Gamma distributions, 10 million random deviates were generated. For Gamma, shape and rate parameters used are 2.0 and 3.0, respecti vely . For Dirichlet, we generated samples in the context of a LD A model according to the text. CCPS: Clock Cycles Per Sample. For Dirichlet, in order to normalize for the e ﬀ ect of K , CCPS is calculated per each Gamma sample generated. For uniform and normal deviates, Intel VSL library was used with methods being VSL RNG METHOD UNIFORM STD and VSL RNG METHOD GAUSSIAN BOXMULLER , respecti vely . For Gamma deviates, we used the algorithm described in Press (2007). For Dirichlet, we use the standard method of using Gamma deviates. Batch numbers include time for copying data from RNG bu ﬀ er to destination, which imposes around 6 clock cycles per sample in overhead. The numbers for batch Gamma and Dirichlet include roughly a 10% additional overhead for generating extra uniform and normal deviates since the exact number of base deviates needed per Gamma sample is not known in adv ance. A more optimize code can re-ﬁll bu ﬀ er periodically to reduce waste. Our results suggest that there is value in RNG libraries o ﬀ ering batch generation capabilities. In the above exam- ple, we could not use the Gamma sampler o ﬀ ered by Intel VSL since its interface does not o ﬀ er an option of passing bu ﬀ ers of uniform or normal de viates, forcing us to write the Gamma sampler from scratch. 4.5. Be yond Single-Node x86 The focus of this paper has been on maximizing performance of a single, shared-memory x86 compute node for parallel MCMC sampling. Going beyond single-node x86 can be driv en by three needs: 1) more memory , 2) more memory bandwidth, and 3) more computing power . In distributed-memory clusters, total memory and compute power scale linearly with node count, and so does total memory bandwidth between each node’ s cores and its pri- vate memory . Ho wev er , typically-slower inter-node connections mean inter-node costs of synchronization and data exchange are higher than their intra-code counterparts. As such, the two principles of minimizing parallel overhead, and minimizing data mov ement must be followed more rigorously in distributed computing. In fact, we can consider our NUMA adjustments, discussed in Section 3.4, as a precursor to distributed computing. Asymmetric treatment of cross-socket vs. within-socket communication, while continuing to take advantage of the conv enience of shared- memory programming, can be considered halfway between writing symmetric, shared-memory parallel code and the standard message-passing-based approach to distributed computing on a cluster . Many-core architectures such as Intel Xeon Phi and GPUs o ﬀ er another option for adding memory bandwidth and compute power . They can be used as an alternativ e to, or in conjunction with, distributed computing. They are best suited for ﬂoating-point SIMD parallelism since the y dedicate more transistors to ﬂoating-point operations and less to out-of-order ex ecution, branch prediction, and speculati ve ex ecution (Kirk and W en-mei (2012); Je ﬀ ers and Reinders (2013)). Intel Xeon Phi co-processors can be considered an extension of x86 processors with higher core count (up to 60) and wider vector units (512-bit, holding 8 doubles). Fig. 24 compares the compute and memory performance of an Intel Xeon Phi 7120P and an Nvidia T esla K40 against our x86 test system. Based on these numbers, we can say , in broad terms, that compute-bound problems can gain > 3.5x speedup while memory-bound problems can gain between 2x-3.5x in speedup by switching from the an x86 processor to a many-core co-processor . As with x86, approaching these hardware limits requires tuning to minimize parallel ov erhead and maximize data locality . There are two other factors beyond the raw performance of co-processors that must be considered in the cost / beneﬁt analysis of using co-processors (also referred to as ‘devices’): 1) memory capacity , 2) data transfer and parallelization ov erhead. As we see in Fig. 24, memory capacity is signiﬁcantly lower for current generation of co-processors, com- pared to modern x86 servers. Therefore, as data sizes e xceed device memory , the device may become signiﬁcantly more complex to accommodate the need to move data in and out of device memory . In our GLM e xample, at K = 50, 34 Intel Xeon E5-2670 (dual-socket) Intel Xeon Phi 7120P Nvidia T elsa K40 Memory Bandwidth (GB / sec) 102.4 352 (3.44x) 288 (2.81x) Peak Floating-Point Perfor - mance (GFLOPS / sec) 332.8 1208 (3.63x) 1430 (4.30x) Memory (GB) (up to) 384 16 (1:24) 12 (1:32) Figure 24: Comparison of dual-socket Intel E5 2650, Intel Xeon Phi 7120P and Nvidia T esla K20 in terms of memory bandwidth and peak ﬂoating point performance. Numbers in parentheses indicate speedup of co-processor over the x86 reference system (2 x Intel Xeon E5-2650) for the given dimension. Sources: http://tinyurl.com/p47ytt3 (intel), http://tinyurl.com/oexotqv (Nvidia) the T esla K40 card will run out of space for N larger than 31.6 million rows, while the same limit on the host system will be reached at data sizes 32 times larger . Unless the code is ported completely to the device, so that there is no need to send data back and forth between the host and device memory , we will incur data transfer cost, due to non-zero latency and ﬁnite bandwidth. On GPU cards, kernel launch incurs an additional cost, while on Xeon Phi where we may launch more than 100 threads, the thread management overhead can be signiﬁcantly higher than the numbers we discussed for the host system. For example, in Bayesian GLM we may want to use the device only for function e valuation and run the sampler on the host, or there might be other parameters besides β that are sampled. As a result, each function ev aluation may incur the data transfer ov erhead associated with 1) sending β from host to device, 2) launching the kernel on the device, and 3) sending the log-likelihood result from device back to the host. T o measure all these overhead components, we used an empty kernel inside a for loop, as seen in Fig. 25 (left). (This test was performed on a T esla K20 card.) This loop took nearly 20 µ sec to run per iteration, equal to 52 K CPU clock cycles. If N = 10 K , this translates into a CPR ov erhead of 5.2. Recall that, using the di ﬀ erential update approach on the CPU we reached a CPR of less than 2.0 using all 16 cores (Fig. 13). Therefore, e ven ignoring the memory and compute limits of the device, simply the data transfer and kernel launch remove an y incentive to port the code to the device, unless port a su ﬃ cient portion of the code to remove the need for such frequent kernel launches and data transfers. Fig. 25 (right) shows the expected speedup as a result of device hardware limits as well as kernel launch and data transfer overhead. This analysis can be used in making an informed decision on whether or not to port the code (and how much of it) to the de vice. Finally , development and maintenance costs associated with code complexity must be factored into a total cost of o wnership analysis of using co-processors. CUD A 37 is a con v enient programming language for Nvidia GPUs, yet it is not open-source, while OpenCL 38 has the opposite pros and cons. Even CUDA code for logistic regression log-likelihood will inevitably be more complex than the CPU code. For example, in CUD A kernels threads within a block can communicate inside a kernel, while blocks cannot. On the other hand, there is an upper limit on number of threads that can be launched within a block. Such limitations may force us to break up the CUDA kernel into two steps, one in which a partial reduction is performed within blocks, and a second kernel where the ﬁnal sum is calculated. Optimal selection of grid and block geometry is another design decision that may require runtime and / or compile-time code branching. 4.6. Compile-T ime Loop Unr oll In implementing the GLM log-likelihood, we encountered a nested loop in the LA map whose vectorization in the SOM and PLF patterns was delegated to Intel MKL library as part of the dgemv call (see Figs. 3 and 7). W e also mentioned that, to transform this log-lik elihood function to a MOS pattern, we can fuse the outer loop of the LA map with the TR loop, which leaves us with a ddot call inside the fused loop to represent the LA step (Fig. 6). But the presence of this function call prevents a full vectorization of the outer loop, creating bad performance. This can be veriﬁed by inspecting the assembly code generated by the Intel compiler , as described belo w . 37 http://www.nvidia.com/object/cuda_home_new.html 38 https://www.khronos.org/opencl/ 35 ------------------------------------------ for (int i=0; i<1000; i++) { cudaMemcpy(betaDev, beta, K*sizeof(double) , cudaMemcpyHostToDevice); doNothing<<<1,1>>>(); cudaMemcpy(&f, fpDev, 1*sizeof(double) , cudaMemcpyDeviceToHost); } ------------------------------------------ 1e+03 1e+04 1e+05 1e+06 1e+07 1 2 3 4 N Device Speedup Figure 25: Left: CUD A code used to measure k ernel launch and data transfer o verhead for Bayesian GLM, assuming incomplete code port. Right: Expected speedup from porting log-likelihood function to Nvidia T esla K20, as a function of N , for K = 50. W e assume both host and device code are compute-bound, hence a 4.3x maximum speedup (horizontal line) for device from Fig. 24, which is speedup assuming full code port. Using code in left panel, we measured overhead associated with incomplete code port to be a constant 20 µ sec, amortized o ver N to produce an e ﬀ ectiv e CPR ov erhead. V ertical line represents memory boundary for device, beyond which it cannot host X in its 12GB global memory . 36 MOS − ddot MOS − RegularLoop MOS − KnownSpanLoop MOS − FullyUnrolledLoop SOM Clock Cycles P er Row 0 20 40 60 80 86.7 53.7 42.0 32.9 32.3 Figure 26: Performance comparison of SOM and several methods in MOS for implementing the inner product x t n β . Parameters: K = 10, N = 1000. All implementations are vectorized b ut single-threaded. The assembly code for the function in Fig. 6 (with the fused loop vectorized) consists of three loops, each containing calls to A VX-vectorized transcendental functions svml exp4 and svml log4 : 39 40 The reason for breaking the loop into 3 parts is to tak e adv antage of e ﬃ cient data movement instructions for aligned data in the main loop, while allo wing for scalar data mov ement instructions to handle misalignment in the ﬁrst loop, and array lengths that are incomplete multiples of 4 in the last loop (since each A VX vector instruction handles 4 doubles in its 256-bit registers). The evidence for partial v ectorization comes from seeing 4 calls to ddot inside the main loop: Since ddot is scalar (i.e. it produces only one element per call as input to the transcendental functions), it must be called 4 times to generate the full input for the vector instructions. As we see in Fig. 26, the resulting performance (‘MOS - ddot’, CPR of 86.7) is much worse than the SOM pattern of Fig. 3 (‘SOM’, CPR of 32.3). Since ddot is a simple routine, we can replace it with a for loop (‘MOS - RegularLoop’), with signiﬁcant perfor- mance improvement. W e can further optimize the code by making the loop span a compile-time constant (‘MOS - KnownSpanLoop’). Best result, ho wev er , is achieved when we fully unroll the loop (‘MOS - FullyUnrolledLoop’), producing essentially the same performance as the SOM pattern. Loop unroll does not necessarily require hand-coding by the programmer . Instead, a source-to-source compilation stage can be utilized, or instead we can utilize the template meta-programming facilities of C ++ compilers. Fig. 27 shows an example code that generates a fully-unrolled version of the ddot BLAS routine for calculating the inner 39 T o focus on vectorization, we compiled the code without the OpenMP pragma. 40 For more on generating vectorized code, see this W eb Aside http://csapp.cs.cmu.edu/public/waside/waside- simd.pdf to Bryant and O’Hallaron’ s great textbook (Bryant and O’Hallaron (2011)). 37 -------------------------------------------------------------------------- template struct inprod { static inline double apply(const double x[], const double y[]) { return inprod::apply(x,y)+x[i-1]*y[i-1]; } }; template<> struct inprod<1> { static inline double apply(const double x[], const double y[]) { return x[0]*y[0]; } }; inline double ddot_unrolled(double x[], double y[], int K) { return inprod::apply(x,y); } -------------------------------------------------------------------------- Figure 27: Using C ++ template meta-programming to implement compile-time loop unroll of the ddot BLAS call, used in the code shown in Fig. 6. product of two vectors. In the logistic regression example, the only parameter subspace for which we sa w a signiﬁcant advantage from ‘MOS - FullyUnrolledLoop’ over SOM is for short and narrow data and when multi-threading. For example, if we use same parameters as in Fig. 26 but 4 threads instead of 1, the CPR numbers for these two methods become 12.0 and 19.4, respectiv ely . In addition to being inﬂexible in handling external function calls, the MOS pattern faces another signiﬁcant performance challenge, i.e. excessi ve synchronization costs in the presence of reduction-to- array operations needed for calculating the deriv ati ves of the log-lik elihood function (Section 4.2). Moving many tuning decisions from runtime to compile time is generally a fruitful strategy for HPC. A recent example of this is the Stan compiler for Bayesian D A Gs. In contrast to previous compilers such as WINBUGS and J A GS, Stan performs a full compilation of DA G into C ++ code. This is likely to be one reason for its superior performance (in addition to using a more e ﬃ cient sampler in HMC and NUTS). Our manual loop unroll pushes the JIT compiling idea further by requiring the compilation to occur for each new set of values for problem dimensions (speciﬁcally K here). This is a minor nuisance but can o ﬀ er performance gains especially for vectorization. 5. Concluding Remarks 5.1. Summary In this paper , we presented two opportunities for SIMD parallel MCMC sampling of probabilistic graphical mod- els: 1) parallel calculation of log-likelihood and its deriv ativ es for exchangeable models with lack of conjugacy , such as GLM, and 2) parallel sampling of conditionally-independent nodes for graphs with discrete nodes and conjugacy , such as Ising model. In addition, we o ﬀ ered strate gies for e ﬃ cient implementation of such opportunities on multi-core x86 processors using a combination of vectorizing compiler and libraries, and a series of high-le vel changes to the source code and runtime en vironment. These strategies included Partial Loop Fusion to minimize parallel overhead, NUMA-aware adjustments to minimize cross-socket communication, cache fusion to maximize cache utilization, and batch RNG for e ﬃ cient random number generation and code vectorization. For Generalized Linear Models, one of the most commonly-used class of models in predicti ve analytics, our implementation of the log-likelihood function and its gradient vector reached within 25% of the hardware limit induced by total memory bandwidth in the big-data regime, outperforming a fully vectorized and (built-in) multi-threaded - but naiv e - implementation using Intel MKL by more than 5x. There are two conceptual extremes with regards to utilizing pre-compiled optimized libraries as components of a HPC application: 1) dropping in the pre-compiled library components as they are, and getting suboptimal aggregate 38 results, or 2) discarding all pre-compiled components and writing e verything from scratch, which can signiﬁcantly raise the development costs for the project. Our research suggests a middle ground, through high-level modiﬁcations layered on top of pre-compiled libraries, allowing for signiﬁcant performance impro vement while incurring only modest dev elopment costs. 5.2. Futur e Dir ections There are sev eral important areas for further research. 1. P ortable Implementation of Optimization T echniques: In order to turn research into action, we must identify robust and portable methods for implementing our proposed optimization techniques. For e xample, our NUMA adjustment relied on the ﬁrst-touch policy used by Linux. Also, in cache fusion, the optimal number of chunks is tied to the size of L2 cache. Such hardware and OS dependencies must be addressed before a widely-usable library can be released. 2. High-P erformance GLM Hessian: W e o ﬀ ered e ﬃ cient implementation of log-likelihood function and its ﬁrst deriv ati ve for GLM. Howe ver , due to their log-concav e property [ref], GLMs are suitable for optimization and sampling techniques that utilize the second deriv ativ e - Hessian matrix - of the log-lik elihood (or log- posterior) function, such as Newton optimization (Nocedal and Wright (2006)) (equi valently known as Iterati ve Re-W eighted Least Squares or IR WLS for GLM models) or RM-HMC sampling (Girolami and Calderhead (2011)). De veloping an equally-e ﬃ cient implementation of the Hessian for GLMs will be an important con- tribution to the ﬁeld of (Bayesian and non-Bayesian) predictiv e analytics. W e expect the Hierachical Loop Structure framew ork of Section 4.2 to remain e ﬀ ective, but additional opportunities for optimization, such as taking advantage of the symmetry of the Hessian matrix, must be thoroughly studied. 3. Hierar chical Bayesian GLM: Many Bayesian applications of GLM inv olves an HB frame work. Our research highlights the subtleties of combining coarse- and ﬁne-grained parallel modes av ailable for HB GLMs. These subtleties as well as additional complexities arising from highly-imbalanced regression groups must be carefully studied and encoded into a high-performance library for MCMC sampling of HB GLM models (or generic D AG compilers such as Stan). 4. Batch RNG: W e demonstrated the value of batch RNG for improving e ﬃ ciency and vectorization of MCMC sampling for models such as Ising, RBM, and LD A. For complex distributions, existing libraries cannot be utilized in batch mode in the context of MCMC sampling since the parameters of distributions changes from one iteration to next. (This is in contrast to purely Monte Carlo - non-Marko vian - settings where sampled distributions remain stationary throughout the simulation.) Dev eloping batch RNG libraries, i.e. libraries that generate b u ﬀ ers of random de viates for a set of core distrib utions such as uniform and normal, and utilize these bu ﬀ ers for generating random deviates from complex distributions - will be an important contribution towards high-performance MCMC sampling of big-data simulations such as automated, semantic analysis of web data. 5. V ectorization and Atomic Operations: Atomic operations arise in parallel sampling of conditionally-independent nodes (such as in non-collapsed LDA and in RBM), where nodes may simultaneously attempt to update the same intermediate data structures. Before hardw are support for vectorized atomic operations become a v ailable, software options can be explored. For example, one possibility is to accumulate updates in temporary bu ﬀ ers such that each vector instruction produces non-ov erlapping updates to the bu ﬀ er . Higher ﬂexibility o ﬀ ered by vector intrinsics might be required to implement such solutions. References References Ackley , D. H., Hinton, G. E., Sejnowski, T . J., 1985. A learning algorithm for boltzmann machines*. Cogniti ve science 9 (1), 147–169. Ahmed, A., Aly , M., Gonzalez, J., Narayanamurthy , S., Smola, A. J., 2012. Scalable inference in latent v ariable models. In: Proceedings of the ﬁfth A CM international conference on W eb search and data mining. A CM, pp. 123–132. Antonio, K., Beirlant, J., 2007. Actuarial statistics with generalized linear mixed models. Insurance: Mathematics and Economics 40 (1), 58–76. 39 Babapulle, M. N., Joseph, L., B ´ elisle, P ., Brophy , J. M., Eisenberg, M. J., 2004. A hierarchical bayesian meta-analysis of randomised clinical trials of drug-eluting stents. The Lancet 364 (9434), 583–591. Bache, K., Lichman, M., 2013. UCI machine learning repository . URL http://archive.ics.uci.edu/ml Bernardo, J. M., 1996. The concept of exchangeability and its applications. Far East Journal of Mathematical Sciences 4, 111–122. Bishop, C. M., 2006. Pattern recognition and machine learning. Blei, D. M., 2012. Probabilistic topic models. Communications of the A CM 55 (4), 77–84. Bovet, D. P ., Cesati, M., 2005. Understanding the Linux kernel. O’Reilly Media, Inc. Brockwell, A., 2006. Parallel markov chain monte carlo simulation by pre-fetching. Journal of Computational and Graphical Statistics 15 (1), 246–261. Broquedis, F ., Furmento, N., Goglin, B., Namyst, R., W acrenier , P .-A., 2009. Dynamic task and data placement over numa architectures: an openmp runtime perspectiv e. In: Evolving OpenMP in an Age of Extreme Parallelism. Springer , pp. 79–92. Brush, S. G., 1967. History of the lenz-ising model. Revie ws of Modern Physics 39 (4), 883. Bryant, R., O’Hallaron, D. R., 2011. Computer systems: a programmer’ s perspecti ve, Second Edition. Prentice Hall. Carlin, B. P ., Gelfand, A. E., Smith, A. F ., 1992. Hierarchical bayesian analysis of changepoint problems. Applied statistics, 389–405. Chandra, R., 2001. Parallel programming in OpenMP . Morgan Kaufmann. da Silva, A. R. F ., 2011. Cudabayesreg: parallel implementation of a bayesian multile vel model for fmri data analysis. Journal of Statistical Software 44 (4), 1–24. Descombes, X., Kruggel, F ., von Cramon, D. Y ., 1998. Spatio-temporal fmri analysis using markov random ﬁelds. Medical Imaging, IEEE T rans- actions on 17 (6), 1028–1039. Diaconescu, R. E., Zima, H. P ., 2007. An approach to data distributions in chapel. International Journal of High Performance Computing Applica- tions 21 (3), 313–335. Duane, S., Kennedy , A. D., Pendleton, B. J., Roweth, D., 1987. Hybrid monte carlo. Physics letters B 195 (2), 216–222. Dumont, M. D., 2011. Markov chain monte carlo on the gpu. Ph.D. thesis, Rochester Institute of T echnology . Fenton, N., Neil, M., 2004. Combining evidence in risk analysis using bayesian netw orks. Agena White Paper W 704. Gelman, A., Hill, J., 2007. Data analysis using regression and multile vel / hierarchical models. Cambridge Uni versity Press. Geman, S., Geman, D., 1984. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence (6), 721–741. Gilks, W . R., Wild, P ., 1992. Adaptive rejection sampling for gibbs sampling. Applied Statistics, 337–348. Girolami, M., Calderhead, B., 2011. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (2), 123–214. Gonzalez, J., Lo w , Y ., Gretton, A., Guestrin, C., 2011. Parallel gibbs sampling: From colored ﬁelds to thin junction trees. In: International Conference on Artiﬁcial Intelligence and Statistics. pp. 324–332. Gonzalez, J. E., Low , Y ., Gu, H., Bickson, D., Guestrin, C., 2012. Powergraph: Distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI). pp. 17–30. Gonzalez, J. E., Low , Y ., Guestrin, C., O’Hallaron, D., 2009. Distributed parallel inference on large factor graphs. In: Proceedings of the T wenty- Fifth Conference on Uncertainty in Artiﬁcial Intelligence. A U AI Press, pp. 203–212. Hastie, T ., T ibshirani, R., 1987. Non-parametric logistic and proportional odds re gression. Applied statistics 36 (3), 260–276. Hastings, W . K., 1970. Monte carlo sampling methods using markov chains and their applications. Biometrika 57 (1), 97–109. Hinton, G. E., Osindero, S., T eh, Y .-W ., 2006. A f ast learning algorithm for deep belief nets. Neural computation 18 (7), 1527–1554. Ho ﬀ man, M. D., Gelman, A., 2011. The no-u-turn sampler: Adapti vely setting path lengths in hamiltonian monte carlo. arXi v preprint Hopﬁeld, J. J., 1982. Neural networks and physical systems with emergent collecti ve computational abilities. Proceedings of the national academy of sciences 79 (8), 2554–2558. Je ﬀ ers, J., Reinders, J., 2013. Intel Xeon Phi Coprocessor High Performance Programming. Newnes. Kirk, D. B., W en-mei, W . H., 2012. Programming massi vely parallel processors: a hands-on approach. Ne wnes. Krizhevsk y , A., Sutske ver , I., Hinton, G. E., 2012. Imagenet classiﬁcation with deep con volutional neural netw orks. In: NIPS. V ol. 1. p. 4. Kumar , S., Kim, D., Smelyanskiy , M., Chen, Y .-K., Chhugani, J., Hughes, C. J., Kim, C., Lee, V . W ., Nguyen, A. D., 2008. Atomic vector operations on chip multiprocessors. In: A CM SIGARCH Computer Architecture News. V ol. 36. IEEE Computer Society , pp. 441–452. Le, Q. V ., Zou, W . Y ., Y eung, S. Y ., Ng, A. Y ., 2011. Learning hierarchical inv ariant spatio-temporal features for action recognition with independent subspace analysis. In: Computer V ision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, pp. 3361–3368. Lee, T . S., Mumford, D., 2003. Hierarchical bayesian inference in the visual corte x. JOSA A 20 (7), 1434–1448. Low , Y ., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J. M., 2012. Distributed graphlab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5 (8), 716–727. Majo, Z., Gross, T . R., 2011. Memory management in numa multicore systems: trapped between cache contention and interconnect ov erhead. In: A CM SIGPLAN Notices. V ol. 46. ACM, pp. 11–20. Marsaglia, G., Tsang, W . W ., 2000. A simple method for generating gamma variables. ACM Transactions on Mathematical Software (TOMS) 26 (3), 363–372. McCool, M., Reinders, J., Robison, A., 2012. Structured parallel programming: patterns for e ﬃ cient computation. Elsevier . McCulloch, C. E., 2006. Generalized linear mixed models. W iley Online Library . Metropolis, N., Rosenbluth, A. W ., Rosenbluth, M. N., T eller, A. H., T eller , E., 2004. Equation of state calculations by fast computing machines. The journal of chemical physics 21 (6), 1087–1092. Metropolis, N., Ulam, S., 1949. The monte carlo method. Journal of the American statistical association 44 (247), 335–341. Mohamed, A.-r ., Sainath, T . N., Dahl, G., Ramabhadran, B., Hinton, G. E., Picheny , M. A., 2011. Deep belief networks using discriminativ e features for phone recognition. In: Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, pp. 40 5060–5063. Molina da Cruz, E. H., Alves, Z., Carissimi, A., Navaux, P . O. A., Ribeiro, C. P ., Mehaut, J., 2011. Using memory access traces to map threads and data on hierarchical multi-core platforms. In: Parallel and Distributed Processing W orkshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on. IEEE, pp. 551–558. Neal, R. M., 1995. Bayesian learning for neural networks. Ph.D. thesis, Uni versity of T oronto. Neal, R. M., 2003. Slice sampling. Annals of statistics, 705–741. Nelder , J. A., Baker , R., 1972. Generalized linear models. W iley Online Library . Newman, D., Asuncion, A., Smyth, P ., W elling, M., 2009. Distributed algorithms for topic models. The Journal of Machine Learning Research 10, 1801–1828. Nocedal, J., Wright, S., 2006. Numerical optimization: Springer series in operations research and ﬁnancial engineering. Springer-V erlag. Perez, P ., et al., 1998. Markov random ﬁelds and images. CWI quarterly 11 (4), 413–437. Polikar , R., 2006. Ensemble based systems in decision making. Circuits and Systems Magazine, IEEE 6 (3), 21–45. Press, W . H., 2007. Numerical recipes 3rd edition: The art of scientiﬁc computing. Cambridge university press. Qi, Y ., Minka, T . P ., 2002. Hessian-based markov chain monte-carlo algorithms. Ren, R., Orkoulas, G., 2007. Parallel mark ov chain monte carlo simulations. The Journal of chemical physics 126 (21), 211102. Ribeiro, C. P ., M ´ ehaut, J., Carissimi, A., 2010. Memory a ﬃ nity management for numerical scientiﬁc applications ov er multi-core multiprocessors with hierarchical memory . In: Parallel & Distrib uted Processing, W orkshops and Phd F orum (IPDPSW), 2010 IEEE International Symposium on. IEEE, pp. 1–4. Ripley , B. D., 2009. Stochastic simulation. V ol. 316. John W iley & Sons. Robert, C. P ., Casella, G., 2004. Monte Carlo statistical methods. V ol. 319. Citeseer . Rossi, P . E., Allenby , G. M., McCulloch, R. E., 2005. Bayesian statistics and mark eting. J. Wile y & Sons. Sharabiani, M. T . A., V ermeulen, R., Scoccianti, C., Hosnijeh, F . S., Minelli, L., Sacerdote, C., P alli, D., Krogh, V ., T umino, R., Chiodini, P ., et al., 2011. Immunologic proﬁle of excessi ve body weight. Biomarkers 16 (3), 243–251. Smolensky , P ., 1986. Information processing in dynamical systems: Foundations of harmony theory . Sobehart, J. R., Stein, R., Mikityanskaya, V ., Li, L., 2000. Moodys public ﬁrm risk model: A hybrid approach to modeling short term default risk. Moodys In vestors Service, Global Credit Research, Rating Methodology , March. Strid, I., 2010. E ﬃ cient parallelisation of metropolis–hastings algorithms using a prefetching approach. Computational Statistics & Data Analysis 54 (11), 2814–2835. Sutter , J. M., Kali vas, J. H., 1993. Comparison of forward selection, backward elimination, and generalized simulated annealing for variable selection. Microchemical journal 47 (1), 60–66. T ibbits, M. M., Haran, M., Liechty , J. C., 2011. Parallel multi variate slice sampling. Statistics and Computing 21 (3), 415–430. W ilkinson, D. J., 2010. Parallel bayesian computation. In: K ontoghiorghes, E. J. (Ed.), Handbook of Parallel Computing and Statistics. CRC Press. Xu, T ., Ihler , A. T ., 2011. Multicore gibbs sampling in dense, unstructured graphs. In: International Conference on Artiﬁcial Intelligence and Statistics. pp. 798–806. Y u, L., Xu, Y ., 2009. A parallel gibbs sampling algorithm for motif ﬁnding on gpu. In: Parallel and Distributed Processing with Applications, 2009 IEEE International Symposium on. IEEE, pp. 555–558. 41

SIMD Parallel MCMC Sampling with Applications for Big-Data Bayesian Analytics

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment