Using RngStreams for Parallel Random Number Generation in C++ and R
The RngStreams software package provides one viable solution to the problem of creating independent random number streams for simulations in parallel processing environments. Techniques are presented for effectively using RngStreams with C++ programs…
Authors: Andrew T. Karl, R, y Eubank
USING RNGSTREAMS F OR P ARALLEL RANDOM NUMBER GENERA TION IN C++ AND R ANDREW T. KARL A dsur go LLC, Denver, CO RAND Y EUBANK AND JELENA MILO V ANO VIC A ND MARK REISER AND DENNIS YOUNG Arizo na State University, T emp e, A Z Abstract. The RngStrea ms soft wa re pac k age provide s one viable solution to the problem of creating independen t random n umber streams for sim ulations in parallel pro cessing en vironments. T echniques are presen ted for effectiv ely using Rng Streams with C++ programs that are paralleli zed via OpenMP or MPI . W ays to access the backbone generator from RngStreams in R through the parallel and rst ream pack ages are also described. The ideas in the paper are illustrated with b oth a sim ple running example and a Monte Carlo int egration application. NOTICE This is a preprint of a n article app ear ing in Computational Statistics (in press). The final publication is av ailable at http://dx.doi.org/ 10.100 7/s00180-014-0492-3 1. Introduction Sim ulation studies are in many resp ects the ideal applicatio n for para llel pro cess- ing. They a re “naturally parallel” in the sense that co mputations may gener ally be conducted witho ut the nee d for inter-pro cesso r comm unication or mo difica tion of the bas e algorithm. Thu s, near linea r sp eedup can be realized in many cases that o ccur in practice. How ev er, in order for a division of lab or acr oss m ultiple pro cessor s to be pro ductive, the r andom num ber streams being used by each pro- cessor must b ehav e a s if they are indep e nden t in a pro babilistic sense. F ortunately , metho ds now exist that, when used prop erly , as sure the requis ite stream indep e n- dence. W e explore one s uch a ppr oach in this pap er that is av ailable thro ug h the RngStr eams softw are pac k age. The effect of stre a m dep endence is easy to demonstrate as s een in the output from an R sess ion exhibited below. > x < - N U L L > f o r ( i i n 1 : 2 0 0 ) { + s e t . s e e d ( 1 2 3 ) 1 2 KARL, EUBANK, MILOV ANOVIC, REIS ER, AND YOUNG + . R a n d o m . s e e d [ 3 : 6 2 6 ] < - a s . i n t e g e r ( . R a n d o m . s e e d [ 3 : 6 2 6 ] + + ( i - 1 ) * 1 0 0 ) + z T e m p < - r n o r m ( 1 0 0 0 ) + x < - c ( x , s q r t ( 1 0 0 0 ) * m e a n ( z T e m p ) ) + } > s e t . s e e d ( 1 2 3 ) > y < - N U L L > f o r ( i i n 1 : 2 0 0 ) { + z T e m p < - r n o r m ( 1 0 0 0 ) + y < - c ( y , s q r t ( 1 0 0 0 ) * m e a n ( z T e m p ) ) + } > c a t ( s h a p i r o . t e s t ( x ) $ p . v a l u e , s h a p i r o . t e s t ( y ) $ p . v a l u e , " \ n " ) 0 . 0 0 0 1 4 3 8 9 3 6 0 . 4 4 6 6 1 8 9 F ollowing along the lines o f Hechenleitner and En tacher (2003), we hav e ge nerated sets of 1 000 random n umbers from the standard normal distribution using 200 different random num b er streams deriv ing fro m R ’s default Mer senne Twister (MT) random uniform g enerator . The streams are constructed by using seeds (vectors of leng th 624) that a re spaced 1 00 units apart in each comp onent. The mean from each str eam is then rescaled to pro duce num b ers that hav e ostensibly b een sampled from a standard normal distribution. How ever, the Shapiro- Wilk test firmly rejects the normality hypothesis. In co n trast, data fro m a s imilar pro cess that p ermits the normal evolution o f the MT sta tes easily pas s the Shapiro- Wilk ev aluatio n. Although the rela tionship be t ween str eams will lik ely b e mu ch mor e complex in situations that arise in pra ctice, the message r emains the same: int er- stream dependence ca n sabota g e a para llel r a ndom n um b er generation scheme. Hill (20 10) reviews av aila ble methods for parallel random n umber g eneration, including random spa cing (workers are initialized to ra ndomly s pa ced p os itio ns on the p er io d o f the same genera tor by assigning differen t, rando mly gener ated see ds ), sequence splitting (dividing a s equence into non ov erlapping contiguous blocks), cycle divisio n (a more inv olved technique for sequence splitting where the p erio d of a generator is deter minis tica lly divided in to segments) and par ametrization (pa- rameters ass o ciated with a generator are v aried to produce differen t strea ms). The random spac ing metho d carries a risk of overlap b etw ee n r andom strea ms on differ- ent pro ce s sors with an unlucky sampling of seeds , although the r isk is small when using generators with large p erio ds. MT provides o ne example of a lo ng-p erio d generator with the perio d 2 19937 − 1 (Matsumoto and Nishim ur a, 1998). This featur e a long with its wides pread av ail- ability ha s made MT a p opula r choice for parallel generatio n via random spacing. In spite of this, there are reas ons that make the dev elopment and use of other parallel generation metho ds worthwhile. F o r example, it is generally advisa ble to rep eat simulations with different generato rs to ensure that the re s ults are not an artifact of the structure of a par ticula r gener ator (Gentle, 200 3). So, there is v alue in having o ther a v aila ble generator options fo r this purp o se. In addition, a method such as cycle div ision comes with a guar antee tha t the s treams will not co llide , whereas random spac ing only renders such even ts unlikely . Lemieux (2009), for example, advocates ag ainst r a ndom spacing and r ecommends using genera tors that are guaranteed not to ov erlap. As a case in p oint, she directs us to the work of Matsumoto et al. (2007) who sho w that many modern rando m n umber genera tors USING RNGSTREAMS WITH C++ AND R 3 exhibit corr elations if their initial states a re chosen using another linear gener a tor with a similar modulus. Issues that arise when using linea r c ongruential gener ators for pa r allel random nu mber generation hav e be en well do cumented. Random spac ing and s equence splitting may result in rando m num b e r streams that ex hibit undesirable dep endence prop erties (e.g., Cenacchi and De Matteis (1970), De Matteis and P agnutti (198 8), Anderson and Titterington (1970) , and Etacher (1999)). Similar results have bee n quantified for so me nonlinear generator s. F or example, De Matteis and Pagn utti (1990) s how that long r ange correlation is present in random n umber streams pro- duced b y sequence splitting with a ny generator of the form x n = f ( x n − 1 ) mo d 2 m if the generator is s uc h that the perio d halves when the mo dulus halves. Many of the problems with co ngruential generator s can b e av oided by using combined multiple recursive gener ators (CMR G) with cycle division. A MR G is roughly equiv alent to a mult iple recursive gener ator with a p erio d that can b e as large a s the pr o duct of the perio ds of the genera tors used in the c ombination (L’Ecuyer and T ezuk a , 19 91). When co m bined appropria tely (e.g., as in (3) b elow), the lattice structure inherent to mult iple recurs ive g enerators is effectiv ely blurred (L’Ecuyer , 1996). With a go od c hoice for the parameters (e.g., L’Ecuyer (1999)), these c o mb ined generator s have also far ed well when sub jected to sta tistical tests as in L’Ecuyer and Simar d (2007). As a case in p oint, the MRG32k3a gener ator that pr ovides the bac kb one g enerator for the RngStr eams pack ag e developed by L’Ecuyer et al. (20 01) is known to hav e go od theore tical a nd statistica l prop erties. It is cited b y Lemieux (2009), alo ng with MT, as a gener ator tha t ca n be “sa fely used”. Of course, cycle division be comes an o ption only if it ca n b e done efficiently . F o r m ultiple r ecursive genera tors L’Ecuyer (199 0) has shown that there are co mputable matrices tha t ca n be used to adv a nce the state o f the generator to an y sp e c ified po int in its a sso ciated random num b er stream. This featur e is what makes the combination of ge ne r ators ea s y to use in a cycle division para dig m. The Rng Stream s pack a ge g ives an implemen tation of this approach in the con text of its MRG32k3a generator . Haramoto et al. (20 08) hav e recently develop e d a jump a head method for MT. The num ber g eneration step is t wo to three times faster for each str eam than with Rng Stream s ; but, the sequence splitting its e lf is roug hly 1 000 times slow er. Another noteworthy pack a g e that provides functionality fo r pa rallel random nu mber gener ation is the SPRNG pack age of Masca gni a nd Sriniv asan (2000). In contrast to RngSt reams , it pro duces differ ent random n umber sequences for the pro cesses v ia par ametrization of a s ingle type of gener a tor. F or example, one of the pack a ge’s featured genera tors is a m ultiplicativ e lagged Fibonacci genera tor whose states fall in to 2 1007 different equiv a le nce classes o f generators (each of p erio d 2 81 ) that pro vide the differen t streams. In this pap er w e focus on the use of the RngSt reams pack ag e. There are s everal reasons for this. F o r example, L’E cuyer et al. (2001) in referenc e to the SPR NG pack- age state that it is “not supp orted by the sa me theoretical a nalysis for the qualit y and indep endence of the differe nt streams” as R ngStre ams . There are also practical difficulties that arise in using SPRNG due to its ties to MP I and a rather complex implemen tation that necessitates the crea tion and linking of a compiled libra ry . In 4 KARL, EUBANK, MILOV ANOVIC, REIS ER, AND YOUNG contrast, the R ngStr eams pack age is quite compact with all its so urce co de av ailable from http:// www.ir o.umontreal.ca/ ~ lecuye r/myf tp/streams00/ . The C++ implementation of RngStr eams provides an ob ject o riented framework where o b jects are created which have a n a sso ciated metho d Ra ndU01 that pro duces uniform random deviates. The ov erall r esult is that R ngStre ams is eas y to incorp o- rate into C/ C++ progra ms using Op enMP o r MPI thro ug h simple include direc tives. This prop erty extends its applications fr om cluster down to desktop environments. General tr e atment s o f parallel co mputing in R are provided by Schmidberger et al. (2009) and Eug s ter et al. (2011). In contrast to these pa pe r s, the R segments of this article are dir e cted tow ard the sp ecific ta sk of para llel random num b er generation in the language . In this r e gard, the RngS treams generator s be c ome av aila ble in R through the r stream pack a g e (Leydold, 201 2). They a re also the default generator for the pa rallel pack age that pr ovides multithreading capabilities and now comes as part of the R base soft ware. The go al o f this article is to illustra te how easily RngSt reams can be employ ed in parallel s ettings in C ++ with O penMP and MPI and in R (R Core T eam, 2013) through the rstrea m a nd paralle l pac k a g es. Our initia l presentation in volv es o nly the most ele mentary feature s of these APIs. How ever, it is our belie f tha t these simple progra ms contain fundamental techniques that can provide a foundation for safe parallel random num b er genera tion in muc h mor e complex co de that might evolve in practice. An application o f Monte Carlo integration in the context of an item trait model is used to suppor t this con tention. In the next section we desc rib e the bas ic features of the RngSt reams genera- tor. Then, in subsequent sections we addr ess its use in C++ and R . These sections are connected throug h a common example that allows us to illustr ate b oth the differences and similarities in us ing RngStr eams under the different APIs. 2. RngStreams The “ ba ckbone” genera tor for RngS treams , often r eferred to as MR G32k3 a , is a combination of the t wo multiple recursive genera tors that pro duce the states x 1 ,n = (14035 80 × x 1 ,n − 2 − 81 0 728 × x 1 ,n − 3 ) mo d (4 29496 7087) , (1) x 2 ,n = (52761 2 × x 2 ,n − 1 − 13 7 0589 × x 2 ,n − 3 ) mo d (4 29494 4443) (2) at the n th step of the recur sion given initial seeds ˜ x i, 0 = ( x i, − 2 , x i, − 1 , x i, 0 ) T , i = 1 , 2 . The tw o states are comb ined to pro duce the unifor m random devia te u n via the rule z n = ( x 1 ,n − x 2 ,n ) mo d 42949 67087 , (3) u n = z n / 4294 96708 8 , if z n > 0 , 42949 67087 / 4294967088 , if z n = 0 . This pa rticular genera tor is capable of pro ducing a ra ndom num b er stream of p erio d length (ro ughly) 2 191 under the c ondition that x 1 , 0 , x 1 , − 1 , x 1 , − 2 are not all 0 and a re all less than 429 49670 87 and x 2 , 0 , x 2 , − 1 , x 2 , − 2 are a ll less than 4294944 443 and not all 0 . F or use in a parallel context this long c y cle is div ide d into 2 64 non-ov erlapping streams eac h of length 2 127 . The key to accessing different streams pro duced b y (1)-(2) is the technique de- scrib ed in L’E cuyer (1990) that a llows mo vemen t b etw een str eams throug h linea r transformatio ns of the initial seeds using known matrices. Spec ifically , if ˜ x 1 , 0 , ˜ x 2 , 0 USING RNGSTREAMS WITH C++ AND R 5 are the initial seeds/sta tes for the tw o generator s, a t the n th step of the recur sion the state for generator i is ˜ x i,n = ( A n i mo d m i ) ˜ x i, 0 mo d m i , i = 1 , 2 , for kno wn 3 × 3 matr ices A 1 , A 2 with m 1 = 429496 7087 , m 2 = 429494 4443. The p ow er s of the matr ices A 1 , A 2 can be computed explicitly and, in particular the matrices A 127 1 , A 127 2 that ar e needed for mo vemen t betw een the diff erent strea ms of the ge n- erator hav e been calculated and are co n tained in an anonymous na mes pace in the RngStr eam.c pp file. This feature is used in the M PI section o f our pap er. The initial state of the pack ag e is set using the function SetPa ckage Seed that has prototype s t a t i c b o o l S e t P a c k a g e S e e d ( c o n s t u n s i g n e d l o n g s e e d [ 6 ] ) with se ed containing the six initial seeds for the ge ner ator that m ust b e s upplied by the user. The fir st RngS tream ob ject that is created will us e this initial seed. Subsequent o b jects will then b e started o n the next str eam; i.e., they will hav e initial “seeds” that hav e b een adv anced 2 127 states from the initial state of the previous ob ject’s g enerator . Our a im in the se quel is to p erfo rm some simple exp eriments that will illustrate ho w the s eeds are changed for the different pro cesses that a re working concurrently in a par allel c o mputing en vironment. 3. OpenMP OpenMP represents the standard API for shared memory pa r allel process ing. An ov erview of the v ar ious OpenMP functions is provided, e.g ., in Chapman et al. (2001). Here we will use only the API features that allow us to pa rallelize a blo ck of co de. The listing below illustrates the use o f Rn gStrea ms in an OpenMP pr ogra m. Listing 1. ranO penMP .cpp / / r a n O p e n M P . c p p # i n c l u d e < o m p . h > # i n c l u d e " R n g S t r e a m . h " # i n c l u d e < i o s t r e a m > i n t m a i n ( ) { i n t n P = o m p _ g e t _ n u m _ p r o c s ( ) ; o m p _ s e t _ n u m _ t h r e a d s ( n P ) ; / / s e t n u m b e r o f t h r e a d s u n s i g n e d l o n g s e e d [ 6 ] = { 1 8 0 6 5 4 7 1 6 6 , 3 3 1 1 2 9 2 3 5 9 , 6 4 3 4 3 1 7 7 2 , 1 1 6 2 4 4 8 5 5 7 , 3 3 3 5 7 1 9 3 0 6 , 4 1 6 1 0 5 4 0 8 3 } ; R n g S t r e a m : : S e t P a c k a g e S e e d ( s e e d ) ; R n g S t r e a m R n g A r r a y [ n P ] ; / / a r r a y o f R n g S t r e a m o b j e c t s i n t m y R a n k ; # p r a g m a o m p p a r a l l e l p r i v a t e ( m y R a n k ) { m y R a n k = o m p _ g e t _ t h r e a d _ n u m ( ) ; # p r a g m a o m p c r i t i c a l { s t d : : c o u t < < R n g A r r a y [ m y R a n k ] . R a n d U 0 1 ( ) < < " " ; } } s t d : : c o u t < < s t d : : e n d l ; r e t u r n 0 ; } 6 KARL, EUBANK, MILOV ANOVIC, REIS ER, AND YOUNG The first step is to include the Ope nMP hea der file omp. h via an include dir ec- tive. The num b er of pro ces sors nP is determined and us ed to set the num b er of threads (with omp get num pro cs and omp se t n um thread s , resp ectively). The SetPac kageS eed function is then used to set the seeds for the Rng Stream s pack age. This particula r choice of seeds was made to provide results that can b e compared to those o btained from R in Sectio n 5. The desired ra ndom deviates are pr o duced with an a rray o f Rn gStrea m o b jects: one ob ject for each thread. The RngStr eam class default constructor is designed so that the first ob ject will have the seed designated with Set Packag eSeed while subsequent ob jects will hav e initial states/seeds hav e b een adv anced 2 127 states from that of their pr e decessor. An alternative a pproach that one might try here is to simply crea te a unique RngStr eam o b ject for ea ch thread with a pri vate sp ecification. In suc h instances the default constructor will a lso b e called. How ever, we hav e exp erienc e d ca ses where a race condition can lead to threa ds with duplicate seeds. The ar ray a pproach ensures that all o b jects a re created co rrectly while effectively explo iting the shar ed memory feature that allo ws all thr e ads access to the a rray elements. A random devia te is g e nerated from each of the Rng Stream o b jects using the class metho d Rand U01 that in vok es the backbone gener ator. These v alues will ser ve as subsequen t comparison benchmarks. The actual generatio n of ra ndo m num ber s in the parallel reg ion is quite simple with each thread using the o b ject in Rng Array that corresp onds to its unique thread num b er as signed b y the Ope nMP API . The resulting v a lues are printed out from within a critic al re gion to av oid garbled output. W e compile and execute L isting 1 using the gnu compiler suite on a four pro cessor mac hine in the follo wing manner. $ g + + - f o p e n m p r a n O p e n M P . c p p R n g S t r e a m . c p p - o r a n O p e n M P $ . / r a n O p e n M P 0 . 3 4 1 1 0 6 0 . 3 1 2 3 9 9 0 . 1 6 6 3 7 4 0 . 1 4 9 4 3 3 4. MPI The Mess age Passing Interface or MPI provides a num b er of functions that can be used for inter-pro cessor co mm unication; see, e.g., Quinn (2003). Our de- velopmen t here will use only the simple communication paradigm where a master and worker pr o cesses trans mit and receive mess a ges via the M PI se nd and recv functions. As noted at the beginning o f this sectio n, the seeds/s tates of RngStr eam ob jects are adv a nced by multiplication inv olving known matrices that a re av a ilable as par t of the pack age. A function that ca n be used to carry out these multiplications is matVec ModM that has prototype v o i d M a t V e c M o d M ( c o n s t d o u b l e A [ 3 ] [ 3 ] , c o n s t d o u b l e s [ 3 ] , d o u b l e v [ 3 ] , d o u b l e m ) This function will p erform multiplication of a 3 × 3 a rray A times a three-element array s mo dulo m and return the result in the three-ele ment ar ray v . F or our purp oses we will take m to b e either m 1 = 429496 7 087 or m 2 = 4294 9 4443 , A to be either A 127 1 or A 127 2 and s to be a seed vector that was allo cated to a previous pro cess. The mo duli and matr ices (as well as MatVe cModM ) are cont ained in an USING RNGSTREAMS WITH C++ AND R 7 anonymous na mespace within RngStr eam.cp p under the names m1 , m2 , A1 p127 and A2p127 . Thus, the following function will ma nu ally a dv ance the input ar ray s eedIn 2 127 states to see dOut which o ccupies the same relative lo cation in the next stream. s t a t i c v o i d A d v a n c e S e e d ( u n s i g n e d l o n g s e e d I n [ 6 ] , u n s i g n e d l o n g s e e d O u t [ 6 ] ) { d o u b l e t e m p I n [ 6 ] ; d o u b l e t e m p O u t [ 6 ] ; f o r ( i n t i = 0 ; i < 6 ; i + + ) t e m p I n [ i ] = s e e d I n [ i ] ; M a t V e c M o d M ( A 1 p 1 2 7 , t e m p I n , t e m p O u t , m 1 ) ; M a t V e c M o d M ( A 2 p 1 2 7 , & t e m p I n [ 3 ] , & t e m p O u t [ 3 ] , m 2 ) ; f o r ( i n t i = 0 ; i < 6 ; i + + ) s e e d O u t [ i ] = t e m p O u t [ i ] ; } W e wra pp ed this function with the ne c e ssary supp orting co de from Rn gStrea m.cpp in a supplemental class RngSt reamSu pp whose header file RngStr eamSup p.h app ears in the c o de b elow. Our MPI implementation o f the Rn gStre ams methodo logy now takes the following form. Listing 2. ranMPI.cpp / / r a n M P I . c p p # i n c l u d e < i o s t r e a m > # i n c l u d e " m p i . h " # i n c l u d e " R n g S t r e a m . h " # i n c l u d e " R n g S t r e a m S u p p . h " i n t m a i n ( ) { u n s i g n e d l o n g s e e d [ 6 ] = { 1 8 0 6 5 4 7 1 6 6 , 3 3 1 1 2 9 2 3 5 9 , 6 4 3 4 3 1 7 7 2 , 1 1 6 2 4 4 8 5 5 7 , 3 3 3 5 7 1 9 3 0 6 , 4 1 6 1 0 5 4 0 8 3 } ; M P I : : I n i t ( ) ; / / s t a r t M P I i n t m y R a n k = M P I : : C O M M _ W O R L D . G e t _ r a n k ( ) ; / / g e t p r o c e s s r a n k i n t n P = M P I : : C O M M _ W O R L D . G e t _ s i z e ( ) ; / / g e t n u m b e r o f p r o c e s s e s i f ( m y R a n k = = 0 ) { / / g e n e r a t e d e v i a t e f o r m a s t e r p r o c e s s R n g S t r e a m : : S e t P a c k a g e S e e d ( s e e d ) ; R n g S t r e a m R n g ; d o u b l e U = R n g . R a n d U 0 1 ( ) ; / / n o w s e n d s e e d s t o t h e o t h e r p r o c e s s e s u n s i g n e d l o n g t e m p S e e d [ 6 ] ; f o r ( i n t i = 1 ; i < n P ; i + + ) { R n g S t r e a m S u p p : : A d v a n c e S e e d ( s e e d , t e m p S e e d ) ; M P I : : C O M M _ W O R L D . S e n d ( t e m p S e e d , 6 , M P I : : U N S I G N E D _ L O N G , i , 0 ) ; f o r ( i n t j = 0 ; j < 6 ; j + + ) s e e d [ j ] = t e m p S e e d [ j ] ; } s t d : : c o u t < < U < < " " ; / / c o l l e c t t h e o t h e r r a n d o m d e v i a t e s f o r ( i n t i = 1 ; i < n P ; i + + ) { M P I : : C O M M _ W O R L D . R e c v ( & U , 1 , M P I : : D O U B L E , i , 0 ) ; s t d : : c o u t < < U < < " " ; } s t d : : c o u t < < s t d : : e n d l ; } e l s e { 8 KARL, EUBANK, MILOV ANOVIC, REIS ER, AND YOUNG u n s i g n e d l o n g m y S e e d [ 6 ] ; / / r e c e i v e y o u r s e e d i f y o u a r e n o t t h e m a s t e r p r o c e s s M P I : : C O M M _ W O R L D . R e c v ( m y S e e d , 6 , M P I : : U N S I G N E D _ L O N G , 0 , 0 ) ; R n g S t r e a m : : S e t P a c k a g e S e e d ( m y S e e d ) ; R n g S t r e a m R n g ; d o u b l e U = R n g . R a n d U 0 1 ( ) ; / / s e n d d e v i a t e t o m a s t e r p r o c e s s M P I : : C O M M _ W O R L D . S e n d ( & U , 1 , M P I : : D O U B L E , 0 , 0 ) ; } M P I : : F i n a l i z e ( ) ; / / e n d M P I r e t u r n 0 ; } The idea behind L is ting 2 is str aightforw ard. The master pro cess uses an ini- tial seed in c o njunction with Advan ceSeed to determine seeds that will pro duce “indep endent ” ra ndo m num b er streams for the other pro ces ses. These seeds are communicated to the pro cesses using the send and recv functions. The pro cesses then use them to initialize their particular RngStr eam ob jects and genera te a ran- dom uniform v alue. The comm unication pro cess is then reversed and the work er pro cesses send their random num ber s back to the master for prin ting. A t ypical compile and run sequence for o ur MPI pr ogram migh t loo k lik e $ m p i c x x r a n M P I . c p p R n g S t r e a m . c p p R n g S t r e a m S u p p . c p p - o r a n M P I $ m p i e x e c - n p 4 r a n M P I 0 . 1 6 6 3 7 4 0 . 3 4 1 1 0 6 0 . 3 1 2 3 9 9 0 . 1 4 9 4 3 3 This output agrees w ith our prior results from the OpenMP program. An alternative a pproach that lea ds to the same result as in r anMPI. cpp is to mimic the array form ulation that was used in ranOpen MP.cpp . This reduces the amount of communication with the exp ense of more memor y usage fo r the individual pro cesses. W e illustrate this approach in Section 6. 5. RngStreams in R The Rn gStrea ms backbone gener ator is now one of the r andom num ber gener ator options in R . F or example, > R N G k i n d ( " L ' E c u y e r - C M R G " ) > s e t . s e e d ( 1 2 3 ) > r u n i f ( 1 ) [ 1 ] 0 . 1 6 6 3 7 4 2 sets the R uniform random num b er gene r ator as MRG32k3a, sets the session’s seed and repro duces one of the rando m deviates fro m the pr evious t w o sections. A t this point it is p erhaps worth while to mention how we have aligned the s eeds that are b eing used in our R a nd C++ co de. The se t.see d function in R uses a single int eger to s et the seed for the current uniform r andom nu mber genera tor. The result can then be accessed through the integer vector .Random .seed . In terms of our example with the MRG32k3a gener ator, the vector ha s seven elements: the fir s t one indicates the type o f generator and the remaining six are the seeds. Th us, > s e t . s e e d ( 1 2 3 ) > c a t ( . R a n d o m . s e e d [ 2 : 7 ] , " \ n " ) 1 8 0 6 5 4 7 1 6 6 - 9 8 3 6 7 4 9 3 7 6 4 3 4 3 1 7 7 2 1 1 6 2 4 4 8 5 5 7 - 9 5 9 2 4 7 9 9 0 - 1 3 3 9 1 3 2 1 3 reveals R ’s r epresentation for the six seeds that are r e q uired for MRG32k3a. Initia lly the pre s ence of negative in tegers seems a bit puzzling; the Rng Stream s seeds are USING RNGSTREAMS WITH C++ AND R 9 stored a s uns igned long in the co de for the pa ck ag e. Howev er , in R they are treated as 32-bit unsigned int egers internally but o therwise viewed as signed in tegers with a t wo’s-complement representation. Th us, one must add 2 32 to a ny negative int egers that a ppea r in .Ran dom.s eed to see the actual seeds that were used with the generator. F or example, > . R a n d o m . s e e d [ 3 ] + 2 ^ { 3 2 } [ 1 ] 3 3 1 1 2 9 2 3 5 9 gives us the second seed that app ear ed in our C+ + co de. Althoug h this may appe ar to be an esoteric detail, it is essen tial for cross-langua ge output comparison. Another way to access the MRG32k3a gener ator is through the rs tream pack a ge that allows us to c reate instances of a der ived class rstre am.mrg 32k3a for the back- bo ne g enerator. By default, subsequent rstr eam.mr g32k3a ob jects a re initialized to the next str eam. In fact, rstr eam will return an er ror if more than one genera tor requests a sp ecific seed in the same R session. T o ov e rride this behavior one m ust sp ecify the option force.s eed = TRUE . The following co de initializes four genera- tors and use s them with the rst ream. sample function to each pro duce a uniform random deviate. > l i b r a r y ( r s t r e a m ) > r n g L i s t < - c ( n e w ( " r s t r e a m . m r g 3 2 k 3 a " , s e e d = c ( 1 8 0 6 5 4 7 1 6 6 , + 3 3 1 1 2 9 2 3 5 9 , 6 4 3 4 3 1 7 7 2 , 1 1 6 2 4 4 8 5 5 7 , 3 3 3 5 7 1 9 3 0 6 , + 4 1 6 1 0 5 4 0 8 3 ) , f o r c e . s e e d = T R U E ) , + r e p l i c a t e ( 3 , n e w ( " r s t r e a m . m r g 3 2 k 3 a " ) ) ) > s a p p l y ( r n g L i s t , r s t r e a m . s a m p l e ) [ 1 ] 0 . 1 6 6 3 7 4 2 0 . 3 4 1 1 0 6 4 0 . 3 1 2 3 9 9 3 0 . 1 4 9 4 3 3 4 Note that we ha ve a gain pro duced the four indep endent streams that ar ose in the previous t wo sections . The MRG32k3a generator is a built in fixture of the para llel pack age that o ffer s essentially the sa me parallel computing functionality as can be obtained separately from the sn ow (Tierney et al., 201 3) and multicor e (Urbanek, 2011) pack age s . In addition, paral lel is now bundled within the R base distribution which sugges ts it will be one of the key ingredients in the evolution of R ’s future pa r allel pro cessing capabilities. The function ma keClus ter from the parall el pack age will cr eate a (default, so ck et) cluster whose seed ca n then b e set with cluste rSetR NGStream . The co de below illustrates this proce ss using a cluster of size four. > l i b r a r y ( p a r a l l e l ) > c l < - m a k e C l u s t e r ( 4 ) > c l u s t e r S e t R N G S t r e a m ( c l , 1 2 3 ) > p a r S a p p l y ( c l , r e p ( 1 , 4 ) , r u n i f ) [ 1 ] 0 . 1 6 6 3 7 4 2 0 . 3 4 1 1 0 6 4 0 . 3 1 2 3 9 9 3 0 . 1 4 9 4 3 3 4 > s t o p C l u s t e r ( c l ) The rando m n umbers were genera ted using p arSapp ly that gives a parallel ver- sion of sappl y ; there ar e para llel versions of la pply and apply as well as se veral other related o ptions. Upon completion of our task, the clus ter is ter minated with stopCl uster . The output co nfirms that w e ar e us ing the MR G32k3 a generator and that the s eeds are being adv ance d corr ectly for each no de in the cluster. The other wa y tha t parallel provides multithreading ability (for non-Windows machines) is through mclap ply , which furnishes a par allelized v ersio n of lappl y 10 KARL, EUBANK, MILOV ANOVIC, REIS ER, AND YOUNG suited for a sha red memory environmen t. An attempt to replica te o ur c lus ter r esults using this approach pr o duces > l i b r a r y ( p a r a l l e l ) > R N G k i n d ( " L ' E c u y e r - C M R G " ) > s e t . s e e d ( 1 2 3 ) > u n l i s t ( m c l a p p l y ( r e p ( 1 , 3 ) , r u n i f ) ) [ 1 ] 0 . 3 4 1 1 0 6 4 0 . 3 1 2 3 9 9 3 0 . 9 7 1 2 7 2 7 > r u n i f ( 1 ) [ 1 ] 0 . 1 6 6 3 7 4 2 which illustrates that the seeds for new str eams ar e not b eing adv anced as exp ected. Note that the call mclapply(rep(1,3),runif ) pro duces thr ee calls to runif with n=1 and collects the r esults in a lis t. This issue is easy enoug h to fix with Listing 3. mclapply with built-in L’Ecuyer CMRG > l i b r a r y ( p a r a l l e l ) > R N G k i n d ( " L ' E c u y e r - C M R G " ) > s e t . s e e d ( 1 2 3 ) > u n l i s t ( m c l a p p l y ( r e p ( 1 , 3 ) , r u n i f , m c . c o r e s = 3 ) ) [ 1 ] 0 . 3 4 1 1 0 6 4 0 . 3 1 2 3 9 9 3 0 . 1 4 9 4 3 3 4 > r u n i f ( 1 ) [ 1 ] 0 . 1 6 6 3 7 4 2 It is imp ortant to s pec ify the desire d num b er of additional cores (beyond the master pro cess) via the mc.co res option. This v alue is not a utomatically set to the length of the first a rgument of mcl apply . On Windows platforms, mc.cor es must be set to 1 since mcla pply relies on the Unix fork comma nd, whic h is not av ailable in Windows. It is p ossible to gain mor e dir e c t control over the b ehavior of the random streams on each core using the rstr eam pack age. This provides a v aluable option when repro ducibility acro ss multiple languages and pla tforms is desired. F or example, Listing 4. mclapply with rstream > l i b r a r y ( r s t r e a m ) > l i b r a r y ( p a r a l l e l ) > r n g L i s t < - c ( n e w ( " r s t r e a m . m r g 3 2 k 3 a " , + s e e d = c ( 1 8 0 6 5 4 7 1 6 6 , 3 3 1 1 2 9 2 3 5 9 , 6 4 3 4 3 1 7 7 2 , + 1 1 6 2 4 4 8 5 5 7 , 3 3 3 5 7 1 9 3 0 6 , 4 1 6 1 0 5 4 0 8 3 ) , + f o r c e . s e e d = T R U E ) , + r e p l i c a t e ( 3 , n e w ( " r s t r e a m . m r g 3 2 k 3 a " ) ) ) > u n l i s t ( m c l a p p l y ( r n g L i s t , r s t r e a m . s a m p l e , m c . c o r e s = 4 ) ) [ 1 ] 0 . 1 6 6 3 7 4 2 0 . 3 4 1 1 0 6 4 0 . 3 1 2 3 9 9 3 0 . 1 4 9 4 3 3 4 returns the results we had anticipated. On Windo ws platforms, where mc.core s m ust be 1, the pro g ram in Listing 3 will pro duce num ber s fro m a sing le str e am fro m RngStr eams . By contrast, the co de from Listing 4 will faithfully r epro duce the r e- sults fro m a multicore non-Windows platform during serial execution on a Windows machine. A simila r approach of explicitly creating the str e ams with rstr eam ma y also be used with the cluster approach in order to maximize transparency . USING RNGSTREAMS WITH C++ AND R 11 6. An applica tion In this section the basic ideas in Sections 3 – 5 are applied to a r eal pro blem inv o lving Mon te Car lo in tegration. The r esulting co de listings now b ecome rather lengthy with the consequence that w e can only discuss a few key a s pec ts here. The s etting is that of Bo ck a nd Lieb erma n (1970) whe r e a latent trait mo del (L TM) is b eing fit to data o btained from sub ject res p o ns es on the LSA T . Ther e are five dic hotomous v a riables Y = ( Y 1 , . . . , Y 5 ) whose observed v alues y = ( y 1 , . . . , y 5 ) pro duce 2 5 = 3 2 p o ssible resp onse patterns. If y is one suc h pattern, it is assumed that the conditional probability of seeing y given the v alue z o f a laten t v ariable Z ∼ N (0 , 1) tak es the form (4) p ( y | z ) = Π 5 i =1 p i ( z ) y i (1 − p i ( z )) 1 − y i , where (5) p i ( z ) = 1 / (1 + exp { α i + β z } ) with α = ( α 1 , . . . , α 5 ) a vector of v ariable sp ecific intercepts a nd β a co mmon slop e parameter. This leads to p ( y ) := E z [ p ( y | z )] = Z ∞ −∞ p ( y | z ) φ ( z ) dz (6) with φ the s ta ndard normal density . W e will fo cus on approximation of this integral via Monte Carlo metho ds. Other int egra ls s uch as those obtained by differentiation of (6) with respect to α and β tha t arise in the computation of ma rginal maximum likelihoo d estimators can b e handled ana logously . The choice of α, β is arbitrary for our purp oses a nd, accor dingly , we take them to b e the marginal likelihoo d estimates returned from the gr m function in the R ltm pack a ge (Rizop oulos, 200 6); it is these v alues that w ill be seen in our co de. What requires somewhat more justification is our fo cus on an example with a sin- gle latent v ar iable. In general, there ma y be man y latent v ar ia bles in an L TM that make the integral in (6) multidim ensional. As detailed in Section 1.1 of Lemieux (2009), the num b er of p oints in pro duct-rule qua drature must gr ow exponentially with the dimens io n of the in tegrand to main tain a constant ra te of decay for the approximation error. In contrast, the exp ected error from a Monte Carlo in tegral estimator is indep endent of dimensio n and is a lwa ys of o rder 1 / √ n with n the num- ber of sampling points. Thus, o ne concludes that Mon te Carlo int egr a tion is mo r e appropria te in higher dimensions and, for ex ample, univ ariate in tegrals a r e b etter and mo r e pr ecisely ev a luated by deterministic quadra ture. O n the other hand, the basic pre mis e that underlies Monte Carlo in tegration re ma ins the same reg a rdless of the dimension: rep eatedly gener ate v alues from a probability distribution, ev al- uate the in tegrand a t these v alues and av er age the results in some general sense. Consequently , there is little conceptua l do wnside to our use of a sing le laten t v a ri- able in this insta nce while the payoff in terms o f co de length and mathematical simplification is rather substan tial. F or any s pecifie d pattern y , a Monte Ca rlo estimator of (6) will typically have the form (7) probF unc( y ) = n X i =1 W ( z i ) p ( y | z i ) 12 KARL, EUBANK, MILOV ANOVIC, REIS ER, AND YOUNG for z 1 , . . . , z n standard normal deviates and W a suitable weight function. The serial asp ect of our computation problem is primarily ab out ev aluatio n of (7). W e will therefore de a l with that issue first b efore turning to the pa rallelization step. In bo th cases the R and C ++ c o de will b e developed in tandem to b etter illustrate the similarities and differ ences that a rise when attempting to solve this type of problem in the tw o different lang uages. The C++ co de (included in funcl ass.cp p in the supplementary materia l) that ev aluates (7) is Listing 5. probF unc in C++ T N T : : A r r a y 1 D < d o u b l e > f u n C l a s s : : p r o b F u n c ( c o n s t T N T : : A r r a y 1 D < d o u b l e > & Z , c o n s t T N T : : A r r a y 2 D < i nt > & p a t n ) { / / p a t t e r n p r o b a b i l i t y a r r a y u s e d i n i n t e r m e d i a t e c a l c u l a t i o n s T N T : : A r r a y 2 D < d o u b l e > p r o b M a t ( p a t n . d i m 1 ( ) , p a t n . d i m 2 ( ) , 0 . ) ; / / w o r k a r r a y T N T : : A r r a y 1 D < d o u b l e > t e m p ( p a t n . d i m 1 ( ) , 0 . ) ; / / a r r a y t h a t w i l l h o l d t h e c e l l p r o b a b i l i t i e s T N T : : A r r a y 1 D < d o u b l e > p r o b ( p a t n . d i m 1 ( ) , 0 . ) ; / / s u m a c r o s s n o r m a l d e v i a t e s f o r ( i n t k = 0 ; k < Z . d i m ( ) ; k + + ) { p r o b M a t = p r o b M a t F u n c ( Z [ k ] , p a t n ) ; / / m u l t i p l y t o g e t t h e p r o b a b i l i t y f o r e a c h p a t t e r n f o r ( i n t i = 0 ; i < p a t n . d i m 1 ( ) ; i + + ) { t e m p [ i ] = 1 . ; f o r ( i n t j = 0 ; j < p a t n . d i m 2 ( ) ; j + + ) t e m p [ i ] * = p r o b M a t [ i ] [ j ] ; t e m p [ i ] * = 2 . * d n o r m ( Z [ k ] ) / ( d n o r m ( Z [ k ] / 2 . ) * ( d o u b l e ) Z . d i m ( ) ) ; } p r o b + = t e m p ; } r e t u r n p r o b ; } This function takes and retur ns arguments from the T emplate Numeric a l T o o lkit (TNT) that provides implemen tations of vector (as Arra y1D ob jects) and ma tr ix (as Array 2D ob jects) classes with ov erloaded addition, multiplication, etc., op er- ators. These array cla sses are implemen ted as templates which has the conse- quence tha t a ll co de is placed in header files that ca n be downloaded directly from http:/ /math .nist.gov/tnt . The call to p robMat Func that app ear s here corresp onds to a function that ev a lu- ates p i ( z ) y i (1 − p i ( z )) 1 − y i across all (32 in this instance ) p ossible re s po nse patter ns for a given v alue of z and returns the result as an array with columns repr e sentin g the different v ar iables in the mo del. The v alues that are us e d for the z argument a r e passed into the function in the Array1 D ob ject Z . Thes e a re chosen to be pseudo- random normal deviates with 0 mean and standar d devia tion 2. This feature in conjunction with the w eights that are applied at the end o f the most interior (or i ) lo o p pro duces a s imple importance sampler estimate of the tar get integral. The co de for probFun c , probM atFunc a nd several utilit y functions (e.g., the standard normal density and quantile function designa ted as dno rm and qnorm , resp ectively , USING RNGSTREAMS WITH C++ AND R 13 in o ur co de) hav e b een collec ted into a C++ class funCla ss that con tains a clas s mem b er to hold the v alue s of the co efficients. The R version of Listing 5 (see L TM parall el.R in the supplemen t) is embo died in three functions. Fir st, there is > P F u n c < - f u n c t i o n ( z , c o e f V e c , p a t n ) { + n c e l l s < - n r o w ( p a t n ) + p r o b M a t < - P r o b M a t F u n c ( z , c o e f V e c , p a t n ) + p < - m a t r i x ( 0 , n c e l l s , 1 ) + # m u l t i p l y t h e p r o b a b i l i t i e s a c r o s s v a r i a b l e s + p < - a s . m a t r i x ( d n o r m ( z ) * a p p l y ( p r o b M a t , 1 , p r o d ) , n c e l l s , 1 ) + p + } that ev a lua tes (4) across all r esp onse patterns (i.e., y v alues) for a given v alue of the laten t v ar iable (i.e., z ). The ca ll to ProbMa tFunc in this instance is for the R analog of its C++ names ake in Listing 5. The patn arg umen t is a n input array that holds the resp onse patterns as in the C++ case. How ever, in co nt ras t to the C ++ developmen t, in this instance the co efficient vector that is needed for ev alua ting (5) is tr eated as an input v ariable, named coef Vec , that is passed to this and other functions that use it dir ectly from a driv er program. Next, the function PFunc is vectorized via > V e c P F u n c < - V e c t o r i z e ( P F u n c , v e c t o r i z e . a r g s = " z " ) to ha ndle the arr ay case and thereby allow us to use it with ap ply for “av eraging ” across normal deviates as in Listing 6. ProbF unc in R > P r o b F u n c p a r L a p p l y < - f u n c t i o n ( n S i m , n P , c o e f V e c , p a t n ) { + Z < - q n o r m ( r u n i f ( r o u n d ( n S i m / n P , 0 ) ) , 0 , 2 ) + p r o b < - c o l M e a n s ( a p p l y ( V e c P F u n c ( Z , c o e f V e c = c o e f V e c , + p a t n = p a t n ) , + 1 , f u n c t i o n ( y ) y / d n o r m ( Z , 0 , 2 ) ) ) + } The Z a rray that app ear s in this listing is, again, a vector of pseudo-r andom no rmal deviates with 0 mean and standa rd deviation 2 that is used in the same weigh ting scheme as for the C++ code. The C ++ MPI co de (av a ilable from dr iverLT M mpi.cp p in the supplement) that calculates (7) in parallel lo o k s lik e R n g S t r e a m : : S e t P a c k a g e S e e d ( s e e d ) ; R n g S t r e a m R n g A r r a y [ n P ] ; / / a r r a y f o r r a n d o m n u m b e r s T N T : : A r r a y 1 D < d o u b l e > Z ( n S i m [ m y R a n k ] , 0 . ) ; / / f u n C l a s s o b j e c t f u n C l a s s f ( p A l p h a , b e t a , p a t n . d i m 2 ( ) ) ; / / a r r a y t h a t w i l l h o l d a p p r o x i m a t e p r o b a b i l i t i e s T N T : : A r r a y 1 D < d o u b l e > p r o b ( p a t n . d i m 1 ( ) , 0 . ) ; / / a p p r o x i m a t e t h e c e l l p r o b a b i l i t i e s o n e a c h p r o c e s s f o r ( i n t i = 0 ; i < n S i m [ m y R a n k ] ; i + + ) Z [ i ] = 2 . * f . q n o r m ( R n g A r r a y [ m y R a n k ] . R a n d U 0 1 ( ) ) ; 14 KARL, EUBANK, MILOV ANOVIC, REIS ER, AND YOUNG p r o b = f . p r o b F u n c ( Z , p a t n ) ; / / n o w g a t h e r u p t h e r e s u l t s o n t h e m a s t e r p r o c e s s d o u b l e * t e m p ; t e m p = n e w d o u b l e [ n P * p a t n . d i m 1 ( ) ] ; d o u b l e * p D a t a _ = n e w d o u b l e [ p a t n . d i m 1 ( ) ] ; f o r ( i n t i = 0 ; i < p a t n . d i m 1 ( ) ; i + + ) p D a t a _ [ i ] = p r o b [ i ] ; M P I : : C O M M _ W O R L D . G a t h e r ( p D a t a _ , p a t n . d i m 1 ( ) , M P I : : D O U B L E , t e m p , p a t n . d i m 1 ( ) , M P I : : D O U B L E , 0 ) ; i f ( m y R a n k = = 0 ) { s t d : : c o u t < < " I n t e g r a t i o n r e s u l t s : " < < s t d : : e n d l ; f o r ( i n t i = 0 ; i < p r o b . d i m ( ) ; i + + ) { f o r ( i n t j = 1 ; j < n P ; j + + ) p r o b [ i ] + = t e m p [ j * p r o b . d i m ( ) + i ] ; p r o b [ i ] / = ( d o u b l e ) n P ; s t d : : c o u t < < p r o b [ i ] < < s t d : : e n d l ; } s t d : : c o u t < < " t i m e w i t h " < < n P < < " p r o c e s s e s w a s " < < M P I : : W t i m e ( ) - w t i m e < < s t d : : e n d l ; } Here we hav e employ ed the scheme fr o m Section 3 for o btaining indep endent streams: eac h pro cesses creates the same array of Rng Stream ob jects and then uses the array element that co rresp ond to its r ank. The num b er of normal deviates that m ust b e generated by the different pr o cesses is por tioned out in an array nS im that is used in the same manner. The funClas s ob ject f then provides the means for each pro cess to transform the uniform dev iates obtaine d from the RngStre am ob ject into standard normals and ca ll prob Func to approximate the integral. The use of a co mmon RngStream o b ject arr ay ha s allow ed us to circum ven t the need for sending informatio n to the pro ce s ses. It is only necessary to c o llect up the results they pro duce. One means to accomplish this directly , without lo oping, is by use of MP I ’s Gat her function whic h is the path we ha ve chosen in this instance. All pro cesses call this function with a po int er argument for the quantit y that is to be pa ssed to the master (or 0) pro cess. This data is then re ceived (b y the master) in a p ointer o f dimension sufficient to hold the collective r esult (i.e., the pro duct of the num ber of pro cesses nP a nd the leng th prob.dim( ) of the prob array). The po int er tha t is passed from the w orker pro cesses needs to b e o ne that p oints to the actual data that is held in the prob a rray . This resides in a p ointer named data that is a priv ate member of the Array1D class. How ever, its conten t is accessible through an ov erloaded [] op erator and we used this fea ture to ma n ually transfer the information in data to the po inter pDat a that was passed to G ather . Once all the integral appro ximations ar e collected, the master pro cess averages them acros s pro cesses to obtain the final approximation. The OpenMP analo g of our MPI co de is shown in the next listing, and is a v a ilable from driverL TM openmp .cpp in the supplemen t. / / p o i n t e r t o A r r a y 1 D o b j e c t s t h a t w i l l h o l d o u t p u t / / f r o m e a c h p r o c e s s T N T : : A r r a y 1 D < d o u b l e > * t e m p V e c = n e w T N T : : A r r a y 1 D < d o u b l e > [ n P ] ; USING RNGSTREAMS WITH C++ AND R 15 / / p o i n t e r f o r r a n d o m n u m b e r s a n d f u n C l a s s o b j e c t d o u b l e * p Z ; f u n C l a s s * p f ; / / i n i t i a l i z e a r r a y f o r c e l l p r o b a b i l i t i e s T N T : : A r r a y 1 D < d o u b l e > p r o b ( p a t n . d i m 1 ( ) , 0 . ) ; / / a p p r o x i m a t e t h e c e l l p r o b a b i l i t i e s # p r a g m a o m p p a r a l l e l p r i v a t e ( m y R a n k , p Z , p f ) { m y R a n k = o m p _ g e t _ t h r e a d _ n u m ( ) ; p Z = n e w d o u b l e [ n S i m [ m y R a n k ] ] ; p f = n e w f u n C l a s s ( p A l p h a , b e t a , p a t n . d i m 2 ( ) ) ; f o r ( i n t i = 0 ; i < n S i m [ m y R a n k ] ; i + + ) p Z [ i ] = 2 . * p f - > q n o r m ( R n g A r r a y [ m y R a n k ] . R a n d U 0 1 ( ) ) ; t e m p V e c [ m y R a n k ] = p f - > p r o b F u n c ( T N T : : A r r a y 1 D < d o u b l e > ( n S i m [ m y R a n k ] , p Z ) , p a t n ) ; d e l e t e [ ] p Z ; d e l e t e p f ; } f o r ( i n t i = 0 ; i < n P ; i + + ) { p r o b + = t e m p V e c [ i ] ; } s t d : : c o u t < < " I n t e g r a t i o n r e s u l t s : " < < s t d : : e n d l ; f o r ( i n t i = 0 ; i < p a t n . d i m 1 ( ) ; i + + ) { p r o b [ i ] / = ( d o u b l e ) n P ; s t d : : c o u t < < p r o b [ i ] < < s t d : : e n d l ; } s t d : : c o u t < < " t i m e w i t h " < < n P < < " p r o c e s s e s w a s " < < o m p _ g e t _ w t i m e ( ) - w t i m e < < s t d : : e n d l ; An array of RngSt ream ob jects is used to generate the random uniforms for each pro cess exactly as in Section 3. What differs from the MPI treatment is the way we have used p ointers for the funClass ob ject, the ar ray of standa r d normals , and storage of the output from p robFun . By designating the first t wo of these p ointers as private in the p rivate cla use at the start of the parallel section, we a void a ny race c onditions that migh t a rise in ca lling c o nstructors. Each thread can safely create the ob ject a nd ar r ay they will need to perfor m their task. The now familia r mapping of a pr o cesses’ rank to a n array ele men t index is used to store the A rray1 D ob jects that each thread pr o duces with probFu nc in a safe lo cation provided by the Array1 D p ointer. After all threads colla ps e back to the ma s ter at the end of the parallel reg ion, their output is av eraged using an ov e r loaded addition assignment op erator for the Array1D class . In the R setting our pa rallel implemen tation takes the form > l i b r a r y ( p a r a l l e l ) > D r i v e r L T M p a r L a p p l y < - f u n c t i o n ( p a t n , c o e f V e c , n S i m , n P ) { + c l < - m a k e C l u s t e r ( n P ) + c l u s t e r S e t R N G S t r e a m ( c l , 1 2 3 ) 16 KARL, EUBANK, MILOV ANOVIC, REIS ER, AND YOUNG + c l u s t e r E x p o r t ( c l , c ( " p a t n " , " c o e f V e c " , " n S i m " , " n P " ) ) + c l u s t e r E x p o r t ( c l , c ( " P r o b M a t F u n c " , " P F u n c " , " V e c P F u n c " , + " P r o b F u n c p a r L a p p l y " ) ) + r e s u l t < - p a r L a p p l y ( c l , s e q _ l e n ( n P ) , + f u n c t i o n ( . . . ) P r o b F u n c p a r L a p p l y ( n S i m , + n P , c o e f V e c , p a t n ) ) + s t o p C l u s t e r ( c l ) + # n o w a v e r a g e a c r o s s t h r e a d s + p r o b < - v e c t o r ( m o d e = " d o u b l e " , l e n g t h = n r o w ( p a t n ) ) + f o r ( i i n 1 : n P ) { + p r o b < - p r o b + r e s u l t [ [ i ] ] + } + p r o b / n P + } in a cluster con text and > l i b r a r y ( p a r a l l e l ) > l i b r a r y ( r s t r e a m ) > D r i v e r L T M m c l a p p l y < - f u n c t i o n ( p a t n , c o e f V e c , n S i m , n P ) { + r n g L i s t < - c ( n e w ( " r s t r e a m . m r g 3 2 k 3 a " , + s e e d = c ( 1 8 0 6 5 4 7 1 6 6 , 3 3 1 1 2 9 2 3 5 9 , 6 4 3 4 3 1 7 7 2 , + 1 1 6 2 4 4 8 5 5 7 , 3 3 3 5 7 1 9 3 0 6 , 4 1 6 1 0 5 4 0 8 3 ) , + f o r c e . s e e d = T R U E ) , + r e p l i c a t e ( n P - 1 , n e w ( " r s t r e a m . m r g 3 2 k 3 a " ) ) ) + r e s u l t < - m c l a p p l y ( r n g L i s t , P r o b F u n c m c l a p p l y , n S i m , n P , + c o e f V e c , p a t n , m c . c o r e s = n P ) + # n o w a v e r a g e a c r o s s p r o c e s s e s + p r o b < - v e c t o r ( m o d e = " d o u b l e " , l e n g t h = n r o w ( p a t n ) ) + f o r ( i i n 1 : n P ) { + p r o b < - p r o b + r e s u l t [ [ i ] ] + } + p r o b / n P + } with > P r o b F u n c m c l a p p l y < - f u n c t i o n ( r n g O b j , n S i m , n P , c o e f V e c , p a t n ) { + Z < - q n o r m ( r s t r e a m . s a m p l e ( r n g O b j , r o u n d ( n S i m / n P , 0 ) ) , 0 , 2 ) + p r o b < - c o l M e a n s ( a p p l y ( V e c P F u n c ( Z , c o e f V e c = c o e f V e c , + p a t n = p a t n ) , + 1 , f u n c t i o n ( y ) y / d n o r m ( Z , 0 , 2 ) ) ) + } for sha r ed memory pur p os es. Note that the v ario us constants and functions m ust be comm unicated to the w orkers in a cluster w ith cluster Export and that w e hav e used the rstre am pa ck ag e in the same manner as Section 5 to ensure prop er stream initialization when us ing mclapp ly . Both parLappl y and mclapply re tur n lists that ho ld the output fro m each of the workers. The list element a re then av eraged to obtain the final approximation. The pr ograms pr in t matching output for the integral approximations when the nu mber of thr eads is held consta n t. F or example, using four threa ds with the R mclapp ly function pro duces > D r i v e r L T M m c l a p p l y ( p a t n , c o e f V e c , n S i m , 4 ) [ 1 ] 0 . 0 1 0 5 9 1 0 3 2 0 . 0 1 4 4 6 5 3 3 9 0 . 0 0 3 3 2 1 0 6 8 0 . 0 0 7 4 5 9 9 3 0 0 . 0 0 8 4 9 7 4 0 1 USING RNGSTREAMS WITH C++ AND R 17 [ 6 ] 0 . 0 1 9 0 8 7 2 3 9 0 . 0 0 4 3 8 2 2 0 0 0 . 0 1 6 3 0 1 1 7 0 0 . 0 0 4 3 4 8 3 0 7 0 . 0 0 9 7 6 7 3 6 0 [ 1 1 ] 0 . 0 0 2 2 4 2 4 6 8 0 . 0 0 8 3 4 1 6 6 7 0 . 0 0 5 7 3 7 6 5 8 0 . 0 2 1 3 4 3 2 8 3 0 . 0 0 4 9 0 0 1 6 1 [ 1 6 ] 0 . 0 3 1 0 1 1 4 9 1 0 . 0 1 2 7 6 9 3 2 0 0 . 0 2 8 6 8 3 0 1 1 0 . 0 0 6 5 8 5 2 7 4 0 . 0 2 4 4 9 6 2 9 5 [ 2 1 ] 0 . 0 1 6 8 4 9 3 1 5 0 . 0 6 2 6 7 7 0 8 3 0 . 0 1 4 3 8 9 9 0 4 0 . 0 9 1 0 6 8 9 2 3 0 . 0 0 8 6 2 2 1 6 5 [ 2 6 ] 0 . 0 3 2 0 7 3 2 4 1 0 . 0 0 7 3 6 3 6 3 0 0 . 0 4 6 6 0 1 9 7 0 0 . 0 1 8 8 4 0 8 4 4 0 . 1 1 9 2 3 7 4 4 2 [ 3 1 ] 0 . 0 2 7 3 7 5 4 8 2 0 . 3 1 0 2 4 5 4 3 1 . The C+ + OpenMP prog ram with four threads pro duces iden tical output, as ex- pec ted: $ g + + - f o p e n m p R n g S t r e a m . c p p f u n c l a s s . c p p R n g S t r e a m S u p p . c p p d r i v e r L T M _ o p e n m p . c p p - o o p e n m p _ l t m $ . / o p e n m p _ l t m 4 I n t e g r a t i o n r e s u l t s : 0 . 0 1 0 5 9 1 0 3 2 0 . 0 1 4 4 6 5 3 3 9 0 . 0 0 3 3 2 1 0 6 8 0 . 0 0 7 4 5 9 9 3 0 0 . 0 0 8 4 9 7 4 0 1 0 . 0 1 9 0 8 7 2 3 9 0 . 0 0 4 3 8 2 2 0 0 0 . 0 1 6 3 0 1 1 7 0 0 . 0 0 4 3 4 8 3 0 7 0 . 0 0 9 7 6 7 3 6 0 0 . 0 0 2 2 4 2 4 6 8 0 . 0 0 8 3 4 1 6 6 7 0 . 0 0 5 7 3 7 6 5 8 0 . 0 2 1 3 4 3 2 8 3 0 . 0 0 4 9 0 0 1 6 1 0 . 0 3 1 0 1 1 4 9 1 0 . 0 1 2 7 6 9 3 2 0 0 . 0 2 8 6 8 3 0 1 1 0 . 0 0 6 5 8 5 2 7 4 0 . 0 2 4 4 9 6 2 9 5 0 . 0 1 6 8 4 9 3 1 5 0 . 0 6 2 6 7 7 0 8 3 0 . 0 1 4 3 8 9 9 0 4 0 . 0 9 1 0 6 8 9 2 3 0 . 0 0 8 6 2 2 1 6 5 0 . 0 3 2 0 7 3 2 4 1 0 . 0 0 7 3 6 3 6 3 0 0 . 0 4 6 6 0 1 9 7 0 0 . 0 1 8 8 4 0 8 4 4 0 . 1 1 9 2 3 7 4 4 2 0 . 0 2 7 3 7 5 4 8 2 0 . 3 1 0 2 4 5 4 3 1 . Though not shown, the C++ MPI and R pa rLappl y pr ogra ms a lso pro duce the same output. When the num b er o f threads ar e changed, the approximations change as w ell since the r andom n umbers are generated from differ e n t streams. W e carried out a few p erforma nc e co mparisons using the MPI and OpenMP C++ co de and the mcap ply and parL apply R co de fo r the Mo n te Car lo integration ap- plication. The av erag e sp eedups (i.e., the ratios of serial to pa rallel r un times) a nd run times (in parent heses) are g iven in T able 1. In all cases we used fiv e runs with 10 5 sampling po ints (i.e., nSim = 10 5 ) and used 1, 2, 4 and 8 pr o cessors. While all of the programs p erfor m w ell, near-linea r sp e edup is a chiev ed with mc lapply in R (using the rstre am pac k age) and in with MP I in C++ . As exp ected, the C++ progra ms execute at least an order of mag nitude faster than the R progr ams. T able 1. Sp eedups and (run times, in seconds) for C++ and R co de Pro cesso rs mclapply parLapply MPI Op enMP 1 1.000 1.000 1.000 1.000 (77.941 ) (82.541 ) (2.254) (2.19 9 ) 2 2.070 1.950 2.014 1.430 (37.651 ) (42.319 ) (1.119) (1.53 8 ) 4 4.117 3.597 4.002 3.043 (18.930 ) (22.944 ) (0.563) (0.72 3 ) 8 7.884 5.549 7.545 6.595 (9.886) (14.874 ) (0.299) (0.33 3 ) As a clo sing thoug ht, we men tion that the accura cy obtained with 10 5 samples can be ro ughly compar ed to tha t of an 18 (uniformly spaced) p oint tra pezo idal quadrature rule in one dimension. How ever, a g rid of 1 0 10 po int s for a pro duct trap ezoidal rule would be required to remain comp etitive with the same Mont e Carlo sc heme in 8 dimens ions, for example. 18 KARL, EUBANK, MILOV ANOVIC, REIS ER, AND YOUNG 7. Summar y The RngStr eams softw a re pack age provides one freely a v aila ble so lution for cre- ating independent r andom n um b er streams for simulation exp eriments that ar e conducted in a pa rallel computing environment. The g oal of this paper has been to provide an introduction to its use in both distributed a nd shared memory set- tings. W e have provided minimal working examples along with a more detailed application to illustrate the p otential of RngStr eams . While RngStre ams is easy to use, some care is neces sary to ens ure that streams are distr ibuted correctly to different pro cess es. W e have accomplished this by us- ing an array of Rng Strea m o b jects where the array index is emplo yed to uniquely determine which pro cess may access each ob ject. This scheme is gua r anteed to pro duce indep endent str e ams in b oth Op enMP and MPI . In the la tter context, we hav e descr ib e d another option that creates seeds on the master pro c ess and then distributes them to a ll the worker proces ses. There are memor y savings from this latter approach that co me with the cost of additional inter-pro cessor co mm unica- tion. In R w e demo nstrated the us e of RngStre ams through b oth the pa ralle l and rstrea m pack ages. W e explored the built-in functiona lit y for RngS treams in para llel and demonstra ted how r stream may b e used within parall el to improv e the p orta- bilit y of the soft ware. Ackno wledgements Some of the computations w ere carried o ut on the Saguaro cluster at Arizona State Universit y . The authors are grateful for commen ts fro m the anonymous ref- erees that lead to improvemen ts in the pap er. References Anderson, N. and Titteringto n, D. (1970), “Cross correla tion b etw een simultane- ously generated sequence s of pseudo-r andom unifor m deviates,” Statist. Comput. , 16, 11–15. Bo ck, R. and Lieb e r man, M. (19 70), “Fitting a resp onse mo del for n dic hotomously scored items,” Psyc hometrika , 35, 1 7 9–19 7 . Cenacchi, G. and De Matteis, A. (1970), “P seudo-Random Numbers for Compara - tive Monte Car lo Calculations,” Numer. Math. , 16, 11–15. Chapman, B., Jo st, G., a nd v an der Pas, R. (2001 ), Using Op en MP: Portable Shar e d Memory Par al lel Pr o gr amming , Ca mb ridge: MIT Press. De Matteis, A. and P agnutti, S. (198 8), “ Parallelization of Random Number Gen- erators and Long-Range Co rrelation,” Numer. Math. , 53, 595– 608. — (1990 ), “ L ong-Range Corr elation in Linear and Non-Linear Ra ndom Number Generators ,” Par al lel Comput. , 14, 2 07–21 0. Etacher, K. (19 99), “O n the CRA Y-system r andom num ber genera to r,” Simulation , 72, 163–169 . Eugster, M. J. A., Knaus , J., Porzelius, C., Schmidberger, M., a nd Vicedo, E. (2011), “Hands - on T utorial for Parallel Co mputing with R,” Comput S tat , 26, 219–2 39. Gent le, J. (2003), Rand om Numb er Gener ation and Monte Carlo Metho ds , New Y ork: Springer . USING RNGSTREAMS WITH C++ AND R 19 Haramoto, H., Mats umo to, M., Nishimura, T., Panneton, F., and L’Ecuyer, P . (2008), “Efficie nt jump ahead for F 2 -linear random num b er g enerators ,” IN- F ORMS J Comput , 20, 385–3 90. Hechenleitn er, B. and En tacher, C. (2 003), “Pitfalls When Using Parallel Streams in O MNeT++ Sim ulations,” in Pr o c e e dings of I nter-domain Performanc e and Simulation Workshop, Salzbur g, Austria, 11-20 . Hill, D. R. C. (20 1 0), “ Practica l Distribution of Random Streams for Sto chas- tic High Performance Computing,” in Int ernational Confer enc e on H igh Perfor- manc e Computing and Simulation (HPCS) , pp. 1–8 . L’Ecuyer, P . (1990), “ Random Num ber s for Sim ulation,” Comm ACM , 33, 8 5–98 . — (1996), “Co mbined multiple r ecursive random nu mber gener ators,” Op er R es , 44, 816–822 . — (1999), “Goo d parameter s and implementation for com bined mult iple recurs ive random n um b er genera tors,” Op er Re s , 47, 159–16 4. L’Ecuyer, P . and Simar d, R. (2007), “T estU01: A C Library for Empirical T esting of Random Number Gener ators,” AC M T r ans. Math. Softw. , 33, 22. L’Ecuyer, P ., Simard, R., Chen, J ., and Kelton, W. (2 0 01), “An Ob ject-Oriented Random-Number Pack age With Many Long Stre ams and Substreams ,” T ech. rep., Univ ersity o f Mo nt real. L’Ecuyer, P . a nd T ezuk a, S. (1 9 91), “Structural prop er ties for tw o clas s es o f random nu mber generator s,” Math Comp , 57, 735–746 . Lemieux, C. (2009 ), Monte Carlo and Quasi-Monte Carlo Sampling , New Y ork: Springer. Leydold, J . (2012), rstr e am: St r e ams of R andom Nu mb ers , R pack age version 1.3.2 . Mascagni, M. and Sr iniv asan, A. (2000 ), “Algorithm 806: SPRNG: A Scalable Library for Pseudo r andom Number Generation,” ACM T r ans Math Softwa r e , 26, 436–461 . Matsumoto, M. and Nishimura, T. (1998), “Mers e nne Twister: A 623 - Dimensionally Eq uidistributed Uniform Pse udo-Random Num be r Gener ator,” ACM T r ansactions on Mo deling and Computer Simulation , 8, 3–30. Matsumoto, M., W ada, I., Kuramoto, A., and Ashihar a, H. (20 0 7), “Common De- fects in Initialization of Pseudora ndo m Number Generato r s,” ACM T r ans actions on Mo deling and Computer Simulation , 17 (4):15. Quinn, M. (2003 ), Pa r al lel Pr o gr amming in C with MPI and Op enMP , New Y ork: McGraw Hill. R Co r e T eam (201 3 ), R: A L anguage and Envir onment for Statistic al Computing , R F oundation for Statistical Computing, Vienna, Austria. Rizop oulos, D. (2006 ), “ltm: An R pack age for Latent V aria ble Mo delling a nd Item Resp onse Theory Analyses,” Journal of St atistic al Softwar e , 17, 1–25. Schm idb erger , M., Mor gan, M., Eddelbuettel, D., Y u, H., Tierney , L., a nd Ma ns- mann, U. (2009), “Sta te of the Art in Parallel Co mputing with R,” J Stat Soft- war e , 31, 1. Tierney , L., Ro ssini, A. J., Li, N., a nd Sevcik ov a , H. (2013 ), snow: Simple Network of Workstations , R pack a ge version 0.3-12. Urbanek, S. (201 1 ), multic or e: Par al lel pr o c essing of R c o de on machines with multiple c or es or CPUs , R pack age v ersion 0.1-7.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment