Enhancing R with Advanced Compilation Tools and Methods

Statistic al Scienc e 2014, V ol. 29, No. 2, 181– 200 DOI: 10.1214 /13-STS462 c  Institute of Mathematical Statisti cs , 2014 Enhancing R with Advanced Compilation T o ols and Me tho ds Duncan T emple Lang Abstr act. I describ e an approac h to compiling common id ioms in R co de directly t o nativ e machine co de and illustrate it with sev eral ex- amples. Not only can this yield signiﬁcant p erformance gains, b ut it allo ws us to use new ap p roac h es to computing in R . Imp ortantly , the compilation requires no change s to R itself, b ut is done en tirely via R pac k ages. Th is allo ws others to exp eriment with d iﬀeren t compilation strategies and ev en to deﬁne new domain-sp e ciﬁc languages within R . W e use th e Lo w-Lev el Virtual Machine ( LL VM ) compiler to olkit to create the nativ e co de and p erform sophisticated optimizations on the co de. By adopting this widely u sed soft wa re w ithin R , w e lev erage its abilit y to generate co de f or diﬀeren t platforms such as CPUs and GPUs, and will con tin ue to b eneﬁt from its ongoing dev elopmen t. This approac h p o ten tially allo ws us to dev elop high-lev el R co d e that is also fast, that can b e compiled to w ork with diﬀerent data repr esentati ons and sources, and that could eve n b e run outside of R . The approac h aims to b ot h p ro vide a compiler for a limited su b set of the R language and also to en ab le R programmers to write other compilers. This is another ap p roac h to h elp us write high-lev el descriptio ns of w h at we w an t to compute, n ot how . Key wor ds and phr ases: Programming language, eﬃcien t computa- tion, compilation, extensible compiler to olkit. 1. BA CK GROUND & MOTIV A TION Computing with d ata is in a v ery interesti ng p e- rio d at present and this h as signiﬁcan t implications for h o w w e c ho ose to go forwa rd with our comput- ing platforms and education in statistic s and related ﬁelds. W e are simultaneously (i) lev eraging h igher- lev el, inte rpr eted languages s u c h as R , MA TL AB, Python and recen tly J ulia, (ii) dealing with increas- ing vo lume an d c omplexit y of data, and (iii) ex- Dunc an T emple L ang is A sso ciate Pr ofessor, Dep artment of Statistics, Un iversity of California at Davis, 4210 Math Scienc es Building, Davis, California 95616 , USA e-mail: dunc an@r-pr oje ct.or g . This is an electronic r eprint o f the original article published by the Institute of Mathematical Statistics in Statistic al Scienc e , 2 014, V ol. 29, No. 2, 181 –200 . This reprint diﬀer s from the orig inal in pagina tion and t yp ogr aphic detail. ploiting, and coming to terms with, tec hnologies for parallel computing including shared and nonshared m ulti-core p ro cessors and GPUs (Graphics Pr o c ess- ing Units). These challe nge us to in no v ate and sig- niﬁcan tly enh ance our existing computing platforms and to devel op new languages and systems so that w e are able to meet n ot ju st tomorrow’s n eeds, b ut those of the n ext decade. Statisticians play an imp o rtant role in the “Big Data” surge, and th er efore must pa y attent ion to logistica l and p erformance details of statistical com- putations that we could p reviously ignore. W e n eed to think ab out how b e st to meet our o w n comput- ing needs f or the near fu ture and also how to b est b e able to participate in multi -disciplinary eﬀorts that require serious computing inv olving statistica l ideas and m etho ds. Are we b est serve d with our own com- puting platform s uc h as R (R Core T eam ( 2013 ))? Do w e ne e d our o wn system? Can w e aﬀord the lux- ury of our own system, giv en the limited resources 1 2 D. TEMPLE LANG our ﬁ eld has to dev elop, m ain tain and innov ate with that system? Alternativ ely , would we b e b ett er oﬀ reimplemen ting R or an R -lik e envi ronment on a system that is devel op ed and su p p orted by other larger comm un ities, for example, P ython or Java ? Or can w e lea ve it to others to build a new, fast computing en v ir onmen t that we can lev erage w ith in the ﬁeld of statistic s? W ould “doing it righ t” giv e us an opp ortunit y to escap e some of the legacy in our current systems and p osition ou r selv es for more inno v ation? W ould d ev eloping a n ew system splinte r our already small communit y and reduce our eﬀec- tiv en ess in disseminating new s tatistica l metho ds as eﬀectiv ely as w e do through R ’s excellen t pac k age mec h anism? I h a v e wrestled with these qu estions for o ver a decade. I don’t b eliev e there is a simp le answer as to w h at is the b est w ay to pro c eed. It is as m uc h a so- cial iss u e as a tec hnical one. T h e R communit y is an amazing and v aluable phenomenon. There is a large R cod e base of p ac k ages and scripts in widespread use for doing imp ortan t science. Ev en if new systems do emerge and replace R , this will tak e sev eral y ears. W e n eed signiﬁcant impro v ement s in p erformance to mak e R comp et itiv e with other systems, at least for the near future. W e m ust improv e R ’s p e rform an ce to allo w us to con tin ue to deal w ith larger and more complex data and prob lems. I n this pap er, I discuss one dir ect appr oac h to impro v e the p erformance of R co de that is extensible and enables many p eople to fur ther improv e it. The essence of the approac h I am suggesting is conceptually qu ite simple and emerged in n umer- ous other languages and platforms, around the same time we ﬁr st started implement ing it for R . T he idea is that we compile R co de directly to lo w -lev el n a- tiv e mac hine instructions that will ru n on a CPU or GPU or any device w e can target. Instead of in - sisting that R code b e ev aluated by the one and only R interpreter, w e ma y generate th e co de to p er- form th e equiv alen t compu tations in a quite diﬀerent w a y . W e can dynamically compile fast nativ e code b y combining information ab out the co de, the data and its represen tation b eing pro cessed b y that co de, the av ail able computing “h ard w are” (i.e., CPU or GPU or multi-co re), the lo cation o f th e diﬀerent sources of the d ata, wh ether we need to hand le miss- ing v alues (NAs) or not, and so on. This is a f orm of just-in-time (JIT) compilation. It lev erages add i- tional kno wledge ab out the con text in whic h co de will b e ru n. It maps the co de to lo w-lev el mac hine instructions rather than ha ving it ev aluated b y a high-lev el in terpreter. The appr oac h present ed h ere is quite diﬀerent from ho w programmers typica lly impro v e p erfor- mance for R co de. They man ually implement the slo w , computationally in tensiv e parts in C / C ++, and call th ese r outines from R . I call this “program- ming around R .” Instead, I am tryin g to “ c ompile around R .” In this appr oac h, statisticians u s e famil- iar R idioms to exp ress the d esired computations, but the compiler infrastru cture generates optimized instructions to realize the in tended computations as eﬃcien tly as p ossib le on th e hard w are platform in use. T he inpu t from the statistician or analyst to this pro cess is R co d e, not lo w-lev el C / C ++ co de. This is go o d b ecause humans can more easily und er- stand, debu g, adap t and extend co de written in R . F urther m ore, the compiler can “unders tand ” w hat is int ended in the high-lev el code and optimize in quite diﬀerent w a ys. This also allo ws th e co de to b e optimized in very diﬀerent w a ys in the future. The high-lev el co d e sa ys wh at to do, but not how to do it. Ho w is left to the compiler. What makes this appr oac h feasible and p racti- cal n ow is the a v ailabilit y of the Lo w-Leve l Virtu al Mac hin e Comp iler Infr astructure ( LL VM ) (Lattner and Adve ( 2004 )). LL VM is the w in ner of the 2012 Asso ciation for Comp uting Mac h inery S ystem Soft- w are Aw ard (the same a w ard conferr ed on the S language in 1999) and is a highly extensible com- piler to olkit. LL VM is a C ++ library w e can in- tegrate in to R (and other language s) to generate nativ e code for v arious diﬀeren t CP Us, GPUs and other output targets. The abilit y to in tegrate this v ery adaptable and extensible to ol in to high-lev el languages is a “game changer” for dev eloping com- pilation to ols and strategies. W e can u se a tec hnol- ogy that will contin ue to ev olv e and will b e deve l- op ed by domain exp erts. W e can adapt these to our purp oses w ith our domain kno wledge. W e d o this within an extensible R p ac k age r ather than in the R in terpreter itself. This lea v es the compilation in- frastructure in “us er” space, allo win g dev elopment of an y new compilation s tr ategies to b e shared with- out any c hanges to the R interpreter. This con trasts with R ’s byte-c o de compiler and the byt e-co de in- terpreter wh ich is p art of the ﬁxed R executable. The Rllvm (T emple Lang ( 2010b )) p ac k age pro- vides R bindings to the LL VM C ++ API. W e can use this to generate arb itrary nativ e cod e. Th e RLL VMCompile (T emple Lang and Bu ﬀalo ( 2011 )) ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 3 pac k age provides a simple “compiler” that attempts to translate R co de to n ative co de b y m apping the R co de to instr uctions in LL VM , lea ving LL VM to optimize the co de and generate the nativ e mac hine co de. Before w e explore some examples of ho w w e can use LL VM in R to imp ro v e p erformance and change our computational strategies for certain t yp es of problems, it is worth think in g a little ab out the p o- ten tial imp lications of fast R co d e: • Alterna tive data mo dels . On a practical level , if w e can compile scalar (i.e., nonv ectorized) R f unc- tions so that they are almost as f ast as C / C ++ co de, w e can use them to p ro cess ind ividual obser- v ations in a streaming or up d ating manner . Th is means w e can escap e the highly-v ectorized or “all data in memory” approac hes that R strongly re- w ards. • Exp orting c o de to alternative exe cution envi r on- ments . W e can write R co d e and then e xp o rt it to run in other d iﬀeren t systems, for ex- ample, database s, Python , JavaScript and W eb bro wsers. W e can map the R c o de to LL VM ’s in termediate repr esen tation (IR). W e can then use emscrip ten (Zak ai ( 2010 )) to compile this directly to JavaScript cod e. F or ot her systems, w e can share the IR co de from R and they can use their o wn LL VM bindings to compile it to nativ e co de for the p articular h ardwa re. • R icher data structur es . R pro vides a small num- b er of primitive data t yp es, for example, vect ors, lists, fu nctions. W e can currently use these to cre- ate new aggregate or comp o site data structur es. Ho wev er, w e can only introduce new and diﬀer- en t d ata structures suc h as link ed lists and suﬃx trees as opaque d ata types programmed in n ativ e ( C / C ++) code. When w e compile code to na- tiv e instructions, we also hav e the opp ortun it y to ha v e that new co d e use these diﬀerent data struc- tures and to represent the data diﬀerently from R . The same R co d e can b e merged w ith descriptions of new data typ es to yield quite diﬀerent nativ e instructions that are b e tter suited to particular problems. • T empl ating c onc epts . Our abilit y to create nativ e co de from R co de allo ws us to think ab o ut R func- tions or expr essions diﬀerently . Th ey are descrip- tions of what is to b e done, without th e sp eciﬁcs of how to do them. An R compiler can rewrite them or generate co de that will b ehav e d iﬀeren tly from the R interpreter but giv e the s ame resu lts (hop efully). The fu nctions are “template s.” The compiler can u se kn owledge of the particular r ep - resen tation of the d ata th e functions will p ro cess to generate the nativ e co de in a more in telligen t manner. F or example, th e co de may access ele - men ts of a tw o-dimensional d ata set—ro w s and columns. T here are tw o very diﬀerent represen- tations of th is in R — d ata frames and matrices. Ho w the individual elements are accessed for eac h is v ery diﬀeren t. Th e compiler can generate sp e- cialize d co d e for eac h of these and migh t ev en c h ange the ord er of the computations to imp ro v e eﬃciency (cac he coherency) for these representa- tions. T he function is not tied to a particular data represent ation. In su mmary , compiling R pr ograms through LL VM yields no ve l computational p otenti als that are di- rectly relev an t to imp r o ving statistica l learning and comm unication in the b ig data era. C ompiling h igh- lev el code to nativ e co de is used in many systems. Julia is an interesting mo dern pro ject doing this. NumPy (Jones et al. ( 2001 )) in Python is another. Sev eral ye ars ago, Ross Ih ak a and I explored using LISP (Ih ak a and T emple Lang ( 2008 )) as the plat- form for a new statistical computing platform. Th e same id eas ha ve b een u sed th er e for man y yea rs and the p e rformance gains are v ery imp ressiv e. A v ery imp ortant premise un derlying the ap- proac h in this pap er is that the R pro ject and its large cod e base are imp ortan t, and will b e for at least another 5 y ears. Users are not likel y to imme- diately change to a new system, ev en if it is tec h- nically sup erior. F or that reason, it is imp ortant to impro ve the p erformance of R no w. It is also im- p ortant to allo w dev elop ers outside of the core R dev elopmen t team to contribute to this eﬀort and to av oid man y f ork ed/parallel pr o jects. F or this, w e need an extensible system within R , and not one that requires con tin ual c hanges to the cen tr alized R source co de. In add ition to fo cusing on the immediate and near-term future and impro ving R , w e also need to b e exploring new language an d computing paradigms within the ﬁeld of statistics. Ju lia is an in teresting mo dern pro ject d oing this. W e need to f oster more exp eriments so new ideas emerge. T o do this, we also need to increase the quan tity and qu alit y of computing within our cu r ricula. Section 2 constitutes the m a jorit y of this pap er. In it I explore diﬀerent examples of compu tin g in R and ho w compiling co de can m ake the computa- tions more eﬃcien t, b oth by simply obtaining faster 4 D. TEMPLE LANG execution sp eed and also b y allo w in g us to c hange ho w we appr oac h the pr ob lem. In Section 3 I discuss some additional general strategies w e can exploit to impro ve computations in the future. I brieﬂy discuss other exciting researc h pro jects to imp r o v e R ’s eﬃ- ciency and con trast th em w ith the LL VM appr oac h. I outline in Section 6 a road m ap of the ongoing wo rk on the LL VM approac h and other related pro jects as part of our futu re activit ies. The aim of this is to illustrate the feasibilit y of the ent ire appr oac h. In this pap er I fo cus on reasonably standard R co de and approac hes th at can b e impr o v ed by gen- erating n ative co de. Th e semanti cs of the generated co de are the same as the original R co de. The ap- proac h also allo ws u s to d ev elop new languages and seman tics within the compilation framew ork and to explore diﬀerent compu tational mo dels. How ev er, the examples d iscussed in this p ap er sta y within R ’s existing computational mo d el in order to anchor the discussion and a v oid to o many degrees of freedom b ecoming a d istraction. I d o hop e th at we will ex- plore new semant ics and language features w ithin R via compilation. 2. ENHANCING R WITH ADV ANCED COMPILA T ION TOOLS AND METHODS In this sectio n we’ ll explore some examples of ho w w e can wr ite co de in R and compile it to mac hine co de. These explore diﬀerent strategie s and illus- trate how we can approac h computations diﬀeren tly when w e ha ve th e option to compile co d e rather than only inte rpr et it. W e ha ve c h osen the problems f or sev eral reasons. Th ey are eac h reasonably s im p le to state, and they illustrate the p oten tial b en eﬁ ts of compilation. Like most b enc hmarks, some of the ex- amples may not reﬂect t ypical u se cases or ho w w e w ould do things in R . Ho wev er, most of these prob- lems are v ery concrete an d practical, and represent w a ys w e w ould lik e to b e able to program in R , w ere it not for the p erf ormance issues. In th is wa y , the ex- amples illustrate ho w we can con tin ue to u se R with an additional compu tational mo d el and can o ver- come some of the inte rpr eter’s p erformance issues while still usin g essentia lly the same R co d e. Note. In the follo wing subsections, w e present absolute and comparativ e timings for diﬀerent ap- proac hes and imp lementati ons to the diﬀeren t tasks. These timings were p erformed on three diﬀerent mac h ines. W e used a MacB o ok Pro runn ing OS X (2.66 GHz Intel Core i7, 8 GB 1067 MHz DDR3) and also tw o Linux mac h ines. The ﬁrst of these Lin ux mac hines is an older 2.8 GHz AMD Op teron, 16 GB. The second is a muc h more recen t and faster mac hine—3.50 GHz Intel Core i7-2700K, with 32 GB of RAM. Additionally , the d iﬀeren t mac hines ha v e diﬀerent compilers used to compile R itself and these may impact the timings. W e used GCC 4.2.1 on OS X, and GCC 4.3.4 on the ﬁ rst Lin ux mac hin e and b oth GCC 4.8.0 and clang on the s econd Linux b ox. In all cases, R w as compiled with the -O3 opti- mization level ﬂag. The absolute times are quite d if- feren t across mac h ines, as we exp ect, and the within- mac h ine relativ e p erformance of our LL VM gener- ated co de to nativ e co de diﬀers b et wee n OS X and Lin ux. Ho w ev er, the within-mac hine resu lts are very similar across Linux mac hines. Finally , our current steps to optimize the n ativ e co de we generate with LL VM are quite s im p le and we exp ect to imp ro v e these in the near futu re. W e h a v e not includ ed the time to compile the co de in our measurement s. There are tw o steps in this— compiling th e R co de to intermediate r ep resen tation (IR) an d then compiling the IR to nativ e co de. The former can b e done once, and the latter for eac h R session and is done in LL VM ’ s C ++ co d e. There are sev eral reasons for omitting these steps in the timings. First, our f o cus is on tasks that tak e a long time to run in R , for example, man y hour s or da ys. Compilation time will b e on the order of, at most, a few min utes and so the compilation time is negligi- ble. Second, w e exp ect that the compiled co de will b e r eu sed in multiple calls and s o the o v erhead of compiling will b e amortized across calls. W e hav e also ignored the time to byte-compile R fu nctions, or compile and in s tall C / C ++ co d e to b e used in R pac k ages. 2.1 The Fib onacci Sequence The Fib onacci sequence is an interesting mathe- matical s equence of in tegers deﬁned by the recur- rence/recursiv e r elation F n = F n − 1 + F n − 2 , n ≥ 0 with F 0 = 0 and F 1 = 1 . W e can implemen t this as an R fun ction in an easy and ob vious manner as fib = function( n) { if (n < 2L) n else fib(n - 1L) + fib(n - 2L) } ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 5 F or simplicit y , w e don’t ve rify that n is nonnegativ e, assuming the caller will p ro vide meaningful inpu ts. This m aps the mathematical description of the se- quence to a computational form essentiall y in a one- to-one manner. This is a goo d thing, as it mak es the co de and computations easy to un derstand, de- bug and main tain. Ho w ev er, th is is a scalar function and not v ectorized, meaning that it computes the v alue of the Fib onacci sequence for a single inte - ger v alue rather th an elemen t-wise for a v ector of inputs. This make s it slo w in R if w e w an t to com- pute m u ltiple v alues from the sequence, for exam- ple, apply it to eac h ele ment of a v ector. Ins tead of implement ing the function in this natural form, to gain p erf orm ance, we wo uld lo ok to other im- plemen tations. Sin ce the sequence is describ ed b y a recurr ence relationship, there is a simple closed- form form ula for computing the n th v alue of the sequence whic h can b e easily im p lemen ted in R as a v ectorized fun ction of n . Alternativ ely , w e might use memoizatio n to remember results compu ted in pre- vious calls to the fun ction to a void rep eating com- putations. W e migh t ev en use a lo okup table of pre- computed r esults for common inp ut v alues, or some com b ination of these approac h es. The k ey is that to get go o d p erformance, we hav e to think ab out the problem quite diﬀeren tly . Instead, we ’ll exp lore h o w w e make the simple implementa tion ab ov e p erf orm b etter an d hop e to av oid h a ving to c hange our entire approac h or r ely on R ’s other v ectorized op erations. W e use the f unction c ompileF u nction() in the RL- L VMCompile pac k age to create a native compiled v ersion of the ﬁb() function w ith fib.ll = compileFun ction(fi b, Int32Typ e, list(n = Int32Typ e)) W e hav e to sp ecify the t yp e of the return v alue and also the t yp e of the inp u t(s), that is, n in this case. F or this f u nction, b oth the return t yp e and th e in put are r egular integ er v alues corresp onding to the 32- bit in teger t yp e I nt32Type . W e could use a 64-bit in teger b y using the t yp e Int64Type if we w an ted to d eal w ith larger n umbers. In fact, we can create t w o separate and diﬀerent ve rsions of this function with d iﬀeren t typ es with t wo calls to c ompileF unc- tion() . This is a simple illustration of ho w easy it is to adapt the same R cod e to d iﬀeren t situations and create diﬀerent compiled routines with d iﬀeren t c h aracteristics. c ompileF unction() can retur n an R fun ction w hic h w e can inv oke d irectly . Ho wev er, by default, it cur- ren tly return s an ob j ect represen ting the compiled routine in LL VM . W e can in vo k e the routine using this ob ject and th e .l lvm() function, analogous to the .Cal l() and .C() fun ctions in R . So .llvm(fi b.ll, 30) calls our compiled r outine and retur ns the v alue 832,040. Unlike the .Cal l() / .C() functions, the .l lvm() fun ction kno ws the exp ected t y p es of the routine’s parameters and so coerces th e inp uts to the t yp es exp ected by the routine. In this case, it con v erts the R v alue 30 f rom what is a numeric v alue to an in teger. After ve rifying that the routine gives th e correct results, w e can explore the p erform ance of the co d e. This recursive f unction is v ery compu tationally in- tensiv e. When calculating, for example, fib( 30) , w e calculate fib(28 ) twice [once for eac h of fib(30 - 1) and fib(30 - 2) ] and, similarly , we compute fib(27) m ultiple times and so on. T his rep etition is one of th e reasons the co de is so slo w. W e’ll compare the time to ev aluate fib (30) using three diﬀeren t v ersions of the ﬁb() fu nction: the original interpreted function, the LL VM -compiled routine ( fib.ll ) and a v ersion of ﬁb () that is compiled b y R ’s b yte-compiler. The LL VM -compiled routine is the fastest. T able 1 s h o ws the elapsed times for eac h and T able 1 Timings for c omputing Fib onac ci Se quenc e V alues OS X Linux 1 Linux 2 Time Sp eedu p Time Sp eedu p Time Sp ee dup Interpreted R co de 80 . 49 1 . 00 112 . 70 1 . 0 51 . 780 1 . 0 Byte-compiled R co de 31 . 70 2 . 53 45 . 85 2 . 5 21 . 620 2 . 4 Rllvm -compiled code 0 . 12 653 . 90 0 . 21 526 . 4 0 . 097 531 . 0 These are t he timings for a call to fib(30) using the regular R fun ct ion, the byte-compiled version and t he LL VM -compiled versi on. T o impro ve the accuracy of the timings, w e calculate the duration for 20 replications for the tw o slow er functions and 200 replications of the LL V M -compiled routine and divided the duration by 10. The LL VM - comp iled versi on is clearly muc h faster. 6 D. TEMPLE LANG a ratio of the time for eac h fun ction relativ e to the time for the int erpreted fun ction. Th ese conv ey the relativ e sp eedup factor. W e see th at on a Macb o ok Pro, the LL VM -compiled rou tin e is 600 times faster than the R in terpreter. On a Lin ux m achine, the sp eedup is a bit smaller, b u t still very signiﬁcan t at a factor of 500. While we h a v e attempted to reduce the v ariabilit y of these timings, we h a ve observ ed diﬀeren t sp eedups ranging from 400 to 600 on OS X and 230 to 540 on Linux. T he timings we rep ort here are the most recen t (rather th an the “b est”). Again, this is a simp le example and not necessar- ily v ery represen tativ e of ho w we wo uld cal culate the Fib onacci sequence in pro duction co d e. Ho w- ev er, the abilit y to express an algorithm in its nat- ural mathematical form makes it easier to program, v erify and extend. W e would ve ry m uc h like to b e able to write code in this mann er, without sacriﬁcing go o d r un-time p erformance. 2.2 2-Dimensional Random W alk Ross Ihak a develo p ed a ve ry instructiv e example of wr iting straig htforw ard R co de compared with clev er, highly v ectorized R co d e as a means to il- lustrate pr oﬁling in R and ho w to mak e co de ef- ﬁcien t. T h e task is simulati ng a t w o-dimensional random w alk. It is v ery n atur al to w rite this as a lo op with N iterations corresp ond ing to th e N steps of the w alk. F or eac h step, we toss a coin to de- termine whether we mo ve horizon tally or v ertically . Giv en that c hoice, w e toss another coin to determine whether to mov e left or right , or up or down. Then w e calculate and store the new lo cation. W e’l l call this th e na ¨ ıv e appr oac h and the co de is shown in Figure 1 . After sev eral reﬁnements based on pr oﬁl- ing and nontrivial knowledge of R , Ihak a deﬁnes a v ery eﬃcien t R implementati on of the random w alk, sho wn in Figure 2 . It r emo ves the explicit lo op, sam- ples all N steps in one call to sample() , and deter- mines the p ositions using t wo calls to the cumsum() function. This mak es very goo d use of sev eral of R ’s vec torized f u nctions whic h are implemen ted in C co de and th erefore fast. W e man ually compiled the na ¨ ıv e implemen tation using Rllvm and, s im ilarly , used R ’s byte-c ompiler to create tw o compiled v ersions of this fun ction. W e then simula ted a 10 million step r andom w alk using eac h of the original na ¨ ıv e function, the b yte-cod e compiled fun ction, the fully vec torized ve rsion and the LL VM -compiled v ersion. T able 2 sho ws the rel- ativ e sp eedups. W e see that th e manually v ector- ized R fu nction is 175 times f aster than the na ¨ ıv e rw2d1 = function(n = 100) { xpos = ypos = numeric(n ) truefalse = c(TRUE, FALSE) plusminus1 = c(1, -1) for(i in 2:n) { # Decide whether we are moving # horizontal ly or vertically. if (sample (truefalse, 1)) { xpos[i] = xpos[i-1 ] + sample(plu sminus1, 1) ypos[i] = ypos[i-1 ] } else { xpos[i] = xpos[i-1 ] ypos[i] = ypos[i-1 ] + sample(plu sminus1, 1) } } list(x = xpos, y = ypos) } Fig. 1. The na ¨ ıve im pl ementation of the 2-D r andom walk. rw2d5 = # Sample from 4 directi ons, separatel y. function(n = 100000) { xsteps = c(-1, 1, 0, 0) ysteps = c( 0, 0, -1, 1) dir = sample(1:4, n - 1, replace = TRUE) xpos = c(0, cumsum(xste ps[dir])) ypos = c(0, cumsum(yste ps[dir])) list(x = xpos, y = ypos) } Fig. 2. The fast, ve ctorize d impl ementation of the 2-D r an- dom walk. implemen tation, illustrating ho w imp ortant vect or- ization is to mak e R co de eﬃcien t. How ev er, we also see that compiling the na ¨ ıv e implemen tation w ith LL VM outp erforms ev en the ve ctorized version, tak- ing ab out b et we en 55% to 65% of the time of the v ectorized v ersion. This is probably due to th e com- piled co d e using a sin gle lo op, while the v ectorized v ersion h as t w o calls to cumsum() and h ence at least one additional C -lev el lo op ov er the N steps. 2.3 Sampling a T ext File Supp ose we ha v e on e or m ore large comma-separated v alue (CS V) ﬁles. F or example, w e ca n do w nload airline traﬃc d ela y d ata for eac h y ear as an appr o x- imately 650 m egabyte CS V ﬁle from the Researc h and Inn o v ativ e T ec hnology Adm in istration (RIT A), ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 7 T able 2 Timings for simulating a 2-D R andom Walk OS X Linux 1 Linux 2 Time Sp eedu p Time S p eedup Time Sp eedu p Interpeted R co de 171 . 08 1 . 0 196 . 6 1 . 0 100 . 3 1 . 0 Byte compiled co de 123 . 92 1 . 4 120 . 8 1 . 6 6 0 . 51 1 . 66 V ectorized R co d e 0 . 97 176 . 5 1 . 8 106 . 8 0 . 63 159 . 46 Rllvm -compiled code 0 . 52 329 . 3 1 . 1 180 . 3 0 . 40 250 . 12 W e generate 10 million steps for eac h approach. W e compare a manuall y vectorized implementation in R co de with a na ¨ ıve versi on written in R , b oth a byte-compiled and LL VM -compiled version of that na ¨ ıve function. The vectorized version is 175 times faster th an the regular R function. H ow ever, the LL VM -compiled versi on outp erforms t h e vectorize d version, most lik ely by removing one C -level loop. part of the Bu r eau of T rans p ortation, at http:// www.transtat s. bts.gov/DL_Selec tFields.asp? Table_ID=236 . Rather than working with the en- tire data set, we migh t c ho ose to tak e a random sample of the observ ations. (W e don’t concern our- selv es here with the app ropriateness of a simple ran- dom sample.) W e’ll also assum e that we kn o w the n umb er of observ ations in the CSV ﬁle. Ho w do w e eﬃcien tly extract a samp le of the lin es from the ﬁle? W e could use UNIX sh ell to ols, but it is diﬃcult to r an d omly generate and sp ecify th e lines to sample. Sampling th e indices is something w e w an t to do in R , but then passing all of these to a shell command is a wkward, at b est. Alterna- tiv ely , w e could do the en tire sampling in R . W e could read th e en tire ﬁle into memory [via the r e ad- Lines() fun ction] and then subset th e ones we w an t. Ho wev er, this requ ires a signiﬁcant amoun t of mem- ory . W e ﬁrs t store all of the lines, then make a copy of th e ones we w an t and then discard the larger v ec- tor. Th is may not b e feasible, as we ma y not ha v e enough memory , or it ma y simply b e to o slo w. W e can think of diﬀerent strategies. One is to ﬁrs t iden tify th e indices of all of th e lines we w an t in our sample, and then read the ﬁle in c hunks until we get to a line that is in our sample. W e store that line and contin ue to read from wh er e w e are up to the next line in our sample, and so on. T o mak e this w ork, w e need to b e able to con tin ue to read from where w e currently are in th e ﬁle. W e can use an R ﬁle connection to d o this. Our ﬁrst step is to generate the ve ctor of the line n umb er s w e wan t to sample using, for example, lineNum = sort(samp le(1:N, sampleSi ze)) where N is the num b er of lin es in th e CS V ﬁ le. W e ha v e sorted the line num b ers, as w e will read th e sample lines in the ﬁle sequenti ally . The n ext step is to d etermin e how man y lines there are b et w een successiv e lin es in our sample. W e can compute this in R with lineSkip s = diff(c(0, lineNum)) whic h giv es a vec tor of the pairwise diﬀerence b e- t w een successiv e elemen ts. F or example, supp ose the ﬁrst t w o lines w e wan t to sample are 60 and 200. T he ﬁrst t wo elemen ts in lineSk ips will b e 60 and 140. W e can then read the ﬁrst t wo lines in our sample with con = file("201 2.csv", "r") readLine s(con, 60)[60] readLine s(con, 140)[140] Eac h elemen t of lineSkips t ells u s ho w many lines to r ead to get the next line in our sample. So next w e need a function that can read that man y lines and return the last of these. The follo wing func- tion d o es this: readTo = function(n umLines, con) readLine s(con, numLines)[numLin es] The ﬁn al step to obtaining our en tire sample is to call r e adT o() for eac h elemen t of lineSkips , for example, readSele ctedLine s = function (lineSki p, fil e) sapply(l ineSkip, readTo, file) T o obtain ou r sample, w e call r e adSele cte dLines() , passing it the v ariable line Skips and our op en con- nection: con = file("201 2.csv", "r") sample = readSele ctedLine s(lineSk ips, co n) 8 D. TEMPLE LANG Eac h of th ese functions is concise and eﬃcien t since sapp ly() is essentia lly implemente d as a C -lev el lo op within the R inte rpr eter. Usin g the connection and r e adLines() to read b lo c ks of lines in r e adT o() is eﬃcien t as it uses C co d e within R . Unfortun ately , it do es in vo lv e reading, allo cating, storing, subsetting and discarding a p oten tially large c haracter v ector returned b y eac h call to r e adLines() . Ho w ev er, w e only wan t a single line at the end of that vec tor in eac h call. While eac h cal l in volv es signiﬁcan tly few er lines than reading the en tire ﬁle, allo cating a large c haracter v ector still slo ws the computations, as it extensiv ely inv olve s the memory man ager in R . A diﬀeren t app roac h to a void the memory issue is to c h ange the r e adT o() function so that it reads eac h line individually and then r etur ns the last one. W e could c hange it to readTo = function (numLine s, con ) { ans = "" for(i in 1:numL ines) ans = readLines (con, 1) ans } Again, this is straigh tforward and easy to under- stand. Unfortunately , it is extremely slo w as w e are no w lo opin g in R o v er almost every line in the ﬁle. The idea of reading one line at a time w ould work w ell if we could av oid the o verhead of the R lo op mec h anism. W e can do this if w e compile this n ew v ersion of r e adT o() in to n ativ e co de. W e can almost do this no w, but we need to ha ve an equiv alent of r e adLines() to r ead a single line of a ﬁle. T his is exactly what th e standard C routine fgets() do es. Similar to a conn ection, we pass fgets() a p ointer to an opaque C -lev el FILE data structure, and it puts the con tents of the next line it reads into a lo cation in memory that w e also provide. F or simplicit y of exp osition, w e will deﬁne our o w n function Fgets() in R as a proxy to call fgets() with Fgets = functio n(file) fgets(pt r, 1000L, file) This is R co de and it just assumes there is a func- tion named fgets() and that ptr is someho w (the address of ) an arra y in memory with 1000 c haracter elemen ts, that is, space f or a long string. W e wo n’t run this co de in R , so these v ariables [ ptr , fil e and fgets() ] don’t actually hav e to exist in R . Instead, we will allocate them in LL VM for the compiled, n ativ e routine we generate fr om Fgets() . W e compile the Fgets() function in an LL VM mo dule, a collection of routines and v ariables, using c ompileF unction() . W e also d eﬁne the mo dule-lev el “global” v ariable ptr to b e the p oin ter to the ar- ra y w e wan t, after creating the actual arr ay of 1000 c h aracters as another global v ariable. When compil- ing Fgets() , w e also need to tell the compiler ab out the signatur e of the external fgets() routine so that it can mak e the call to fgets() correctly . W e do this via mod = Module() FILEType = pointerT ype(Int3 2Type) declareF unction( list(StringType, StringTy pe, Int32Type, FILEType), "fgets", mod) [While we h av e done this explicitly , w e could au- tomate this step usin g the R CInd ex (T emple Lang ( 2010a )) pac k age to obtain the signature program- matically .] W e also need to tell the LL V M run-time engine ho w to lo cate the f g ets() routine wh ic h w e do with llvmAddS ymbol("f gets") Note that in our Fgets() fu nction, w e assumed that the longest lin e w as less than 1000 charact ers. W e can sp ecify a diﬀerent length if w e knew or su s- p ected otherwise. Similarly , we d idn’t pro vide an y error c hecking ab out w hether we had reac h ed the end of the ﬁle. This is b ecause we are assuming that the caller kno ws the tota l num b er of lines and is sampling only up to, at most, that num b er. Th is is an example of the context- sp eciﬁc shortcuts we can mak e when compiling th e cod e for a p articular sit- uation and n ot writing general, r obust co de wh ich can b e u sed in man y diﬀerent situations. W e could also tell the compiler to add these tests for us, if we w an ted, but can a v oid the extra compu tations wh en w e kno w they are red u ndant. Ho w do w e obtain the instance of the FILE data t yp e to pass to the compiled Fgets() routine? W e can use the C r outine fop en() and again, we can wr ite an R fu nction that mimics that and then compile it. Ho wev er, the RLL VMCompiler pac k age has a func- tion to automate the creation of th at pro xy function in R , if we kno w the signature of th e C routine of in terest. S o th is example illustrates how we can dy- namically cr eate bindings to existing compiled r ou- tines in diﬀerent libraries. In th e case of FILE , we ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 9 T able 3 Timings for sampli ng a CSV ﬁle OS X Linux 1 Linux 2 Time Sp eedu p Time S p eedup Time Sp ee dup Interpreted R lo op & r e adLi nes() 68 . 93 1 . 0 103 . 25 1 . 0 42 . 78 1 . 0 Rllvm -compiled loop & Fgets() 3 . 2 78 21 . 09 6 . 54 15 . 8 2 . 59 16 . 5 C code ( F astCSVSample ) 3 . 0 22 . 97 6 . 28 16 . 4 2 . 40 17 . 8 W e use vectorized co de in R to read blo cks of data and extract the ﬁnal line of each blo ck. The LL VM approach compiles simple R functions that read one line at a time. The F astCSVSample do es the same thing with manually written C cod e. The compiled approac hes avo id t h e memory usage related to r e adLines() and see a nontrivial sp eedup. The C -co de in the F astCSVSample pack age outp erforms the LL VM -compiled versi on, b ut b oth app roac hes outp erform the approach using R ’s connections and r e adLines() funct ionality which are also implemen ted with C co de. can also use the existing fu nction CFILE() in the R Curl (T emple Lang ( 2002 )) p ack ag e. So n o w w e can r ead a single line from an op en FILE ob ject in R via our compiled Fgets() routine. W e can red eﬁ ne our r e adT o() function as readTo = function (numLine s, con ) { ans = "" for(i in 1:numL ines) ans = Fgets(con ) ans } This is almost iden tical to the original f u nction ab o v e but replaces the call read Lines(co n, 1) with Fgets (con) . No w w e can compile this in to na- tiv e cod e via c ompileF unction() and the r esulting co de will b e quite fast. W e n ow h a v e a fast replacemen t for reading up to the n ext lin e in our sample. The last step is to make r e adSele cte dLines() fast. Recall that this w as im- plemen ted simply as sap ply(line Skip, r eadTo, file) . When w e compile this as retur ning an R c h aracter v ector, our compiler recognizes th e sap- ply() call and con verts this in to a lo op in nativ e co de and p opulates and retur ns a new R c haracter v ector. In summ ary , w e h a v e compiled three R fu n c- tions [ Fgets() , r e adT o() and r e adSele cte dLines() ] and these n o w allo w us to read one line at a time and use the minimal amount of memory to colle ct the lines for our sample, b ut u s ing t wo lo ops in na- tiv e cod e rather than in R . W e can now compare the p erformance of our R -based approac h using r e adLines() to consume c hunks of lines and our compiled version that reads one line at a time. In add ition to these t w o ap- proac hes, w e also ha v e a manual C implemen ta- tion essen tially equiv alen t to our LL VM -compiled approac h in the F astCSVSamp le p ac k age (T em- ple Lang ( 2013 )). Our timings are based on extract- ing a sample of one hundred thousand lines uni- formly from a CSV ﬁle that con tains one hundred million lines—the same lines f or eac h appr oac h. The elapsed times are giv en in T able 3 . W e see that our compiled app roac h of r eading one line at a time is aroun d t wen t y times f aster than collecting many unnecessary lines with r e adLines() and lo opin g in R , ev en with sapp ly() . The diﬀerence b et we en th e LL VM and nativ e C appr oac hes ma y b e inherent, but also p ossib ly due to diﬀerent optimization tec h - niques that we m a y b e able to enh ance with LL VM . In short, w e can outp erform R ’s native v ectorized co de by compiling our relativ ely straightforw ard R co de. The exp osition of this example m a y mak e it seem more complicated th an it is. E s sen tially , w e wan t to eﬃcien tly read one line of a ﬁle at a time in or- der to get to the next line in our sample. W e com- piled the Fgets() fun ction for this and then compiled t w o other fu nctions in R to p erform lo ops ov er the n umb er of lines. The imp ortan t imp lications from this example is th at we can sidestep R ’s memory managemen t, get ﬁn e-grained con trol o v er computa- tions u sing dyn amically generated routines, and w e can use existing nativ e routines and data stru ctures, suc h as fgets() and FILE , in our R co de that will b e compiled. W e could already dyn amically call nativ e routines directly from R using, for example, rd y n call (Adler ( 2012 )) or Rﬃ (T emple Lang ( 2011 )). What is imp ortan t here is that w e are also compiling the iterations and not d oing them in R . 10 D. TEMPLE LANG 2.4 F using Lo ops Consider computing the log-lik eliho o d for a giv en v ector of observ ations x and a d ensit y function, sa y , dnorm() . I n R , we can write the log-lik eliho o d very eﬃcien tly as sum(log( dnorm(x, mu, sigma))) Indeed, we could redu ce this to sum(dnorm (x, mu, sigma, log = TRUE)) , bu t the purp ose of this ex- ample is to consider a general sequence of calls to v ectorized R functions. Eac h of the f unctions dnorm() , lo g() and sum() are bu ilt-in to R and are implemented in C co de, and t wo of them u se the ve ry eﬃcien t .P rimitive() mec h anism. As a result, this co d e seems to b e as fast as it can b e. Th is is true giv en the w a y R inte r- prets the expression, one sub -expression at a time. Ho wev er, there are t wo w a ys w e can mak e this more eﬃcien t by compiling suc h an expression. Th e ﬁrst is by red ucing the num b er of lo ops (in C ) from three to one. Generally , if w e ha v e n nested-call s to elemen t-wise and v ectorized fun ctions, we can re- duce n lo ops to one. Second, w e can t ypically elim- inate at least one allo cation of a p oten tially large v ector. A third wa y we migh t sp eed up th e compu - tations is to us e p arallel capabilities suc h as a GPU or m ultiple cores or CP Us. W e won’t discus s this here, but it is conceptually quite str aigh tforward to do generally when we are co mpiling the co d e dy- namically . In deed, th e abilit y to programmatically com b ine a particular function with a general paral- lel strategy mak es it more exp edient th an wr iting it ourselv es in C / C ++ . Ho w do es R ev aluate the expression ab o v e? I t uses three separate lo ops . Ignoring p edantic details, es- sen tially R ev aluates th e call to dnorm() and so lo ops o ve r all of the ele ments of x and compu tes the density at eac h of those v alues. It s tores these v alues in a newly allocated v ector and then returns that. Th is b ecomes the input to the call to lo g() . R then iterate s ov er the elemen ts of this vecto r an d computes the lo g() f or eac h individu al v alue. In this case, R may r ecognize that it do esn ’t n eed to create a new v ector in wh ic h to return the results, bu t that it can reuse the inp ut ve ctor since it is essen tially anon ymous. T h e ﬁnal step in the ov erall expression is the call to sum() and this iterate s ov er the ele- men ts of the vecto r it receiv es and returns a single scalar v alue. Imp ortantly , there are thr ee lo ops ov er three vec - tors all of the same length, and we allo cate one new and large ve ctor. W e co uld use a single lo op and a void allo cating this intermediate v ector by rewr it- ing the co de as normalLo gLik = function (x, mu = 0, sigma = 1) { ans = 0 for(val in x) ans = ans + log(dnor m(val, mu, sigma)) ans } Instead of the vec torized calls in R , we ha v e put scalar fun ction calls inside a s in gle lo op. W e h av e com b ined the calls to dnorm() and lo g() together. Then w e to ok the result for eac h elemen t and add ed it to the cum ulativ e su m. This com b ination of op er- ations is called lo op fusion and for large v ectors can yield signiﬁ can t p erformance imp ro v ement s. This n ew scalar v ersion is faster by av oiding the lo op and allocation. Of course, it is ev aluated in R and so w ill b e m uc h slo w er. W e could w rite this in C , b ut it would b e very sp eciﬁc to th e log-lik eliho o d for a Normal den s it y . Generally , w e wo uld hav e to write implemen tations for v arious sequences of calls, for example, for diﬀerent densit y fun ctions [i.e., sum(log( pdf(x, ...))) ], and expressions in volv- ing other functions [e.g., prod( dchisq(x ^2, p) ) ]. This isn’t pr actical . Ho we ve r, giv en our abilit y to dynamically generate nativ e co d e, w e can com- pile an y expr ession such as our original expression sum(log( dnorm(x, mu, sigma))) in to the nativ e equiv alen t of our scalar co d e ab o ve. T o compile the norma lL o gLik() f unction ab o v e, we need to b e able to call scala r v ersions of the lo g() and dnorm() routines. The lo g() fun ction is a v ail- able in the ubiqu itous math lib r ary ( libm ) and we can just refer to it. The Normal d ensit y fun ction is not standard. W e can arrange for our nativ e co d e to inv ok e R ’s dnorm() function for eac h scalar v alue in the v ector. Th is is b oth a wkw ard and ineﬃcien t. Instead, we can write our o w n v ersion of dnorm() di- rectly in R . While this would b e slow to inv ok e man y times in R , w e will compile our dnorm() an d nor- malL o gLik() fu nctions toget her into a single mo du le and b oth will b e fast. Another p ossible approac h, in this case, is to take adv an tage of the goo d d esign and mo d ularit y of th e Rmath library . It pr o vides the routine dnorm4() as a regular n ativ e routine (un- connected with R ’s data t yp es, etc.) and so we can in v ok e it, just as we d o the lo g() routine. F or reasons that are n ot quite clear at present, on the OS X mac hine, our loop -fu sed version tak es ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 11 T able 4 Times and r elative p erformanc e f or fusing lo ops OS X Lin ux 1 Lin ux 2 Rllvm -compiled fused lo op s 0.73 1.69 0.84 Interpreted R vectorized functions 0.52 2.28 1.07 Regular R time/ Rllvm time 0 .71 1.35 1.27 The ﬁ rst t wo ro ws show the times for fusing the lo ops by compiling the R with LL VM and using a sequence of calls to R ’s vectorized functions. The ﬁnal row show s the ratio of the tw o times within each machine. F using the lo ops is slow er on OS X , but faster on Linux. ab out 40% longer than the R co d e for 10 million ob- serv ations and 27% longer f or 100 million obser v a- tions. Again, we susp ect that we will b e able to im- pro v e the LL VM -compiled co de by explorin g more of its optimization facilities. Ho wev er, on the Linux mac h ines, w e do see a sp eedu p, ev en for 10 million observ ations where th e LL VM lo op -fu sed co d e runs in ab out 75% the time of the R co de. The timings and relativ e p erformances are giv en in T able 4 . Re- gardless of the exact n umb er s , th e results ind icate that compiling our o wn co de is comp etitiv e with man ually writing vecto rized r ou tin es in R , and that w e can outp erform these bu ilt-in C routines. A diﬀerence b et w een the t wo app roac h es is that R uses the .P rimitive () mec h anism rather than a standard function call wh ic h w e hav e to do via the .l lvm() fu nction. Ho we ve r, n ot only d o we re- duce three lo op s to one, but we also a void d ealing with missing v alues ( NA s) and additional p arame- ters suc h as base for the lo g() fu nction. So we should b e doing even b etter. If w e ha v e access to m ultiple cores or GPUs, we ma y b e able to execute this co d e m uc h more eﬃcien tly simply via parallel execution. By f u sing the lo ops op erations together, we can also a void three separate tr ansitions from the host to the GPU and transf erring memory b et ween the tw o sys- tems more times than we n eed. W e explicitly w rote the normalL o gLik() function to sho w ho w to f use the lo ops. W e could also h a v e written the original expression sum(log(dn orm(x, mu, sigma))) as Reduce(‘ +‘, Map(log, Map(dnorm, x, MoreArgs = list(mu, sigma) ))) By explicitly using these functional programming concepts, it is easy for us to see ho w to fuse lo ops and rewrite th e co d e into the lo op ab ov e. The RLL VM- Compile p ac k age can recognize s u c h an expression and compile it to the lo op -fu sed instructions. W e can either require R p r ogrammers to d o this in order to gain the p erformance from n ativ e code or w e can try to m ake the compiler r ecognize the v ectorized nested fun ction call idiom of the f orm f(g(h(x ))) . 2.5 Computing Distances b etw een Observations Distances b etw een pairs of observ ations are im- p ortant in common statistical tec hniques such as clustering, m ulti-dimensional scaling, supp ort ve c- tor machines and many metho ds that use the “ker- nel tric k” (Sc h¨ olk opf and Smola ( 2001 )). R pro vides the dist() function that allo ws us to compute the distance b etw een all pairs of observ ations in a ma- trix or data frame, using any of s ix diﬀeren t metrics. The core computations are implemen ted in C an d are fast. Ho wev er, th ere are some issues and rigidi- ties. The dist() function ins ists that the data passed to the C co de are represent ed as a m atrix, and so will mak e a cop y of the data if a data frame is giv en b y the caller. F or large data sets, this can b e a s ig- niﬁcan t issue as w e will essenti ally hav e tw o copies of the data in memory . Also, the dist() function only accepts a single data set and computes the dis- tances b et ween all pairs of observ ations within it. In con tr ast, a reasonably common situation is that w e start with tw o separate data s ets— X and Y —and w an t to compu te the distance b etw een eac h obser- v ation in X and eac h observ ation in Y , bu t not the distances b et w een pairs of observ ations within X or within Y . Not only do we risk h a ving three copies of the d ata in memory (the t wo separate data frames, the t w o com bin ed in to one data frame and then con- v erted to a matrix), b ut the dist() fun ction will also p erform man y u nnecessary computations f or these within-same-set observ ations that we w ill discard. If w e ha v e t w o data sets with n 1 and n 2 obser- v ations, resp ectiv ely , the dist() function computes ( n 1 + n 2) × ( n 1 + n 2 − 1) / 2 distances. W e are only 12 D. TEMPLE LANG in terested in n 1 × n 2 of th ese. As n 1 and n 2 dive rge, the num b er of u nnecessary computations increases, and this is esp ecially bur densome if the num b er of v ariables f or eac h observ ation is large. Another rigidit y is that the choice of distance met- ric is ﬁxed. If w e wa nte d to introdu ce a n ew d istance metric, it would b e usefu l to b e able to reuse the C co d e u nderlying dist() . W e could do this with a function p ointer in C , but the co de for dist() would need to b e mo d iﬁed to su pp ort this. Accordingly , if w e w ant to introd u ce a new metric, w e ha ve to copy or re-implement the entire C code. The C code un d erlying dist() can u se parallel ca- pabilities (Op enMP) if they are d etected when R is compiled. W e cannot use GPUs or c hange the par- allel strategy within an R session without rewriting the C co de. As a result, we w ou ld lik e to b e able to express the computations in R and select a dif- feren t strategy f or parallelizi ng the computations at run-time. In short, as usefu l as dist() is, we w ould lik e it to b e muc h m ore ﬂ exib le. W e wan t to b e able to com- pute the d istances b et we en tw o sets of observ ations, not within a single d ata set; use a data frame or a matrix or p erhaps some other data r epresen tation without making a cop y of the data; introdu ce n ew metrics w ithin th e s ame infrastructure; and use dif - feren t parallel computing app roac h es. The current dist() fun ction in R cannot help us meet these goals and is essen tially static/ﬁxed co d e. The pack ag e p dist (W ong ( 2013 )) pro vides a w a y to compute p airwise d istances b etw een tw o data sets. This a v oids the r edundant computations. Un- fortunately , it only supp orts the E u clidean metric and also insists on matrices b eing passed to the C co de. Also, it has no su pp ort for p arallel computing. If we could write th e basics of the dist() function in R and mak e it fast, we could add ress all of the enhancemen ts we listed ab ov e as w ell as mak e the co de m ore comprehen s ible and accessible to users . The basic ap p roac h to computing the distance b e- t w een eac h pair of obs er v ations in t wo data sets X and Y can b e expressed in R with the follo wing quite sp eciﬁc/rigid function (wr itten to aid compiling): dist = function (X, Y, nx = nrow(X), ny = nrow(Y) , p = ncol(X)) { ans = numeri c(nx * ny) ctr = 1L for(i in 1:nx) { for(j in 1:ny) { total = 0.0 posX = i posY = j for(k in 1:p) { total = total + (X[posX] - Y[posY]) ^2 posX = posX + nx posY = posY + ny } ans[ctr] = sqrt(tot al) ctr = ctr + 1L } } ans } The basic steps are to lo op ov er eac h observ atio n in th e ﬁr st data set ( X ) and then to lo op o ve r eac h observ ation in the other data set ( Y ). F or eac h pair of observ ations, w e compu te the d istance b et w een them via th e thir d nested loop. W e could h a v e made this simpler (and m ore general) by using a v ector- ized R expression or calling a function to do this ﬁnal lo op. Ho wev er, w e hav e inline d the computations di- rectly for a reason. Sup p ose we had written this part of the computation as (X[i,] - Y[j, ])^2 . Unfor- tunately , in R , this wo uld cause u s to create t wo new in termed iate v ectors, one for eac h of the sp e- ciﬁc ro ws in th e t wo data sets. T his is b ecause the ro w of eac h d ata set is not a simp le vec tor cont ain- ing the elemen ts of inte rest whic h w e can pass to the subtraction function ( - ). I nstead, w e hav e to arrange the d ata in eac h ro w of the matrix or d ata frame in to a new vecto r of conti guous v alues. This is where R is conv enient, but ineﬃcien t. This do es not happ en in the C co de for R ’s b u iltin dist() routine, or our s, as it uses matrices and kn o ws ho w to ac- cess the elements ind ivid ually rather than creating a new temp orary v ector. W e use this same approac h in our lo op. W e also could allocate the v ectors for the row v alues just once and reus e them for eac h observ ation, bu t we still hav e to p opulate them f or eac h diﬀerent observ ation. T o a v oid the inte rmediate v ectors, our co de explic- itly acce sses the ind ividual elements X [i, k] and Y[j, k] dir ectly . A matrix in R is merely a vect or with the elemen ts of the matrix arranged sequen- tially in column order. Therefore, th e ﬁrst elemen t of obs er v ation i in X is at p osition i in the ve ctor. ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 13 T able 5 Timings for c omputing p air-wise distanc es OS X Linux 1 Linux 2 Rllvm -compiled code 8 . 72 11 . 94 6 . 22 R dist() function (calling native co de) 14 . 74 79 . 65 27 . 37 Sp eedup factor 1 . 69 6 . 67 4 . 4 This shows the total elapsed t ime for distance computations with 40 va riables and 8000 and 1000 obser- v ations in the t wo data sets. In the R approac h with the di st() function, there is ex tra memory allocation and also 80% of th e distances computed are discarded. W e out p erform the native R implementa tion on b oth platforms. The second elemen t of the i th observ ation is at p o- sition i + nrow(X) , and so on. T o compute the dis- tance b et we en the t w o observ ations, we lo op o ve r the p v ariables presen t in eac h of the observ ations and compu te the d iﬀerence. The co de ill ustrates these computations for the Euclidean distance. W e could easily c hange this to implemen t other distance metrics. W e could d o this b y c hanging the co d e either man ually or program- matically by replacing the expression ( X[posX]- Y[posY]) ^2 with, fo r example, abs(X[pos X] - Y[posY]) . Rewr iting co de programmatically is a p ow erful f eature that allo w s u s treat R code as a template. W e can compile this three-lev el nested lo op R co de via RLL VMCompile to nativ e instru ctions. Our compiler currentl y wo rks p rimarily with primi- tiv e d ata t yp es and has limited su pp ort f or working directly with R ob jects, for example, kn owing the dimensions of an R matrix. Accordingly , w e arrange to p ass the m atrices and their dim en sions to the routine and cur r en tly h a v e to explicitly sp ecify the signature: distc = compile Function (dist, REALSXPT ype, list(X = Double PtrType, Y = DoublePt rType, nx = Int32Type, ny = Int32Type , p = Int32Typ e)) In the future, we will allo w the caller to sp ecify just the t w o data sets ( X and Y ). Ho wev er, we are making the representat ion as matrices more explicit h er e, whic h is v aluable information for the compiler. No w that we h a v e the nativ e co d e, we can then compare this to using R co de that computes the same distances but do es so b y combining the tw o data sets, calls dist() , and conv erts the result to the sub-matrix of in terest. This comparison f a v ors our co de since this is the form of our in puts and the ex- p ected form of the outpu t. Ho we ve r, th ese are qu ite reasonable. W e timed the f unctions to compute the distances f or t wo data sets of size 8000 and 1000 ob- serv ations, eac h with 40 v ariables. I n this case, 80% of the distances computed using R ’s dist() fun ction are irrelev an t and d iscarded. T able 5 sho ws the r e- sults and illustrates that b y doing few er computa- tions, we do indeed outp erform the nativ e C code in R , on b oth platforms. If w e had used d ata sets with similar n umbers of observ ations, the results w ould ha v e b een less dramatic. Ho wev er, with 3000 obser- v ations in Y , the LL VM -generated n ativ e co de was still thr ee times faster on Linux and only 18% slo we r on OS X. Comparing the resu lts ab o v e to similar n ativ e co de in the p dist p ack ag e, the timings again s h o w that n ativ e C co de in p d ist outp erforms our LL VM - compiled co de, 60% faster on one mac hine and 9 times faster on another. T his illustrates th at there is r o om for signiﬁcan t imp r o v emen t in our LL VM compilation. Ho wev er, the fact that we can outp er- form R ’s native approac h is encouraging. Th at w e can readily adapt th is to diﬀerent purp oses and dif- feren t computational strategies indicates signiﬁcant opp ortun ities and p oten tial. As a ﬁn al note, w e could r emo v e the third loop and insert a call to a function to compute the distance for these t wo v ariables, for example, euc lidean(X [i,], Y[j,]) . T he compiler could recognize that X and Y are matrices and arrange for the compiled v ersion of th e eu clide an() function to access th e elements as w e h a ve disp la yed ab o v e, that is, without computing the in termediate vect or for eac h row. If we tell the compiler X and/or Y are data frames, it would gen- erate diﬀeren t co d e to access the elemen ts so as to a void these intermediate v ectors. Since the compiler 14 D. TEMPLE LANG has th e opp ortunity to compile b oth the co d e for the main loop and for the metric fu nction together and kno ws the representat ions of the in puts, it can create b etter co de than if we w rote these separately and more rigidly . 3. POSSIBLE COMPILA TION ENHANCEMENT STRA TEGIES The examples in the previous section exp lored diﬀeren t wa ys w e could c han ge the w ay w e com- pute in R with new facilities f or generating native co de. W e considered compiling R co d e to nativ e rou- tines, reus ing existing n ativ e routines within these generated routines, and c hanging the computational strategies w e emp lo y within R to embrace these new approac hes. There are man y other simple examples w e could consider to impro ve the p erformance of R co de. O ne is the abilit y to write fun ctions that fo- cus on scalar op erations and th en to create v ector- ized ve rsions of these automatically . Give n a scalar function f() , we can write a vecto rized v ersion as sapply(x , f, ...) or with mapply() . The com- piler can then turn this into a nativ e lo op. In d eed, man y of the p erformance gains are ac hiev ed by mak- ing lo op in g faster. T hey also p oten tially redu ce the necessit y to use ve ctorized co d e in R and so hop e- fully m ak e programming in R more int uitive for new users. In addition to handling lo ops, there are sev eral other asp ects of R ’s ev aluation mo del th at w e might b e able to improv e by c ho osing diﬀerent compilation strategies in d iﬀeren t con texts. The idea is that the R user compiling th e co d e ma y ha v e more in forma- tion ab out the computations, the data and its repre- sen tation, or the av ai lable computing resources than the compiler do es by examining the co de. Th is ex- tra information is imp ortant. The p rogrammer may b e able to giv e hints to a compiler, or c ho ose a dif- feren t compiler function/implemen tation altoget her, to cont rol ho w the co d e is und ersto o d and the na- tiv e ins tructions are generated. The follo wing are some r easonably obvious and general impr o vemen ts w e migh t b e able to in fer or make in certain situa- tions. Gu ided b y the R user, d iﬀeren t compilers ma y yield diﬀerent co de, and even diﬀeren t semanti cs, f or the same co de. 3.1 Omitting Checks f o r NA V alues Man y of the C routines in R loop ov er the ele- men ts of a vecto r and m ust c hec k eac h elemen t to see if it is a m iss ing v alue ( NA ). This co de is gen- eral purp ose co de and so this test is a ﬁxed part of almost ev ery compu tation in vo lving that routine. Ho wev er, w hen we dynamically generate the cod e, w e m ay know that there are no missin g v alues in the data set on whic h we will r un that co de and so omit the co de to p erform these additional, redun- dan t tests. Similarly , in our example of samplin g a CSV ﬁle, w e knew the num b er of lines in a ﬁle and w e knew that eac h call to the fg ets() r outine w ould s ucceed. As a r esult, we did n ot ha v e to c hec k the return status of the call for reading at the end of the ﬁle. W e also assum ed that the largest line wa s less than 1000 c h aracters and didn ’t v alidate this in eac h iteration. The same app lies wh en w e are accessing elemen ts of a v ector as to whether w e ﬁr st need to chec k that the index is within the exten t of the array or not, that is, b ound s c h ec k in g. When w e can verify this conceptually (within a loop o ver a v ector), or b y declaration by the user, we can omit these chec ks. These tests are typica lly simple and n ot compu- tationally exp ensiv e. How ev er, th ey can b ecome s ig- niﬁcan t when the instructions are in vo ke d v ery of- ten, for examp le, in a lo op ov er elemen ts of a large v ector. 3.2 Memo ry Allo cation In our example discussin g lo op fu sion, we sa w that not only could we reduce the num b er of o v erall it- erations in a computation, bu t w e also could redu ce memory usage. W e a voided creating a v ector for the result of the call to dnorm() [and lo g() ]. There are p oten tial opp ortun ities to fur ther redu ce memory usage. R u s es the conce pt of pass-b y-v alue in calls to functions. In theory , R mak es a cop y of eac h argu- men t in a call to a fun ction. (Lazy ev aluation means that some arguments are nev er ev aluated and so not copied.) Ho wev er, the R interpreter is sm arter than this and only copies the ob ject wh en it is mo diﬁed, and only if it is n ot part of another ob j ect. When compiling R cod e, w e w an t to b e able to determine that an ob j ect is not mo diﬁed and av oid cop ying it. By analyzing code, w e can detect whether pa- rameters can b e considered r ead-only and so redu ce memory consumption in cases where R cann ot ver- ify that it is safe to a v oid cop ying an ob ject. W e can id en tify this w ithin r egular R co d e, ho wev er, we w ould ha v e to mo dify the interpreter to mak e use of this information. When generating nativ e cod e ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 15 with, for examp le, c ompileF unction() , w e can mak e use of this information dynamically , b ypassing the R in terpreter. Another example where we can reduce the mem- ory fo otprint of co de is when we can reuse the same memory fr om a diﬀeren t computation. F or example, consider a simple b o otstrap computation something lik e the follo wing R pseud o-code: for(i in 1:B) { d.star = data[s ample(1: n, n, replace = TRUE), ] ans[[i]] = T(d.star , ...) } In R ’s computational mo del, we w ill allo cate a n ew data frame d.star for eac h b o otstrap sample. This is unnecessary . W e can reuse the same memory for eac h samp le, as eac h sample h as the same structure and only diﬀers in the v alues in eac h cell. By an- alyzing th e sequence of commands rather than ex- ecuting eac h one separately without kno wledege of the others, we can tak e adv antag e of th is opp ortu- nit y to reuse the memory . W e can also reuse the same v ector to store the r esult of the rep eated calls to sample() . It is r easonably clear that this is what w e would do if we wrote this co de in C , reus in g the same data structur e in s tances. Ho w ev er, this is n ot p ossible within R , as th e ind ividual computations are n ot as connected as they are in th e large-picture C co d e. When we dynamically generate native co de, w e can utilize this large-scale inform ation. Similarly , some R scripts creat e a large ob j ect, p erform seve ral computations on it and then mo v e to other tasks. Co d e analysis can allo w us to iden - tify that the ob j ect is no longer b eing used and so w e can insert calls to remo v e the ob ject. Ho wev er, w e m a y b e able to recognize that the ob ject is no longer needed, but that subsequent tasks can r euse the same d ata form at an d repr esen tation. In that case, we can reuse the memory or at least p arts of it. 3.3 Data Rep resentations The small n umber of fun damen tal data t yp es in R mak es computational reasoning quite simple, b oth to use and to imp lemen t. Of course, the c hoice of data t yp e and structure can b e imp ortan t for man y computations. Sequences, for example, 1:n or seq(alon g = x) , are common in R co de and these are repr esen ted in R as explicit v ectors conta ining all of the v alues in the sequence. W e ha v e seen that w e can a v oid creating the sequ ence vecto r and p opu - lating it when it is used as a lo op counte r. S imilarly , w e can repr esent a r egular sequence with the s tart, end and stride, that is, the increment b et we en el- emen ts. When generating the cod e, we then access elemen ts of su ch a sequen ce usin g appr opriate cal- culations sp ecialized to that s equence type. In man y cases, R ’s simple d ata typ es cause us to use an in teger wh en w e only need a byt e, or eve n just a few bits, to represent a few p ossible v alues/states. The snpS tats (Cla yton ( 2011 )) p ack ag e d o es this successfully u sing bytes to reduce th e m emory fo ot- print for large genomic d ata. Again, the op erations to sub set data in this diﬀerent format need to b e mo diﬁed from the default. Doing this element- wise in R is excessiv ely slo w. Ho w ev er, when w e generate nativ e co de, we are fr ee to use diﬀerent wa ys to ac- cess the individual elemen ts. This idea is imp ortant. W e sp ecify what to do in the co de, but not p recisely ho w to do it. When generating the co de, we com b ine the cod e and information ab out ho w to represent the data and generate diﬀerent co de strategies an d realizatio ns. This is somewhat similar to template functions in C ++, but more dynamic due to run- time compilati on/generation with more con textual information. There are sev eral other aspects of R co de that w e can compile, for examp le, matc hing named argu- men ts at compile time rather than at run time. 4. O VERVIEW OF G ENERA TING CODE WITH LL VM In this section we will br ieﬂy d escrib e the basic ideas of h o w we generate co de with LL VM , Rllvm and RL L VMCompile . This is a little more technical and lo w lev el than our examples and readers d o not need to un derstand th is material to un derstand the main ideas of this p ap er or to u se the compiler or the compiled co de. W e are describin g it here to illustrate ho w other R pr ogrammers can r eadily exp eriment with these to ols to generate co de in diﬀerent wa ys. W e’ll use the Fib onacci sequence an d the ﬁb() function example again, as it illustrates a f ew dif- feren t asp ects of generating co de. Our ﬁb() fu nction in R exp ects an integer v alue and return s an integ er. The b o dy consists of a single if–else expression. T his conta ins a condition to test and t wo b lo c ks of co de, one of which will b e ev alu- ated dep end ing on the outcome of that condition. T o map this co de to LL VM concepts, w e need to create 16 D. TEMPLE LANG ; Module ID = ’fib’ define i32 @fib(i32 %n) { entry: %0 = icmp slt i32 %n, 2 br i1 %0, label %"body.n < 2L", label %body.last "body.n < 2L": ; preds = %entry ret i32 %n body.las t: ; preds = %entry %"n - 1L" = sub i32 %n, 1 %1 = call i32 @fib(i32 %"n - 1L") %"n - 2L" = sub i32 %n, 2 %2 = call i32 @fib(i32 %"n - 2L") %"fib(n - 1L) + fib(n - 2L)" = add i32 %1, %2 ret i32 %"fib(n - 1L) + fib(n - 2L)" } Fig. 3. Interme diate R epr esentation for the c ompil e d ﬁb() r outine. We cr e ate the F unction and Blo ck obje cts and cr e ate and insert LL VM instruction obje cts c orr esp onding to the ex- pr essions and sub-expr essions in the R function. The r esult is this low-level description in interme di ate f orm which LL VM c an optimize and c ompi le to native c o de for diﬀer ent tar gets, for example, a CPU or GPU. T he diﬀer ent blo cks have a l a- b el (e.g., body.last ) and c orr esp ond to diﬀer ent p arts of an if statement or p ossibly p arts of a lo op, gener al ly. diﬀeren t ins truction blo cks, eac h of whic h con tains one or more instructions. When w e call the routin e, the ev aluation starts in th e ﬁr st in s truction b lo c k and exec utes eac h of its instructions sequen tially . The end of eac h instruction blo c k has a terminator whic h identiﬁes the next blo c k to wh ic h to j ump, or returns from the routine. J u mping b et we en blo cks allo ws us to imp lemen t conditional b ranc hing, lo ops, etc. F or our ﬁb() function, w e start with an en try blo ck that might create any lo cal v ariables for the compu- tations. In our f u nction, this b lo c k simply con tains co de to ev aluate the condition n < 2 an d , dep endin g on the v alue of this test, the instruction to branch to one of t wo other blo c ks corresp onding to the ex- pressions in the if and the e lse parts. In the if blo ck (i.e., n is less than 2), we add a sin gle instruction to r eturn th e v alue of the v ariable n . In the blo c k corresp ondin g to the else part, we add several lo w- lev el instructions. W e start by compu ting n - 1L and then call ﬁb() with that v alue and store the re- sult in a lo cal v ariable. Th en we calculate n - 2L , call ﬁb() and store that result. Then w e add these t w o lo cal intermediate results and store the result. Finally w e return that resu lt. Figure 3 sho ws the co de in what is called Intermediate Repr esen tation (IR) f orm that LL VM uses. This illustrates the low- lev el compu tations. While th e co d e for this fun ction is reasonably sim- ple, there are man y d etails inv olv ed in generating the nativ e co de, s u c h as deﬁning the routine and its parameters, creating th e in struction blo c ks, loading and storing v alues, and creating instructions to p er- form sub traction, call th e ﬁb() fu nction and r eturn a v alue. The LL VM C ++ API (Applicatio n Pro- gramming In terface) pr o vides n umerous classes and metho ds that allo w us to create instances of these conceptual items suc h as F unction s, Blo ck s, man y diﬀeren t t yp es of instructions and so on. The Rl- lvm pac k age provi des an R interface to these C ++ classes and metho d s and allo ws us to create and ma- nipulate these ob jects directly within R . F or exam- ple, the follo wing co de sho ws ho w w e can deﬁn e the function, the en try in struction blo ck and generate the call fib(n - 1) : mod = Module() f = Function ("fib", Int32Type, list(n = Int32T ype), mod) start = Block(f ) ir = IRBuilder( start) parms = getPara meters(f ) n.minus. 1 = binOp(ir, Sub, parms$n, createCo nstant(i r, 1L) ) createCa ll(ir, f, n.minus.1) W e don’t wa nt to w rite this code man ually our- selv es in R , although Rllvm enables us to d o so. In- stead, w e wan t to p rogrammatically transform the R co d e in the ﬁb() function to create the LL V M ob- jects. The RLL VMCompile p ac k age do es this. Since R functions are regular R ob jects w h ic h w e can query and m an ip ulate directly in R , we can tra ve rse the expr essions in the b o dy of a fun ction, analyze eac h one and p erform a simple-minded translation from R concepts to LL VM concepts. Th is is the ba- sic w a y the c ompileF unction() generates the co de, using customizable handler fun ctions f or the diﬀer- en t t yp es of expressions. These recog nize calls to functions, accessing global v ariables, arithmetic op- erations, if statemen ts, lo op s and so on. They use the functions in Rllvm to create the corresp ond ing LL VM ob j ects and instructions. Once th e compiler has ﬁnished deﬁn ing the in- structions for our r outine, LL VM has a description of w hat we wan t to do in the form of these b lo c ks and instructions. This description is in th is in ter- mediate represen tation (IR). W e can lo ok at this “cod e” and it will lo ok something similar to that ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 17 sho wn in Figure 3 . T he I R cod e sh o ws the some- what lo w-lev el details of the blo cks and instructions as we describ ed ab o v e. W e see the three b lo c ks la- b eled entry , body.n < 2L and bo dy.last . Again, it is not imp ortant to un derstand these details to b e able to u se the compiled routine translated from the R function. I sh o w it h er e to illustrate the d iﬀeren t steps in the compilation pro cess and to in dicate that an R p rogrammer can c hose to c hange any of these steps. Next, w e instruct LL VM to v erify and optimize the cod e. A t this p oint, we can call the new routine via the .l lvm() function in Rllvm wh ic h corresp onds to a metho d in the LL VM API. Th e ﬁrst time the co de is us ed , LL VM generates the nativ e co de from the IR f orm. 5. CONTRASTS WITH RELA TED RESEARCH There ha v e b een s ev eral pr o jects exploring ho w to impr o v e th e p erformance of R co de. W e discuss some of these in this section. Byte-Compiler : On e of the most visib le pro jects is the byte- co de compiler dev elop ed b y Lu k e Tiern ey (Tierney ( 2001 )). This consists of an R pac k age th at pro vides the compiler and s ome supp ort in th e core R inte rpr eter to execute the resu lting b yte-compiled co de. The compiler maps the R co d e to instructions in th e same spirit as LL VM ’s in s tructions and in- termediate represen tation. These instru ctions are at a higher-lev el than LL VM ’s and are more sp eciﬁc to R . The typical sp eedup pr o vided b y the byte-co mpiler is a factor of b et ween 2 and 5, with muc h larger sp eedup s on some problems. This ma y n ot b e suf- ﬁcien t to ob viate the need for wr iting co de in C / C + +. W e probably need to see a factor of more than 10 and closer to 100 for common tasks. The b yte-co de compiler is written in R and so others can adapt and extend it. Ho wev er, the de- tails of ho w the resulting b yte-co de it generates is ev aluated is tight ly em b edd ed in the C -language im- plemen tation of the R engine. Th is means that if one w an ts to c h ange the byte-code in terpreter, one h as to mo dify the R interpreter itself. While one can do this w ith a priv ate v ersion of R , one cannot make these c hanges av ailable to others without them also compiling a mo diﬁ ed v ersion of R . In other words, the byt e-co de in terpreter is not extensible at run- time or by regular R users. F urth ermore, the R core dev elopmen t team d o es not alwa ys greet su ggested enhancemen ts and patches with en th usiasm. Ther e- fore, th is appr oac h tend s to b e th e work of one p er- son and so has limited resources. R a JIT Compiler : The Ra extension to R and the asso ciated jit pac k age is another approac h to using JIT (Just-in-Time) compilation in R . T his fo cuses on compiling lo ops and arithmetic expr es- sions in lo ops. Like the b yte-cod e compiler ab o v e, Ra requir es almost no change to existing R co d e b y R users—only the call to the function jit() b e- fore ev aluating the co d e. T h e p erformance gain on some problems can b e apparently as high as a f actor of 27. (See http://www.milbo.users.sonic.net/ ra/times9.html .) Un fortunately , this is no longer main tained on CRAN, the primary cen tral rep osi- tory of R p ac k ages. This appr oac h su ﬀers from the fact that it requires a mo diﬁed version of the R in- terpreter, again compiled f r om the C -lev el sour ce. This places a bu rden on the author of Ra to contin- ually up d ate Ra as R itself c hanges. Also, it r equires users trust Ra and tak e the time to b uild the rele- v an t binary installations of Ra. As we men tioned p reviously , im p ortant motiv at- ing goals in our work are to a void m o difying R it- self, to allo w other p eople to build on and adap t our to ols, and to directly lev erage the ongoing work of domain exp erts in compiler tec hnology b y int e- grating their to ols to p erform the compilation. Ou r approac h diﬀers from b oth th e b yte-compiler and Ra in these resp ects. R on the Java Virtual Machine : There are several pro jects working on d evelo ping an implemen tation of R usin g the Java programming language and vir- tual mac h ine. On e is F astR ( htt ps://github.com/ allr/fastr ), whic h is b eing dev elop ed by a collab- oration b etw een r esearchers at Purdu e, Or acle and INRIA. Another is R en jin ( https: //code.google. com/p/renjin/ ). Ha ving R run on the Java virtual mac h ine oﬀers sev eral b eneﬁts. Th er e are many in- teresting large-data p ro jects implemen ted in Java , for example, Ap ac h e’s Hado op and Mahout. In te- grating R co de and such pr o jects and their f u nc- tionalit y w ould b e m uch tigh ter and eﬀectiv e if they are all on th e same platform and share the same computational engine. Imp ortan tly , R w ould b ene- ﬁt from passive ly acquirin g features in Java and its libraries, for example, security , thr eads. O n e v ery in teresting d ev elopmen t is that researchers in Or a- cle, collab orating w ith the dev elop ers of F astR, are creating a to ol Graal ( http://openj dk.java.net/ 18 D. TEMPLE LANG projects / graal/ ) f or compiling the co de ordinar- ily interpreted b y a h igh-lev el virtual mac h in e, for example, F astR. T h is could yield the p erformance gains w e seek, but b y passively leve raging the gen- eral work of others merely b y using widespread tec h- nologies. This contrasts with the ongoing develo p- men t of R by a relativ ely small communit y and h a v - ing to activ ely and manuall y imp ort new tec hn olo- gies, features and ideas from other languages, sys- tems and comm u nities. T r anslating R c o de to C : Another approac h is to translate R co d e to C / C ++ co de. This is attractiv e, as it w ould giv e us similar sp eedup as w e can get with LL VM , p otentia lly pro duces human-readable co de, and allo ws us to lev erage the standard to ols for these languages such as compilers, linkers and , im- p ortantl y , debu ggers. W e can also p otenti ally r euse (some of ) the generated co d e outside of R . Simon Urbanek’s r2c pac k age (Urbanek ( 2007 )) is an ex- ample of exploring translation of R co de to C co de. The Rcpp (Ed delbuettel and F ran¸ co is ( 2011 )) (and inline ) p ac k age are widely u sed in R to im- pro v e p erformance. The pac k ages p ro vide a w a y to include high-lev el C ++ co de within R cod e and to compile and call it within the R co de. The C ++ co de uses an R -lik e syntax to mak e it r elativ ely eas- ier to wr ite the C ++ co d e. Th is has b een a v alu- able ad d ition to R to obtain p erformant co de. The approac hes that compile R cod e directly are prefer- able if they can get the same p erf orm ance. Th e ﬁrst reason is b ecause the programmer d o es not ha v e to program in C ++ . It is also harder for other pro- grammers to read the co d e and und erstand what it do es. The second reason is b ecause th e C ++ co de is essen tially opaque to any R co de analysis or com- piler. If we d o m anage to generally compile R cod e eﬀectiv ely or implement an automated parallel com- puting str ategy for R co de, the C ++ co d e cann ot easily b e p art of this. F or examp le, if we can map the R co de to r un as a GPU kernel on many cores, w e cannot easily com b ine the R and C ++ co d e to tak e adv an tage of these cores. Par al lel Sp e e dup : There are sev eral interesting pro jects that ha ve aimed at impro ving the p erfor- mance of R exclusiv ely by runn ing the co de in p ar- allel. Th is is very imp ortant and in some sense or- thogonal to compilation of R co de. If we sp eed up the compu tations on a single CP U, that sp eedup will b eneﬁt r u nning co d e on eac h CPU. Ho w ev er, w e also wan t to compile R co de to take adv anta ge of m ultiple CPU/GPUs. W e hop e to b e able to in- tegrate ideas from these pr o jects into our compila- tion strategies. Unfortunately , some of th em are no longer activ e p ro jects, for example, p R and taskPR . This illustrates one asp ect we hav e observed in th e R comm unit y . S ome researc hers implemen t some ideas in R , sometimes as a Ph D thesis, and then mo v e on to other p ro jects. On e of the terriﬁc asp ects of R is the ongoing commitmen t to sup p ort the R comm u- nit y . This is probably a v ery signiﬁcan t reason for R ’s wid espread use and an imp ortan t consideration when devel oping new en vironments and languages. It is one of the f orces motiv ating our con tin ued w ork within R , ev en if dev eloping a new system w ould b e in tellectually more stimulating. A ve ry imp ortant asp ect of all this work is to rec- ognize that there are many p ositiv e wa ys to make R faster and more eﬃcien t. While one of these ap- proac hes may dominate others in the fu ture, it is v ery im p ortant that we should p u rsue comparativ e approac hes and cont inue to motiv ate eac h other’s w ork. There is muc h to b e learned fr om these diﬀer- en t ap p roac h es that will imp ro v e the others. 6. FUTURE W ORK Compiling (subsets of ) R co de and other Domain Sp eciﬁc Languages (DSLs ) w ithin R using LL VM is a pr omising approac h that is certainly w orth vigor- ously p ursuin g in the near term. T he wo rk is cur - ren tly in its inf ancy—w e started it in the sum mer of 2010, bu t ha ve only recen tly return ed to it after an almost three-y ear hiatus due to other pro jects (ours and other p eople’s). Ho wev er, the foun dations of man y of the imp ortant comp onen ts are in place, that is, the Rllvm pac k age, and the b asics of th e ex- tensible and adaptable compiler m ec h anism in RL- L VMCompile should allo w u s and others to mak e relativ ely quic k progress, p rogramming almost en- tirely in R to deve lop compilation str ategies. How- ev er, there are many other tasks to d o to mak e these transparent an d reliable, and many related pro jects that w ill make them m ore p o werful and con v enien t. One of the immediate tasks we will un dertak e is to pr ogram some ric h examples explicitly in R co d e. W e are implemen ting R co de v ersions of recursive partitioning trees, r an d om forests and b o osting. W e also plan to explore compiling co d e for the Exp ecta- tion Maximizatio n (EM) algorithm and particle ﬁl- ters to run on GPUs. Th e aim is to share these sam- ple R pro jects with the other researc h ers inv estigat- ing diﬀerent compilation s tr ategies for R so that w e compare approac hes on substan tiv e and real tasks w e w an t to pr ogram in pu re R co de. ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 19 W e plan to add s ome of th e fu nctionalit y a v ailable in LL VM that do es not ye t ha ve b indings in Rllvm . This includ es topics such as diﬀeren t optimizat ion passes and adding meta d ata to the instructions. W e ha v e also dev elop ed the in itial infrastru cture to compile R co de as kernel r outines that can b e us ed on GPUs, that is, PTX (Pa rallel Th read Execution) co de. Being able to generate kernel fu nctions fr om R co de, along with the existing R -CUD A b indings to manage memory and launc h kernels from th e host device, allo ws u s to program GPUs d irectly within high-lev el R cod e. This con trasts w ith the lo w-lev el C cod e dev elop ed f or existing R p ac k ages that target GPUs, for example, gputo ols (Buc kner et al. ( 2009 )) and rgpu (Kemp enaar and Dijkstra ( 2010 )). W e will also b e exploring diﬀeren t approac hes to compiling the R co de to run in p arallel and dis- tributed settings. W e think that b eing able to u se information ab out the d istribution of the data to generate/c ompile th e co de will b e imp ortan t so that w e can minimize th e mo ve ment of data and k eep the C PUs/GPUs busy on the actual computations rather than transf erring the inpu ts and outputs to and fr om the compu tations. Being able to write R code that dir ectly calls C routines is v ery p ow erful. As we sa w in r elation to the fgets() r outine in S ection 2.3 , w e need to sp ec- ify the signature for the routines we wan t to call. It is pr eferab le to b e able to programmatically id entify these signatures rather th an requ ir e R p rogrammers to explicitly sp ecify them. The R CIndex pac k age is an R in terface to lib clang (Carruth et al. ( 2007 )), the parsing facilitie s f or the clang compiler. This already allo ws us to read C and C ++ co d e in R and to iden tify th e diﬀerent elemen ts it cont ains. This allo ws to not only determine the signatures of routines, b u t also d isco ver diﬀerent data stru ctures, en umerated constant s, etc. W e can also go further and understand more ab out ho w the routines ma- nipulate their argument s and whether they p erform the memory managemen t or lea v e it to the caller. As we sa w in eac h of our examples, information ab out the t y p es of eac h parameter an d local v ari- able is a necessit y to b eing able to compile using LL VM . C urrently , the R programmer must sp ecify this information not only for the function she is com- piling, bu t also for all of the fu nctions it calls. Again, w e wan t to m ak e this transparent, or at least only require the R p rogrammer to sp ecify this inform a- tion when there is am b iguity . T o this en d , w e are w orking on a typ e inference p ac k age f or R . This starts with a known set of fundamen tal functions and their signatures. F rom this, w e can determine the signatures of many higher lev el calls. As alw ays, w e cannot d eal with many features of the language suc h as nonstandard ev aluation, but w e most lik ely can get muc h of the typ e information we n eed pro- grammatically . Since R ’s t yp es are so ﬂexible w ith diﬀeren t return t yp es based on not only the t yp es of the inputs, but also the con tent of the inputs, w e need a ﬂexible wa y to sp ecify t yp es. P erhaps the existing T yp eIn f o pac k age (T emple Lang and Gen- tleman ( 2005 )) or lam b da.r p ac k age will help here. T o analyze co de for type information and for v ari- able dep endencies, we w ill bu ild up on the Co deDe- p end s (T emple Lang, P eng and Nolan ( 2007 )) and co deto ols (Tiern ey ( 2011 )) p ack ag es. While these are some of th e related activities we en visage w orking on, we also encourage others to col- lab orate with us or work indep endently u s ing LL VM and optionally Rllvm and RLL VMCompile so that our communit y ends u p with b etter to ols. 7. CONCLUSION W e hav e describ ed one approac h to making some parts of the R language fast. W e lev erage the com- piler to olkit infrastructure LL VM to generate nativ e co de. This allo ws us to incorp orate tec hn ical kn owl- edge fr om another comm unit y , b oth n o w and in the future. W e can generate co d e for CPUs, GPUs and other targets. W e can dyn amically sp ecialize R func- tions to diﬀeren t compu tational appr oac hes, data represent ations and sour ces, and con textual k n o wl- edge, giving us a new and very ﬂexible approac h to thinking ab out high-lev el computing. W e are d ev eloping a simple but extensib le and cus- tomizable compiler in R that can translate R code to nativ e co de. Not only do es this mak e the co d e run fast, but it also allo ws us to compute in quite diﬀeren t w a ys than when w e in terpret the R co de in the usu al w a y . W e can ev en outp erform some of R ’s o w n n ativ e co d e. In no wa y sh ou ld this w ork b e considered a gen- eral compiler for all of the R language. There are man y asp ects of the R language w e ha ve not y et dealt w ith or consider ed . V ect orized su bsetting, re- cycling, lazy ev aluation and n on s tandard ev aluation are examp les. W e, or others, can add facilities to the compiler to sup p ort these when they mak e sense an d are feasible. The initial results from this simp le approac h are v ery encouraging. An imp ortan t implication of this 20 D. TEMPLE LANG and other eﬀorts to mak e R co d e eﬃcien t is that w e can b e b eneﬁt from wr iting high-lev el co de that describ es what to compute, not ho w. W e then u s e smart inte rpr eters or compilers to generate eﬃcient co de, simultaneo usly freeing R p rogrammers to con- cen trate on their tasks and leverag ing d omain ex- p ertise for executing the co de. W e hop e others will b e able to use these basic b uilding blocks to im- pro v e matters and also to explore quite diﬀerent ap- proac hes and n ew languages within the R en vir on- men t. A CKNO WLEDGMENTS Vincen t Buﬀalo made v aluable con trib u tions to designing and develo ping the RLL VMCompile p ac k- age in th e initial w ork. Vincent Carey has provided imp ortant ideas, insigh ts, advice an d motiv ation and I am ve ry grateful to him for organizing this collec- tion of pap ers and the session at the 2012 Join t Sta- tistical Meetings. Also, I appreciate the very useful commen ts on the initial dr aft of this pap er by th e three review ers and also John Chambers . SUPPLEMENT AR Y MA TERIAL The co de for the examples in this pap er, along with the timing r esu lts and their meta-data, are a v ailable from https:// github.c om/duncantl/ RllvmTim ings as a git rep ository . The versions of the Rllvm and RLL VMComp ile pac k ages inv olve d in the timings can also b e retriev ed from their re- sp ectiv e git rep ositories. Th e sp eciﬁc co d e u sed is asso ciated with the git tag StatSciPap er . REFERENCES Adler, D. (2012). rdy ncall: Improv ed foreign fun ction inter- face (FFI) and dynamic bindings to C libraries. R pac k age versi on 0.7.5. Buckner, J. , W ilson, J. , Seligman, M. , A they, B . , W a t- son, S. and Meng , F. (2009). The gpu tools pack age en- ables GPU computing in R . Bioinf ormatics 26 135–135. Av ailable at http://cran.r- project.org/web/packages/ gputools/index.html . Carruth, C. , Christopher, E. , Gregor, D. , K o- ro beyniko v, A. , Kre menek, T. , McC all, J. , Ro sier, C. and Smith, R. (2007). lib clang: C / C ++ translation unit parser library . Ava ilable at http:// clang.llvm.org . Cla yton, D . (2011). snpstats: SnpMatrix and XSnp Matrix classes and metho ds. R pack age version 1.5.0. Eddelbuettel, D. and Franc ¸ois, R. (2011). Rcpp: Seam- less R and C++ in tegration. Journal of Statistic al Softwar e 40 1–18. Ihaka, R. and Te mple Lang, D. (2008). Bac k to the fu- ture: Lisp as a base for a statistical computing system. In Pr o c e e dings in Com putational Statistics 21–33. S pringer, Heidelb erg. Jones, E. , Oliphant, T. and Peterson, P. et al. (2001). SciPy: Open source scien tiﬁc tools for Python. Kempenaar, M. and Dijkstra, M. (2010). R/GPU: U s- ing th e Graphics Pro cessing Unit to speed u p b ioinformat- ics analysis with R . Ava ilable at https://trac.nbic.nl/ rgpu/ . La ttner, C. and Adve, V. (2004). LL VM: A compilation framew ork for lifelo ng program analysis and transforma- tion. In Pr o c. of the 2004 I nternational Symp osium on Co de Gener ation and Optimization (CGO’04), San Jose, CA, USA 75–88. IEEE Computer So ciety , W ashington, DC. Av ailable at http://llvm.org/ . R Core Tea m (2013). R: A L anguage and Envir onment for Statistic al C omputing . R F oundation for Statistical Com- puting, Vienn a, A ustria. Sch ¨ olk opf, B. and Smola, A. ( 2001). L e arning with Ker- nels . MIT Press, Cam bridge, MA. Temple La n g, D. (2002). RCurl: General netw ork (http/ftp/. . . ) client in terface for R . R pack age versi on 1.95-4. Temple Lang, D. (2010a). R CIndex: R interfa ce to the clang parser’s C API. Avai lable at https://github.com/ omegahat/RClangSim ple . Temple Lang, D. (2010b). Rllvm: R interface to th e Low- Level Virtual Mac hine API. Av ailable at https://github. com/duncantl/Rllvm . Temple Lang, D. (2011). Rﬃ: Interface to libﬃ to d ynami- cally invo ke arbitrary compiled routines at ru n-time with- out compiled bind ings. R pack age versi on 0.3-0. Temple Lang, D. (2013). F astCSVSample: An R pack age to sample lines from a text ﬁle. A v ailable at https://github. com/duncantl/FastCSVSam ple . Temple Lang, D. and Bu ff alo , V. (2011). RLL VMCom- pile: A simple LL VM-based compiler for R co de. Av ailable at https://gith ub.com/dun cantl/RLL VMCompile . Temple Lang, D. an d Gentleman, R. ( 2005). TypeInfo: Optional typ e sp eciﬁcation prototype. R pac ka ge version 1.27.0. Temple Lang, D. , Peng, R. and Nolan, D. (2007). Co de- Dep ends: Analysis of R cod e for repro du cible research and code comprehension. Av ailable at https://github.com/ duncantl/CodeDepen ds . Tierney, L. (2001). Compiling R : A preliminary rep ort. In Pr o c e e dings of the 2nd International Workshop on Distribute d Statist ic al Computing, M ar ch 15–17, 2001 ( K. Horni k and F. Lei sch , eds.). T echnisc h e Universi t¨ at Wien, V ienna, Austria. Tierney, L. (2011). co deto ols: Co de analysis to ols for R . R pack age version 0.2-8. Urbanek, S. ( 2007). R to C compiler. Av ailable at http:// www.rforge.net/r2c/index.html . Wong, J. (2013). p dist: Partitio ned distance function. R pack age version 1.2. Zakai, A. (2010). Emscripten: An LL VM-to-Jav aScript compiler. Av ailable at https://github.com/kripken/ emscripten / wiki .

Enhancing R with Advanced Compilation Tools and Methods

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment