Enhancing R with Advanced Compilation Tools and Methods
I describe an approach to compiling common idioms in R code directly to native machine code and illustrate it with several examples. Not only can this yield significant performance gains, but it allows us to use new approaches to computing in R. Impo…
Authors: Duncan Temple Lang
Statistic al Scienc e 2014, V ol. 29, No. 2, 181– 200 DOI: 10.1214 /13-STS462 c Institute of Mathematical Statisti cs , 2014 Enhancing R with Advanced Compilation T o ols and Me tho ds Duncan T emple Lang Abstr act. I describ e an approac h to compiling common id ioms in R co de directly t o nativ e machine co de and illustrate it with sev eral ex- amples. Not only can this yield significant p erformance gains, b ut it allo ws us to use new ap p roac h es to computing in R . Imp ortantly , the compilation requires no change s to R itself, b ut is done en tirely via R pac k ages. Th is allo ws others to exp eriment with d ifferen t compilation strategies and ev en to define new domain-sp e cific languages within R . W e use th e Lo w-Lev el Virtual Machine ( LL VM ) compiler to olkit to create the nativ e co de and p erform sophisticated optimizations on the co de. By adopting this widely u sed soft wa re w ithin R , w e lev erage its abilit y to generate co de f or differen t platforms such as CPUs and GPUs, and will con tin ue to b enefit from its ongoing dev elopmen t. This approac h p o ten tially allo ws us to dev elop high-lev el R co d e that is also fast, that can b e compiled to w ork with different data repr esentati ons and sources, and that could eve n b e run outside of R . The approac h aims to b ot h p ro vide a compiler for a limited su b set of the R language and also to en ab le R programmers to write other compilers. This is another ap p roac h to h elp us write high-lev el descriptio ns of w h at we w an t to compute, n ot how . Key wor ds and phr ases: Programming language, efficien t computa- tion, compilation, extensible compiler to olkit. 1. BA CK GROUND & MOTIV A TION Computing with d ata is in a v ery interesti ng p e- rio d at present and this h as significan t implications for h o w w e c ho ose to go forwa rd with our comput- ing platforms and education in statistic s and related fields. W e are simultaneously (i) lev eraging h igher- lev el, inte rpr eted languages s u c h as R , MA TL AB, Python and recen tly J ulia, (ii) dealing with increas- ing vo lume an d c omplexit y of data, and (iii) ex- Dunc an T emple L ang is A sso ciate Pr ofessor, Dep artment of Statistics, Un iversity of California at Davis, 4210 Math Scienc es Building, Davis, California 95616 , USA e-mail: dunc an@r-pr oje ct.or g . This is an electronic r eprint o f the original article published by the Institute of Mathematical Statistics in Statistic al Scienc e , 2 014, V ol. 29, No. 2, 181 –200 . This reprint differ s from the orig inal in pagina tion and t yp ogr aphic detail. ploiting, and coming to terms with, tec hnologies for parallel computing including shared and nonshared m ulti-core p ro cessors and GPUs (Graphics Pr o c ess- ing Units). These challe nge us to in no v ate and sig- nifican tly enh ance our existing computing platforms and to devel op new languages and systems so that w e are able to meet n ot ju st tomorrow’s n eeds, b ut those of the n ext decade. Statisticians play an imp o rtant role in the “Big Data” surge, and th er efore must pa y attent ion to logistica l and p erformance details of statistical com- putations that we could p reviously ignore. W e n eed to think ab out how b e st to meet our o w n comput- ing needs f or the near fu ture and also how to b est b e able to participate in multi -disciplinary efforts that require serious computing inv olving statistica l ideas and m etho ds. Are we b est serve d with our own com- puting platform s uc h as R (R Core T eam ( 2013 ))? Do w e ne e d our o wn system? Can w e afford the lux- ury of our own system, giv en the limited resources 1 2 D. TEMPLE LANG our fi eld has to dev elop, m ain tain and innov ate with that system? Alternativ ely , would we b e b ett er off reimplemen ting R or an R -lik e envi ronment on a system that is devel op ed and su p p orted by other larger comm un ities, for example, P ython or Java ? Or can w e lea ve it to others to build a new, fast computing en v ir onmen t that we can lev erage w ith in the field of statistic s? W ould “doing it righ t” giv e us an opp ortunit y to escap e some of the legacy in our current systems and p osition ou r selv es for more inno v ation? W ould d ev eloping a n ew system splinte r our already small communit y and reduce our effec- tiv en ess in disseminating new s tatistica l metho ds as effectiv ely as w e do through R ’s excellen t pac k age mec h anism? I h a v e wrestled with these qu estions for o ver a decade. I don’t b eliev e there is a simp le answer as to w h at is the b est w ay to pro c eed. It is as m uc h a so- cial iss u e as a tec hnical one. T h e R communit y is an amazing and v aluable phenomenon. There is a large R cod e base of p ac k ages and scripts in widespread use for doing imp ortan t science. Ev en if new systems do emerge and replace R , this will tak e sev eral y ears. W e n eed significant impro v ement s in p erformance to mak e R comp et itiv e with other systems, at least for the near future. W e m ust improv e R ’s p e rform an ce to allo w us to con tin ue to deal w ith larger and more complex data and prob lems. I n this pap er, I discuss one dir ect appr oac h to impro v e the p erformance of R co de that is extensible and enables many p eople to fur ther improv e it. The essence of the approac h I am suggesting is conceptually qu ite simple and emerged in n umer- ous other languages and platforms, around the same time we fir st started implement ing it for R . T he idea is that we compile R co de directly to lo w -lev el n a- tiv e mac hine instructions that will ru n on a CPU or GPU or any device w e can target. Instead of in - sisting that R code b e ev aluated by the one and only R interpreter, w e ma y generate th e co de to p er- form th e equiv alen t compu tations in a quite different w a y . W e can dynamically compile fast nativ e code b y combining information ab out the co de, the data and its represen tation b eing pro cessed b y that co de, the av ail able computing “h ard w are” (i.e., CPU or GPU or multi-co re), the lo cation o f th e different sources of the d ata, wh ether we need to hand le miss- ing v alues (NAs) or not, and so on. This is a f orm of just-in-time (JIT) compilation. It lev erages add i- tional kno wledge ab out the con text in whic h co de will b e ru n. It maps the co de to lo w-lev el mac hine instructions rather than ha ving it ev aluated b y a high-lev el in terpreter. The appr oac h present ed h ere is quite different from ho w programmers typica lly impro v e p erfor- mance for R co de. They man ually implement the slo w , computationally in tensiv e parts in C / C ++, and call th ese r outines from R . I call this “program- ming around R .” Instead, I am tryin g to “ c ompile around R .” In this appr oac h, statisticians u s e famil- iar R idioms to exp ress the d esired computations, but the compiler infrastru cture generates optimized instructions to realize the in tended computations as efficien tly as p ossib le on th e hard w are platform in use. T he inpu t from the statistician or analyst to this pro cess is R co d e, not lo w-lev el C / C ++ co de. This is go o d b ecause humans can more easily und er- stand, debu g, adap t and extend co de written in R . F urther m ore, the compiler can “unders tand ” w hat is int ended in the high-lev el code and optimize in quite different w a ys. This also allo ws th e co de to b e optimized in very different w a ys in the future. The high-lev el co d e sa ys wh at to do, but not how to do it. Ho w is left to the compiler. What makes this appr oac h feasible and p racti- cal n ow is the a v ailabilit y of the Lo w-Leve l Virtu al Mac hin e Comp iler Infr astructure ( LL VM ) (Lattner and Adve ( 2004 )). LL VM is the w in ner of the 2012 Asso ciation for Comp uting Mac h inery S ystem Soft- w are Aw ard (the same a w ard conferr ed on the S language in 1999) and is a highly extensible com- piler to olkit. LL VM is a C ++ library w e can in- tegrate in to R (and other language s) to generate nativ e code for v arious differen t CP Us, GPUs and other output targets. The abilit y to in tegrate this v ery adaptable and extensible to ol in to high-lev el languages is a “game changer” for dev eloping com- pilation to ols and strategies. W e can u se a tec hnol- ogy that will contin ue to ev olv e and will b e deve l- op ed by domain exp erts. W e can adapt these to our purp oses w ith our domain kno wledge. W e d o this within an extensible R p ac k age r ather than in the R in terpreter itself. This lea v es the compilation in- frastructure in “us er” space, allo win g dev elopment of an y new compilation s tr ategies to b e shared with- out any c hanges to the R interpreter. This con trasts with R ’s byte-c o de compiler and the byt e-co de in- terpreter wh ich is p art of the fixed R executable. The Rllvm (T emple Lang ( 2010b )) p ac k age pro- vides R bindings to the LL VM C ++ API. W e can use this to generate arb itrary nativ e cod e. Th e RLL VMCompile (T emple Lang and Bu ffalo ( 2011 )) ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 3 pac k age provides a simple “compiler” that attempts to translate R co de to n ative co de b y m apping the R co de to instr uctions in LL VM , lea ving LL VM to optimize the co de and generate the nativ e mac hine co de. Before w e explore some examples of ho w w e can use LL VM in R to imp ro v e p erformance and change our computational strategies for certain t yp es of problems, it is worth think in g a little ab out the p o- ten tial imp lications of fast R co d e: • Alterna tive data mo dels . On a practical level , if w e can compile scalar (i.e., nonv ectorized) R f unc- tions so that they are almost as f ast as C / C ++ co de, w e can use them to p ro cess ind ividual obser- v ations in a streaming or up d ating manner . Th is means w e can escap e the highly-v ectorized or “all data in memory” approac hes that R strongly re- w ards. • Exp orting c o de to alternative exe cution envi r on- ments . W e can write R co d e and then e xp o rt it to run in other d ifferen t systems, for ex- ample, database s, Python , JavaScript and W eb bro wsers. W e can map the R c o de to LL VM ’s in termediate repr esen tation (IR). W e can then use emscrip ten (Zak ai ( 2010 )) to compile this directly to JavaScript cod e. F or ot her systems, w e can share the IR co de from R and they can use their o wn LL VM bindings to compile it to nativ e co de for the p articular h ardwa re. • R icher data structur es . R pro vides a small num- b er of primitive data t yp es, for example, vect ors, lists, fu nctions. W e can currently use these to cre- ate new aggregate or comp o site data structur es. Ho wev er, w e can only introduce new and differ- en t d ata structures suc h as link ed lists and suffix trees as opaque d ata types programmed in n ativ e ( C / C ++) code. When w e compile code to na- tiv e instructions, we also hav e the opp ortun it y to ha v e that new co d e use these different data struc- tures and to represent the data differently from R . The same R co d e can b e merged w ith descriptions of new data typ es to yield quite different nativ e instructions that are b e tter suited to particular problems. • T empl ating c onc epts . Our abilit y to create nativ e co de from R co de allo ws us to think ab o ut R func- tions or expr essions differently . Th ey are descrip- tions of what is to b e done, without th e sp ecifics of how to do them. An R compiler can rewrite them or generate co de that will b ehav e d ifferen tly from the R interpreter but giv e the s ame resu lts (hop efully). The fu nctions are “template s.” The compiler can u se kn owledge of the particular r ep - resen tation of the d ata th e functions will p ro cess to generate the nativ e co de in a more in telligen t manner. F or example, th e co de may access ele - men ts of a tw o-dimensional d ata set—ro w s and columns. T here are tw o very different represen- tations of th is in R — d ata frames and matrices. Ho w the individual elements are accessed for eac h is v ery differen t. Th e compiler can generate sp e- cialize d co d e for eac h of these and migh t ev en c h ange the ord er of the computations to imp ro v e efficiency (cac he coherency) for these representa- tions. T he function is not tied to a particular data represent ation. In su mmary , compiling R pr ograms through LL VM yields no ve l computational p otenti als that are di- rectly relev an t to imp r o ving statistica l learning and comm unication in the b ig data era. C ompiling h igh- lev el code to nativ e co de is used in many systems. Julia is an interesting mo dern pro ject doing this. NumPy (Jones et al. ( 2001 )) in Python is another. Sev eral ye ars ago, Ross Ih ak a and I explored using LISP (Ih ak a and T emple Lang ( 2008 )) as the plat- form for a new statistical computing platform. Th e same id eas ha ve b een u sed th er e for man y yea rs and the p e rformance gains are v ery imp ressiv e. A v ery imp ortant premise un derlying the ap- proac h in this pap er is that the R pro ject and its large cod e base are imp ortan t, and will b e for at least another 5 y ears. Users are not likel y to imme- diately change to a new system, ev en if it is tec h- nically sup erior. F or that reason, it is imp ortant to impro ve the p erformance of R no w. It is also im- p ortant to allo w dev elop ers outside of the core R dev elopmen t team to contribute to this effort and to av oid man y f ork ed/parallel pr o jects. F or this, w e need an extensible system within R , and not one that requires con tin ual c hanges to the cen tr alized R source co de. In add ition to fo cusing on the immediate and near-term future and impro ving R , w e also need to b e exploring new language an d computing paradigms within the field of statistics. Ju lia is an in teresting mo dern pro ject d oing this. W e need to f oster more exp eriments so new ideas emerge. T o do this, we also need to increase the quan tity and qu alit y of computing within our cu r ricula. Section 2 constitutes the m a jorit y of this pap er. In it I explore different examples of compu tin g in R and ho w compiling co de can m ake the computa- tions more efficien t, b oth by simply obtaining faster 4 D. TEMPLE LANG execution sp eed and also b y allo w in g us to c hange ho w we appr oac h the pr ob lem. In Section 3 I discuss some additional general strategies w e can exploit to impro ve computations in the future. I briefly discuss other exciting researc h pro jects to imp r o v e R ’s effi- ciency and con trast th em w ith the LL VM appr oac h. I outline in Section 6 a road m ap of the ongoing wo rk on the LL VM approac h and other related pro jects as part of our futu re activit ies. The aim of this is to illustrate the feasibilit y of the ent ire appr oac h. In this pap er I fo cus on reasonably standard R co de and approac hes th at can b e impr o v ed by gen- erating n ative co de. Th e semanti cs of the generated co de are the same as the original R co de. The ap- proac h also allo ws u s to d ev elop new languages and seman tics within the compilation framew ork and to explore different compu tational mo dels. How ev er, the examples d iscussed in this p ap er sta y within R ’s existing computational mo d el in order to anchor the discussion and a v oid to o many degrees of freedom b ecoming a d istraction. I d o hop e th at we will ex- plore new semant ics and language features w ithin R via compilation. 2. ENHANCING R WITH ADV ANCED COMPILA T ION TOOLS AND METHODS In this sectio n we’ ll explore some examples of ho w w e can wr ite co de in R and compile it to mac hine co de. These explore different strategie s and illus- trate how we can approac h computations differen tly when w e ha ve th e option to compile co d e rather than only inte rpr et it. W e ha ve c h osen the problems f or sev eral reasons. Th ey are eac h reasonably s im p le to state, and they illustrate the p oten tial b en efi ts of compilation. Like most b enc hmarks, some of the ex- amples may not reflect t ypical u se cases or ho w w e w ould do things in R . Ho wev er, most of these prob- lems are v ery concrete an d practical, and represent w a ys w e w ould lik e to b e able to program in R , w ere it not for the p erf ormance issues. In th is wa y , the ex- amples illustrate ho w we can con tin ue to u se R with an additional compu tational mo d el and can o ver- come some of the inte rpr eter’s p erformance issues while still usin g essentia lly the same R co d e. Note. In the follo wing subsections, w e present absolute and comparativ e timings for different ap- proac hes and imp lementati ons to the differen t tasks. These timings were p erformed on three different mac h ines. W e used a MacB o ok Pro runn ing OS X (2.66 GHz Intel Core i7, 8 GB 1067 MHz DDR3) and also tw o Linux mac h ines. The first of these Lin ux mac hines is an older 2.8 GHz AMD Op teron, 16 GB. The second is a muc h more recen t and faster mac hine—3.50 GHz Intel Core i7-2700K, with 32 GB of RAM. Additionally , the d ifferen t mac hines ha v e different compilers used to compile R itself and these may impact the timings. W e used GCC 4.2.1 on OS X, and GCC 4.3.4 on the fi rst Lin ux mac hin e and b oth GCC 4.8.0 and clang on the s econd Linux b ox. In all cases, R w as compiled with the -O3 opti- mization level flag. The absolute times are quite d if- feren t across mac h ines, as we exp ect, and the within- mac h ine relativ e p erformance of our LL VM gener- ated co de to nativ e co de differs b et wee n OS X and Lin ux. Ho w ev er, the within-mac hine resu lts are very similar across Linux mac hines. Finally , our current steps to optimize the n ativ e co de we generate with LL VM are quite s im p le and we exp ect to imp ro v e these in the near futu re. W e h a v e not includ ed the time to compile the co de in our measurement s. There are tw o steps in this— compiling th e R co de to intermediate r ep resen tation (IR) an d then compiling the IR to nativ e co de. The former can b e done once, and the latter for eac h R session and is done in LL VM ’ s C ++ co d e. There are sev eral reasons for omitting these steps in the timings. First, our f o cus is on tasks that tak e a long time to run in R , for example, man y hour s or da ys. Compilation time will b e on the order of, at most, a few min utes and so the compilation time is negligi- ble. Second, w e exp ect that the compiled co de will b e r eu sed in multiple calls and s o the o v erhead of compiling will b e amortized across calls. W e hav e also ignored the time to byte-compile R fu nctions, or compile and in s tall C / C ++ co d e to b e used in R pac k ages. 2.1 The Fib onacci Sequence The Fib onacci sequence is an interesting mathe- matical s equence of in tegers defined by the recur- rence/recursiv e r elation F n = F n − 1 + F n − 2 , n ≥ 0 with F 0 = 0 and F 1 = 1 . W e can implemen t this as an R fun ction in an easy and ob vious manner as fib = function( n) { if (n < 2L) n else fib(n - 1L) + fib(n - 2L) } ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 5 F or simplicit y , w e don’t ve rify that n is nonnegativ e, assuming the caller will p ro vide meaningful inpu ts. This m aps the mathematical description of the se- quence to a computational form essentiall y in a one- to-one manner. This is a goo d thing, as it mak es the co de and computations easy to un derstand, de- bug and main tain. Ho w ev er, th is is a scalar function and not v ectorized, meaning that it computes the v alue of the Fib onacci sequence for a single inte - ger v alue rather th an elemen t-wise for a v ector of inputs. This make s it slo w in R if w e w an t to com- pute m u ltiple v alues from the sequence, for exam- ple, apply it to eac h ele ment of a v ector. Ins tead of implement ing the function in this natural form, to gain p erf orm ance, we wo uld lo ok to other im- plemen tations. Sin ce the sequence is describ ed b y a recurr ence relationship, there is a simple closed- form form ula for computing the n th v alue of the sequence whic h can b e easily im p lemen ted in R as a v ectorized fun ction of n . Alternativ ely , w e might use memoizatio n to remember results compu ted in pre- vious calls to the fun ction to a void rep eating com- putations. W e migh t ev en use a lo okup table of pre- computed r esults for common inp ut v alues, or some com b ination of these approac h es. The k ey is that to get go o d p erformance, we hav e to think ab out the problem quite differen tly . Instead, we ’ll exp lore h o w w e make the simple implementa tion ab ov e p erf orm b etter an d hop e to av oid h a ving to c hange our entire approac h or r ely on R ’s other v ectorized op erations. W e use the f unction c ompileF u nction() in the RL- L VMCompile pac k age to create a native compiled v ersion of the fib() function w ith fib.ll = compileFun ction(fi b, Int32Typ e, list(n = Int32Typ e)) W e hav e to sp ecify the t yp e of the return v alue and also the t yp e of the inp u t(s), that is, n in this case. F or this f u nction, b oth the return t yp e and th e in put are r egular integ er v alues corresp onding to the 32- bit in teger t yp e I nt32Type . W e could use a 64-bit in teger b y using the t yp e Int64Type if we w an ted to d eal w ith larger n umbers. In fact, we can create t w o separate and different ve rsions of this function with d ifferen t typ es with t wo calls to c ompileF unc- tion() . This is a simple illustration of ho w easy it is to adapt the same R cod e to d ifferen t situations and create different compiled routines with d ifferen t c h aracteristics. c ompileF unction() can retur n an R fun ction w hic h w e can inv oke d irectly . Ho wev er, by default, it cur- ren tly return s an ob j ect represen ting the compiled routine in LL VM . W e can in vo k e the routine using this ob ject and th e .l lvm() function, analogous to the .Cal l() and .C() fun ctions in R . So .llvm(fi b.ll, 30) calls our compiled r outine and retur ns the v alue 832,040. Unlike the .Cal l() / .C() functions, the .l lvm() fun ction kno ws the exp ected t y p es of the routine’s parameters and so coerces th e inp uts to the t yp es exp ected by the routine. In this case, it con v erts the R v alue 30 f rom what is a numeric v alue to an in teger. After ve rifying that the routine gives th e correct results, w e can explore the p erform ance of the co d e. This recursive f unction is v ery compu tationally in- tensiv e. When calculating, for example, fib( 30) , w e calculate fib(28 ) twice [once for eac h of fib(30 - 1) and fib(30 - 2) ] and, similarly , we compute fib(27) m ultiple times and so on. T his rep etition is one of th e reasons the co de is so slo w. W e’ll compare the time to ev aluate fib (30) using three differen t v ersions of the fib() fu nction: the original interpreted function, the LL VM -compiled routine ( fib.ll ) and a v ersion of fib () that is compiled b y R ’s b yte-compiler. The LL VM -compiled routine is the fastest. T able 1 s h o ws the elapsed times for eac h and T able 1 Timings for c omputing Fib onac ci Se quenc e V alues OS X Linux 1 Linux 2 Time Sp eedu p Time Sp eedu p Time Sp ee dup Interpreted R co de 80 . 49 1 . 00 112 . 70 1 . 0 51 . 780 1 . 0 Byte-compiled R co de 31 . 70 2 . 53 45 . 85 2 . 5 21 . 620 2 . 4 Rllvm -compiled code 0 . 12 653 . 90 0 . 21 526 . 4 0 . 097 531 . 0 These are t he timings for a call to fib(30) using the regular R fun ct ion, the byte-compiled version and t he LL VM -compiled versi on. T o impro ve the accuracy of the timings, w e calculate the duration for 20 replications for the tw o slow er functions and 200 replications of the LL V M -compiled routine and divided the duration by 10. The LL VM - comp iled versi on is clearly muc h faster. 6 D. TEMPLE LANG a ratio of the time for eac h fun ction relativ e to the time for the int erpreted fun ction. Th ese conv ey the relativ e sp eedup factor. W e see th at on a Macb o ok Pro, the LL VM -compiled rou tin e is 600 times faster than the R in terpreter. On a Lin ux m achine, the sp eedup is a bit smaller, b u t still very significan t at a factor of 500. While we h a v e attempted to reduce the v ariabilit y of these timings, we h a ve observ ed differen t sp eedups ranging from 400 to 600 on OS X and 230 to 540 on Linux. T he timings we rep ort here are the most recen t (rather th an the “b est”). Again, this is a simp le example and not necessar- ily v ery represen tativ e of ho w we wo uld cal culate the Fib onacci sequence in pro duction co d e. Ho w- ev er, the abilit y to express an algorithm in its nat- ural mathematical form makes it easier to program, v erify and extend. W e would ve ry m uc h like to b e able to write code in this mann er, without sacrificing go o d r un-time p erformance. 2.2 2-Dimensional Random W alk Ross Ihak a develo p ed a ve ry instructiv e example of wr iting straig htforw ard R co de compared with clev er, highly v ectorized R co d e as a means to il- lustrate pr ofiling in R and ho w to mak e co de ef- ficien t. T h e task is simulati ng a t w o-dimensional random w alk. It is v ery n atur al to w rite this as a lo op with N iterations corresp ond ing to th e N steps of the w alk. F or eac h step, we toss a coin to de- termine whether we mo ve horizon tally or v ertically . Giv en that c hoice, w e toss another coin to determine whether to mov e left or right , or up or down. Then w e calculate and store the new lo cation. W e’l l call this th e na ¨ ıv e appr oac h and the co de is shown in Figure 1 . After sev eral refinements based on pr ofil- ing and nontrivial knowledge of R , Ihak a defines a v ery efficien t R implementati on of the random w alk, sho wn in Figure 2 . It r emo ves the explicit lo op, sam- ples all N steps in one call to sample() , and deter- mines the p ositions using t wo calls to the cumsum() function. This mak es very goo d use of sev eral of R ’s vec torized f u nctions whic h are implemen ted in C co de and th erefore fast. W e man ually compiled the na ¨ ıv e implemen tation using Rllvm and, s im ilarly , used R ’s byte-c ompiler to create tw o compiled v ersions of this fun ction. W e then simula ted a 10 million step r andom w alk using eac h of the original na ¨ ıv e function, the b yte-cod e compiled fun ction, the fully vec torized ve rsion and the LL VM -compiled v ersion. T able 2 sho ws the rel- ativ e sp eedups. W e see that th e manually v ector- ized R fu nction is 175 times f aster than the na ¨ ıv e rw2d1 = function(n = 100) { xpos = ypos = numeric(n ) truefalse = c(TRUE, FALSE) plusminus1 = c(1, -1) for(i in 2:n) { # Decide whether we are moving # horizontal ly or vertically. if (sample (truefalse, 1)) { xpos[i] = xpos[i-1 ] + sample(plu sminus1, 1) ypos[i] = ypos[i-1 ] } else { xpos[i] = xpos[i-1 ] ypos[i] = ypos[i-1 ] + sample(plu sminus1, 1) } } list(x = xpos, y = ypos) } Fig. 1. The na ¨ ıve im pl ementation of the 2-D r andom walk. rw2d5 = # Sample from 4 directi ons, separatel y. function(n = 100000) { xsteps = c(-1, 1, 0, 0) ysteps = c( 0, 0, -1, 1) dir = sample(1:4, n - 1, replace = TRUE) xpos = c(0, cumsum(xste ps[dir])) ypos = c(0, cumsum(yste ps[dir])) list(x = xpos, y = ypos) } Fig. 2. The fast, ve ctorize d impl ementation of the 2-D r an- dom walk. implemen tation, illustrating ho w imp ortant vect or- ization is to mak e R co de efficien t. How ev er, we also see that compiling the na ¨ ıv e implemen tation w ith LL VM outp erforms ev en the ve ctorized version, tak- ing ab out b et we en 55% to 65% of the time of the v ectorized v ersion. This is probably due to th e com- piled co d e using a sin gle lo op, while the v ectorized v ersion h as t w o calls to cumsum() and h ence at least one additional C -lev el lo op ov er the N steps. 2.3 Sampling a T ext File Supp ose we ha v e on e or m ore large comma-separated v alue (CS V) files. F or example, w e ca n do w nload airline traffic d ela y d ata for eac h y ear as an appr o x- imately 650 m egabyte CS V file from the Researc h and Inn o v ativ e T ec hnology Adm in istration (RIT A), ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 7 T able 2 Timings for simulating a 2-D R andom Walk OS X Linux 1 Linux 2 Time Sp eedu p Time S p eedup Time Sp eedu p Interpeted R co de 171 . 08 1 . 0 196 . 6 1 . 0 100 . 3 1 . 0 Byte compiled co de 123 . 92 1 . 4 120 . 8 1 . 6 6 0 . 51 1 . 66 V ectorized R co d e 0 . 97 176 . 5 1 . 8 106 . 8 0 . 63 159 . 46 Rllvm -compiled code 0 . 52 329 . 3 1 . 1 180 . 3 0 . 40 250 . 12 W e generate 10 million steps for eac h approach. W e compare a manuall y vectorized implementation in R co de with a na ¨ ıve versi on written in R , b oth a byte-compiled and LL VM -compiled version of that na ¨ ıve function. The vectorized version is 175 times faster th an the regular R function. H ow ever, the LL VM -compiled versi on outp erforms t h e vectorize d version, most lik ely by removing one C -level loop. part of the Bu r eau of T rans p ortation, at http:// www.transtat s. bts.gov/DL_Selec tFields.asp? Table_ID=236 . Rather than working with the en- tire data set, we migh t c ho ose to tak e a random sample of the observ ations. (W e don’t concern our- selv es here with the app ropriateness of a simple ran- dom sample.) W e’ll also assum e that we kn o w the n umb er of observ ations in the CSV file. Ho w do w e efficien tly extract a samp le of the lin es from the file? W e could use UNIX sh ell to ols, but it is difficult to r an d omly generate and sp ecify th e lines to sample. Sampling th e indices is something w e w an t to do in R , but then passing all of these to a shell command is a wkward, at b est. Alterna- tiv ely , w e could do the en tire sampling in R . W e could read th e en tire file into memory [via the r e ad- Lines() fun ction] and then subset th e ones we w an t. Ho wev er, this requ ires a significant amoun t of mem- ory . W e firs t store all of the lines, then make a copy of th e ones we w an t and then discard the larger v ec- tor. Th is may not b e feasible, as we ma y not ha v e enough memory , or it ma y simply b e to o slo w. W e can think of different strategies. One is to firs t iden tify th e indices of all of th e lines we w an t in our sample, and then read the file in c hunks until we get to a line that is in our sample. W e store that line and contin ue to read from wh er e w e are up to the next line in our sample, and so on. T o mak e this w ork, w e need to b e able to con tin ue to read from where w e currently are in th e file. W e can use an R file connection to d o this. Our first step is to generate the ve ctor of the line n umb er s w e wan t to sample using, for example, lineNum = sort(samp le(1:N, sampleSi ze)) where N is the num b er of lin es in th e CS V fi le. W e ha v e sorted the line num b ers, as w e will read th e sample lines in the file sequenti ally . The n ext step is to d etermin e how man y lines there are b et w een successiv e lin es in our sample. W e can compute this in R with lineSkip s = diff(c(0, lineNum)) whic h giv es a vec tor of the pairwise difference b e- t w een successiv e elemen ts. F or example, supp ose the first t w o lines w e wan t to sample are 60 and 200. T he first t wo elemen ts in lineSk ips will b e 60 and 140. W e can then read the first t wo lines in our sample with con = file("201 2.csv", "r") readLine s(con, 60)[60] readLine s(con, 140)[140] Eac h elemen t of lineSkips t ells u s ho w many lines to r ead to get the next line in our sample. So next w e need a function that can read that man y lines and return the last of these. The follo wing func- tion d o es this: readTo = function(n umLines, con) readLine s(con, numLines)[numLin es] The fin al step to obtaining our en tire sample is to call r e adT o() for eac h elemen t of lineSkips , for example, readSele ctedLine s = function (lineSki p, fil e) sapply(l ineSkip, readTo, file) T o obtain ou r sample, w e call r e adSele cte dLines() , passing it the v ariable line Skips and our op en con- nection: con = file("201 2.csv", "r") sample = readSele ctedLine s(lineSk ips, co n) 8 D. TEMPLE LANG Eac h of th ese functions is concise and efficien t since sapp ly() is essentia lly implemente d as a C -lev el lo op within the R inte rpr eter. Usin g the connection and r e adLines() to read b lo c ks of lines in r e adT o() is efficien t as it uses C co d e within R . Unfortun ately , it do es in vo lv e reading, allo cating, storing, subsetting and discarding a p oten tially large c haracter v ector returned b y eac h call to r e adLines() . Ho w ev er, w e only wan t a single line at the end of that vec tor in eac h call. While eac h cal l in volv es significan tly few er lines than reading the en tire file, allo cating a large c haracter v ector still slo ws the computations, as it extensiv ely inv olve s the memory man ager in R . A differen t app roac h to a void the memory issue is to c h ange the r e adT o() function so that it reads eac h line individually and then r etur ns the last one. W e could c hange it to readTo = function (numLine s, con ) { ans = "" for(i in 1:numL ines) ans = readLines (con, 1) ans } Again, this is straigh tforward and easy to under- stand. Unfortunately , it is extremely slo w as w e are no w lo opin g in R o v er almost every line in the file. The idea of reading one line at a time w ould work w ell if we could av oid the o verhead of the R lo op mec h anism. W e can do this if w e compile this n ew v ersion of r e adT o() in to n ativ e co de. W e can almost do this no w, but we need to ha ve an equiv alent of r e adLines() to r ead a single line of a file. T his is exactly what th e standard C routine fgets() do es. Similar to a conn ection, we pass fgets() a p ointer to an opaque C -lev el FILE data structure, and it puts the con tents of the next line it reads into a lo cation in memory that w e also provide. F or simplicit y of exp osition, w e will define our o w n function Fgets() in R as a proxy to call fgets() with Fgets = functio n(file) fgets(pt r, 1000L, file) This is R co de and it just assumes there is a func- tion named fgets() and that ptr is someho w (the address of ) an arra y in memory with 1000 c haracter elemen ts, that is, space f or a long string. W e wo n’t run this co de in R , so these v ariables [ ptr , fil e and fgets() ] don’t actually hav e to exist in R . Instead, we will allocate them in LL VM for the compiled, n ativ e routine we generate fr om Fgets() . W e compile the Fgets() function in an LL VM mo dule, a collection of routines and v ariables, using c ompileF unction() . W e also d efine the mo dule-lev el “global” v ariable ptr to b e the p oin ter to the ar- ra y w e wan t, after creating the actual arr ay of 1000 c h aracters as another global v ariable. When compil- ing Fgets() , w e also need to tell the compiler ab out the signatur e of the external fgets() routine so that it can mak e the call to fgets() correctly . W e do this via mod = Module() FILEType = pointerT ype(Int3 2Type) declareF unction( list(StringType, StringTy pe, Int32Type, FILEType), "fgets", mod) [While we h av e done this explicitly , w e could au- tomate this step usin g the R CInd ex (T emple Lang ( 2010a )) pac k age to obtain the signature program- matically .] W e also need to tell the LL V M run-time engine ho w to lo cate the f g ets() routine wh ic h w e do with llvmAddS ymbol("f gets") Note that in our Fgets() fu nction, w e assumed that the longest lin e w as less than 1000 charact ers. W e can sp ecify a different length if w e knew or su s- p ected otherwise. Similarly , we d idn’t pro vide an y error c hecking ab out w hether we had reac h ed the end of the file. This is b ecause we are assuming that the caller kno ws the tota l num b er of lines and is sampling only up to, at most, that num b er. Th is is an example of the context- sp ecific shortcuts we can mak e when compiling th e cod e for a p articular sit- uation and n ot writing general, r obust co de wh ich can b e u sed in man y different situations. W e could also tell the compiler to add these tests for us, if we w an ted, but can a v oid the extra compu tations wh en w e kno w they are red u ndant. Ho w do w e obtain the instance of the FILE data t yp e to pass to the compiled Fgets() routine? W e can use the C r outine fop en() and again, we can wr ite an R fu nction that mimics that and then compile it. Ho wev er, the RLL VMCompiler pac k age has a func- tion to automate the creation of th at pro xy function in R , if we kno w the signature of th e C routine of in terest. S o th is example illustrates how we can dy- namically cr eate bindings to existing compiled r ou- tines in different libraries. In th e case of FILE , we ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 9 T able 3 Timings for sampli ng a CSV file OS X Linux 1 Linux 2 Time Sp eedu p Time S p eedup Time Sp ee dup Interpreted R lo op & r e adLi nes() 68 . 93 1 . 0 103 . 25 1 . 0 42 . 78 1 . 0 Rllvm -compiled loop & Fgets() 3 . 2 78 21 . 09 6 . 54 15 . 8 2 . 59 16 . 5 C code ( F astCSVSample ) 3 . 0 22 . 97 6 . 28 16 . 4 2 . 40 17 . 8 W e use vectorized co de in R to read blo cks of data and extract the final line of each blo ck. The LL VM approach compiles simple R functions that read one line at a time. The F astCSVSample do es the same thing with manually written C cod e. The compiled approac hes avo id t h e memory usage related to r e adLines() and see a nontrivial sp eedup. The C -co de in the F astCSVSample pack age outp erforms the LL VM -compiled versi on, b ut b oth app roac hes outp erform the approach using R ’s connections and r e adLines() funct ionality which are also implemen ted with C co de. can also use the existing fu nction CFILE() in the R Curl (T emple Lang ( 2002 )) p ack ag e. So n o w w e can r ead a single line from an op en FILE ob ject in R via our compiled Fgets() routine. W e can red efi ne our r e adT o() function as readTo = function (numLine s, con ) { ans = "" for(i in 1:numL ines) ans = Fgets(con ) ans } This is almost iden tical to the original f u nction ab o v e but replaces the call read Lines(co n, 1) with Fgets (con) . No w w e can compile this in to na- tiv e cod e via c ompileF unction() and the r esulting co de will b e quite fast. W e n ow h a v e a fast replacemen t for reading up to the n ext lin e in our sample. The last step is to make r e adSele cte dLines() fast. Recall that this w as im- plemen ted simply as sap ply(line Skip, r eadTo, file) . When w e compile this as retur ning an R c h aracter v ector, our compiler recognizes th e sap- ply() call and con verts this in to a lo op in nativ e co de and p opulates and retur ns a new R c haracter v ector. In summ ary , w e h a v e compiled three R fu n c- tions [ Fgets() , r e adT o() and r e adSele cte dLines() ] and these n o w allo w us to read one line at a time and use the minimal amount of memory to colle ct the lines for our sample, b ut u s ing t wo lo ops in na- tiv e cod e rather than in R . W e can now compare the p erformance of our R -based approac h using r e adLines() to consume c hunks of lines and our compiled version that reads one line at a time. In add ition to these t w o ap- proac hes, w e also ha v e a manual C implemen ta- tion essen tially equiv alen t to our LL VM -compiled approac h in the F astCSVSamp le p ac k age (T em- ple Lang ( 2013 )). Our timings are based on extract- ing a sample of one hundred thousand lines uni- formly from a CSV file that con tains one hundred million lines—the same lines f or eac h appr oac h. The elapsed times are giv en in T able 3 . W e see that our compiled app roac h of r eading one line at a time is aroun d t wen t y times f aster than collecting many unnecessary lines with r e adLines() and lo opin g in R , ev en with sapp ly() . The difference b et we en th e LL VM and nativ e C appr oac hes ma y b e inherent, but also p ossib ly due to different optimization tec h - niques that we m a y b e able to enh ance with LL VM . In short, w e can outp erform R ’s native v ectorized co de by compiling our relativ ely straightforw ard R co de. The exp osition of this example m a y mak e it seem more complicated th an it is. E s sen tially , w e wan t to efficien tly read one line of a file at a time in or- der to get to the next line in our sample. W e com- piled the Fgets() fun ction for this and then compiled t w o other fu nctions in R to p erform lo ops ov er the n umb er of lines. The imp ortan t imp lications from this example is th at we can sidestep R ’s memory managemen t, get fin e-grained con trol o v er computa- tions u sing dyn amically generated routines, and w e can use existing nativ e routines and data stru ctures, suc h as fgets() and FILE , in our R co de that will b e compiled. W e could already dyn amically call nativ e routines directly from R using, for example, rd y n call (Adler ( 2012 )) or Rffi (T emple Lang ( 2011 )). What is imp ortan t here is that w e are also compiling the iterations and not d oing them in R . 10 D. TEMPLE LANG 2.4 F using Lo ops Consider computing the log-lik eliho o d for a giv en v ector of observ ations x and a d ensit y function, sa y , dnorm() . I n R , we can write the log-lik eliho o d very efficien tly as sum(log( dnorm(x, mu, sigma))) Indeed, we could redu ce this to sum(dnorm (x, mu, sigma, log = TRUE)) , bu t the purp ose of this ex- ample is to consider a general sequence of calls to v ectorized R functions. Eac h of the f unctions dnorm() , lo g() and sum() are bu ilt-in to R and are implemented in C co de, and t wo of them u se the ve ry efficien t .P rimitive() mec h anism. As a result, this co d e seems to b e as fast as it can b e. Th is is true giv en the w a y R inte r- prets the expression, one sub -expression at a time. Ho wev er, there are t wo w a ys w e can mak e this more efficien t by compiling suc h an expression. Th e first is by red ucing the num b er of lo ops (in C ) from three to one. Generally , if w e ha v e n nested-call s to elemen t-wise and v ectorized fun ctions, we can re- duce n lo ops to one. Second, w e can t ypically elim- inate at least one allo cation of a p oten tially large v ector. A third wa y we migh t sp eed up th e compu - tations is to us e p arallel capabilities suc h as a GPU or m ultiple cores or CP Us. W e won’t discus s this here, but it is conceptually quite str aigh tforward to do generally when we are co mpiling the co d e dy- namically . In deed, th e abilit y to programmatically com b ine a particular function with a general paral- lel strategy mak es it more exp edient th an wr iting it ourselv es in C / C ++ . Ho w do es R ev aluate the expression ab o v e? I t uses three separate lo ops . Ignoring p edantic details, es- sen tially R ev aluates th e call to dnorm() and so lo ops o ve r all of the ele ments of x and compu tes the density at eac h of those v alues. It s tores these v alues in a newly allocated v ector and then returns that. Th is b ecomes the input to the call to lo g() . R then iterate s ov er the elemen ts of this vecto r an d computes the lo g() f or eac h individu al v alue. In this case, R may r ecognize that it do esn ’t n eed to create a new v ector in wh ic h to return the results, bu t that it can reuse the inp ut ve ctor since it is essen tially anon ymous. T h e final step in the ov erall expression is the call to sum() and this iterate s ov er the ele- men ts of the vecto r it receiv es and returns a single scalar v alue. Imp ortantly , there are thr ee lo ops ov er three vec - tors all of the same length, and we allo cate one new and large ve ctor. W e co uld use a single lo op and a void allo cating this intermediate v ector by rewr it- ing the co de as normalLo gLik = function (x, mu = 0, sigma = 1) { ans = 0 for(val in x) ans = ans + log(dnor m(val, mu, sigma)) ans } Instead of the vec torized calls in R , we ha v e put scalar fun ction calls inside a s in gle lo op. W e h av e com b ined the calls to dnorm() and lo g() together. Then w e to ok the result for eac h elemen t and add ed it to the cum ulativ e su m. This com b ination of op er- ations is called lo op fusion and for large v ectors can yield signifi can t p erformance imp ro v ement s. This n ew scalar v ersion is faster by av oiding the lo op and allocation. Of course, it is ev aluated in R and so w ill b e m uc h slo w er. W e could w rite this in C , b ut it would b e very sp ecific to th e log-lik eliho o d for a Normal den s it y . Generally , w e wo uld hav e to write implemen tations for v arious sequences of calls, for example, for different densit y fun ctions [i.e., sum(log( pdf(x, ...))) ], and expressions in volv- ing other functions [e.g., prod( dchisq(x ^2, p) ) ]. This isn’t pr actical . Ho we ve r, giv en our abilit y to dynamically generate nativ e co d e, w e can com- pile an y expr ession such as our original expression sum(log( dnorm(x, mu, sigma))) in to the nativ e equiv alen t of our scalar co d e ab o ve. T o compile the norma lL o gLik() f unction ab o v e, we need to b e able to call scala r v ersions of the lo g() and dnorm() routines. The lo g() fun ction is a v ail- able in the ubiqu itous math lib r ary ( libm ) and we can just refer to it. The Normal d ensit y fun ction is not standard. W e can arrange for our nativ e co d e to inv ok e R ’s dnorm() function for eac h scalar v alue in the v ector. Th is is b oth a wkw ard and inefficien t. Instead, we can write our o w n v ersion of dnorm() di- rectly in R . While this would b e slow to inv ok e man y times in R , w e will compile our dnorm() an d nor- malL o gLik() fu nctions toget her into a single mo du le and b oth will b e fast. Another p ossible approac h, in this case, is to take adv an tage of the goo d d esign and mo d ularit y of th e Rmath library . It pr o vides the routine dnorm4() as a regular n ativ e routine (un- connected with R ’s data t yp es, etc.) and so we can in v ok e it, just as we d o the lo g() routine. F or reasons that are n ot quite clear at present, on the OS X mac hine, our loop -fu sed version tak es ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 11 T able 4 Times and r elative p erformanc e f or fusing lo ops OS X Lin ux 1 Lin ux 2 Rllvm -compiled fused lo op s 0.73 1.69 0.84 Interpreted R vectorized functions 0.52 2.28 1.07 Regular R time/ Rllvm time 0 .71 1.35 1.27 The fi rst t wo ro ws show the times for fusing the lo ops by compiling the R with LL VM and using a sequence of calls to R ’s vectorized functions. The final row show s the ratio of the tw o times within each machine. F using the lo ops is slow er on OS X , but faster on Linux. ab out 40% longer than the R co d e for 10 million ob- serv ations and 27% longer f or 100 million obser v a- tions. Again, we susp ect that we will b e able to im- pro v e the LL VM -compiled co de by explorin g more of its optimization facilities. Ho wev er, on the Linux mac h ines, w e do see a sp eedu p, ev en for 10 million observ ations where th e LL VM lo op -fu sed co d e runs in ab out 75% the time of the R co de. The timings and relativ e p erformances are giv en in T able 4 . Re- gardless of the exact n umb er s , th e results ind icate that compiling our o wn co de is comp etitiv e with man ually writing vecto rized r ou tin es in R , and that w e can outp erform these bu ilt-in C routines. A difference b et w een the t wo app roac h es is that R uses the .P rimitive () mec h anism rather than a standard function call wh ic h w e hav e to do via the .l lvm() fu nction. Ho we ve r, n ot only d o we re- duce three lo op s to one, but we also a void d ealing with missing v alues ( NA s) and additional p arame- ters suc h as base for the lo g() fu nction. So we should b e doing even b etter. If w e ha v e access to m ultiple cores or GPUs, we ma y b e able to execute this co d e m uc h more efficien tly simply via parallel execution. By f u sing the lo ops op erations together, we can also a void three separate tr ansitions from the host to the GPU and transf erring memory b et ween the tw o sys- tems more times than we n eed. W e explicitly w rote the normalL o gLik() function to sho w ho w to f use the lo ops. W e could also h a v e written the original expression sum(log(dn orm(x, mu, sigma))) as Reduce(‘ +‘, Map(log, Map(dnorm, x, MoreArgs = list(mu, sigma) ))) By explicitly using these functional programming concepts, it is easy for us to see ho w to fuse lo ops and rewrite th e co d e into the lo op ab ov e. The RLL VM- Compile p ac k age can recognize s u c h an expression and compile it to the lo op -fu sed instructions. W e can either require R p r ogrammers to d o this in order to gain the p erformance from n ativ e code or w e can try to m ake the compiler r ecognize the v ectorized nested fun ction call idiom of the f orm f(g(h(x ))) . 2.5 Computing Distances b etw een Observations Distances b etw een pairs of observ ations are im- p ortant in common statistical tec hniques such as clustering, m ulti-dimensional scaling, supp ort ve c- tor machines and many metho ds that use the “ker- nel tric k” (Sc h¨ olk opf and Smola ( 2001 )). R pro vides the dist() function that allo ws us to compute the distance b etw een all pairs of observ ations in a ma- trix or data frame, using any of s ix differen t metrics. The core computations are implemen ted in C an d are fast. Ho wev er, th ere are some issues and rigidi- ties. The dist() function ins ists that the data passed to the C co de are represent ed as a m atrix, and so will mak e a cop y of the data if a data frame is giv en b y the caller. F or large data sets, this can b e a s ig- nifican t issue as w e will essenti ally hav e tw o copies of the data in memory . Also, the dist() function only accepts a single data set and computes the dis- tances b et ween all pairs of observ ations within it. In con tr ast, a reasonably common situation is that w e start with tw o separate data s ets— X and Y —and w an t to compu te the distance b etw een eac h obser- v ation in X and eac h observ ation in Y , bu t not the distances b et w een pairs of observ ations within X or within Y . Not only do we risk h a ving three copies of the d ata in memory (the t wo separate data frames, the t w o com bin ed in to one data frame and then con- v erted to a matrix), b ut the dist() fun ction will also p erform man y u nnecessary computations f or these within-same-set observ ations that we w ill discard. If w e ha v e t w o data sets with n 1 and n 2 obser- v ations, resp ectiv ely , the dist() function computes ( n 1 + n 2) × ( n 1 + n 2 − 1) / 2 distances. W e are only 12 D. TEMPLE LANG in terested in n 1 × n 2 of th ese. As n 1 and n 2 dive rge, the num b er of u nnecessary computations increases, and this is esp ecially bur densome if the num b er of v ariables f or eac h observ ation is large. Another rigidit y is that the choice of distance met- ric is fixed. If w e wa nte d to introdu ce a n ew d istance metric, it would b e usefu l to b e able to reuse the C co d e u nderlying dist() . W e could do this with a function p ointer in C , but the co de for dist() would need to b e mo d ified to su pp ort this. Accordingly , if w e w ant to introd u ce a new metric, w e ha ve to copy or re-implement the entire C code. The C code un d erlying dist() can u se parallel ca- pabilities (Op enMP) if they are d etected when R is compiled. W e cannot use GPUs or c hange the par- allel strategy within an R session without rewriting the C co de. As a result, we w ou ld lik e to b e able to express the computations in R and select a dif- feren t strategy f or parallelizi ng the computations at run-time. In short, as usefu l as dist() is, we w ould lik e it to b e muc h m ore fl exib le. W e wan t to b e able to com- pute the d istances b et we en tw o sets of observ ations, not within a single d ata set; use a data frame or a matrix or p erhaps some other data r epresen tation without making a cop y of the data; introdu ce n ew metrics w ithin th e s ame infrastructure; and use dif - feren t parallel computing app roac h es. The current dist() fun ction in R cannot help us meet these goals and is essen tially static/fixed co d e. The pack ag e p dist (W ong ( 2013 )) pro vides a w a y to compute p airwise d istances b etw een tw o data sets. This a v oids the r edundant computations. Un- fortunately , it only supp orts the E u clidean metric and also insists on matrices b eing passed to the C co de. Also, it has no su pp ort for p arallel computing. If we could write th e basics of the dist() function in R and mak e it fast, we could add ress all of the enhancemen ts we listed ab ov e as w ell as mak e the co de m ore comprehen s ible and accessible to users . The basic ap p roac h to computing the distance b e- t w een eac h pair of obs er v ations in t wo data sets X and Y can b e expressed in R with the follo wing quite sp ecific/rigid function (wr itten to aid compiling): dist = function (X, Y, nx = nrow(X), ny = nrow(Y) , p = ncol(X)) { ans = numeri c(nx * ny) ctr = 1L for(i in 1:nx) { for(j in 1:ny) { total = 0.0 posX = i posY = j for(k in 1:p) { total = total + (X[posX] - Y[posY]) ^2 posX = posX + nx posY = posY + ny } ans[ctr] = sqrt(tot al) ctr = ctr + 1L } } ans } The basic steps are to lo op ov er eac h observ atio n in th e fir st data set ( X ) and then to lo op o ve r eac h observ ation in the other data set ( Y ). F or eac h pair of observ ations, w e compu te the d istance b et w een them via th e thir d nested loop. W e could h a v e made this simpler (and m ore general) by using a v ector- ized R expression or calling a function to do this final lo op. Ho wev er, w e hav e inline d the computations di- rectly for a reason. Sup p ose we had written this part of the computation as (X[i,] - Y[j, ])^2 . Unfor- tunately , in R , this wo uld cause u s to create t wo new in termed iate v ectors, one for eac h of the sp e- cific ro ws in th e t wo data sets. T his is b ecause the ro w of eac h d ata set is not a simp le vec tor cont ain- ing the elemen ts of inte rest whic h w e can pass to the subtraction function ( - ). I nstead, w e hav e to arrange the d ata in eac h ro w of the matrix or d ata frame in to a new vecto r of conti guous v alues. This is where R is conv enient, but inefficien t. This do es not happ en in the C co de for R ’s b u iltin dist() routine, or our s, as it uses matrices and kn o ws ho w to ac- cess the elements ind ivid ually rather than creating a new temp orary v ector. W e use this same approac h in our lo op. W e also could allocate the v ectors for the row v alues just once and reus e them for eac h observ ation, bu t we still hav e to p opulate them f or eac h different observ ation. T o a v oid the inte rmediate v ectors, our co de explic- itly acce sses the ind ividual elements X [i, k] and Y[j, k] dir ectly . A matrix in R is merely a vect or with the elemen ts of the matrix arranged sequen- tially in column order. Therefore, th e first elemen t of obs er v ation i in X is at p osition i in the ve ctor. ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 13 T able 5 Timings for c omputing p air-wise distanc es OS X Linux 1 Linux 2 Rllvm -compiled code 8 . 72 11 . 94 6 . 22 R dist() function (calling native co de) 14 . 74 79 . 65 27 . 37 Sp eedup factor 1 . 69 6 . 67 4 . 4 This shows the total elapsed t ime for distance computations with 40 va riables and 8000 and 1000 obser- v ations in the t wo data sets. In the R approac h with the di st() function, there is ex tra memory allocation and also 80% of th e distances computed are discarded. W e out p erform the native R implementa tion on b oth platforms. The second elemen t of the i th observ ation is at p o- sition i + nrow(X) , and so on. T o compute the dis- tance b et we en the t w o observ ations, we lo op o ve r the p v ariables presen t in eac h of the observ ations and compu te the d ifference. The co de ill ustrates these computations for the Euclidean distance. W e could easily c hange this to implemen t other distance metrics. W e could d o this b y c hanging the co d e either man ually or program- matically by replacing the expression ( X[posX]- Y[posY]) ^2 with, fo r example, abs(X[pos X] - Y[posY]) . Rewr iting co de programmatically is a p ow erful f eature that allo w s u s treat R code as a template. W e can compile this three-lev el nested lo op R co de via RLL VMCompile to nativ e instru ctions. Our compiler currentl y wo rks p rimarily with primi- tiv e d ata t yp es and has limited su pp ort f or working directly with R ob jects, for example, kn owing the dimensions of an R matrix. Accordingly , w e arrange to p ass the m atrices and their dim en sions to the routine and cur r en tly h a v e to explicitly sp ecify the signature: distc = compile Function (dist, REALSXPT ype, list(X = Double PtrType, Y = DoublePt rType, nx = Int32Type, ny = Int32Type , p = Int32Typ e)) In the future, we will allo w the caller to sp ecify just the t w o data sets ( X and Y ). Ho wev er, we are making the representat ion as matrices more explicit h er e, whic h is v aluable information for the compiler. No w that we h a v e the nativ e co d e, we can then compare this to using R co de that computes the same distances but do es so b y combining the tw o data sets, calls dist() , and conv erts the result to the sub-matrix of in terest. This comparison f a v ors our co de since this is the form of our in puts and the ex- p ected form of the outpu t. Ho we ve r, th ese are qu ite reasonable. W e timed the f unctions to compute the distances f or t wo data sets of size 8000 and 1000 ob- serv ations, eac h with 40 v ariables. I n this case, 80% of the distances computed using R ’s dist() fun ction are irrelev an t and d iscarded. T able 5 sho ws the r e- sults and illustrates that b y doing few er computa- tions, we do indeed outp erform the nativ e C code in R , on b oth platforms. If w e had used d ata sets with similar n umbers of observ ations, the results w ould ha v e b een less dramatic. Ho wev er, with 3000 obser- v ations in Y , the LL VM -generated n ativ e co de was still thr ee times faster on Linux and only 18% slo we r on OS X. Comparing the resu lts ab o v e to similar n ativ e co de in the p dist p ack ag e, the timings again s h o w that n ativ e C co de in p d ist outp erforms our LL VM - compiled co de, 60% faster on one mac hine and 9 times faster on another. T his illustrates th at there is r o om for significan t imp r o v emen t in our LL VM compilation. Ho wev er, the fact that we can outp er- form R ’s native approac h is encouraging. Th at w e can readily adapt th is to different purp oses and dif- feren t computational strategies indicates significant opp ortun ities and p oten tial. As a fin al note, w e could r emo v e the third loop and insert a call to a function to compute the distance for these t wo v ariables, for example, euc lidean(X [i,], Y[j,]) . T he compiler could recognize that X and Y are matrices and arrange for the compiled v ersion of th e eu clide an() function to access th e elements as w e h a ve disp la yed ab o v e, that is, without computing the in termediate vect or for eac h row. If we tell the compiler X and/or Y are data frames, it would gen- erate differen t co d e to access the elemen ts so as to a void these intermediate v ectors. Since the compiler 14 D. TEMPLE LANG has th e opp ortunity to compile b oth the co d e for the main loop and for the metric fu nction together and kno ws the representat ions of the in puts, it can create b etter co de than if we w rote these separately and more rigidly . 3. POSSIBLE COMPILA TION ENHANCEMENT STRA TEGIES The examples in the previous section exp lored differen t wa ys w e could c han ge the w ay w e com- pute in R with new facilities f or generating native co de. W e considered compiling R co d e to nativ e rou- tines, reus ing existing n ativ e routines within these generated routines, and c hanging the computational strategies w e emp lo y within R to embrace these new approac hes. There are man y other simple examples w e could consider to impro ve the p erformance of R co de. O ne is the abilit y to write fun ctions that fo- cus on scalar op erations and th en to create v ector- ized ve rsions of these automatically . Give n a scalar function f() , we can write a vecto rized v ersion as sapply(x , f, ...) or with mapply() . The com- piler can then turn this into a nativ e lo op. In d eed, man y of the p erformance gains are ac hiev ed by mak- ing lo op in g faster. T hey also p oten tially redu ce the necessit y to use ve ctorized co d e in R and so hop e- fully m ak e programming in R more int uitive for new users. In addition to handling lo ops, there are sev eral other asp ects of R ’s ev aluation mo del th at w e might b e able to improv e by c ho osing different compilation strategies in d ifferen t con texts. The idea is that the R user compiling th e co d e ma y ha v e more in forma- tion ab out the computations, the data and its repre- sen tation, or the av ai lable computing resources than the compiler do es by examining the co de. Th is ex- tra information is imp ortant. The p rogrammer may b e able to giv e hints to a compiler, or c ho ose a dif- feren t compiler function/implemen tation altoget her, to cont rol ho w the co d e is und ersto o d and the na- tiv e ins tructions are generated. The follo wing are some r easonably obvious and general impr o vemen ts w e migh t b e able to in fer or make in certain situa- tions. Gu ided b y the R user, d ifferen t compilers ma y yield different co de, and even differen t semanti cs, f or the same co de. 3.1 Omitting Checks f o r NA V alues Man y of the C routines in R loop ov er the ele- men ts of a vecto r and m ust c hec k eac h elemen t to see if it is a m iss ing v alue ( NA ). This co de is gen- eral purp ose co de and so this test is a fixed part of almost ev ery compu tation in vo lving that routine. Ho wev er, w hen we dynamically generate the cod e, w e m ay know that there are no missin g v alues in the data set on whic h we will r un that co de and so omit the co de to p erform these additional, redun- dan t tests. Similarly , in our example of samplin g a CSV file, w e knew the num b er of lines in a file and w e knew that eac h call to the fg ets() r outine w ould s ucceed. As a r esult, we did n ot ha v e to c hec k the return status of the call for reading at the end of the file. W e also assum ed that the largest line wa s less than 1000 c h aracters and didn ’t v alidate this in eac h iteration. The same app lies wh en w e are accessing elemen ts of a v ector as to whether w e fir st need to chec k that the index is within the exten t of the array or not, that is, b ound s c h ec k in g. When w e can verify this conceptually (within a loop o ver a v ector), or b y declaration by the user, we can omit these chec ks. These tests are typica lly simple and n ot compu- tationally exp ensiv e. How ev er, th ey can b ecome s ig- nifican t when the instructions are in vo ke d v ery of- ten, for examp le, in a lo op ov er elemen ts of a large v ector. 3.2 Memo ry Allo cation In our example discussin g lo op fu sion, we sa w that not only could we reduce the num b er of o v erall it- erations in a computation, bu t w e also could redu ce memory usage. W e a voided creating a v ector for the result of the call to dnorm() [and lo g() ]. There are p oten tial opp ortun ities to fur ther redu ce memory usage. R u s es the conce pt of pass-b y-v alue in calls to functions. In theory , R mak es a cop y of eac h argu- men t in a call to a fun ction. (Lazy ev aluation means that some arguments are nev er ev aluated and so not copied.) Ho wev er, the R interpreter is sm arter than this and only copies the ob ject wh en it is mo dified, and only if it is n ot part of another ob j ect. When compiling R cod e, w e w an t to b e able to determine that an ob j ect is not mo dified and av oid cop ying it. By analyzing code, w e can detect whether pa- rameters can b e considered r ead-only and so redu ce memory consumption in cases where R cann ot ver- ify that it is safe to a v oid cop ying an ob ject. W e can id en tify this w ithin r egular R co d e, ho wev er, we w ould ha v e to mo dify the interpreter to mak e use of this information. When generating nativ e cod e ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 15 with, for examp le, c ompileF unction() , w e can mak e use of this information dynamically , b ypassing the R in terpreter. Another example where we can reduce the mem- ory fo otprint of co de is when we can reuse the same memory fr om a differen t computation. F or example, consider a simple b o otstrap computation something lik e the follo wing R pseud o-code: for(i in 1:B) { d.star = data[s ample(1: n, n, replace = TRUE), ] ans[[i]] = T(d.star , ...) } In R ’s computational mo del, we w ill allo cate a n ew data frame d.star for eac h b o otstrap sample. This is unnecessary . W e can reuse the same memory for eac h samp le, as eac h sample h as the same structure and only differs in the v alues in eac h cell. By an- alyzing th e sequence of commands rather than ex- ecuting eac h one separately without kno wledege of the others, we can tak e adv antag e of th is opp ortu- nit y to reuse the memory . W e can also reuse the same v ector to store the r esult of the rep eated calls to sample() . It is r easonably clear that this is what w e would do if we wrote this co de in C , reus in g the same data structur e in s tances. Ho w ev er, this is n ot p ossible within R , as th e ind ividual computations are n ot as connected as they are in th e large-picture C co d e. When we dynamically generate native co de, w e can utilize this large-scale inform ation. Similarly , some R scripts creat e a large ob j ect, p erform seve ral computations on it and then mo v e to other tasks. Co d e analysis can allo w us to iden - tify that the ob j ect is no longer b eing used and so w e can insert calls to remo v e the ob ject. Ho wev er, w e m a y b e able to recognize that the ob ject is no longer needed, but that subsequent tasks can r euse the same d ata form at an d repr esen tation. In that case, we can reuse the memory or at least p arts of it. 3.3 Data Rep resentations The small n umber of fun damen tal data t yp es in R mak es computational reasoning quite simple, b oth to use and to imp lemen t. Of course, the c hoice of data t yp e and structure can b e imp ortan t for man y computations. Sequences, for example, 1:n or seq(alon g = x) , are common in R co de and these are repr esen ted in R as explicit v ectors conta ining all of the v alues in the sequence. W e ha v e seen that w e can a v oid creating the sequ ence vecto r and p opu - lating it when it is used as a lo op counte r. S imilarly , w e can repr esent a r egular sequence with the s tart, end and stride, that is, the increment b et we en el- emen ts. When generating the cod e, we then access elemen ts of su ch a sequen ce usin g appr opriate cal- culations sp ecialized to that s equence type. In man y cases, R ’s simple d ata typ es cause us to use an in teger wh en w e only need a byt e, or eve n just a few bits, to represent a few p ossible v alues/states. The snpS tats (Cla yton ( 2011 )) p ack ag e d o es this successfully u sing bytes to reduce th e m emory fo ot- print for large genomic d ata. Again, the op erations to sub set data in this different format need to b e mo dified from the default. Doing this element- wise in R is excessiv ely slo w. Ho w ev er, when w e generate nativ e co de, we are fr ee to use different wa ys to ac- cess the individual elemen ts. This idea is imp ortant. W e sp ecify what to do in the co de, but not p recisely ho w to do it. When generating the co de, we com b ine the cod e and information ab out ho w to represent the data and generate different co de strategies an d realizatio ns. This is somewhat similar to template functions in C ++, but more dynamic due to run- time compilati on/generation with more con textual information. There are sev eral other aspects of R co de that w e can compile, for examp le, matc hing named argu- men ts at compile time rather than at run time. 4. O VERVIEW OF G ENERA TING CODE WITH LL VM In this section we will br iefly d escrib e the basic ideas of h o w we generate co de with LL VM , Rllvm and RL L VMCompile . This is a little more technical and lo w lev el than our examples and readers d o not need to un derstand th is material to un derstand the main ideas of this p ap er or to u se the compiler or the compiled co de. W e are describin g it here to illustrate ho w other R pr ogrammers can r eadily exp eriment with these to ols to generate co de in different wa ys. W e’ll use the Fib onacci sequence an d the fib() function example again, as it illustrates a f ew dif- feren t asp ects of generating co de. Our fib() fu nction in R exp ects an integer v alue and return s an integ er. The b o dy consists of a single if–else expression. T his conta ins a condition to test and t wo b lo c ks of co de, one of which will b e ev alu- ated dep end ing on the outcome of that condition. T o map this co de to LL VM concepts, w e need to create 16 D. TEMPLE LANG ; Module ID = ’fib’ define i32 @fib(i32 %n) { entry: %0 = icmp slt i32 %n, 2 br i1 %0, label %"body.n < 2L", label %body.last "body.n < 2L": ; preds = %entry ret i32 %n body.las t: ; preds = %entry %"n - 1L" = sub i32 %n, 1 %1 = call i32 @fib(i32 %"n - 1L") %"n - 2L" = sub i32 %n, 2 %2 = call i32 @fib(i32 %"n - 2L") %"fib(n - 1L) + fib(n - 2L)" = add i32 %1, %2 ret i32 %"fib(n - 1L) + fib(n - 2L)" } Fig. 3. Interme diate R epr esentation for the c ompil e d fib() r outine. We cr e ate the F unction and Blo ck obje cts and cr e ate and insert LL VM instruction obje cts c orr esp onding to the ex- pr essions and sub-expr essions in the R function. The r esult is this low-level description in interme di ate f orm which LL VM c an optimize and c ompi le to native c o de for differ ent tar gets, for example, a CPU or GPU. T he differ ent blo cks have a l a- b el (e.g., body.last ) and c orr esp ond to differ ent p arts of an if statement or p ossibly p arts of a lo op, gener al ly. differen t ins truction blo cks, eac h of whic h con tains one or more instructions. When w e call the routin e, the ev aluation starts in th e fir st in s truction b lo c k and exec utes eac h of its instructions sequen tially . The end of eac h instruction blo c k has a terminator whic h identifies the next blo c k to wh ic h to j ump, or returns from the routine. J u mping b et we en blo cks allo ws us to imp lemen t conditional b ranc hing, lo ops, etc. F or our fib() function, w e start with an en try blo ck that might create any lo cal v ariables for the compu- tations. In our f u nction, this b lo c k simply con tains co de to ev aluate the condition n < 2 an d , dep endin g on the v alue of this test, the instruction to branch to one of t wo other blo c ks corresp onding to the ex- pressions in the if and the e lse parts. In the if blo ck (i.e., n is less than 2), we add a sin gle instruction to r eturn th e v alue of the v ariable n . In the blo c k corresp ondin g to the else part, we add several lo w- lev el instructions. W e start by compu ting n - 1L and then call fib() with that v alue and store the re- sult in a lo cal v ariable. Th en we calculate n - 2L , call fib() and store that result. Then w e add these t w o lo cal intermediate results and store the result. Finally w e return that resu lt. Figure 3 sho ws the co de in what is called Intermediate Repr esen tation (IR) f orm that LL VM uses. This illustrates the low- lev el compu tations. While th e co d e for this fun ction is reasonably sim- ple, there are man y d etails inv olv ed in generating the nativ e co de, s u c h as defining the routine and its parameters, creating th e in struction blo c ks, loading and storing v alues, and creating instructions to p er- form sub traction, call th e fib() fu nction and r eturn a v alue. The LL VM C ++ API (Applicatio n Pro- gramming In terface) pr o vides n umerous classes and metho ds that allo w us to create instances of these conceptual items suc h as F unction s, Blo ck s, man y differen t t yp es of instructions and so on. The Rl- lvm pac k age provi des an R interface to these C ++ classes and metho d s and allo ws us to create and ma- nipulate these ob jects directly within R . F or exam- ple, the follo wing co de sho ws ho w w e can defin e the function, the en try in struction blo ck and generate the call fib(n - 1) : mod = Module() f = Function ("fib", Int32Type, list(n = Int32T ype), mod) start = Block(f ) ir = IRBuilder( start) parms = getPara meters(f ) n.minus. 1 = binOp(ir, Sub, parms$n, createCo nstant(i r, 1L) ) createCa ll(ir, f, n.minus.1) W e don’t wa nt to w rite this code man ually our- selv es in R , although Rllvm enables us to d o so. In- stead, w e wan t to p rogrammatically transform the R co d e in the fib() function to create the LL V M ob- jects. The RLL VMCompile p ac k age do es this. Since R functions are regular R ob jects w h ic h w e can query and m an ip ulate directly in R , we can tra ve rse the expr essions in the b o dy of a fun ction, analyze eac h one and p erform a simple-minded translation from R concepts to LL VM concepts. Th is is the ba- sic w a y the c ompileF unction() generates the co de, using customizable handler fun ctions f or the differ- en t t yp es of expressions. These recog nize calls to functions, accessing global v ariables, arithmetic op- erations, if statemen ts, lo op s and so on. They use the functions in Rllvm to create the corresp ond ing LL VM ob j ects and instructions. Once th e compiler has finished defin ing the in- structions for our r outine, LL VM has a description of w hat we wan t to do in the form of these b lo c ks and instructions. This description is in th is in ter- mediate represen tation (IR). W e can lo ok at this “cod e” and it will lo ok something similar to that ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 17 sho wn in Figure 3 . T he I R cod e sh o ws the some- what lo w-lev el details of the blo cks and instructions as we describ ed ab o v e. W e see the three b lo c ks la- b eled entry , body.n < 2L and bo dy.last . Again, it is not imp ortant to un derstand these details to b e able to u se the compiled routine translated from the R function. I sh o w it h er e to illustrate the d ifferen t steps in the compilation pro cess and to in dicate that an R p rogrammer can c hose to c hange any of these steps. Next, w e instruct LL VM to v erify and optimize the cod e. A t this p oint, we can call the new routine via the .l lvm() function in Rllvm wh ic h corresp onds to a metho d in the LL VM API. Th e first time the co de is us ed , LL VM generates the nativ e co de from the IR f orm. 5. CONTRASTS WITH RELA TED RESEARCH There ha v e b een s ev eral pr o jects exploring ho w to impr o v e th e p erformance of R co de. W e discuss some of these in this section. Byte-Compiler : On e of the most visib le pro jects is the byte- co de compiler dev elop ed b y Lu k e Tiern ey (Tierney ( 2001 )). This consists of an R pac k age th at pro vides the compiler and s ome supp ort in th e core R inte rpr eter to execute the resu lting b yte-compiled co de. The compiler maps the R co d e to instructions in th e same spirit as LL VM ’s in s tructions and in- termediate represen tation. These instru ctions are at a higher-lev el than LL VM ’s and are more sp ecific to R . The typical sp eedup pr o vided b y the byte-co mpiler is a factor of b et ween 2 and 5, with muc h larger sp eedup s on some problems. This ma y n ot b e suf- ficien t to ob viate the need for wr iting co de in C / C + +. W e probably need to see a factor of more than 10 and closer to 100 for common tasks. The b yte-co de compiler is written in R and so others can adapt and extend it. Ho wev er, the de- tails of ho w the resulting b yte-co de it generates is ev aluated is tight ly em b edd ed in the C -language im- plemen tation of the R engine. Th is means that if one w an ts to c h ange the byte-code in terpreter, one h as to mo dify the R interpreter itself. While one can do this w ith a priv ate v ersion of R , one cannot make these c hanges av ailable to others without them also compiling a mo difi ed v ersion of R . In other words, the byt e-co de in terpreter is not extensible at run- time or by regular R users. F urth ermore, the R core dev elopmen t team d o es not alwa ys greet su ggested enhancemen ts and patches with en th usiasm. Ther e- fore, th is appr oac h tend s to b e th e work of one p er- son and so has limited resources. R a JIT Compiler : The Ra extension to R and the asso ciated jit pac k age is another approac h to using JIT (Just-in-Time) compilation in R . T his fo cuses on compiling lo ops and arithmetic expr es- sions in lo ops. Like the b yte-cod e compiler ab o v e, Ra requir es almost no change to existing R co d e b y R users—only the call to the function jit() b e- fore ev aluating the co d e. T h e p erformance gain on some problems can b e apparently as high as a f actor of 27. (See http://www.milbo.users.sonic.net/ ra/times9.html .) Un fortunately , this is no longer main tained on CRAN, the primary cen tral rep osi- tory of R p ac k ages. This appr oac h su ffers from the fact that it requires a mo dified version of the R in- terpreter, again compiled f r om the C -lev el sour ce. This places a bu rden on the author of Ra to contin- ually up d ate Ra as R itself c hanges. Also, it r equires users trust Ra and tak e the time to b uild the rele- v an t binary installations of Ra. As we men tioned p reviously , im p ortant motiv at- ing goals in our work are to a void m o difying R it- self, to allo w other p eople to build on and adap t our to ols, and to directly lev erage the ongoing work of domain exp erts in compiler tec hnology b y int e- grating their to ols to p erform the compilation. Ou r approac h differs from b oth th e b yte-compiler and Ra in these resp ects. R on the Java Virtual Machine : There are several pro jects working on d evelo ping an implemen tation of R usin g the Java programming language and vir- tual mac h ine. On e is F astR ( htt ps://github.com/ allr/fastr ), whic h is b eing dev elop ed by a collab- oration b etw een r esearchers at Purdu e, Or acle and INRIA. Another is R en jin ( https: //code.google. com/p/renjin/ ). Ha ving R run on the Java virtual mac h ine offers sev eral b enefits. Th er e are many in- teresting large-data p ro jects implemen ted in Java , for example, Ap ac h e’s Hado op and Mahout. In te- grating R co de and such pr o jects and their f u nc- tionalit y w ould b e m uch tigh ter and effectiv e if they are all on th e same platform and share the same computational engine. Imp ortan tly , R w ould b ene- fit from passive ly acquirin g features in Java and its libraries, for example, security , thr eads. O n e v ery in teresting d ev elopmen t is that researchers in Or a- cle, collab orating w ith the dev elop ers of F astR, are creating a to ol Graal ( http://openj dk.java.net/ 18 D. TEMPLE LANG projects / graal/ ) f or compiling the co de ordinar- ily interpreted b y a h igh-lev el virtual mac h in e, for example, F astR. T h is could yield the p erformance gains w e seek, but b y passively leve raging the gen- eral work of others merely b y using widespread tec h- nologies. This contrasts with the ongoing develo p- men t of R by a relativ ely small communit y and h a v - ing to activ ely and manuall y imp ort new tec hn olo- gies, features and ideas from other languages, sys- tems and comm u nities. T r anslating R c o de to C : Another approac h is to translate R co d e to C / C ++ co de. This is attractiv e, as it w ould giv e us similar sp eedup as w e can get with LL VM , p otentia lly pro duces human-readable co de, and allo ws us to lev erage the standard to ols for these languages such as compilers, linkers and , im- p ortantl y , debu ggers. W e can also p otenti ally r euse (some of ) the generated co d e outside of R . Simon Urbanek’s r2c pac k age (Urbanek ( 2007 )) is an ex- ample of exploring translation of R co de to C co de. The Rcpp (Ed delbuettel and F ran¸ co is ( 2011 )) (and inline ) p ac k age are widely u sed in R to im- pro v e p erformance. The pac k ages p ro vide a w a y to include high-lev el C ++ co de within R cod e and to compile and call it within the R co de. The C ++ co de uses an R -lik e syntax to mak e it r elativ ely eas- ier to wr ite the C ++ co d e. Th is has b een a v alu- able ad d ition to R to obtain p erformant co de. The approac hes that compile R cod e directly are prefer- able if they can get the same p erf orm ance. Th e first reason is b ecause the programmer d o es not ha v e to program in C ++ . It is also harder for other pro- grammers to read the co d e and und erstand what it do es. The second reason is b ecause th e C ++ co de is essen tially opaque to any R co de analysis or com- piler. If we d o m anage to generally compile R cod e effectiv ely or implement an automated parallel com- puting str ategy for R co de, the C ++ co d e cann ot easily b e p art of this. F or examp le, if we can map the R co de to r un as a GPU kernel on many cores, w e cannot easily com b ine the R and C ++ co d e to tak e adv an tage of these cores. Par al lel Sp e e dup : There are sev eral interesting pro jects that ha ve aimed at impro ving the p erfor- mance of R exclusiv ely by runn ing the co de in p ar- allel. Th is is very imp ortant and in some sense or- thogonal to compilation of R co de. If we sp eed up the compu tations on a single CP U, that sp eedup will b enefit r u nning co d e on eac h CPU. Ho w ev er, w e also wan t to compile R co de to take adv anta ge of m ultiple CPU/GPUs. W e hop e to b e able to in- tegrate ideas from these pr o jects into our compila- tion strategies. Unfortunately , some of th em are no longer activ e p ro jects, for example, p R and taskPR . This illustrates one asp ect we hav e observed in th e R comm unit y . S ome researc hers implemen t some ideas in R , sometimes as a Ph D thesis, and then mo v e on to other p ro jects. On e of the terrific asp ects of R is the ongoing commitmen t to sup p ort the R comm u- nit y . This is probably a v ery significan t reason for R ’s wid espread use and an imp ortan t consideration when devel oping new en vironments and languages. It is one of the f orces motiv ating our con tin ued w ork within R , ev en if dev eloping a new system w ould b e in tellectually more stimulating. A ve ry imp ortant asp ect of all this work is to rec- ognize that there are many p ositiv e wa ys to make R faster and more efficien t. While one of these ap- proac hes may dominate others in the fu ture, it is v ery im p ortant that we should p u rsue comparativ e approac hes and cont inue to motiv ate eac h other’s w ork. There is muc h to b e learned fr om these differ- en t ap p roac h es that will imp ro v e the others. 6. FUTURE W ORK Compiling (subsets of ) R co de and other Domain Sp ecific Languages (DSLs ) w ithin R using LL VM is a pr omising approac h that is certainly w orth vigor- ously p ursuin g in the near term. T he wo rk is cur - ren tly in its inf ancy—w e started it in the sum mer of 2010, bu t ha ve only recen tly return ed to it after an almost three-y ear hiatus due to other pro jects (ours and other p eople’s). Ho wev er, the foun dations of man y of the imp ortant comp onen ts are in place, that is, the Rllvm pac k age, and the b asics of th e ex- tensible and adaptable compiler m ec h anism in RL- L VMCompile should allo w u s and others to mak e relativ ely quic k progress, p rogramming almost en- tirely in R to deve lop compilation str ategies. How- ev er, there are many other tasks to d o to mak e these transparent an d reliable, and many related pro jects that w ill make them m ore p o werful and con v enien t. One of the immediate tasks we will un dertak e is to pr ogram some ric h examples explicitly in R co d e. W e are implemen ting R co de v ersions of recursive partitioning trees, r an d om forests and b o osting. W e also plan to explore compiling co d e for the Exp ecta- tion Maximizatio n (EM) algorithm and particle fil- ters to run on GPUs. Th e aim is to share these sam- ple R pro jects with the other researc h ers inv estigat- ing different compilation s tr ategies for R so that w e compare approac hes on substan tiv e and real tasks w e w an t to pr ogram in pu re R co de. ENHANCING R WITH ADV ANCED COMPILA TION TOOLS AND METHODS 19 W e plan to add s ome of th e fu nctionalit y a v ailable in LL VM that do es not ye t ha ve b indings in Rllvm . This includ es topics such as differen t optimizat ion passes and adding meta d ata to the instructions. W e ha v e also dev elop ed the in itial infrastru cture to compile R co de as kernel r outines that can b e us ed on GPUs, that is, PTX (Pa rallel Th read Execution) co de. Being able to generate kernel fu nctions fr om R co de, along with the existing R -CUD A b indings to manage memory and launc h kernels from th e host device, allo ws u s to program GPUs d irectly within high-lev el R cod e. This con trasts w ith the lo w-lev el C cod e dev elop ed f or existing R p ac k ages that target GPUs, for example, gputo ols (Buc kner et al. ( 2009 )) and rgpu (Kemp enaar and Dijkstra ( 2010 )). W e will also b e exploring differen t approac hes to compiling the R co de to run in p arallel and dis- tributed settings. W e think that b eing able to u se information ab out the d istribution of the data to generate/c ompile th e co de will b e imp ortan t so that w e can minimize th e mo ve ment of data and k eep the C PUs/GPUs busy on the actual computations rather than transf erring the inpu ts and outputs to and fr om the compu tations. Being able to write R code that dir ectly calls C routines is v ery p ow erful. As we sa w in r elation to the fgets() r outine in S ection 2.3 , w e need to sp ec- ify the signature for the routines we wan t to call. It is pr eferab le to b e able to programmatically id entify these signatures rather th an requ ir e R p rogrammers to explicitly sp ecify them. The R CIndex pac k age is an R in terface to lib clang (Carruth et al. ( 2007 )), the parsing facilitie s f or the clang compiler. This already allo ws us to read C and C ++ co d e in R and to iden tify th e different elemen ts it cont ains. This allo ws to not only determine the signatures of routines, b u t also d isco ver different data stru ctures, en umerated constant s, etc. W e can also go further and understand more ab out ho w the routines ma- nipulate their argument s and whether they p erform the memory managemen t or lea v e it to the caller. As we sa w in eac h of our examples, information ab out the t y p es of eac h parameter an d local v ari- able is a necessit y to b eing able to compile using LL VM . C urrently , the R programmer must sp ecify this information not only for the function she is com- piling, bu t also for all of the fu nctions it calls. Again, w e wan t to m ak e this transparent, or at least only require the R p rogrammer to sp ecify this inform a- tion when there is am b iguity . T o this en d , w e are w orking on a typ e inference p ac k age f or R . This starts with a known set of fundamen tal functions and their signatures. F rom this, w e can determine the signatures of many higher lev el calls. As alw ays, w e cannot d eal with many features of the language suc h as nonstandard ev aluation, but w e most lik ely can get muc h of the typ e information we n eed pro- grammatically . Since R ’s t yp es are so flexible w ith differen t return t yp es based on not only the t yp es of the inputs, but also the con tent of the inputs, w e need a flexible wa y to sp ecify t yp es. P erhaps the existing T yp eIn f o pac k age (T emple Lang and Gen- tleman ( 2005 )) or lam b da.r p ac k age will help here. T o analyze co de for type information and for v ari- able dep endencies, we w ill bu ild up on the Co deDe- p end s (T emple Lang, P eng and Nolan ( 2007 )) and co deto ols (Tiern ey ( 2011 )) p ack ag es. While these are some of th e related activities we en visage w orking on, we also encourage others to col- lab orate with us or work indep endently u s ing LL VM and optionally Rllvm and RLL VMCompile so that our communit y ends u p with b etter to ols. 7. CONCLUSION W e hav e describ ed one approac h to making some parts of the R language fast. W e lev erage the com- piler to olkit infrastructure LL VM to generate nativ e co de. This allo ws us to incorp orate tec hn ical kn owl- edge fr om another comm unit y , b oth n o w and in the future. W e can generate co d e for CPUs, GPUs and other targets. W e can dyn amically sp ecialize R func- tions to differen t compu tational appr oac hes, data represent ations and sour ces, and con textual k n o wl- edge, giving us a new and very flexible approac h to thinking ab out high-lev el computing. W e are d ev eloping a simple but extensib le and cus- tomizable compiler in R that can translate R code to nativ e co de. Not only do es this mak e the co d e run fast, but it also allo ws us to compute in quite differen t w a ys than when w e in terpret the R co de in the usu al w a y . W e can ev en outp erform some of R ’s o w n n ativ e co d e. In no wa y sh ou ld this w ork b e considered a gen- eral compiler for all of the R language. There are man y asp ects of the R language w e ha ve not y et dealt w ith or consider ed . V ect orized su bsetting, re- cycling, lazy ev aluation and n on s tandard ev aluation are examp les. W e, or others, can add facilities to the compiler to sup p ort these when they mak e sense an d are feasible. The initial results from this simp le approac h are v ery encouraging. An imp ortan t implication of this 20 D. TEMPLE LANG and other efforts to mak e R co d e efficien t is that w e can b e b enefit from wr iting high-lev el co de that describ es what to compute, not ho w. W e then u s e smart inte rpr eters or compilers to generate efficient co de, simultaneo usly freeing R p rogrammers to con- cen trate on their tasks and leverag ing d omain ex- p ertise for executing the co de. W e hop e others will b e able to use these basic b uilding blocks to im- pro v e matters and also to explore quite different ap- proac hes and n ew languages within the R en vir on- men t. A CKNO WLEDGMENTS Vincen t Buffalo made v aluable con trib u tions to designing and develo ping the RLL VMCompile p ac k- age in th e initial w ork. Vincent Carey has provided imp ortant ideas, insigh ts, advice an d motiv ation and I am ve ry grateful to him for organizing this collec- tion of pap ers and the session at the 2012 Join t Sta- tistical Meetings. Also, I appreciate the very useful commen ts on the initial dr aft of this pap er by th e three review ers and also John Chambers . SUPPLEMENT AR Y MA TERIAL The co de for the examples in this pap er, along with the timing r esu lts and their meta-data, are a v ailable from https:// github.c om/duncantl/ RllvmTim ings as a git rep ository . The versions of the Rllvm and RLL VMComp ile pac k ages inv olve d in the timings can also b e retriev ed from their re- sp ectiv e git rep ositories. Th e sp ecific co d e u sed is asso ciated with the git tag StatSciPap er . REFERENCES Adler, D. (2012). rdy ncall: Improv ed foreign fun ction inter- face (FFI) and dynamic bindings to C libraries. R pac k age versi on 0.7.5. Buckner, J. , W ilson, J. , Seligman, M. , A they, B . , W a t- son, S. and Meng , F. (2009). The gpu tools pack age en- ables GPU computing in R . Bioinf ormatics 26 135–135. Av ailable at http://cran.r- project.org/web/packages/ gputools/index.html . Carruth, C. , Christopher, E. , Gregor, D. , K o- ro beyniko v, A. , Kre menek, T. , McC all, J. , Ro sier, C. and Smith, R. (2007). lib clang: C / C ++ translation unit parser library . Ava ilable at http:// clang.llvm.org . Cla yton, D . (2011). snpstats: SnpMatrix and XSnp Matrix classes and metho ds. R pack age version 1.5.0. Eddelbuettel, D. and Franc ¸ois, R. (2011). Rcpp: Seam- less R and C++ in tegration. Journal of Statistic al Softwar e 40 1–18. Ihaka, R. and Te mple Lang, D. (2008). Bac k to the fu- ture: Lisp as a base for a statistical computing system. In Pr o c e e dings in Com putational Statistics 21–33. S pringer, Heidelb erg. Jones, E. , Oliphant, T. and Peterson, P. et al. (2001). SciPy: Open source scien tific tools for Python. Kempenaar, M. and Dijkstra, M. (2010). R/GPU: U s- ing th e Graphics Pro cessing Unit to speed u p b ioinformat- ics analysis with R . Ava ilable at https://trac.nbic.nl/ rgpu/ . La ttner, C. and Adve, V. (2004). LL VM: A compilation framew ork for lifelo ng program analysis and transforma- tion. In Pr o c. of the 2004 I nternational Symp osium on Co de Gener ation and Optimization (CGO’04), San Jose, CA, USA 75–88. IEEE Computer So ciety , W ashington, DC. Av ailable at http://llvm.org/ . R Core Tea m (2013). R: A L anguage and Envir onment for Statistic al C omputing . R F oundation for Statistical Com- puting, Vienn a, A ustria. Sch ¨ olk opf, B. and Smola, A. ( 2001). L e arning with Ker- nels . MIT Press, Cam bridge, MA. Temple La n g, D. (2002). RCurl: General netw ork (http/ftp/. . . ) client in terface for R . R pack age versi on 1.95-4. Temple Lang, D. (2010a). R CIndex: R interfa ce to the clang parser’s C API. Avai lable at https://github.com/ omegahat/RClangSim ple . Temple Lang, D. (2010b). Rllvm: R interface to th e Low- Level Virtual Mac hine API. Av ailable at https://github. com/duncantl/Rllvm . Temple Lang, D. (2011). Rffi: Interface to libffi to d ynami- cally invo ke arbitrary compiled routines at ru n-time with- out compiled bind ings. R pack age versi on 0.3-0. Temple Lang, D. (2013). F astCSVSample: An R pack age to sample lines from a text file. A v ailable at https://github. com/duncantl/FastCSVSam ple . Temple Lang, D. and Bu ff alo , V. (2011). RLL VMCom- pile: A simple LL VM-based compiler for R co de. Av ailable at https://gith ub.com/dun cantl/RLL VMCompile . Temple Lang, D. an d Gentleman, R. ( 2005). TypeInfo: Optional typ e sp ecification prototype. R pac ka ge version 1.27.0. Temple Lang, D. , Peng, R. and Nolan, D. (2007). Co de- Dep ends: Analysis of R cod e for repro du cible research and code comprehension. Av ailable at https://github.com/ duncantl/CodeDepen ds . Tierney, L. (2001). Compiling R : A preliminary rep ort. In Pr o c e e dings of the 2nd International Workshop on Distribute d Statist ic al Computing, M ar ch 15–17, 2001 ( K. Horni k and F. Lei sch , eds.). T echnisc h e Universi t¨ at Wien, V ienna, Austria. Tierney, L. (2011). co deto ols: Co de analysis to ols for R . R pack age version 0.2-8. Urbanek, S. ( 2007). R to C compiler. Av ailable at http:// www.rforge.net/r2c/index.html . Wong, J. (2013). p dist: Partitio ned distance function. R pack age version 1.2. Zakai, A. (2010). Emscripten: An LL VM-to-Jav aScript compiler. Av ailable at https://github.com/kripken/ emscripten / wiki .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment