iotools: High-Performance I/O Tools for R

C O N T R I B U T E D R E S E A R C H A RT I C L E 1 iotools: High-Performance I/O T ools for R by T aylor Arnold, Michael J. Kane, and Simon Urbanek Abstract The iotools package provides Input/Output (I/O) intensive data processing in R ( R Core T eam , 2014 ). Efﬁcient parsing methods are included to minimize copying and avoid the use of intermediate strings. Methods support “chunk-wise” operations for computing on str eaming input including arbitrarily large ﬁles. W e present a set of examples for iotools , as well as benchmarks comparing similar functions provided in both cor e-R as well as other contributed packages. Introduction When processing lar ge data sets on a single machine the performance bottleneck is often getting data from the har d drive to the format required by the pr ogramming environment. The associated latency comes from a combination of two sources. First, there is hardware latency fr om moving data from the hard-drive to RAM. This is especially the case with “spinning” disk drives, which can have thr oughput speeds several orders of magnitude less than those of RAM. Hardware approaches for addressing latency have been an active ar ea of resear ch and development since hard-drives have existed. Solid state drives and redundant arrays of inexpensive disks (RAID) now provide thr oughput comparable to RAM; they are readily available on commodity systems; and they continue to improve. The second source comes fr om the software latency associated with transforming data from its r epresentation on the disk to the format required by the programming environment. This translation drags performance for many R users, especially in the context of larger data sour ces. The code below uses the microbenchmark ( Mersmann , 2014 ) package to compar e the time needed to read, parse, and cr eate a data.frame with the time needed to simply r ead data from disk. The ﬁle contains comma-separated value (CSV) ﬁle with 29 columns and 7,009,728 rows. It takes about 20 times longer to perform the former compared to the latter indicating ther e may be room for impr ovement. > library(microbenchmark) > col_classes = c(rep("integer", 8), "character", "integer", "character", + rep("integer", 5), "character", "character", + rep("integer", 4), "character", rep("integer", 6)) > > f = "2008.csv" > microbenchmark(s=read.csv(f, colClasses=col_classes), unit="s", times=5) Unit: seconds expr min lq mean median uq max neval s 91.7096 92.08713 92.43579 92.24694 92.92127 93.21401 5 > microbenchmark(s=readBin(f, "raw", file.info(f)$size), times=5, unit="s") Unit: seconds expr min lq mean median uq max neval s 0.4596488 0.4817199 0.4857666 0.4940194 0.4940351 0.4994098 5 This is not to say read.csv and its associated functions ar e poorly written. On the contrary , they are r obust and do an excellent job inferring data format and shape characteristics. They allow users to import and examine a data set without knowing how many rows it has, how many columns it has, or its column types. Because of these function statisticians using R are able to focus on data exploration and modeling instead of ﬁle formats and schemas. While these functions ar e sufﬁcient for pr ocessing relatively small data sets, lar ger ones requir e a different approach. For large ﬁles, data are often processed on a single machine by extracting consecutive rows or “chunks” from the ﬁlesystem, a chunk is processed, and then the next chunk is retrieved. The results from processing each chunk are then aggregated and returned. Small, manageable subsets are str eamed from the disk to the processor and only requir e the memory needed to repr esent a single chunk. This approach is common not only on single machine but also in distributed environements with technologies like Spark ( Zaharia et al. , 2010 ) and Hadoop MapReduce ( Dean and Ghemawat , 2008 ). Clusters of commodity machines are able to process vast amounts of data one chunk at a time. Statistical methodology is compatible with this computational appr oach and is justiﬁed in a variety of statistical/machine learning contexts including Hartigan ( 1975 ), Kleiner et al. ( 2011 ), Guha et al. ( 2012 ), and Matloff ( 2014 ) to name a few . However , R currently does not address this common computing pattern. Packages such as bigmemory ( Kane et al. , 2013 ) and ff ( Adler et al. , 2014 ) provide data structures using their own The R Journal V ol. XX/YY , AAAA ISSN 2073-4859 C O N T R I B U T E D R E S E A R C H A RT I C L E 2 binary format on a disk. They make use of memory-mapped ﬁles that may be stored on disk. The data structur es they provide are not native R objects. They do not exhibit copy-on-write behavior . And, in general, they cannot be seamlessly integrated with R’s plethora of user contributed packages. The readr package ( W ickham and Francois , 2015 ) provides fast importing of data.frame objects but it does not support chunk-wise operations for arbitrarily large ﬁles. The foreach package ( W eston and Revolution Analytics , 2014 ), and it’s associated iterators package ( Revolution Analytics , 2014 ), provide a general framework for chunked processing but does not provide the low-level connection-based utilities for transforming binary data stored on a disk to those native to R. The iotools package provides in-place stream processing for any data source represented as a connection. Users of the package can import text and binary data into R and process large data sets as chunks. The package can be several orders of magnitude faster when compar ed to R’s native facilities. The package provides general tools for quickly processing lar ge data sets in consecutive chunks, both in- and out-of-core, and pr ovides a basis for speeding up distributed computing frameworks including Hadoop Streaming ( The Apache Softwar e Foundation , 2013 ) and Spark. The rest of this paper introduces the use of the iotools package for quickly importing data fr om disk to R and processing those data. Examples center around calculation of OLS slope coefﬁcients via the normal equations. This particular calculation was chosen because it balances read/write times with processing time. A note on the data used in this paper Examples in this paper make use of the “Airline on-time performance” data set ( RIT A , 2009 ), which was released for the 2009 American Statistical Association (ASA) Section on Statistical Computing and Statistical Graphics biannual data exposition. The data set includes commercial ﬂight arrival and departure information from October 1987 to April 2008 for those carriers with at least 1% of domestic U.S. ﬂights in a given year . In total, there is information for over 120 million ﬂights, with 29 variables related to ﬂight time, delay time, departure airport, arrival airport, and so on. In total, the uncompressed data set is 12 gigabytes (GB) in size. It should be noted the 12 GB Airline On-time data set will likely not be consider ed “big” to many readers. However , in designing the examples two principles were consider ed before sheer data size. First, the data set is publicly available. The code included in the Supplemental Material of this paper is capable of downloading the data set and r unning the benchmarks. Users are encouraged to engage the data themselves by trying the code examples and developing their own analyses. Second, the data set is large enough to investigate the performance pr operties of iotools along with it’s associated scaling behavior . T ogether , the data set and the code available with this paper provide a set of accessible and repr oducible examples forming a basis for instruction and subsequent development. I/O Methods and Formatters R’s ﬁle operations make use of Standard C input/output operations includig fread and fwrite . Data are r ead in, elements are parsed, and parsed values populate data structur es. The iotools package also uses the Standard C library but it makes use of “bulk” binary operations including memchr and strchr . These functions make use of har dware speciﬁc, single instruction, multiple data operations (SIMD) and tend to be faster than their Standard I/O counterpart, which uses fread with search functions in the user-space. As a result iotools is able to ﬁnd and r etrieve data at a higher rate. In addition, an entire data set or chunk is buffered rather than scanned and transformed line-by-line as in the read.table function. Thus, by buf fering chunks of data and making use of low-level, system functions iotools is able to provide more performant data ingestion than what is available in base Ras well as other packages. Importing data with read.csv.raw and dstrsplit In this section the iotools import functionality is applied to the airline data ﬁles, each of which is csv-formatted. Files begin with header and column types consistent across each of the 22 ﬁles. Each ﬁle corr esponds to a full year of data, except the ﬁrst year (1987), wher e the data starts on October 14th. Importing 1987 ﬂights with iotools is shown below . The readAsRaw function takes either connection or ﬁle name and r eturns the contents as a raw type. The dstrsplit function parses the a raw vector according to the speciﬁed column types and returns a data.frame . Since these functions may be considered “lower-level,” the read.csv.raw was written for importing data in a manner similar to read.table . It supports similar parameters but runs faster . The R Journal V ol. XX/YY , AAAA ISSN 2073-4859 C O N T R I B U T E D R E S E A R C H A RT I C L E 3 > # read.table with column types specified > microbenchmark(read.table("1987.csv", header=TRUE, sep=",", + colClasses=col_classes), times=5, unit="s") Unit: seconds expr min lq mean median uq max neval s 12.85797 12.93201 13.50017 13.10885 13.91895 14.68304 5 > > # iotools with column types specified > microbenchmark(dstrsplit(readAsRaw(file.path(path, "1987.csv")), sep=",", + col_types=col_classes), times=5, unit="s") + times=5, unit="s") Unit: seconds expr min lq mean median uq max neval s 2.671205 2.728222 2.733358 2.740685 2.761237 2.765441 5 > > # read.table with column types inferred > microbenchmark(read.table("1987.csv", header=TRUE, sep=","),times=5, unit="s") Unit: seconds expr min lq mean median uq max neval s 15.48836 15.58127 15.59946 15.60643 15.63559 15.68568 5 > > # iotools with column types inferred > microbenchmark(read.csv.raw("1987.csv", header=TRUE, sep=","), + times=5, unit="s") Unit: seconds expr min lq mean median uq max neval s 2.705287 2.781435 2.798451 2.806583 2.832832 2.866119 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 2e+06 4e+06 6e+06 Number of Lines Read Time (in seconds) Function read.csv dstrsplit Figure 1: T imings using read.csv , and dstrsplit The performance of read.csv and dstrsplit is compared in Figur e 1 . The benchmark measures the import times for 1,000,000 to 7,000,000 lines 1 . The visualization shows importing data using read.csv takes about ﬁve times longer than dstrsplit . The dstrsplit function takes either a raw or character vector and splits it into a data frame according to a speciﬁed separator . The columns may be of type logical , integer , numeric , character , raw , complex , POSIXct , and NA where NA indicates the column should be skipped in the output. It may be considered a building block for both read.csv.raw as well as other computing infrastructures including Hadoop, pipes, and database connections to name a few . It should be noted factor types are not supported. It will be shown later dstrsplit can be used in a str eaming context and in this case data ar e read sequentially . As a result, the set of factor levels 1 Benchmarks were performed on a MacBook Pro with 2.7GHz Intel Core i7 processor (4 physical cores, 8 logical cores) with 16 GB 1600 MHz DDR3 RAM and Flash Storage. The R Journal V ol. XX/YY , AAAA ISSN 2073-4859 C O N T R I B U T E D R E S E A R C H A RT I C L E 4 cannot be deduced until the entire sequence is read. However , in most cases, a caller knows the schema and is willing to specify factor levels or the caller is willing to use a single pass thr ough the data to ﬁnd factor levels. 0 5 10 15 load dstrsplit read_csv read.csv Time (in seconds) integer 0 5 10 15 load dstrsplit read_csv read.csv Time (in seconds) numeric 0 5 10 15 20 load dstrsplit read_csv read.csv Time (in seconds) complex 0 4 8 12 load dstrsplit read_csv read.csv Time (in seconds) logical 0 5 10 15 load dstrsplit read_csv read.csv Time (in seconds) character 0 2 4 6 load dstrsplit read_csv read.csv Time (in seconds) raw 0 200 400 600 load dstrsplit read_csv read.csv Time (in seconds) POSIXct Figure 2: T ime to import data.frame by element type. Note readr does not support complex and raw types. Figure 2 shows the time needed to import a data ﬁle with 1,000,000 rows and 25 columns using load , dstrsplit , read_csv (from the readr package), and read.table . Imports wer e performed for each of R’s native types to see how their dif ferent size requir ements affect performance. The benchmarks show that, except for the POSIXct type, load is fastest. This is unsurprising since load stores the binary repr esentation of an R object and importing consists of copying the ﬁle to memory and registering the object in R. The read_csv ’s performance is very close to those of iotools . When comparing with read_csv we have found three things of note. First, The differ ence in times are constant. As the number of lines to read increases the slope of the r ead times are the same. Second, where read_csv maintains a slight edge in supported numerical types on OS X, iotools has a slight edge on the Linux machines we tested. Third, the read_csv function was provided a connection explicitly in these benchmarks so all functions being examined are pr ovided the same input. When a ﬁle name is provided to read_csv , it achieves slightly better performance than the values shown here since it imports fr om a memory-mapping of the ﬁle. Processing and Checkpointing with as.output While optimized I/O operations ar e a convenience when performing explorations and analyses on small data sets ﬁtting in RAM, they are an imperative when working with big data. In this class of data challenges we often deal with individual ﬁles whose aggregate is too large to ﬁt in RAM. Furthermor e, in distributed applications we may need to load and pr ocess subsets from dif ferent machines, later combining the results in some useful way . The R Journal V ol. XX/YY , AAAA ISSN 2073-4859 C O N T R I B U T E D R E S E A R C H A RT I C L E 5 Let us assume we are tasked with ﬁnding the slope coef ﬁcents for the linear regresssion Ar r De l a y ∼ D a yO f W ee k + D e p Tim e + Mo nt h + D e p D el a y . (1) The slope estimates are formed by cr eating the model matrix, and applying the normal equations to derive the coefﬁcients. As a ﬁrst task, we perform the simple prepr ocessing step of aggregating all of the ﬁles into a single ﬁle holding the model matrix of the entire airline data set. The slope coefﬁcients will be calculated in a second, separate step. Separate processing and model ﬁtting in this case ar e mostly for the sake of example. However , in many real-world data challenges it is a good idea. Separated steps pr ovide checkpointing and if a problem arises while ﬁtting the model, whether from a bug in the code or an interruption in computing services, the model matrix does not need to be recalculated. Also, the analysis can be changed based on the transformed data, ther eby saving a step for similar analyses. In the case of our regr ession we can derive many differ ent models involving the described variables by including or excluding them in the model ﬁtting step. The example below shows how to write the model matrices to a single ﬁle, named airline_mm.io. However , we could have processed sets of ﬁles just as easily with iotools . T o emphasize iotools is complementary to existing packages, we will show its use with “pipes” included in the tidyr package ( W ickham , 2016 ). The code reads each of the airline ﬁles into a data frame using readAsRaw and dstrsplit , normalizes the categorical variables and transforms the departure times, cr eates a model matrix from the r esulting data.frame , strips the r ow names of the model matrix, creates the text output repr esentation, and writes it to the output ﬁle. The output connection is recycled in each iteration of the loop thereby appending each year ’s data. > library(tidyr) > > # The variables we’ll use in the linear regression. > form = ~ ArrDelay + DayOfWeek + DepTime + DepDelay + Month > > # A function to normalize the categorical variables and turn > # departure time into minutes after midnight. > normalize_df = function(x) { + names(x) = col_names + x$DayOfWeek = factor(x$DayOfWeek, levels=1:7) + x$Month = factor(x$Month, levels=1:12) + x$DepTime = sprintf("%04d", x$DepTime) + x$DepTime = as.numeric(substr(x$DepTime, 1, 2))*60 + + as.numeric(substr(x$DepTime, 3, 4)) + x + } > > # Remove the rownames from the output file. > strip_rownames = function(x) { + rownames(x) = NULL + x + } > > # Read the files and write a single model matrix. > data_files = paste0(1988:2008, ".csv") > out_file = file("airline_mm.io", "wb") > for (data_file in data_files) { + data_file %>% readAsRaw %>% + dstrsplit(sep=",", skip=1, col_types=col_classes) %>% normalize_df %>% + model.matrix(form, .) %>% strip_rownames %>% as.output(sep=",") %>% + writeBin(out_file) + } > close(out_file) Fitting the model with mstrsplit and chunk.apply W ith the model matrices cr eated, The next step is to estimate the slope coefﬁcients β in the model Y = X β + ε , (2) The R Journal V ol. XX/YY , AAAA ISSN 2073-4859 C O N T R I B U T E D R E S E A R C H A RT I C L E 6 where Y , ε ∈ R n , and β ∈ R d , n ≥ d ; each element of ε is an i.i.d. random variable with mean zero; and X is a matrix in R n × d with full column rank. The analytic solution for estimating the slope coefﬁcients, β , is b β =  X T X  − 1 X T Y . (3) Consider the row-wise partitioning (or chunking) of Equation 2 :      Y 1 Y 2 . . . Y r      =      X 1 X 2 . . . X r      β +      ε 1 ε 2 . . . ε r      , where Y 1 , Y 2 , ..., Y r , X 1 , X 2 , ..., X r and ε 1 , ε 2 , ..., ε r are data partitions wher e each chunk is composed of subsets of rows of the model matrix. Then Equation 3 can be expressed as ˆ β = r ∑ i = 1 X T i X i ! − 1 r ∑ i = 1 X T i Y i . (4) The matrices X T i X i and X T Y can be calculated on each chunk and then summed to calculated the slope coefﬁcients. W e r emark computed solutions rarely use Equation 3 directly but rather use QR decompositions of X for numerical stability . In practice we have found the amount of numerical stability gained does not warrant the QR calculation, especially when distinguishing nearly colinear variables is not critical. Code to ﬁt the model will need to read from airline_mm.io in chunks and where before data were read into a data frame, now we would like to read data into a numeric matrix. Interestingly enough, this functionality is not provided in base R or the Matrix ( Bates and Maechler , 2014 ) package. T raditionally , users who wanted to r ead matrices from disk either used the load / dget function, for cing them to write using save / dput , or they could be read in as a data frame and then converted using the as.matrix function. The former approach allows an R user to quickly import and export data but is not easily accessed from other computing envir onments. The latter requires a r edundant copy of the data. The iotools package ﬁlls this gap by providing the mstrsplit , a matrix import function similar to dstrsplit An implementation to ﬁt a linear model out-of-core linear model shown below . The chunk.apply function reads and pr ocesses chunks - in this case contiguous groups of r ows in the model matrix. The function takes as an argument a connection or ﬁle, a function with a single parameter , and a number of parallel processors to use. The function parameter requires a single ar gument corresponding to the raw vector to be parsed by dstrsplit or mstrsplit . > # Get the factor expansion of the variables. > mm_col_names = data_files[1] %>% read.csv.raw(header=TRUE, nrows=2) %>% + normalize_df %>% model.matrix(form, .) %>% colnames > > ne_chunks = chunk.apply("airline_mm.io", + function(x) { + mm = mstrsplit(x, sep=",", type="numeric") + colnames(mm) = mm_col_names + list(xtx=crossprod(mm[,-2]), + xty=crossprod(mm[,-2], mm[,2, drop=FALSE])) + }, CH.MERGE=list, parallel=4) > > xtx = Reduce(‘+‘, Map(function(x) x$xtx, ne_chunks)) > xty = Reduce(‘+‘, Map(function(x) x$xty, ne_chunks)) > > qr_xtx = qr(xtx) > keep_vars = qr_xtx$pivot[1:qr_xtx$rank] > > # The regression coefficients > solve(xtx[keep_vars,keep_vars]) %*% xty[keep_vars] [,1] (Intercept) 0.5564085990 DayOfWeek2 0.5720431343 DayOfWeek3 0.8480978666 DayOfWeek4 1.2436976583 The R Journal V ol. XX/YY , AAAA ISSN 2073-4859 C O N T R I B U T E D R E S E A R C H A RT I C L E 7 DayOfWeek5 1.0805744488 DayOfWeek6 -1.2235684080 DayOfWeek7 -0.9883340887 DepTime 0.0003022008 DepDelay 0.9329374752 Month2 0.2880436452 Month3 -0.2198123852 ... 0 5 10 15 load mstrsplit as.matrix Time (in seconds) integer 0 5 10 15 load mstrsplit as.matrix Time (in seconds) numeric 0 5 10 15 20 load mstrsplit as.matrix Time (in seconds) comple x 0 4 8 12 load mstrsplit as.matrix Time (in seconds) logical 0 5 10 15 load mstrsplit as.matrix Time (in seconds) character 0.0 2.5 5.0 7.5 10.0 load mstrsplit as.matrix Time (in seconds) ra w Figure 3: T ime to import a matrix by element type. Figure 3 compares the performance of mstrsplit with read.table followed by a call to as.matrix along binary importing using load . As with dstrsplit mstrsplit outperforms the base R’s read.table benchmarks by an order of magnitude and even outperforms load for character data. Parallel Processing of Chunks In the example above xtx and xty for each chunk are calculated independently of any other chunk. The chunk.apply function includes a parameter , parallel , allowing the user to specify the number of parallel processes, taking advantage of the embarrassingly parallel nature of these calculations. However , it is worth noting parallelism in the chunk.apply function is slightly differ ent than other functions such as mclapply . Most parallel functions in R work by having worker processes r eceive data and an expression to compute. The master process initiates the computations and waits for them to complete. For I/O- intensive computations this means either the master loads data befor e initiating the computation or the worker processes load the data. The former case is supported in iotools through iterator functions ( idstrsplit and imstrsplit ), which are compatible with the foreach package. However , in this case, new tasks cannot be started until data has been loaded for each of the workers. Loading the data on master process may become a bottleneck and it may requir e much more time to load the data than to process it. The latter approach is also supported in iotools and ensures the master pr ocess is not a bottleneck but if multiple worker processes on a single machine load a large amount of data fr om the same disk then resource contention at the system level may also cause excessive delays. The operating system has to service multiple requests for data fr om the same disk having limited I/O capability . A third option, implemented in chunk.apply , pr ovides pipeline parallelism wher e the master process sequentially loads data and then calls mcparallel to initiate the parallel computation. When the maximum number of worker processes has been reached the master process pre-fetches the next chunk and then blocks on the r esult of the running worker processes. When the r esult is returned a newly created worker begins pr ocessing the pre-fetched data. In this way the master does not wait idly for worker processing and ther e is no resource contention since only the master is r etrieving data. Pipeline parallelism increases execution throughput when the computation time is around the same order as the load time. When the overhead involved in initiating worker pr ocesses and getting The R Journal V ol. XX/YY , AAAA ISSN 2073-4859 C O N T R I B U T E D R E S E A R C H A RT I C L E 8 their results may overwhelm the computation time and parallel processing yield less performant results. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100 200 300 400 2 4 6 Number of Cores Time (in seconds) Function One Reader Multi−Process Multi−Readers Multi−Process Pipeline P arallel Figure 4: T ime to ﬁt the linear model Figure 4 shows the times requir ed to calculate X T X and X T Y from the normal equations in the regr ession described above using the three appr oaches described: all workers read, only the master reads, and pipeline parallel. Pipeline parallelism performs best followed by all workers reading. It should be noted all workers r eading will only be able to keep pace with pipeline parallelism as long as there is suf ﬁcient hard-drive bandwidth and little contention fr om multiple reads. As a result, the pipeline parallel approach is likely a mor e general and therefore pr eferred strategy . Conclusion This paper presents the iotools package for the processing of data out-of-core and explores its use analyzing the Airline On-time data set. The examples emphasize computing on a single machine. However it should be noted iotools is by no means limited to this conﬁguration. The “chunk” functions are compatible with any object derived from a connection and could therefor e be used with compressed ﬁles or even pipes and sockets. In fact, our current work uses iotools as a building block for more tightly integrating R in the Hadoop Streaming and Spark frameworks. Early results show iotools achieves better performance in processing terabyte- and even petabyte-scale data when compared to other existing packages. Bibliography D. Adler , C. Gläser , O. Nenadic, J. Oehlschlägel, and W . Zucchini. ff: memory-efﬁcient storage of lar ge data on disk and fast access functions , 2014. URL http://CRAN.R- project.org/package=ff . R package version 2.2-13. [p 1 ] D. Bates and M. Maechler . Matrix: Sparse and Dense Matrix Classes and Methods , 2014. URL http: //CRAN.R- project.org/package=Matrix . R package version 1.1-4. [p 6 ] J. Dean and S. Ghemawat. MapReduce: Simpliﬁed data processing on large clusters. Commun. ACM , 51(1):107–113, Jan. 2008. ISSN 0001-0782. doi: 10.1145/1327452.1327492. URL http://doi.acm.org/ 10.1145/1327452.1327492 . [p 1 ] S. Guha, R. Hafen, J. Rounds, J. Xia, J. Li, B. Xi, and W . S. Cleveland. Large complex data: divide and recombine (d&r) with rhipe. Stat , 1(1):53–67, 2012. ISSN 2049-1573. doi: 10.1002/sta4.7. URL http://dx.doi.org/10.1002/sta4.7 . [p 1 ] J. A. Hartigan. Necessary and sufﬁcient conditions for asymptotic joint normality of a statistic and its subsample values. The Annals of Statistics , 3(3):573–580, 05 1975. doi: 10.1214/aos/1176343123. URL http://dx.doi.org/10.1214/aos/1176343123 . [p 1 ] The R Journal V ol. XX/YY , AAAA ISSN 2073-4859 C O N T R I B U T E D R E S E A R C H A RT I C L E 9 M. J. Kane, J. Emerson, and S. W eston. Scalable strategies for computing with massive data. Journal of Statistical Software , 55(14):1–19, 2013. URL http://www.jstatsoft.org/v55/i14/ . [p 1 ] A. Kleiner , A. T alwalkar , P . Sarkar , and M. I. Jordan. A scalable bootstrap for massive data. arXiv preprint arXiv:1112.5016 , 2011. [p 1 ] N. Matloff. Software Alchemy: T urning Complex Statistical Computations into Embarrassingly- Parallel Ones. ArXiv e-prints , Sept. 2014. [p 1 ] O. Mersmann. microbenchmark: Accurate T iming Functions , 2014. URL http://CRAN.R- project.org/ package=microbenchmark . R package version 1.4-2. [p 1 ] R Core T eam. R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, V ienna, Austria, 2014. URL http://www.R- project.org/ . [p 1 ] Revolution Analytics. iterators: Iterator construct for R , 2014. URL http://CRAN.R- project.org/ package=iterators . R package version 1.0.7. [p 2 ] RIT A. The Airline on-time performance data set website, 2009. URL http://stat- computing.org/ dataexpo/2009/ . Resear ch and Innovation T echnology Administration, Bureau of T ransportation Statistics. [p 2 ] The Apache Software Foundation, 2013. Apache Hadoop Streaming, available at http://hadoop. apache.org . [p 2 ] S. W eston and Revolution Analytics. foreach: Foreach looping construct for R , 2014. URL http://CRAN.R- project.org/package=foreach . R package version 1.4.2. [p 2 ] H. W ickham. tidyr: Easily T idy Data with ‘spread()‘ and ‘gather()‘ Functions , 2016. URL https://CRAN.R- project.org/package=tidyr . R package version 0.4.1. [p 5 ] H. W ickham and R. Francois. readr: Read T abular Data , 2015. URL https://CRAN.R- project.org/ package=readr . R package version 0.2.2. [p 2 ] M. Zaharia, M. Chowdhury , M. J. Franklin, S. Shenker , and I. Stoica. Spark: Cluster computing with working sets. HotCloud , 10:10–10, 2010. [p 1 ] T aylor Arnold A T&T Labs – Statistics Resear ch 33 Thomas Street, NY USA taylor@research.att.com Michael J. Kane Y ale University 300 George Str eet, New Haven, CT USA michael.kane@yale.edu Simon Urbanek A T&T Labs – Statistics Resear ch 1 A T&T W ay Bedminster , NJ USA urbanek@research.att.com The R Journal V ol. XX/YY , AAAA ISSN 2073-4859

iotools: High-Performance I/O Tools for R

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment