The Simulator: An Engine to Streamline Simulations

The simulator is an R package that streamlines the process of performing simulations by creating a common infrastructure that can be easily used and reused across projects. Methodological statisticians routinely write simulations to compare their met…

Authors: Jacob Bien

The Simulator: An Engine to Streamline Simulations
The sim ulator : An Engine to Streamline Sim ulations Jacob Bien Departmen ts of BSCB and Statistical Science, Cornell Universit y Abstract The sim ulator is an R pack age that streamlines the pro cess of p erformi ng sim ulations b y creating a common infrastructure that can b e easily used and reused across pro jects. Methodological statisticians routinely write simu lations to compare their metho ds to pre- existing ones. While developing ideas, there is a temptation to write “quick and dirty” sim ulations to try out ideas. This approach of rapid prototyping is useful but can some- times bac kfire if bugs are in tro duced. Using the sim ulator allo ws one to remo v e the “dirt y” without sacrificing the “quic k.” Coding is quick because the stati stician fo cuses exclusiv ely on those aspects of the simulation that are sp ecific to the particular paper b eing written. Co de written with the simul ator is succinct, highly readable, and easily shared with others. The mo dular nature of simulations written with the sim ulator promotes co de reusabilit y , whic h sa ves time and facilitates reproducibility . The syn tax of the simulator leads to sim- ulation co de that is easily h uman-readable. Other b enefits of using the sim ulator include the ability to “step in” to a simulation and change one asp ect without ha ving to rerun the en tire simulation from scratc h, the straigh tforward in tegration of parallel computing into sim ulations, and the ability to rapidly generate plots, tables, and rep orts with minimal effort. Keywor ds : simulations, repro ducible co de, parallel computing. 1. In tro duction Metho dological statisticians spend an appreciable amount of time writing co de for simulation studies. Every pap er introducing a new metho d has a simulation section in which the new metho d is compared across sev eral metrics to preexisting methods under v arious scenarios. Giv en the formulaic nature of the simulation studies in most statistics pap ers, there is a lot of co de that can b e reused. The goal of the sim ulator is to streamline the pro cess of p erforming simulations by creating a common infrastructure that can b e easily used and reused. By infrastructure, we mean everything that is common across pro jects. This includes co de for • running simulations in parallel, • proper handling of pseudorandom num b er generator streams (relev an t when running sim ulations in parallel), • storing sim ulation outputs at v arious stages to allo w one to “step in” and change one asp ect of a simulat ion (suc h as how a metho d is b eing ev aluated) without ha ving to rerun the en tire sim ulation from scratch, • summarizin g sim ulation results with plots and tables sho wing how v arious metho ds compare across one or more metrics, and • generating reports to easily comm unicate sim ulation results and co de to others. The sim ulator is an R ( R Core T eam 2016 ) pac k age fo cused on all of these asp ects of a 2 The simulator sim ulation. This allows the statistician to fo cus exclusively on problem-sp ecific co ding. The b enefits of workin g this wa y include • a decrease in co de size, whic h both reduces the amoun t of time sp en t co ding and the lik eliho o d of introducing bugs, • co de that is easy to understand, share, and reuse, • the ability to maint ain fo cus on the statistical issues at hand without either getting sidetrac ked by co ding-related issues or alternativ ely settling for “quick and dirt y” ap- proac hes that may sav e time in the presen t but slo w do wn a pro ject in the future (if one must, for example, rerun all results from scratch), • the abilit y to add to or mo dify a simulati on without rerunning ev erything • the abilit y to op en/close the R session without ha ving to sa ve the w orkspace (a common practice that pro duces un wieldy R en vironments that can lead to bugs in whic h global v ariables unin tentionally affect b eha vior of functions) • a consistent organization of and approach to simulations across all pro jects, making it easier to resume a pro ject that ma y hav e b een dormant for years, • the abil ity to easily “plug in” comp onen ts written by others f or similar problems. Indeed, if a communit y of researchers (either at the research group level or at the level of a sub- area of statistics) is using the sim ulator , standard libraries of mo del s, metho ds, and metrics can b e developed at the comm unity lev el, leading to easy co de sharing. The simulator divides a statistical sim ulation study into four basic comp onen ts or mo dule s: 1) Mo dels: The statistical mo del , determining how data is generated. 2) Metho ds: The statistical pro cedures b eing compared: Given the data, eac h metho d pro duces an output such as an estimate, prediction, or decision. 3) Ev aluations: Metrics that ev aluate a metho d’s output based on the input. 4) Plots, T ables, Rep orts: How one displa ys the ev aluated metrics for v arious metho ds under v arious scenarios, and how one communicates the results and co de to others. When using the sim ulator , one co des the mo dels, metho ds, and metrics particular to the problem at hand. These are then “plugged in” to the simulator . After an initial lo ok at the sim ulator in action (Section 2 ), we will then discuss eac h of the comp onen ts of the sim ulator mentioned ab o ve in greater detail (Section 3 ). In Section 4 , w e discuss certain principles that informed our design c hoices for the pack age. Section 5 recommends a simple wa y to get started with the sim ulator . Section 6 giv es credit to the R pac k ages t hat are leveraged by the sim ulator , and Section 7 reviews other simulation pack ages. W e conclude (in Section 8 ) with a discussion of the future of the simulator . 2. A first lo ok: Betting on sparsit y with the lasso The easiest wa y to learn ab out the sim ulator is b y example. F or this reason, w e ha ve created a series of vignettes, av ailable online, whic h sho w how the simulator can b e used in the con text of some of the most well-kno wn papers in statistics. W e pro vide here a first lo ok at working Jacob Bien 3 Figure 1: A simul ation can b e viewed as a pip eline of interlock ing comp onen ts. with the simulator , applying it to a simulation inv olving the lasso (an extended form of this example is a v ailable online as a vignette). Hastie, Tibshirani, and F riedman ( 2009 ) put forw ard the “b et on sparsit y” principle: “Use a pr o c e dur e that do es wel l in sp arse pr oblems, sinc e no pr o c e dur e do es wel l in dense pr oblems.” The authors p erform simu lations comparing the lasso ( Tibshirani 1996 ) with ridge regression in sparse and dense situations. A simulati on of this sort can be concisely co ded in several easy-to-read lines of co de using the sim ulator . library(simulator) new_simulation(name = "bet-on-sparsity", label = "Bet on sparsity") %>% generate_model(make_spa rse_linear_model, n = 200, p = 500, k = as.list(seq(5, 80, by = 5)), vary_along = "k") %>% simulate_from_model(nsi m = 5, index = 1) %>% run_method(list(lasso, ridge)) %>% evaluate(list(mse, bestmse, df)) The co de ab o ve can b e read as follows: Create a new simulation (with some name and lab el); add to it a sequence of sparse linear mo dels where the sparsit y level is v aried from 1 to 80; sim ulate 5 random draws from this mo del; run the lasso and ridge on b oth of these; finally , compute the mean squared error, b est mean squared error (o ver all v alues of the tuning parameter), and degrees of freedom for each of these. In co ding this simulation, w e only needed to define the problem-sp ecific parts, whic h are 1) the function make_sparse_linear_model , whic h creates a Model ob ject, 2) the Method ob jects lasso and ridge , 3) and the Metric ob jects mse , df , and bestmse . This problem-sp ecific co de is provided in the App endix. The functions new_simulation , generate_model , simulate_from_model , run_method , and evaluate are sim ulator functions that tak e care of the details of b o okk eeping, random num b er generation, parallel pro cessing (if desired), saving outputs to file, etc. It is useful to think of a simulation as a pip eline (see Figure 1 ), and this is emphasized by the sim ulator ’s use of the magrittr pip e %>% ( Bache and Wic kham 2014 ). W e discuss the pip e further in Section 4.4 . The results from all intermediate stages are sa ved to file. T o get access to these results for further analysis, one loads the Simulation ob ject that was created, using the name it w as giv en. 4 The simulator sim <- load_simulation("bet-on-sparsit y") This ob ject only contains references to the sim ulation results (and not the results themselves) and thus is quick to load and remains light weigh t even for large sim ulations. W e next make a plot showing how the b est MSE of eac h metho d v aries with sparsity lev el. plot_eval_by(sim, "best_sqr_err", varying = "k", main = "Betting on sparsity") ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.05 0.10 20 40 60 80 k Mean best squared error Method ● ● Lasso Ridge Betting on sparsity The plot sho ws (a Monte Carlo estimate of ) the b est-ac hiev able mean squared error of b oth metho ds. The rationale for the “b et on sparsity” principle is apparen t from the plot. When k is lo w, we see large gains in p erfor mance using the lasso compared to ridge; when k is large, ridge do es b etter—ho w ever, in this regime, neither metho d is doing particularly well (and in a relativ e sense, ridge’s adv antage is only slight). Observe that the call to plot_eval_by is fairly simple considering that the simulator has lab eled the axes and has identifi ed the metho ds in a legend. This is an example of how the form ulaic nature of most sim ulation studies allows for efficiency in co ding. The simulator is designed to make simulations as mo dular as p ossible so that one can easily add new parts, run v arious steps in parallel, etc. F or example, if one later decides to compare to the minimax concav e p enalt y (MCP) ( Zhang 2010 ), one can simply call sim <- run_method(sim, mcp) and the earlier steps of generating data will not b e rep eated. If one decides to add more than 5 random dra ws to the simulation, one can do so without ha ving to redo the earlier computation (this will b e discussed in greater detail in Section 4.5 ). Rather than lo oking at just the b est MSE of each metho d, we migh t also like to examine how the MSE v aries with degrees of freedom for eac h method. W e lo ok at the metho ds’ results when the true sparsity k is lo w and when it is high. Jacob Bien 5 subset_simulation(sim, k == 20 | k == 80) %>% plot_evals("df", "sqr_err") n = 200, p = 500, k = 20 n = 200, p = 500, k = 80 0.05 0.10 0.15 0.20 0 50 100 150 200 0 50 100 150 200 degrees of freedom squared error Method Lasso Ridge In practice, it is common to use cross-v alidation to select the tuning parameter, so w e migh t also wan t to sim ulate the tw o metho ds when cross-v alidation is used. W e organize this as a new sim ulation that shares the same mo dels and simulated data as b efore, but with tw o new metho ds. sim2 <- subset_simulation(sim, methods = "") %>% rename("bet-on-sparsity -cv") %>% relabel("Bet on sparsity (with cross validation)") %>% run_method(list(lasso + cv, ridge + cv)) %>% evaluate(mse) W e explain in the App end ix in greater depth the simulator construct that allows us to write lasso + cv , but for no w w e note that doing so allows us to a void rerunning the lasso and ridge from scratch, which w ould be w asteful giv en that w e ha ve already computed them in these situations in the previous simulation. A plot reveals that qualitatively the results are quite similar. plot_eval_by(sim2, "sqr_err", varying = "k", main = "Betting on sparsity") 6 The simulator ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.05 0.10 0.15 20 40 60 80 k Mean squared error Method ● ● Lasso cross validated Ridge cross validated Betting on sparsity 3. A lo ok at eac h comp onen t In this section, we lo ok more closely at the comp onen ts of the simulator and giv e brief o verviews of the functions p ertaining to each comp onen t. Thinking of a sim ulation as a pip eline, a Simulation ob ject is what gets pushed through the series of pip es. Each of the main pip eline functions of the simulator (which can b e though t of as the pip es) input and output a simulation ob ject. Each main pip eline function creates new ob jects, sav es them to file, and then records what was done in the Simulation ob ject. Th us, the Simulation ob ject serves as the memory of what has b een done and as a handle for getting to the v arious in termediate stages of the sim ulation. Importantly , a Simulator ob ject do es not con tain the results themselves, only references to them. This is discussed in greater detail in Section 4 . • new_simulation is used to create a new Simulation ob ject. By default, when it is called, a record of its existence is sav ed to file. • save_simulation is passed a Simulation ob ject, it sav es it to file. The lo cation is giv en b y the name and dir (short for directory) slot of the ob ject. • load_simulation when passed the name of a simulation loads the Simulation ob ject in to the R session. • subset_simulation is useful for creating a new Simulation ob ject from a preexisting one. F or example, this is useful for fo cusin g a plot on only certain mo del s or for reusing only certain sim ulation results in a differen t context. 3.1. Mo dels comp onen t Jacob Bien 7 The first comp onen t in the sim ulator is the mo dels comp onen t. An ob ject of class Model defines a set of parameters and a function, called simulate , that takes these parameters and returns a sp ecified num b er of random draws from this distribution. F or example, in the “b et on sparsit y” example of Section 2 , we created a sequence of Model ob jects with v arying sparsit y lev el k . Loading one such ob ject, we can hav e a lo ok by printing it: m <- model(sim, k == 10) m Model Component name: slm/k_10/n_200/p_500 label: n = 200, p = 500, k = 10 params: x beta mu sigma n p k The parameters are stored in a named list m@params and the function to simulate data is m@simulate . Here are some of the most useful functions related to the mo del component: • new_model creates a new Model ob ject. In the example, it is called in make_sparse_linear_model (see App endix A.1 ). • generate_model is one of the main pip eline functions of the sim ulator . The user supplies it with a function ( make_sparse_linear_model , in the example) that returns an ob ject of class Model , and generate_model then creates this mo del, sav es it to file, and adds a reference to it to the Simulation ob ject. In many simul ation studies, one is in terested in a sequence of mo dels, indexed by several parameters. The vary_along parameter of generate_model is used to sp ecify which parameters should b e v aried. • simulate_from_model is another main pip eline function. Once the mo del has b een generated, simulate_from_model creates a sp ecified n umber of random draws from the distribution defined by the model. The Draws ob jects are sav ed to file and references to them are added to the Simulation ob ject. These sim ulations can b e broken in to c hunks and run either sequen tially or in parallel (or a mix of the t wo). Parallelization is describ ed in greater detail in Section 4.5 . • model loads one or more Model ob jects from a Simulation ob ject in to the R session. • draws loads one or more Draws ob jects from a Simulation ob ject in to the R session. The Draws ob jects hav e a simple structure. d <- draws(sim, k == 10) d Draws Component name: slm/k_10/n_200/p_500 label: (Block 1:) 5 draws from n = 200, p = 500, k = 10 In this example, d@draws is a list of length 5 (since we had nsim = 5 ), with eac h elemen t b eing a realization of the resp onse vector y . 8 The simulator 3.2. Metho ds comp onen t The next comp onen t in the sim ulator is the metho ds comp onen t. An ob ject of class Method defines a function that tak es arguments model and draw . In the example, the design matrix X is con tained in the mo del (accessible via model$x ), and the resp onse vect or y comes from the draw argumen t. A Method ob ject can also (optionally) include a list of settings . • new_method creates a new Method ob ject. In the example, the ob jects lasso and ridge w ere pro duced b y the new_method function (see App endix A.2 ). • run_method is a main pip eline function. It runs one or more metho ds on the simulated data and creates Output ob jects, sav es these to file, and adds references to these to the Simulation ob ject. • output loads one or more Output ob jects from a Simulation ob ject into the R session. • (for adv anced usage—not needed in most simulations) new_method_extension and new_extended_method are used to create new MethodExtension and ExtendedMethod ob jects. An ExtendedMethod is an ob ject that b eha ves like a metho d, but is allow ed to use the output of another method. This can be useful when one method builds off of an- other. In the example, cv is a MethodExtension and lasso + cv is an ExtendedMethod (see App endix A.4 for more). A t the core of an Output ob ject is a list called out . o <- output(sim, k == 10, methods = "lasso") o Output Component model_name: slm/k_10/n_200/p_5 00 index: 1 nsim: 5 method_name: lasso method_label: Lasso out: beta, yhat, lambda, df, time Eac h element of the list o@out corresponds to a random draw (i.e., realization) and giv es the output of the lasso on this draw. 3.3. Metrics/Ev aluation comp onen t The next component in the sim ulator is the ev aluations component, in whic h w e ev aluate eac h metho d based on some user-defined metrics. An ob ject of class Metric defines a function with argumen ts model and out (the output of a metho d). In the example, the mean squared error is a metric of interest whic h measures how far the true co efficien t vector β (accessible via model$beta ) is from a metho d’s estimate of this quantit y (accessible via out$beta ). • new_metric creates a new Metric ob ject. In the example, the ob jects mse , bestmse , and df are pro duced by the new_method function (see App endix A.3 ). Jacob Bien 9 • evaluate is a main pip eline function. It computes one or more metrics on the outputs of the metho ds and creates Evals ob jects, sav es these to file, and adds references to these to the Simulation ob ject. In addition to the metrics that are explicitly passed to evaluate , the computing time for a metho d is automatically sa ved as well. • evals loads one or more Evals ob jects from a Simulation ob ject in to the R session. • as.data.frame conv erts an Evals ob ject (or list of Evals ob jects) to a data.frame . An Evals ob ject contains the computed metrics on eac h metho d-dra w pair. e <- evals(sim, k == 10) e Evals Component model_name: slm/k_10/n_200/p_5 00 index: 1 (5 nsim) method_name(s): lasso, ridge (labeled: Lasso, Ridge) metric_name(s): sqr_err, df, best_sqr_err, time metric_label(s): squared error, degrees of freedom, best squared error, Computing time In particular, e@eva ls$lasso and e@evals$ridge are lists of l ength 5 (one per random dra w) and e@evals$lasso[[1]]$bestmse giv es the b est MSE for the lasso on the first draw. 3.4. Plot/T able comp onen t Making standard plots and tables of simulation results is a task that is particularly streamlined using the sim ulator . - plot_eval takes the name of a metric and mak es side-by-side b o x plots of each metho d. If m ultiple mo dels are included, these are shown as a m ultifaceted plot. subset_simulation(sim, k == 20 | k == 80) %>% plot_eval("time") ● ● n = 200, p = 500, k = 20 n = 200, p = 500, k = 80 0.02 0.04 0.06 0.08 0.10 Lasso Ridge Lasso Ridge Method Computing time 10 The simulator • plot_evals takes the names of t wo metrics and plots the first versus the second with differen t colored/st yled lines (or p oin ts) for each metho d. W e used this function in Section 2 . • plot_eval_by plots a metric versus a mo del parameter that w as v aried across mo dels (in our example, this corresp onded to the sparsit y lev el k of the true co efficen t vector). By default, a line with error bars is sho wn. The center and width of these error bars is by default the sample mean (across the random realizations) and standard error of the sample mean, resp ectiv ely; how ev er, one can pass custom-made Aggregator ob jects to hav e the center and width of the error bars represent different aggregates. Another option (with type="raw" ) simply shows the ra w v alues of the metric for eac h metho d at each v alue of the mo del parameter. • tabulate_eval tak es the name of a metric and creates a table (in latex, markdown, or html formats) in which each row is a mo del and each column is a metho d. As with plot_eval_by , the default is to show the sample mean (with its standard error in parentheses), and the user can pro vide custom Aggregator ob jects so that these represen t more general aggregates. • (for adv anced usage—not needed in most sim ulations) new_aggregator can b e used to create a custom Aggregator . 4. Design principles 4.1. Mo dularit y of comp onen ts As noted earlier, a simulat ion can b e viewed as a pip eline of in terlo c king comp onen ts. Each comp onen t can b e easily inserted, mo dified, or remo ved without requiring one to change the structure of the ov erall simulat ion. F or example, if one wishes to consider an additional mo del, include another metho d, or ev aluate the metho ds using a different metric, this can all b e easily done with minimal structural c hanges to the o verall simulation. Each comp onen t in the pip eline has a standardized input and is responsible for pro viding output that will fit into the next comp onen t’s input. F or example, at the heart of a Metric ob ject is a function that tak es in a metho d’s output and a Model and outputs a num b er or vector of num b ers. The output of a metric can then b e easily fed into the simulator ’s plot or tabulation functions. Another asp ect of mo dularit y is that the name (a short identifier used in file names) and the lab el (a longer h uman-readable string) of the comp onen t are stored as part of the comp onen t. This is in con trast to the typical practice of only lab eling metho ds or mo dels when plots or tables are made. Conceptually , it makes m uch more sense to lab el comp onen ts at the lo cation they are defined. 1 A great b enefit is that all do wnstream components hav e access to the lab els. F or example, the sim ulator ’s plot functions automatically provid e informativ e descriptions suc h as legends identifying the v arious metho ds, labeled axes, and titles describing the mo del. Decoupling the comp onen ts of a simulation is also imp ortan t when certain asp ects require more computing time than others. F or example, applying statistical metho ds to sim ulated 1 This is similar in spirit to roxygen2 ( Wickham, Danenberg, and Eugster 2015 ), in which one do cumen ts functions in the same place where they are defined rather than writing separate .Rd files. Jacob Bien 11 data is often muc h slow er than computing metrics to ev aluate their success. The mo dularit y of the simulator disen tangles these t wo parts so that if one c hanges a single metric, one do es not hav e to rerun the statistical metho ds. Finally , the plotting and tabulating comp onen ts are themse lves separate from all prev ious parts—since one should not ha ve to rerun an y ma jor computation as one makes slight mo difications to plots or tables. Finally , mo dularit y makes it easier to share comp onen ts across one’s pro jects with other researc hers. A group of researchers migh t develop a set of common mo dels and metho ds for, say , high-dimensional cov ariance estimation. When someone prop oses a new co v ariance estimator, one simply needs to define this new Method ob ject and then this can b e easily plugged in to the preexi sting co de base, sa ving a lot of time and making for a more meaningful comparison. 4.2. Ev ery comp onen t’s results sa v ed to file Ev ery comp onen t of a simulation generates R ob jects that tak e time to compute and tak e space to k eep in memory . When we mo dify a comp onen t of a simulation, we should not ha ve to rerun the parts of the simulation that are “upstream” of the mo dification. Also, in analyzing a simulation, w e migh t w ant to examine the generated ob jects at v arious stages in the sim ulation; again, w e should not ha ve to rerun anything to see an in termediate stage of the sim ulation. F or all of these reasons, the sim ulator automatically sa ves all generated ob jects to files when one runs a simulati on. The files are kept in an organized directory structure that is easy to interpret; how ever, the user nev er has to explicitly learn or pa y attention to the particulars of the directory structure since there are a series of sim ulator functions that mak e it simple to load these sav ed files. 4.3. W orking with references In the most common usage of the simulator , a Simulation ob ject is passed through the sim ulator pipeline. As it gets fed through the v arious comp onen t functions, it accum ulates more and more of a “record” of the simulation. An imp ortan t asp ect is that the Simulation ob ject do es not con tain the ob jects ge nerated from the v arious components thems elves (whic h w ould b e extremely memory in tensive); rather, it contains r efer enc es to these ob jects. A reference is an ob ject containing the sav ed lo cation of an ob ject generated b y the simulator The fact that Simulation ob jects con tain references rather than the ob jects themselves makes the sim ulator b eha ve muc h more nimbly . One only loads the sp ecific pieces of a simulation as needed. The functions models , draws , outputs , and evals allo w one to load sp ecific ob jects referred to in a sim ulation. A more adv anced consequence is th at multiple sim ulations can refer to the same set of sav ed ob jects (thus if multiple sim ulations ha ve some common asp ects, one can a void recomputing these or creating m ultiple copies of them). The function subset_simulation is useful in this context. 4.4. Magrittr friendly The first argument and the output of most sim ulator functions is a Simulation ob ject. This is done to facilitate magrittr in tegration, whic h allo ws one to write succinct “one-liner” sim- ulations that read (almost) lik e sen tences. The pip e op erator passes the output of its left- 12 The simulator hand-side to the first argumen t of the function on its right. 2 Without the pip e, the example presen ted in Section 2 would b e a bit more cumbersome. sim <- new_simulation(name = "bet-on-sparsity", label = "Bet on sparsity") sim <- generate_model(sim, make_sparse_linear _model, n = 200, p = 500, k = as.list(seq(5, 80, by = 5)), vary_along = "k") sim <- simulate_from_model(sim, nsim = 5, index = 1) sim <- run_method(sim, list(lasso, ridge)) sim <- evaluate(sim, list(mse, bestmse, df)) 4.5. P arallel pro cessing and random seed streams The simulator uses the parallel pac k age ( R Core T eam 2016 ) to allow simulations to b e run in parallel. In simulate_from_model one simulates dra ws in c hunks, indexed by index . F or example, simulate_from_model(sim, nsim = 10, index = 1:10) w ould lead to a total of 100 simul ations p erfor med in c hunks of size 10. The index of a ch unk is used to sp ecify a distinct stream of pseudorandom n umbers, using the “L’Ecuy er-CMRG” generator in R ( L’ecuy er 1999 ). The use of streams is con venien t b ecause it compartmen talizes c hunks of random dra ws, so that the starting state of one c hunk do es not depend on the end state of another ch unk. In particular, the following three options all give identical results. 1) In sequence: simulate_from_model(sim , nsim = 10, index = 1:10) %>% run_methods(list_of_met hods) 2) In parallel: simulate_from_model(sim , nsim = 10, index = 1:10) %>% run_methods(list_of_met hods, parallel = list(socket_names = 4)) 3) Mixed-and-matche d with no particular order to index : simulate_from_model(sim , nsim = 10, index = 1:2) %>% run_methods(list_of_met hods) # perhaps a day later: sim <- load_simulation("bet-on-sparsit y") simulate_from_model(sim , nsim = 10, index = 5:10) %>% run_methods(list_of_met hods, parallel = list(socket_names = 4)) # realize later you forgot two chunks: sim <- load_simulation("bet-on-sparsit y") simulate_from_model(sim , nsim = 10, index = 3:4) %>% run_methods(list_of_met hods) 2 F or example, matrix(rnorm(5 * 2), 5, 2) could equiv alently b e written 5 * 2 %>% rnorm %>% ma- trix(5, 2) . Jacob Bien 13 T able 1: A comparison of Mean squared error (a veraged o ver 5 replicates). Lasso cross v alidated Ridge cross v alidated n = 200, p = 500, k = 35 0.04 (0.00) 0.05 (0.00) n = 200, p = 500, k = 40 0.06 (0.01) 0.06 (0.00) n = 200, p = 500, k = 45 0.08 (0.01) 0.07 (0.00) n = 200, p = 500, k = 50 0.09 (0.01) 0.08 (0.00) n = 200, p = 500, k = 55 0.09 (0.00) 0.08 (0.00) n = 200, p = 500, k = 60 0.11 (0.00) 0.09 (0.00) When simulate_from_model creates a new Draws ob ject (i.e., a ch unk of random draws) the end state of the random num b er generator is sav ed to file. Whenever run_method is called on that Draws ob ject, the random nu mber generator is first set to that stored state. This ensures that w e will get the same results regardless of the order in which we apply our metho ds (if an y are randomized algorithms). 4.6. Unified in terface for plotting and tables Gelman, P asarica, and Do dhia ( 2002 ) go through every table app earing in an issue of the Journal of the Americ an Statistic al Asso ciation and conclude that plots are more effective than tables for most tasks. The simulator therefore fo cuses mostly on the generation of plots rather than tables. That said, the basic function for making tables (which can b e in latex, h tml, or markdown format) is designed to hav e a very similar int erface to one of the plotting functions, reflecting the similar task these tw o functions accomplish. F or example, returning to the comparison of lasso and ridge with cross v alidation, one could tabulate the results (ev en though the plot generated in Section 2 seems more useful). subset_simulation(sim2, k > 30 & k <= 60) %>% tabulate_eval("sqr_err" , format = list(nsmall = 2, digits = 0)) Ev erything ab out the table created b y this command, including the caption, w as automatically generated by tabulate_eval . 4.7. Easy extraction of data into data.frames The sim ulator is intended to pro vide al l the ma jor functionalit y that is t ypically r equired when one p erforms a sim ulation study; how ever, users should not feel “lo c k ed in” to the simulator . In particular, since there may b e sp ecific situations in which a non-standard op eration is required (e.g., generating an unusu al sort of plot based on sim ulation results), the sim ulator mak es it simple to extract simulation data in to a data.frame which can then b e manipulated outside the sim ulator framew ork. 5. Getting started The easiest w ay to start using the sim ulator is to use the function create . The command 14 The simulator create("name/of/new/dir ectory") will pro duce the sp ecified directory and the follo wing files that make up the skeleton of a sim ulation: • model_functions.R - file with a function that creates Model ob jects. • method_functions.R - file that creates Method ob jects. • eval_functions.R - file that creates Metric ob jects. • main.R - file with the main sim ulation pi p eline co de (and that sources the three *_func- tions.R files). • writeup.Rmd - an rmarkdown file that la ys out the co de and results (sho wing the main co de first and the definitions of v arious comp onen ts later). The sim ulation co de is not itself stored in writeup.Rmd , so that as one mak es changes to the R files, the rep ort will remain up-to-date. Some basic caching logic is built into writeup.Rmd so that if the .R files of the simulation ha ve not b een changed since the simulation data was created, the sim ulation will not b e rerun when one knits writeup.Rmd . This is conv enien t since one can see the effects of changes to the report without having to rerun the en tire sim ulation. A series of online vignettes for the simulator demonstrate its use in v aried settings and ma y b e useful for getting started with the simulator . 6. Dep endence on other R pac k ages The sim ulator makes use of sev eral R pac k ages. The pip e op eration, %>% uses magrittr ( Bache and Wickham 2014 ); the parallel pro cessing capabilities are built up on the parallel ( R Core T eam 2016 ) pack age; file name formation is based on the digest pack age ( with contributi ons b y An toine Lucas, T uszynski, Bengtsson, Urbanek, F rasca, Lewis, Stok ely , Muehleisen, Murdoch, Hester, W u, and Onk elinx. 2016 ); knitr ( Xie 2016 ) and rmarkdo wn ( Allaire, Cheng, Xie, McPherson, Chang, Allen, Wic kham, A tkins, and Hyndman 2016 ) are used for c reating tables and report generation; pl ots are by default generated using ggp lot2 ( Wickham 2009 ), although eac h plot function has a use_ggplot2 = FALSE setting for users preferring the basic plot functions provi ded through graphics ; finally , although not directly inv olv ed in user-called functions, the pack ages devto ols ( Wickham and Chang 2016 ) and testthat ( Wickham 2011 ) w ere instrumen tal in the dev elopment of the pac k age. 7. Related w ork A lo ok into the literature and a search online for softw are pac k ages reveal sev eral pac k ages similar in spirit to the simul ator , which we briefly review here. The pac k ages ezsim ( Chan 2014 ) and simsalapar ( Hofert and M ¨ ac hler 2016 ) are similar in spirit in man y w ays to the sim ulator and provid e muc h of the same functionalit y (suc h as the abilit y to parallelize sim ulations, sav e results to files, generate plots, and v ary mo del parameters). The ma jor differences are in terms of design c hoices. F or example, these pac k ages’ approac hes Jacob Bien 15 are m uch less mo dular and less ob ject-orien ted. All asp ects of the simulation are passed as functions to a single master function. By con trast, the simulator design encourages one to think of a sim ulation as a pip eline with replaceable parts that can b e easily reused, sw app ed, and shared. Another difference is that ezsim do es not app ear well-suited for sim ulations that ev olve ov er time (for example, if one later decides to add a metho d, increase the num b er of random dra ws, or add an additional metric, the simulator do es not require one to rerun ev erything). Alfons, T empl, and Filzmoser ( 2010 ) devel op the pack age simF rame as an ob ject-orien ted framew ork for simulati on (and also include parallelization and plotting abilities). Their par- ticular fo cus is on survey statistics (with asso ciated issues of outliers and missing data), but the authors note that the framewor k is general and can b e extended to many other applica- tions. In the examples presen ted, it app ears that the user writes a function that is passed to simFrame::runSimulation that has all the metho ds and metrics computed within it. This bundling of all metho ds and metrics together go es against the sim ulator design principle that metho ds should b e decoupled from each other (so they can b e run in parallel or at differen t times) and that the computation of metrics to ev aluate the metho d ob jects should b e viewed as a separate lay er of the simulation. The pac k age harv estr ( Redd 2014 ) fo cuses on the parallelization and cac hing infrastructure (and adds color with function names related to gardening!); the framework is quite general but less tailored to the sort of statistical simulations in whic h one is comparing metho ds across differen t mo dels and wan ts to rapidly generate plots. SimDesign ( Chalmers 2016 ) has man y asp ects in common with the sim ulator and simF rame , for example its pip eline of generate-analyse-summarise is similar to the mo del-method-ev aluate pip eline of the sim ulator . How ever, like the other simulation pack ages, SimDesign ’s metho ds app ear to b e bundled together in the analyse step. Lik ewise all metrics are bundled to- gether in the summarize step. Again, a distinguishing feature of the sim ulator is the mo dular organization of the Mo dels/Methods/Metrics components. 8. Discussion The sim ulator attempts to streamline an important yet often dreaded part of writing statistics pap ers. By taking care of the infrastructure and enforcing some structure on the pro cess, we ha ve found that it mak es conducting simulation studies less tedious and less haphazard. Using this to ol allo ws the user to fo cus time, energy , and though t on the problem-specific asp ect s, whic h hop efully leads to more carefully though t-out, insightful studies that might ev en b e enjo yable to conduct. The foundation provided b y the simulator can easily b e built up on by dev eloping domain-sp ecific libraries of mo dels, metho ds, and metrics. F or example, the areas of high-dimensional regression, classification, multiple testing, graphical mo dels, clustering, net work science, etc, could eac h ha ve a repository of sim ulator Model , Method , and Metric ob jects that can b e easily shared. Building up suc h a library of comp onen ts would greatly facilitate the sharing of sim ulation mo dels, leading to greater consistency across the literature while at the same time requiring less effort. 16 The simulator Ac kno wledgmen ts The inspiration to tak e the large co de base we had dev elop ed o ver y ears and turn it in to an easy-to-use R pac k age came from devto ols , whic h was also used to make this pack age. W e also ackno wledge the b o ok “R pac k ages” ( Wic kham 2015 ) for influencing many of the design and co de-st yle c hoices for the sim ulator pac k age. References Alfons A, T empl M, Filzmoser P (2010). “An Ob ject-Oriented F ramework for Statistical Sim ulation: The R Pac k age simF rame.” Journal of Statistic al Softwar e , 37 (i03). Allaire J, Cheng J, Xie Y, McPherson J , Chang W, Allen J, Wickham H, A tkins A, Hyndman R (2016). rmarkdown: Dynamic Do cuments for R . R pac k age version 0.9.6, URL https: //CRAN.R- project.org/pack age=rmarkdown . Bac he SM, Wickham H (2014). magrittr: A F orwar d-Pip e Op er ator for R . R pack age version 1.5, URL https://CRAN.R- project.or g/package=magrittr . Chalmers P (2016). SimDesign: Structur e for Or ganizing Monte Carlo Simulation Designs . R pack age version 1.0, URL https://CRAN.R- project.org /package=SimDesign . Chan TJ (2014). ezsim: pr ovide an e asy to use fr amework to c onduct simulation . R pack age v ersion 0.5.5, URL https://CRAN.R- project.org/ package=ezsim . Gelman A, P asarica C, Do dhia R (2002). “Let’s practice what we preach: turning tables into graphs.” The Americ an Statistician , 56 (2), 121–130. Hastie T, Tibshirani R, F riedman J (2009). The Elements of Statistic al L e arning; Data Mining, Infer enc e and Pr e diction, Se c ond Edition . Springer V erlag, New Y ork. Hofert M, M ¨ ac hler M (2016). “P arallel and Other Sim ulations in R Made Easy: An End-to- End Study .” Journal of Statistic al Softwar e , 69 (4), 1–44. doi:10.18637/jss.v069.i04 . L’ecuy er P (1999). “Go od parameters and implementations for com bined mult iple recursive random num b er generators.” Op er ations R ese ar ch , 47 (1), 159–164. R Core T eam (2016). R: A L anguage and Envir onment for Statistic al Computing . R F oun- dation for Statistical Computing, Vienna, Austria. URL https://www.R- project.org/ . Redd A (2014). harvestr: A Par al lel Simulation F r amework . R pack age v ersion 0.6.0, URL https://CRAN.R- project.or g/package=harvestr . Tibshirani R (1996). “Regression shrink age and selection via the lasso.” J. R oyal. Stat. So c. B. , 58 , 267–288. Wic kham H (2009). ggplot2: Ele gant Gr aphics for Data Analysis . Springer-V erlag New Y ork. ISBN 978-0-387-98140-6. URL http://ggplot2.org . Wic kham H (2011). “testthat: Get Started with T esting.” The R Journal , 3 , 5–10. URL http://journal.r- project. org/archive/2011- 1/RJournal_2011- 1_Wickham.pdf . Jacob Bien 17 Wic kham H (2015). R p ackages . ” O’Reilly Media, Inc.” . Wic kham H, Chang W (2016). devto ols: T o ols to Make Developing R Packages Easier . R pac k age version 1.11.1, URL https://CRAN.R- project.org/pac kage=devtools . Wic kham H, Danen b erg P , Eugster M (2015). r oxygen2: In-Sour c e Do cumentation for R . R pac k age version 5.0.1, URL https://CRAN.R- project.org/pack age=roxygen2 . with con tributions by Antoine Lucas DE, T uszynski J, Bengtsson H, Urbanek S, F rasca M, Lewis B, Stokely M, Muehleisen H, Murdo c h D, Hester J, W u W, Onk elinx T (2016). digest: Cr e ate Comp act Hash Digests of R Obje cts . R pack age v ersion 0.6.9, URL https: //CRAN.R- project.org/pack age=digest . Xie Y (2016). knitr: A Gener al-Purp ose Package for Dynamic R ep ort Gener ation in R . R pac k age version 1.13, URL https://CRAN.R- project.org/pack age=knitr . Zhang CH (2010). “Nearly unbiased v ariable selection under minimax concav e p enalt y .” The A nnals of statistics , pp. 894–942. A. App endix This app endix provides the problem-sp ecific co de required for the example shown in Section 2 . A.1. The mo del W e sim ulate from a linear mo del Y = X β +  where Y ∈ R n , β ∈ R p , and  ∼ N (0 , σ 2 I n ). W e hav e tak en X to ha ve iid N (0 , 1) entries and treat it as fixed in this simulation. W e define a Model ob ject, whic h sp ecifies the parameters and, most imp ortan tly , describ es how to simulate data. make_sparse_linear_mode l <- function(n, p, k) { x <- matrix(rnorm(n * p), n, p) beta <- rep(c(1, 0), c(k, p - k)) mu <- as.numeric(x %*% beta) sigma <- sqrt(sum(mu^2) / (n * 2)) # fixes signal-to-noise at 2 new_model(name = "slm", label = sprintf("n = %s, p = %s, k = %s", n, p, k), params = list(x = x, beta = beta, mu = mu, sigma = sigma, n = n, p = p, k = k), simulate = function(mu, sigma, nsim) { y <- mu + sigma * matrix(rnorm( nsim * n), n, nsim) return(split(y, col(y))) # make each col its own list element }) } W e will typically put the co de ab o ve in a file named model_functions.R . 18 The simulator A.2. The metho ds W e compare the lasso and ridge. Both of these metho ds dep end on tuning parameters, so we compute a sequence of solutions. library(glmnet) lasso <- new_method("lasso", "Lasso", method = function(model, draw, lambda = NULL) { if (is.null(lambda)) fit <- glmnet(x = model$x, y = draw, nlambda = 50, intercept = FALSE) else { fit <- glmnet(x = model$x, y = draw, lambda = lambda, intercept = FALSE) } list(beta = fit$beta, yhat = model$x %*% fit$beta, lambda = fit$lambda, df = fit$df) }) ridge <- new_method("ridge", "Ridge", method = function(model, draw, lambda = NULL) { sv <- svd(model$x) df_fun <- function(lam) { # degrees of freedom when tuning param is lam sum(sv$d^2 / (sv$d^2 + lam)) } if (is.null(lambda)) { nlambda <- 50 get_lam <- function(target_df) { f <- function(lam) df_fun(lam) - target_df uniroot(f, c(0, 100 * max(sv$d^2)))$root } lambda <- sapply(seq(1, nrow(model$x), length = nlambda), get_lam) } df <- sapply(lambda, df_fun) beta <- sapply(lambda, function(r) { d <- sv$d / (sv$d^2 + r) return(sv$v %*% (d * crossprod(sv$u, draw))) }) list(beta = beta, yhat = model$x %*% beta, lambda = lambda, df = df) }) Metho ds can return different items. How ev er, aspects of the metho d that will b e used down- stream in the simulation and compared across metho ds should b e in a common format. Thus beta , yhat , and df in eac h case are in the same format. These will b e the items used when ev aluating the metho ds’ p erformances. Jacob Bien 19 W e will typically put the co de ab o ve in a file named method_functions.R . A.3. The metrics When w e compare metho ds in plots and tables, there are usually a n umber of “metrics” we use. An ob ject of class Metric sp ecifies how to go from a mo del’s parameters and the output of a metho d and return some quan tity of interest. mse <- new_metric("mse", "Mean-squared error", metric = function(model, out) { colMeans(as.matrix(out$ beta - model$beta)^2) }) bestmse <- new_metric("bestmse", "Best mean-squared error", metric = function(model, out) { min(colMeans(as.matrix( out$beta - model$beta)^2)) }) df <- new_metric("df", "Degrees of freedom", metric = function(model, out) out$df) Observ e that out refers to the list returned by our metho ds and model refers to the Model ob ject that is generated by make_sparse_linear_model . The $ operator can b e used to get parameters that are stored in the Model ob ject. W e will typically put the co de ab o ve in a file named eval_functions.R . A.4. Extensions of the methods In Section 2 , we wan ted to study cross-v alidated versions of the lasso and ridge regres- sion, so we inputed into run_method the ob jects lasso + cv and ridge + cv . These are ExtendedMethod ob jects, whic h b eha ve muc h like Method ob jects, except they can get access to the output of another metho d. If we only cared ab out relaxing the lasso, w e could hav e directly created an ExtendedMethod ; ho wev er, in this situation we wan ted b oth metho ds to b e cross-v alidated in an identical fash- ion. In the spirit of co de r eusability , w e therefore created what we call a MethodExtensio n ob- ject, cv . A MethodExtension ob ject when added to a Method ob ject generates an ExtendedMeth od . make_folds <- function(n, nfolds) { nn <- round(n / nfolds) sizes <- rep(nn, nfolds) sizes[nfolds] <- sizes[nfolds] + n - nn * nfolds b <- c(0, cumsum(sizes)) ii <- sample(n) folds <- list() for (i in seq(nfolds)) folds[[i]] <- ii[seq(b[i] + 1, b[i + 1])] folds 20 The simulator } cv <- new_method_extension("cv", "cross validated", method_extension = function(model , draw, out, base_method) { nfolds <- 5 err <- matrix(NA, ncol(out$beta), nfolds) ii <- make_folds(model$n, nfolds) for (i in seq_along(ii)) { train <- model train@params$x <- model@params$x[ -ii[[i]], ] train@params$n <- model@params$x[ -ii[[i]], ] train_draw <- draw[-ii[[i]]] test <- model test@params$x <- model@params$x[ii[[i] ], ] test@params$n <- model@params$x[ii[[i] ], ] test_draw <- draw[ii[[i]]] fit <- base_method@method(model = train, draw = train_draw, lambda = out$lambda) yhat <- test$x %*% fit$beta ll <- seq(ncol(yhat)) err[ll, i] <- colMeans((yhat - test_draw)^2) } m <- rowMeans(err) se <- apply(err, 1, sd) / sqrt(nfolds) imin <- which.min(m) ioneserule <- max(which(m <= m[imin] + se[imin])) list(err = err, m = m, se = se, imin = imin, ioneserule = ioneserule, beta = out$beta[, imin], yhat = model$x %*% out$beta[, imin]) }) Of course, if w e later added mcp to the sim ulation, we could easily incorporate cross-v alidation in to it as w ell witih mcp + cv . W e will typically put the co de ab o ve in the file named method_functions.R .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment