R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics
Nolan and Temple Lang argue that "the ability to express statistical computations is an essential skill." A key related capacity is the ability to conduct and present data analysis in a way that another person can understand and replicate. The copy-a…
Authors: Ben Baumer, Mine Cetinkaya-Rundel, Andrew Bray
R Markdo wn: In tegrating A Repro ducible Analysis T o ol in to In tro ductory Statistics Ben Baumer ∗ , Mine ¸ Cetink a y a-Rundel † , Andrew Bra y ∗ , Linda Loi ∗ and Nic holas J. Horton ‡ Abstract Nolan and T emp le Lang argue that “the abilit y to express statistical computations is an es- sen tial skill.” A key related capacit y is the abili t y to conduct and presen t data analysis in a w ay that another p erson can understand and r eplicate. The copy-and-paste w orkflow that is an arti- fact of an tiquated user-in terface design mak es r epr o ducibility of statist ical analysis more difficult, esp ecially as data b ecome increasingly complex and statistical metho ds b ecome increasingly so- phisticated. R Markdo wn is a new technology that makes creating fully-reproducible statistical analysis simple and painless. It pro vides a solution suitable not only for cutting edge research, but also for use in an introductory statistics course. W e present evidence that R Markdo wn can b e used effectively in introductory statistics courses, and discuss its role in the rapidly-changing w orld of statistical computation. 1. In tro duction Statistical analysis of data is both increasingly common and increasingly sophisticated. While the imp erativ e to conv ey findings with clarit y remains, the mo dern statistical analyst faces a v ariety of c hallenges that ma y mak e analyses more difficult to understand. First, as the field of statistics deep- ens, applications of statistics are increasingly complex. Second, collaboration among researchers is no w the norm, rather than the exception. Third, muc h of that collab oration is conducted remotely , with written analyses, data files, and computing scripts shared via electronic means. F ourth, the underlying data b eing analyzed is larger and more complex, making it imp ossible to fully describe on pap er, and thus necessitating transmission via an electronic file. Eac h of these complications mak es it harder to completely follow someone else’s work. Y et this task—in a word, r epr o ducibil- ity —remains the lifebloo d of scientific collab oration. In the past few y ears, the startling realization that man y mo dern scien tific findings c annot be replicated has b een highlighted in the p opular press ( The Economist Editorial 2013 ; Johnson 2014 ) ∗ Departmen t of Mathematics & Statistics, Smith College, Northampton, MA 01063 † Departmen t of Statistical Science, Duk e Univ ersit y , Durham, NC 27708 ‡ Departmen t of Mathematics, Amherst College, Amherst, MA 01002 as well as the scientific literature ( Ioannidis 2013 ). Man y factors ha v e b een identi fied, including publication bias, reporting bias, conflicts of in terest, and insufficien t statistical p o wer. This last factor can b e remedied b y encouraging the replication of studies and then conducting subsequen t meta-analyses. In order for a scientific study to b e replicated, how ever, the metho d of statistical analysis must b e entirely repro ducibl e. T eaching repro ducible analysis in an introductory statistics course not only mak es studen ts a w are of these issues, but also pa v es the wa y to w ard making them v aluable contributors to mo dern data analysis. These future con tributions could b e made as part of academic researc h or for a data-cen tric en terprise that needs to conduct daily analysis on new data. The journal Natur e addressed these issues head-on in an editorial outlining the efforts that the journal w ould tak e to reduce their irrepro ducibil it y ( Nature Editorial 2013 ). A k ey provision is their pledge to “examine statistics more closely and encourage authors to b e transparent , for example b y including their raw data.” In searc hing for the source of this irrepro ducibilit y , they note that “men toring of young scien tists on matters of rigour and transparency is inconsisten t at b est” (p. 398). A natural environmen t to pro vide this men toring is the first time that most young scientists will encoun ter the formal principles of scientific inquiry: in introductory statistics. The introductory statistics course has changed greatly in recent decades, with more fo cus on ac- tiv e learning, use of tec hnology for conceptual understanding and analysis of data, along with “nak ed, realistic and real” data ( GAISE College Group 2005 ). In this pap er, w e discuss how R Ma rkdo wn ( Allaire, Horner, Marti, and P orte 2013 ), a simple, easy-to-learn, op en source markup language, can b e in tegrated in to an introductory statistics course in an effort to achiev e the GAISE guidelines, specifically , to enable studen ts to dev elop the basic capacity to undertake mo dern data analysis and comm unicate their results. Our wor ld is increasingly aw ash in data. What w e plan to do with that information—ho w w e plan to store it, how we will analyze it, and what exactly we hope to extract insight from it—are cen tral unanswered questions facing data scientists, netw ork scien tists, statisticians, and computer scien tists alike, b oth inside of academia and out. As statistics instructors, w e face the difficult task of preparing our studen ts to make their wa y in a sea of data that is foreign to man y of us. Mean while, in trepid graduates hop on massive vessels lik e Go ogle, F aceb ook, and Amazon each y ear, perhaps not kno wing that the tw o-sample t -test they sp en t so m uc h time studying in college is unlikely to b e sufficien t for their future work ( Cobb 2007 ). But our aim here is not to discuss the topics in introductory statistics courses, but rather, the workflow . Hop efully the days in whic h studen ts p erform statistical analyses by hand or with a calculator are, if not o v er, b ounded ab o v e by a function tending to zero. Thus, we take it as a given that our studen ts’ future work will b e done on a computer. As such, providi ng students with the to ols to “think with data” and “compute with data” is essential to their prosp erit y as a data analyst. A t the same time, the ability to comm unicate one’s findings to other p eople is imperative. At the end of every data analysis task, there is a p erson who wan ts to understand the analyst’s findings. That p erson may b e a boss, a journalist, a do ctor, or a p olicy-mak er, who more often than not will hav e a w eaker technical bac kground than the analyst. This is the wa y it should b e, since otherwise, the p erson migh t as well perform the analysis themselves . But the end result is that the analyst’s v alue is ultimately tied up in ho w understandable she can make her wor k. Because b oth computation and presentation are essential, a typical w orkflo w is comprised of at least tw o ma jor comp onen ts: a statistical softw are pack age for p erforming the data analysis; and a la y out pac k age for presen ting the results. F or the former, w e ha v e had v ery p ositiv e exp eriences using R ( R Core T eam 2013 ), even in an introductory course, but other options (e.g. SAS or Stata ) ma y b e feasible. 1 F or presentation, written rep orts tend to b e comp osed in a w ord-processing application (e.g. Microsoft W ord, LibreOffice W riter, or Google Do cs) while oral presen tations tend to use slides prepared in a presentation application (e.g. Apple Keynote, Microsoft Po werP oin t, L A T E X with beamer or Prezi). A pairing of a statistical pack age and a la y out pack age constitute a workflow . That is, the analyst’s work will typically b egin in their statistical pack age of choice, wherein the data analysis will b e p erformed. Once completed, translated summaries of work, b e they tables, c harts, images, or other output, need to b e in tegrated into the la y out application. Once in this en vironment, additional material can b e la y ered onto the statistical results, so that a (usually less technical) human being can understand the findings. This workflo w is ubiquitous, and in most undergraduate courses where students are exp ected to compute with data, studen t homew ork assignmen ts are pro duced in this manner. That is, statistical computations are performed in a statistics pac k age, say R , and then a written summary is pro duced in, sa y , W ord. T ables and plots are simply copied-and-pasted from R to W ord. So if this workflo w is so common, what is wrong with it? In truth, there are several imp ortan t un- desirable asp ects. First, it is not repro ducible. Since the commands used to generate the statistical output are not present in the final presentation, then either: a) the reader must assume that the studen t has calculated exactly what they sa y they ha v e calculated, since th ere is no w a y of v erifying the computation; or b) the grader must rely on the student to also copy -and-paste the commands used to generate the analysis. In either case, it will frequently be the case that the grader will b e unable to completely follo w the student’s work. Moreov er, the issue of repro ducibilit y is relev ant not only for a second-party (i.e. a grader), but also for the studen t. Being able to retrace steps while studying for a final, for example, is a desirable outcome. More concretely , the studen t ma y be reminded years later of the analysis, and seek to reapply the same metho ds in a differen t setting. Ha ving the commands separated from the results inhibits this pro cess. 1 Here we must distinguish betw een command-driv en softw are pac k ages (each of the aforementioned) and men u- driv en softw are pack ages (e.g. StatCrunc h or Microsoft Excel). It is increasingly the case that the complexit y of data analysis tasks require the additional functionality and programmability of command-driven applications. The iteration required of studen ts doing inquiry-based pro jects often breaks down in a menu-driv en workflo w. Second, the separation of computation from analysis is not logical. The commands in an R script pro ceed chronologically , suc h that the analyst will most lik ely run the enti re script all at once. A written rep ort will b e read in the same order, and there is no reason why the commands and analysis should not be in terwo ven. Rather, the imp etus to separate the command s from the analysis is that the statistical pack age is not go o d at presen tation, and the word-processing application is not go od at computing. But this is not the ideal setup for the data analyst—it is simply an artifact of soft ware design. R Markdo wn helps to bridge this gap in the data analysis workflo w. Third, the separation of computing from presentat ion is not necessarily honest. A t Smith College, a strict honor co de—to whic h all studen ts are b ound—discourages cheating. But it is all to o easy for a studen t copying-and-pasti ng output from one program to another to fudge a few num b ers. Again, the div orce of the computation from the pre sen tation enables th e studen t to edit the con tent along the w a y . The p ossibilit y of getting “lost in translation” is disastrous for the data analyst. More subtly and less p erniciously , the copy-and-paste paradigm enables, and in man y cases ev en encourages, selectiv e rep orting. That is, the tabular output from R is admittedly not of presentation qualit y . Thus the student may b e tempted or even encouraged to prettify tabular output b efore submitting. But while one is fiddling with margins and headers, it is all to o tempting to remo v e ro ws or columns that do not suit the student’s purp ose. Since the commands used to generate the table are not present, the reader is none the wiser. Lastly , the copy-and-paste paradigm is error prone. When jumping b et ween mult iple windows ( R and a word pro cessor), students, often worki ng on laptops with small cluttered screens, inadver- ten tly cop y-and-paste partial output or forget to up date the output or plots included in the written rep ort as they revise their analysis. This not only complicates grading, but it also results in in- creased frustration level s in students who dev ote time to improving their analysis but lose p oin ts for turning in a rep ort that do es not con tain the desired results. 2. Related W ork While the notion that scientific results should b e repro ducible is fundamental, the recen t interest in repro ducible statistical analysis is a mo dern outgrowth fueled by devel opmen ts in computing and netw orking. In particular, computational statistical metho ds hav e b ecome more p opular as computational pow er has b ecome c heap er. Similarly , the In ternet has eliminated man y barriers to information dissemination. Thus, it is no w p ossible to transmit the entiret y of a statistical research pro ject to nearly an yone in the w orld in almost no time, and at almost no cost. With the inclusion of b oth data and co de the p ossibilit y exists that another p erson can entirely duplicate an analyst’s findings with little effort. Kn uth was an early advocate of liter ate pr o gr amming , whic h emphasized the use of detailed com- men ts em b edded in co de to explain exactly what the code was doing ( Kn uth 1984 ). The goal w as to tie explanations to instructions so that work could b e recreated, b etter understo od, and v erified. This idea was a predecessor to the notion of r epr o ducible r ese ar ch . According to Xie ( 2014 ), the use of the term repro ducible research first app eared in Claerb out ( 1994 ). Buc kheit and Donoho w ere early discipl es of Claerb out’s ideas, incorp orating them in to their w ork with Matlab libr aries ( Buc k- heit and Donoho 1995 ). They prop osed that in a scientific publication that relies on computation, the scholarsh ip is not merely the presentation of the figures, etc. that further the author’s case. Rather, “the actual scholarship is the complete soft w are developmen t en vironmen t and complete set of instructions whic h generated the figures” ( Buc kheit and Donoho 1995 ). F rom here, it is clear that the burden of repro ducibilit y rests on the original author, and that publication of computer co de is considered a necessary but not sufficient condition for ac hieving reproducibility . P articular advocacy of repro ducibilit y has come f rom the communit y surrounding R . Sw eav e ( Leisc h 2002 ) provided a metho d for int egrating executed R co de into L A T E X do cumen ts. The knitr pac k age b y Xie ( 2014 ) provides equiv alen t functionalit y , but also partners with R Markdo wn to bring repro- ducibilit y and dynamic do cumen t generation to those who are not familiar with L A T E X ( Gandrud 2013 ). In man y w a ys knitr can b e seen as the realization of the vision for repro ducible statistical analysis described by Gentleman and T emple Lang ( 2004 ). The emphasis on reproducibility can be seen as a necessary but not sufficient part of ensuring that studen ts hav e capacity to “think with data.” Along these lines, recent efforts in statistics education ha v e adv ocated for an increased use of computing in the stati stics curriculum, both at the undergraduate and graduate lev els ( Nolan and T emple Lang 2010 ). Y et while they argue strongly for the need for studen ts to learn programming (and presumably , literate programming) they pro vide no mec hanism for allo wing studen ts to express their statistical computations. R Markdo wn provides exactly suc h a mechanis m, and fits squarely into the statistical computing w orkflow. 3. R Markdown R Ma rkdown is an easy-to-use system that enables studen ts to com bine statistical computing in an en vironmen t of their choosing and written analysis in one do cument . A t a high-level, it renders a well-annotated R script into a self-contained HTML file, replete with graphics, commands, and st ylized text. Lik e L A T E X or HTML, R Markdo wn relies on a sour c e file and output file paradigm. T ext, with simple rules for creating st yles, is typed into an R Markdo wn source file, whic h has the .Rmd extension. R commands are t yp ed directly in to this file, set off in “c h unks” . The knitr rendering engine then parses the .Rmd file. It first executes eac h of the R commands in the ch unks and pro cesses the output from those commands. This generates an intermediate Markdown file (with a .md extension) which is of no immediate interest. Next, it renders this Markdo wn file in to a single HTML file with embedded graphics. F or those familiar with L A T E X, Swea ve, or PHP , it is very similar to the wa y that each of these pro cess one source file into another output file. A comparison of the w orkflo w in rendering applications is given in T able 1 . Source Language Source file format Rendering Engine In termediate file format Output file format L A T E X .tex p dflatex .log, .aux .p df, .ps Sw ea ve .Rn w Sw ea ve, knitr .tex .p df PHP .php PHP .h tml R Ma rkdown .Rmd knitr .md .h tml T able 1: Comparison of similar rendering applications The primary b enefit of R Markdo wn is that it restores the logical connection b et ween the statistical computing and the statistical analysis that was brok en b y the cop y-and-paste paradigm. Eac h c h unk of R code is rendered in to tw o parts: first, a box that contains the syn tax-highligh ted, tidied R co de; follow ed by the output from those commands. In this manner, it is p erfectly clear exactly what command has b een run, and there is no wa y to fudge or edit the output from those commands. 2 Additional con ten t in the form of text, lists, headers, tables, external images, and web links, etc. can surround the R c h unks in a standard w ay . One of the ma jor adv an tages of R Markdo wn ov er existing technologies, such as Swea ve ( Leisc h 2002 ), is that the Markdown syntax is v ery simple. F or example, to mak e a w ord sho w up in b oldface, it is surrounded with asterisks. Compare this to HTML, in whic h you’d hav e to put “ < b > ” b efore the word and “ < /b > ” after it. Or consider L A T E X, in which you would ha v e to encase the w ord: \textbf{word} . A side-b y-side comparison of the alternativ es are shown in T able 2 . HTML L A T E X R Markdo wn word \textbf{word} *word* T able 2: Comparison of syn tax for typesetting “word” in b old face. The syn tax emplo yed b y R Ma rkdo wn do es not require learning a separate set of complex rules, as do es HTML or L A T E X. T o make a bulleted-list in R Markdo wn , a series of lines are prefaced with an asterisk in exactly the manner as in a plain-text email (See Figure 1 ). Thus, studen ts can learn to use R Markdo wn without the burden of learning a wholly new tec hnology , suc h as L A T E X or HTML. The R Markdo wn syntax is so simple that the ma jority of the R Ma rkdo wn syntax is presented on a short web page ( RStudio 2013 ). R commands and output are distinguished from plain text with the use of chunks . Chunks b egin with a series of three backtic ks, and conclude with three more. Figure 2 illustrates a simple c hunk 2 OK, a hack er-studen t could edit the HTML file manually , but go od luck trying to edit the figures, which are rendered as b ytecode to allow them to b e sav ed as embedded images! That is, instead of the t ypical configuration wherein images on a web page are stored in separate files, R Markdo wn conv erts all images to an equiv alent HTML string of mach ine-readable b ytecode. This allo ws each rendered R Markdo wn document to include images without requiring external files. Figure 1: Bulleted list in R Ma rkdo wn , input (left) and output (right). of R co de and its rendered output. Note that the ch unk in Figure 2 is named ( exPlot ), and sets t w o options to non-default v alues ( fig.width, fig.height ). W e should note that the knitr rendering engine is not specific to R or RStudio , a p opular open source in tegrated developmen t environmen t for R . The follo wing R commands are equiv alent to clicking on the “Knit HTML” button in RStudio (note the in termediate generation of a Markdo wn file): library(markdown) knit("filename.Rmd") # creates filename.md markdownToHTML("filenam e.md", "filename.html") browseURL("filename.htm l") RStudio is av ailable as either a clien t application or a server (cloud-based) v ersion. The latter setup, implemen ted at our institutions, allows students to access and run R Markdo wn and RStudio through a bro wser, and minimizes startup time. Moreo v er, while in this pap er w e fo cus on the use of R Markdo wn in the introductory statistics class, w e should also note that, just lik e R , R Markdo wn also extends b eyond the introductory classro om. Studen ts who are in tro duced to the concept of reproducibility at this level carry the skills they acquire with them throughout their undergraduate career (and b ey ond). At the p oin t where the simple formatting of R Ma rkdo wn b ecomes limiting to pro ducing high qualit y customizable rep orts, studen ts who are familiar with L A T E X can easily transition to Swea ve/ knitr . In fact, at Duk e Univ ersit y , students taking the Statistical Consulting course (ST A 470) as one of the last courses in the ma jor curriculum use Swea ve/ knitr to complete their data analysis assignments, as do studen ts in Mathematical Statistics at Smith College. Figure 2: An example of an R Ma rkdo wn ch unk (left) and its rendered output (righ t). 4. Using R Markdo wn in Introductory Statistics 4.1. Duk e Universit y A t Duk e Univ ersity in Durham, North Carolina, 272 statistics student s ha v e used R Ma rkdo wn during the 2012-2013 academic ye ar (221 enrolled in ST A 101 during the F all and Spring semesters, and 51 enrolled in ST A 102 during Spring). Both of these are non-calculus based introductory statistics courses usually taken by first and second y ear students ma joring in either the social sciences or the life sciences, resp ectiv ely . Only a v ery small num b er of the studen ts eac h semester ha v e any meaningful computational backgr ound. Both courses ha v e lecture and lab comp onen ts, and studen ts used R Markdo wn to complete their lab assignmen ts as w ell as data analysis pro ject(s). In ST A 101, studen ts complete a simpler pro ject on statistical inference ev aluating univ ariate distributions or biv ariate relationships (completed individually) and a more adv anced pro ject on m ultiple regression (completed in teams). In ST A 102, the students complete an op en-ended data analysis pro ject using the appropriate metho ds co vered in the course (competed individually). The ST A 101 course employs the flipp ed classro om mo del as well as team-based learning. Studen ts are assigned to teams b y the instructor at the b eginning of the semester based on their performance on the AR TIST CAOS, Comprehensiv e Assessmen t of Outcomes in a First Statistics course, ( del- Mas, Garfield, Ooms, and Chance 2007 ), pre-test and their resp onses to a surv ey on their statistics, mathematics, and computer science background as well as their int erests and reasons for taking the course. The teams are created to b e heterogenous with respect to statistics exp erience and homogenous with respect to student in terests. Students work in teams in man y comp onen ts of the course, including the w eekly R labs. The final product of the weekly labs is a team lab report, created using R Markdo wn . The labs are designed such that studen ts complete the ma jority of the exercises during the lab sessions led b y the teaching assistants. How ever, finalizing the analysis and the write-up requires sp ending time outside of class. Rep orts pro duced in R Markdo wn facil- itate easy and organized sharing of the co de and the write-up among the team members. Prior to int egrating R Markdo wn in to the course curriculum students struggled with sharing their w ork among team members and v ersion cont rol. Using R Mark do wn for the weekly labs allo ws studen ts to work collab orativ ely on data analysis throughout the semester, and they reap the benefits of ha ving dev elop ed a w orkflo w that has reproducibility at its heart when w orking on their larger scale individual and team pro jects. In addition, rep orts pro duced using R Markdo wn present the co de and the output in one place (as input and output) making it easier for studen ts to learn R and locate the cause of an error. Likewise, uniformit y of the output and the enforced structure of the rep orts significan tly aid the instructors in debugging issues as they arise as well as simplifying the task of grading (see App endix C for a sample lab assignmen t and studen t solution). In previous versions of the course, prior to adopting R Ma rkdo wn , labs and pro jects still required analyses p erformed in R . As the stud en ts were learning R concurrently with new statistical concepts, they would often struggle to organize their analyses. They to ok a trial-and-error approach to co ding, and made ad-ho c changes as they wen t through the analysis. Ho wev er, despite trying to instill b est practices, most studen ts never really devel op ed a habit of separately sa ving their co de. This often resulted in cluttered w orkspaces and R consoles, difficult-to-diagnose errors due to o verwriting data, and hence studen t frustration. W e b eliev e that the ro ot of the problem w as that the desired final pro duct (the lab rep ort, the pro ject write-up, etc.) was just a presentation of results (t yp ed up in a W ord pro cessor lik e Microsoft W ord or Go ogle Do cs) that did not include the underlying co de. On the other hand, commen ts from students enrolled in recent v ersions of the course, after adopting R Ma rkdo wn , suggest that they appreciate the ease of organization of their co de: • “I think the labs hav e b een great. Using R Markdo wn has b een so great b ecause w e do not sp end as muc h time solving the format/design of the pap er and instead fo cus on actual problem solving. R is sup er easy to use and useful. ” • “The labs ha v e b een enjo yable, and R Markdo wn mak es the process very easy .” • “Labs can sometimes b e troublesome and confusing, ho w ev er, the T As are v ery helpful. The R Ma rkdown used to generate lab reports/prop osals are v ery helpful for organizing our infor- mation.” Studen ts also commen ted on the usefulness of templates pro vided with the labs (see App endix C ). Another notable p oin t w as a general sense of excitemen t and in terest around the labs. • “The labs are fun. There is something satisfying ab out hitting ‘knit’ and havin g the text turn in to figures and tables.” • “I like it a lot actually . It has sparked an interest in co ding for me.” 4.2. Smith College A t Smith College in Northampton, Massac h usetts, 145 statistics students used R Ma rkdo wn during the 2012-2013 academic year. In the fall semester, 42 students completed MTH 245, an adv anced first course in statistics for studen ts with a calculus bac kground. The course is w orth five credits and has b oth lecture and lab comp onen ts. These studen ts completed most of their lab assignments in R Markdo wn . F urthermore, after conducting a statistical inv estigation inv olving multiple regression as part of their final pro ject, they submitted a “technical app endix” composed in R Mar kdo wn . A total of 33 other student s to ok a second course in statistics, MTH 247, which fo cused on regression analysis. These students completed all of their homew ork assignments in R Markdo wn and for their final pro ject, submitted b oth a tec hnical app endix written in R Ma rkdo wn and a write-up comp osed in a w ord-pro cessing application. Anecdotal success with this pilot program at Smith led to the integration of R Markdo wn into three sections of the spring semester introductory statistics course. 241 is the four credit equiv alen t of MTH 245, whic h similarly requires calculus but do es not hav e a lab comp onen t. As in MTH 245, 70 studen ts completed almost all homew ork assignmen ts in R Ma rkdown , as well as a tec hnical app endix for their final pro ject. These students w ere given surv eys at the b eginning and end of the semester in order to gauge their attitudes to w ard R and R Ma rkdo wn . (This pro ject was appro ved b y the Smith College Institutional Review Board.) The results, whic h we presen t in detail b elo w, suggest that: 1. Students grew to appreciate R Markdo wn ’s ability to streamline their homew ork w orkflo w. In particular, students did not prefer to copy-and -paste their work from R in to Microsoft W ord. 2. While studen ts experienced frustration with b oth R and R Markdo wn , this frustration waned o v er the course of the semester. 3. There w as little to no correlation b et w een a studen t’s attitude to w ards R Markdo wn and that studen t’s p erformance in the course. 4. Lack of prior exp osure to markup languages similar to R Ma rkdo wn was not an imp edimen t to learning R Ma rkdown . F rom the p oin t of view of the instructor, while there is some o v erhead and growin g pain required alongside the in tro duction of R Ma rkdown , these hurdles will b e ov ercome, and the b enefits are w ell worth it. Sp ecifically , the lesson of repro ducibilit y is emphasized throughout the semester, homew ork is easier to grade, and students receiv e more comprehensive and sp ecific feedback on their statistical computing than they w ould using the t ypical cop y-and-paste paradigm. Survey data Of the aforemen tioned 70 studen ts, 56 completed the Lik ert-scale surv ey shown in App endix A at b oth the b eginning of the semester (after some initial exp osure to R and R Markdo wn ), and at the end of the semester. A summary of their resp onses to questions is shown in T able 3 and Figure 3 ( Bryer and Sp eersc hneider 2013 ). The resp onses in T able 3 are scored on a scale from − 2 to 2, where − 2 represents strong disagreemen t with a statemen t that is fav orable to R or R Markdo wn , and 2 represen t strong agreement with that same statemen t. Note that only ab out half of the statemen ts on the survey were worded fav orably tow ards R Markdo wn , so for questions 3, 4, 6, 7, 10, and 11, − 2 corresponds to “strongly agree” , whereas for questions 1,2,5,8, and 9, − 2 corresponds to “strongly disagree” . Thus , in the interpretation of T able 3 , larger p ositiv e n um b ers are alw a ys go od (from the point-of-view of R Markdo wn ). Questions 5, 6, 7, 9, and 10 address R Markd o wn ’s role in the data analysis w orkflo w. F or all fiv e questions, the studen ts resp onses w ere fav orable at the end of semester, and grew more fa vor- able o v er the course of the semester. Most notably , while students were largely indifferent to R Ma rkdo wn ’s ability to make their homework easier to read and understand at the b eginning of the semester ( mean init ial response to R5 of 0.35), b y t he end of the semester the most studen ts realized this b enefit (mean final resp onse to R5 of 0.84). The improv ement of 0.51 was among the largest c hanges for any of the elev en questions. Note that this question forces the studen ts to consider the p erspective of someone reading their work—it do es not solely address a question in the student’s immediate self-in terest. Before After Change Question Idea N Mean (SD) N Mean (SD) N Mean (SD) B1 prior R 56 1.30 (0.60) 56 1.34 (0.58) 56 0.04 (0.50) B2.CSS prior CSS 56 0.14 (0.35) 56 0.12 (0.33) 56 -0.02 (0.23) B2.HTML prior HTML 56 0.46 (0.50) 56 0.48 (0.50) 56 0.02 (0.45) B2.LaT eX prior L A T E X 56 0.07 (0.26) 56 0.07 (0.26) 56 0.00 (0.19) B2.Wiki prior Wiki 56 0.14 (0.40) 56 0.12 (0.33) 56 -0.02 (0.40) B2.XML prior XML 56 0.00 (0.19) 56 0.04 (0.19) 56 0.04 (0.19) R1 simplicit y 55 -0.30 (0.93) 56 0.24 (1.05) 55 0.53 (1.12) R2 compilation 55 -0.53 (1.07) 56 -0.04 (1.05) 55 0.48 (1.42) R3 RM frustration 56 -0.50 (1.04) 56 -0.10 (1.07) 56 0.40 (1.19) R4 R frustration 55 -0.68 (0.90) 56 -0.21 (1.17) 55 0.45 (1.26) R5 readabilit y 55 0.35 (0.95) 56 0.84 (0.80) 55 0.51 (0.79) R6 cop y-and-paste 51 0.73 (0.94) 55 0.87 (1.06) 50 0.10 (0.99) R7 co ercion 53 0.35 (0.99) 55 0.55 (1.02) 52 0.24 (1.03) R8 improv emen t 56 0.22 (0.83) 56 0.83 (0.75) 56 0.61 (0.94) R9 ease 55 0.08 (0.89) 56 0.33 (0.93) 55 0.25 (1.03) R10 difficult y 55 -0.05 (1.00) 55 0.30 (1.00) 55 0.35 (0.99) R11 training 56 -1.46 (0.79) 56 -1.51 (0.79) 56 -0.04 (0.95) T able 3: Summary of b efore and after resp onses to questionnaire. Resp onses were scored according to the scale: no opinion =N/A, strongly disagree = − 2, disagree = − 1, indifferen t = 0, agree = 1, strongly agree = 2. The resp onses to questions 3, 4, 6, 7, 10, and 11 hav e b een flipp ed. Th us, higher scores are more fav orable to R and R Markdo wn , and low er scores are less fa v orable. Note that what is b eing shown in the third group of columns is the mean change in resp onse, not the c hange in mean resp onse. Moreo v er, while the initial resp onse to questions 9 and 10 w ere indistinguishable from zero, by the end of the semester there was mild agreemen t that R Ma rkdo wn mak es it easier for studen ts to complete their homework. Th us, studen ts ackno wledged that R Mar kdo wn , in addition to b eing a b enefit to their audience (as demonstrated by question 5), w as of a mild b enefit to them. Questions 6 and 7 address the p ossibilit y of alternativ e w orkflo ws. In questi on 7, studen ts expressed a mild lack of resentmen t at b eing forced to use R Markd o wn . How ever, residual resentmen t waned o v er the course of the semester. More interestingly , studen ts were quite opp osed to the t ypical w orkflo w whic h w ould require them to copy-and-paste their results from R in to W ord. Moreov er, there was little c hange in these resp onses ov er the course of the semester. Th us, the results suggest that not only do students prefer R Markdo wn to W ord after having used it all semester long, but that they never preferred to use W ord in the first place. This should help to encourage those instructors who are most comfortable in W ord to consider making a change. It do es not app ear that these studen ts w ere beholden to word pro cessing applications. Questions 3, 4, and 8 address the issue of frustration with R and R Ma rkdown . Here, it is expected that many studen ts will express frustration with R , which is an admittedly exp ert-friendly softw are pac k age. The data suggests that while initial frustration with b oth R and R Markdo wn was rea- 22% 51% 44% 26% 34% 23% 37% 63% 42% 24% 20% 13% 31% 25% 37% 54% 32% 21% 32% 16% 44% 64% 24% 20% 8% 19% 73% 55% 19% 26% 71% 69% 14% 14% 16% 17% 60% 40% 19% 12% 21% 49% 7% 20% 75% 51% 19% 30% 24% 23% 53% 38% 24% 39% 52% 38% 31% 38% 17% 25% 2% 1% 86% 90% 12% 8% R1: I find the R Markdown syntax to be simple and understandable. R2: When my Markdown document does not compile, I kno w how to go about fixing it. R3: I am frequently frustrated by R Markdown when doing my home work. R4: I am frequently frustrated by R when doing my home work. R5: R Markdown makes my home work easier to read and understand. R6: I would rather cop y and paste my results (plots, tab les, and numbers) into a word processing program (e.g. W ord). R7: I resent being forced to use R Markdown. It should be m y choice how I prepare my home work. R8: I found R Markdown to be frustrating at first, b ut now I'v e got the hang of it. R9: R Markdown makes it easier f or me to complete my home work. R10: R Markdown makes it more difficult f or me to complete my home work. R11: I wish I had received a more thorough introduction to the logic and f eatures of R Markdown. After Before After Before After Before After Before After Before After Before After Before After Before After Before After Before After Before 100 50 0 50 100 P ercentage Response Strongly Disagree Disagree Indifferent Agree Strongly Agree Figure 3: Results from Lik ert scale R Ma rkdo wn survey administered at Smith College, also sum- marized in T able 3 . Resp onses from student s who circled more than one answer w ere rounded to the more extreme v alue. sonably high, by the end of the semester it had largely dissipated. In particular, frustration with R Markdo wn was negligible by the end of the semester, and frustration with R was considerably diminished. This notion w as addressed more directly by question 8, whic h offered the lar gest c hange o v er the course of the semester (0.61). Here, most studen ts agreed that they were frustrated b y R Ma rkdo wn at first, but had gotten the hang of it b y the end of the semester. Questions 1, 2, and 11 address the studen ts’ exp erience w orking with R Markdo wn . On Question 11, studen ts w ere almost unanimous is their desire to hav e receiv ed a more thorough introduction to the logic and features of R Ma rkdo wn . Unlike the previous questions, this initial reaction w as not mo derated o v er the course of semester. While it is exp ected that many studen ts will request additional help in wor king with new tech nologies, future versions of the course will include some kind of “workshop” during the first mon th that eases the adoption of R and R Ma rkdo wn . On the other hand, questions 1 and 2 show evidence of student growth. While man y students did not find R Markdo wn to b e particularly simple and understandable up on initial exp osure, by the end of the semester they mildly supp orted the claim that R Markdo wn w as simple and understandable. P erhaps more imp ortan tly , students show ed a mark ed impro v ement in their abilit y to debug R Ma rkdo wn . At the beginning of the semester, man y studen ts did not feel as though they knew how to fix compilation errors in R Mar kdo wn , but by the end of the semester, they did not disagree (to a statistically significan t exten t) with the notion that they could debug their ow n R Markdo wn errors. A nticip ate d Pr oblems Tw o fears that we had did not seem to b e supp orted by the data. First, we feared that since the use of R Markdo wn was so thoroughly integrated in to the course, and so vital for completing the homew ork (which constituted 20% of the total grade for the course), that students who viewed R Ma rkdo wn more fav orably w ould b e adv antaged with resp ect to their ov erall grade in the course. Second, we feared that students who had stronger prior exp osure to tec hnologies similar to R Ma rkdo wn w ould b e hav e an easier time completing their assignmen ts. More sp ecifically , we feared that studen ts who had not been exp osed to technologies similar to R Ma rkdown w ould suffer since they might ha ve to sp end more time on their homew ork. Neither of these fears w ere b orne out in the data. T o test these hypotheses, we examined the relationships b et ween surv ey resp onses at the b eginning and end of the semester, and tw o measures of performance in the course: the student’s final course grade, and her score on the Comprehensive Assessment of Outcomes in a First Statistics Course (CA OS; delMas et al. ( 2007 )) p ost-test. None of the correlations b et ween the scores on eac h of the 11 questions and the studen t’s final course grade w ere statistically sign ifican t at the 5% lev el 3 . Only t w o (R5 and R6) w ere significan t at the 10% lev el, with b oth indicating w eak p ositiv e asso ciation with final course grade (0.23 and 0.26, respectively). Correlations betw een the responses and CA OS scores reveal ed a similar lack of asso ciation (R6 sho wed a correlation of 0.27, but a 95% confidence in terv al [ − 0 . 004 , 0 . 514] for the true v alue of this parameter still included 0). Moreov er, assessmen t of the association b et ween the tw o performance measures and the c hange in response o v er the course of the semester revealed no statistically significan t asso ciations. Th us, not only were the studen ts’ initial reactions to R Markdo wn not an imp ortan t indicator of their p erformance in the course or their absorption of statistical material, but neither was their c hange in attitude tow ards R Ma rkdown o ver the course of the semester. Similarly , prior exp osure to R Ma rkdo wn -like tec hnologies did not app ear to be asso ciated with studen t p erformance. First of all, only one quarter (14 of the 56) of the students had ev er heard of R prior to taking the course, and only tw o had used it. Only four studen ts had prior exp osure to L A T E X, and only nine rep orted having edited a Wiki. While eigh t students had seen Cascading St yle Sheets (CSS), all eight had prior exp osure to Hyp ertext Markup Language (HTML), along 3 W e ac knowledge that none of the measures of statistical significance reported w ere corrected for m ultiple com- parisons. How ever, since the purp ose of this analysis is to show that there is little statistical evidence of correlation betw een attitudes tow ards R Markdown and p erformance in the course, and a multiple comparisons correction would only we aken any claims of statistical significance, we do not feel that this omission detracts from our findings. with 18 students who had used HTML but not CSS. Thus, prior exp osure to HTML w as the only prior tec hnology to which students had reasonably v aried backgrounds. While there was no asso ciation b et ween prior exp osure to HTML and score on the CAOS exam, there was a b orderline statistically significan t ne gative correlation betw een prior exposure to HTML and final course grade ( p = 0 . 051). Due to the marginal significance of this result, its counter-in tuitiv e nature, and the m ultiple uncorrected tests we p erformed, we do not interpret it as b eing of interest. A ncil lary Outc omes Finally , the end-of-semester resp onses to questions 5 and 6 deserve a moment’s reflection in their o wn right. F or the most part, students agreed (0.84) that R Markdo wn made their homework easier to read and understand. [T o what they were comparing it to, p erhaps handwritten or pasted in to a W ord do cumen t, is left op en.] Moreov er, they w ould not rather (0.87) copy-and-past e their homew ork int o Microsoft W ord. W e in terpret the resp onses to question 5 as an affirmation of R Markdo wn ’s usefulness for students, and note that this p erception grew ov er the course of the semester. The responses to question 6 confirm that working with R Markdo wn for a semester, and the o ccasional frustration that go es along with it, did not make students yearn for a return to W ord. While this attitude did not c hange muc h of the course of the semester, it rev eals the perhaps surprising disco v ery that even students who hav e never heard of R a few weeks b efore would not rather cop y-and-paste their statistical results in to W ord as part of the homew ork preparation. W e in terpret these findings as further evidence that op en-source to ols are p erfectly suitable for use in ev en in tro ductory statistics courses at the undergraduate lev el. 5. Discussion Ha ving presented a motiv ation for using R Mar kdo wn in in tro ductory statistics, describ ed the tech- nology , and reviewed our experience using it, we close with a discussion of some additional b enefits, c hallenges, and limitations. 5.1. Challenges and common problems One of the b enefits of using R Markdo wn in both t he in tro ductory and intermediate statistics courses is the developmen t of kno wledge within the institution. At Smith, one of five statistics teaching assistan ts is a v ailable for t wo hours eac h nigh t from Sunda y to Thursda y . All of these students are now familiar with R Ma rkdo wn and capable of helping introductory students with common problems. In goo d faith, w e present some of those issues below. • W orkspace confusion: Man y errors result from a failure to understand that each R Ma rkdown file, when compiled, runs in a fresh w orkspace that do es not ha v e access to any of the ob jects in the existing workspace active in RStudio . – F ailure to load pac k ages: Students will often forget to load additional pac k ages in their R Ma rkdown scripts (e.g. require(mosaic) ). – Reading external data files: Students often forget to add the read.csv() in their R Ma rkdo wn file after loading it in to their w orkspace from typing it in the console. • Improp er use of c hu nks: Studen ts often forget to put th eir R co de in to a v alid c hunk. A useful solution is to tell them to select “Insert Ch unk” from the green Ch unks menu whenever they w an t to enter commands. • F orgetting to close quotes or parentheses or ch unks: Syn tax highlighting in RStudio mitigates this issue, but it still arises. • Issues sp ecific to R as opp osed to R Markdo wn : Inv alid synt ax for commands. • Debugging: The compilation errors that o ccur when R Markdo wn is rendered are not alw a ys straigh tforw ard to in terpret. Th us, studen ts o ccasionally hav e a hard time identifying the particular command that is causing the problem. This can b e mitigated by encouraging stu- den ts to name their ch unks, and to encourage them to pursue common pro cess of elimination debugging tec hniques. • Pac k age ver sioning: In some cases the pack ages on a studen ts machine may b ecome out-of- date or out-of-sync. Encouraging them to keep all of their pac k ages up-to-date (esp ecially knitr ) with the update.packages() command usually provides a solution. Alternatively , encouraging studen ts to use a serv er version of RStudio (administered by your institution) can be an effective solution. • F ormatting: While R Markdo wn is capable of imple men ting basic formatting op erations, man y more adv anced features are not av ailable. Some of the more useful and accessible options are: – Gratuitous output: Without the message=FALSE option in an R Ma rkdo wn ch unk, un- w an ted messages are rendered in the output. – Plot size: The size of a rendered plot can b e c hanged by using the fig.width and fig.height ch unk options. – Chunk naming: Assigning a name to eac h R ch unk is helpful with debugging. • When all else fails: Restarting RStudio can solve many problems. Any pack age can b e safely remo v ed and reinstalled. Occasionally doing this will solve less obvi ous problems. 5.2. Limitations While R Markdo wn is suitable for man y purposes, it has a few limitations that may pro v e problem- atic. Sp ecifical ly: • While ob jects defined in previous c hunks become part of the w orkspace and are th us av ailable for later use, plots defined in previous ch unks cannot b e modified b y later ch unks. The most common w ork-around for this issue is to create a plot in a single c h unk or assign the output of a plot to an ob ject that can b e printed in a subsequent ch unk. • There is no easy w ay to coun t w ords or pages in the rendered R Markdo wn output. This mak es it difficult to chec k to see if a submitted homework assignment meets any such guidelines. Use of the default formatting options in R Mar kdo wn can result in v ery long do cumen ts. If the rendered HTML file that a studen t wishes to submit is very long, then it can quic kly b ecome cum bersome and ev en exp ensiv e to print it out and submit a hard copy . On the other hand, if the do cumen t is to b e submitted and ev aluated electronically , the length of the do cumen t may b e of no real concern. Thus, while use of even basic non-default formatting options can dramatically reduce the length of rendered R Markdo wn do cumen ts, there is a sense in which moving to electronic submission and grading will mesh w ell with R Ma rkdown adoption. Indeed, if the grader knows HTML, it is even p ossible to giv e inline feedback on a studen t’s submission. (This pro cess has b een implemen ted at tw o of our institutions.) Giv en the in terest in ha ving students collaborate on pro jects at the introductory lev el ( Halvorsen and Mo ore 2001 ), streamlining a collab orativ e workflo w is w orth while. R Ma rkdo wn pro vides suc h a mec hanism in part due to its inheren t emphasis on reproducibility . Studen ts w orking together are able to follow, and ev en extend, each other’s work with minimal effort. Neve rtheless, a fo ol-pro of solution for having multiple students edit the same R Markdo wn file simultaneously do es not yet exist. The use of an RStudio server, or a third-part y file synchronization solution ( e.g. Dropb o x) can pro vide a functional work around. F uture versions of RStudio may also include additional features designed to facilitate real-time collab oration pro jects. 5.3. Additional Though ts Another comp onen t of repro ducibilit y relates to the version of R and its asso ciated pack ages, which are often up dated. While somewhat b ey ond the scope of this man uscript, further efforts to facilitate the reproduction of analyses that require sp ecific (older) v ersions of pac k ages will b e needed. It is w orth noting that knitr pro vides functionalit y for condensing an R Markdo wn file int o a con v en- tional R script, and vice versa. More generally , those who are comfortable wor king with R scripts will find it easy to augmen t those scripts into R Markdo wn files, which will retain the abilit y to send successiv e R commands to the curren t console. It w ould b e interestin g to assess t he exten t to whic h studen ts absorb the importance of repro ducibil- it y in this couse. Adding an assessmen t that sp ecifically addresses repro ducibilit y and is presen ted to studen ts with a set of concrete learning ob jectiv es is something that is under consideration and a topic of future w ork. Ho wev er it is not trivial to add material to an already busy introduc- tory statistics curriculum, and therefore requires careful consideration of the existing material and assessmen ts. On a cautionary note, w e remind the reader that due to the multiple uncorrected tests we ran, the claims of statistical significance made in Section 4.2.2 should not b e ov erstated. 6. Conclusion The aforementioned Natur e Nature Editorial ( 2013 ) concludes with a call to action: “W e urge others to take note . . . and do whatever they can to improv e research repro ducibilit y” (p. 398). As statistics educators, w e are the mem bers of the scientific communit y that are most w ell-suited to, and responsible for, addressing this c hallenge. R Markdo wn is a new tec hnology that in tegrates seamlessly in to existing computational w ork done with R within R Studio . With appropriate supp ort mec hanisms, in tro ductory statistics students are receptiv e to its adoption. In our exp erience at t w o very different institutions with very different student b o dies, R Ma rkdo wn made a welcomed impro v ement to the traditional copy-and-paste workflo w. Students left the course equipp ed with functional skills that will help them in any future quantitativ e endea v or. 7. Ac knowledgemen ts This work was partially supp orted b y Pro ject MOSAIC, US National Science F oundation (DUE- 0920350). References Allaire, J., Horner, J., Marti, V., and Porte, N. (2013), markdown : Markdown r endering for R , R pac k age v ersion 0.6.3, http://CRAN.R- project.org/p ackage=markdown . Bry er, J. and Sp eersc hneider, K. (2013), likert: F unctions to analyze and visualize likert typ e items , R pac k age version 1.1, http://CRAN.R- project.org/package=lik ert . Buc kheit, J. B. and Donoho, D. L. (1995), “W a velab and repro ducible research, ” T ec h. Rep. 474, Stanford Univ ersity , http://statweb.stanford.edu/~wavelab/Wave lab_850/wavelab.pdf . Claerb out, J. (1994), “Hyp ertext do cumen ts ab out repro ducible researc h,” T ec h. rep., Stanford Univ ersit y , http://sepwww.stanford.edu/sep/jon/n rc.html . Cobb, G. W. (2007), “The Introductory Statistics Course: A Ptolemaic Curriculum?” T e chnolo gy Innovations in Statistics Educ ation (TISE) , 1, http://escholarship.org/uc/ item/6hb3k0nz . delMas, R., Garfield, J., Ooms, A., and Chance, B. (2007), “Assessing Students’ Conceptual Un- derstanding after a First Course in Statistics,” Statistics Educ ation R ese ar ch Journal , 6, 28–58, https://apps3.cehd.umn. edu/artist/caos.html . F omel, S. and Claerb out, J. F. (2009), “Guest Editors’ Intr o duction: Reproducible Research,” Computing in Scienc e & Engine ering , 11, 5–7. GAISE College Group (2005), “Guidelines for Assessment and Instruction in Statistics Educa- tion,” T ec h. rep., American Statistical Asso ciation, http://www.amstat.org/educ ation/gaise , accessed August 15, 2013. Gandrud, C. (2013), R epr o ducible R ese ar ch With R and RStudio , Chapman & Hall/CR C. Gen tleman, R. and T emple Lang, D. (2004), “Statistical analyses and repro ducible research,” Bio c onductor Pr oje ct Working Pap ers , W orking P ap er 2, http://biostats.bepress.com/ bioconductor/paper2 . Hall, M. R. and Row ell, G. H. (2008), “Introductory statistics education and the National Science F oundation,” Journal of Statistics Educ ation , 16, http://www.amstat.org/publications/j se/ v16n2/rowell1.pdf . Halv orsen, K. T. and Mo ore, T. L. (2001), “Motiv ating, monitoring, and ev aluating student pro jects,” MAA Notes , 27–32. Ioannidis, J. P . (2013), “This I b eliev e in genetics: discov ery can b e a n uisance, replication is science, implemen tation matters,” F r ontiers in Genetics , 4. Johnson, G. (2014), “New truths that only one can see,” The New Y ork Times , http://www. nytimes.com/2014/01/21/ science/new- truths- that- only- one- can- see.h tml . Kn uth, D. E. (1984), “Literate programming,” The Computer Journal , 27, 97–111. Leisc h, F. (2002), “Swea ve: Dynamic generation of statistical reports using literate data analysis,” in Compstat , Springer, pp. 575–580. Nature Editorial (2013), “Announcement: Reducing our irrepro ducibilit y ,” Natur e , 496, http:// www.nature.com/news/ann ouncement- reducing- our- irreproduc ibility- 1.12852 . Nolan, D. and T emple Lang, D. (2010), “Computing in the statistics curricula,” The Americ an Statistician , 64, 97–107. R Core T eam (2013), R: A L anguage and Envir onment for Statistic al Computing , R F oundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, http://www.R- projec t.org/ . RStudio (2013), Using R Markdown with RStudio , http://www.rstudio.com/ide/docs/ authoring/using_markdow n . The Economist Editorial (2013), “T rouble at the lab. (Cov er story).” http://www.economist.com/ node/21588057/ . Xie, Y. (2014), Dynamic Do cuments with R and knitr , Chapman & Hall/CR C. A. R Markdown Surv ey This survey is part of an ongoing research study to help improv e the use of technology in intro- ductory statistics courses. Responses will b e merged with other assessment data from the class to create a de-identified research dataset accessible only to the instructor, and the original forms will b e destroy ed. Only aggregate data will b e included in any reports. The decision to participate in this study is en tirely up to y ou. Y ou may refuse to take part in the study at any time without affecting your relationship with the inv estigators of this study , y our grade in the course or Smith College. Y ou ha v e the righ t not to answ er an y single question. Y ou are under no obligation to complete the survey . Y our submission of the completed surv ey constitutes y our consen t to use of the data within these constrain ts. If you hav e an y further questions ab out the study , at any time feel free to con tact Nic holas Horton at nhorton@smith.edu or b y telephone at 413-585-3688. If y ou lik e, a summary of the results of the study will b e sent to you. If you hav e any other concerns ab out your rights as a research participan t that ha ve not b een answ ered by the in vestigators, y ou ma y contact Phil Peak e, Co-chair of the Smith College Institutional Review Board at (413) 585-3914. Y our Name: Backgr ound 1. How often had you used R prior to taking this course (circle one)? had nev er heard of it nev er infrequently a few times frequen tly 2. T o which of the follo wing markup languages had you been exp osed prior to taking this course (circle all that apply)? HTML CSS XML L A T E X Wikip edia (editing) R Markdown Please indicate the resp onse that most closely matches y our attitude to wards each of the following statemen ts. 1. I find the R Markdo wn syntax to b e simple and understandable. no opinion strongly disagree disagree indifferen t agree strongly agree 2. When my Markdown do cumen t does not compile, I kno w how to go ab out fixing it. no opinion strongly disagree disagree indifferen t agree strongly agree 3. I am frequently frustrated b y R Markdo wn when doing m y homework. no opinion strongly disagree disagree indifferen t agree strongly agree 4. I am frequently frustrated b y R when doing my homework. no opinion strongly disagree disagree indifferen t agree strongly agree 5. R Markdown makes my homew ork easier to read and understand. no opinion strongly disagree disagree indifferen t agree strongly agree 6. I w ould rather copy and paste my results (plots, tables, and num b ers) into a w ord pro cessing program (e.g. W ord). no opinion strongly disagree disagree indifferen t agree strongly agree 7. I resen t b eing forced to use R Markdown. It should b e my choice how I prepare m y homew ork. no opinion strongly disagree disagree indifferen t agree strongly agree 8. I found R Markdo wn to be frustrating at first, but now I’ve got the hang of it. no opinion strongly disagree disagree indifferen t agree strongly agree 9. R Markdown makes it easier for me to complete my homew ork. no opinion strongly disagree disagree indifferen t agree strongly agree 10. R Markdown makes it more difficult for me to complete my homework. no opinion strongly disagree disagree indifferen t agree strongly agree 11. I wish I had received a more thorough introduction to the logic and features of R Markdown. no opinion strongly disagree disagree indifferen t agree strongly agree B. Introducing R Markdo wn in class A Prezi in tro ducing the features of R Markdo wn and its use in lab reports can b e found at: http: //prezi.com/dvmgx17e_wa s/reproducible/?utm_cam paign=share&utm_medium=c opy . Figure 4 pro vides t wo sample slides, diagramming the difference b et w een the tw o w orkflo ws. Figure 4: The traditional w orkflo w, c haracterized by a separation b et w een the data analysis and the interpretation that are then fused together by copy-an d-paste. By contast, the R Markdo wn w orkflo w integrates these tw o comp onen ts int o a single do cumen t. C. Sample assignmen t and solution A sample lab assignment and student solution is included below. Lab 1: Introduction to data Some define Statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information - the data. In this lab, you will gain insight into public health by generating simple graphical and numerical summaries of a data set collected by the Centers for Disease Control and Prev ention (CDC). As this is a large data set, along the wa y you’ll also learn the indispensable skills of data processing and subsetting. T emplate for lab report Before y ou begin the lab, download the lab report template. This template makes it very simple to in- clude code and output in your write up from within RStudio as well as ensuring reproducibility of your results. download.file ( "http://stat.duke.edu/courses/Summer13/sta104.01-1/labs/lab1.Rmd" , destfile = "lab1.Rmd" ) Click on the file called lab1.Rmd under the Files tab on the bottom right pane of your RStudio window . Insert your team name, name of the “author ” for the w eek, and the names of the “discussants” (other team members present in lab today). Use the allotted spaces to enter your responses. For questions that require R code or a plot, space has been provided for you to enter the relev ant code. Getting started The Behavioral Risk Factor Sur v eillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and w eekly physical activity , their HIV/AIDS status, possible tobacco use, and ev en their lev el of healthcare cov erage. The BRFSS W eb site ( http:// www.cdc.gov/ brfss ) contains a complete description of the survey , including the research questions that motivate the study and many interesting results derived from the data. W e will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset. W e begin by loading the data set of 20,000 observations into the R workspace. After launching RStudio, enter the following command. source ( "http://www.openintro.org/stat/data/cdc.R" ) The data set cdc that shows up in your workspace is a data matrix , with each row representing a case and each column representing a variable . R calls this data format a data frame , which is a term that will be used throughout the labs. T o view the names of the variables, type the command names (cdc) This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported ( http: // creativecommons.org/ licenses/ by-sa/ 3.0 ). This lab was adapted for OpenIntro by Andrew Bray and Mine C ¸ etinkay a-Rundel from a lab written by Mark Hansen of UCLA Statistics. 1 This returns the names genhlth , exerany , hlthplan , smoke100 , height , weight , wtdesire , age , and gender . Each one of these variables corresponds to a question that was asked in the sur v ey . For ex- ample, for genhlth , respondents were asked to evaluate their general health, responding either excellent, v ery good, good, fair or poor . The exerany v ariable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some for m of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire , age in years, and gender . Exercise 1 How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete). W e can hav e a look at the first few entries (rows) of our data with the command head (cdc) and similarly we can look at the last few by typing tail (cdc) Y ou could also look at all of the data frame at once by typing its name into the console, but that might be unwise here. W e know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen. It’s better to take small peeks at the data with head , tail or the subsetting techniques that you’ll learn in a moment. Summaries and tables The BRFSS questionnaire is a massive trov e of information. A good first step in any analysis is to distill all of that information into a few summary statistics and graphics. As a simple example, the function summary returns a numerical summary: minimum, first quartile, median, mean, second quartile, and maximum. For weight this is summary (cdc $ weight) R also functions like a very fancy calculator . If you wanted to compute the interquartile range for the re- spondents’ w eight, you w ould look at the output from the summary command abo v e and then enter 190 - 140 R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean, median, and variance of weight , type mean (cdc $ weight) var (cdc $ weight) median (cdc $ weight) While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about categorical data? W e would instead consider the sample frequency or relative frequency distribution. The 2 function table does this for you by counting the number of times each kind of response was given. For example, to see the number of people who hav e smoked 100 cigarettes in their lifetime, type table (cdc $ smoke100) or instead look at the relative frequency distribution by typing table (cdc $ smoke100) / 20000 Notice how R automatically divides all entries in the table by 20,000 in the command abov e. This is similar to something w e observed in the last lab; when w e multiplied or divided a vector with a number , R applied that action across entries in the vectors. As we see abov e, this also works for tables. Next, we make a bar plot of the entries in the table by putting the table inside the barplot command. barplot ( table (cdc $ smoke100)) Notice what we’v e done here! W e’ve computed the table of cdc $ smoke100 and then immediately applied the graphical function, barplot . This is an important idea: R commands can be nested. Y ou could also break this into two steps by typing the following: smoke <- table (cdc $ smoke100) barplot (smoke) Here, w e’v e made a ne w object, a table, called smoke (the contents of which w e can see b y typing smoke into the console) and then used it in as the input for barplot . The special symbol <- performs an assignment , taking the output of one line of code and saving it into an object in your workspace. This is another important idea that we’ll retur n to later . Exercise 2 Create a numerical summary for height and age , and compute the interquartile range for each. Exercise 3 Compute the relative frequency distribution for gender and genhlth . How many males are in the sample? What proportion of the sample reports being in excellent health? The table command can be used to tabulate any number of variables that you provide. For example, to examine which participants have smoked across each gender , we could use the following. table (cdc $ gender, cdc $ smoke100) Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has smoked at least 100 cigarettes. The rows refer to gender . T o create a mosaic plot of this table, we would enter the following command. mosaicplot ( table (cdc $ gender, cdc $ smoke100)) W e could hav e accomplished this in two steps by saving the table in one line and applying mosaicplot in the next (see the table/bar plot example abov e). 3 Exercise 4 What does the mosaic plot rev eal about smoking habits and gender? Interlude: How R thinks about data W e mentioned that R stores data in data frames, which you might think of as a type of spreadsheet. Each row is a different observation (a different respondent) and each column is a different variable (the first is genhlth , the second exerany and so on). W e can see the size of the data frame next to the object name in the workspace or w e can type dim (cdc) which will retur n the number of rows and columns. Now , if w e want to access a subset of the full data frame, we can use ro w-and-column notation. For example, to see the sixth v ariable of the 567 th respondent, use the for mat cdc[567, 6] which means w e w ant the element of our data set that is in the 567 th row (meaning the 567 th person or observation) and the 6 th column (in this case, w eight). W e know that weight is the 6 th variable because it is the 6 th entry in the list of variable names names (cdc) T o see the weights for the first 10 respondents we can type cdc[1 : 10, 6] In this expression, w e hav e asked just for rows in the range 1 through 10. R uses the “:” to create a range of values, so 1:10 expands to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Y ou can see this by entering 1 : 10 Finally , if we want all of the data for the first 10 respondents, type cdc[1 : 10, ] By leaving out an index or a range (we didn’t type anything betw een the comma and the square bracket), w e get all the columns. When starting out in R, this is a bit counterintuitive. As a rule, we omit the column number to see all columns in a data frame. Similarly , if we leav e out an index or range for the rows, we would access all the observations, not just the 567 th , or rows 1 through 10. T ry the following to see the w eights for all 20,000 respondents fly by on your screen cdc[, 6] Recall that column 6 represents respondents’ w eight, so the command abov e reported all of the weights in the data set. An alter nativ e method to access the w eight data is b y referring to the name. Previously , 4 w e typed names (cdc) to see all the variables contained in the cdc data set. W e can use any of the variable names to select items in our data set. cdc $ weight The dollar-sign tells R to look in data frame cdc for the column called weight . Since that’s a single vector , w e can subset it with just a single index inside square brackets. W e see the weight for the 567 th respondent by typing cdc $ weight[567] Similarly , for just the first 10 respondents cdc $ weight[1 : 10] The command abo v e returns the same result as the cdc[1:10,6] command. Both ro w-and-column no- tation and dollar-sign notation are widely used, which one y ou choose to use depends on your personal preference. A little more on subsetting It’s often useful to extract all individuals (cases) in a data set that hav e specific characteristics. W e accom- plish this through conditioning commands. First, consider expressions like cdc $ gender == "m" or cdc $ age > 30 These commands produce a series of TRUE and FALSE v alues. There is one v alue for each respondent, where TRUE indicates that the person w as male (via the first command) or older than 30 (second command). Suppose we want to extract just the data for the men in the sample, or just for those ov er 30. W e can use the R function subset to do that for us. For example, the command mdata <- subset (cdc, cdc $ gender == "m" ) will create a new data set called mdata that contains only the men from the cdc data set. In addition to finding it in your workspace alongside its dimensions, you can take a peek at the first several rows as usual head (mdata) This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep only specific variables, which is a topic we’ll discuss in a future lab. For now , the important thing 5 is that we can car v e up the data based on values of one or more variables. As an aside, you can use several of these conditions together with & and | . The & is read “and” so that m_and_over30 <- subset (cdc, cdc $ gender == "m" & cdc $ age > 30) will give you the data for men ov er the age of 30. The | character is read “or ” so that m_or_over30 <- subset (cdc, cdc $ gender == "m" | cdc $ age > 30) will take people who are men or ov er the age of 30 (why that’s an interesting group is hard to say , but right now the mechanics of this are the important thing). In principle, y ou ma y use as many “and” and “or ” clauses as you like when forming a subset. Exercise 5 Create a new object called under23 and smoke that contains all obser v ations of respondents under the age of 23 that hav e smoked 100 cigarettes in their lifetime. W rite the command you used to create the new object as the answer to this exercise. Quantitative data W ith our subsetting tools in hand, we’ll now return to the task of the day: making basic summaries of the BRFSS questionnaire. W e’ve already looked at categorical data such as smoke and gender so now let’s turn our attention to quantitative data. T wo common wa ys to visualize quantitative data are with box plots and histograms. W e can construct a box plot for a single variable with the following command. boxplot (cdc $ height) Y ou can compare the locations of the components of the box by examining the summar y statistics. summary (cdc $ height) Confirm that the median and upper and lower quartiles reported in the numerical summar y match those in the graph. The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across sev eral categories. S o we can, for example, compare the heights of men and women with boxplot (cdc $ height ~ cdc $ gender) The notation here is new . The ~ character can be read “versus” or “as a function of ”. So we’re asking R to giv e us a box plots of heights where the groups are defined by gender . Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI). BMI is a weight to height ratio and can be calculated as. B M I = we ig ht ( lb ) he ig ht ( i n ) 2 ∗ 703 † † 703 is the approximate conversion factor to change units from metric (meters and kilograms) to imperial (inches and pounds) 6 The following two lines first make a new object called bmi and then creates box plots of these values, defining groups by the variable cdc $ genhlth . bmi <- (cdc $ weight / cdc $ height ^ 2) * 703 boxplot (bmi ~ cdc $ genhlth) Notice that the first line abov e is just some arithmetic, but it’s applied to all 20,000 numbers in the cdc data set. That is, for each of the 20,000 participants, w e take their weight, divide by their height-squared and then multiply by 703. The result is 20,000 BMI values, one for each respondent. This is one reason why we like R: it lets us perfor m computations like this using very simple expressions. Exercise 6 What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would hav e a relationship to BMI, and indicate what the figure seems to suggest. Finally , let’s make some histograms. W e can look at the histogram for the age of our respondents with the command hist (cdc $ age) Histograms are generally a v ery good w ay to see the shape of a single distribution, but that shape can change depending on how the data is split between the different bins. Y ou can control the number of bins by adding an argument to the command. In the next two lines, we first make a default histogram of bmi and then one with 50 breaks. hist (bmi) hist (bmi, breaks = 50) Note that you can flip between plots that you’v e created by clicking the forward and backward arrows in the lower right region of RStudio, just abov e the plots. How do these two histograms compare? Exercise 7 In the last lab, when exploring how percentage of bo ys bor n varies in time (two numerical variables) we use a scatterplot. Using the same tools, the plot function, make a scatterplot of weight versus desired weight. Describe the relationship between these variables. At this point, we’v e done a good first pass at analyzing the information in the BRFSS questionnaire. W e’ve found an interesting association between smoking and gender , and we can say something about the relationship between people’s assessment of their general health and their own BMI. W e’ve also picked up essential computing tools – summary statistics, subsetting, and plots – that will ser v e us well throughout this course. Class sur v ey In the rest of this lab you will use the data from the Sta 101 classes to inv estigate relationships between certain types of variables of interest. Y ou can nd a list of the variables and corresponding survey questions here. download.file ( "http://stat.duke.edu/courses/Spring13/sta101.001/data/surveyS13.csv" , 7 destfile = "survey.csv" ) survey = read.csv ( "survey.csv" ) Exercise 8 Pick a numerical variable, make an appropriate plot to visualize its distribution. Briefly describe the distribution of the variable using appropriate statistics. Hint: Use R to calculate summary statistics you might want to mention in your description. Exercise 9 Pick a categorical variable, make an appropriate plot to visualize its distribution. Briefly describe the distribution of the variable using appropriate statistics. Exercise 10 Pick one numerical and one categorical variable, make an appropriate plot to vi- sualize the relationship between these variables, and briefly describe the apparent relationship. Exercise 11 Pick two categorical variables, make an appropriate plot to visualize the relation- ship between these variables, and briefly describe the apparent relationship. Exercise 12 Pick two numerical variables, make an appropriate plot to visualize the relation- ship between these variables, and briefly describe the apparent relationship. Exercise 13 What concepts from the textbook are co vered in this lab? What concepts, if any , are not cov ered in the textbook? Hav e y ou seen these concepts elsewher e, e.g. lecture, textbook, previous labs, etc.? Be specific in your answer . List of R functions For your conv enience, a list of R functions y ou will commonly use in this class hav e been posted at on the course website under the resources tab (also linked here ). If you aren’t sure how to do something in R, the first thing to do is to alwa ys search the web. But some of the resources you come across might be ov er whelming if the y’re designed for more experienced users. Please don’t hesitate to ask your teammates, T As, and the professor for help. 8 Lab 1 Name: Exerc ise s Load CDC data: source ( "http://www.openintro.org/stat/data/cdc.R" ) Exercise 1: There are 20,000 cases and nine variables. Genhlth: ordinal/categorical, Exerany: categorical, Hlthplan: categorical, Smoke100: categorical, Height: discrete numerical, Weight: discrete numerical, Wtdesire: discrete numerical, Age: discrete numerical, Gender: categorical. Exercise 2: IQR of height: 70-64=6 IQR of age: 57-31= 26 summary ( cdc $ height ) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 48.0 64.0 67.0 67.2 70.0 93.0 summary ( cdc $ age ) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 18.0 31.0 43.0 45.1 57.0 99.0 Exercise 3: 47.8% of survey participants are male, so 20,000 x .478 = 9560 participants. 23.28% of participants report being in excellent health, which makes 20,000 x .2328 = 4657 participants. table ( cdc $ gender )/ 20000 ## ## m f ## 0.4784 0.5215 table ( cdc $ genhlth )/ 20000 ## ## excellent very good good fair poor ## excellent very good good fair poor ## 0.23285 0.34860 0.28375 0.10095 0.03385 Exercise 4: The mosaic plot reveals that slightly more females than males completed the survey–which we already knew. More importantly, it reveals that a larger percentage of surveyed males have smoked more than 100 cigarettes in their lives than have surveyed females. Thus, males likely have slightly worse smoking habits than females. Exercise 5: Using nrow, I find that there are 620 rows in this object, so there are 620 respondents that meet this criteria. under23_and_smoke <- subset ( cdc, cdc $ age < "23" & cdc $ smoke100 == "1" ) Exercise 6: With these boxplots, we can see that the worse that participants declared their health to be in, the higher their median BMIs. I would think that the amount that respondants exercise also relates to the BMI–one would assume that the more someone exercises, the lower their BMI. The box plots I created indeed show that those who had not exercised in the previous month had a slightly higher median BMI than those who had. bmi <- ( cdc $ weight / cdc $ height ^2) * 703 boxplot ( bmi ~ cdc $ exerany, main = "BMI vs. recent exercise" ) Exercise 7: It appears as though the majority of respondants' desired weights are slightly below their current weights. As weight increases, desired weight generally increases as well, so the relationship is relatively strong (positive association, linear). There are a couple of outliers that seem to be inaccurate reportings (ie, desired weight of 600 and 700 lbs.), and it would be wise to disregard these data. plot ( cdc $ wtdesire ~ cdc $ weight, main = "Weight vs. desired weight" ) Load survey data: download.file ( "http://stat.duke.edu/courses/Spring13/sta101.001/data/surveyS13.csv" , destfile = "survey.csv" ) survey = read.csv ( "survey.csv" ) Exercise 8: The histogram of GPAs seems to be left skewed and unimodal. Since the highest a GPA at Duke University can be is 4.0, the outliers are probably reporting mistakes and should probably not be considered with the actual data. According to the summary, the median GPA is 3.6, and using 1Q and 3Q, we can calculate that the IQR is 3.78-3.36 = .42. hist ( survey $ gpa, breaks = 20, main = "GPAs of surveyed students" ) Exercise 9: For this categorical variable, the most popular area of residence of surveyed students was the Southern United States (72/208 = 34.6%), followed by the Northeast US (28.3%), Western US (20.1%), Midwest US (9.6%), and finally international (7.2%). table ( survey $ where_from ) ## ## International US - Midwest US - Northeast US - South US - West ## 15 20 59 72 42 Exercise 10: With these two variables we can see that a student whose first choice was Duke goes out a median of about 2 nights per week, while the median number of nights out for students whose first choice was not Duke is one night less per week than the “yes” group. boxplot ( survey $ go_out_times ~ survey $ duke_first_choice, main = "First choice vs. Nights out per week" ) Exercise 11: In general, surveyed students from regions farther away from home seem to be more homesick than students who live closer to Duke. This is evident because a larger percentage of international students, as well as students from the northeastern and western US, reported that they were homesick than did students from the midwest and southern US. mosaicplot ( table ( survey $ where_from, survey $ homesick ) , main = "Home location vs. Homesickness" ) Exercise 12: This scatterplot shows that there is a positive, linear association between the number of drinks it takes for a student to get drunk and the average number of drinks he/she consumes on a given night. Students who need more drinks to get drunk will generally drink more on average in a given night than students who do not need as much to get drunk, and the relationship is evident but not particularly strong. There are a few data that have higher numerical responses than the majority of the other respondants' , but they follow the same general trend. plot ( survey $ drink_amount ~ survey $ drinks_to_drunk, main = "Average drinks consumed vs. Drinks it takes to get drunk" ) Exercise 13: In this lab, we covered cases and variables, determining whether variables are numerical or categorical, how to calculate the IQR of a data set, and many different types of plots. This includes scatter plots, boxplots, mosaic plots, tables, and histograms, while using either one or two variables (both numerical and categorical) to build these plots. We also learned how to categorize these plots (for example, positive/negative association, strength, linear/nonlinear, skewness) to understand a plot's distribution or the relationship between two given variables. Additionally, we learned how to calculate information with frequency tables. We learned how to calculate a numerical summary, which is a feature of r, but was not presented in the textbook. We walked through this concept during the lecture on friday but I had not learned about it before then.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment