Data Visualization on Day One: Bringing Big Ideas into Intro Stats Early and Often

In a world awash with data, the ability to think and compute with data has become an important skill for students in many fields. For that reason, inclusion of some level of statistical computing in many introductory-level courses has grown more comm…

Authors: Xiaofei Wang, Cynthia Rush, Nicholas Jon Horton

Data Visualization on Day One: Bringing Big Ideas into Intro Stats Early   and Often
Data Visualization on Da y One: Bringing Big Ideas in to In tro Stats Early and Often Xiaofei W ang, Cyn thia Rush, and Nic holas Jon Horton Ma y 25, 2017 Abstract In a w orld a w ash with data, the abilit y to think and compute with data has b ecome an im- p ortan t skil l for studen ts in man y fields. F or that reason, inclusion of some level of statistical computing in man y in tro ductory-lev el courses has gro wn more common in recen t y ears. Exist- ing literature has do cumen ted multiple success stories of teac hing statistics with R, b olstered b y the capabilities of R Markdo wn. In this article, w e present an in-class data visualiza- tion activity intended to exp ose students to R and R Markdown during the first wee k of an in tro ductory statistics class. The activit y b egins with a brief lecture on exploratory data analysis in R. Studen ts are then placed in small groups tasked with exploring a new dataset to produce three visualizations that describe particular insigh ts that are not immediately ob- vious from the data. Up on completion, studen ts will hav e pro duced a series of univ ariate and m ultiv ariate visualizations on a real dataset and practiced describing them. Keyw ords : data science, data visualization, introductory statistics, statistical computing, visualization 1. INTR ODUCTION A num b er of calls to infuse our statistics curricula with statistical computing to exp ose stu- den ts to authentic data exp eriences hav e b een recently published ( Nolan and Lang 2012 ; Hardin, Ho erl, Horton, Nolan, Baumer, Hall-Holt, Murrell, Peng, Roback, T emple Lang, and W ard 2015 ; Horton and Hardin 2015 ). In the revised 2016 Guidelines for Assessmen t and Instruction in Statistics Education (GAISE) College Rep ort, instructors are recommended to “in tegrate real data with a con text and a purp ose” , to “use tec hnology to explore concepts and analyze data” , and to “emphasize the m ultiv ariate nature of the discipline” ( ASA GAISE Col- lege working group: R. Carver, Everson, Gabrosek, Row ell, Horton, Lo ck, Mo c ko, Rossman, V elleman, Witmer, and W oo d 2016 ). With b ooming interest in data science, it is now more imp ortan t than ever to bring statistical softw are into the classro om early , to enable analysis of real-w orld data, to exp ose students to the excitemen t and p oten tial of statistics, and to pro vide examples where insights are extracted from data. Chance, Ben-Zvi, Garfield, and Medina ( 2007 ) provides a broad ov erview of how tec hnology b enefits learning even at the in tro ductory level, citing v arious to ols ranging from graphing calculators to computing softw are. With the adven t of R ( R Core T eam 2016 ) and RStudio ( RStudio T eam 2016 ), studen ts no w ha v e access to a free, open-source softw are with an established prominence in industry . A p o w erful soft w are such as R do es come with a steep learning curv e for some; we b eliev e that using data visual ization as the en try point to learn ing R and introductory statistics could lessen the anxiety asso ciated with b oth. T o facilitate this initial foray into the world of data visualization using R, w e adv o cate the use of (1) R Markdo wn ( Allaire, Cheng, Xie, McPherson, Chang, Allen, Wic kham, A tkins, and Hyndman 2016 ), a system that is integrated into RStudio, b ecause its wor kflo w inherently encourages w ell-do cumen ted repro ducible analysis ( Baumer, ¸ Cetink a ya-Rundel, Bra y , Loi, and Horton 2014 ) and (2) the mosaic pack age ( Pruim, Kaplan, and Horton 2016b ), whose mo deling language provi des a simplified in terface to mul tiv ariate descriptiv e statistics, linear mo dels, and graphical displays. Muc h has b een written ab out the nuances of data visualization tec hniques; see for exam- ple T ufte and Gra ves-Morris ( 1983 ); Clev eland ( 1994 ) and, more recent ly , Wic kham ( 2009 , 2010 ). Nolan and P errett ( 2016 ) describ es the p oten tial for visualization to inform statistical thinking and suggests wa ys to incorp orate this capacit y into statistics courses, noting that computational adv ancemen ts in recent decades hav e made it far easier for studen ts to apply adv anced graphical to ols at the lev el of introductory statistics. In this article, we present an in-class introductory m ultiv ariate data visualization activit y designed for a single class p eriod during the first week of class. The activity b egins with a brief, instructor-led in tro duction to exploratory data analysis in R. Students are then placed in small groups task ed with exploring a new dataset to pro duce three visualizations that describ e particular insights that are not immediately obvious from the data. Up on completion, studen ts will hav e pro duced a series of univ ariate and m ultiv ariate visualizations on a real dataset and practiced describing them. T able 1: Abridged co deb ook for the ‘HELPrct’ dataset V ariable Description sex ‘male’ or ‘female’ age sub ject age at baseline (in years) racegrp race/ethnicit y: ‘blac k’, ‘hispanic’, ‘white’, or ‘other’ an ysub use of an y substance p ost-deto x: ‘no’ or ‘yes’ cesd Cen ter for Epidemiologic Studies Depression measure at baseline (high scores indi- cate more depressive symptoms) substance primary substance of abuse: ‘alcohol’, ‘co caine’, or ‘heroin’ mcs SF-36 Men tal Comp onen t Score (measured at baseline, low er scores indicate worse status) p cs SF-36 Ph ysical Comp onen t Score (measured at baseline, low er scores indicate wor se status) 2. THE A CTIVITY 2.1. Overview The prop osed activit y b egins with a 15-min ute tutorial led by the instructor on how to generate basic n umeric summaries and visualizations in R. The lecture introduces the basic functionalit y used b y the mosaic R pack age to simplify the pro cess of generating multiv ariate graphical displa ys and summary statistics. The fo cus of this tutorial is not on the mec hanics of these exploratory to ols (for example, ho w we compute the heights of a histogram) but rather on enhancing one’s comprehension of the relationships b et ween v ariables in a dataset. T o that end, a complex, multiv ariate dataset should b e selected to serve as the basis of this tutorial. In our case, we used the baseline data from the Health Ev aluation and Link age to Primary Care Clinical T rial (‘HELPrct’), whic h enrolled sub jects without primary medical care while they w ere attending a substance-use deto xification unit ( Samet, Larson, Horton, Do yle, Winter, and Saitz 2003 ). T able 1 summarizes a p ortion of the v ariables collected at baseline that might b e of interest to study . W e recommend starting out with some motiv ating questions for class discussion: • What do es one row of the dataset represent? • Who is included in this dataset? Who is not? • What kinds of v ariables are included? After discussing these answers, the students hav e a b etter sense of the scop e of the dataset, but questions remain regarding the relationships b et ween the v ariables. The next step is then to sho w some simple summary statistics and visualizations that help shed light on more targeted questions: T able 2: Abridged co deb ook for the ‘CPS85’ dataset V ariable Description w age w age (US dollars p er hour) educ n um b er of years of education race ‘NW’ (nonwhite) or ‘W’ (white) age age in y ears sex ‘F’ (female) or ‘M’ (male) married ‘Married’ or ‘Single’ exp er n um b er of years of wor k exp erience (inferred from age and education) union ‘Union’ or ‘Not Union’ sector sector of emplo yment: ‘clerical’, ‘const’, ‘manag’, ‘manuf ’, ‘other’, ‘prof ’, ‘sales’, or ‘service’ • What are the prop ortions of men and women in this dataset? • What are the prop ortions of different primary substances of abuse in this dataset? • What do es the relationship lo ok like b et ween depression score and o v erall mental health? W e distribute to studen ts a handout (see App endix A) that con tains the co de to generate a v ariet y of univ ariate (histograms and barplots) and biv ariate (b o xplots and scatterplots) plots. W e demonstrate how some of these plots can b e dra wn in R b y t yping the corresp onding co de from the handout in to an R Markdown file and compile the results. W e also show how easy it is to load the help files for the relev an t functions to lo ok at wa ys of adding third or fourth grouping v ariables to make multiv ariate visualizations. W e then ask the studen ts to practice statistical thinking b y reflecting on what is learned from each plot and discuss as a class. Some plots are not particularly insigh tful (see Figure 1 for example) relative to others (Figure 2 ). And additional information might b e gleaned by including another v ariable (Figure 3 ). Through these demonstrations, we encourage studen ts to think ab out what interesting questions might b e answered given the data and the to ols at their disp osal. After the brief introduction to R, w e then prov ide studen ts with a differen t dataset of similar complexit y . F or this second dataset, we used data from the 1985 Current Population Surv ey (‘CPS85’) ( Berndt 1991 ). T able 2 summarizes some of the v ariables contained in this dataset. Note that getting into RStudio and loading the R Markdown template may take some time when students first use R/RStudio and RMarkdown. The studen ts are p rov ided a single R Markdo wn w orksheet (Appendix B provides an example) in which they are asked to pro duce three meaningful p lots of the new dataset to provide insight on some facet of interest, and write a couple of sen tences ab out each plot to discuss what they learned from it. Students should complete this task in groups of ab out t w o or three o ver a p erio d of 40 to 50 minutes. The deliv erable is a compiled R Markdo wn file, in HTML form, that collates the co de, plots, and descriptions. age f emale male 20 30 40 50 60 Figure 1: Box plot of age b y sex of sub jects mcs alcohol cocaine heroin 10 20 30 40 50 60 Figure 2: Box plot of Men tal Comp onen t Score by primary substance of abuse mcs alcohol cocaine heroin 10 20 30 40 50 60 f emale 10 20 30 40 50 60 male Figure 3: Bo xplot of Mental Comp onen t Score b y primary substance of abuse, group ed by sex 2.2. Requirements The simplest environm en t to supp ort this activity is an institutional RStudio serv er since in this setup, students need only a web browser to run their analyses. RStudio servers are licensed on a p er-user basis but are provi ded free for academic institutions for teac hing purp oses. Having an RStudio server that all can access on day one helps to ov ercome several difficulties: students do not ha v e to do wnload RStudio and R on their own computers, which ma y present its o wn sort of problem when installation fails; also, the server can b e preloaded with some pack ages that all can use ( ¸ Cetink a y a-Rundel and Horton 2016 ); finally , working from a serv er eliminates the w orry of ha ving insufficien t computing p ow er on the studen ts’ o wn computers. Alternativ ely , the activity could b e done within a computer lab where RStudio and necessary pack ages hav e b een preloaded. The R Markdo wn system enables student s (and instructors) to break up R co de into short, digestable c h unks that can then be annotated. The beauty of this syste m is that all ows for the easy creation of presen tation-w orthy rep orts show casing the results of analysis; it seamlessly w ea v es together R co de with plain text to pro duce a single file con taining analysis and expla- nation. In this w a y , R Markdo wn pro vides a comp elling argument for getting students used to the pro cess of con tin uously do cumen ting their work and encourages clear presentation of findings ( Baumer et al. 2014 ). With nativ e supp ort within RStudio, and recen tly b olstered b y a new R Noteb ook feature that provides automatic previews, R Markdown is straightforw ard for students to work with ev en on da y one of class. Additionally , compiled R Markdown files can b e shared publicly via RPubs, a free platform for show casing R Markdown output 1 . F or graphing, w e utilize the mosaic pack age ( Pruim, Kaplan, and Horton 2016a ; Pruim et al. 2016b ) and take adv antage of its simple-to-learn R syntax that helps unite the v arious differen t R functions used for data explorations. Conv eniently , the syntax for mo deling can b e describ ed by: GOAL(Y ~ X, data= DATASET) with v ariations dep ending on whether there are more (multiv ariate) or fewer (univ ariate) v ariables than the typical outcome and single predictor. Studen ts need to pick a GOAL (e.g. create a scatterplot), sp ecify the v ariables (X and Y) to study , and the DA T ASET con taining these v ariables. As an example, a comparison of mean w ages b y sex could be generated through the command: mean (wage ~ sex, data =CPS85) # wage "by" sex ## F M ## 7.88 9.99 while side by side b o xplots could b e generated b y running (see Figure 4 ): bwplot (sex ~ wage, data =CPS85) Scatterplots of t w o quantitativ e v ariables can b e generated using a similar command (see Figure 5 ): xyplot (wage ~ age, data =CPS85) Multiv ariate displays are straigh tforw ard to generate. Figure 6 mo difies Figure 5 by adding sex as a grouping v ariable via the ‘ group= ’ argument. The ‘ auto.key=TRUE ’ argument asks for a figure legend matching colors to levels of the grouping v ariable. The ‘ type= ’ argument optionally allo ws m ultiple geometric la y ers to b e display ed on the same plot; p opular options are “p” for p oin ts and “r” for least squares lines. xyplot (wage ~ age, group =sex, type = c ( "p" , "r" ), auto.key = TRUE , data =CPS85) 1 see https://rpubs.com/about/gett ing- started wage F M 0 10 20 30 40 Figure 4: Box plot of hourly w age b y sex age wage 0 10 20 30 40 20 30 40 50 60 Figure 5: Scatterplot of hourly wage by age age wage 0 10 20 30 40 20 30 40 50 60 F M Figure 6: Scatterplot of hourly wage by age, group ed by sex If w e wan t to incorp orate a grouping v ariable using facetting as opp osed to using m ultiple colors (suc h as in Figure 3 ), we could use a vertical pip e to sp ecify our grouping v ariable. Figure 3 was pro duced using the command: bwplot (substance ~ mcs | sex, data =HELPrct) W e also recommend making use of the mosaicData R pack age ( Pruim, Kaplan, and Horton 2015 ), whic h con tains a num b er of ready-to-use datasets that would serv e w ell for this activit y . Both of our selected datasets are contained in this pack age. W e hav e used this lab activit y in our classes with an a v erage of 25 studen ts. Ho w ev er, we b eliev e that if course assistan ts are a v ailable to main tain the ratio of one assistan t per 15 to 20 studen ts, this activit y can scale easily for classes with more students. The purp ose of course assistan ts is to b e able to help with hurd les as they arise – missing parentheses, missp elled v ariables/functions, incorrect capitalization – all of which are commonly exp erienced by a studen t working in R for the first time. W e detail a num b er of other common pitfalls in the Discussion. 2.3. Selecting Datasets It is imp ortan t to use datasets that are of in terest to studen ts ( Gould 2010 ). The datasets we ha v e selected con tain a go od mix of demographic data (lik e age, race, and years of education), whic h student s easily comprehend, along with some other asp ect that is relev an t to them (de- pression scores and alcohol usage in the ‘HELPrct’ dataset and wages, work exp erience, and sector in the ‘CPS85’ dataset). In addition to the tw o datasets presented here, w e also recom- mend the 2013 New Y ork Cit y fligh t delay s dataset ( Wic kham 2016 ), the IMDB mo vie ratings dataset ( Wic kham 2015 ), and the 2006 New Hav en residential prop ert y dataset ( Emerson, Green, and Hartigan 2012 ), accessible in R pac k ages nycflights13 , ggplot2movies , and barcode , resp ectiv ely . W e should note that larger datasets, say , with sample size larger than 1000, could b e tric ky for students to w ork with due to the time it tak es to plot a large num b er of p oin ts and excessive o v erplotting. Ha ving about 8 to 15 v ariables pro vides a goo d v ariety of differen t p ossibilities for exploration. T o o few v ariables ma y yield a smaller chance to identify questions of in terest and to o many v ariables would ov erwhelm. A go o d mix of categorical and quantit ativ e v ariables would pro- vide more opportunities to practice with different ki nds of data summaries and visualizations. Needless to say , having a large num b er of rows in the dataset is also imp ortan t so that, for example, if a student wan ted to include explorations of tw o v ariables by subgroups of a third, the sample sizes would still b e substantial enough to render interes ting visualizations. W e also recommend selecting datasets that are already built into R or an R pack age (or include co de to download the data directly in to the R Markdown file using the read.csv() function in conjunction with a weblink ). The b enefit of using a dataset that can b e loaded with minimal effort is that w e can head straigh t into exploratory data analysis. An added b on us is that most built-in datasets are described using a help page that pro vides a co debo ok and description of the v ariables. The abilit y to access and inter pret help files is an imp ortan t skill that will inevitably become useful as studen ts delv e in to the deep er lev els of R computing as time progresses. 3. RESUL TS The authors collected student feedbac k concerning this activit y from introductory statistics studen ts at Amherst College. W e received permission to rep ort anon ymous student feedbac k and to sho w examples of studen t w ork from those groups of studen ts who ha ve pro vided consen t (Amherst College Institutional Review Board approv al #15-028). Out of 50 total studen ts, 39 completed an electronic consent form affirming that their work may b e shared, while the other 11 did not complete the form. Rather than pic k a few of the examples to share in this article, we hav e p osted all of them online 2 . The studen ts noted strengths and weaknesses of the activity when queried at the one month mark of the course. The follo wing p ositiv e asp ects of the activity were excerpted from the surv ey: • The activity got us to immediately start working in R in a hands-on w a y . W e started with a brief ov erview and then got to exp erience the eccentricities of co ding in R. Ev en tually , I b ecame more comfortable with working with R. • I got to work with data. 2 http://xiaofei- wang.com/research/vi slab/ • I got to me et and work with my classmates. I appreciate that the activity w as self-guided and that it gav e opp ortunit y for team w ork. • The activit y was c hallenging and gav e us an accurate depiction of how the class is run. W e were able to ask for help from the instructor and the lab assistan t when we got stuc k. Belo w are areas where students noted ro om for improv ement: • The pace was a bit fast. A bit more direction would improv e the exp erience. • It would b e b eneficial to discuss the plots (histograms, bargraphs, b o xplots, etc.) and compare and con trast them. They utilize differen t t yp es of v ariables; I realized this after forming an incorrect plot. • Y ou asked us to write descriptions b efore we really knew how to describ e. In summary , most studen ts ac kno wledged the imp ortance of R and appreciated the chance to exp erience it early on with supp ort from the instructor and lab assistant. Moreo v er, the activit y serv ed as an ice breaker b et ween classmates, facilitating subsequen t group-based activities. A t the same time, not all studen ts appreciated the whirlwind tour of R offered by this activi t y . Some students crav ed more guidance and a slo w er pace. Some students expressed that they felt lost through the exercise. Indeed, the group wor k b egan after only fifteen minutes of instruction. Ho wev er, we b eliev e it is acceptable for a first-week activity to leav e students with questions unansw ered and a desire to learn more. W e recommend setting the righ t exp ectations b y conv eying to studen ts that the activit y is intended to pro vide a first exp osure to R, with mastery to b e achi ev ed later in the semester. A t the end of the class p eriod, studen ts were able to co de and generate a presentab le HTML file containing three plots and summaries using a repro ducible analysis framework. Incorp o- rating this exp erience into the first week of class disp elled the sense that the early classes w ould simply b e a rehash of mean, median, and mo de, and provided additional confidence for subsequen t interactions with R. Giv en the critical feedback, a follo w-up class discussion might b e inserted after the students ha v e learned more ab out data visualization through readings and in-class examples. This dis- cussion migh t b egin with a critique of a few lab submissions from the activit y . F or example, w e might find an instance where a student plotted a histogram but called it a bargraph in the accompanying description or vice versa. Up on revisiting the activit y , some students will ha v e learned that bargraphs are meant for categorical v ariables and histograms are mean t for quan titativ e v ariables; reviewing the mistak es from the lab activity helps reinforce these new concepts. W e migh t also tak e the opp ortunit y to critique the phrasing of some of the descrip- tions that w ere written. A student’s description of a graph might say “this is a histogram that depicts the distribution of x ” , to whic h we can now agree as a class that a b etter insigh t w ould discuss “surprisingly , we see that x is actually bimo dal, with p eaks at 3 and 5.” This exercise gives studen ts a second lo ok to reflect up on their previous work and see how far they ha v e come since their initial foray . 4. DISCUSSION An imp ortan t learning outcome in any statistics class is for students to b egin to think like a statistician. Sp ecifically , w e b eliev e this consists of rep eated practice with p osing statis- tical questions and answering them with evidence bac ked by data. Softw are mak es this an ac hiev able practice, even in the first w eek of class, if we exploit the studen ts’ curiosity ab out the world around them. If we pro vide a dataset ab out which students ha v e some conte xtual understanding (ev en better, misunderstanding), they will naturally pose in teresting questions. In our exp erience, there are some common pitfalls that students encoun ter during this activit y . When students first download the template R Markdown file, some bro wsers will force a .txt extension on the file. Studen ts ha v e to c hange the extension bac k to .Rmd in order to pro ceed. One w a y around this issue is to simply copy and paste the con ten ts that app ear as ra w text in their browser window in to a brand new R Markdown file. Sometimes this act of copying and pasting introduces leading whitespace b efore some co de ch unks, which hav e to b e manually deleted in order for the file to compile. Some studen ts may try a lot of different plots in the RStudio console b efore picking their fav orites to submit. In the process of cop ying their w ork from the console to their R Markdown script, they may include the “ + ” and “ > ” signs that then break the compilation pro cess. Some studen ts hav e difficulty distinguishing b et ween co de ch unks from regular text. In sev eral instances, studen ts learned that the “ # ” sym b ol creates a comment in R, but placed these comments outside of a co de ch unk, in which case the comment gets printed as top-level header font in Markdown. F urthermore, some students do not realize that a button or k eyb oard shortcut allo ws them to create co de c h unks, so instead they man ually type in the co de ch unk headers and fo oters. With incorrect syntax, they then run in to compiling issues. All of these pitfalls are part of the learning curv e; we exp ect that students will run into a num b er of these issues so oner or later when learning R. Exp eriencing these issues in class giv es studen ts immediate assistance when they do arise and helps minimize the friction of learning new softw are. T o make the activity run as smo othly as p ossible, we highly recommend having one course assistant p er 15 to 20 students during this activity . T o reiterate, our prop osed activit y do es not aim to pro duce exp erts at data visualization or R co ding; rather, it is intended to serv e as a p edagogical tool to inspire multiv ariate thinking. Our goal is to motiv ate students to ask go o d, statistical questions and then attempt to answ er them with data and a minimal amoun t of computing. Being able to shed ligh t on those questions, alb eit without the rigor of considering significance, within a class p eriod in the first w eek of class is extremely emp o wering and helps to whet their app etite for more to come. The activit y is extensible dep ending on how muc h time can b e allotted for it. On some o ccasions, we ha v e asked studen ts to share their compiled HTML files on RPubs. This approac h is attractive since it allo ws student findings to b e shared with the class as a whole b y sharing the appropriate RPubs link. In the instances where w e added this step, we found that students to ok more time to p olish their work, taking greater own ership in the final published pro duct. If additional time is a v ailable (p erhaps in a second class p erio d), some groups of students can present their plots to the rest of their class. This helps to develop comm unication skills, ov ercome the fear of sp eaking to ones’ classmates, and share insights, all early in the course. References Allaire, J. J., Cheng, J., Xie, Y., McPherson, J., Chang, W., Allen, J., Wic kham, H., Atkins, A., and Hyndman, R. (2016), rmarkdown: Dynamic Do cuments for R , R pack age version 0.9.5. ASA GAISE College working group: R. Carver , Everson, M., Gabrosek, J., Row ell, G. H., Horton, N. J., Lo c k, R., Mo c ko, M., Rossman, A., V elleman, P ., Witmer, J., and W o od, B. (2016), “Guidelines for Assessmen t and Instruction in Statistics Education: College Rep ort,” http://www.amstat.org/e ducation/gaise . Baumer, B., ¸ Cetink ay a-Rundel, M., Bray , A., Loi, L., and Horton, N. J. (2014), “R Mark- do wn: Integrat ing A Repro ducible Analysis T o ol in to Introductory Statistics,” T e chnolo gy Innovations in Statistics Educ ation , 8, http://escholarship.org/uc/item/90b2f5xh . Berndt, E. R. (1991), The Pr actic e of Ec onometrics: Classic and Contemp or ary , Addison- W esley Reading, MA. ¸ Cetink a y a-Rundel, M. and Horton, N. J. (2016), “T ech nology Lo wering Barriers: Get Started with R at the Snap of a Finger,” in Ele ctr onic Confer enc e on T e aching Statistics , https: //www.causeweb.org/caus e/ecots/ecots16/breakouts/7 . Chance, B., Ben-Zvi, D., Garfield, J., and Medina, E. (2007), “The Role of T ec hnology in Impro ving Studen t Learning of Statistics,” T e chnolo gy Innovations in Statistics Educ ation , 1, http://escholarship.org/ uc/item/8sd2t4rr . Clev eland, W. S. (1994), The Elements of Gr aphing Data , Hobart Press, Summit, NJ. Emerson, J. W., Green, W. A., and Hartigan, J. A. (2012), b ar c o de: Bar c o de Distribution Plots , R pack age version 1.1. Gould, R. (2010), “Statistics and the Mo dern Student,” International Statistic al R eview , 78, 297–315. Hardin, J., Ho erl, R., Horton, N. J., Nolan, D., Baumer, B., Hall-Holt, P ., Murrell, P ., P eng, R., Roback, P ., T emple Lang, D., and W ard, M. D. (2015), “Data Science in Statistics Curricula: Preparing Students to ‘Think with Data’,” The Americ an Statistician , 69, 343– 353. Horton, N. J. and Hardin, J. S. (2015), “T eaching the Next Generation of Statistics Students to ‘Think With Data’: Sp ecial Issue on Statistics and the Undergraduate Curriculum,” The Amer ic an Statistician , 69, 259–265, http://amstat.tandfon line.com/doi/full/10. 1080/00031305.2015.1094 283 . Nolan, D. and Lang, D. T. (2012), “Computing in the Statistics Curricula,” The Ameri c an Statistician , 64, 97–107. Nolan, D. and Perrett , J. (2016), “T eaching and Learning Data Visualization: Ideas and Assignmen ts,” The Americ an Statistician , 70, 260–269. Pruim, R., Kaplan, D., and Horton, N. J. (2015), mosaicData: Pr oje ct MOSAIC (mosaic- web.or g) Data Sets , R pack age version 0.13.0. — (2016a), “The mosaic Pac k age: Helping Students to ‘Think with Data’ Using R,” Working p ap er . — (2016b), “Mosaic: Pro ject MOSAIC Statistics and Mathematics T eac hing Utilities,” R Journal , R pack age version 0.14.4. https://cran.r- project.org/web/p ackages/mosaic/ index.html . R Core T eam (2016), “R: A Language and Environmen t for Statistical Computing,” Vienna, Austria. RStudio T eam (2016), “RStudio: Integrated Dev elopmen t En vironmen t for R,” Boston, MA. Samet, J. H., Larson, M. J., Horton, N. J., Doyle, K., Win ter, M., and Saitz, R. (2003), “Linking Alcohol-and Drug-Dep enden t Adults to Primary Medical Care: A Randomized Con trolled T rial of a Multi-Disciplinary Health Interv ention in a Deto xification Unit,” A d- diction , 98, 509–516. T ufte, E. R. and Grav es-Morris, P . (1983), The Visual Display of Quantitative Information , v ol. 2, Graphics press Cheshire, CT. Wic kham, H. (2009), ggplot2: Ele gant Gr aphics for Data Analysis , Springer Science & Busi- ness Media, https://github.com/hadley/ggplot2- book . — (2010), “A La yered Grammar of Graphics,” Journal of Computational and Gr aphic al Statis- tics , 19, 3–28. — (2015), ggplot2movies: Movies Data , R pack age versi on 0.0.1. — (2016), nycflights13: Flights that Dep arte d NYC in 2013 , R pack age version 0.2.0. App endix A: Lab Handout The T emplate The template for most functions (from the mosaic pac k age in R) is: goal( ~ , data = ) Getting R to W ork Eac h command you type should b e guided b y the following 2 questions: 1. What do y ou w an t R to do? 2. What must R kno w to do that? Exploring the Data In this course, we’ll work with datasets that ha ve a combination of quan titativ e and categorical v ari- ables. Oftentimes, an imp ortan t first step (b efore doing any analysis) is to explore the data. Here are some plots that are frequen tly used to visually displa y the data. Univariate Summaries tally ( ~ sex, data =HELPrct) ## ## female male ## 107 346 bargraph ( ~ sex, data =HELPrct) count 0 100 200 300 f emale male favstats ( ~ age, data =HELPrct) ## min Q1 median Q3 max mean sd n missing ## 19 30 35 40 60 35.7 7.71 453 0 histogram ( ~ age, data =HELPrct) age Density 0.00 0.01 0.02 0.03 0.04 0.05 0.06 20 30 40 50 60 Bivariate Summaries Categorical v ar. vs. categorical v ar. tally (homeless ~ sex, data =HELPrct) bargraph ( ~ sex, group = homeless, data =HELPrct, auto.key = TRUE ) ## sex ## homeless female male ## homeless 40 169 ## housed 67 177 count 0 50 100 150 f emale male homeless housed Quan titativ e v ar. vs. quantitativ e v ar. cor (i1 ~ age, data =HELPrct) xyplot (i1 ~ age, data =HELPrct) ## [1] 0.207 age i1 0 50 100 20 30 40 50 60 Categorical v ar. vs. quantitativ e v ar. favstats (age ~ racegrp, data =HELPrct) bwplot (age ~ racegrp, data =HELPrct) ## racegrp min Q1 median Q3 max mean sd n missing ## 1 black 20 31.0 35 39.0 60 35.7 7.08 211 0 ## 2 hispanic 21 28.2 32 36.2 55 33.2 7.99 50 0 ## 3 other 22 30.0 34 40.5 48 35.0 7.66 26 0 ## 4 white 19 30.0 36 42.0 58 36.5 8.28 166 0 racegrp age 20 30 40 50 60 black hispanic other white Helpful Tips • R is case sensitiv e: x is not the same thing as X . • In the console, > means R is ready for a new command, whereas + means R is waiting for you to finish an existing command. Hitting ESC gets you out of the latter scenario if you’re there b y acciden t. • Not sure what a function lik e msummary() does? Type the function name preceded b y a qu estion mark, like this: ?msummary to get help. Scroll down to Examples – replicate some of these on y our o wn. • If R throws y ou an error, read it b efore y ou panic. Usually , the error is more interpre table than y ou think! App endix B: Lab Activity Instructions Please delete this entire section b efore you submit your file to RPubs! In your groups, explore the CPS85 dataset within the mosaicData pack age to try to find some in teresting insights. Y ou may wan t to t yp e ?CPS85 and head(CPS85) to get a glimpse at what this dataset con tains. Next, start exploring the dataset using plots, tables, and other n umeric summaries. Select 3 fa vorite plots an d tell a story (in writing) ab out each of them. Extra brownie p oin ts if you can w ea v e the 3 plots together into one cohesiv e story . PLOT 1 # put the code for your plot here (Include the description for your plot here.) PLOT 2 # put the code for your plot here (Include the description for your plot here.) PLOT 3 # put the code for your plot here (Include the description for your plot here.)

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment