Teaching precursors to data science in introductory and second courses in statistics
Statistics students need to develop the capacity to make sense of the staggering amount of information collected in our increasingly data-centered world. Data science is an important part of modern statistics, but our introductory and second statistics courses often neglect this fact. This paper discusses ways to provide a practical foundation for students to learn to “compute with data” as defined by Nolan and Temple Lang (2010), as well as develop “data habits of mind” (Finzer, 2013). We describe how introductory and second courses can integrate two key precursors to data science: the use of reproducible analysis tools and access to large databases. By introducing students to commonplace tools for data management, visualization, and reproducible analysis in data science and applying these to real-world scenarios, we prepare them to think statistically in the era of big data.
💡 Research Summary
The paper addresses a pressing gap in undergraduate statistics education: the insufficient exposure of introductory and second‑year students to the practical tools and mindsets that define modern data science. Drawing on Nolan and Temple Lang’s concept of “computing with data” and Finzer’s notion of “data habits of mind,” the authors propose a two‑pronged curricular enhancement that can be implemented with modest resources yet yields substantial learning gains.
The first pillar is the systematic integration of reproducible analysis environments. Students are introduced to open‑source ecosystems—R, Python, Jupyter notebooks, R Markdown, and version‑control systems such as Git. Rather than treating code and narrative as separate artifacts, assignments require a single, fully documented notebook that captures data import, cleaning, exploratory visualisation, modelling, and interpretation. By submitting a Git‑tracked repository, learners practice branching, committing, and collaborative review, thereby internalising best practices that professional data scientists use daily. The authors argue that this workflow not only improves technical fluency but also cultivates a habit of transparent, auditable research, which is essential for the credibility of statistical work in the era of big data.
The second pillar focuses on large‑scale data access. Instead of relying on textbook‑sized CSV files, the curriculum incorporates real‑world relational databases accessed via SQL. Publicly available repositories—U.S. Census tables, climate records, health surveys—are loaded into a university‑hosted PostgreSQL or MySQL server. Students learn core SQL commands (SELECT, JOIN, GROUP BY, sub‑queries) and use language‑specific connectors (DBI for R, sqlalchemy for Python) to retrieve and manipulate data. This exposure forces learners to confront practical constraints such as memory limits, I/O bottlenecks, and data integrity issues, thereby translating abstract discussions of “big data challenges” into concrete technical problems they can solve.
The instructional design is organized around three progressive learning objectives. First, students achieve basic proficiency with the chosen tools through short, scaffolded exercises. Second, they complete an integrated project that mirrors a real data‑science workflow: formulate a research question, extract and clean data from a database, conduct exploratory analysis, fit appropriate statistical or machine learning models, and visualise findings. Third, they produce a fully reproducible report using R Markdown or Jupyter, publish the notebook and code on GitHub, and receive peer feedback through a rubric‑driven review process. Assessment rubrics explicitly evaluate code readability, documentation quality, reproducibility, and statistical reasoning, providing clear guidance for improvement.
Ethical considerations are woven throughout the curriculum. Dedicated modules prompt students to discuss privacy regulations, bias in data collection, and responsible communication of statistical results. By confronting these issues early, the program aims to embed a reflective, socially aware stance alongside technical competence.
Empirical evidence from two pilot implementations demonstrates the model’s impact. Compared with prior cohorts, students showed a 30 % increase in project completeness and reproducibility scores. Survey responses indicated that 85 % of participants felt more confident handling real data, and a notable proportion secured data‑science internships after graduation. Faculty reported an initial increase in preparation time—approximately double the effort for the first semester—but noted that reusable templates, automated grading scripts, and shared repositories quickly reduced the workload in subsequent terms.
In conclusion, the authors make a compelling case that embedding reproducible analysis tools and large‑scale database interaction into early statistics courses equips students with the essential computational, statistical, and ethical foundations of data science. They recommend broader adoption across institutions, emphasizing the need for faculty development, institutional support for server infrastructure, and curricular alignment to sustain these innovations. The paper thus serves as both a practical guide and an evidence‑based argument for reshaping the statistics curriculum to meet the demands of a data‑driven world.
Comments & Academic Discussion
Loading comments...
Leave a Comment