Greater data science at baccalaureate institutions

Greater data science at baccalaureate institutions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Donoho’s JCGS (in press) paper is a spirited call to action for statisticians, who he points out are losing ground in the field of data science by refusing to accept that data science is its own domain. (Or, at least, a domain that is becoming distinctly defined.) He calls on writings by John Tukey, Bill Cleveland, and Leo Breiman, among others, to remind us that statisticians have been dealing with data science for years, and encourages acceptance of the direction of the field while also ensuring that statistics is tightly integrated. As faculty at baccalaureate institutions (where the growth of undergraduate statistics programs has been dramatic), we are keen to ensure statistics has a place in data science and data science education. In his paper, Donoho is primarily focused on graduate education. At our undergraduate institutions, we are considering many of the same questions.


💡 Research Summary

The paper “Greater Data Science at Baccalaureate Institutions” builds on David Donoho’s call for a “Greater Data Science” (GDS) framework and examines how liberal‑arts colleges can integrate statistics and data science into a coherent undergraduate curriculum. Donoho argues that statisticians are losing ground because they refuse to recognize data science as a distinct, evolving domain, even though many of its practices have long been part of statistics. The authors—faculty at Smith College and Amherst College—agree with Donoho’s diagnosis and set out to ensure that statistics retains a central role while embracing the broader data‑science perspective.

The authors adopt the six‑component GDS model: (1) data gathering, preparation, and exploration; (2) data representation and transformation; (3) computing with data; (4) data modeling; (5) data visualization and presentation; and (6) “science about data science.” They use the 2014 ASA Undergraduate Curriculum Guidelines and the 2016 GAISE College Report as foundational references, both of which stress real‑world problems, messy data, and complex models. In addition, they incorporate the 2016 ASA‑endorsed Curriculum Guidelines for Undergraduate Programs in Data Science, which provide a more explicit roadmap for integrating new courses and revising existing ones.

At both institutions, the introductory “Introduction to Data Science” course serves as a hub that touches all six GDS elements. The course blends data‑wrangling, visualization, ethics, SQL, and liberal‑arts modules in which faculty from other disciplines pose authentic research questions that students address using data‑science tools. Subsequent courses such as Multiple Regression, Intermediate Statistics, and Machine Learning extend these foundations: students clean, transform, and visualize data in regression classes; they learn statistical computing (function writing, simulations, GitHub collaboration) within probability and theoretical statistics; and they practice communication through dedicated “Communicating with Data,” “Visual Analytics,” and “Multivariate Data Analysis” courses. Capstone projects synthesize the entire pipeline, emphasizing reproducible research, version control, and interdisciplinary collaboration.

A central theme is the distinction between teaching specific tools (e.g., particular R packages) and cultivating transferable problem‑solving skills. The authors note that packages such as reshape2 and plyr have been superseded, underscoring the need for curricula that focus on “learning how to learn.” Ethical considerations are woven throughout: Amherst embeds ethics in Intermediate Statistics and reinforces it in electives and capstones; Smith’s Introduction to Data Science includes a dedicated ethics module. The paper cites UC Berkeley’s “Behind the Data: Humans and Values” course as a model for framing data‑science work within broader societal values.

Beyond pedagogy, the authors discuss the emerging scholarly identity of data science. They argue that influential works by Hadley Wickham (tidy data, ggplot2, dplyr) and collections edited by Jenny Bryan and Wickham illustrate a body of scholarship that does not fit neatly into traditional statistics or computer‑science categories. This supports the claim that data science is coalescing into its own research field with distinct publication venues, citation practices, and evaluation criteria. The paper also references recent literature on reproducible research, version control, and meta‑data management as evidence of a growing methodological canon.

Finally, the authors confront the institutional question of where data science should reside. One author belongs to a Mathematics and Statistics department, while the others are in a dedicated Statistical and Data Sciences program. They weigh the merits of departmental integration versus establishing an independent data‑science unit, noting that data‑science curricula require expertise in statistics, computer science, domain knowledge, and ethics. The conclusion calls for strategic planning at baccalaureate institutions: revising curricula, hiring faculty with hybrid skill sets, and creating governance structures that support the long‑term development of a “Greater Data Science” that remains rooted in statistical thinking while embracing its broader, interdisciplinary nature.


Comments & Academic Discussion

Loading comments...

Leave a Comment