Expanding the scope of statistical computing: Training statisticians to be software engineers
Traditionally, statistical computing courses have taught the syntax of a particular programming language or specific statistical computation methods. Since the publication of Nolan and Temple Lang (2010), we have seen a greater emphasis on data wrangling, reproducible research, and visualization. This shift better prepares students for careers working with complex datasets and producing analyses for multiple audiences. But, we argue, statisticians are now often called upon to develop statistical software, not just analyses, such as R packages implementing new analysis methods or machine learning systems integrated into commercial products. This demands different skills. We describe a graduate course that we developed to meet this need by focusing on four themes: programming practices; software design; important algorithms and data structures; and essential tools and methods. Through code review and revision, and a semester-long software project, students practice all the skills of software engineering. The course allows students to expand their understanding of computing as applied to statistical problems while building expertise in the kind of software development that is increasingly the province of the working statistician. We see this as a model for the future evolution of the computing curriculum in statistics and data science.
💡 Research Summary
The paper argues that the traditional focus of statistical computing courses—teaching the syntax of a specific language or isolated statistical algorithms—no longer suffices for modern statisticians. Since Nolan and Temple Lang’s seminal 2010 paper, curricula have shifted toward data wrangling, reproducible research, and visualization, better preparing students for handling messy, large‑scale data. However, the authors contend that today’s statisticians are increasingly called upon to develop and maintain statistical software products, such as R packages that implement new methods or machine‑learning components embedded in commercial systems. This shift demands a broader set of software‑engineering skills that go beyond one‑off analyses.
To meet this need, the authors describe a graduate‑level course created at Carnegie Mellon University in 2015, now required for both the Master’s in Statistical Practice and the Ph.D. in Statistics & Data Science. The class enrolls roughly 40‑50 students annually, who already possess basic statistical programming experience but come from diverse backgrounds. The course is organized around four thematic pillars:
-
Effective Programming Practices – unit testing (e.g., testthat in R), code review, clear naming, and documentation. Students practice peer review through in‑class activities and GitHub pull‑request workflows, learning how systematic testing and review catch bugs and improve maintainability.
-
Fundamental Principles of Software Design – modularity, separation of concerns, and clean architecture. A semester‑long “Challenge” project forces students to design a non‑trivial software product, experience the consequences of good versus poor design, and receive iterative feedback from teaching assistants.
-
Important Algorithms, Data Structures, and Representations – exposure to scalable computing tools (SQL, Hadoop, Spark), language‑level performance extensions (Rcpp, Cython), and algorithmic thinking required to process massive datasets efficiently.
-
Essential Tools and Methods – reproducible workflow technologies such as knitr, R Markdown, command‑line automation, and continuous integration pipelines that support automated testing and deployment.
Pedagogically, the course blends lectures, hands‑on labs, code‑review sessions, and the extensive project. The authors emphasize that regular, targeted feedback is crucial for mastering complex computational skills. They also present evidence of impact: among 2018 CMU statistics graduates, 25 % reported job titles indicating software development roles, and industry examples (e.g., Airbnb’s data‑science team) illustrate how statistical models become production‑ready software components.
Overall, the paper positions this course as a prototype for the evolution of statistical and data‑science curricula. By integrating software‑engineering fundamentals with statistical problem solving, the authors argue that future statisticians must be trained not only to analyze data but also to build, test, and maintain robust statistical software that operates at scale in real‑world settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment