Finding Bottlenecks: Predicting Student Attrition with Unsupervised Classifier
With pressure to increase graduation rates and reduce time to degree in higher education, it is important to identify at-risk students early. Automated early warning systems are therefore highly desirable. In this paper, we use unsupervised clustering techniques to predict the graduation status of declared majors in five departments at California State University Northridge (CSUN), based on a minimal number of lower division courses in each major. In addition, we use the detected clusters to identify hidden bottleneck courses.
💡 Research Summary
The paper investigates the use of unsupervised learning—specifically K‑means clustering—to predict student attrition and to uncover “bottleneck” courses that most strongly separate students who graduate from those who do not. The study focuses on five undergraduate majors at California State University, Northridge (CSUN): Management, Marketing, Business Law, Civil Engineering, and Electrical Engineering. Academic records spanning a ten‑year period (2004‑2014) were collected, yielding 9,088 student instances. For each student, grades in all required lower‑division courses for the declared major were extracted, normalized to a 0‑4.0 GPA scale, and missing required courses were coded as an “F” to reflect the same impact as a failing grade. Additional metadata such as graduation status, number of semesters in the major, total major credits, and transfer credits were also retained.
The methodological pipeline consists of three main steps: (1) preprocessing and encoding of grade data, (2) application of K‑means clustering, and (3) evaluation of the resulting clusters as predictors of graduation. To determine the optimal number of clusters, the authors employed the Calinski‑Harabasz (CH) index within a five‑fold cross‑validation framework. Across all five majors, the CH index peaked at two clusters, which the authors interpreted as “likely to graduate” and “unlikely to graduate” groups. Cluster membership was then used as a binary feature in a simple classifier, and its predictive power was compared against a logistic regression model trained on the full set of course grades.
Performance was assessed using Receiver Operating Characteristic (ROC) curves, as well as standard classification metrics (accuracy, precision, recall, F1‑score). When the full set of approximately 113 courses was used, logistic regression achieved slightly higher AUC values, but the cluster‑based classifier performed competitively, often within a few percentage points. Remarkably, when the feature set was reduced to only the first three lower‑division courses that students typically take in their first year, both models retained strong predictive ability. This finding suggests that early academic performance contains sufficient signal to forecast eventual graduation outcomes.
A key contribution of the work is the identification of bottleneck courses. By examining the mean grades of each cluster for every course, the authors observed that a small subset of lower‑division courses—particularly those taken in the first year—exhibited the greatest separation between clusters. These courses, therefore, serve as early indicators of student success or failure and are proposed as focal points for intervention. The paper also reports that students in the “non‑graduating” cluster tend to spend an average of four semesters in the major before dropping out, reinforcing the notion that early academic struggles are decisive.
The discussion acknowledges several limitations. First, the analysis relies solely on grade information; demographic variables, financial aid status, standardized test scores, and other contextual factors are omitted, potentially limiting model generalizability. Second, K‑means assumes spherical clusters and may not capture more complex, non‑linear relationships among students. Third, restricting the solution to two clusters precludes finer‑grained insights such as distinguishing between students who change majors versus those who leave the university entirely. The authors suggest future directions including hierarchical clustering, density‑based methods (e.g., DBSCAN), and collaborative filtering techniques to predict major switches and to generate personalized course recommendations. Incorporating richer student metadata could also improve the robustness of the early‑warning system.
In conclusion, the study demonstrates that unsupervised clustering can effectively predict graduation outcomes using only a minimal set of early‑semester grades, and that the most predictive features correspond to a handful of bottleneck courses. These insights have practical implications for university administrators and advisors: by monitoring performance in identified bottleneck courses and deploying targeted support, institutions may reduce attrition rates and improve overall graduation metrics.
Comments & Academic Discussion
Loading comments...
Leave a Comment