Likely to stop? Predicting Stopout in Massive Open Online Courses

Likely to stop? Predicting Stopout in Massive Open Online Courses
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Understanding why students stopout will help in understanding how students learn in MOOCs. In this report, part of a 3 unit compendium, we describe how we build accurate predictive models of MOOC student stopout. We document a scalable, stopout prediction methodology, end to end, from raw source data to model analysis. We attempted to predict stopout for the Fall 2012 offering of 6.002x. This involved the meticulous and crowd-sourced engineering of over 25 predictive features extracted for thousands of students, the creation of temporal and non-temporal data representations for use in predictive modeling, the derivation of over 10 thousand models with a variety of state-of-the-art machine learning techniques and the analysis of feature importance by examining over 70000 models. We found that stop out prediction is a tractable problem. Our models achieved an AUC (receiver operating characteristic area-under-the-curve) as high as 0.95 (and generally 0.88) when predicting one week in advance. Even with more difficult prediction problems, such as predicting stop out at the end of the course with only one weeks’ data, the models attained AUCs of 0.7.


💡 Research Summary

The paper presents a comprehensive methodology for predicting student “stopout” (drop‑out) in massive open online courses (MOOCs), using the Fall 2012 offering of MIT’s 6.002x (Circuits and Electronics) as a case study. The authors begin by ingesting a variety of raw data sources supplied by edX—including click‑stream logs, forum posts, wiki revisions, and MySQL tables containing assignment submissions—into a unified relational schema called MOOCdb. This transformation reduces the original ~70 GB of JSON and MongoDB dumps to a ~7 GB normalized database, enabling fast in‑memory queries and reproducible feature extraction.

A critical design decision is the definition of “stopout.” Rather than using the last interaction of any type (which can be noisy), the authors define stopout as the week in which a learner makes no further assignment submissions. Using this definition, they extract the stopout week for each of the 105,622 learners who ever accessed the course; 52,683 stopped out in week 1 (i.e., never submitted an assignment) and are excluded from subsequent modeling. The remaining cohort provides a clean label for binary classification (stopout = 0, persisted = 1).

The prediction problem is formalized with two parameters: lead (how many weeks ahead the model should predict) and lag (how many weeks of historical data are used as features). For example, a lead of 5 and a lag of 3 means that weeks 1‑3 are used to predict whether a learner will stop out in week 8. This formulation yields 91 distinct binary classification tasks across the 14‑week course horizon.

Feature engineering is a major contribution. The authors crowd‑source the creation of over 25 predictive features, ultimately selecting 27 sophisticated, interpretable variables that capture multi‑modal student behavior: total video watch time, number of video pauses, number of assignment attempts, assignment correctness rates, forum post and reply counts, wiki edit counts, weekly average grades, grade trends, and derived temporal aggregates (e.g., week‑to‑week change). Both cumulative and delta versions of each metric are included, allowing models to detect sudden drops in engagement.

For modeling, the authors employ a large‑scale, automated pipeline that trains more than 10,000 models using a variety of state‑of‑the‑art machine‑learning algorithms: logistic regression, support vector machines, random forests, gradient‑boosted trees, and deep belief networks. Hyper‑parameter tuning and 5‑fold cross‑validation are applied uniformly. Model performance is evaluated with the area under the ROC curve (AUC). Across most lead‑lag configurations, AUC values range from 0.88 to 0.95, indicating highly discriminative predictions. Even the most challenging scenario—predicting final‑week persistence using only the first week of data—achieves an AUC of 0.71, demonstrating that early signals are sufficiently informative.

Feature importance analysis reveals that assignment‑related metrics (submission frequency, correctness, and grade trends) dominate predictive power, while forum and wiki activity provide modest additional signal. Notably, learners who are about to stop out exhibit sharp declines in these core metrics in the weeks preceding dropout, suggesting that abrupt engagement drops can serve as early warning signs.

The paper also discusses practical deployment: an instructor or platform could feed current‑week data into the trained model to obtain real‑time risk scores for each active learner, enabling targeted interventions such as personalized reminders, supplemental resources, or motivational messages. The authors outline a roadmap for extending the methodology to other MOOCs on edX and Coursera, releasing the MOOCdb schema and codebase to the research community, and investigating causal relationships between identified risk factors and actual dropout.

In summary, this work demonstrates that stopout prediction in MOOCs is a tractable problem. By meticulously preprocessing massive interaction logs, engineering rich temporal features, and exhaustively exploring modern machine‑learning models, the authors achieve near‑perfect predictive performance in many settings and provide a scalable, reproducible framework that can be adopted for early‑warning systems and for deeper studies of online learning behavior.


Comments & Academic Discussion

Loading comments...

Leave a Comment