Evaluating Federated Learning for At-Risk Student Prediction: A Comparative Analysis of Model Complexity and Data Balancing

Evaluating Federated Learning for At-Risk Student Prediction: A Comparative Analysis of Model Complexity and Data Balancing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This study proposes and validates a Federated Learning (FL) framework to proactively identify at-risk students while preserving data privacy. Persistently high dropout rates in distance education remain a pressing institutional challenge. Using the large-scale OULAD dataset, we simulate a privacy-centric scenario where models are trained on early academic performance and digital engagement patterns. Our work investigates the practical trade-offs between model complexity (Logistic Regression vs. a Deep Neural Network) and the impact of local data balancing. The resulting federated model achieves strong predictive power (ROC AUC approximately 85%), demonstrating that FL is a practical and scalable solution for early-warning systems that inherently respects student data sovereignty.


💡 Research Summary

This paper investigates the feasibility and practical trade‑offs of applying Federated Learning (FL) to the problem of early‑risk student prediction while preserving data privacy. Using the Open University Learning Analytics Dataset (OULAD), which contains records for over 32 000 students across seven distinct course modules, the authors simulate a realistic federated scenario by treating each module as an independent data‑holding institution. After filtering out withdrawn students and labeling “Fail” outcomes as the positive (at‑risk) class, the final cohort comprises 22 437 learners.

Feature engineering follows a three‑tier approach: (1) early academic performance (average score and number of assessments submitted within the first 90 days), (2) engagement volume (total VLE clicks and distinct active days), and (3) engagement quality (click counts per activity type such as quizzes, forums, and content). Missing values are imputed with zero, yielding a dense feature matrix suitable for both centralized and federated experiments.

Two model families are evaluated: a linear Logistic Regression (LR) and a modest Deep Neural Network (DNN) consisting of two hidden layers (32 and 16 neurons) with ReLU activations and a sigmoid output. In the federated setting, the standard FedAvg algorithm orchestrates training across 50 communication rounds. For LR, local class imbalance is mitigated by applying SMOTE on each client before local training; the DNN is trained on the raw local data without oversampling. After each round, the global model is evaluated on a held‑out centralized test set (20 % of the data) using ROC‑AUC as the primary metric.

Four experimental conditions are compared: (i) centralized LR, (ii) centralized DNN, (iii) federated LR with local SMOTE, and (iv) federated DNN. Results show that centralized DNN achieves the highest AUC (0.86), while centralized LR reaches 0.84. The federated DNN attains an AUC of 0.85, only marginally lower than its centralized counterpart, and federated LR with SMOTE records 0.83. Thus, the performance gap introduced by federated training is modest (≈1–2 percentage points), confirming that FL can deliver near‑centralized predictive power while keeping raw student data on local premises.

The analysis highlights several key insights. First, model complexity matters: the DNN consistently outperforms LR in both centralized and federated regimes, suggesting that non‑linear representations capture richer patterns in engagement and early performance data. Second, local data balancing is crucial for linear models; SMOTE on each client preserves the minority (at‑risk) class representation and prevents severe recall degradation. Third, despite inherent non‑IID distributions across modules, FedAvg converges reliably, with learning curves showing rapid AUC gains in the first ten rounds followed by gradual stabilization.

From a privacy and regulatory perspective, the FL approach aligns with GDPR, LGPD, and similar statutes because raw student records never leave the institutional boundary; only anonymized model updates are transmitted. This reduces the attack surface for data breaches and eliminates the need for costly cross‑institution data sharing agreements.

The authors acknowledge limitations: network latency, communication overhead, and potential security threats (e.g., model inversion attacks) are not modeled; the study also does not explore advanced FL variants such as personalized FL, differential privacy, or asynchronous updates. Future work is proposed to integrate these techniques, to evaluate more sophisticated oversampling strategies compatible with FL, and to test the framework in real‑world multi‑institution deployments.

In conclusion, the study provides a comprehensive empirical validation that federated learning is a viable, privacy‑preserving solution for early‑risk student detection. It demonstrates that, with appropriate model selection and local balancing, federated models can achieve performance comparable to centralized baselines while respecting student data sovereignty, thereby offering a scalable blueprint for institutions seeking to implement AI‑driven early‑warning systems without compromising privacy.


Comments & Academic Discussion

Loading comments...

Leave a Comment