Analysis of factors that affect the students academic performance - Data Mining Approach

Analysis of factors that affect the students academic performance - Data   Mining Approach
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Analysis of factors that affect students academic performance - Data Mining Approach


💡 Research Summary

This paper investigates the determinants of student academic performance by applying a comprehensive data‑mining workflow to a well‑known educational dataset. The authors use the “Student Performance” dataset from the UCI Machine Learning Repository, which contains records for 649 Portuguese high‑school students across two schools and includes 33 variables covering demographic information, family background, school‑related factors, study habits, and lifestyle choices, together with three periodical grades (G1, G2, G3).

Data preprocessing involves handling a negligible amount of missing values, converting all categorical attributes to one‑hot encoded vectors, and standardizing continuous features to zero mean and unit variance. The target variable (final grade G3) is binarized into “high performance” (score > 15) and “low/average performance” (score ≤ 15) to reflect a realistic early‑warning scenario for educators.

Six classification algorithms are trained: Logistic Regression, Naïve Bayes, Decision Tree (CART), Random Forest, Gradient Boosting Machine (GBM), and Support Vector Machine (SVM). Hyper‑parameters are tuned via five‑fold cross‑validation, and model quality is assessed using accuracy, precision, recall, F1‑score, and ROC‑AUC. Random Forest achieves the best results (ROC‑AUC ≈ 0.87, F1 ≈ 0.81), followed closely by GBM and SVM, while the linear models perform substantially worse.

Feature importance is examined through two complementary lenses: the Gini importance from the Random Forest and SHAP (Shapley Additive Explanations) values that provide both global and local interpretability. The most influential predictors are the prior grades G1 and G2, confirming that past academic achievement is a strong indicator of future performance. Study time, measured in categorical levels, shows a positive effect; students who report “high” study time achieve on average 12 % higher final grades. Absences exert a negative impact, with a sharp decline in performance when missed classes exceed ten. Socio‑economic variables such as family relationship quality and parents’ education level also contribute positively; children whose parents have tertiary education score about 1.5 points higher on average. Lifestyle factors—particularly alcohol consumption and weekend leisure—are negatively correlated with grades.

Interaction effects are visualized using Partial Dependence Plots and two‑dimensional SHAP dependence plots. A notable interaction emerges between study time and absences: high study time boosts performance only when absences are low (≤ 5); once absences surpass ten, the benefit of additional study time diminishes markedly. This finding underscores the importance of attendance policies alongside study‑support initiatives.

The authors translate these analytical insights into actionable recommendations for educational practice. First, they propose an early‑warning system that flags students with high absenteeism or insufficient study time for targeted interventions such as tutoring or mentorship. Second, they suggest augmenting support for families with lower educational attainment through resource provision and parental workshops. Third, they recommend integrating substance‑use prevention programs, given the observed adverse link between alcohol use and grades. Fourth, they advocate for a staged assessment approach that leverages early grades (G1, G2) to identify at‑risk learners and deliver timely remedial instruction.

Limitations are acknowledged: the dataset reflects a specific regional context, which may limit external validity; the feature set is constrained by the original data collection; and the models are static, lacking temporal dynamics that could capture changes in study behavior over a semester. Future work is outlined to include multi‑country datasets for broader generalization, time‑series modeling to capture longitudinal patterns, and deep‑learning architectures capable of automatic feature extraction and handling of complex non‑linear relationships.

In sum, the study demonstrates that data‑mining techniques can effectively uncover the multifaceted drivers of student academic success, providing a robust empirical foundation for policy design, resource allocation, and personalized educational interventions aimed at improving overall student outcomes.


Comments & Academic Discussion

Loading comments...

Leave a Comment