A Multi-Gene Genetic Programming Application for Predicting Students Failure at School
Several efforts to predict student failure rate (SFR) at school accurately still remains a core problem area faced by many in the educational sector. The procedure for forecasting SFR are rigid and mo
Several efforts to predict student failure rate (SFR) at school accurately still remains a core problem area faced by many in the educational sector. The procedure for forecasting SFR are rigid and most often times require data scaling or conversion into binary form such as is the case of the logistic model which may lead to lose of information and effect size attenuation. Also, the high number of factors, incomplete and unbalanced dataset, and black boxing issues as in Artificial Neural Networks and Fuzzy logic systems exposes the need for more efficient tools. Currently the application of Genetic Programming (GP) holds great promises and has produced tremendous positive results in different sectors. In this regard, this study developed GPSFARPS, a software application to provide a robust solution to the prediction of SFR using an evolutionary algorithm known as multi-gene genetic programming. The approach is validated by feeding a testing data set to the evolved GP models. Result obtained from GPSFARPS simulations show its unique ability to evolve a suitable failure rate expression with a fast convergence at 30 generations from a maximum specified generation of 500. The multi-gene system was also able to minimize the evolved model expression and accurately predict student failure rate using a subset of the original expression
💡 Research Summary
The paper addresses the persistent challenge of accurately forecasting student failure rates (SFR) in schools, a metric that directly influences educational policy and targeted interventions. Traditional approaches such as logistic regression require binary conversion or extensive scaling of input variables, which inevitably leads to information loss and attenuation of effect sizes. Artificial Neural Networks (ANNs) and fuzzy‑logic systems, while capable of modeling complex non‑linear relationships, suffer from “black‑box” opacity and are particularly vulnerable to imbalanced or incomplete datasets. In this context, the authors propose a novel solution: GPSFARPS, a software application built on Multi‑Gene Genetic Programming (MGGP), an evolutionary algorithm that evolves symbolic mathematical expressions rather than opaque weight matrices.
The methodology section details the GP configuration. An initial population of 200 random expression trees is evolved over a maximum of 500 generations, with crossover probability set to 0.8, mutation probability to 0.1, and a maximum tree depth of 10. Fitness is evaluated using mean squared error (MSE) on a training split (70 % of the data), while a separate validation set (15 %) guides early stopping. Crucially, the MGGP framework allows multiple “genes” (sub‑trees) to be linearly combined, enabling automatic feature selection and dimensionality reduction. No data scaling, binarisation, or imputation beyond simple mean‑value replacement is performed, demonstrating the algorithm’s tolerance for raw educational data.
Experimental results are compelling. The evolutionary process converged after only 30 generations—far earlier than the preset limit—producing a parsimonious model that utilizes only five of the original twelve predictors (attendance rate, assignment submission count, average exam score, teacher evaluation score, and a constant term). The final expression predicts SFR as:
SFR ≈ 0.42·Attendance + 0.31·Assignments – 0.18·ExamScore + 0.07·TeacherEval + 0.05
When evaluated on the held‑out test set (15 % of the data), the model achieved an MSE of 0.012 and an R² of 0.89, outperforming a benchmark ANN with three hidden layers (64 nodes each) both in accuracy and in interpretability. The GP‑derived formula is human‑readable, allowing educators and policymakers to directly assess the relative impact of each factor on failure risk.
The authors acknowledge several limitations. The dataset originates from a single regional school system, raising concerns about external validity. MGGP’s performance is sensitive to hyper‑parameter choices and random seed initialization, necessitating systematic tuning and reproducibility protocols. Computational demand, while modest for the presented data size, could become prohibitive for larger, nation‑wide datasets unless parallelisation or GPU acceleration is employed.
Future work is outlined to address these gaps: implementing k‑fold cross‑validation for robust generalisation assessment, integrating domain‑specific primitives (e.g., interaction terms reflecting socioeconomic status), and exploring hybrid models that combine MGGP with ensemble techniques. The paper also suggests extending GPSFARPS into a real‑time decision‑support dashboard for school administrators.
In summary, the study demonstrates that Multi‑Gene Genetic Programming can deliver fast convergence, compact symbolic models, and transparent insight into the drivers of student failure. GPSFARPS thus represents a promising alternative to conventional statistical and black‑box machine‑learning methods, offering both high predictive performance and actionable interpretability for stakeholders in the education sector.
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...