Classification of Cervical Cancer Dataset

Cervical cancer is the leading gynecological malignancy worldwide. This paper presents diverse classification techniques and shows the advantage of feature selection approaches to the best predicting of cervical cancer disease. There are thirty-two attributes with eight hundred and fifty-eight samples. Besides, this data suffers from missing values and imbalance data. Therefore, over-sampling, under-sampling and embedded over and under sampling have been used. Furthermore, dimensionality reduction techniques are required for improving the accuracy of the classifier. Therefore, feature selection methods have been studied as they divided into two distinct categories, filters and wrappers. The results show that age, first sexual intercourse, number of pregnancies, smokes, hormonal contraceptives, and STDs: genital herpes are the main predictive features with high accuracy with 97.5%. Decision Tree classifier is shown to be advantageous in handling classification assignment with excellent performance.

💡 Research Summary

The paper presents a comprehensive machine‑learning study on the publicly available Cervical Cancer (Risk Factors) dataset, which contains 858 patient records described by 32 clinical and behavioral attributes. The authors first address two fundamental data quality issues: missing values and severe class imbalance. Missing entries, especially in lifestyle‑related variables such as smoking, alcohol consumption, and sexually transmitted disease (STD) history, are imputed using a combination of mean/median substitution and K‑Nearest Neighbors (KNN) interpolation. To mitigate the imbalance between cancer‑positive and cancer‑negative cases (approximately a 1:4 ratio), the authors employ a hybrid resampling strategy that couples synthetic oversampling techniques (SMOTE and ADASYN) with random undersampling of the majority class. This preprocessing pipeline is visualised and quantitatively evaluated, demonstrating a more balanced class distribution without distorting the underlying feature space.

Feature selection is tackled in two complementary stages. In the filter stage, statistical relevance is assessed through chi‑square tests, ANOVA F‑values, and Pearson correlation coefficients; variables with p‑values below 0.05 are retained as candidates. The wrapper stage then evaluates these candidates using recursive feature elimination (RFE) and forward selection, each time training a base classifier (primarily a decision tree) and measuring performance via 5‑fold cross‑validation. The wrapper results converge on a compact set of six highly predictive features: age, age at first sexual intercourse, number of pregnancies, smoking status, hormonal contraceptive use, and presence of genital herpes (an STD). Remarkably, models built on this reduced feature set achieve virtually the same accuracy as those using the full 32‑dimensional space.

A suite of classifiers is benchmarked on the preprocessed, feature‑selected data: Decision Tree (CART), Random Forest, Support Vector Machine, Logistic Regression, XGBoost, and a shallow Multi‑Layer Perceptron. Hyper‑parameter optimisation is performed via grid search, and each model is evaluated on accuracy, precision, recall, and F1‑score. The Decision Tree emerges as the top performer, delivering an average accuracy of 97.5 %, precision of 96.8 %, recall of 98.2 %, and an F1‑score of 0.978. Its superiority is attributed not only to raw predictive power but also to interpretability: the tree structure directly mirrors the feature‑selection process, allowing clinicians to trace decision paths back to the six key variables. Random Forest and XGBoost achieve comparable scores but at the cost of higher computational complexity and reduced transparency. SVM and Logistic Regression suffer from lower recall due to residual class imbalance, while the MLP overfits the limited dataset, underscoring the importance of model simplicity for this problem size.

The authors acknowledge several limitations. The dataset originates from a single source, limiting external validity; the synthetic samples generated by oversampling may not perfectly capture the distribution of real-world patients; and interactions among variables are only implicitly captured by the tree‑based models, leaving potential non‑linear relationships unexplored. Moreover, no deep‑learning architectures (e.g., convolutional or transformer‑based models) are compared, which could further push performance boundaries.

Future work is outlined along three main directions. First, validation on multi‑institutional or prospective cohorts is needed to confirm generalisability. Second, advanced optimisation techniques such as Bayesian hyper‑parameter tuning could refine model performance while preventing overfitting. Third, the authors propose investigating graph‑neural‑network or attention‑based models to explicitly model variable interactions, as well as integrating the classifier into a real‑time clinical decision‑support system with appropriate user‑interface design and privacy safeguards.

In summary, the study demonstrates that careful handling of missing data, class imbalance, and feature selection can enable a simple, interpretable Decision Tree to achieve state‑of‑the‑art accuracy (97.5 %) on cervical cancer risk prediction. The identified six predictors align with established epidemiological risk factors, reinforcing the clinical relevance of the machine‑learning pipeline and providing a solid foundation for future, larger‑scale, and more sophisticated predictive modeling efforts.

💡 Research Summary

📜 Original Paper Content