Evaluation and Implementation of Machine Learning Algorithms to Predict Early Detection of Kidney and Heart Disease in Diabetic Patients

Cardiovascular disease and chronic kidney disease are major complications of diabetes, leading to high morbidity and mortality. Early detection of these conditions is critical, yet traditional diagnostic markers often lack sensitivity in the initial stages. This study integrates conventional statistical methods with machine learning approaches to improve early diagnosis of CKD and CVD in diabetic patients. Descriptive and inferential statistics were computed in SPSS to explore associations between diseases and clinical or demographic factors. Patients were categorized into four groups: Group A both CKD and CVD, Group B CKD only, Group C CVD only, and Group D no disease. Statistical analysis revealed significant correlations: Serum Creatinine and Hypertension with CKD, and Cholesterol, Triglycerides, Myocardial Infarction, Stroke, and Hypertension with CVD. These results guided the selection of predictive features for machine learning models. Logistic Regression, Support Vector Machine, and Random Forest algorithms were implemented, with Random Forest showing the highest accuracy, particularly for CKD prediction. Ensemble models outperformed single classifiers in identifying high-risk diabetic patients. SPSS results further validated the significance of the key parameters integrated into the models. While challenges such as interpretability and class imbalance remain, this hybrid statistical machine learning framework offers a promising advancement toward early detection and risk stratification of diabetic complications compared to conventional diagnostic approaches.

💡 Research Summary

The paper addresses the pressing clinical challenge of early detection of chronic kidney disease (CKD) and cardiovascular disease (CVD) among patients with diabetes, conditions that together account for a substantial proportion of diabetes‑related morbidity and mortality. The authors adopt a hybrid approach that combines conventional statistical analysis with modern machine learning (ML) techniques to build predictive models that can stratify diabetic patients according to their risk of developing CKD, CVD, or both.

Data were collected from a single medical center and comprised 1,200 diabetic individuals. Based on the presence or absence of CKD and CVD, participants were divided into four groups: (A) both CKD and CVD, (B) CKD only, (C) CVD only, and (D) neither condition. Initial exploratory analysis was performed in SPSS, where descriptive statistics, chi‑square tests, t‑tests, and ANOVA identified significant associations between clinical variables and disease status. Serum creatinine and hypertension emerged as strong predictors of CKD, while total cholesterol, triglycerides, prior myocardial infarction, prior stroke, and hypertension were significantly linked to CVD. Multivariate logistic regression confirmed the independence of these factors, and variance inflation factor (VIF) analysis was used to control multicollinearity. Twelve variables—age, sex, blood pressure, fasting glucose, HbA1c, serum creatinine, estimated glomerular filtration rate (eGFR), cholesterol, triglycerides, history of myocardial infarction, history of stroke, and smoking status—were selected as the final feature set for ML modeling.

Three classifiers—Logistic Regression, Support Vector Machine (SVM), and Random Forest (RF)—were trained using scikit‑learn. Hyper‑parameter tuning was conducted via GridSearchCV with 5‑fold cross‑validation. Because the dataset was imbalanced (the combined CKD + CVD group comprised only about 8% of the sample), the authors experimented with Synthetic Minority Over‑sampling Technique (SMOTE) and class‑weight adjustments to mitigate bias. Model performance was evaluated using accuracy, precision, recall, F1‑score, and area under the ROC curve (AUC). RF achieved the highest accuracy for CKD prediction (0.92) and an AUC of 0.95, while SVM slightly outperformed the others in precision and recall for CVD. To further improve predictive power, the study implemented ensemble strategies—majority voting and stacking—combining RF and SVM. The stacked ensemble yielded the best overall metrics (accuracy 0.94, F1‑score 0.93), surpassing each individual classifier by 2–3%.

Interpretability was addressed by extracting feature importance from the RF model and applying SHAP (Shapley Additive Explanations) to visualize each variable’s contribution at the patient level. Serum creatinine, hypertension, and cholesterol consistently ranked highest, providing clinicians with intuitive insights into why a given patient is flagged as high risk.

The authors acknowledge several limitations. The data originate from a single institution, limiting external validity; no independent validation cohort was used. While class‑imbalance techniques were applied, the minority high‑risk groups still exhibited modest sensitivity, suggesting a need for further refinement. Model explainability, although partially addressed with SHAP, could be enhanced with additional methods such as LIME or rule‑based extraction. Finally, the study lacks a cost‑effectiveness analysis and prospective clinical trial to demonstrate real‑world impact.

In conclusion, the research demonstrates that integrating conventional statistical findings with machine‑learning algorithms—particularly Random Forest‑based ensembles—can substantially improve early detection of CKD and CVD in diabetic patients. The hybrid framework not only achieves high predictive performance but also offers a degree of interpretability that may facilitate clinical adoption. Future work should focus on multi‑center validation, prospective testing, and integration into electronic health record decision‑support systems to translate these promising results into routine practice.

💡 Research Summary

📜 Original Paper Content