Classification of Heart Disease Using K- Nearest Neighbor and Genetic Algorithm

Data mining techniques have been widely used to mine knowledgeable information from medical data bases. In data mining classification is a supervised learning that can be used to design models describing important data classes, where class attribute is involved in the construction of the classifier. Nearest neighbor (KNN) is very simple, most popular, highly efficient and effective algorithm for pattern recognition.KNN is a straight forward classifier, where samples are classified based on the class of their nearest neighbor. Medical data bases are high volume in nature. If the data set contains redundant and irrelevant attributes, classification may produce less accurate result. Heart disease is the leading cause of death in INDIA. In Andhra Pradesh heart disease was the leading cause of mortality accounting for 32%of all deaths, a rate as high as Canada (35%) and USA.Hence there is a need to define a decision support system that helps clinicians decide to take precautionary steps. In this paper we propose a new algorithm which combines KNN with genetic algorithm for effective classification. Genetic algorithms perform global search in complex large and multimodal landscapes and provide optimal solution. Experimental results shows that our algorithm enhance the accuracy in diagnosis of heart disease.

💡 Research Summary

The paper addresses the pressing need for reliable decision‑support tools in the diagnosis of coronary artery disease, a leading cause of mortality in India. Recognizing the popularity and simplicity of the k‑Nearest Neighbor (K‑NN) classifier, the authors also acknowledge its well‑known drawbacks when applied to high‑dimensional medical datasets: sensitivity to irrelevant or redundant features, the “curse of dimensionality,” and the critical dependence on the choice of the neighborhood size k. To overcome these limitations, the study proposes a hybrid algorithm that couples K‑NN with a Genetic Algorithm (GA). The GA serves two complementary purposes. First, it performs feature selection by encoding each attribute as a binary gene; the fitness function combines cross‑validated classification accuracy with a penalty for the number of selected features, thereby encouraging parsimonious models. Second, the GA searches for the optimal value of k, treating k as an integer gene and evaluating each candidate through the same cross‑validation scheme. This dual‑optimization framework is intended to produce a K‑NN model that is both leaner and better tuned to the underlying data distribution.

The experimental evaluation uses a publicly available heart disease dataset collected from hospitals in Andhra Pradesh, comprising 303 patient records with 13 clinical attributes (e.g., age, cholesterol, resting blood pressure, thalassemia). Missing values are imputed with column means, and all continuous variables are normalized to the

💡 Research Summary

📜 Original Paper Content