A Computational Approach to Improving Fairness in K-means Clustering
The popular K-means clustering algorithm potentially suffers from a major weakness for further analysis or interpretation. Some cluster may have disproportionately more (or fewer) points from one of the subpopulations in terms of some sensitive variable, e.g., gender or race. Such a fairness issue may cause bias and unexpected social consequences. This work attempts to improve the fairness of K-means clustering with a two-stage optimization formulation–clustering first and then adjust cluster membership of a small subset of selected data points. Two computationally efficient algorithms are proposed in identifying those data points that are expensive for fairness, with one focusing on nearest data points outside of a cluster and the other on highly ‘mixed’ data points. Experiments on benchmark datasets show substantial improvement on fairness with a minimal impact to clustering quality. The proposed algorithms can be easily extended to a broad class of clustering algorithms or fairness metrics.
💡 Research Summary
The paper addresses a well‑known fairness problem in K‑means clustering: when clusters are later used for analysis or decision‑making, they may contain disproportionate numbers of individuals from sensitive sub‑populations (e.g., gender, race). This imbalance can lead to biased outcomes and social harms. Traditional fairness‑aware clustering formulations treat fairness as a hard constraint and solve a mixed‑integer program, which is computationally prohibitive and often tied to a specific clustering algorithm.
The authors propose a two‑stage, algorithm‑agnostic framework. In the first stage they run any off‑the‑shelf clustering method (they use standard K‑means) to obtain a high‑quality partition without any fairness constraints. In the second stage they improve fairness by reassigning a small, carefully chosen subset of points across cluster boundaries. The key insight is that only points near the decision boundary can be moved without substantially degrading the clustering objective (within‑cluster sum of squares, SSW). By limiting changes to these “promising” points, the method approximates the solution of the constrained problem while keeping computational cost low.
Two concrete heuristics for selecting candidate points are introduced:
-
Nearest‑Foreign heuristic – For each pair of clusters with the most extreme balance measures (β), compute distances of points in the union to both centroids. Points that are far from their own centroid but close to the opposite centroid are deemed “foreign” and are swapped iteratively until the balance measures of the two clusters fall within a tolerance (e.g., 5‑10 %). This procedure has linear complexity in the number of points and the number of swaps.
-
Gini‑index heuristic – The Gini impurity, originally used in decision‑tree learning, is repurposed to detect points that lie on or near a cluster boundary. For each point the algorithm examines its k‑nearest‑neighbour neighbourhood; a high Gini value indicates a mixed neighbourhood (different cluster labels), suggesting the point is near a boundary. Points with the highest Gini scores are swapped between the most unbalanced clusters, again stopping when balance criteria are met. The neighbourhood size k is chosen adaptively, but experiments show the method is robust to the exact value.
Fairness is quantified by a global index F, defined as a weighted sum over clusters of the absolute differences between each cluster’s sub‑population proportions and the overall population proportions. Smaller F indicates higher fairness, with F = 0 meaning perfect proportional representation in every cluster. The balance measure β for a cluster is simply the ratio of counts of the two sub‑populations; extreme β values identify clusters that are over‑ or under‑representing a group.
The authors evaluate their approach on several benchmark datasets that contain binary sensitive attributes: Adult (income prediction), COMPAS (recidivism risk), and German Credit. They compare three pipelines: (i) vanilla K‑means, (ii) K‑means followed by the Nearest‑Foreign heuristic, and (iii) K‑means followed by the Gini‑based heuristic. Results show that both heuristics reduce the fairness index F by roughly 30‑40 % relative to vanilla K‑means, while the increase in SSW (the clustering quality metric) is modest, typically under 2 %. Runtime remains linear in dataset size, confirming the claimed efficiency.
Importantly, the framework is not limited to K‑means. Because the second stage only requires a set of cluster assignments and centroids (or a distance function), it can be applied to other clustering algorithms such as DBSCAN, spectral clustering, or hierarchical methods, and to alternative fairness metrics (e.g., demographic parity, equalized odds) with minor modifications.
In summary, the paper contributes a practical, scalable method for improving fairness in clustering by minimally perturbing the original partition. It sidesteps the computational intractability of exact constrained clustering, offers two intuitive point‑selection strategies, and demonstrates empirical effectiveness across multiple real‑world datasets. Future work could extend the approach to multi‑group settings, streaming data, or non‑Euclidean similarity measures, and explore theoretical guarantees on the trade‑off between fairness improvement and clustering distortion.
Comments & Academic Discussion
Loading comments...
Leave a Comment