Data Stability in Clustering: A Closer Look

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We consider the model introduced by Bilu and Linial (2010), who study problems for which the optimal clustering does not change when distances are perturbed. They show that even when a problem is NP-hard, it is sometimes possible to obtain efficient algorithms for instances resilient to certain multiplicative perturbations, e.g. on the order of $O(\sqrt{n})$ for max-cut clustering. Awasthi et al. (2010) consider center-based objectives, and Balcan and Liang (2011) analyze the $k$-median and min-sum objectives, giving efficient algorithms for instances resilient to certain constant multiplicative perturbations. Here, we are motivated by the question of to what extent these assumptions can be relaxed while allowing for efficient algorithms. We show there is little room to improve these results by giving NP-hardness lower bounds for both the $k$-median and min-sum objectives. On the other hand, we show that constant multiplicative resilience parameters can be so strong as to make the clustering problem trivial, leaving only a narrow range of resilience parameters for which clustering is interesting. We also consider a model of additive perturbations and give a correspondence between additive and multiplicative notions of stability. Our results provide a close examination of the consequences of assuming stability in data.

💡 Research Summary

The paper investigates the algorithmic implications of stability assumptions in clustering, focusing on the perturbation‑resilience model introduced by Bilu and Linial (2010). In this model an optimal clustering must remain unchanged under any multiplicative perturbation of the distances by a factor up to α > 1. Prior work showed that for certain objectives (e.g., Max‑Cut, k‑median, min‑sum) efficient algorithms exist when the instance is α‑resilient for relatively large α (e.g., O(√n) for Max‑Cut, α ≈ 3 for k‑median). The authors ask how far these assumptions can be weakened while still permitting polynomial‑time algorithms.

The paper makes two complementary contributions. First, it establishes tight hardness lower bounds for the “proper” setting where cluster centers must be chosen from the data points. By reducing from a Perfect Dominating Set promise problem, they prove that even for (2 − ε)‑center‑stable k‑median instances the optimal solution is NP‑hard to compute, for any ε > 0. The reduction builds a metric where edges in the original graph correspond to distance ½ and non‑edges to distance 1; the optimal k‑median cost directly encodes a minimum dominating set. Second, they introduce a new notion of α‑min‑sum stability (a natural analogue of center stability for the min‑sum objective) and show that α‑perturbation‑resilience implies this stability. Using a reduction from the NP‑complete Triangle Partition problem, they prove that (2 − ε)‑min‑sum‑stable instances of k‑clustering are also NP‑hard, again for any ε > 0. The construction mirrors the k‑median reduction, assigning distance ½ to adjacent vertices and 1 otherwise, and shows that a cost‑optimal clustering exists iff the graph can be partitioned into triangles.

These lower bounds demonstrate that the known algorithms (α ≈ 1 + √2 for k‑median, α ≈ 2 for min‑sum) are essentially optimal in the sense that any substantial reduction of α would encounter computational intractability. The authors then explore the opposite extreme: when α is large. They prove that if α exceeds 2 + √3, every α‑stable instance exhibits strict separation—each point is strictly closer to all points in its own cluster than to any point outside. In such cases the clustering problem becomes trivial, as a simple distance‑based rule recovers the optimal partition. Thus, there is a narrow “interesting” window for α between roughly 2 and 2 + √3.

The paper also examines an additive perturbation model, defining ε‑additive stability (each point’s average intra‑cluster distance is at least ε smaller than its average distance to any other cluster). They establish a linear correspondence between additive and multiplicative parameters, showing that additive stability does not circumvent the hardness results: the same (2 − ε) lower bounds hold, and algorithms for multiplicatively stable data can be adapted to additive instances with appropriately transformed parameters.

Overall, the work paints a detailed picture of the trade‑off between stability assumptions and computational feasibility. Small constant α values do not help; large α values over‑constrain the data, making the problem trivial. The results narrow the range of α that is both algorithmically tractable and non‑trivial, and they demonstrate that switching from multiplicative to additive stability offers no advantage. This clarifies the limits of stability‑based approaches to clustering and suggests that future research should focus on the narrow band of α where meaningful, efficient algorithms might still be discovered.

Data Stability in Clustering: A Closer Look

💡 Research Summary

Comments & Academic Discussion

Leave a Comment