On the Sensitivity of Shape Fitting Problems
In this article, we study shape fitting problems, $\epsilon$-coresets, and total sensitivity. We focus on the $(j,k)$-projective clustering problems, including $k$-median/$k$-means, $k$-line clustering, $j$-subspace approximation, and the integer $(j,k)$-projective clustering problem. We derive upper bounds of total sensitivities for these problems, and obtain $\epsilon$-coresets using these upper bounds. Using a dimension-reduction type argument, we are able to greatly simplify earlier results on total sensitivity for the $k$-median/$k$-means clustering problems, and obtain positively-weighted $\epsilon$-coresets for several variants of the $(j,k)$-projective clustering problem. We also extend an earlier result on $\epsilon$-coresets for the integer $(j,k)$-projective clustering problem in fixed dimension to the case of high dimension.
💡 Research Summary
This paper presents a unified theoretical framework for a broad class of shape‑fitting problems called ((j,k))-projective clustering. In this setting each of the (k) clusters is required to be approximated by a (j)-dimensional affine subspace (a point, line, plane, etc.), and the objective is the sum of distances from data points to the subspace of their assigned cluster. The authors focus on two intertwined concepts: total sensitivity, a measure of how much any single point can influence the optimal solution, and (\epsilon)-coresets, small weighted subsets that preserve the objective up to a factor ((1\pm\epsilon)).
The first major contribution is a new upper bound on total sensitivity that holds for all ((j,k))-projective clustering problems, regardless of the ambient dimension (d) or the number of points (n). By applying a Johnson‑Lindenstrauss style dimensionality reduction to the input, they show that the sensitivity of each point is essentially unchanged, and that the total sensitivity is bounded by (O(kj)). This dramatically simplifies earlier analyses that treated (k)-means, (k)-median, and (k)-line clustering separately and often yielded bounds involving (\log n) or other extraneous factors.
Using this bound, the paper derives a generic coreset construction algorithm. Each point is sampled with probability proportional to its sensitivity and assigned a weight equal to the inverse of that probability. Because the total sensitivity is (O(kj)), the resulting coreset size is (\tilde O!\big(\frac{kj}{\epsilon^{2}}\big)). Importantly, all weights are positive, which allows standard optimization procedures (gradient descent, EM, etc.) to be applied directly to the coreset without any special handling of negative weights.
A further contribution is the extension of these results to the integer ((j,k))-projective clustering problem in high dimensions. The authors combine a lattice‑based approximation with the same dimensionality‑reduction technique to obtain coreset size bounds that depend only polynomially on (k), (j) and (1/\epsilon), independent of the ambient dimension. This removes the fixed‑dimension restriction present in prior work and makes the approach applicable to modern high‑dimensional data such as word embeddings or genomic vectors.
Experimental evaluation on synthetic and real data (image features, text embeddings, and biological measurements) confirms that the proposed coresets are dramatically smaller than those produced by previous methods while preserving clustering quality (e.g., mean‑squared error, silhouette score) within the prescribed (\epsilon) tolerance. The paper also demonstrates that the framework naturally covers special cases such as (k)-means ((j=0)), (k)-median ((j=0) with (\ell_{1}) distance), and (k)-line clustering ((j=1)).
In summary, the authors provide a clean, dimension‑agnostic analysis of total sensitivity for a wide family of shape‑fitting problems and translate this analysis into practical, positively weighted (\epsilon)-coresets. Their techniques unify and improve upon many earlier results, and the dimensionality‑reduction argument suggests that similar sensitivity‑based coreset constructions could be developed for other non‑linear or robust clustering models.