Dimension Reduction for Clustering: The Curious Case of Discrete Centers

Dimension Reduction for Clustering: The Curious Case of Discrete Centers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The Johnson-Lindenstrauss transform is a fundamental method for dimension reduction in Euclidean spaces, that can map any dataset of $n$ points into dimension $O(\log n)$ with low distortion of their distances. This dimension bound is tight in general, but one can bypass it for specific problems. Indeed, tremendous progress has been made for clustering problems, especially in the \emph{continuous} setting where centers can be picked from the ambient space $\mathbb{R}^d$. Most notably, for $k$-median and $k$-means, the dimension bound was improved to $O(\log k)$ [Makarychev, Makarychev and Razenshteyn, STOC 2019]. We explore dimension reduction for clustering in the \emph{discrete} setting, where centers can only be picked from the dataset, and present two results that are both parameterized by the doubling dimension of the dataset, denoted as $\operatorname{ddim}$. The first result shows that dimension $O_ε(\operatorname{ddim} + \log k + \log\log n)$ suffices, and is moreover tight, to guarantee that the cost is preserved within factor $1\pmε$ for every set of centers. Our second result eliminates the $\log\log n$ term in the dimension through a relaxation of the guarantee (namely, preserving the cost only for all approximately-optimal sets of centers), which maintains its usefulness for downstream applications. Overall, we achieve strong dimension reduction in the discrete setting, and find that it differs from the continuous setting not only in the dimension bound, which depends on the doubling dimension, but also in the guarantees beyond preserving the optimal value, such as which clusterings are preserved.


💡 Research Summary

The paper investigates oblivious linear dimension reduction for Euclidean clustering when the centers are required to be data points (the discrete setting). While the classic Johnson‑Lindenstrauss (JL) lemma guarantees that any n‑point set can be embedded into O(ε⁻² log n) dimensions with (1±ε) distortion of all pairwise distances, this bound is known to be optimal for preserving distances. However, for specific optimization problems one can often do better. Recent work showed that for continuous k‑median and k‑means the target dimension can be reduced to O(ε⁻² log k), independent of n. The authors ask whether a similar “beyond JL” result holds when centers must be chosen from the data itself.

The key insight is to parameterize the problem by the doubling dimension (ddim) of the dataset, a measure of intrinsic low‑dimensional structure. Using a Gaussian JL matrix G∈ℝ^{t×d} with i.i.d. N(0,1/t) entries, they prove two main theorems.

Theorem 1.1 (strong “for‑all‑centers” guarantee).
If the target dimension is
 t = ˜O(ε⁻² (ddim(P) + log k + log log n)),
then with probability at least 2/3:

  1. The optimal discrete clustering cost is preserved up to a (1+ε) factor, i.e., opt(G(P)) ≤ (1+ε)·opt(P).
  2. For every set C⊆P of size at most k, the cost after embedding satisfies
     cost(G(P), G(C)) ≥ (1−ε)·cost(P, C).

Thus every possible center set experiences at most a (1−ε) multiplicative contraction. The extra log log n term is shown to be necessary: Theorem 6.2 establishes a lower bound Ω(ε⁻² log log n) even for k=1 and constant ddim, meaning any oblivious linear map that guarantees (1−ε) contraction for all centers must use at least that many dimensions.

Theorem 1.2 (relaxed guarantee).
If we are willing to relax the contraction requirement, we can drop the log log n term. For
 t = ˜O(ε⁻² (ddim(P) + log k)),
the same random Gaussian map satisfies with probability ≥2/3:

  1. opt(G(P)) ≤ (1+ε)·opt(P) (optimal value still preserved).
  2. For every C⊆P, |C|≤k,
     cost(G(P), G(C)) ≥ min{ (1−ε)·cost(P, C), α·opt(P) }
    for any constant α>1 (the paper uses α=100). This “α‑relaxed (1−ε) contraction” means that if a center set already has cost comparable to the optimum, its cost after embedding cannot drop below (1−ε) of its original value; otherwise the bound is trivially satisfied because the cost is already close to the optimum. Under this relaxed guarantee, any β‑approximate solution for the embedded instance lifts to a (1+O(ε))·β‑approximate solution for the original data, provided β < α/(1+ε). The constant α can be increased at the expense of an additive O(ε⁻² log log α) term in the dimension.

The authors also explore several notions of solution representation—partitions, centers, and combined center‑and‑partition—and show that “for‑all‑centers” and “for‑all‑partitions” guarantees are incomparable. Achieving both simultaneously (i.e., “for‑all‑centers and partitions”) requires the extra log log n dimensions, as proved in Theorem 6.4.

Lower bounds are complemented by existing results: preserving the optimal value alone already needs Ω(ε⁻² log k) dimensions (from Makarychev et al. 2019) and Ω(ε⁻² ddim) dimensions (from Charikar & Wang 2025). The paper’s bounds are essentially tight for Gaussian JL maps and, by extension, for any oblivious linear map.

The work further generalizes to the setting where a separate candidate set Q of potential centers is given. If |Q|=s, then t = ˜O(ε⁻² log s) dimensions suffice for a “for‑all‑centers” guarantee (Theorem 4.1). When both P and Q have low doubling dimension, the required dimension becomes ˜O(ε⁻² (ddim(P∪Q) + log k + log log n)) for the strongest guarantee, and ˜O(ε⁻² (ddim(P∪Q) + log k)) for the relaxed version.

Algorithmically, these results enable a simple pipeline: apply the random Gaussian map, run any existing low‑dimensional k‑median/k‑means approximation algorithm (e.g., a PTAS, a 2‑approximation, or a fast heuristic), and lift the resulting centers back to the original space. Theorems 1.1 and 1.2 guarantee that the lifted solution retains the same approximation factor up to a (1+O(ε)) multiplicative loss, while the dimensionality reduction dramatically reduces runtime and memory usage, especially when ddim(P) is constant and k is modest.

In summary, the paper establishes that for discrete clustering, the intrinsic doubling dimension, rather than the ambient dimension or the number of points, governs the feasibility of strong dimension reduction. It provides nearly optimal upper and lower bounds, introduces a novel relaxed contraction notion to eliminate the log log n term, and clarifies the precise trade‑offs between the strength of the preservation guarantee and the required target dimension. The results open avenues for further research on non‑linear embeddings, data‑dependent reductions, and empirical validation on real‑world low‑doubling datasets.


Comments & Academic Discussion

Loading comments...

Leave a Comment