Relative Error Fair Clustering in the Weak-Strong Oracle Model
We study fair clustering problems in a setting where distance information is obtained from two sources: a strong oracle providing exact distances, but at a high cost, and a weak oracle providing potentially inaccurate distance estimates at a low cost. The goal is to produce a near-optimal fair clustering on $n$ input points with a minimum number of strong oracle queries. This models the increasingly common trade-off between accurate but expensive similarity measures (e.g., large-scale embeddings) and cheaper but inaccurate alternatives. The study of fair clustering in the model is motivated by the important quest of achieving fairness with the presence of inaccurate information. We achieve the first $(1+\varepsilon)$-coresets for fair $k$-median clustering using $\text{poly}\left(\frac{k}{\varepsilon}\cdot\log n\right)$ queries to the strong oracle. Furthermore, our results imply coresets for the standard setting (without fairness constraints), and we could in fact obtain $(1+\varepsilon)$-coresets for $(k,z)$-clustering for general $z=O(1)$ with a similar number of strong oracle queries. In contrast, previous results achieved a constant-factor $(>10)$ approximation for the standard $k$-clustering problems, and no previous work considered the fair $k$-median clustering problem.
💡 Research Summary
The paper addresses the problem of fair clustering in a setting where distance information can be obtained from two distinct sources: a strong oracle that returns exact distances at a high computational cost, and a weak oracle that provides cheap but potentially inaccurate distance estimates. This “weak‑strong oracle model” captures the practical trade‑off between expensive, high‑quality similarity measures (such as deep embeddings) and inexpensive, noisy alternatives (such as lightweight heuristics). The authors focus on producing a near‑optimal fair k‑median clustering while minimizing the number of queries to the strong oracle.
The main contributions are threefold. First, they construct (k, ε)‑coresets for fair k‑median clustering with size O(Λ·k²/ε²), where Λ denotes the maximum number of disjoint demographic groups any point may belong to. The construction uses only O(Λ·k·log n/ε) strong‑oracle point (or edge) queries and O(Λ·n·k·log n) weak‑oracle queries, achieving polylogarithmic dependence on the dataset size n. When groups are disjoint (Λ = 1), the coreset size improves to O(k²·log⁴ n·log(n/ε)/ε²) and the strong‑oracle query count drops to O(k·log⁴ n), which is optimal up to polylogarithmic factors.
Second, they extend the technique to the standard (non‑fair) (k, z)‑clustering problem for any constant z = Θ(1) (including k‑means as a special case). They obtain (1 + ε)‑approximate coresets of size O(k²/ε³) using O(k²/ε³) strong‑oracle queries, again with only linear‑in‑n weak‑oracle queries.
Third, they provide empirical validation on real‑world datasets (Adult and Credit Card Default). The weak oracle is simulated to return very small distances on erroneous cases, and the proposed algorithm is compared against a uniform random sampling baseline. Results demonstrate that, despite using a comparable number of strong‑oracle queries, the coreset built by the new method yields significantly lower clustering cost and better adherence to fairness constraints.
Technically, the paper adapts the classic ring‑sampling coreset construction to the weak‑strong oracle setting. Traditional ring sampling requires exact knowledge of point‑to‑center distances to partition points into concentric “rings” and then sample O(k/ε²) points per ring. Because the weak oracle can be adversarial, the authors cannot rely on it for accurate ring assignments, yet querying the strong oracle for every point would be prohibitively expensive. Their solution is to perform heavy‑hitter sampling: in each iteration they randomly sample O(k·poly(log n)/ε) points, query the strong oracle only for these, and identify “heavy” rings that contain enough sampled points. Light rings, which contain few points, are shown to contribute negligibly to the overall clustering cost and can be ignored. The heavy rings are then “peeled off” using a procedure from prior work, and the process recurses on the remaining points. This yields an “unconstrained” (ε, k)‑coreset with only polylogarithmic strong‑oracle queries.
For the fair coreset, the authors incorporate assignment‑preserving constraints. They adjust sampling probabilities and reweight points so that each demographic group’s fraction within any cluster respects the prescribed lower and upper bounds (α_j, β_j). By leveraging lemmas from earlier fair‑clustering literature, they prove that the resulting coreset maintains the fairness guarantees of the original dataset. Consequently, any (1 + ε)‑approximation algorithm run on the coreset will produce a clustering that is both low‑cost and fair.
The paper acknowledges limitations: achieving a true (1 + ε)‑approximation for fair clustering is NP‑hard, so the final clustering step may still be exponential in the worst case. The authors suggest using existing approximation heuristics on the coreset as a practical workaround. Moreover, the query complexity for (k, z)‑clustering is slightly higher than for fair k‑median, and the coreset size scales linearly with Λ, leaving room for improvement when many overlapping groups exist.
In summary, this work introduces the first (1 + ε)‑approximate coreset constructions for fair k‑median clustering in the weak‑strong oracle model, dramatically reducing the number of expensive strong‑oracle queries while preserving both clustering quality and fairness. It also generalizes the approach to standard (k, z)‑clustering, offering a versatile toolkit for cost‑effective, fairness‑aware unsupervised learning in modern data‑intensive applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment