A Generic Framework for Fair Consensus Clustering in Streams

A Generic Framework for Fair Consensus Clustering in Streams
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Consensus clustering seeks to combine multiple clusterings of the same dataset, potentially derived by considering various non-sensitive attributes by different agents in a multi-agent environment, into a single partitioning that best reflects the overall structure of the underlying dataset. Recent work by Chakraborty et al, introduced a fair variant under proportionate fairness and obtained a constant-factor approximation by naively selecting the best closest fair input clustering; however, their offline approach requires storing all input clusterings, which is prohibitively expensive for most large-scale applications. In this paper, we initiate the study of fair consensus clustering in the streaming model, where input clusterings arrive sequentially and memory is limited. We design the first constant-factor algorithm that processes the stream while storing only a logarithmic number of inputs. En route, we introduce a new generic algorithmic framework that integrates closest fair clustering with cluster fitting, yielding improved approximation guarantees not only in the streaming setting but also when revisited offline. Furthermore, the framework is fairness-agnostic: it applies to any fairness definition for which an approximately close fair clustering can be computed efficiently. Finally, we extend our methods to the more general k-median consensus clustering problem.


💡 Research Summary

The paper tackles the problem of aggregating multiple clusterings of the same data set into a single, fair consensus clustering when the input clusterings arrive as a stream and memory is severely limited. Prior work on fair consensus clustering (Chakraborty et al., COLT’25) achieved a constant‑factor approximation by storing all input clusterings, computing a closest fair clustering for each, and selecting the best candidate. This offline approach is infeasible for modern large‑scale or federated scenarios where data arrives continuously and cannot be kept in full.

The authors introduce a novel two‑phase framework that works in the insertion‑only streaming model. The key assumption is the existence of a γ‑approximation algorithm for the “closest fair clustering” subproblem under a given fairness constraint P (e.g., proportional, statistical, multi‑color). This subroutine is treated as a black box, making the framework agnostic to the specific definition of fairness.

Phase 1 (candidate generation) samples O(log (m n)) input clusterings uniformly at random from the stream. For each sampled clustering, the black‑box closest‑fair algorithm is invoked, producing a set of candidate fair clusterings. To avoid an explosion of candidates, the authors apply a “cluster fitting” technique that merges redundant structure among the sampled clusterings, yielding a compact candidate pool.

Phase 2 (candidate selection) draws a second independent logarithmic‑size sample from the stream. Using this sample, the algorithm estimates the total median objective (sum of distances to all inputs) for each candidate and picks the one with the smallest estimated cost. The estimation error is bounded via Chernoff‑type concentration inequalities, guaranteeing that with high probability the selected candidate is within a constant factor of the optimum.

For the 1‑median (single representative) fair consensus problem, the streaming algorithm achieves a (γ + 1.995)‑approximation while using O(n log (m n)) bits of space, which is essentially optimal because any output clustering already requires Ω(n log n) bits. For the more general k‑median variant (k fair representatives), the algorithm attains a (1.0151 γ + 1.99951)‑approximation using O(k² n polylog (m n)) space. When the underlying closest‑fair subroutine is exact (γ = 1), the overall factors become roughly 2.995 for 1‑median and about 3.0 for k‑median, improving upon the previous offline guarantees.

The framework is modular: any fairness notion for which a close‑fair clustering algorithm exists can be plugged in, including proportional fairness with arbitrary numbers of colors, overlapping groups, or statistical parity constraints. This contrasts with earlier work that focused solely on two‑color proportional fairness.

A technical challenge stems from the distance metric used for clusterings, which counts disagreeing pairs of points. This pairwise metric does not admit standard coreset constructions, so the authors resort to sampling‑based distance estimation. They prove that a logarithmic number of sampled pairs suffices to approximate the total distance to within a constant factor, and that the additional error does not degrade the overall approximation guarantee.

For the k‑median extension, each of the k representatives must individually satisfy the fairness constraint. The algorithm therefore maintains k separate candidate pools, each built from its own logarithmic sample, and employs monotone far‑away sampling to ensure diversity among the representatives. The resulting space usage grows quadratically in k but remains sublinear in the total number of input clusterings.

Empirical evaluation (as described in the paper) demonstrates that the streaming algorithms match or exceed the quality of the offline baseline while using orders of magnitude less memory. The experiments cover synthetic and real‑world datasets, varying the number of colors, population ratios, and fairness definitions, confirming the theoretical robustness of the approach.

In summary, this work makes several significant contributions: (1) it initiates the study of fair consensus clustering in the streaming model; (2) it provides the first constant‑factor streaming algorithms for both 1‑median and k‑median fair consensus problems; (3) it offers a fairness‑agnostic, modular framework that can incorporate any efficiently computable closest‑fair clustering subroutine; and (4) it achieves near‑optimal space complexity, making it practical for large‑scale, real‑time applications such as federated learning, online community detection, and continuous bio‑informatics pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment