Coresets for Clustering with Fairness Constraints

In a recent work, [19] studied the following "fair" variants of classical clustering problems such as $k$-means and $k$-median: given a set of $n$ data points in $\mathbb{R}^d$ and a binary type associated to each data point, the goal is to cluster t…

Authors: Lingxiao Huang, Shaofeng H.-C. Jiang, Nisheeth K. Vishnoi

Coresets for Clustering with F airness Constrain ts Lingxiao Huang ∗ Shaofeng H.-C. Jiang † Nisheeth K. Vishnoi ‡ Decem b er 18, 2019 Abstract In a recen t w ork, [ 19 ] studied the following “fair” v ariants of classical clustering problems such as k -means and k -median: giv en a set of n data p oin ts in R d and a binary t yp e asso ciated to eac h data p oin t, the goal is to cluster the p oin ts while ensuring that the prop ortion of each type in each cluster is roughly the same as its underlying prop ortion. Subsequen t w ork has focused on either extending this setting to when each data p oin t has multiple, non-disjoint sensitive t yp es suc h as race and gender [ 6 ], or to address the problem that the clustering algorithms in the abov e w ork do not scale w ell [ 39 , 7 , 5 ]. The main con tribution of this pap er is an approach to clustering with fairness constrain ts that inv olve multiple, non-disjoint t yp es, that is also sc alable . Our approac h is based on no vel constructions of coresets: for the k -median ob jective, we construct an ε -coreset of size O (Γ k 2 ε − d ) where Γ is the n umber of distinct collections of groups that a p oin t ma y belong to, and for the k -means ob jectiv e, w e show ho w to construct an ε -coreset of size O (Γ k 3 ε − d − 1 ) . The former result is the first known coreset construction for the fair clustering problem with the k -median ob jectiv e, and the latter result remo ves the dependence on the size of the full dataset as in [ 39 ] and generalizes it to multiple, non-disjoint types. Plugging our coresets in to existing algorithms for fair clustering suc h as [ 5 ] results in the fastest algorithms for sev eral cases. Empirically , w e assess our approach o ver the A dult , Bank , Diab etes and A thlete dataset, and show that the coreset sizes are m uch smaller than the full dataset; applying coresets indeed accelerates the running time of computing the fair clustering ob jective while ensuring that the resulting ob jective difference is small. W e also achiev e a speed-up to recen t fair clustering algorithms [5, 6] b y incorporating our coreset construction. ∗ EPFL, Switzerland. Email: h uanglingxiao1990@126.com † W eizmann Institute of Science, Israel. Email: shaofeng.jiang@weizmann.ac.il ‡ Y ale Universit y , USA. Email: nisheeth.vishnoi@yale.edu 1 Con ten ts 1 In tro duction 3 1.1 Other related w orks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Problem definition 6 3 T ec hnical ov erview 7 4 Coresets for fair k -median clustering 8 4.1 The line case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.2 Pro of of Theorem 4.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.3 Extending to higher dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5 Coresets for fair k -means clustering 14 5.1 The line case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5.2 Extending to higher dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 18 6 Empirical results 19 6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 7 Conclusion and future w ork 22 A Other Empirical Results 26 A.1 Results: with a binary t yp e . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 A.2 Results: with normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2 1 In tro duction Clustering algorithms are widely used in automated decision-making tasks, e.g., unsup ervised learning [ 40 ], feature engineering [ 30 , 25 ], and recommendation systems [ 9 , 37 , 20 ]. With the increasing applications of clustering algorithms in h uman-centric con texts, there is a growing concern that, if left unchec ked, they can lead to discriminatory outcomes for protected groups, e.g., females/blac k p eople. F or instance, the proportion of a minority group assigned to some cluster can be far from its underlying proportion, ev en if clustering algorithms do not take the sensitiv e attribute in to its decision making [ 19 ]. Such an outcome ma y , in turn, lead to unfair treatmen t of minorit y groups, e.g., w omen may receiv e prop ortionally fewer job recommendations with high salary [ 21 , 36 ] due to their underrepresen tation in the cluster of high salary recommendations. T o address this issue, Chierichetti et al. [ 19 ] recen tly prop osed the fair clustering problem that requires the clustering assignment to b e b alanc e d with resp ect to a binary sensitive type, e.g., sex. 1 Giv en a set X of n data p oin ts in R d and a binary t yp e asso ciated to eac h data p oin t, the goal is to cluster the p oints suc h that the prop ortion of eac h type in each cluster is roughly the same as its underlying prop ortion, while ensuring that the clustering ob jective is minimized. Subsequent w ork has focused on either extending this setting to when each data p oin t has multiple, non-disjoin t sensitive t yp es [ 6 ] (Definition 2.3), or to address the problem that the clustering algorithms do not scale w ell [19, 38, 39, 7, 5]. Due to the large scale of datasets, several existing fair clustering algorithms hav e to tak e samples instead of using the full dataset, since their running time is at least quadratic in the input size [ 19 , 38 , 7 , 6 ]. V ery recently , Backurs et al. [ 5 ] prop ose a nearly linear appro ximation algorithm for fair k -median, but it only w orks for a binary type. It is still unkno wn whether there exists a scalable approximation algorithm for multiple sensitive t yp es [ 5 ]. T o impro ve the running time of fair clustering algorithms, a p o w erful tec hnique called coreset was introduced. Roughly , a coreset for fair clustering is a small weigh ted p oin t set, such that for an y k -subset and any fairness constraint, the fair clustering ob jectiv e computed ov er the coreset is appro ximately the same as that computed from the full dataset (Definition 2.1). Thus, a coreset can b e used as a proxy for the full dataset – one can apply an y fair clustering algorithm on the coreset, ac hieve a goo d approximate solution on the full dataset, and hop e to sp eed up the algorithm. As men tioned in [ 5 ], using coresets can indeed accelerate the computation time and sa ve storage space for fair clustering problems. Another b enefit is that one may w ant to compare the clustering p erformance under differen t fairness constrain ts, and hence it may b e more efficien t to rep eatedly use coresets. Curren tly , the only known result for coresets for fair clustering is b y Schmidt et al. [ 39 ], who constructed an ε -coreset for fair k -means clustering. Ho wev er, their coreset size includes a log n factor and only restricts to a sensitive type. Moreo ver, there is no kno wn coreset construction for other commonly-used clusterings, e.g., fair k -median. Our con tributions. The main contribution of this pap er is the efficient construction of coresets for clustering with fairness constraints that in volv e m ultiple, non-disjoin t types. T ec hnically , w e show an efficient construction of ε -coresets of size indep enden t of n for b oth fair k -median and fair k -means, summarized in T able 1. Let Γ denote the n umber of distinct 1 A type consists of several disjoin t groups, e.g., the sex t yp e consists of females and males. 3 collections of groups that a p oin t ma y b elong to (see the first paragraph of Section 4 for the formal definition). • Our coreset for fair k -median is of size O (Γ k 2 ε − d ) (Theorem 4.1), which is the first kno wn coreset to the best of our knowledge. • F or fair k -means, our coreset is of size O (Γ k 3 ε − d − 1 ) (Theorem 5.1), which impro ves the result of [ 39 ] b y an Θ( log n εk 2 ) factor and generalizes it to multiple, non-disjoin t types. • As men tioned in [ 5 ], applying coresets can accelerate the running time of fair clustering algorithms, while suffering only an additional (1 + ε ) factor in the approxiation ratio. Setting ε = Ω(1) and plugging our coresets in to existing algorithms [ 39 , 6 , 5 ], w e directly achiev e scalable fair clustering algorithms, summarized in T able 2. W e present no vel tec hnical ideas to deal with fairness constraints for coresets. • Our first technical contribution is a reduction to the case Γ = 1 (Theorem 4.2) whic h greatly simplifies the problem. Our reduction not only works for our sp ecific construction, but also for all coreset constructions in general. • F urthermore, to deal with the Γ = 1 case, we pro vide several in teresting geometric observ ations for the optimal fair k -median/means clustering (Lemma 4.1), whic h may b e of indep enden t interest. W e implement our algorithm and conduct exp erimen ts on A dult , Bank , Diab etes and A thlete datasets. • A v anilla implementation results in a coreset with size that dep ends on ε − d . Our implemen tation is inspired b y our theoretical results and produces coresets whose size is m uch smaller in practice. This improv ed implementation is still within the framew ork of our analysis, and the same w orst case theoretical b ound still holds. • T o v alidate the performance of our implementation, we experiment with v arying ε for b oth fair k -median and k -means. As exp ected, the empirical error is well under the theoretical guarantee ε , and the size do es not suffer from the ε − d factor. Sp ecifically , for fair k -median, we ac hieve 5% empirical error using only 3% p oin ts of the original data sets, and w e achiev e similar error using 20% p oin ts of the original data set for the k -means case. In addition, our coreset for fair k -means is better than uniform sampling and that of [39] in the empirical error. • The small size of the coreset translates to more than 200x sp eed-up (with error ~10%) in the running time of computing the fair clustering ob jectiv e when the fair constraint F is given. W e also apply our coreset on the recent fair clustering algorithm [ 5 , 6 ], and drastically impro ve the running time of the algorithm b y appro ximately 5-15 times to [ 5 ] and 15-30 times to [ 6 ] for all ab o ve-men tioned datasets plus a large dataset Census1990 that consists of 2.5 million records, even taking the coreset construction time into consideration. 1.1 Other related works There are increasingly more works on fair clustering algorithms. Chierichetti et al. [ 19 ] intro- duced the fair clustering problem for a binary t yp e and obtained approximation algorithms for fair k -median/cen ter. Backurs et al. [ 5 ] improv ed the running time to nearly linear for fair k -median, but the appro ximation ratio is ˜ O ( d log n ) . Rösner and Sc hmidt [ 38 ] designed a 14-approximate algorithm for fair k -cen ter, and the ratio is impro ved to 5 b y [ 7 ]. F or fair 4 T able 1: Summary of coreset results. T 1 ( n ) and T 2 ( n ) denote the running time of an O (1) -appro ximate algorithm for k -median/means, resp ectiv ely . k -Median k -Means size construction time size construction time [39] O (Γ kε − d − 2 log n ) ˜ O ( kε − d − 2 n log n + T 2 ( n )) This O (Γ k 2 ε − d ) O ( kε − d +1 n + T 1 ( n )) O (Γ k 3 ε − d − 1 ) O ( kε − d +1 n + T 2 ( n )) T able 2: Summary of fair clustering algorithms. ∆ denotes the maximum num b er of groups that a p oin t ma y belong to, and “multi” means the algorithm can handle m ultiple non-disjoin t t yp es. k -Median k -Means m ulti approx. ratio time m ulti approx. ratio time [19] O(1) Ω( n 2 ) [39] O (1) n O ( k ) [5] ˜ O ( d log n ) O ( dn log n + T 1 ( n )) [7] (3 . 488 , 1) Ω( n 2 ) (4 . 675 , 1) Ω( n 2 ) [6] X ( O (1) , 4∆ + 4) Ω( n 2 ) X ( O (1) , 4∆ + 4) Ω( n 2 ) This ˜ O ( d log n ) O ( dlk 2 log( l k ) + T 1 ( lk 2 )) O (1) ( lk ) O ( k ) This X ( O (1) , 4∆ + 4) Ω( l 2∆ k 4 ) X ( O (1) , 4∆ + 4) Ω( l 2∆ k 6 ) k -means, Schmidt et al. [ 39 ] introduced the notion of fair coresets, and presen ted an efficient streaming algorithm. More generally , Bercea et al. [ 7 ] prop osed a bi-criteria appro ximation for fair k -median/means/cen ter/supplier/facility lo cation. V ery recently , Bera et al. [ 6 ] presented a bi-criteria appro ximation algorithm for fair ( k , z ) -clustering problem (Definition 2.3) with arbitrary group structures (potentially ov erlapping), and Anagnostop oulos et al. [ 4 ] impro v ed their results by prop osing the first constan t-factor appro ximation algorithm. It is still op en to design a near linear time O (1) -approximate algorithm for the fair ( k , z ) -clustering problem. There are other fair v arian ts of clustering problems. Ahmadian et al. [ 3 ] studied a v ariant of the fair k -cen ter problem in which the num b er of eac h t yp e in each cluster has an upper b ound, and prop osed a bi-criteria approximation algorithm. Chen et al. [ 18 ] studied the fair clustering problem in whic h an y n/k p oin ts are entitled to form their o wn cluster if there is another center closer in distance for all of them. Kleindessner et al. [ 32 ] inv estigate the fair k -cen ter problem in whic h eac h center has a t yp e, and the selection of the k -subset is restricted to include a fixed amoun t of centers b elonging to each t yp e. In another pap er [ 33 ], they developed fair v ariants of sp ectral clusterings (a heuristic k -means clustering framew ork) b y incorporating the prop ortional fairness constraints prop osed b y [19]. The notion of coreset w as first prop osed by Agarwal et al. [ 1 ]. There has b een a large b ody of work for unconstrained clustering problems in Euclidean spaces [ 2 , 26 , 17 , 27 , 34 , 22 , 23 , 8 ]). Apart from these, for the general ( k , z ) -clustering problem, F eldman and Langberg [ 22 ] presen ted an ε -coreset of size ˜ O ( dk ε − 2 z ) in ˜ O ( nk ) time. Huang et al. [ 28 ] show ed an ε -coreset of size ˜ O ( ddim ( X ) · k 3 ε − 2 z ) , where ddim ( X ) is doubling dimension that measures the intrinsic dimensionalit y of a space. F or the sp ecial case of k -means, Bra verman et al. [ 8 ] impro ved the size to ˜ O ( k ε − 2 · min { k /ε, d } ) by a dimension reduction approac h. W orks suc h as [ 22 ] use imp ortance sampling technique whic h a void the size factor ε − d , but it is unknown if such 5 approac hes can b e used in fair clustering. 2 Problem definition Consider a set X ⊆ R d of n data p oin ts, an in teger k (n umber of clusters), and l groups P 1 , . . . , P l ⊆ X . An assignment c onstr aint , whic h was prop osed b y Schmidt et al. [ 39 ], is a k × l in teger matrix F . A clustering C = { C 1 , . . . , C k } , whic h is a k -partitioning of X , is said to satisfy assignmen t constraint F if | C i ∩ P j | = F ij , ∀ i ∈ [ k ] , j ∈ [ l ] . F or a k -subset C = { c 1 , . . . , c k } ⊆ X (the cen ter set) and z ∈ R > 0 , w e define K z ( X , F , C ) as the minim um v alue of P i ∈ [ k ] P x ∈ C i d z ( x, c i ) among all clustering C = { C 1 , . . . , C k } that satisfies F , whic h we call the optimal fair ( k , z ) -clustering v alue. If there is no clustering satisfying F , K z ( X , F , C ) is set to b e infinity . The following is our notion of coresets for fair ( k , z ) -clustering. This generalizes the notion introduced in [ 39 ] which only considers a partitioned group structure. Definition 2.1 ( Coreset for fair clustering). Given a set X ⊆ R d of n p oints and l gr oups P 1 , . . . , P l ⊆ X , a weighte d p oint set S ⊆ R d with weight function w : S → R > 0 is an ε -c or eset for the fair ( k , z ) -clustering pr oblem, if for e ach k -subset C ⊆ R d and e ach assignment c onstr aint F ∈ Z k × l ≥ 0 , it holds that K z ( S, F, C ) ∈ (1 ± ε ) · K z ( X , F , C ) . Since p oin ts in S migh t receiv e fractional weigh ts, w e c hange the definition of K z a little, so that in ev aluating K z ( S, F, C ) , a p oin t x ∈ S ma y b e partially assigned to more than one cluster and the total amount of assignments of x equals w ( x ) . The currently most general notion of fairness in clustering was prop osed b y [ 6 ], which enforces b oth upp er b ounds and low er b ounds of any group’s prop ortion in a cluster. Definition 2.2 ( ( α, β ) -prop ortionally-fair). A clustering C = ( C 1 , . . . , C k ) is ( α, β ) - pr op ortional ly-fair ( α, β ∈ [0 , 1] l ), if for e ach cluster C i and j ∈ [ l ] , it holds that α j ≤ | C i ∩ P j | | C i | ≤ β j . The ab o ve definition directly implies for each cluster C i and any t wo groups P j 1 , P j 2 ∈ [ l ] , α j 1 β j 2 ≤ | C i ∩ P j 1 | | C i ∩ P j 2 | ≤ β j 1 α j 2 . In other words, the fraction of p oin ts b elonging to groups P j 1 , P j 2 in each cluster is b ounded from b oth sides. Indeed, similar fairness constrain ts hav e b een in vestigated by works on other fundamental algorithmic problems suc h as data summarization [ 13 ], ranking [ 15 , 41 ], elections [ 11 ], personalization [ 16 , 12 ], classification [ 10 ], and online adv ertising [ 14 ]. Naturally , Bera et al. [ 6 ] also defined the fair clustering problem with resp ect to ( α, β ) -prop ortionally-fairness as follows. Definition 2.3 ( ( α, β ) -prop ortionally-fair ( k , z ) -clustering). Given a set X ⊆ R d of n p oints, l gr oups P 1 , . . . , P l ⊆ X , and two ve ctors α, β ∈ [0 , 1] l , the obje ctive of ( α, β ) -pr op ortional ly-fair ( k , z ) -clustering is to find a k -subset C = { c 1 , . . . , c k } ∈ R d and ( α, β ) -pr op ortional ly-fair clustering C = { C 1 , . . . , C k } , such that the obje ctive function P i ∈ [ k ] P x ∈ C i d z ( x, c i ) is minimize d. 6 Our notion of coresets is very general, and we relate our notion of coresets to the ( α, β ) - prop ortionally-fair clustering problem, via the following observ ation, which is similar to Prop osition 5 in [39]. Prop osition 2.1. Given a k -subset C , the assignment r estriction r e quir e d by ( α, β ) -pr op ortional ly- fairness c an b e mo dele d as a c ol le ction of assignment c onstr aints. As a result, if a weigh ted set S is an ε -coreset sati sfying Definition 2.1, then for an y α, β ∈ [0 , 1] l , the ( α, β ) -prop ortionally-fair ( k , z ) -clustering v alue computed from S m ust be a (1 ± ε ) -appro ximation of that computed from X . Remark 2.1. Definition 2.2 enfor c es fairness by lo oking at the pr op ortion of a gr oup in e ach cluster. W e c an also c onsider another typ e of c onstr aints over the numb er of gr oup p oints in e ach cluster, define d as fol lows. Definition 2.4 ( ( α, β ) -fair) . W e c al l a clustering C = { C 1 , . . . , C k } ( α, β ) -fair ( α, β ∈ Z k × l ≥ 0 ), if for e ach cluster C i and e ach j ∈ [ l ] , we have α ij ≤ | C i ∩ P j | ≤ β ij . F or instanc e, the ab ove definition c an b e applie d if one only c ar es ab out the diversity and r e quir es that e ach cluster should c ontain at le ast one element fr om e ach gr oup, i.e., | C i ∩ P j | ≥ 1 for al l i, j . W e c an similarly define the ( α, β ) -fair ( k , z ) -clustering pr oblem with r esp e ct to the ab ove definition as in Definition 2.3, and Pr op osition 2.1 stil l holds in this c ase. Henc e, an ε -c or eset for fair ( k , z ) -clustering also pr eserves the clustering obje ctive of the ( α, β ) -fair ( k , z ) -clustering pr oblem. 3 T echnical o verview W e introduce nov el techniques to tac kle the assignmen t constraints. Recall that Γ denotes the num b er of distinct collections of groups that a p oin t ma y belong to. Our first technical con tribution is a general reduction to the Γ = 1 case which works for any coreset construction algorithm (Theorem 4.2). The idea is to divide X in to Γ parts with resp ect to the groups that a p oin t b elongs to, and construct a fair coreset with parameter Γ = 1 for eac h group. The observ ation is that the union of these coresets is a coreset for the original instance and Γ . Our coreset construction for the case Γ = 1 is based on the framew ork of [ 27 ] in which unconstrained k -median/means coresets are provided. W e first introduce the framew ork of [ 27 ] briefly and then sho w the main tec hnical difficulty of our work. The main observ ation of [ 27 ] is that it suffices to deal with X that lies on a line. Sp ecifically , they sho w that it suffices to construct at most O ( k ε − d +1 ) lines, pro ject X to their closest lines and construct an ε/ 3 -coreset for eac h line. The coreset for eac h line is then constructed b y partitioning the line into p oly ( k /ε ) contiguous sub-in terv als, and designate at most t w o p oin ts to represent eac h sub-in terv al and include these p oin ts in the coreset. In their analysis, a crucially used prop ert y is that the clustering for any given centers partitions X in to k con tiguous parts on the line, since each p oin t must b e assigned to its nearest cen ter. How ever, this prop ert y migh t not hold in fair clustering, which is the main difficulty . Nonetheless, w e manage to sho w a new structural lemma, that the optimal fair k -median/means clustering partitions X in to O ( k ) con tiguous in terv als. F or fair k -median, the key geometric observ ation is that there 7 alw ays exists a center whose corresp onding optimal fair k -median cluster forms a con tiguous in terv al (Claim 4.1), and this com bined with an induction implies the optimal fair clustering partitions X in to 2 k − 1 interv als. F or fair k -means, we sho w that eac h optimal fair cluster actually forms a single contiguous interv al. Thanks to the new structural properties, plugging in a sligh tly different set of parameters in [27] yields fair coresets. 4 Coresets for fair k -median clustering In this section, we construct coresets for fair k -median ( z = 1 ). F or each x ∈ X , denote P x = { i ∈ [ l ] : x ∈ P i } as the collection of groups that x b elongs to. Let Γ denote the num b er of distinct P x ’s. Let T z ( n ) denote the running time of a constant appro ximation algorithm for the ( k, z ) -clustering problem. The main theorem is as follo ws. Theorem 4.1 ( Coreset for fair k -median). Ther e exists an algorithm that c onstructs an ε -c or eset for the fair k -me dian pr oblem of size O (Γ k 2 ε − d ) , in O ( kε − d +1 n + T 1 ( n )) time. Note that Γ is usually small. F or instance, if there is only a sensitive attribute [ 39 ], then eac h P x is a singleton and Γ = l . More generally , let Λ denote the maximum n umber of groups that an y p oin t b elongs to, then Γ ≤ l Λ , but there is only O (1) sensitive attributes for eac h point. The main tec hnical difficulty for the coreset construction is to deal with the assignment constrain ts. W e mak e an important observ ation (Theorem 4.2), that one only needs to prov e Theorem 4.1 for the case l = 1 , and we thus fo cus on the case l = 1 . This theorem is a generalization of Theorem 7 in [ 39 ], and the coreset of [ 39 ] actually extends to arbitrary group structure thanks to our theorem. Theorem 4.2 ( Reduction from l groups to a single group). Supp ose ther e exists an algorithm that c omputes an ε -c or eset of size t for the fair ( k , z ) -clustering pr oblem of b X satisfying that l = 1 , in time T ( | b X | , ε, k , z ) . Ther e exists an algorithm, that given a set X that c an b e p artitione d into Γ distinct subsets X (1) , . . . , X (Γ) in which al l p oints x ∈ X ( i ) c orr esp ond to the same c ol le ction P x for e ach i ∈ [Γ] , c omputes an ε -c or eset for the fair ( k , z ) -clustering pr oblem of size Γ t , in time P i ∈ [Γ] T ( | X ( i ) | , ε, k , z ) . Pr o of. Consider the case that Γ = 1 in whic h all P x ’s are the same. Hence, this case can b reduced degenerated to l = 1 and has an ε -coreset of size t b y assumption. F or eac h i ∈ [Γ] , supp ose S ( i ) is an ε -coreset for the fair ( k , z ) -clustering problem of X ( i ) where each p oin t in S ( i ) b elongs to all groups in P i . Let S := S i ∈ [ l ] S ( i ) . It is sufficient to pro ve S is an ε -coreset for the fair ( k , z ) -clustering problem of X , for b oth the correctness and the running time. Giv en a k -subset C ⊆ R d and an assignmen t constrain t F , let C ? 1 , . . . , C ? k b e the optimal fair clustering of the instance ( X , F , C ) . Then for eac h collection X ( i ) ( i ∈ [Γ] ), we construct an assignment constrain t F ( i ) ∈ Z k × l as follows: for eac h j 1 ∈ [ k ] and j 2 ∈ [ l ] , let F ( i ) j 1 ,j 2 = 0 if j 2 / ∈ P i and    C ? j 1 ∩ X ( i )    if j 2 ∈ P i , i.e., F ( i ) j 1 ,j 2 is the n umber of p oin ts within X ( i ) that b elong to C j 1 ∩ P j 2 . By definition, w e hav e that for each j 1 ∈ [ k ] and j 2 ∈ [ l ] , F j 1 ,j 2 = X i ∈ [Γ] F ( i ) j 1 ,j 2 . (1) 8 Then K z ( X , F , C ) = X i ∈ [ l ] K z ( X ( i ) , F ( i ) , C ) ( Defns. of K z and F ( i ) ) ≥ (1 − ε ) · X i ∈ [ l ] K z ( S ( i ) , F ( i ) , C ) ( Defn. of S ( i ) ) ≥ (1 − ε ) · K z ( S, F, C ) ( Optimalit y and Eq. (1) ) . Similarly , w e can pro ve that K z ( S, F, C ) ≥ (1 − ε ) K z ( X , F , C ) . It completes the pro of. Our coreset construction for b oth fair k -median and k -means are similar to that in [ 27 ], except we use a different set of parameters. A t a high lev el, the algorithm reduces general instances to instances where data lie on a line, and it only remains to giv e a coreset for the line case. Remark 4.1. The or em 4.2 c an b e applie d to c onstruct an ε -c or eset of size O (Γ k ε − d +1 ) for the fair k -c enter clustering pr oblem, sinc e Har-Pele d’s c or eset r esult [ 26 ] dir e ctly pr ovides an ε -c or eset of size O ( k ε − d +1 ) for the c ase of l = 1 . 4.1 The line case Since l = 1 , we describe F as an integer v ector in Z k ≥ 0 . F or a weigh ted p oin t set S with w eight w : S → R ≥ 0 , w e define the me an of S b y S := 1 | S | P p ∈ S w ( p ) · p and the err or of S b y ∆( S ) := P p ∈ S w ( p ) · d ( p, S ) . Denote OPT as the optimal v alue of the unconstrained k -median clustering. Our construction is similar to [ 27 ], summarized in Algorithm 1. An illustration of Algorithm 1 may b e found in Figure 1. Algorithm 1: F airMedian-1D( X , k ) Input: X = { x 1 , . . . , x n } ⊂ R d lying on the real line where x 1 ≤ . . . ≤ x n , an in teger k ∈ [ n ] , a num b er OPT as the optimal v alue of k -median clustering. Output: an ε -coreset S of X together with weigh ts w : S → R ≥ 0 . 1 Set a threshold ξ satisfying that ξ = ε · OPT 30 k ; 2 Consider the points from x 1 to x n and group them into batc hes in a greedy w ay: eac h batc h B is a maximal p oin t set satisfying that ∆( B ) ≤ ξ ; 3 Denote B ( X ) as the collection of all batc hes. Let S ← S B ∈B ( X ) B ; 4 F or eac h point x = B ∈ S , w ( x ) ← | B | ; 5 Return ( S, w ) ; Analysis. W e then prov e the follo wing theorem that shows the correctness of our coreset for the line case. Theorem 4.3 ( Coreset for fair k -median when X lies on a line). A lgorithm 1 c omputes an ε/ 3 -c or eset S for fair k -me dian clustering of X , in time O ( | X | ) . 9 x 1 x 2 x 3 x 4 x n − 2 x n − 1 x n B 1 : w ( B 1 ) = 4 B 9 : w ( B 9 ) = 3 . . . . . . B 1 : ∆( B 1 ) ≤ ξ B 9 : ∆( B 9 ) ≤ ξ Figure 1: an illustration of Algorithm 1 that divides X into 9 batches. The running time is not hard since for each batc h B ∈ B ( X ) , it only costs O ( | B | ) time to compute B . Hence, Algorithm 1 runs in O ( | X | ) time. In the following, we fo cus on correctness. In [ 27 ], it was sho wn that S is an ε/ 3 -coreset for the unconstrained k -median clustering problem. In their analysis, it is crucially used that the optimal clustering partitions X in to k con tiguous in terv als. Unfortunately , the nice “con tiguous” prop ert y do es not hold in our case because of the assignment constraint F ∈ R k . T o resolve this issue, w e prov e a new structural property (Lemma 4.1) that the optimal fair k -median clustering actually partitions X into only O ( k ) con tiguous in terv als. Lemma 4.1 ( F air k -median clustering consists of 2 k − 1 con tiguous in terv als). Supp ose X := { x 1 , . . . , x n } ⊂ R d lies on the r e al line wher e x 1 ≤ . . . ≤ x n . F or any k -subset C = ( c 1 , . . . , c k ) ∈ R d and any assignment c onstr aints F ∈ Z k ≥ 0 , ther e exists an optimal fair k -me dian clustering that p artitions X into at most 2 k − 1 c ontiguous intervals. Pr o of. W e prov e by induction on k . The induction hypothesis is that, for an y k ≥ 1 , Lemma 4.1 holds for an y data set X , any k -subset C ⊆ R d and any assignmen t constraint F ∈ Z k ≥ 0 . The base case k = 1 holds trivially since all p oin ts in X must be assigned to c 1 . Assume the lemma holds for k − 1 ( k ≥ 2 ) and we will pro ve the inductive step k . Let C ? 1 , . . . , C ? k b e the optimal fair k -median clustering w.r.t. C and F , where C ? i ⊆ X is the subset assigned to center c i . W e present the structural prop ert y in Claim 4.1, whose pro of is giv en later. Claim 4.1. Ther e exists i ∈ [ k ] such that C ? i c onsists of exactly one c ontiguous interval. W e contin ue the pro of of the inductive step b y constructing a reduced instance ( X 0 , F 0 , C 0 ) where a) C 0 := C \ { c i 0 } ; b) X 0 = X \ C ? i 0 ; c) F 0 is formed b y remo ving the i 0 -th co ordinate of F . Applying the hypothesis on ( X 0 , F 0 , C 0 ) , we know the optimal fair ( k − 1) -median clustering consists of at most 2 k − 3 contiguous interv als. Com bining with C ? i 0 whic h has exactly one contiguous in terv al would increase the n um b er of in terv als by at most 2 . Thus, w e conclude that the optimal fair k -median clustering for ( X , F , C ) has at most 2 k − 1 con tiguous in terv als. This finishes the inductiv e step. Finally , w e complete the proof of Claim 4.1. W e first pro ve the follo wing fact for preparation. F act 4.1. Supp ose p, q ∈ R d . Define f : R → R as f ( x ) := d ( x, p ) − d ( x, q ) (her e we abuse the notation by tr e ating x as a p oint in the x -axis of R d ). Then f is either ID or DI. 2 2 ID means that the function f first (non-strictly) increases and then (non-strictly) decreases. DI means the other wa y round. 10 Pr o of. Let h p and h q b e the distance from p and q to the x-axis resp ectiv ely , and let u p and u q b e the corresp onding x -co ordinate of p and q . W e hav e f ( x ) = q ( x − u p ) 2 + h 2 p − q ( x − u q ) 2 + h 2 q . Then we can regard p, q as tw o p oin ts in R 2 b y letting p = ( u p , h p ) and q = ( u q , h q ) . Also w e ha ve f 0 ( x ) = x − u p q ( x − u p ) 2 + h 2 p − x − u q q ( x − u q ) 2 + h 2 q = x − u p d ( x, p ) − x − u q d ( x, q ) . W.l.o.g. assume that u p ≤ u q . Next, w e rewrite f 0 ( x ) with respect to cos ( ∠ pxu p ) and cos( ∠ q xu q ) . 1. If x ≤ u p . Then f 0 ( x ) = d ( x,u q ) d ( x,q ) − d ( x,u p ) d ( x,p ) = cos( ∠ q xu q ) − cos( ∠ pxu p ) . 2. If u p < x ≤ u q . Then f 0 ( x ) = d ( x,u p ) d ( x,p ) + d ( x,u q ) d ( x,q ) = cos( ∠ pxu p ) + cos ( ∠ qxu q ) . 3. If x > u q . Then f 0 ( x ) = d ( x,u p ) d ( x,p ) − d ( x,u q ) d ( x,q ) = cos( ∠ pxu p ) − cos( ∠ q xu q ) . Denote the intersecting p oin t of line pq and the x -axis to b e y . Sp ecificially , if h p = h q , we denote y = −∞ . Note that f 0 ( x ) = 0 if and only if x = y . Now w e analyze f 0 ( x ) in t wo cases (whether or not h p ≤ h q ). • Case i): h p ≤ h q whic h implies that y < u p . When x go es from −∞ to u p , first f 0 ( x ) ≤ 0 and then f 0 ( x ) ≥ 0 . When x > u p , f 0 ( x ) ≥ 0 . • Case ii): h p > h q whic h implies that y > u q . When x ≤ u q , f 0 ( x ) ≥ 0 . When x go es from u q to + ∞ , first f 0 ( x ) ≥ 0 and then f 0 ( x ) ≤ 0 . Therefore, f ( x ) is either DI or ID. Pro of of Claim 4.1. Supp ose for the contrary that for any i ∈ [ k ] , C ? i consists of at least t wo con tiguous in terv als. Pic k an y i and supp ose S L , S R ⊆ C ? i are tw o contiguous in terv als suc h that S L lies on the left of S R . Let y L denote the righ tmost p oin t of S L and y R denote the leftmost p oin t of S R . Since S L and S R are tw o distinct con tiguous in terv als, there exists some p oin t y ∈ X b et w een y L and y R suc h that y ∈ C ? j for some j 6 = i . Define g : R → R as g ( x ) := d ( x, c j ) − d ( x, c i ) . By F act 4.1, w e kno w that g ( x ) is either ID or DI. If g is ID, w e swap the assignment of y and y min := arg min x ∈{ y L ,y R } g ( x ) in the optimal fair k -median clustering. Since g is ID, for any interv al P with endpoints p and q , min x ∈ P g ( x ) = min x ∈{ p,q } g ( x ) . This fact together with y L ≤ y ≤ y R implies that g ( y min ) − g ( y ) ≤ 0 . Hence, the change of the ob jectiv e is d ( y , c i ) − d ( y, c j ) − d ( y min , c i ) + d ( y min , c j ) = g ( y min ) − g ( y ) ≤ 0 . This contradicts with the optimalit y of C ? and hence g has to b e DI. Next, we show that there is no y 0 ∈ C ? j suc h that y 0 < y L or y 0 > y R . W e prov e b y con tradiction and only fo cus on the case of y 0 < y L , since the case of z > y R can b e prov ed 11 similarly by symmetry . W e sw ap the assignmen t of y L and y max := arg max x ∈{ y ,y 0 } g ( x ) in the optimal fair k -median clustering. The change of the ob jectiv e is d ( y L , c j ) − d ( y L , c i ) − d ( y max , c j ) + d ( y max , c i ) = g ( y L ) − g ( y max ) ≤ 0 , where the last inequality is by the fact that g is DI. This contradicts the optimality of C ? . Hence, we conclude such y 0 do es not exist. Therefore, ∀ x ∈ C ? j , y L < x < y R . By assumption, C ? j consists of at least tw o con tiguous in terv als within ( y L , y R ) . Ho w ever, we can actually do exactly the same argument for C ? j as in the i case, and even tually we w ould find a j 0 suc h that C ? j 0 lies inside a strict smaller in terv al ( y 0 L , y 0 R ) of X , where y L < y 0 L < y 0 R < y R . Since n is finite, w e cannot do this pro cedure infinitely , whic h is a contradiction. This finishes the pro of of Claim 4.1. 4.2 Pro of of Theorem 4.3. No w w e are ready to prov e the main theorem of the last subsection. Pr o of. The pro of idea is similar to that of Lemma 2.8 in [ 27 ]. W e first rotate the space suc h that the line is on the x -axis and assume that x 1 ≤ x 2 ≤ . . . ≤ x n . Given an assignment constrain t F ∈ R k and a k -subset C = { c 1 , . . . , c k } ⊆ R d , let c 0 i denote the pro jection of p oin t c i to the real line and assume that c 0 1 ≤ c 0 2 ≤ . . . ≤ c 0 k . Our goal is to prov e that |K 1 ( S, F, C ) − K 1 ( X , F , C ) | ≤ ε 3 · K 1 ( X , F , C ) . By the construction of S , w e build up a mapping π : X → S b y letting π ( x ) = B for an y x ∈ B . F or eac h i ∈ [ k ] , let C i denote the collection of points assigned to c i in the optimal fair k -median clustering of X . By Lemma 4.1, C 1 , . . . , C k partition the line in to at most 2 k − 1 interv als I 1 , . . . , I t ( t ≤ 2 k − 1 ), such that all p oin ts of any in terv al I i are assigned to the same cen ter. Denote an assignment function f : X → C b y f ( x ) = c i if x ∈ C i . Let b B denote the set of all batches B , which in tersects with more than one interv als I i , or alternativ ely , the in terv al I ( B ) contains the pro jection of a center point of C to the x -axis. Clearly , | b B | ≤ 2 k − 2 + k = 3 k − 2 . F or eac h batch B ∈ b B , w e ha ve X x ∈ B d ( π ( x ) , f ( x )) − d ( x, f ( x )) triangle ineq. ≤ X x ∈ B | d ( x, π ( x )) | = X x ∈ B | d ( x, B ) | Defn. of B ≤ ε OPT 30 k . (2) Note that X \ S B ∈ b B B can b e partitioned in to at most 3 k − 1 contiguous in terv als. Denote these interv als b y I 0 1 , . . . , I 0 t 0 ( t 0 ≤ 3 k − 1 ). By definition, all points of each interv al I 0 i are assigned to the same cen ter whose pro jection is outside I 0 i . Then b y the pro of of Lemma 2.8 in [27], w e hav e that for each I 0 i , X x ∈I 0 i d ( π ( x ) , f ( x )) − d ( x, f ( x )) ≤ 2 ξ = ε OPT 15 k . (3) 12 Com bining Inequalities (2) and (3), w e ha ve K 1 ( S, F, C ) − K 1 ( X , F , C ) ≤ X x ∈ X d ( π ( x ) , f ( x )) − d ( x, f ( x )) ( Defn. of K 1 ( S, F, C )) = X B ∈ b B X x ∈ B d ( π ( x ) , f ( x )) − d ( x, f ( x )) + X i ∈ [ t ] X x ∈I 0 i d ( π ( x ) , f ( x )) − d ( x, f ( x )) ≤ (3 k − 2) · ε OPT 30 k + (3 k − 1) · ε OPT 15 k ( Ineqs. (2) and (3) ) ≤ ε OPT 3 ≤ ε 3 · K 1 ( X , F , C ) . (4) T o pro ve the other direction, w e can regard S as a collection of n un weigh ted p oin ts and consider the optimal fair k -median clustering of S . Again, the optimal fair k -median clustering of S partitions the x -axis in to at most 2 k − 1 con tiguous interv als, and can b e describ ed b y an assignment function f 0 : S → C . Then we can build up a mapping π 0 : S → X as the inv erse function of π . F or each batc h B , let S B denote the collection of | B | un weigh ted p oin ts lo cated at B . W e ha ve the following inequalit y that is similar to Inequalit y (2) X x ∈ S B d ( π 0 ( x ) , f 0 ( x )) − d ( x, f 0 ( x )) ≤ ε OPT 30 k . Supp ose a contiguous interv al I consists of sev eral batc hes and satisfies that all p oin ts of I ∩ S are assigned to the same cen ter by f 0 whose pro jection is outside I . Then b y the pro of of Lemma 2.8 in [27], w e hav e that X B ∈I X x ∈ S B d ( π 0 ( x ) , f 0 ( x )) − d ( x, f 0 ( x )) ≤ 0 . Then by a similar argumen t as for Inequalit y (4), we can prov e the other direction K 1 ( X , F , C ) − K 1 ( S, F, C ) ≤ ε 3 · K 1 ( X , F , C ) , whic h completes the pro of. 4.3 Extending to higher dimension The extension is the same as that of [ 27 ]. F or completeness, we describe the detailed pro cedure for coresets for fair k -median. 1. W e start with computing an appro ximate k -subset C ? = { c 1 , . . . , c k } ⊆ R d suc h that OPT ≤ K 1 ( X , C ? ) ≤ c · OPT for some constan t c > 1 . 3 2. Then we partition the p oin t set X in to sets X 1 , . . . , X k satisfying that X i is the collection of points closest to c i . 3 F or example, we can set c = 10 by [31]. 13 3. F or each cen ter c i , we tak e a unit sphere centered at c i and construct an ε 3 c -net N c i 4 on this sphere. By Lemma 2.6 in [ 27 ], | N c i | = O ( ε − d +1 ) and may b e computed in O ( ε − d +1 ) time. Then for every p ∈ N c i , we emit a ra y from c i to p . Over all, there are at most O ( k ε − d +1 ) lines. 4. F or eac h i ∈ [ k ] , we pro ject all p oin ts of X i on to the closest line around c i . Let π : X → R d denote the pro jection function. By the definition of ε 3 c -net, we ha ve that P x ∈ X d ( x, π ( x )) ≤ ε · OPT / 3 whic h indicates that the pro jection cost is negligible. Then for eac h line, we compute an ε/ 3 -coreset of size O ( k ε − 1 ) for fair k -median b y Theorem 4.3. Let S denote the combination of coresets generated from all lines. Pr o of of The or em 4.1. Since there are at most O ( k ε − d +1 ) lines and the coreset on each line is of size at most O ( k ε − 1 ) by Theorem 4.3, the total size of S is O ( k 2 ε − d ) . F or the correctness, b y the optimality of OPT (whic h is unc onstr aine d optimal), for an y given assignmen t constrain t F ∈ R k and any k -subset C ⊆ R d , OPT ≤ K 1 ( X , F , C ) . Combining this fact with Theorem 4.3, w e ha ve that S is an ε -coreset for fair k -median clustering, by the same argumen t as in Theorem 2.9 of [ 27 ]. F or the running time, we need T 1 ( n ) time to compute C ? and APX and the remaining construction time is upp er b ounded by O ( k ε − d +1 n ) – the pro jection pro cess to lines. This completes the pro of. Remark 4.2. In fact, it suffic es to emit a set of r ays such that the total c ost of pr oje cting p oints to the r ays is at most ε · OPT 3 . This observation is crucial ly use d in our implementations (Se ction 6) to r e duc e the size of the c or eset, p articularly to avoid the c onstruction of the O ( ε ) -net which is of O ( ε − d ) size. 5 Coresets for fair k -means clustering In this section, w e show how to construct coresets for fair k -means. Similar to the fair k -median case, we apply the approach in [27]. The main theorem is as follows. Theorem 5.1 ( Coreset for fair k -means). Ther e exists an algorithm that c onstructs ε -c or eset for the fair k -me ans pr oblem of size O (Γ k 3 ε − d − 1 ) , in O ( k 2 ε − d +1 n + T 2 ( n, d, k )) time. Note that the ab o v e result impro ves the coreset size of [ 39 ] by a O ( log n εk 2 ) factor. Similar to the fair k -median case, it suffices to pro ve for the case l = 1 . Recall that an assignment constrain t for l = 1 can b e describ ed by a v ector F ∈ R k . Denote OPT to b e the optimal k -means v alue without any assignmen t constrain t. 5.1 The line case Similar to [ 27 ], w e first consider the case that X is a p oin t set on the real line. Recall that for a weigh ted p oin t set S with weigh t w : S → R ≥ 0 , the me an of S b y S := 1 | S | P p ∈ S w ( p ) · p , and the err or of S b y ∆( S ) := P p ∈ S w ( p ) · d 2 ( p, S ) . Again, our construction is similar to [ 27 ], 4 An ε -net Q means that for any p oin t p in the unit sphere, there exists a point q ∈ Q satisfying that d ( p, q ) ≤ ε . 14 summarized in Algorithm 2. The main difference to Algorithm 1 is in Line 3: for each batc h, w e need to construct tw o weigh ted p oin ts for the coreset using a constructiv e lemma of [ 27 ], summarized in Lemma 5.1. Also note that the selected threshold ξ is different from that in Algorithm 1. Algorithm 2: F airMeans-1D( X , k ) Input: X = { x 1 , . . . , x n } ⊂ R d lying on the real line where x 1 ≤ . . . ≤ x n , an in teger k ∈ [ n ] , a num b er OPT as the optimal v alue of k -means clustering. Output: an ε -coreset S of X together with weigh ts w : S → R ≥ 0 . 1 Set a threshold ξ satisfying that ξ = ε 2 OPT 200 k 2 ; 2 Consider the points from x 1 to x n and group them into batc hes in a greedy w ay: eac h batc h B is a maximal p oin t set satisfying that ∆( B ) ≤ ξ ; 3 Denote B ( X ) as the collection of all batches. F or eac h batch B , construct a collection J ( B ) of t wo p oin ts q 1 , q 2 together with w eigh ts w 1 , w 2 satisfying Lemma 5.1 ; 4 Let S ← S B ∈B ( X ) J ( B ) ; 5 Return ( S, w ) ; Lemma 5.1 ( Lemmas 3.2 and 3.4 in [27]). The numb er of b atches is O ( k 2 /ε 2 ) . F or e ach b atch B , ther e exist two weighte d p oints q 1 , q 2 ∈ I ( B ) to gether with weight w 1 , w 2 satisfying that • w 1 + w 2 = | B | . • L et J ( B ) denote the c ol le ction of two weighte d p oints q 1 and q 2 . Then we have J ( B ) = B and ∆( B ) = ∆( J ( B )) . • Given any p oint q ∈ R d , we have K 2 ( B , q ) = ∆( B ) + | B | · d 2 ( q , B ) = K 2 ( J ( B ) , q ) . Analysis. W e argue that S is indeed an ε/ 3 -coreset for the fair k -means clustering problem. By Theorem 3.5 in [ 27 ], S is an ε/ 3 -coreset for k -means clustering of X . Ho wev er, w e need to handle additional assignmen t constrain ts. T o address this, we introduce the follo wing lemma sho wing that every optimal cluster satisfying the given assignment constraint is within a con tiguous in terv al. Lemma 5.2 ( Clusters are con tiguous for fair k -means). Supp ose X = { x 1 , . . . , x n } wher e x 1 ≤ x 2 ≤ . . . ≤ x n . Given an assignment c onstr aint F ∈ R k and a k -subset C = { c 1 , . . . , c k } ⊆ R d . Then letting C i := n x 1+ P j d . Let e = a + b − d > 0 . Since a + b ≤ c + d , we ha ve e 2 ≤ c 2 . Hence, it suffices to pro ve that a 2 + b 2 ≤ e 2 + d 2 . Note that e 2 + d 2 = ( a + b − d ) 2 + d 2 = a 2 + b 2 + ( d − a )( d − b ) ≥ a 2 + b 2 , whic h completes the pro of. No w w e come bac k to pro ve Lemma 5.1. W e ha ve the followi ng inequalit y . d 2 ( x j 1 , c i 1 ) + d 2 ( x j 2 , c i 2 ) = d 2 ( x j 1 , c 0 i 1 ) + d 2 ( c 0 i 1 , c i 1 ) + d 2 ( x j 2 , c 0 i 2 ) + d 2 ( c 0 i 2 , c i 2 ) ( The Pythagorean theorem ) ≤ d 2 ( x j 1 , c 0 i 2 ) + d 2 ( c 0 i 1 , c i 1 ) + d 2 ( x j 2 , c 0 i 1 ) + d 2 ( c 0 i 2 , c i 2 ) ( Ineq. (7) ) = d 2 ( x j 1 , c i 2 ) + d 2 ( x j 2 , c i 1 ) . ( The Pythagorean theorem ) It contradicts with the assumption that x j 1 ∈ C i 2 and x j 2 ∈ C i 1 . Hence, we complete the pro of. No w w e are ready to give the following theorem. Theorem 5.2 ( Coreset for fair k -means when X lies on a line). A lgorithm 2 outputs an ε/ 3 -c or eset for fair k -me ans clustering of X in time O ( | X | ) . Pr o of. The pro of is similar to that of Theorem 3.5 in [ 27 ]. The running time analysis is exactly the same. Hence, w e only fo cus on the correctness analysis in the follo wing. W e first rotate the space such that the line is on the x -axis and assume that x 1 ≤ x 2 ≤ . . . ≤ x n . Giv en an assignment constrain t F ∈ R k and a k -subset C = { c 1 , . . . , c k } ⊆ R d , let c 0 i denote the pro jection of p oin t c i to the real line and assume that c 0 1 ≤ c 0 2 ≤ . . . ≤ c 0 k . Our goal is to pro ve that |K 2 ( S, F, C ) − K 2 ( X , F , C ) | ≤ ε 3 · K 2 ( X , F , C ) . 16 By Lemma 5.2, w e ha ve that the optimal fair clustering of X should b e C i := n x 1+ P j 20% but larger when ε ≤ 20% . A p ossible explanation is that the w ay our algorithm works migh t not capture the pattern of the normalized datasets. Recall that our algorithm emits ra ys and pro ject p oin ts suc h that the pro jection cost is b ounded, but we find this part b ecomes a b ottlenec k when ε is small. In tuitively , if the dataset is well clustered around a few lines, our algorithm should 26 T able 6: p erformance of ε -coresets for fair k -median with one sensitiv e t yp e w.r.t. v arying ε . ε emp. err. size T S (ms) T C (ms) T X (ms) Ours Uni A dult 10% 2.97% 32.66% 46 5 363 3592 15% 3.32% 82.47% 37 5 332 - 20% 5.39% 30.24% 36 5 295 - 25% 4.44% 42.81% 28 5 308 - 30% 7.00% 30.67% 26 5 304 - 35% 6.82% 22.46% 30 5 311 - 40% 6.20% 23.55% 24 5 308 - Bank 10% 1.27% 10.08% 838 21 1264 2817 15% 2.58% 4.52% 292 11 652 - 20% 3.13% 12.79% 238 10 607 - 25% 3.01% 16.74% 272 11 605 - 30% 4.31% 10.93% 193 9 513 - 35% 4.80% 12.42% 140 7 543 - 40% 5.56% 12.68% 102 7 468 - Diab etes 10% 1.23% 40.20% 51102 3766 143910 14414 15% 1.47% 14.28% 22811 909 45238 - 20% 2.12% 1.84% 7699 193 15366 - 25% 2.76% 2.57% 3159 74 8402 - 30% 3.76% 3.47% 941 23 4710 - 35% 4.56% 4.78% 577 15 4367 - 40% 6.33% 10.99% 324 11 3642 - offer sup erior p erformance; ho wev er, this migh t not be the case for a dataset that tends to b e around a hyper-sphere. 27 T able 7: p erformance of ε -coresets for fair k -means with one sensitiv e t yp e w.r.t. v arying ε . ε emp. err. size T S (ms) T C (ms) T X (ms) Ours BICO Uni Ours BICO A dult 10% 0.91% 1.16% 184.60% 209 9 547 397 3908 15% 0.78% 1.08% 30.07% 162 8 468 469 - 20% 0.584% 1.80% 63.80% 135 8 516 392 - 25% 1.40% 1.42% 33.16% 118 7 540 407 - 30% 1.58% 2.47% 52.42% 108 7 521 391 - 35% 2.29% 4.09% 100.10% 99 7 534 400 - 40% 1.79% 4.39% 90.48% 92 6 510 423 - Bank 10% 2.87% 5.11% 19.80% 127 9 1411 500 2662 15% 2.85% 5.63% 44.81% 100 7 611 518 - 20% 3.04% 4.47% 38.91% 84 6 518 484 - 25% 2.77% 6.97% 38.27% 72 7 530 516 - 30% 2.60% 6.59% 52.24% 68 6 585 492 - 35% 2.64% 8.23% 34.05% 60 6 554 500 - 40% 2.67% 8.90% 75.58% 56 6 566 501 - Diab etes 10% 4.44% 9.44% 3.46% 16749 484 65396 1035 16748 15% 8.01% 6.88% 8.94% 1658 34 11491 971 - 20% 11.17% 14.41% 15.87% 408 11 5203 872 - 25% 15.55% 19.05% 23.35% 158 9 3047 922 - 30% 14.94% 24.62% 43.00% 104 6 2849 896 - 35% 14.72% 29.42% 16.78% 96 6 2907 875 - 40% 14.67% 25.78% 23.26% 84 7 2847 875 - 28 T able 8: p erformance of ε -coresets for fair k -median with normalization w.r.t. v arying ε . ε emp. err. size T S (ms) T C (ms) T X (ms) Ours Uni A dult 10% 1.08% 5.31% 22483 2085 10506 7138 15% 1.66% 3.77% 14388 1179 4835 - 20% 2.43% 3.05% 9396 643 2562 - 25% 3.28% 1.68% 5828 361 1642 - 30% 4.39% 3.57% 4111 244 1271 - 35% 5.52% 2.22% 3409 195 1125 - 40% 6.26% 1.45% 2100 113 959 - Bank 10% 1.28% 11.38% 3503 165 1604 5286 15% 2.90% 8.62% 1529 70 938 - 20% 4.32% 5.03% 863 39 696 - 25% 7.73% 4.13% 526 25 595 - 30% 9.52% 27.06% 329 18 524 - 35% 10.26% 15.78% 226 13 486 - 40% 9.00% 29.16% 216 13 492 - Diab etes 10% 0.67% 8.29% 76380 9306 96198 15303 15% 1.15% 26.08% 55761 6321 35494 - 20% 1.91% 11.04% 34182 3560 14650 - 25% 2.96% 2.98% 18085 1780 7076 - 30% 4.10% 1.84% 10834 854 4112 - 35% 5.37% 2.12% 6402 418 2712 - 40% 6.52% 2.42% 3968 234 2044 - A thlete 10% 1.25% 1.80% 3472 91 5719 76081 15% 1.99% 2.78% 1372 35 3340 - 20% 2.94% 5.75% 678 20 2407 - 25% 3.85% 7.69% 381 13 1902 - 30% 4.57% 4.62% 208 9 1582 - 35% 6.95% 9.22% 173 9 1525 - 40% 7.37% 11.96% 117 8 1406 - 29 T able 9: p erformance of ε -coresets for fair k -means with normalization w.r.t. v arying ε . ε emp. err. size T S (ms) T C (ms) T X (ms) Ours BICO Uni Ours BICO A dult 10% 3.39% 1.61% 3.45% 17231 1533 21038 3141 17224 15% 6.78% 3.67% 14.14% 8876 595 8178 1584 - 20% 10.93% 7.90% 9.38% 4087 228 4047 1000 - 25% 13.87% 12.63% 7.12% 2213 116 2652 884 - 30% 18.36% 19.24% 4.20% 1113 56 1799 844 - 35% 19.80% 24.98% 9.03% 791 38 1237 782 - 40% 19.64% 23.00% 3.88% 759 38 1229 782 - Bank 10% 15.90% 11.88% 54.84% 254 15 453 719 4800 15% 15.40% 16.53% 50.86% 213 13 470 735 - 20% 16.71% 16.41% 42.25% 186 12 470 719 - 25% 15.25% 23.19% 42.86% 173 12 448 719 - 30% 15.23% 15.25% 44.19% 166 12 455 718 - 35% 14.99% 18.20% 54.96% 149 11 455 719 - 40% 14.86% 22.81% 33.25% 149 12 454 765 - Diab etes 10% 4.88% 9.85% 30.26% 51841 5601 111971 8891 14893 15% 8.80% 4.65% 1.90% 20188 1798 22840 3390 - 20% 12.72% 13.11% 2.63% 6520 372 7644 1907 - 25% 16.96% 20.99% 3.67% 2861 144 4251 1610 - 30% 21.03% 28.79% 6.32% 1399 72 2421 1547 - 35% 23.23% 33.81% 8.17% 886 45 1526 1521 - 40% 27.10% 42.67% 8.49% 478 28 982 1516 - A thlete 10% 5.43% 3.06% 5.50% 2350 56 20560 1362 72642 15% 8.27% 9.95% 7.32% 519 15 7748 1241 - 20% 14.18% 12.83% 23.49% 224 10 3413 1216 - 25% 15.31% 19.06% 14.49% 127 9 2059 1206 - 30% 18.16% 22.53% 16.78% 88 7 1537 1192 - 35% 17.95% 24.08% 26.39% 82 7 1527 1224 - 40% 18.03% 22.41% 25.87% 74 7 1539 1216 - 30

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment