Distributed Statistical Estimation and Rates of Convergence in Normal Approximation
This paper presents a class of new algorithms for distributed statistical estimation that exploit divide-and-conquer approach. We show that one of the key benefits of the divide-and-conquer strategy is robustness, an important characteristic for larg…
Authors: Stanislav Minsker, Nate Strawn
Distrib uted Statistical Estimation and Rates of Con vergence in Normal Appr o ximation Stanislav Minsk er † E-mail: minsk er@usc.edu Nate Stra wn E-mail: nate .strawn@georgeto wn.edu Summary . This paper presents a class of new algorithms f or distributed statistical estima- tion that e xploit divide-and-conquer approach. We show that one of the ke y benefits of the divide-and-conquer str ategy is robustness , an important characteristic f or large distributed systems. We establish connections between performance of these distributed algorithms and the rates of conv ergence in nor mal approximation, and prove non-asymptotic de via- tions guarantees, as well as limit theorems , f or the resulting estimators. Our techniques are illustrated through se veral examples: in par ticular , we obtain ne w results f or the median- of-means estimator , as well as provide performance guarantees for distr ibuted maximum likelihood estimation. 1. Introduction. According to ( IBM , 2015 ), “Ev ery da y , we create 2.5 quintillion b ytes of data so muc h that 90% of the data in the world to da y has b een created in the last tw o y ears alone. This data comes from ev erywhere: sensors used to gather climate information, p osts to so cial media sites, digital pictures and videos.. to name a few. This data is big data”. No v el scalable and robust algorithms are required to successfully address the challenges p osed by big data problems. This pap er dev elops and analyzes techniques that exhibit sc alability , a necessary characteristic of mo dern metho ds designed to p erform statistical analysis of large datasets, as w ell as r obustness that guaran tees stable p erformance of distributed systems when some of the no des exhibit abnormal b ehavior. The computational p o wer of a single computer is often insufficient to store and pro- cess mo dern data sets, and instead data is stored and analyzed in a distributed w ay b y a cluster consisting of sev eral machines. W e consider a distributed estimation framework wherein data is assumed to b e randomly assigned to computational no des that pro duce in termediate results. W e assume that no comm unication b et w een the no des is allo w ed at this first stage. On the second stage, these intermediate results are used to compute some statistic on the whole dataset; see figure 1 for a graphical illustration. Often, such a distributed setting is unav oidable in applications, whence in teractions b etw een sub- samples stored on different mac hines are inevitably lost. Most previous research fo cused on the following question: ho w significantly do es this loss affect the qualit y of statis- tical estimation when compared to an “oracle” that has access to the whole sample? † S. Minsker was partially supp orted b y the National Science F oundation gran t DMS-1712956. 2 S. Minsker and N. Stra wn Fig. 1: Distributed estimation proto col where data is randomly distributed across no des to obtain “lo cal” estimates that are aggregated to compute a “global” estimate. The question that we ask in this pap er is different: what can b e gained from randomly splitting the data across sev eral subsamples? What are the statistical adv an tages of the divide-and-conquer framew ork? Our work indicates that one of the k ey b enefits of an appropriate merging strategy is robustness. In particular, the quality of estimation at- tained b y the distributed estimation algorithm is preserved ev en if a subset of machines stops w orking prop erly . At the same time, the resulting estimators admit tigh t proba- bilistic guarantees (expressed in the form of exponential concen tration inequalities) ev e n when the distribution of the data has heavy tails – a viable mo del of real-w orld samples con taminated by outliers. W e establish connections b et w een a class of randomized divide-and-conquer strategies and the rates of con v ergence in normal appro ximation. Using these connections, w e pro vide a new analysis of the “median-of-means” estimator whic h often yields significan t impro v ements ov er the previously a v ailable results. W e further illustrate the implications of our results by constructing nov el algorithms for distributed Maxim um Likelihoo d Estimation that admit strong p erformance guarantees under weak assumptions on the underlying distribution. 1.1. Backg round and related wor k. W e b egin b y in tro ducing a simple mo del for distributed statistical estimation. Let X 1 , . . . , X N b e a sequence of indep enden t random v ariables with v alues in a measur- able space ( S, S ) that represen t the data av ailable to a statistician. W e will assume that N is large, and that that the sample X = ( X 1 , . . . , X N ) is partitioned into k disjoint subsets G 1 , . . . , G k of cardinalities n j := card( G j ) resp ectiv ely , where the partitioning sc heme is indep endent of the data. Let P j b e the distribution of X j , j = 1 , . . . , N . The goal is to estimate an unknown parameter θ ∗ = θ ∗ ( P j ) , j = 1 , . . . , N shared b y P 1 , . . . , P N and taking v alues in a separable Hilb ert space ( H , k · k H ); for example, if S = H , θ ∗ could b e the common mean of X 1 , . . . , X N . Distributed estimation proto col pro ceeds via p erforming “lo cal” computations on each subset G j , j ≤ k , and the lo cal estimators ¯ θ j := ¯ θ j ( G j ) , j ≤ k are then pieced together to pro duce the final “global” estimator ˆ θ ( k ) = ˆ θ ( k ) ( ¯ θ 1 , . . . , ¯ θ k ). W e are interested in the statistical prop erties of such distributed estimation proto cols, and our main fo cus is on the final step that combines the lo cal estimators. Let us mention that the condition requiring the sets G j , 1 ≤ j ≤ k to b e disjoin t can b e relaxed; we discuss the extensions related to U-quan tiles in section Distributed Statistical Estimation 3 2.6 b elow. The problem of distributed and communication - efficient statistical estimation has recen tly received significan t attention from the researc h comm unity . While our review pro vides only a subsample of the abundan t literature in this field, it is imp ortant to ac kno wledge the works b y Mcdonald et al. ( 2009 ); Zhang et al. ( 2012 ); F an et al. ( 2014 ); Battey et al. ( 2015 ); Duchi et al. ( 2014 ); Shafieezadeh-Abadeh et al. ( 2015 ); Lee et al. ( 2015 ); Cheng and Shang ( 2015 ); Rosenblatt and Nadler ( 2016 ); Zink evic h et al. ( 2010 ). Li et al. ( 2016 ); Scott et al. ( 2016 ); Shang and Cheng ( 2015 ); Minsker et al. ( 2014 ) hav e in v estigated closely related problems for distributed Bay esian inference. Applications to important algorithms such as Principal Component Analysis were inv estigated in ( F an et al. , 2017 ; Liang et al. , 2014 ), among others. Jordan ( 2013 ), author pro vides an o verview of recent trends in the intersection of the statistics and computer science comm unities, describ es p opular existing strategies suc h as the “bag of little b o otstraps”, as w ells as successful applications of the divide-and-conquer paradigm to problems such as matrix factorization. The ma jorit y of the aforemen tioned w orks prop ose aver aging of lo cal estimators as a final merging step. Indeed, a v eraging reduces v ariance, hence, if the bias of each lo cal estimator is sufficien tly small, their av erage often attains optimal rates of con v ergence to the unknown parameter θ ∗ . F or example, when θ ∗ ( P ) = E P X is the mean of X and ¯ θ j is the sample mean ev aluated o ver the subsample G j , j = 1 , . . . , k , then the av erage of lo cal estimators ˜ θ = 1 k P k j =1 ¯ θ j is just a empirical mean ev aluated ov er the whole sample. More generally , it has b een shown by Battey et al. ( 2015 ); Zhang et al. ( 2013 ) that in man y problems (for instance, linear regression), k can b e taken as large as O ( √ N ) without negatively affecting the estimation rates; similar guarantees hold for a v ariety of M-estimators (see Rosenblatt and Nadler , 2016 ). Ho w ever, if the n um b er of no des k itself is large (the case w e are mainly interested in), then the a v eraging sc heme has a drawbac k: if one or more among the lo cal estimators ¯ θ j ’s is anomalous (for example, due to data corruption or a computer system malfunctioning), then statistical properties of the a v erage will b e negativ ely affected as well. F or large distributed systems, this dra wbac k can b e costly . One wa y to address this issue is to replace a v eraging by a more robust pro cedure, suc h as the median or a robust M-estimator; this approac h is in v estigated in the present w ork. In the univ ariate case ( θ ∗ ∈ R ), the merging strategies we study can b e described as solutions of the optimization problem b θ ( k ) = argmin z ∈ R k X j =1 ρ | ¯ θ j − z | (1) for an appropriately defined con vex function ρ ; w e inv estigate this class of estimators in detail. A natural extension to the case θ ∗ ∈ R m is to consider b θ ( k ) = argmin y ∈ R m k X j =1 ρ ¯ θ j − y ◦ for some conv ex function ρ and norm k · k ◦ . F or example, if ρ ( x ) = x , then b θ ( k ) b ecomes the spatial (also known as geometric or Haldane’s) median ( Haldane , 1948 ; Small , 1990 ) 4 S. Minsker and N. Stra wn of ¯ θ 1 , . . . , ¯ θ k . Since the median remains stable as long as at least a half of the no des in the system p erform as expected, such mo del for distributed estimation is robust. The merging approac h based on the v arious notions of the multiv ariate median has b een previously considered b y Minsker ( 2015 ) and Hsu and Sabato ( 2016 ); here, w e analyze the setting when ρ ( x ) = x and k · k ◦ is the L 1 -norm using the nov el approac h. Existing results for the median-based merging strategies hav e several pitfalls related to the deviation rates, and in most cases known guaran tees are sub optimal. In particular, these guarantees suggest that estimators obtained via the median-based approach are v ery sensitiv e to the choice of k , the n um b er of partitions. F or instance, consider the problem of univ ariate mean estimation, where X 1 , . . . , X N are i.i.d. copies of X ∈ R , and θ ∗ = E X is the exp ectation of X . Assume that card( G j ) ≥ n := b N /k c for all j , let ¯ θ j = 1 | G j | P i : X i ∈ G j X i b e the empirical mean ev aluated o v er the subsample G j , and define the “median-of-means” estimator via b θ ( k ) = med ¯ θ 1 , . . . , ¯ θ k , (2) where med ( · ) is the usual univ ariate median. This estimator has b een introduced b y Nemiro vski and Y udin ( 1983 ) in the con text of sto c hastic optimization, and later ap- p eared in ( Jerrum et al. , 1986 ) and ( Alon et al. , 1996 ). If V ar( X ) = σ 2 < ∞ , it has b een sho wn (for example, by Lerasle and Oliveira , 2011 ) that the median-of-means estimator b θ ( k ) satisfies b θ ( k ) − θ ∗ ≤ 2 σ √ 6 e r k N (3) with probability ≥ 1 − e − k . How ever, this bound, while b eing the current state of the art, do es not tell us what happ ens at the confidence levels other than 1 − e − k . F or example, if k = b √ N c , the only conclusion we can mak e is that b θ ( k ) − θ ∗ . N − 1 / 4 with high probabilit y , whic h is far from the optimal rate N − 1 / 2 . And if we w an t the b ound to hold with confidence 99% instead of 1 − e − √ N , then, according to ( 3 ), we should tak e k = b log 100 c + 1 = 5, in which case the b eneficial effect of parallel computation is v ery limited. The natural question to ask is the following: is the median-based merging step indeed sub optimal for large v alues of k (e.g., k = b √ N c ), or is the problem related to the sub optimality of existing bounds? W e claim that in man y situations the latter is the case, and that previously known results can b e strengthened: for instance, the statemen t of Corollary 1 b elo w implies that whenever E | X − θ ∗ | 3 < ∞ , the median-of- means estimator satisfies | b θ ( k ) − θ ∗ | ≤ 3 σ E | X − θ ∗ | 3 σ 3 k N − k + r s N − k ! with probability ≥ 1 − 4 e − 2 s , for al l s . k . In particular, this inequality sho ws that the estimator ( 2 ) has “typical” deviations of order N − 1 / 2 whenev er k = O ( √ N ), hence the “statistical cost” of employing a large n um b er of computational no des is minor. Moreo v er, w e will prov e that √ N b θ ( k ) − θ ∗ d − → N 0 , π 2 σ 2 if k → ∞ and k = o ( √ N ) Distributed Statistical Estimation 5 as N → ∞ . It will also b e demonstrated that improv ed b ounds hold in other imp or- tan t scenarios, suc h as maxim um likelihoo d estimation, even when the subgroups hav e differen t sizes and the observ ations are not identically distributed. 1.2. Organization of the paper . Section 1.3 describes notation used throughout the paper. Sections 2 and 3 presen t main results and examples for the cases of univ ariate and multiv ariate parameter resp ectiv ely . Outcomes of numerical simulation are discussed in section 4 , and pro ofs of the main results are contained in section 5 . 1.3. Notation. Ev erywhere b elow, k · k 1 and k · k 2 stand for the L 1 and L 2 norms of a vector, and k · k - for the op erator norm of a matrix (its largest singular v alue). Giv en a probability measure P , E P ( · ) will stand for the exp ectation with resp ect to P , and w e will write E ( · ) when P is clear from the con text. Conv ergence in distribution will b e denoted b y d − → . F or tw o sequences { a j } j ≥ 1 ⊂ R and { b j } j ≥ 1 ⊂ R for j ∈ N , the expression a j . b j means that there exists a constant c > 0 such that a j ≤ cb j for all j ∈ N . Absolute constan ts will b e denoted c, C , c 1 , etc., and ma y tak e differen t v alues in different parts of the pap er. F or a function f : R d 7→ R , we define argmin z ∈ R d f ( z ) = { z ∈ R d : f ( z ) ≤ f ( x ) for all x ∈ R d } , and k f k ∞ := ess sup {| f ( x ) | : x ∈ R d } . Finally , f + ( x ) = lim t & 0 f ( x + t ) − f ( x ) t and f − ( x ) = lim t % 0 f ( x + t ) − f ( x ) t will denote the right and left deriv atives of f resp ectively (whenever these limits exist). Additional notation and auxiliary results are in tro duced on demand for the pro ofs in section 5 . 1.4. Main results. As we ha v e argued ab ov e, existing guarantees for the estimator ( 2 ) are sensitive to the c hoice of k , the num b er of partitions. In the following sections, we demonstrate that these b ounds are often sub optimal, and show that large v alues of k often do not ha v e a significan t negative effect on the statistical p erformance of resulting algorithms. The key observ ation underlying the subsequent exp osition is the follo wing: assume that the “local estimators” ¯ θ j , 1 ≤ j ≤ k , are asymptotically normal with asymptotic mean equal to θ ∗ . In particular, distributions of ¯ θ j ’s are approximately symmetric, with θ ∗ b eing the cen ter of symmetry . The lo cation parameters of symmetric distributions admits many robust estimators of the form ( 1 ), the sample median b eing a notable example. This intuition allo ws us to establish a parallel b etw een the non-asymptotic devia- tion guarantees for distributed estimation pro cedures of the form ( 1 ) and the degree of 6 S. Minsker and N. Stra wn symmetry of “lo cal” estimators quan tified by the rates of con vergence to normal appro x- imation. Results for the univ ariate case are presen ted in section 2 , and extensions to the m ultiv ariate case are presen ted in section 3 . 2. The univariate case . W e assume that X 1 , . . . , X N is a collection of indep enden t (but not necessarily identically distributed) S -v alued random v ariables with distributions P 1 , . . . , P N resp ectiv ely . The data are partitioned in to disjoin t groups G 1 , . . . , G k of cardinalit y n j := card( G j ) eac h, and such that P k j =1 n j = N . Let ¯ θ j := ¯ θ j ( G j ) , 1 ≤ j ≤ k b e a sequence of indep endent estimators of the parameter θ ∗ ∈ R shared b y P 1 , . . . , P N . Our main assumption will b e that ¯ θ 1 , . . . , ¯ θ k are asymptotically normal as quan tified by the following condition. Assumption 1. L et Φ( t ) b e the cumulative distribution function of the standar d normal r andom variable Z ∼ N (0 , 1) . F or e ach j = 1 , . . . , k , ther e exist a se quenc e { σ ( j ) n } n ∈ N ⊂ R + such that g j ( n j ) := sup t ∈ R P ¯ θ j − θ ∗ σ ( j ) n j ≤ t ! − Φ( t ) → 0 as n j → ∞ . Clearly , functions g j ( n j ), control the r ate of c onver genc e of estimators ¯ θ 1 , . . . , ¯ θ k to the normal law. F urthermore, let H k := 1 k k X j =1 1 σ ( j ) n j − 1 b e the harmonic me an of σ ( j ) n j ’s, and set α j = H k σ ( j ) n j . Note that P k j =1 α j = k , and that α 1 = . . . = α k = 1 if σ (1) n 1 = . . . = σ ( k ) n k . 2.1. Merging procedure based on the median. In this subsection, w e establish guarantees for the merging pro cedure based on the sample median, namely , b θ ( k ) = med ¯ θ 1 , . . . , ¯ θ k . This case is treated separately due to its practical imp ortance, the fact that w e can obtain b etter numerical constan ts, and a conceptually simpler pro of. Theorem 1. Assume that s > 0 and n j = card( G j ) , j = 1 , . . . , k ar e such that 1 k k X i =1 g i ( n i ) + r s k · max j =1 ,...,k α j < 1 2 . (4) Distributed Statistical Estimation 7 Mor e over, let assumption 1 b e satisfie d, and let ζ j ( n j , s ) solve the e quation Φ ζ j ( n j , s ) /σ ( j ) n j − 1 2 = α j · 1 k k X i =1 g i ( n i ) + r s k . Then for al l s satisfying ( 4 ) , b θ ( k ) − θ ∗ ≤ ζ ( s ) := max j =1 ,...,k ζ j ( n j , s ) with pr ob ability at le ast 1 − 4 e − 2 s . Proof. See section 5.2 . The follo wing lemma yields a more explicit form of the b ound and n umerical constan ts. Lemma 1. Assume that 1 k P k i =1 g i ( n i ) + p s k · max j =1 ,...,k α j ≤ 0 . 33 . Then ζ ( s ) ≤ 3 H k · 1 k k X j =1 g j ( n j ) + r s k . Proof. See section 5.7 . Remark 1. L et ¯ σ (1) ≤ . . . ≤ ¯ σ ( k ) b e the non-de cr e asing r e arr angement of σ (1) n 1 , . . . , σ ( k ) n k . It is e asy to se e that the harmonic me an H k of σ (1) n 1 , . . . , σ ( k ) n k satisfies H k ≤ k b k /m c · 1 b k /m c b k/m c X j =1 ¯ σ ( j ) for any inte ger 1 ≤ m ≤ k , henc e, informal ly sp e aking, the deviations of b θ ( k ) ar e c on- tr ol le d by the smal lest σ ( j ) n j ’s r ather than the lar gest. 2.2. Example: new bounds for the median-of-means estimator . The univ ariate mean estimation problem is p erv asiv e in statistics, and serves as a build- ing blo c k of more adv anced metho ds such as empirical risk minimization. Early w orks on robust mean estimation include T ukey’s “trimmed mean” ( T uk ey and Harris , 1946 ), as well as “winsorized mean” ( Bick el et al. , 1965 ); also see discussion in ( Bub eck et al. , 2013 ). These techniques often produce estimators with significant bias. A different approac h based on M-estimation was suggested by O. Catoni ( Catoni , 2012 ); Catoni’s estimator yields almost optimal constan ts, how ever, its construction requires additional information about the v ariance or the kurtosis of the underlying distribution; moreov er, its computation is not easily parallelizable, therefore this technique cannot b e easily emplo y ed in the distributed setting. Here, w e will fo cus on a fruitful idea that is commonly referred to as the “median-of- means” estimator that w as formally defined in equation ( 2 ) ab o ve. Sev eral refinements 8 S. Minsker and N. Stra wn and extensions of this estimator to higher dimensions hav e b een recen tly introduced by Minsk er ( 2015 ); Hsu and Sabato ( 2013 ); Devroy e et al. ( 2016 ); Joly et al. ( 2016 ); Lugosi and Mendelson ( 2017 ). Adv antages of this metho d include the facts that that it can b e implemen ted in parallel and do es not require prior knowledge of any information ab out parameters of the distribution (e.g., its v ariance). The following result for the median- of-means estimator is the corollary of Theorem 1 ; for brevity , we treat only the i.i.d. case. Recall that n = b N /k c and card( G j ) ≥ n, j = 1 , . . . , k . Corollar y 1. L et X 1 , . . . , X N b e a se quenc e of i.i.d. c opies of a r andom variable X ∈ R such that E X = θ ∗ , V ar ( X ) = σ 2 , E | X − θ ∗ | 3 < ∞ , and set c n = 0 . 4748 E | X − θ ∗ | 3 σ 3 √ n . Then for al l s > 0 such that c n + p s k ≤ 0 . 33 , the estimator b θ ( k ) define d in ( 2 ) satisfies | b θ ( k ) − θ ∗ | ≤ σ 1 . 43 E | X − θ ∗ | 3 /σ 3 n + 3 r s k n ! with pr ob ability at le ast 1 − 4 e − 2 s . Remark 2. The term 1 . 43 σ E | X − θ ∗ | 3 /σ 3 n c an b e thought of as the “bias” due to asym- metry of the distribution of the sample me an. Note that whenever k . √ N (so that n & √ N ), the right-hand side of the ine quality ab ove is of or der ( k n ) − 1 / 2 ' N − 1 / 2 . Proof. It follows from the Berry-Esseen Theorem (fact 1 in section 5.1 ) that as- sumption 1 is satisfied with σ (1) n = . . . = σ ( k ) n = σ √ n , and g j ( n ) ≤ c n = 0 . 4748 E | X − θ ∗ | 3 σ 3 √ n for all j . Lemma 1 implies that max j ζ j ( n, s ) ≤ 3 σ √ n c n + p s/k , and the claim follows from Theorem 1 . F or distributions with infinite third momen t, the rate of con v ergence in the Berry-Esseen t yp e b ound is slow er, and the follo wing result holds instead. Corollar y 2. L et X 1 , . . . , X N b e a se quenc e of i.i.d. c opies of a r andom variable X ∈ R such that E X = θ ∗ , V ar ( X ) = σ 2 , E | X − θ ∗ | 2+ δ < ∞ for some δ ∈ (0 , 1] . Then ther e exist absolute c onstants c 1 , c 2 > 0 such that for al l s > 0 and k satisfying E | X − θ ∗ | 2+ δ σ 2+ δ n δ/ 2 + p s k ≤ c 1 , the fol lowing ine quality holds with pr ob ability at le ast 1 − 4 e − 2 s : | b θ ( k ) − θ ∗ | ≤ c 2 σ E | X − θ ∗ | 2+ δ /σ 2+ δ n 1+ δ 2 + r s N ! . In this case, typical deviations of b θ ( k ) are still of order N − 1 / 2 as long as k . N δ / (1+ δ ) . The pro of of this result follows from fact 2 in section 5.1 in the same w a y as Corollary 1 was deduced from the Berry-Esseen b ound. Distributed Statistical Estimation 9 2.3. Example: distributed maximum likelihood estimation. Let X 1 , . . . , X N b e i.i.d. copies of a random vector X ∈ R d with distribution P θ ∗ , where θ ∗ ∈ Θ ⊆ R . Assume that for each θ ∈ Θ, P θ is absolutely contin uous with resp ect to a σ -finite measure µ , and let p θ = dP θ dµ b e the corresp onding density . In this section, w e state sufficient conditions for assumption 1 to b e satisfied when ¯ θ 1 , . . . , ¯ θ k are the maxim um likelihoo d estimators ( v an der V aart , 1998 ) of θ ∗ . Conditions stated b elow w ere obtained by Pinelis ( 2016 ). All deriv atives b elo w (denoted by 0 ) are taken with resp ect to θ , unless noted otherwise. Assume that the the log-lik eliho o d function ` x ( θ ) = log p θ ( x ) satisfies the following: (1) [ θ ∗ − δ, θ ∗ + δ ] ⊆ Θ for some δ > 0; (2) “standard regularity conditions” that allo w differentiation under the exp ectation: assume that E ` 0 X ( θ ∗ ) = 0, and that the Fisher information E ` 0 X ( θ ∗ ) 2 = − E ` 00 X ( θ ∗ ) := I ( θ ∗ ) is finite; (3) E | ` 0 X ( θ ∗ ) | 3 + E | ` 00 X ( θ ∗ ) | 3 < ∞ ; (4) for µ -almost all x , ` x ( θ ) is three times differentiable for θ ∈ [ θ ∗ − δ, θ ∗ + δ ], and E sup | θ − θ ∗ |≤ δ | ` 000 X ( θ ) | 3 < ∞ ; (5) P | ¯ θ 1 − θ ∗ | ≥ δ ≤ cγ n for some p ositive constan ts c and γ ∈ [0 , 1). In turn, condition (5) ab ov e is implied by the following tw o inequalities (see Pinelis , 2016 , section 6.2, for detailed discussion and examples): (a) H 2 ( θ , θ ∗ ) ≥ 2 − 2 (1+ c 0 ( θ − θ ∗ ) 2 ) γ , where H ( θ 1 , θ 2 ) = q R R d √ p θ 1 − √ p θ 2 2 dµ is the Hellinger distance, and c 0 , γ are p ositiv e constants; (b) I ( θ ) ≤ c 1 + c 2 | θ | α for some p ositive constan ts c 1 , c 2 and α and all θ ∈ Θ. Corollar y 3. Assume that c onditions (1)-(5) ar e satisfie d, and that card( G j ) ≥ n = b N /k c , j = 1 , . . . , k . Then for al l s > 0 such that C √ n + cγ n + p s k ≤ 0 . 33 , b θ ( k ) − θ ∗ ≤ 3 p I ( θ ∗ ) C n + c √ n γ n + r s k n with pr ob ability at le ast 1 − 4 e − 2 s , wher e C is a p ositive c onstant that dep ends only on { P θ } θ ∈ [ θ ∗ − δ,θ ∗ + δ ] . Proof. It follo ws from results in ( Pinelis , 2016 ), in particular equation (5.5), that whenev er conditions (1)-(5) hold, assumption 1 is satisfied for all j with σ ( j ) n = ( nI ( θ ∗ )) − 1 / 2 , where I ( θ ∗ ) is the Fisher information, and g j ( n ) ≤ C √ n + cγ n , where C is a constan t that dep ends only on { P θ } θ ∈ [ θ ∗ − δ,θ ∗ + δ ] . Lemma 1 implies that max j =1 ,...,k ζ j ( n, s ) ≤ 3 C √ n + cγ n + p s/k , and the claim follows from Theorem 1 . 10 S. Minsker and N. Stra wn Remark 3. R esults of this se ction c an b e extende d to include other M-estimators b esides MLEs, as Bentkus et al. ( 1997 ) have shown that M-estimators satisfy a variant of Berry-Esse en b ound under r ather gener al c onditions. 2.4. Merging procedures based on robust M-estimators. In this subsection, w e study the family of merging pro cedures based on the M-estimators b θ ( k ) ρ := argmin z ∈ R k X j =1 ρ z − ¯ θ j . (5) The sample median med ¯ θ 1 , . . . , ¯ θ k corresp onds to the choice of (non-smo oth) ρ ( x ) = | x | and w as treated separately abov e; here, it will b e assumed that ρ is con v ex, even, differen tiable function suc h that ρ ( z ) → ∞ as | z | → ∞ and k ρ 0 k ∞ < ∞ . A particular example of such a function is Hub er’s loss ρ M ( z ) = ( z 2 / 2 , | z | ≤ M , M | z | − M 2 / 2 , | z | > M , (6) where M is a p ositive constan t. The following result quan tifies non-asymptotic p erfor- mance of the estimator b θ ( k ) ρ . As b efore, w e set H k = 1 1 /k P k i =1 1 /σ ( i ) n j and α j = H k σ ( j ) n j , (7) where σ ( j ) n ’s are defined in assumption 1 . Moreov er, giv en the loss ρ as ab o ve, let C ρ > 0 b e such that | ρ 0 ( x ) | ≥ k ρ 0 k ∞ 2 for | x | > C ρ . Theorem 2. L et assumption 1 b e satisfie d, and supp ose that s > 0 and n 1 , . . . , n k ar e such that max j =1 ,...,k α j e C ρ /σ ( j ) n j 2 1 k k X i =1 r s k + 2 g i ( n i ) ≤ 0 . 33 . (8) Then for al l s satisfying ( 8 ) , b θ ( k ) ρ − θ ∗ ≤ 3 H k max j =1 ,...,k e C ρ /σ ( j ) n j 2 · 1 k k X i =1 r s k + 2 g i ( n i ) (9) with pr ob ability at le ast 1 − 4 e − 2 s . Proof. See section 5.3 . Note that the b ound dep ends on ρ only through max j =1 ,...,k e C ρ /σ ( j ) n j 2 . Assume for concreteness that n 1 = . . . = n k = b N /k c , and that ρ = ρ M is Hub er’s loss defined in ( 6 ), so that C ρ = M / 2. F or max j =1 ,...,k e C ρ /σ ( j ) n j 2 to b e b ounded ab o v e by an absolute Distributed Statistical Estimation 11 constan t, one should c ho ose M to b e of order min j =1 ,...,k σ ( j ) n j . While the latter quantit y is typically unknown, it can be estimated in some cases. F or example, if the data ar e i.i.d. then σ ( j ) n j = q V ar ¯ θ 1 for all j . Since ¯ θ j ’s are appro ximately normal, their standard deviation can b e estimated b y the me dian absolute deviation as b σ n 1 = 1 Φ − 1 (0 . 75) med | ¯ θ 1 − med ¯ θ 1 , . . . , ¯ θ k | , . . . , | ¯ θ k − med ¯ θ 1 , . . . , ¯ θ k | , where the factor 1 / Φ − 1 (0 . 75) is introduced to mak e the estimator consisten t ( Hamp el et al. , 2011 ); another p ossibilit y is to use b o otstrap ( Ghosh et al. , 1984 ). 2.5. Asymptotic results. In this section, we complement the previously discussed non-asymptotic deviation b ounds for b θ ( k ) ρ b y the asymptotic results. F or the b enefits of clarity , w e state the complete list of assumptions made b elo w: (a) X 1 , . . . , X N are i.i.d., n = b N /k c and card( G j ) = n, j = 1 , . . . , k ; result for non- iden tically distributed data is presen ted in App endix A . (b) Assumption 1 is satisfied for some function g ( n ) (note that there is no dep endence on index j due to the i.i.d. assumption); (c) k and n are such that k → ∞ and √ k · g ( n ) → 0 as N → ∞ ; (d) ρ is a conv ex, even function, suc h that ρ ( z ) → ∞ as | z | → ∞ and k ρ 0 k ∞ < ∞ (here, ρ 0 ( x ) is defined as the a v erage of the right and left deriv atives of ρ at x ). (e) b θ ( k ) ρ is defined as b θ ( k ) ρ := argmin z ∈ R k X j =1 ρ z − ¯ θ j σ n , where σ (1) n = . . . = σ ( k ) n ≡ σ n is a normalizing sequence from assumption 1 (our definition of the estimator is sligh tly different than in section 2.4 which allows to k eep ρ fixed as k and n are changing). F or z ∈ R , define L ( z ) := E ρ 0 ( z + Z ) , where Z ∼ N (0 , 1). Note that, since ρ is differentiable almost everywhere, L ( z ) = E ρ 0 − ( z + Z ) = E ρ 0 + ( z + Z ). Theorem 3. Under assumptions (a)-(e) ab ove, √ k b θ ( k ) ρ − θ ∗ σ n d − → N (0 , ∆ 2 ) , wher e ∆ 2 = E ( ρ 0 ( Z )) 2 ( L 0 (0)) 2 . 12 S. Minsker and N. Stra wn Proof. See section 5.4 . F or example, if ρ ( x ) = | x | , Theorem 3 implies that under appropriate assumptions, the median-of-means estimator b θ ( k ) defined in ( 2 ) satisfies √ N b θ ( k ) − θ ∗ d − → N 0 , π 2 σ 2 . Indeed, in this case σ n = σ / √ n , where σ 2 = V ar( X 1 ), and ρ 0 ( x ) = − 1 , x < 0 , 0 , x = 0 , 1 , x > 0 , hence a simple calculation yields ∆ 2 = 1 / ( L 0 (0)) 2 = π / 2. If w e consider the mean estimation problem with Hub er’s loss ρ M ( x ) ( 6 ) instead of ρ ( x ) = | x | , we similarly deduce that ρ 0 ( x ) = − M x ≤ − M , x, | x | < M , M , x ≥ M , and w e get the well-kno wn ( Hub er , 1964 ) expression ∆ 2 = R M − M x 2 d Φ( x )+2 M 2 (1 − Φ( M )) (2Φ( M ) − 1) 2 ; in particular, ∆ 2 → 1 as M → ∞ , and the conv ergence is fast. F or instance, ∆ 2 ' 1 . 15 for M = 2 and ∆ 2 ' 1 . 01 for M = 3. Remark 4. The key assumptions in the list (a)-(e) governing the r e gime of gr owth of k and n ar e (b) and (c). F or instanc e, if the r andom variables p ossess finite moments of or der (2 + δ ) for some δ ∈ (0 , 1] , then it fol lows fr om fact 2 in se ction 5.1 that √ k g ( n ) → 0 if k = o N δ 1+ δ as N → ∞ . 2.6. Connections to U-quantiles. In this section, w e discuss connections of prop osed algorithms to U-quantiles and the assumption requiring the groups G 1 , . . . , G k to b e disjoin t. W e assume that the data X 1 , . . . , X N are i.i.d. with common distribution P , and let θ ∗ = θ ∗ ( P ) ∈ R b e a real- v alued parameter of interest. It is clear that the estimators produced b y distributed algorithms considered ab o ve dep end on the random partition of the sample. A natural w a y to av oid such dep endence is to consider the U-quantile (in this case, the median) e θ ( k ) = med ¯ θ J , J ∈ A ( n ) N , where A ( n ) N := { J : J ⊆ { 1 , . . . , N } , card( J ) = n := b N /k c} is a collection of all distinct subsets of { 1 , . . . , N } of cardinality n , and ¯ θ J := ¯ θ ( X j , j ∈ J ) is an estimator of θ ∗ based on { X j , j ∈ J } . F or instance, when card( J ) = 2 and ¯ θ J = 1 card( J ) P j ∈ J X j 2 , e θ ( k ) Distributed Statistical Estimation 13 is the well-kno wn Ho dges-Lehmann estimator of the lo cation parameter, see ( Ho dges and Lehmann , 1963 ; Lehmann and D’Abrera , 2006 ); for a comprehensive study of U- quan tiles, see ( Arcones , 1996 ). The main result of this section is an analogue of Theorem 1 for the estimator e θ ( k ) ; it implies that theoretical guaran tees for the p erformance of e θ ( k ) are at least as go o d as for the estimator b θ ( k ) . Since the data are i.i.d., it is enough to imp ose the assumption 1 on ¯ θ ( X 1 , . . . , X n ) only , hence we drop the index j and denote the normalizing sequence { σ n } n ∈ N and the corresp onding error function g ( n ). Theorem 4. Assume that s > 0 and n = b N /k c ar e such that g ( n ) + r s k < 1 2 . (10) Mor e over, let assumption 1 b e satisfie d, and let ζ ( n, s ) solve the e quation Φ ( ζ ( n, s )) = 1 2 + g ( n ) + r s k . Then for any s satisfying ( 10 ) , e θ ( k ) − θ ∗ ≤ σ n ζ ( n, s ) with pr ob ability at le ast 1 − 4 e − 2 s . Proof. See section 5.5 . As before, a more explicit form of the b ound immediately follo ws from Lemma 1 . A dra wback of the estimator e θ ( k ) is the fact that its exact computation requires ev aluation of n N estimators ¯ θ J o v er subsamples n { X j , j ∈ J } , J ∈ A ( n ) N o . F or large N and n , suc h task becomes in tractable. How ever, an appro ximate result can be obtained b y c ho osing ` subsets J 1 , . . . , J ` from A ( n ) N uniformly at random, and setting e θ ( k ) ` := med ¯ θ J 1 , . . . , ¯ θ J ` . T ypically , the error e θ ( k ) ` − e θ ( k ) is of order ` − 1 / 2 with high probabilit y ov er the random dra w of J 1 , . . . , J ` . W e note that Theorem 2 admits a similar extension for the estimator defined as e θ ( k ) ρ := argmin z ∈ R X J ∈A ( n ) N ρ z − ¯ θ J . Namely , if the data are i.i.d., then under the assumptions of section 2.4 , e θ ( k ) ρ − θ ∗ ≤ 3 e ( C ρ /σ n ) 2 · σ n r s k + 2 g ( n ) (11) with probability at least 1 − 4 e − 2 s , whenever s > 0 and n = b N /k c are suc h that e ( C ρ /σ n ) 2 r s k + 2 g ( n ) ≤ 0 . 33 . W e omit the pro of of ( 11 ) since the required mo difications in the argumen t of Theorem 2 are exactly the same as those explained in the pro of of Theorem 4 . 14 S. Minsker and N. Stra wn 3. Estimation in higher dimensions. In this section, it will be assumed that θ ∗ ∈ R m , m ≥ 2, is a v ector-v alued parameter of in terest. Let X 1 , . . . , X N b e indep endent S -v alued random v ariables that are randomly partitioned into disjoin t groups G 1 , . . . , G k of cardinality n = b N /k c each. Let ¯ θ j := ¯ θ j ( G j ) ∈ R m , 1 ≤ j ≤ k b e a sequence of estimators of θ ∗ , the common parameter of the distributions of X j ’s. Assume that ρ 1 , . . . , ρ m are conv ex, even functions such that ρ i ( z ) → ∞ as | z | → ∞ and k ρ 0 i k ∞ < ∞ , with ρ 0 i ( x ) defined as the a v erage of the righ t and left deriv atives of ρ i , i = 1 , . . . , m , and let b θ ( k ) := argmin z ∈ R m k X j =1 m X i =1 ρ i z i − ¯ θ j,i , (12) where z = ( z 1 , . . . , z m ) and ¯ θ j = ( ¯ θ j, 1 , . . . , ¯ θ j,m ) for 1 ≤ j ≤ k . F or the sak e of clarity , w e will assume b elo w that X 1 , . . . , X N are i.i.d. Ho w ev er, results can be easily extended to the case of non-iden tically distributed data in a manner describ ed in section 2.4 . Assumption 1 will b e required to hold co ordinatewise, namely , w e will assume that there exist sequences { σ n,i } n ∈ N ⊂ R + , i = 1 , . . . , m , such that g m ( n ) := max i =1 ,...,m sup t ∈ R P ¯ θ 1 ,i − θ ∗ σ n,i ≤ t − Φ( t ) → 0 as n → ∞ . Note that the maxim um ov er the second index j disapp ears due to the i.i.d. assumption. Theorem 5. L et C ρ i > 0 b e such that | ρ 0 + ,i ( x ) | ≥ k ρ 0 + ,i k ∞ 2 and | ρ 0 − ,i ( x ) | ≥ k ρ 0 − ,i k ∞ 2 for | x | > C ρ i , i = 1 , . . . , m . L et assumption 1 hold for e ach c o or dinate of ¯ θ 1 , and supp ose that s > 0 and n = b N /k c ar e such that max i =1 ,...,m e ( C ρ i /σ n,i ) 2 r s k + 2 g m ( n ) ≤ 0 . 33 . (13) Then for al l s satisfying ( 13 ) and al l 1 ≤ i ≤ m simultane ously, b θ ( k ) i − θ ∗ ,i ≤ 3 e ( C ρ i /σ n,i ) 2 · σ n,i r s k + 2 g m ( n ) (14) with pr ob ability at le ast 1 − 4 me − 2 s . Proof. See section 5.6 . 3.1. Example: multiv ariate median-of-means estimator . Consider the sp ecial case of Theorem 5 when θ ∗ = E X is the mean of X ∈ R m , ¯ θ j ( X ) := 1 | G j | P X i ∈ G j X i is the sample mean ev aluated ov er the subsample G j , and ρ i ( x ) = | x | for all i . In this case, b θ ( k ) b ecomes the spatial median with resp ect to the L 1 -norm, namely , b θ ( k ) := argmin z ∈ R m k X j =1 z − ¯ θ j 1 . (15) Distributed Statistical Estimation 15 The problem of finding the mean estimator that admits sub-Gaussian concentration around E X under weak momen t assumptions on the underlying distribution has re- cen tly b een inv estigated in sev eral w orks. F or instance, Joly et al. ( 2016 ) construct an estimator that admits “almost optimal” b eha vior under the assumption that the entries of X p ossess 4 moments. Recently , Lugosi and Mendelson ( 2017 , 2018 ) prop osed new estimators that attains optimal b ounds and requires existence of only 2 momen ts. More sp ecifically , the aforemen tioned pap ers show that, for an y s such that 2 N < e − s < 1, there exists an estimator ˆ θ ( s ) suc h that with probability at least 1 − C 1 e − s , ˆ θ ( s ) − θ ∗ 2 ≤ C 2 r tr (Σ) N + r s λ max (Σ) N ! , where C 1 , C 2 > 0 are n umerical constants, Σ is the cov ariance matrix of X , tr (Σ) is its trace and λ max (Σ) - its largest eigen v alue. How ever, construction of these estimators explicitly dep ends on the desired confidence level s , and (more imp ortan tly) they are n umerically difficult to compute. On the other hand, Theorem 5 demonstrates that p erformance of the multiv ariate median-of-means estimator is robust with resp ect to the c hoice of the n umber of sub- groups k , and the resulting deviation b ounds hold simultaneously ov er the range of confidence parameter s whenever the co ordinates of X p ossess 2 + δ moments for some δ > 0. The following corollary summarizes these claims. Corollar y 4. L et X 1 , . . . , X N b e i.i.d. r andom ve ctors such that θ ∗ = E X 1 is the unknown me an, Σ = E ( X 1 − θ ∗ )( X 1 − θ ∗ ) T is the c ovarianc e matrix, σ 2 i = Σ i,i , and max i =1 ,...,m E | X 1 ,i | 2+ δ < ∞ for some δ ∈ (0 , 1] . Then ther e exist absolute c onstants c 1 , c 2 > 0 such that for al l s > 0 and k satisfying r s k + max i =1 ,...,m E | X 1 ,i − θ ∗ ,i | 3 σ 3 i √ n ≤ c 1 , with pr ob ability at le ast 1 − 4 me − 2 s for al l i = 1 , . . . , m simultane ously, b θ ( k ) i − θ ∗ ,i ≤ c 2 σ i max i =1 ,...,m E | X 1 ,i − θ ∗ ,i | 2+ δ /σ 2+ δ i n 1+ δ 2 + r s N ! . Proof. It follo ws from fact 2 in section 5.1 that g m ( n ) can b e b ounded as g m ( n ) ≤ A max i =1 ,...,m E | X 1 ,i − θ ∗ ,i | 2+ δ σ 2+ δ i n δ / 2 for an absolute constan t A > 0. Moreo v er, it is easy to see that C ρ i = 0 for all i and that assumption 1 holds with σ n,i = σ i √ n . Now the claim immediately follows from Theorem 5 . Remark 5. Estimator ( 15 ) admits a natur al gener alization of the form b θ ( k ) ρ, k·k ◦ := ar gmin z ∈ R m k X j =1 ρ z − ¯ θ j ◦ , (16) 16 S. Minsker and N. Stra wn wher e k · k ◦ is a norm in R m and ρ is a c onvex, non-de cr e asing function. F or example, if k · k ◦ is the Euclide an norm, r esulting estimator is invariant with r esp e ct to the ortho go- nal tr ansformations. However, available p erformanc e guar ante es for this estimator hold under str onger assumptions (such as joint asymptotic normality of the c o or dinates of ¯ θ j ’s inste ad of c o or dinate-wise asymptotic normality), and exhibit sub optimal dep endenc e on the dimension; these r esults, along with the discussion of r elevant numeric al metho ds, ar e pr esente d in App endix C . Complete char acterization of the effe ct of the norm k · k ◦ on the ge ometry of the pr oblem and p erformanc e of the c orr esp onding estimator ( 16 ) warr ants further study. 4. Simulation results. W e illustrate results of the previous sections with numerical sim ulations that compare p erformance of the median-of-means estimator with the usual sample mean, see figure 2 b elow. Moreo v er, we compared the theoretical guarantees for the median-of-means estimator (describ ed in section 2.2 ) against the empirical outcomes for the Lomax dis- tribution with shap e parameter α = 4 and scale parameter λ = 1; the corresp onding probabilit y density function is p ( x ) = α λ 1 + x λ − ( α +1) for x ≥ 0 In particular, the Lomax distribution with α = 4 and λ = 1 has mean 1 / 3 and median 4 √ 2 − 1 ≈ 0 . 1892. Since the mean and median do not coincide, the error of the median-of- means estimator has a significan t bias comp onent for large v alues of k . Figure 3 depicts the impact of the bias b ey ond k = √ N (equiv alently , log N k = 1 / 2), and also the fact that the median error is mostly flat for k < √ N . Finally , we assessed empirical co v erage of the confidence interv als constructed using Theorem 3 and centered at the median-of-means estimator; results are presented in figure 4 . The sample of size N = 10 5 w as generated from the half-t distribution with 3 degrees of freedom; recall that a random v ariable ξ has half-t distribution with ν degrees of freedom if ξ d = | η | where η has usual t-distribution with ν degrees of freedom. It is clear that half-t distribution is b oth asymmetric and heavy-tailed. Eac h sample was further corrupted by outliers sampled from the normal distribution with mean 0 and standard deviation 10 5 ; the n um b er of outliers ranged from 0 to √ N = 100 with increments of 20. The median-of-means estimator was constructed for k = √ N = 100. F or comparison, w e presen t empirical cov erage levels attained b y the sample mean in the same framework. 5. Proofs In this section, we presen t the pro ofs of the main results. 5.1. Preliminaries. W e recall several facts that are used in the pro ofs b elo w. The follo wing b ound has b een established by A. Berry ( Berry , 1941 ) and C.-G. Esseen ( Esseen , 1942 ). A v ersion with an explicit constant given b elo w is due to Shevtsov a ( 2011 ). Distributed Statistical Estimation 17 F a ct 1 (Berr y-Esseen bound). Assume that Y 1 , . . . , Y n is a se quenc e of i.i.d. c opies of a r andom variable Y with me an µ , varianc e σ 2 and such that E | Y | 3 < ∞ . Then sup s ∈ R P √ n ¯ Y − µ σ ≤ s − Φ( s ) ≤ 0 . 4748 E | Y − µ | 3 σ 3 √ n , wher e ¯ Y = 1 n P n j =1 Y j and Φ( s ) is the cumulative distribution function of the standar d normal r andom variable. The following generalization of Berry-Esseen b ound is due to Petro v ( 1995 ). F a ct 2 (Generaliza tion of Berr y-Esseen bound). Assume that Y 1 , . . . , Y n is a se quenc e of i.i.d. c opies of a r andom variable Y with me an µ , varianc e σ 2 and such that E | Y | 2+ δ < ∞ for some δ ∈ (0 , 1] . Then ther e exists an absolute c onstant A > 0 such that sup s ∈ R P √ n ¯ Y − µ σ ≤ s − Φ( s ) ≤ A E | Y − µ | 2+ δ σ 2+ δ n δ / 2 . Next, we recall a w ell-kno wn concentration inequality . F a ct 3 (Bounded difference inequality). L et X 1 , . . . , X k b e i.i.d. r andom vari- ables, and assume that Z = g ( X 1 , . . . , X k ) , wher e g is such that for al l j = 1 , . . . , k and al l x 1 , x 2 , . . . , x j , x 0 j , . . . , x k , g ( x 1 , . . . , x j − 1 , x j , x j +1 , . . . , x k ) − g ( x 1 , . . . , x j − 1 , x 0 j , x j +1 , . . . , x k ) ≤ c j . Then P ( Z − E Z ≥ t ) ≤ exp ( − 2 t 2 P k j =1 c 2 j ) and P ( Z − E Z ≤ − t ) ≤ exp ( − 2 t 2 P k j =1 c 2 j ) . Finally , we recall the definition of a U-statistic. Let h : R n 7→ R b e a measurable function of n v ariables, and A ( n ) N := { J : J ⊆ { 1 , . . . , N } , card( J ) = n } . A U-statistic of order n with kernel h based on the i.i.d. sample X 1 , . . . , X N is defined as ( Ho effding , 1948 ) U N ( h ) = 1 n N X J ∈A ( n ) N h ( X j , j ∈ J ) . Clearly , E U N ( h ) = E h ( X 1 , . . . , X n ), moreov er, U N ( h ) has the smallest v ariance among all unbiased estimators. The following analogue of fact 3 holds for the U-statistics: F a ct 4 (Concentra tion inequality for U-st a tistics, ( Hoeffding , 1963 )). Assume that the kernel h satisfies | h ( x 1 , . . . , x n ) | ≤ M for al l x 1 , . . . , x n . Then for al l s > 0 , P ( | U N ( h ) − E U N ( h ) | ≥ s ) ≤ 2 exp − 2 b N /n c t 2 M 2 . 18 S. Minsker and N. Stra wn 5.2. Proof of Theorem 1 . Observ e that b θ ( k ) − θ ∗ = med ¯ θ 1 − θ ∗ , . . . , ¯ θ k − θ ∗ . Let Φ ( n j ,j ) ( · ) b e the distribution function of ¯ θ j − θ ∗ , j = 1 , . . . , k , and b Φ k ( · ) - the empirical distribution function corresp onding to the sample W 1 = ¯ θ 1 − θ ∗ , . . . , W k = ¯ θ k − θ ∗ , that is, b Φ k ( z ) = 1 k k X j =1 I { W j ≤ z } . Supp ose that z ∈ R is fixed, and note that b Φ k ( z ) is a function of the random v ariables W 1 , . . . , W k , and E b Φ k ( z ) = 1 k P k j =1 Φ ( n j ,j ) ( z ). Moreo v er, the hypothesis of the b ounded difference inequality (fact 3 ) is satisfied with c j = 1 /k for j = 1 , . . . , k , and therefore it implies that b Φ k ( z ) − 1 k k X j =1 Φ ( n j ,j ) ( z ) ≤ r s k (17) on the draw of W 1 , . . . , W k with probability ≥ 1 − 2 e − 2 s . Let z 1 ≥ z 2 b e such that 1 k P k j =1 Φ ( n j ,j ) ( z 1 ) ≥ 1 2 + p s k and 1 k P k j =1 Φ ( n j ,j ) ( z 2 ) ≤ 1 2 − p s k . Applying ( 17 ) for z = z 1 and z = z 2 together with the union b ound, we see that for j = 1 , 2, b Φ k ( z j ) − 1 k k X j =1 Φ ( n j ,j ) ( z j ) ≤ r s k on an ev en t E of probabilit y ≥ 1 − 4 e − 2 s . It follows that on E , b Φ k ( z 1 ) ≥ 1 / 2 and 1 − b Φ k ( z 2 ) ≥ 1 / 2 simultaneously , hence med ( W 1 , . . . , W k ) ∈ [ z 2 , z 1 ] (18) b y the definition of the median. It remains to estimate z 1 and z 2 . Assumption 1 implies that 1 k k X j =1 Φ ( n j ,j ) ( z 1 ) ≥ 1 k k X j =1 Φ z 1 σ ( j ) n j ! − 1 k k X j =1 Φ ( n j ,j ) ( z 1 ) − Φ z 1 σ ( j ) n j !! ≥ 1 k k X j =1 Φ z 1 σ ( j ) n j ! − 1 k k X j =1 g j ( n j ) . Hence, it suffices to find z 1 suc h that 1 k P k j =1 Φ z 1 σ ( j ) n j ≥ 1 2 + p s k + 1 k P k j =1 g j ( n j ). Recall that α j = 1 /σ ( j ) n j 1 /k P k i =1 1 /σ ( i ) n j , j = 1 , . . . , k , and let ζ j ( n j , s ) b e the solution of the equation Φ ζ j ( n j , s ) /σ ( j ) n − 1 2 = α j · 1 k k X i =1 g i ( n i ) + r s k . Distributed Statistical Estimation 19 Note that ζ j ( n, s ) alw ays exists since α j · 1 k P k i =1 g i ( n i ) + p s k < 1 2 b y assumption. Finally , since P k j =1 α j = k , it is clear that any z 1 ≥ max j =1 ,...,k ζ j ( n j , s ) satisfies the requirements. Similarly , 1 k k X j =1 Φ ( n j ,j ) ( z 2 ) ≤ 1 k k X j =1 Φ z 2 σ ( j ) n j ! + 1 k k X j =1 Φ ( n j ,j ) ( z 2 ) − Φ z 2 σ ( j ) n j !! ≤ 1 k k X j =1 Φ z 2 σ ( j ) n j ! + 1 k k X j =1 g j ( n j ) b y assumption 1 , hence it is sufficien t to choose z 2 suc h that z 2 ≤ max j =1 ,...,k ˜ ζ j ( n j , s ), where ˜ ζ j ( n j , s ) satisfies Φ ˜ ζ j ( n j , s ) /σ ( j ) n − 1 2 = − α j · 1 k P k i =1 g i ( n i ) + p s k . Noting that ˜ ζ j ( n j , s ) = − ζ j ( n j , s ) and recalling ( 18 ), we conclude that b θ ( k ) − θ ∗ ≤ max j =1 ,...,k ζ j ( n j , s ) with probability at least 1 − 4 e − 2 s . 5.3. Proof of Theorem 2 . W e will use notation as in the pro of of Theorem 1 . Clearly , b θ ( k ) ρ satisfies the equation G ( b θ ( k ) ρ ) = 0, where G ( z ) = 1 k k X j =1 ρ 0 z − ¯ θ j . Supp ose z 1 , z 2 are such that G ( z 1 ) > 0 and G ( z 2 ) < 0. Since G is increasing, it is easy to see that b θ ( k ) ρ ∈ ( z 2 , z 1 ). T o find such z 1 and z 2 , we pro ceed in 3 steps. (a) First, observe that the b ounded difference inequality (fact 3 ) implies that for an y fixed z ∈ R , 1 k k X j =1 ρ 0 z − ¯ θ j − E ρ 0 z − ¯ θ j ≤ ρ 0 ∞ r s k with probability ≥ 1 − 2 e − 2 s . (b) Next, we will find an upp er b ound for 1 k k X j =1 E ρ 0 z − ¯ θ j − E ρ 0 ( z − Z j ) , 20 S. Minsker and N. Stra wn where Z j ∼ N θ ∗ , σ ( j ) n j 2 , j = 1 , . . . , k are indep endent. Note that for an y b ounded non-negativ e function f : R 7→ R + and a signed measure Q , Z R f ( x ) dQ = Z k f k ∞ 0 Q ( x : f ( x ) ≥ t ) dt ≤ k f k ∞ max t ≥ 0 | Q ( x : f ( x ) ≥ t ) | . Since an y b ounded function f can b e written as f = max( f , 0) − max( − f , 0), we deduce that Z R f ( x ) dQ ≤ k f k ∞ max t ≥ 0 | Q ( x : f ( x ) ≥ t ) | + max t ≤ 0 | Q ( x : f ( x ) ≤ t ) | . Moreo v er, if f is monotone, the sets { x : f ( x ) ≥ t } and { x : f ( x ) ≤ t } are half-in terv als. Applying this to f = ρ 0 and Q = Φ ( n j ,j ) − Φ, we deduce that 1 k k X j =1 E ρ 0 z − ¯ θ j − E ρ 0 ( z − Z j ) ≤ 2 k ρ 0 k ∞ 1 k k X j =1 sup t ∈ R Φ ( n j ,j ) ( t ) − Φ( t ) ≤ 2 k ρ 0 k ∞ 1 k k X j =1 g j ( n j ) b y assumption 1 . (c) In remains to find z 1 satisfying 1 k k X j =1 E ρ 0 ( z 1 − θ ∗ − ( Z j − θ ∗ )) > ρ 0 ∞ r s k + 2 k k X i =1 g i ( n i ) ! . Let ˜ z 1 := z 1 − θ ∗ and ˜ Z j := Z j − θ ∗ . Since P k j =1 α j = k (where α j ’s were defined in ( 7 )), it suffices to find z 1 suc h that E ρ 0 ˜ z 1 − ˜ Z j > α j k ρ 0 k ∞ p s k + 2 k P k i =1 g i ( n i ) for all j . F or any b ounded function h such that h ( − x ) = − h ( x ) and h ( x ) ≥ 0 for x ≥ 0, and any z ≥ 0, Z R h ( x + z ) φ ( x ) dx = Z ∞ 0 h ( x ) ( φ ( x − z ) − φ ( − x − z )) dx ≥ 0 , where φ ( x ) = (2 π ) − 1 / 2 e − x 2 / 2 . Recall that C ρ > 0 is such that | ρ 0 ( x ) | ≥ k ρ 0 k ∞ / 2 for | x | ≥ C ρ . It follows that E ρ 0 ˜ z 1 − ˜ Z j ≥ 1 2 k ρ 0 k ∞ E I { ˜ z 1 − ˜ Z j ≥ C ρ } − I { ˜ z 1 − ˜ Z j ≤ − C ρ } = 1 2 k ρ 0 k ∞ P ˜ Z j ≥ C ρ − ˜ z 1 − P ˜ Z j ≤ − C ρ − ˜ z 1 = 1 2 k ρ 0 k ∞ P Z ∈ " C ρ − ˜ z 1 σ ( j ) n j , C ρ + ˜ z 1 σ ( j ) n j #! , Distributed Statistical Estimation 21 where Z ∼ N (0 , 1). Next, Lemma 3 implies that P Z ∈ " C ρ − ˜ z 1 σ ( j ) n j , C ρ + ˜ z 1 σ ( j ) n j #! ≥ 2 e − C ρ /σ ( j ) n j 2 P Z ∈ h 0 , ˜ z 1 /σ ( j ) n j i . Com bining the previous tw o b ounds, we deduce that it suffices to find ˜ z 1 > 0 such that P Z ∈ [0 , ˜ z 1 /σ ( j ) n j ≥ α j e C ρ /σ ( j ) n j 2 r s k + 2 k k X i =1 g i ( n i ) ! . By our assumptions, max j =1 ,...,k α j e C ρ /σ ( j ) n j 2 p s k + 2 k P k i =1 g i ( n i ) ≤ 0 . 33. Lemma 1 yields that it suffices to tak e ˜ z 1 = z 1 − θ ∗ = 3 H k max j =1 ,...,k e C ρ /σ ( j ) n j 2 r s k + 2 k k X i =1 g i ( n i ) ! . The estimate for z 2 follo ws the same pattern, and yields that one can tak e z 2 as z 2 = θ ∗ − 3 H k max j =1 ,...,k e C ρ /σ ( j ) n j 2 r s k + 2 k k X i =1 g i ( n i ) ! , implying the claim. 5.4. Proof of Theorem 3 . Recall that L ( z ) = E ρ 0 ( z + Z ) for Z ∼ N (0 , 1), and note that under our assumptions, equation L ( z ) = 0 has a unique solution z = 0 (even if ρ is not strictly conv ex). Next, observ e that P k X j =1 ρ 0 − θ ∗ − ¯ θ j + t ∆ σ n √ k σ n ! < 0 ≤ P √ k σ n b θ ( k ) ρ − θ ∗ ≥ t ∆ ! ≤ P k X j =1 ρ 0 − θ ∗ − ¯ θ j + t ∆ σ n √ k σ n ! ≤ 0 , hence it suffices to sho w that b oth the left-hand side and the right-hand side of the inequalit y ab ov e con verge to 1 − Φ( t ) for all t . W e will outline the argumen t for the left-hand side, and the remaining part is prov en in a similar fashion. Note that P k X j =1 ρ 0 − θ ∗ − ¯ θ j + t ∆ σ n √ k σ n ! < 0 = P P k j =1 Y n,j − E Y n,j p k V ar ( Y n, 1 ) < − √ k E Y n, 1 p V ar ( Y n, 1 ) ! , (19) where Y n,j = ρ 0 − θ ∗ − ¯ θ j + t ∆ σ n √ k σ n . 22 S. Minsker and N. Stra wn Lemma 2. Under the assumptions of The or em 3 , √ k E Y n, 1 → t ∆ L 0 (0) and q V ar ( Y n, 1 ) → q E ( ρ 0 ( Z )) 2 = ∆ · L 0 (0) as N → ∞ , wher e Z ∼ N (0 , 1) . Proof (of Lemma 2 ). Let Z ∼ N (0 , 1). Since ρ is con v ex, its deriv ative ρ 0 := ( ρ 0 + + ρ 0 − ) / 2 is monotone and contin uous almost everywhere (with resp ect to Leb esgue measure). T ogether with the assumption that k ρ 0 k ∞ < ∞ , Leb esgue dominated con v er- gence Theorem implies that d dz L ( z ) z =0 = lim h → 0 1 h √ 2 π Z R ρ 0 ( x + h ) e − x 2 / 2 dx = lim h → 0 1 h √ 2 π Z R ρ 0 ( x ) e − ( x − h ) 2 / 2 dx = 1 √ 2 π Z R xρ 0 ( x ) e − x 2 / 2 dx. (20) Next, we will prov e the assertion that √ k E Y n, 1 → t ∆ L 0 (0). It is easy to see that √ k E Y n, 1 = √ k E ρ 0 θ ∗ − ¯ θ 1 σ n + t ∆ √ k − E ρ 0 Z + t ∆ √ k + t ∆ · 1 t ∆ / √ k E ρ 0 Z + t ∆ √ k − E ρ 0 ( Z ) | {z } =0 . Reasoning as in the pro of of Theorem 2 (see step (b) in section 5.3 ), we deduce that E ρ 0 θ ∗ − ¯ θ 1 σ n + t ∆ √ k − E ρ 0 Z + t ∆ √ k ≤ 2 ρ 0 ∞ g ( n ) , where g ( n ) is the function from assumption 1 . Hence, recalling that g ( n ) √ k → 0 as N → ∞ , we obtain that √ k E ρ 0 θ ∗ − ¯ θ 1 σ n + t ∆ √ k − E ρ 0 Z + t ∆ √ k → 0 as N → ∞ . On the other hand, it follows from ( 20 ) that for t 6 = 0 t ∆ · 1 t ∆ / √ k E ρ 0 Z + t ∆ √ k N →∞ − − − − → t ∆ · L 0 (0) . F or t = 0, it is also clear that E ρ 0 ( Z ) = 0. T o establish the fact that p V ar ( Y n, 1 ) → q E ( ρ 0 ( Z )) 2 , note that weak con v ergence of ¯ θ 1 − θ ∗ σ n to the normal la w (assumption 1 ) together with Leb esgue dominated conv ergence Theorem implies that E ρ 0 θ ∗ − ¯ θ 1 σ n + t ∆ √ k → E ρ 0 ( Z ) = 0 , E ρ 0 θ ∗ − ¯ θ 1 σ n + t ∆ √ k 2 → E ρ 0 ( Z ) 2 . Distributed Statistical Estimation 23 Since L 0 (0) > 0, we deduce that E 1 / 2 ρ 0 ( Z ) 2 = ∆ · L 0 (0) , and the claim follows. Lemma 2 implies that − √ k E Y n, 1 √ V ar ( Y n, 1 ) N →∞ − − − − → t . It remains to apply Lindeb erg’s Cen tral Limit Theorem ( Serfling , 1981 , Theorem 1.9.3) to Y n,j ’s to deduce the result from equa- tion ( 19 ). T o this end, w e only need to v erify the Lindeb erg condition requiring that for an y ε > 0, E ( Y n, 1 − E Y n, 1 ) 2 I n | Y n, 1 − E Y n, 1 | ≥ ε √ k o → 0 as k → ∞ . (21) Ho w ever, since ρ 0 ( · ) (and hence Y n, 1 ) is b ounded, ( 21 ) easily follows. 5.5. Proof of Theorem 4 . The argumen t is similar to the proof of Theorem 1 . Let Φ ( n ) ( · ) b e the distribution function of ¯ θ 1 − θ ∗ σ n and b Φ ( N n ) ( · ) - the empirical distribution function corresp onding to the sample n W J = ¯ θ J − θ ∗ σ n , J ∈ A ( n ) N o of size N n . Supp ose that z ∈ R is fixed, and note that b Φ ( N n ) ( z ) is a U-statistic with mean Φ ( n ) ( z ). W e will apply the concentration inequalit y for U-statistics (fact 4 ) with M = 1 to get that b Φ ( N n ) ( z ) − Φ ( n ) ( z ) ≤ r s b N /n c ≤ r s k (22) with probability ≥ 1 − 2 e − 2 s ; here, we also used the fact that n = b N /k c . Let z 1 ≥ z 2 b e such that Φ ( n ) ( z 1 ) ≥ 1 2 + p s k and Φ ( n ) ( z 2 ) ≤ 1 2 − p s k . Applying ( 22 ) for z = z 1 and z = z 2 together with the union b ound, we see that for j = 1 , 2, b Φ ( N n ) ( z j ) − Φ ( n ) ( z j ) ≤ r s k on an even t E of probability ≥ 1 − 4 e − 2 s . It follows that on E , med W J , J ∈ A ( n ) N ∈ [ z 2 , z 1 ]. The rest of the pro of rep eats the argument of section 5.2 . 5.6. Proof of Theorem 5 . Set F ( z ) := P k j =1 P m i =1 ρ i ( z i − ¯ θ j,i ). Then b θ ( k ) = argmin z ∈ R m F ( z ) by the definition. Since F ( z ) is con v ex, the sufficient and necessary condition for b θ ( k ) to b e its minimizer is that 0 ∈ ∂ F ( b θ ( k ) ), the sub differential of F at p oin t z . It is easy to see that ∂ F ( z ) = u ∈ R m : k X j =1 ρ 0 − ,i ( z i − ¯ θ j,i ) ≤ u i ≤ k X j =1 ρ 0 + ,i ( z i − ¯ θ j,i ) , i = 1 , . . . , m , 24 S. Minsker and N. Stra wn where ρ 0 + ,i ( x ) := lim t & 0 ρ i ( x + t ) − ρ i ( x ) t and ρ 0 − ,i ( x ) := lim t % 0 ρ i ( x + t ) − ρ i ( x ) t are the right and left deriv ative of ρ i at p oint x resp ectiv ely . Since the subdifferential is con v ex, it suffices to find points z i, 1 , z i, 2 , i = 1 , . . . , m suc h that for all i , k X j =1 ρ 0 − ,i ( z i, 1 − ¯ θ j,i ) ≤ 0 , (23) k X j =1 ρ 0 + ,i ( z i, 2 − ¯ θ j,i ) ≥ 0 . This task has already b een accomplished in the proof of Theorem 2 : since ρ + ,i , ρ − ,i , i = 1 , . . . , m are nondecreasing functions, rep eating the argument of section 5.3 yields that, on an even t of probabilit y ≥ 1 − 4 e − 2 s , inequalities ( 23 ) hold with z i, 1 = θ ∗ ,i + 3 σ n,i e ( C ρ i /σ n,i ) 2 r s k + 2 g m ( n ) , (24) z i, 2 = θ ∗ ,i − 3 σ n,i e ( C ρ i /σ n,i ) 2 r s k + 2 g m ( n ) . W e hav e thus sho w n that for each i = 1 , . . . , m , b θ ( k ) i − θ ∗ ,i ≤ 3 e ( C ρ i /σ n,i ) 2 · σ n,i r s k + 2 g m ( n ) with probabilit y ≥ 1 − 4 e − 2 s . Applying the union b ound o ver all i , w e obtain the result. 5.7. Proof of Lemma 1 . It is a simple n umerical fact that whenever α j · 1 k k X j =1 g j ( n j ) + r s k ≤ 0 . 33 , ζ j ( n j , s ) /σ ( j ) n j ≤ 1 (indeed, this follows since Φ(1) ' 0 . 8413 > 1 / 2 + 0 . 33). Set B ( s ) := 1 k P k j =1 g j ( n j ) + p s k for brevity . Since e − y 2 / 2 ≥ 1 − y 2 2 , we hav e √ 2 π α j · B ( s ) = Z ζ j ( n j ,s ) /σ ( j ) n j 0 e − y 2 / 2 dy ≥ ζ j ( n j , s ) σ ( j ) n j − 1 6 ζ j ( n j , s ) σ ( j ) n j ! 3 ≥ 5 6 ζ j ( n j , s ) σ ( j ) n j , (25) Distributed Statistical Estimation 25 where the last inequality follows since ζ j ( n j , s ) /σ ( j ) n j ≤ 1. Equation ( 25 ) implies that ζ j ( n j ,s ) σ ( j ) n j ≤ 6 5 α j √ 2 π B ( s ). Pro ceeding again as in ( 25 ), w e see that √ 2 π α j B ( s ) ≥ ζ j ( n j , s ) σ ( j ) n j − 1 6 ζ j ( n j , s ) σ ( j ) n j ! 3 ≥ ζ j ( n j , s ) σ ( j ) n j − 12 π 25 α 2 j ( B ( s )) 2 ζ j ( n j , s ) σ ( j ) n j ≥ ζ j ( n j , s ) σ ( j ) n j 1 − 1 . 51 α 2 j ( B ( s )) 2 , hence ζ j ( n j ,s ) σ ( j ) n j ≤ √ 2 π 1 − 1 . 51 α 2 j ( B ( s )) 2 α j B ( s ) . The claim follows since α j B ( s ) ≤ 0 . 33 for all j b y assumption, and σ ( j ) n j α j ≡ H k . Ackno wledg ements Authors w ould like to thank Anatoli Juditsky for many insightful comments and sug- gestions. References Alon, N., Matias, Y. and Szegedy , M. (1996) The space complexit y of appro ximating the frequency moments. In Pr o c e e dings of the twenty-eighth annual A CM symp osium on The ory of c omputing , 20–29. ACM. Arcones, M. A. (1996) The Bahadur-Kiefer representation for U-quan tiles. The A nnals of Statistics , 24 , 1400–1422. Battey , H., F an, J., Liu, H., Lu, J. and Zh u, Z. (2015) Distributed estimation and inference with statistical guarantees. arXiv pr eprint arXiv:1509.05457 . Ben tkus, V. (2003) On the dependence of the berry–esseen b ound on dimension. Journal of Statistic al Planning and Infer enc e , 113 , 385–402. Ben tkus, V., Bloznelis, M. and G¨ otze, F. (1997) A Berry–Ess´ een b ound for M-estimators. Sc andinavian journal of statistics , 24 , 485–502. Berry , A. C. (1941) The accuracy of the Gaussian approximation to the sum of indep en- den t v ariates. T r ansactions of the americ an mathematic al so ciety , 49 , 122–136. Bic k e l, P . J. et al. (1965) On some robust estimates of lo cation. The Annals of Mathe- matic al Statistics , 36 , 847–858. Bub ec k, S., Cesa-Bianc hi, N. and Lugosi, G. (2013) Bandits with hea vy tail. IEEE T r ansactions on Information The ory , 59 , 7711–7717. 26 S. Minsker and N. Stra wn Cardot, H., Cenac, P ., Zitt, P .-A. et al. (2013) Efficien t and fast estimation of the geometric median in Hilb ert spaces with an av eraged sto chastic gradient algorithm. Bernoul li , 19 , 18–43. Catoni, O. (2012) Challenging the empirical mean and empirical v ariance: a deviation study . In A nnales de l’Institut Henri Poinc ar´ e, Pr ob abilit´ es et Statistiques , vol. 48, 1148–1185. Institut Henri Poincar ´ e. Cheng, G. and Shang, Z. (2015) Computational limits of divide-and-conquer metho d. arXiv pr eprint arXiv:1512.09226 . Cohen, M. B., Lee, Y. T., Miller, G., Pac ho cki, J. and Sidford, A. (2016) Geometric median in nearly linear time. arXiv pr eprint arXiv:1606.05225 . Devro y e, L., Lerasle, M., Lugosi, G., Oliveira, R. I. et al. (2016) Sub-Gaussian mean estimators. The Annals of Statistics , 44 , 2695–2725. Duc hi, J. C., Jordan, M. I., W ainwrigh t, M. J. and Zhang, Y. (2014) Optimality guar- an tees for distributed statistical estimation. arXiv pr eprint arXiv:1405.0782 . Dudley , R. M. (1978) Central limit theorems for empirical measures. The Annals of Pr ob ability , 899–929. Esseen, C.-G. (1942) On the Liap ounoff limit of err or in the the ory of pr ob ability . Almqvist & Wiksell. F an, J., Han, F. and Liu, H. (2014) Challenges of Big Data analysis. National scienc e r eview , 1 , 293–314. F an, J., W ang, D., W ang, K. and Zhu, Z. (2017) Distributed estimation of principal eigenspaces. arXiv pr eprint arXiv:1702.06488 . Ghosh, M., Parr, W. C., Singh, K., Babu, G. J. et al. (1984) A note on b o otstrapping the sample median. The A nnals of Statistics , 12 , 1130–1135. Gin ´ e, E. and Nickl, R. (2015) Mathematic al foundations of infinite-dimensional statistic al mo dels , vol. 40. Cam bridge Universit y Press. Haldane, J. B. S. (1948) Note on the median of a multiv ariate distribution. Biometrika , 35 , 414–417. Hamp el, F. R., Ronc hetti, E. M., Rousseeuw, P . J. and Stahel, W. A. (2011) R obust statistics: the appr o ach b ase d on influenc e functions , vol. 196. John Wiley & Sons. Haussler, D. (1995) Sphere pac king n um b ers for subsets of the b oolean n-cub e with b ounded v apnik-cherv onenkis dimension. Journal of Combinatorial The ory, Series A , 69 , 217–232. Ho dges, J. L. and Lehmann, E. L. (1963) Estimates of lo cation based on rank tests. The A nnals of Mathematic al Statistics , 598–611. Distributed Statistical Estimation 27 Ho effding, W. (1948) A class of statistics with asymptotically normal distribution. The A nnals of Mathematic al Statistics , 293–325. — (1963) Probability inequalities for sums of b ounded random v ariables. Journal of the A meric an statistic al asso ciation , 58 , 13–30. Hsu, D. and Sabato, S. (2013) Loss minimization and parameter estimation with heavy tails. arXiv pr eprint arXiv:1307.1827 . — (2016) Loss minimization and parameter estimation with heavy tails. Journal of Machine L e arning R ese ar ch , 17 , 1–40. Hub er, P . J. (1964) Robust estimation of a lo cation parameter. The A nnals of Mathe- matic al Statistics , 35 , 73–101. IBM (2015) What is Big Data? https://www- 01.ibm.com/software/data/bigdata/ what- is- big- data.html . Jerrum, M. R., V aliant, L. G. and V azirani, V. V. (1986) Random generation of com- binatorial structures from a uniform distribution. The or etic al Computer Scienc e , 43 , 169–188. Joly , E., Lugosi, G. and Oliv eira, R. I. (2016) On the estimation of the mean of a random v ector. arXiv pr eprint arXiv:1607.05421 . Jordan, M. (2013) On statistics, computation and scalability . Bernoul li , 19 , 1378–1390. K¨ arkk¨ ainen, T. and Ayr¨ am¨ o, S. (2005) On computation of spatial median for robust data mining. Evolutionary and Deterministic Metho ds for Design, Optimization and Contr ol with Applic ations to Industrial and So cietal Pr oblems, EUR OGEN, Munich . Kemp erman, J. (1987) The median of a finite measure on a Banac h space. Statistic al data analysis b ase d on the L 1 -norm and r elate d metho ds , 217–230. Kuhn, H. W. (1973) A note on F ermat’s problem. Mathematic al pr o gr amming , 4 , 98–107. Lee, J. D., Sun, Y., Liu, Q. and T aylor, J. E. (2015) Comm unication-efficient sparse regression: a one-shot approach. arXiv pr eprint arXiv:1503.04337 . Lehmann, E. L. and D’Abrera, H. J. (2006) Nonp ar ametrics: statistic al metho ds b ase d on r anks . Springer New Y ork. Lerasle, M. and Oliv eira, R. I. (2011) Robust empirical mean estimators. arXiv pr eprint arXiv:1112.3914 . Li, C., Sriv asta v a, S. and Dunson, D. B. (2016) Simple, scalable and accurate p osterior in terv al estimation. arXiv pr eprint arXiv:1605.04029 . Liang, Y., Balcan, M.-F. F., Kanchanapally , V. and W o o druff, D. (2014) Impro v ed dis- tributed Principal Comp onen t Analysis. In A dvanc es in Neur al Information Pr o c essing Systems , 3113–3121. 28 S. Minsker and N. Stra wn Lugosi, G. and Mendelson, S. (2017) Sub-Gaussian estimators of the mean of a random v ector. arXiv pr eprint arXiv:1702.00482 . — (2018) Near-optimal mean estimators with resp ect to general norms. arXiv pr eprint arXiv:1806.06233 . Mcdonald, R., Mohri, M., Silb erman, N., W alker, D. and Mann, G. S. (2009) Efficient large-scale distributed training of conditional maxim um entrop y mo dels. In A dvanc es in Neur al Information Pr o c essing Systems , 1231–1239. Minsk er, S., Sriv astav a, S., Lin, L. and Dunson, D. B. (2014) Robust and scalable Bay es via a median of subset p osterior measures. arXiv pr eprint arXiv:1403.2660 . Minsk er, S. a. (2015) Geometric median and robust estimation in Banac h spaces. Bernoul li , 21 , 2308–2335. Nemiro vski, A. and Y udin, D. (1983) Pr oblem c omplexity and metho d efficiency in opti- mization . John Wiley & Sons Inc. Ostresh, L. M. (1978) On the conv ergence of a class of iterative metho ds for solving the Web er lo cation problem. Op er ations R ese ar ch , 26 , 597–609. Ov erton, M. L. (1983) A quadratically con vergen t metho d for minimizing a sum of Euclidean norms. Mathematic al Pr o gr amming , 27 , 34–63. P etro v, V. V. (1995) Limit the or ems of pr ob ability the ory: se quenc es of indep endent r andom variables . Oxford, New Y ork. Pinelis, I. (2016) Optimal-order b ounds on the rate of con v ergence to normality for maxim um likelihoo d estimators. arXiv pr eprint arXiv:1601.02177 . P ollard, D. (2000) Asymptopia: an exposition of statistical asymptotic theory . Available at http: // www. stat. yale. edu/ ~ pollard/ Books/ Asymptopia . Rosen blatt, J. D. and Nadler, B. (2016) On the optimality of av eraging in distributed statistical learning. Information and Infer enc e , 5 , 379–404. Scott, S. L., Blo c ker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I. and Mc- Cullo c h, R. E. (2016) Ba yes and big data: the consensus Mon te Carlo algorithm. International Journal of Management Scienc e and Engine ering Management , 11 , 78– 88. Serfling, R. J. (1981) Appro ximation theorems of mathematical statistics. Shafieezadeh-Abadeh, S., Esfahani, P . M. and Kuhn, D. (2015) Distributionally robust logistic regression. In A dvanc es in Neur al Information Pr o c essing Systems , 1576–1584. Shang, Z. and Cheng, G. (2015) A Bay esian splitotic theory for nonparametric mo dels. arXiv pr eprint arXiv:1508.04175 . Shevtso v a, I. (2011) On the absolute constants in the Berry-Esseen type inequalities for iden tically distributed summands. arXiv pr eprint arXiv:1111.6554 . Distributed Statistical Estimation 29 Small, C. (1990) A survey of m ultidimensional medians. International Statistic al R eview , 58 , 263–277. T alagrand, M. (2005) The generic chaining . Springer. T ukey , J. and Harris, T. (1946) Sampling from con taminated distributions. Ann. Math. Statist , 17501 . v an der V aart, A. W. (1998) Asymptotic statistics , vol. 3 of Cambridge Series in Statis- tic al and Pr ob abilistic Mathematics . Cam bridge: Cambridge Univ ersit y Press. v an der V aart, A. W. and W ellner, J. A. (1996) We ak c onver genc e and empiric al pr o- c esses . Springer Series in Statistics. New Y ork: Springer-V erlag. V ardi, Y. and Zhang, C.-H. (2000) The multiv ariate L 1 -median and asso ciated data depth. Pr o c e e dings of the National A c ademy of Scienc es , 97 , 1423–1426. W eiszfeld, E. (1936) Sur un probl` eme de minim um dans l’espace. T ohoku Mathematic al Journal . Y ang, T. and Lin, Q. (2015) Rsg: Beating subgradien t metho d without smo othness and strong conv exity . arXiv pr eprint arXiv:1512.03107 . Zhang, Y., Duc hi, J. and W ainwrigh t, M. (2013) Divide and conquer kernel ridge regres- sion. In Confer enc e on L e arning The ory , 592–617. Zhang, Y., W ainwrigh t, M. J. and Duc hi, J. C. (2012) Comm unication-efficien t algo- rithms for statistical optimization. In A dvanc es in Neur al Information Pr o c essing Systems , 1502–1510. Zink evic h, M., W eimer, M., Li, L. and Smola, A. J. (2010) Parallelized sto chastic gradient descen t. In A dvanc es in neur al information pr o c essing systems , 2595–2603. A. Central limit theorem for the non-i.i.d. data. W e presen t an extension of Theorem 3 to non-i.i.d. data for the estimator b θ ( k ) = med ¯ θ 1 , . . . , ¯ θ k that holds under the follo wing assumptions: (a) X 1 , . . . , X N are indep endent, card( G j ) = n j , and P k j =1 n j = k ; (b) Assumption 1 is satisfied with some { σ ( j ) n } n ≥ 1 and g j ( n ), j = 1 , . . . , k ; (c) k → ∞ and max j =1 ,...,k √ k · g j ( n j ) → 0 as N → ∞ ; (d) max j ≤ k H k σ ( j ) n j √ k N →∞ − − − − → 0, where H k := 1 k P k j =1 1 σ ( j ) n j − 1 is the harmonic mean of σ ( j ) n j ’s. 30 S. Minsker and N. Stra wn Theorem 6. Under assumptions (a)-(e) ab ove, √ k b θ ( k ) − θ ∗ H k d − → N 0 , π 2 . Proof. Define d − ( x ) := I { x > 0 } − I { x ≤ 0 } , and Y n j ,j = d − θ ∗ − ¯ θ j + t p π 2 H k √ k . W e will show that (a) 1 k P k j =1 √ k E Y n j ,j → t as N → ∞ ; (b) 1 k P k j =1 V ar( Y n j ,j ) → 1 as N → ∞ . T o pro ve the first claim, first assume that t 6 = 0 (for t = 0 the argument follo ws the same line with simplifications), and observ e that √ k E Y n j ,j = √ k E d − θ ∗ − ¯ θ j σ ( j ) n j + t r π 2 H k σ ( j ) n j √ k ! − E d − Z + t r π 2 H k σ ( j ) n j √ k !! + t r π 2 H k σ ( j ) n j · 1 t p π 2 H k σ ( j ) n j √ k E d − Z + t r π 2 H k σ ( j ) n j √ k ! − E d − ( Z ) | {z } =0 . Moreo v er, √ k E d − θ ∗ − ¯ θ j σ ( j ) n j + t r π 2 H k σ ( j ) n j √ k ! − E d − Z + t r π 2 H k σ ( j ) n j √ k !! ≤ 2 g j ( n j ) , while under assumption (d), 1 t p π 2 H k σ ( j ) n j √ k E d − Z + t r π 2 H k σ ( j ) n j √ k ! − E d − ( Z ) | {z } =0 → 2 √ 2 π as N → ∞ . It then follows from assumption (c) that 1 k k X j =1 √ k E Y n j ,j − t H k 1 k k X j =1 1 σ ( j ) n j | {z } =1 → 0 as N → ∞ . Claim (b) follows since E Y n j ,j 2 = 1 and max j ≤ k E Y n j ,j → 0 under assumption (d). The rest of the argumen t rep eats the pro of of Theorem 3 for ρ ( x ) = | x | . B. Supplementary results. Lemma 3. L et A ⊂ R b e symmetric, me aning that A = −A , and let Z ∼ N (0 , 1) . Then for al l x ∈ R , P ( Z ∈ A − x ) ≥ e − x 2 / 2 P ( Z ∈ A ) . Distributed Statistical Estimation 31 Proof. Observe that P ( Z ∈ A ) = Z R I { z ∈ A } 1 √ 2 π e − z 2 / 2 dz = e x 2 / 2 Z R I { z ∈ A } e − xz / 2 e xz / 2 1 √ 2 π e − z 2 / 2 e − x 2 / 2 dz ≤ e x 2 / 2 s Z R I { z ∈ A } 1 √ 2 π e − ( z − x ) 2 / 2 dz s Z R I { z ∈ A } 1 √ 2 π e − ( z + x ) 2 / 2 dz = e x 2 / 2 Z R I { z ∈ A } 1 √ 2 π e − ( z − x ) 2 / 2 dz = e x 2 / 2 P ( Z ∈ A − x ) , and the claim follows. Lemma 4. Ine quality tanh( x ) ≥ x 1+ x 1+ x + x 2 holds for al l x ≥ 0 . Mor e over, if tanh( x ) ≤ 1 / 2 and x ≥ 0 , then tanh( x ) ≥ 0 . 83 x . Proof. Since e x ≥ 1 + x + x 2 2 for all x ≥ 0, tanh( x ) = 1 − 2 1 + e 2 x ≥ 1 − 1 1 + x + x 2 = x 1 + x 1 + x + x 2 . Note that f ( x ) = 1+ x 1+ x + x 2 is decreasing on [0 , ∞ ). Whenev er tanh( x ) ≤ 1 / 2, x ≤ log 3 2 ≤ 0 . 55, hence tanh( x ) ≥ 0 . 83 x . C. Results for the spatial median with respect to the k · k 2 norm. In this section, w e discuss estimation of the m ultiv ariate parameter θ ∗ ∈ R m based on the L 2 -median. Let X 1 , . . . , X N ∈ R d b e i.i.d. copies of X randomly partitioned in to disjoin t groups G 1 , . . . , G k of cardinalit y n ≥ b N /k c eac h, and let ¯ θ j := ¯ θ j ( G j ) ∈ R m , 1 ≤ j ≤ k b e a sequence of i.i.d. estimators of θ ∗ . W e define b θ ( k ) = med g ¯ θ 1 , . . . , ¯ θ k := argmin z ∈ R m k X j =1 z − ¯ θ j 2 (26) b e the L 2 median of ¯ θ 1 , . . . , ¯ θ k . Let Z ∈ R m ha v e multiv ariate normal distribution N (0 , Σ), and define Φ Σ ( A ) := P ( Z ∈ A ) for a Borel measurable set A ⊆ R m . Moreov er, define S to b e the set of closed cones, S m = { C u ( t ; b ) = { x ∈ R m : h x − b, u i ≥ t k x − b k 2 } , k u k 2 = 1 , b ∈ R m , 0 ≤ t ≤ 1 } . (27) W e will assume that ¯ θ 1 is “asymptotically normal on cones”: Assumption 2. Ther e exists a se quenc e { σ n } n ∈ N ⊂ R + and a p ositive-definite matrix Σ such that k Σ k ≤ 1 and g S m ( n ) := sup S ∈S m P 1 σ n ¯ θ 1 − θ ∗ ∈ S − Φ Σ ( S ) → 0 as n → ∞ . 32 S. Minsker and N. Stra wn Theorem 7. L et assumption 2 b e satisfie d. Then with pr ob ability ≥ 1 − e − 2 s , tanh 1 σ n b θ ( k ) − θ ∗ 2 ≤ 26 . 8 Σ − 1 / 2 C 1 ( m ) √ k + C 2 ( m ) r s 4 k + g S m ( n ) , (28) wher e C 1 ( m ) = 6 q log 4 e 5 / 2 ( m + 4) q m + 2 p ( m − 1) ln 4 and C 2 ( m ) = q m + 2 p ( m − 1) ln 4 . Remark 6. It fol lows fr om L emma 4 that whenever the right-hand side of the in- e quality ( 28 ) is b ounde d by 1 / 2 , tanh 1 σ n b θ ( k ) − θ ∗ 2 ≥ 0 . 83 σ n b θ ( k ) − θ ∗ 2 , which le ads to a mor e explicit b ound for b θ ( k ) − θ ∗ 2 . As an example, we consider the problem of the m ultiv ariate mean estimation. Recall that the condition num b er cond( A ) of a non-singular matrix A is defined as cond( A ) = k A k k A − 1 k . Corollar y 5. L et X 1 , . . . , X N b e a se quenc e of i.i.d. c opies of a r andom ve ctor X ∈ R d such that E X = θ ∗ , E ( X − θ ∗ )( X − θ ∗ ) T = e Σ , and E k X − θ ∗ k 3 2 < ∞ . Define ˆ θ ( k ) = me d g ¯ θ 1 , . . . , ¯ θ k . Assume that s > 0 and k ≤ N / 2 ar e such that cond( e Σ 1 / 2 ) C 1 ( d ) √ k + C 2 ( d ) r s 4 k + 400 d 1 / 4 E e Σ − 1 / 2 ( X − θ ∗ ) 3 2 √ n ≤ 0 . 037 . Then b θ ( k ) − θ ∗ 2 ≤ 32 . 4 k e Σ 1 / 2 k cond( e Σ 1 / 2 ) × C 1 ( d ) √ k n + C 2 ( d ) r s 4 k n + 400 d 1 / 4 E e Σ − 1 / 2 ( X − θ ∗ ) 3 2 n with pr ob ability ≥ 1 − e − 2 s , wher e C 1 ( d ) and C 2 ( d ) ar e the same as in The or em 7 . Proof. It follows from the multiv ariate Berry-Esseen b ound (fact 5 ) that assumption 2 is satisfied with σ n = q k e Σ k n , Σ = e Σ k e Σ k and g S d ( n ) = 400 d 1 / 4 E k e Σ − 1 / 2 X k 3 2 √ n . Noting that k Σ − 1 / 2 k = k e Σ 1 / 2 k k e Σ − 1 / 2 k = cond( e Σ 1 / 2 ), it is easy to deduce the b ound from ( 28 ) and remark 6 . Remark 7. Note that, similarly to the c ase d = 1 , whenever k . √ N (henc e, n & √ N ), the b ound of Cor ol lary 5 is of or der N − 1 / 2 with r esp e ct to the sample size N . However, dep endenc e of the b ound on the dimension factor d is sub optimal. Distributed Statistical Estimation 33 C .1. Over vie w of numerical algorithms. Letting x 1 , . . . , x k ∈ R d , F ( z ) := k P j =1 k z − x j k is conv ex and it ac hieves its minim um at a unique p oin t (unless { x 1 , . . . , x k } are on the same line ( Kemperman , 1987 )) that b elongs to the conv ex hull of x 1 , . . . , x k . The classical algorithm that approximates argmin z ∈ H F ( z ) is the famous Weiszfeld’s algorithm ( W eiszfeld , 1936 ): starting from some z 0 in the affine hull of { x 1 , . . . , x k } , iterate z m +1 = k X j =1 α ( j ) m +1 x j , (29) where α ( j ) m +1 = k x j − z m k − 1 H k P j =1 k x j − z m k − 1 H . H. W. Kuhn ( Kuhn , 1973 ) show ed that W eiszfeld’s algo- rithm con v erges to the geometric median for all but countably man y initial p oin ts. It is easy to chec k that ( 29 ) is a gradient descen t scheme: indeed, it is equiv alent to z m +1 = z m − β m +1 g m +1 , where β m +1 = 1 k P j =1 k x j − z m k − 1 H and g m +1 = k P j =1 z m − x j k z m − x j k H is the gradient of F (we assume here that z m / ∈ { x 1 , . . . , x k } ). V arious improv ements and accelerated versions of W eiszfeld’s algorithm hav e b een prop osed and analyzed. Ostresh ( 1978 ) provides a mo dified version of W eiszfeld’s algo- rithm that conv erges to the geometric median under reasonable initialization conditions, but the rate of conv ergence is not sp ecified. K¨ arkk¨ ainen and Ayr¨ am¨ o ( 2005 ) consider empirical b ehavior of several mo difications of W eiszfeld’s algorithm, and obtains con- v ergence for an SOR metho d. V ardi and Zhang ( 2000 ) demonstrate conv ergence of another mo dified W eiszfeld algorithm, but only provides empirical con v ergence rates. Ov erton ( 1983 ) provides an algorithm that exhibits quadratic con vergence under some assumptions, but a quantitativ e rate is not expressed. Cardot et al. ( 2013 ) dev elops an online sto c hastic descent algorithms and provides an asymptotic con v ergence rate. Quan titativ e error b ounds are not av ailable for an y of the algorithms discussed so far. Literature from computer science considers the computational complexit y of algo- rithms for computing e θ ( k ) suc h that F ( e θ ( k ) ) is close to the minimum v alue F ( b θ ( k ) ). A thorough comparison of such results is provided by Cohen et al. ( 2016 ). The results from this w ork are fully quan titative, but they need to b e adapted to our setting. In our statistical estimation setting, we are using b θ ( k ) to estimate the true parameter θ ∗ , so we w an t b ounds on the proximit y k e θ ( k ) − b θ ( k ) k instead of bounds on F ( e θ ( k ) ). The following theorem (prov en in Section D.3 ) provides a “lo cal low er b ound.” Theorem 8. Supp ose { x i } k i =1 , let x = 1 k P k i =1 x i , set m t = 1 k P k i =1 k x i − x k t for t = 1 , 2 , 3 , and assume that the empiric al c ovarianc e matrix b Σ = 1 k P k i =1 ( x i − x )( x i − x ) T 34 S. Minsker and N. Stra wn satisfies a := 1 k d X j =2 λ j ( b Σ) > 0 wher e λ j ( b Σ) ar e the eigenvalues of b Σ liste d with multiplicity and in non-incr e asing or der. Then, for al l θ ∈ R d , 1 k ( F ( θ ) − F ( b θ ( k ) )) ≥ 1 2 a k θ − b θ ( k ) k 2 b 2 ( k θ − b θ ( k ) k + b ) wher e b = 20 m 3 1 + 6 m 1 m 2 + m 3 a . Theorem 8 allo ws us to infer proximit y b ounds from all the computer science literature that discusses v alue b ounds. Moreo v er, this b ound is asymptotically stable in the i.i.d. sampling setting assuming the existence of three moments. F or small k θ − b θ ( k ) k , the lo w er b ound is approximately quadratic, and hence the pro ximit y b ound b eha v es like q F ( θ ) − F ( b θ ( k ) ). On the other hand, this local low er bound fits in w ell with the theory of Restarted Gradient Descent ( Y ang and Lin , 2015 ). D . Proofs of results in Appendix C . D .1. T echnical background. Ev erywhere b elow, Φ Σ stands for the distribution of the normal v ector with mean 0 and co v ariance matrix Σ. The follo wing m ultiv ariate v ersion of the Berry-Esseen Theorem for conv ex sets has b een established by Ben tkus ( 2003 ). F a ct 5 (Mul tiv aria te Berr y-Esseen bound). Assume that Y 1 , . . . , Y n is a se- quenc e of i.i.d. c opies of a r andom ve ctor Y ∈ R d with me an µ , c ovarianc e matrix Σ 0 and such that E k Y k 3 2 < ∞ . L et Z have normal distribution N (0 , Σ) , and A b e the class of al l c onvex subsets of R d . Then sup A ∈A P √ n ( ¯ Y − µ ) ∈ A − Φ Σ ( A ) ≤ 400 d 1 / 4 E Σ − 1 / 2 Y 3 2 √ n , wher e ¯ Y = 1 n P n j =1 Y j . Giv en a metric space ( T , ρ ), the cov ering n um b er N ( T , ρ, ε ) is defined as the smallest N ∈ N suc h that there exists a subset F ⊆ T of cardinalit y N with the property that for all z ∈ T , ρ ( z , F ) ≤ ε . When metric ρ is clear from the context, we will simply write N ( T , ε ). Let { Y ( t ) , t ∈ T } b e a sto c hastic pro cess indexed by T . W e will sa y that it has sub-Gaussian increments with resp ect to metric ρ if for all t 1 , t 2 ∈ T and s > 0, P ( | Y t 1 − Y t 2 | ≥ sρ ( t 1 , t 2 )) ≤ 2 e − s 2 / 2 . Distributed Statistical Estimation 35 F a ct 6 (Dudley’s entropy bound). L et { Y ( t ) , t ∈ T } b e a c enter e d sto chastic pr o c ess with sub-Gaussian incr ements. Then the fol lowing ine quality holds: E sup t ∈ T Y ( t ) ≤ 12 D ( T ) Z 0 p log N ( T , ρ, ε ) dε, (30) wher e D ( T ) is the diameter of the sp ac e T with r esp e ct to ρ . Proof. See ( T alagrand , 2005 ). Finally , we recall t wo useful facts related to V apnik-Cherv onenkis (VC) combinatorics (see v an der V aart and W ellner , 1996 , for the definition of VC dimension and related theory). Let F b e a finite-dimensional v ector space of real functions on S . F a ct 7. L et C = {{ f ≥ 0 } : f ∈ F } and C + = {{ f > 0 } : f ∈ F } Then V C( C ) = V C( C + ) = dim( F ) . Proof. See Prop osition 3.6.6 in ( Gin´ e and Nickl , 2015 ). F a ct 8. L et C b e a class of sets of VC-dimension V . Then, for any pr ob ability me asur e Q , N ( C , L 2 ( Q ) , ε ) ≤ e ( V + 1)(4 e ) V 1 ε 2 V (31) for al l 0 < ε ≤ 1 ; Proof. This b ound follows from results of R. Dudley ( Dudley , 1978 ) and D. Haussler ( Haussler , 1995 ). The b ound with explicit constan ts as stated ab o v e is giv en in ( P ollard , 2000 ). D .2. Proof of Theorem 7 . By the definition of the geom etric median, b θ ( k ) = argmin z ∈ R m k X j =1 k z − ¯ θ j k 2 , hence 1 σ n b θ ( k ) − θ ∗ = argmin z ∈ R m k X j =1 z − 1 σ n ¯ θ j − θ ∗ 2 . (32) Set F k ( z ) := P k j =1 z − 1 σ n ¯ θ j − θ ∗ 2 . Then ( 32 ) is equiv alen t to b µ ( k ) := 1 σ n b θ ( k ) − θ ∗ = argmin z ∈ R m F ( z ) . 36 S. Minsker and N. Stra wn Denote by Φ ( n ) the distribution of 1 σ n ¯ θ 1 − µ , and by Φ ( n ) k - the empirical distribution corresp onding to the sample W 1 = 1 σ n ( ¯ θ 1 − µ ) , . . . , W k = 1 σ n ( ¯ θ k − µ ) . Let DF k ( b µ ( k ) ; u ) := lim t & 0 F k ( b µ ( k ) + tu ) − F k ( b µ ( k ) ) t b e the directional deriv ative of F k at point b µ ( k ) in direction u . Clearly , D F k ( b µ ( k ) ; u ) ≥ 0 for an y u suc h that k u k 2 = 1. On the other hand, it is easy to chec k that D F k ( b µ ( k ) ; u ) = Φ ( n ) k f u, b µ ( k ) , where f u,b ( x ) = ( D x − b k x − b k 2 , u E , x 6 = b, 1 , x = b. Let S m b e the set of closed cones defined in ( 27 ), and note that for an y unit v ector u ∈ R m and t ∈ [0 , 1], x ∈ R m : f u, b µ ( k ) ( x ) ≥ t = C u ( t ; b µ ( k ) ) . (33) Next, observe that 0 ≤ D F ( b µ ( k ) ; u ) = (Φ ( n ) k − Φ ( n ) ) f u, b µ ( k ) + (Φ ( n ) − Φ Σ ) f u, b µ ( k ) + Φ Σ f u, b µ ( k ) . (34) W e will assume that u is chosen such that Φ Σ f u, b µ ( k ) ≤ 0 (if not, simply replace u by − u ). Then ( 34 ) implies that Φ Σ f − u, b µ ( k ) ≤ (Φ ( n ) k − Φ ( n ) ) f u, b µ ( k ) + (Φ ( n ) − Φ Σ ) f u, b µ ( k ) . (35) It remains to estimate the left-hand side of inequality ( 35 ) from b elo w and its right- hand side from ab o ve. W e start b y finding an upp er b ound (prov ed in section D.2.1 ) for (Φ ( n ) − Φ Σ ) f u, b µ ( k ) . Lemma 5. The fol lowing b ound holds: (Φ ( n ) − Φ Σ ) f u, b µ ( k ) ≤ 2 g S m ( n ) , wher e g S m ( n ) was define d in assumption 2 . The next Lemma (prov ed in section D.2.2 ) provides an upp er bound for (Φ ( n ) k − Φ ( n ) ) f u, b µ ( k ) . Lemma 6. With pr ob ability ≥ 1 − e − 2 s , (Φ ( n ) k − Φ ( n ) ) f u, b µ ( k ) ≤ 12( m + 4) √ k q log 4 e 5 / 2 + r s k . Finally , it remains to estimate Φ Σ f − u, b µ ( k ) from b elow. The following inequality (pro v ed in section D.2.3 ) holds: Distributed Statistical Estimation 37 Lemma 7. Set u = − Σ − 1 b µ ( k ) k Σ − 1 b µ ( k ) k 2 . Then Φ Σ f − u, b µ ( k ) ≥ 0 . 15 2 Σ − 1 / 2 q m + 2 p ( m − 1) ln 4 tanh b µ ( k ) 2 , wher e tanh( · ) is the hyp erb olic tangent define d as tanh( x ) = 1 − e − 2 x 1+ e − 2 x . It therefore follows from Lemmas 5 , 6 and 7 that with probability exceeding 1 − e − 2 s , 0 . 15 2 Σ − 1 / 2 q m + 2 p ( m − 1) ln 4 tanh b µ ( k ) 2 ≤ 12( m + 4) √ k q log 4 e 5 / 2 + r s k +2 g S ( n ) , whic h implies the b ound of Theorem 7 . D.2.1. Pro of of Lemma 5 . Recall that for any non-negativ e function f : R m 7→ R + and a signed measure Q , Z R m f ( x ) dQ = Z ∞ 0 Q ( x : f ( x ) ≥ t ) dt. (36) Hence (Φ ( n ) − Φ Σ ) f u, b µ ( k ) = (Φ ( n ) − Φ Σ ) max f u, b µ ( k ) , 0 − (Φ ( n ) − Φ Σ ) max f − u, b µ ( k ) , 0 , where we used the iden tit y − f u, b µ ( k ) = f − u, b µ ( k ) . Next, it follows from ( 33 ) that (Φ ( n ) − Φ Σ ) max f u, b µ ( k ) , 0 = Z 1 0 (Φ ( n ) − Φ Σ ) x : f u, b µ ( k ) ≥ t dt ≤ max 0 ≤ t ≤ 1 (Φ ( n ) − Φ Σ ) C u ( t ; b µ ( k ) ) ≤ g S m ( n ) b y assumption 2 . It implies that (Φ ( n ) − Φ Σ ) f u, b µ ( k ) ≤ 2 g S m ( n ) , as claimed. D.2.2. Pro of of Lemma 6 . Using ( 36 ) and pro ceeding as in the pro of of Lemma 5 , we obtain that (Φ ( n ) k − Φ ( n ) ) f u, b µ ( k ) ≤ max 0 ≤ t ≤ 1 (Φ ( n ) k − Φ ( n ) ) C u ( t ; b µ ( k ) ) ≤ sup A ∈S m Φ ( n ) k ( A ) − Φ ( n ) ( A ) . It follows from the b ounded difference inequality (fact 3 ) that for all s > 0, P sup A ∈S m Φ ( n ) k ( A ) − Φ ( n ) ( A ) − E sup A ∈S m Φ ( n ) k ( A ) − Φ ( n ) ( A ) ≥ r s k ≤ e − 2 s , hence it is enough to control E sup A ∈S m Φ ( n ) k ( A ) − Φ ( n ) ( A ) . T o this end, we will esti- mate the co v ering num b ers of the class of cones S and use Dudley’s integral b ound (fact 6 ). 38 S. Minsker and N. Stra wn Giv en a v ector x ∈ R m , let x 1 , . . . , x m b e its co ordinates with resp ect to the standard Euclidean basis. Note that h x − b, u i ≥ t k x − b k 2 ⇐ ⇒ h x − b, u i ≥ 0 and h x − b, u i 2 ≥ t 2 k x − b k 2 2 , whic h is equiv alent to P m i,j =1 α i α i,j x i x j + P m j =1 β j x j + γ ≥ 0 and h x − b, u i ≥ 0, where α i,j , β j , i, j = 1 , . . . , m , and γ are functions of t, b j and u j , j = 1 , . . . , m . In particular, ev ery element of A ∈ S m is the intersection of a half-space { x : h x − b, u i ≥ 0 } and a set { x : f ( x ) ≥ 0 } , where f is a polynomial of degree 2 in m v ariables. The dimension of the space V 2 ,m of p olynomials of degree at most 2 is dim( V 2 ,m ) = m +2 2 , hence the V apnik- Chernonenkis dimension of the collection of sets S V 2 ,m = n { x : f ( x ) ≥ 0 } , f ∈ V 2 ,m o is ˜ m := m +2 2 b y fact 7 . It follows from fact 8 that for any probabilit y measure Q , N ( S V 2 ,m , L 2 ( Q ) , ε ) ≤ e ( ˜ m + 1)(4 e ) ˜ m 1 ε 2 ˜ m (37) for all 0 < ε ≤ 1. It is also well known that (and can b e deduced from the similar reasoning) that the V C-dimension of a collection S L of halfspaces of R m is m + 1, hence N ( S L , L 2 ( Q ) , ε ) ≤ e ( m + 2)(4 e ) m +1 1 ε 2 m +1 . Giv en t wo collections of sets C 1 , C 2 , let A (1) 1 , . . . , A (1) N ( C 1 ,L 2 ( Q ) ,ε ) and A (2) 1 , . . . , A (2) N ( C 2 ,L 2 ( Q ) ,ε ) b e the L 2 ( Q ) ε - nets of smallest cardinality for the classes of functions { I A : A ∈ C 1 } and { I A : A ∈ C 2 } resp ectiv ely . Let A 0 ∈ C 1 , A 00 ∈ C 2 , and assume without loss of generalit y that k A 0 − A (1) 1 k L 2 ( Q ) ≤ ε and k A 00 − A (2) 1 k L 2 ( Q ) ≤ ε . Then I A 0 I A 00 − I A (1) 1 I A (2) 2 L 2 ( Q ) ≤ 2 ε, whic h implies that the cov ering n um b er of the class D = { I A 1 I A 2 , A 1 ∈ C 1 , A 2 ∈ C 2 } corresp onding to intersections of elements of C 1 and C 2 satisfies N ( D , L 2 ( Q ) , ε ) ≤ N ( C 1 , L 2 ( Q ) , ε/ 2) N ( C 2 , L 2 ( Q ) , ε/ 2) . In particular, the metric en trop y of the class of cones S m can b e b ounded as log N ( S m , L 2 ( Q ) , ε ) ≤ 2 m + 2 2 + m + 1 log 4 e 3 / 2 ε uniformly ov er all probability measures Q , hence fact 6 implies that E sup A ∈S m Φ ( n ) k ( A ) − Φ ( n ) ( A ) ≤ 12 √ k E " Z 1 0 r log N S m , L 2 Φ ( n ) k , ε dε # ≤ 12 √ k E s Z 1 0 log N S m , L 2 Φ ( n ) k , ε dε ≤ 12( m + 4) √ k q log 4 e 5 / 2 . Distributed Statistical Estimation 39 D.2.3. Pro of of Lemma 7 . Making the change of v ariables x = Σ 1 / 2 z , we obtain Z R m x − b µ ( k ) k x − b µ ( k ) k 2 , u d Φ Σ ( x ) = Z R m * Σ 1 / 2 ( z − Σ − 1 / 2 b µ ( k ) ) Σ 1 / 2 ( z − Σ − 1 / 2 b µ ( k ) ) 2 , u + d Φ( z ) ≥ Σ 1 / 2 u 2 Z R m * z − Σ − 1 / 2 b µ ( k ) z − Σ − 1 / 2 b µ ( k ) 2 , ˜ u + d Φ( z ) , where ˜ u = Σ 1 / 2 u k Σ 1 / 2 u k 2 . Let κ := Σ − 1 / 2 b µ ( k ) 2 , and note that κ ≥ b µ ( k ) 2 since k Σ k ≤ 1 by assumption. Let V b e any orthogonal transformation that maps Σ − 1 / 2 b µ ( k ) to κe 1 (here, e 1 , . . . , e m is the standard Euclidean basis of R m ). Then, letting y = V ( z − Σ − 1 / 2 b µ ( k ) ), w e observe that Z R m x − b µ ( k ) k x − b µ ( k ) k 2 , u d Φ Σ ( x ) ≥ Σ 1 / 2 u 2 Z R m y k y k 2 , V ˜ u d Φ( y + κe 1 ) . Setting u = − Σ − 1 b µ ( k ) k Σ − 1 b µ ( k ) k 2 , we obtain from the last inequality that Z R m x − b µ ( k ) k x − b µ ( k ) k 2 , u d Φ Σ ( x ) ≥ 1 Σ − 1 / 2 Z R m y k y k 2 , − e 1 d Φ( y + κe 1 ) . Set y = ( − t, z ), where t ∈ R and z ∈ R m − 1 . W e will also let φ k denote the density (with resp ect to Leb esgue measure) of the standard normal distribution on R k . Then Z R m y k y k 2 , − e 1 d Φ( y + κe 1 ) = Z R m − 1 Z ∞ −∞ t p t 2 + k z k 2 2 φ 1 ( t − κ ) φ m − 1 ( z ) dt dz . Setting h ( t, z ) = t/ p t 2 + k z k 2 2 , we hav e that Z ∞ −∞ h ( t, z ) φ 1 ( t − κ ) dt = Z 0 −∞ h ( t, z ) φ 1 ( t − κ ) dt + Z ∞ 0 h ( t, z ) φ 1 ( t − κ ) dt = Z 0 ∞ h ( t, z ) φ 1 ( − t − κ ) dt + Z ∞ 0 h ( t, z ) φ 1 ( t − κ ) dt = Z 0 ∞ h ( t, z ) φ 1 ( t + κ ) dt + Z ∞ 0 h ( t, z ) φ 1 ( t − κ ) dt = Z ∞ 0 h ( t, z ) [ φ 1 ( t − κ ) − φ 1 ( t + κ )] dt. (38) No w, for any t ≥ 0, φ 1 ( t − κ ) − φ 1 ( t + κ ) = e − ( t 2 + κ 2 ) / 2 √ 2 π e tκ − e − tκ = e − ( t 2 + κ 2 ) / 2 √ 2 π tanh( tκ ) e tκ + e − tκ ≥ e − ( t 2 + κ 2 ) / 2 √ 2 π tanh( tκ ) e tκ = tanh( tκ ) φ 1 ( t − κ ) , 40 S. Minsker and N. Stra wn hence Z R m y k y k 2 , − e 1 d Φ( y + κe 1 ) ≥ Z R m − 1 Z ∞ 0 h ( t, z ) tanh( tκ ) φ 1 ( t − κ ) φ m − 1 ( z ) dt dz ≥ Z k z k 2 ≤ R Z ∞ 1 h ( t, z ) tanh( tκ ) φ 1 ( t − κ ) φ m − 1 ( z ) dt dz ≥ tanh( κ ) √ 1 + R 2 Z k z k 2 ≤ R φ m − 1 ( z ) dz Z ∞ 1 φ 1 ( t − κ ) dt ≥ 0 . 15 tanh( κ ) √ 1 + R 2 Z k z k 2 ≤ R φ m − 1 ( z ) dz , where we ha v e use the inequality h ( t, z ) ≥ (1 + R 2 ) − 1 / 2 whenev er k z k 2 2 ≤ R and t ≥ 1, and 1 − Φ(1) > 0 . 15. Finally , a w ell-kno wn bound states that if Y has χ 2 m − 1 distribution, then for all t > 0 P Y m − 1 − 1 > t ≤ e − ( m − 1) t 2 / 8 . F or R 2 := m − 1 + 2 p ( m − 1) ln 4, it implies that Z k z k 2 ≤ R φ m − 1 ( z ) dz = P Y ≤ R 2 = P Y m − 1 − 1 ≤ 2 r log 4 m − 1 ! ≥ 1 / 2 , whic h concludes the pro of. D .3. Proof of Theorem 8 . T o simplify notation in what follows, we let z ∗ = argmin z ∈ R d F ( z ). W e let f i ( z ) = k z − x i k for all i = 1 , . . . , k and observ e that a weak gradien t of f i ( z ) is giv en by ∇ f i ( z ) = 0 z = 0 1 k z − x i k ( z − x i ) z 6 = 0 . Hence, ∇ F ( z ) = P k i =1 ∇ f i ( z ) is a weak gradient of F . No w, fix z ∈ R d with z 6 = z ∗ , let r = k z − z ∗ k , and set u = 1 r ( z − z ∗ ). The second Distributed Statistical Estimation 41 fundamen tal theorem of calculus yields F ( z ) − F ( z ∗ ) = Z r 0 ∇ F ( z ∗ + tu ) T udt = Z r 0 k X i =1 1 k z ∗ − x i + tu k ( z ∗ − x i + tu ) T udt = Z r 0 k X i =1 1 k z ∗ − x i + tu k ( z ∗ − x i + tu ) T udt = Z r 0 k X i =1 ( z ∗ − x i ) T u + t p k z ∗ − x i k 2 + 2 t ( z ∗ − x i ) T u + t 2 dt = Z r 0 k X i =1 γ i c i + t p ( γ i c i + t ) 2 + γ 2 i ( c 2 i − 1) dt. In this last line, w e hav e set γ i = k z ∗ − x i k and c i = 1 γ i ( z ∗ − x i ) T u . By Cauc hy-Sc hw arz, w e hav e that c 2 i ≤ 1. If c 2 i = 1, then γ i c i + t p ( γ i c i + t ) 2 + γ 2 i (1 − c 2 i ) = sgn( γ i c i + t ) ≥ c i for all t ≥ 0. If c 2 i < 1, then we ha v e that γ i c i + t p ( γ i c i + t ) 2 + γ 2 i (1 − c 2 i ) = c i + Z t 0 γ 2 i (1 − c 2 i ) ( γ i c i + s ) 2 + γ 2 i (1 − c 2 i ) 3 / 2 ds. Note that P k i =1 c i = ∇ F ( z ∗ ) T u = 0 since z ∗ is the minimizer. Consequen tly , we hav e F ( z ) − F ( z ∗ ) ≥ Z r 0 k X i =1 c i + X i : c 2 i < 1 Z t 0 γ 2 i (1 − c 2 i ) ( γ i c i + s ) 2 + γ 2 i (1 − c 2 i ) 3 / 2 ds dt = X i : c 2 i < 1 Z r 0 Z t 0 γ 2 i (1 − c 2 i ) ( γ i c i + s ) 2 + γ 2 i (1 − c 2 i ) 3 / 2 ds dt = X i : c 2 i < 1 Z r 0 Z t 0 1 − c 2 i γ i 1 h ( c i + s γ i ) 2 + (1 − c 2 i ) i 3 / 2 ds dt. Giv en that c i + s γ i 2 + (1 − c 2 i ) = s 2 γ 2 i + 2 c i s γ i + 1 ≤ s 2 γ 2 i + 2 s γ i + 1 = 1 + s γ i 2 , 42 S. Minsker and N. Stra wn w e obtain the low er b ound F ( z ) − F ( z ∗ ) ≥ X i : c 2 i < 1 Z r 0 Z t 0 1 − c 2 i γ i 1 h ( s γ i + 1) 2 i 3 / 2 ds dt = X i : c 2 i < 1 Z r 0 Z t 0 1 − c 2 i γ i 1 ( s γ i + 1) 3 ds dt = X i : c 2 i < 1 Z r 0 Z t 0 γ 2 i (1 − c 2 i ) ( s + γ i ) 3 ds dt = k X j =1 γ 2 j (1 − c 2 j ) Z r 0 Z t 0 k X i =1 γ 2 i (1 − c 2 i ) P k j =1 γ 2 j (1 − c 2 j ) 1 ( s + γ i ) 3 ds dt. Noting that the in v erse cubic function is conv ex, Jensen’s inequalit y and straightforw ard in tegration yields F ( z ) − F ( z ∗ ) ≥ k X j =1 γ 2 j (1 − c 2 j ) Z r 0 Z t 0 1 s + P k i =1 γ 3 i (1 − c 2 i ) P k j =1 γ 2 j (1 − c 2 j ) 3 ds dt = 1 2 k X j =1 γ 2 j (1 − c 2 j ) r 2 P k i =1 γ 3 i (1 − c 2 i ) P k j =1 γ 2 j (1 − c 2 j ) 2 r + P k i =1 γ 3 i (1 − c 2 i ) P k j =1 γ 2 j (1 − c 2 j ) . W e now observe that k X i =1 γ 3 i (1 − c 2 i ) ≤ k X i =1 k z ∗ − x i k 3 ≤ k X i =1 ( k z ∗ − x k + k x − x i k ) 3 ≤ k X i =1 2 k F ( x ) + k x − x i k 3 and also that k X i =1 γ 2 i (1 − c 2 i ) = k X i =1 k z ∗ − x i k 2 − ( z ∗ − x i ) T u 2 = k X i =1 d X j =2 u T j ( z ∗ − x i )( z ∗ − x i ) T u j where { u, u 2 , . . . , u d } is an orthonormal basis for R d . W e further observe that k X i =1 ( z ∗ − x i )( z ∗ − x i ) T = k X i =1 ( z ∗ − x + x − x i )( z ∗ − x + x − x i ) T = k ( z ∗ − x )( z ∗ − x ) T + k X i =1 ( x i − x )( x i − x ) T . The Courant-Fisc her characterization of eigenv alues gives us k X i =1 γ 2 i (1 − c 2 i ) ≥ d X j =2 u T j k X i =1 ( x i − x )( x i − x ) T ! u j ≥ k d X j =2 λ j ( b Σ) Distributed Statistical Estimation 43 where { λ j ( b Σ) } d j =1 are the eigen v alues of the empirical co v ariance matrix listed with m ultiplicit y and in non-increasing order. W e therefore hav e 1 k ( F ( z ) − F ( z ∗ )) ≥ 1 2 P d j =2 λ j ( b Σ) r 1 k P k i =1 (2 m 1 + k x i − x k ) 3 P d j =2 λ j ( b Σ) 2 r + 1 k P k i =1 (2 m 1 + k x i − x k ) 3 P d j =2 λ j ( b Σ) , and the result follows. 44 S. Minsker and N. Stra wn 0 50 100 150 200 250 Run number (ordere d by median-o f-means error) − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 Error (log scale) Estimation errors for 100 0 000 Pareto samples median-of-means sample mean (a) 0 50 100 150 200 250 Run number (ordere d by error difference ) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 Error difference Error differences for 100 0000 Pareto Samples (b) 0 50 100 150 200 250 Run number (ordere d by median-o f-means error) − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 0 . 5 Error (log scale) Estimation errors for 100 Student’s t samples median-of-means sample mean (c) 0 50 100 150 200 250 Run number (ordere d by error difference ) 0 1 2 3 4 5 6 Error difference Error differences for 100 Student’s t Sa mpl es (d) Fig. 2: Comparison of errors corresp onding to the median-of-means and sample mean estimator o ver 256 runs of the exp erimen t. In (a) the sample of size N = 10 6 consists of i.i.d. random vectors in R 2 with indep endent P areto-distributed entries p ossessing only 2 . 1 momen ts. Each run computes the median-of-means estimator using partition in to k = 1000 groups, as w ell as the usual sample mean. In (b), the ordered differences b e- t w ee n the error of the sample mean and the median-of-means o ver all 256 runs illustrates robustness. Positiv e error differences in (b) indicate lo wer error for the median-of-means, and negative error differences o ccur when the sample mean provided a b etter estimate. Images (c) and (d) illustrate a similar exp eriment that w as p erformed for tw o- dimensional random v ectors with indep enden t en tries with Student’s t-distribution with 2 degrees of freedom. In this case, the sample size is N = 100 and the n um b er of groups is k = 10. Distributed Statistical Estimation 45 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 log N k 10 − 6 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 10 1 Median Error N = 2 16 N = 2 18 N = 2 20 Fig. 3: The solid and dotted lines indicate theoretical b ounds for the different v alues of the sample size N , with the solid part indicating the num b er of subgroups k for which our estimates hold. The dashed lines indicate empirical error b etw een the median-of- means estimator and the true mean. W e consider three cases: N = 2 16 (blue), N = 2 18 (green), and N = 2 20 (red). The x -axis is log N k taken from a uniform partition of (0 , 1) and the y -axis indicates the median error of the median-of-means estimator ov er 2 16 runs of the exp erimen t. F or each v alue of N and k , w e run 2 16 sim ulations by drawing N i.i.d. random v ariables with Lomax distribution with shap e parameter α = 4 and scale parameter λ = 1, splitting into k groups, and then computing the median of the means of those groups. F rom the 2 16 sim ulations, we displa y (on a logarithmic scale) the median of the absolute differences b et w een the true mean 1 / 3 and the median-of-means estimators, pro ducing the dashed lines in the figure. The solid and dotted lines are our theoretical b ounds with 4 e − 2 s = 1 / 2 (that is, the probabilit y that the solid and dotted b ounds holds is guaran teed to b e at least 1 / 2). Nominal confidence level F raction of outliers 0 0 . 2 √ N 0 . 4 √ N 0 . 6 √ N 0 . 8 √ N 1 √ N 0.8 0.94 0.0008 0 0 0 0 0.95 0.99 0.001 0 0 0 0 (a) Nominal confidence level F raction of outliers 0 0 . 2 √ N 0 . 4 √ N 0 . 6 √ N 0 . 8 √ N 1 √ N 0.8 0.88 0.82 0.77 0.66 0.6 0.53 0.95 0.99 0.97 0.93 0.85 0.79 0.71 (b) Fig. 4: Empirical co verage levels of confidence interv als constructed using (a) the Cen tral Limit Theorem for the sample mean and (b) Theorem 3 for the median of means; (a) reflects the results obtained for the sample mean and (b) reflects the results obtained for the median of means estimator.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment