Recent Advances in Algorithmic High-Dimensional Robust Statistics

Recen t Adv ances in Algorithmic High-Dimensional Robust Statistics ∗ Ilias Diak o nik olas † Univ ersit y of Wisconsin-Madison ilias@cs.wi sc.edu Daniel M. Kane ‡ Univ ersit y of California, San Diego dakane@cs.u csd.edu No ve m b er 15, 2019 Abstract Learning in the presence of o utliers is a fundamental problem in statistics. Un til re c ent ly , all known eﬃcient unsup ervised learning algor ithms w e r e very sensitive to outliers in high dimen- sions. In particular, even for the task of r obust mea n estimation under na tural distributional assumptions, no eﬃcien t algorithm was known. Recent w o rk in theoretical co mputer scie nc e gav e the ﬁrst eﬃcient robust estimators for a nu mber of fundamental statistica l task s, including mean and cov aria nce estimation. Since then, there has been a ﬂurry of resear ch activ ity on algorithmic high-dimensional ro bust estimation in a ra nge o f settings. In this survey article, we in tro duce the core ideas and alg orithmic techniques in the emer g ing area of algorithmic high-dimensional r obust statistics with a fo cus on ro bust mean estimation. W e also provide a n ov erv ie w of the a pproaches that ha ve led to computationa lly eﬃcient robust estimator s for a range of bro ader statistical ta sks and dis cuss new dire c tions and opp o rtunities for future w o rk. 1 In tro duction 1.1 Bac kground Consider the follo wing basic statistical task: Given n indep endent samp les from an un kno w n mean spherical Gaussian distribution N ( µ, I ) on R d , estimate its mean v ector µ within small ℓ 2 -norm. It is not h ard to see th at the empirical m ean has ℓ 2 -error at most O ( p d/n ) f rom µ with high probabilit y . Moreov er, th is err or upp er b ound is b est p ossible among all n -sample estimators. The Ac hilles heel of the empirical estima tor is that it crucially relies on the assumption that the observ ations we re generated b y a spherical Gaussian. T he existence of ev en a single outlier can arbitrarily compromise this estimator’s p erformance. How ev er, the Gauss ian assumption is only ev er appro ximately v alid, as real datasets are typica lly exp osed to some source of con tamination. Hence, an y estimator that is to b e used in practice must b e r obust in the p r esence of outliers. Learning in the p r esence of outliers is an imp ortant goal in statisti cs and has b een studied in the rob u st statistics communit y since the 1960s [73 , 44] (see [38, 45] for in tro ductory statistical textb o oks on the topic). In recent y ears, the problem of designing r obust and computationally eﬃcien t estimators for v arious high-dimensional statisti cal tasks has b ecome a pr essin g c hallenge ∗ This article is an expanded version of an invited chapter entitled “Robust High-Dimensional S tatistics” in the b ook “Beyond t he W orst-Case An alysis of Algorithms” to b e pu blished by Cambridge Un iversi ty Press. † Supp orted by NSF Award CCF-1652862 (CAR EER) and a Sloan R esearc h F ello wship. ‡ Supp orted by NSF Award CCF-1553288 (CAR EER) and a Sloan R esearc h F ello wship. 1 in a n umb er of data analysis applicatio ns. T hese include th e analysis of b iologic al datasets, where natural outliers are common [68, 64, 58] and can con taminate the downstream s tatistica l analysis, and d ata p oisoning attacks in mac hine learning [4], where ev en a s mall fraction of f ake data (outliers) can sub stan tially degrade the qualit y of the learned mod el [9, 70]. Classical work in robust statistics pin ned do w n the minimax r isk of high-d im en sional robust estimation in seve ral basic settings of inte rest. In con trast, until ve ry r ecen tly , eve n the most basic computational questions in th is ﬁeld were p o orly understo o d . F or example, the T uk ey median [74] is a sample-eﬃcien t r obust mean estimator for spherical Gaussian distributions. Ho wev er, it is NP- hard to compute in general [47] and the heuristics prop osed to approxima te it degrade in th e qualit y of their appro x im ation as th e d imension s cales. Similar hardness results ha v e b een sh o wn [5, 39] for essent ially all kno w n classical estimators in robu st statistics. Un til recen tly , all kn o wn compu tationally eﬃcien t estimators could only tolerate a negligible fraction of outlie rs in high dimensions, ev en for the basic ta sk of mean estimation. Recen t wo rk b y Diak oniko las, Kamath, Kane, Li, Moitra, and Stewart [21], and b y Lai, Rao, and V empala [57] ga v e the ﬁrst eﬃcien t robu st estimators for v arious high-dimensional unsup ervised learning tasks, including mean and co v ariance estimation. Sp eciﬁcally , [21] obtained the ﬁ r st p olynomial-time robust estimators with dimension-indep endent error guarantee s, i.e., with err or scaling only with the fraction of corr u pted samples and not with the dimensionalit y of the data. Since the dissemination of these w orks [21, 57], there has b een signiﬁcan t researc h activit y on designing computationally eﬃcien t robust estimators in a v ariet y of settings. 1.2 Con tamination Mo del Throughout this article, we fo cus on the follo wing mo del of data cont amination that generalizes sev eral other existing mo dels: Deﬁnition 1.1 (Str ong Con tamination Mo del) . Given a p arameter 0 < ǫ < 1 / 2 and a distribution family D on R d , the adversa ry op erates as follo ws : The algo rithm sp eciﬁes a num b er of samples n , and n samp les are dra wn from some unkno w n X ∈ D . The adv ersary is allo we d to ins p ect the samples, remo v e up to ǫn of them and replace them with arbitrary p oin ts. This mod iﬁed set of n p oints is then given as input to the algorithm. W e say that a set of s amples is ǫ -c orrupte d if it is generated by the ab ov e p ro cess. The con tamination mo del of Deﬁnition 1.1 can b e view ed as a semi-random input m o del: First, nature d ra ws a set S of i.i.d. s amples from a statistical mo del of interest, and then an adversary is allo w ed to c h ange th e set S in a b ounded wa y to obtain an ǫ -corrupted set T . The parameter ǫ is the prop ortion of con tamination and qu an tiﬁes the p ow er of the adv ersary . Intuitiv ely , among our samples, an unkno w n (1 − ǫ ) fr action are generated from a distribu tion of interest and are cal led inliers , and the rest are calle d outliers . One can consider less p o werful adv ersaries, giving rise to we aker cont amination mo d els. An adv ersary may b e (i) adaptiv e or oblivious to the inliers, (ii) only allo w ed to add outliers, or only allo wed to remov e in liers, or allo wed to do b oth. W e pro vide some examples of natural and w ell-studied suc h mo dels in the follo wing paragraphs. In Hub er’s c ontamination mo del [44], th e ad versary is oblivious to the inliers and is only allo w ed to add outliers. More sp eciﬁcally , in Hu b er’s mo d el, the adversary generates samples from a mixture distribution P of the form P = (1 − ǫ ) X + ǫN , where X ∈ D is the u nkno wn target distribution and N is an adv ersarially c hosen noise distrib ution. Another natural con tamination mo del very common in theoretical computer science allo ws the adv ersary to p ertur b the distribution X b y at most ǫ in total v ariation distance, i.e., the adv ersary generates samples from a distrib ution Y that 2 satisﬁes d TV ( Y , X ) ≤ ǫ . Intuitiv ely , the adv ersary in this mo del is oblivious to the inliers and is allo w ed to b oth add ou tliers and remo ve inliers. This mo del is v ery similar to a con tamination mo del prop osed by Hamp el [37]. W e note that cont amination in total v ariation distance is strictly stronger than Hub er’s mo del. More broadly , one can natur ally mo dify this mod el to study mo del missp eciﬁcation w ith resp ect to d iﬀeren t loss functions (see, e.g., [76]). In compu tational learnin g theo ry , the cont amination mod el of Deﬁnition 1.1 is related to the agnostic mo del [40, 49], wh ere the goal is to learn a lab eling f unction whose agreement with some underlying target function is close to the b est p ossible, among all fu nctions in some giv en class. An imp ortant diﬀerence with our setting is that the agnostic mo del requires that w e ﬁt all the data, while in our robust setting we only w ant to ﬁt the u ncorrup ted data. The s tr ong con tamination mo del can b e view ed as th e un s up erv ised analogue of th e c hallenging nasty noise mo del [13] (itself a strengthening of the m alicious mo del [75, 51]). In the n ast y mo del, an adversary is allo wed to corrupt an ǫ -fraction of b oth the lab els and the samples, and the goal of the learning algorithm is to output a hyp othesis with small misclassiﬁcation error with resp ect to the clean data generating d istribution. In robu st mean estimation, giv en an ǫ -corrupted set of samples f rom a well-behav ed distribution (e.g., N ( µ, I )), we w an t to outpu t a vect or b µ that closely appr o ximates th e unknown mean v ector µ . A natural c hoice of metric b etw een the means for iden tit y co v ariance distrib utions is the ℓ 2 - error k b µ − µ k 2 , and in this article we fo cus on d esigning r obust esti mators minimizing k b µ − µ k 2 . W e note how ev er that th e same algorithms t yp ically lead to small Mahalanobis distance, i.e., k b µ − µ k Σ = | ( b µ − µ ) T Σ − 1 ( b µ − µ ) | 1 / 2 , wh en the un derlying distribution has co v ariance Σ . One can an alogously deﬁne robus t estimation for other paramete rs of high-dimensional dis- tributions (e.g., co v ariance matrix an d higher-order momen t tensors) with resp ect to natural loss functions (e .g., F rob enius norm, sp ectral norm). A more general statistical task is that of r obust density e stimation : Giv en an ǫ -corru pted set of samples from an u n kno w n distribution X ∈ D , output a hyp othesis distribu tion H (not necessarily in D ) such that the total v ariation distance d TV ( H , X ) is minimized. W e n ote that robust density estimation and robust parameter estimation are closely related to eac h other. F or man y n atur al parametric d istributions, the latter can b e reduced to the former for an appropriate choi ce of metric b et ween the parameters. In all these s ettings, the goal is to d esign computationally eﬃcient learning algorithms that ac hiev e dimension-indep endent err or, i.e., error that scales only with the con tamination parameter ǫ , but n ot with the dimension d . The information-theoretic limits of robu st estimation dep end on our assump tions ab out the distribution family D . In the follo wing subsection, we pro vide the basic relev ant bac kground. 1.3 Basic Background: Samp le Eﬃcien t Robust Est imation Before w e pr o ceed with p resen ting computationally eﬃcien t robust estimators in the next sections, w e pro v id e a few basic facts on the in formation-theoretic limits of robus t estimati on. F or concrete- ness, we fo cus here on robust mean estimation. W e note that similar argumen ts can b e applied to v arious other parameter estimatio n tasks. The in terested reader is referred to [21] and to [15, 59] (and references therein) for r ecen t information-theoretic work from the stati stics comm unit y . W e ﬁ rst note that some assump tions on the u nderlying distribution family D are necessary for robust mean estimation to b e information-theoretically p ossible. Consider for example, the family D = { D x , x ∈ R } , where D x is a probabilit y distribution on the r eal line with only one p oin t x ∈ R ha ving p ositiv e mass Pr [ D x = x ] = ǫ > 0 and suc h that E [ D x ] = x . While estimating the mean of an arbitrary distribution in D is straigh tforward without corru ptions (b y taking samples u n til w e see a sample t wice, which m u st b e the true mean), an adv ersary can erase all information ab out 3 the mean in an ǫ -corrupted sample from D x . Ind eed, an adversary can d elete the samples at x and mo ve them at an arbitrary lo cation to arbitrarily corrupt the sample mean. T yp ical assum p tions on the f amily D are either parametric (e.g., D could b e the family of all Gaussian distributions) or are deﬁned b y concen tration prop erties (e.g., eac h distribu tion in D satisﬁes sub-gaussian concent ration) or cond itions on lo w-degree momen ts (e.g., eac h d istribution in D has appropriately b ou n ded higher-order moments). Another b asic observ ation is that, in con trast to th e uncorrup ted setting, in the con taminated setting of Deﬁnition 1.1 it is not p ossible to obtain consisten t estimators — that is, estimators with error con v erging to zero in probability as the sample size increases indeﬁn itely . T ypically , there is an information-theoretic limit on the minimum attainable error that dep ends on the prop ortion of con tamination ǫ and stru ctural p rop erties of th e un derlying distribution family . In particular, for the r obust mean estimation of a h igh-dimensional Gaussian, w e h a ve: F act 1.2. F or any d ≥ 1 , any r obust estimator for the me an of X = N ( µ X , I ) on R d , must have ℓ 2 -err or Ω( ǫ ) , even in Hub er’s c ontamination mo del. This fact can b e shown as follo w s: Giv en tw o distinct distribu tions N ( µ 1 , I ) an d N ( µ 2 , I ) with k µ 1 − µ 2 k 2 = Θ( ǫ ), the adv ersary constructs t w o noise distributions N 1 , N 2 on R d suc h that Y = (1 − ǫ ) N ( µ 1 , I ) + ǫN 1 = (1 − ǫ ) N ( µ 2 , I ) + ǫN 2 . Consequ en tly , ev en in the inﬁ nite samp le regime, any alg orith m can at b est learn that its samples are coming from Y , but will b e u nable to distinguish betw een the cases where the real distribu tion is N ( µ 1 , I ) an d where it is N ( µ 2 , I ). This pr o ves F act 1.2. If the target distribu tion X is allo wed to come from a br oader cla ss of distrib utions (suc h as allo wing an y distrib ution w ith subgaussian tails, or an y distr ib ution w ith b ounded co v ariance), the situati on is eve n worse. If one can ﬁnd t w o distrib utions X and X ′ in the desired cla ss with d TV ( X, X ′ ) ≤ ǫ , it b ecomes information-theoretical ly imp ossible for an algorithm to distinguish b et w een the t wo . How ev er, if the diﬀeren ce b et wee n X and X ′ is concen trated in the tails of the distribution, then X and X ′ migh t h a ve v ery diﬀerent means. Th is implies that for the class of distributions with sub-gaussian tails (and identit y co v ariance) we cann ot hop e to learn the mean to ℓ 2 -error b etter than Ω( ǫ p log(1 /ǫ )); and for the class of distributions with co v ariance Σ b oun ded by σ 2 I , we cannot exp ect to do b etter th an Ω( σ √ ǫ ). It tur ns out that these b ounds are inform ation- theoreticall y optimal, and in fact as w e will s ee, the m eans of such distrib u tions can b e robustly estimated to these errors in m an y cases. The problem of rob u st m ean estimation for N ( µ, I ) seems so in no cuous that one could n aturally w ond er why simple approac h es do not work. In the one-dimensional case, it is w ell-kno wn that the median is a robust estimator of the mea n, matc h ing the low er b ound of F act 1.2 . Sp eciﬁcally , it is easy to sho w that th e median b µ of a m ultiset of n = Ω(log(1 /τ ) /ǫ 2 ) ǫ -corrup ted samp les from a one-dimensional Gaussian N ( µ, 1) satisﬁes | b µ − µ | ≤ O ( ǫ ) with pr obabilit y at least 1 − τ . In high dimensions, the situation is more s ubtle. T here are m an y reasonable wa ys to attempt to generalize the median as a robust estimator in h igh dimensions, but u nfortunately , most natural ones lead to ℓ 2 -error of Ω ( ǫ √ d ) in d d imensions, eve n in the inﬁn ite samp le regime (see, e.g., [21, 57]). P erh aps the most obvious high-dimen sional generalization of the median is the c o or dinate-wise me dian . Here the i -th co ordin ate of the ou tp ut is the median of the i -th co ord inates of the input samples. This estima tor guaran tees that every coordinate of the outpu t is within O ( ǫ ) of the corresp ondin g coordin ate of the inp ut. Th is estimator suﬃ ces for ob taining small ℓ ∞ -error, b u t if one w ants ℓ 2 -error (whic h is natur al in the case of Gaussians), then th ere exist instances where the co ordin ate-wise median h as ℓ 2 -error as large as Ω( ǫ √ d ). Another p oten tial w ay to ge neralize the median to h igh dimensions is via th e ge ometric me dian , i.e., the p oin t x ∗ that minimizes 4 P i k x ( i ) − x ∗ k 2 . Unfortunately , th e geometric median can also pro duce ℓ 2 -error of Ω( ǫ √ d ) if the adv ersary adds the ǫ -fraction of the ou tliers all oﬀ f rom the mean in the same direction. A third h igh-d imensional generalization of the median r elies on the observ ation that taking the median of any univ ariate pro jection of our input p oints give s u s an appr o ximation to the pro jected mean. Find ing a mean v ector that min imizes the error ov er the worst dir e c tion can actually b e used to obtain ℓ 2 -error of O ( ǫ ) with high p robabilit y . In other words, it is p ossible to reduce the high-dimensional robust mean estimation problem to a collection of one-dimensional robust mean estimation pr ob lems. This is the underlying idea in T uk ey’s median [74], whic h is kno w n to b e a robust mean estimator for spherical Gaussians and for more general symmetric distributions. But unfortun ately , the T ukey median r elies on com binin g information for inﬁnitely many dir ections, and is u nsurp risingly NP-Hard to compute in general. The follo wing p rop osition giv es a compu tationally ineﬃcien t r obust mean estimator matc hing the lo w er b oun d of F act 1.2: Prop osition 1.3. Ther e exists an algorithm that, on input an ǫ -c orrupte d set of samples fr om X = N ( µ X , I ) of size n = Ω (( d + log (1 /τ )) /ǫ 2 ) , runs in p oly( n, 2 d ) time, and outputs b µ ∈ R d such that with pr ob ability at le ast 1 − τ , it ho lds tha t k b µ − µ X k 2 = O ( ǫ ) . The algo rithm to establish Prop osition 1.3 pro ceeds b y using a one-dimensional robust mean estimator to estimate v · µ , for an appr opriate net of 2 O ( d ) unit v ectors v ∈ R d , and then com b ines these estimates (by solving a large linear program) to obtain an accurate estimate of µ . When X = N ( µ X , I ), ou r one-dimensional robust mean estimator w ill b e the median, giving the O ( ǫ ) error in Prop osition 1.3. W e note that the same pro cedure is app licable to other distrib ution families as well (ev en non-symmetric ones), as long as there is an accurate robus t mean estimator for eac h un iv ariate pro jection. S p eciﬁcally , if X h as tails b ounded by th ose of a Gaussian with co v ariance σ 2 I , one can use th e trimme d me an for eac h univ ariate pro jection. This giv es error of O ( σ ǫ p log(1 /ǫ )). If X is only assumed to ha ve boun ded co v ariance matrix (Σ X  σ 2 I ), w e can similarly use the trimmed mean, wh ich leads to total error of O ( σ √ ǫ ). By the d iscu ssion follo wing F act 1.2, b oth these upp er b ound s are optimal, within constant factors, u nder the corresp onding assumptions. 1.4 Structure of this Art icle In Section 2, w e pro vide a un iﬁ ed p resen tation of tw o r elated algorithmic tec hniques that ga v e the ﬁrst computationall y eﬃcien t algorithms for high-dimensional r obust mean estimation. S ection 2 is the main tec hnical section of this article and sho w cases a num b er of core ideas and tec hniques that can b e applied to sev eral h igh-dimensional robust estimation tasks. Section 3 pro vid es an o verview of recen t algorithmic progress for more general r ob u st estimat ion tasks. Finally , in Section 4 w e conclude with a few general directions for fu tu re wo rk. 1.5 Preliminaries and Notation F or a distribution X , w e will use the notat ion x ∼ X to denote th at x is a sample dr a wn from X . F or a ﬁnite set S , w e will write x ∼ u S to denote that x is drawn from the uniform distrib ution on S . The probabilit y of ev ent E will b e denoted by Pr [ E ]. W e will use E [ X ] and V ar [ X ] to d enote the exp ectatio n and the v ariance of ran d om v ariable X . If X is multiv ariate, w e will d enote by Co v [ X ] its co v ariance matrix. W e w ill also use the notation µ X and Σ X for the mean and co v ariance of X . Similarly , for a ﬁnite s et S , w e w ill denote b y µ S and Σ S the sample mean and sample co v ariance of S . 5 F or a v ector v , we will use k v k 2 to denote its ℓ 2 -norm. F or a matrix A , w e will denote by k A k 2 and k A k F its sp ectral an d F rob enius norms r esp ectiv ely , and by tr( A ) its trace. W e will denote by  the Loewn er order b et ween matrices. W e will use standard asymptotic n otation O ( · ), Ω( · ). Th e ˜ O ( · ) notation hides loga rithmic factors in its argumen t. 2 High-Dimensional Robust Mean Estimation In this s ection, w e illustrate the main insights u nderlying r ecen t robu s t high-d im en sional learning algorithms by fo cusin g on the problem of robust mean estimation. The algorithmic techniques pre- sen ted in this section were in tro duced in [21, 22]. Here w e giv e a simpliﬁed and uniﬁ ed presentat ion that illustrates the k ey ideas and the connections b et ween them. The ob jectiv e of th is section is to pro vide th e intuitio n and bac kground r equired to develo p robust learning algorithms in an accessible w a y . As suc h, w e will not attempt to optimize the sample or compu tational complexitie s of the alg orithms presente d , other than to s ho w that they are p olynomial in th e relev ant parameters. In the problem of robust mean estimat ion, w e are giv en an ǫ -co rrupted set of samples from a distribution X on R d and our goal is to app ro ximate th e mean of X , within small error in ℓ 2 -norm. In order f or su ch a goal to b e information-theoretically p ossible, it is r equ ired that X b elongs to a suitably we ll-b eha ved f amily of distribu tions. A t yp ical assu m ption is that X b elongs to a family whose momen ts are guaran teed to satisfy certain conditions, or equiv alen tly , a family with appropriate concentrati on prop erties. In our initial discussion, w e will use the running example of a sp herical Gaussian, although the results presen ted here hold in greater generalit y . That is, the reader is encouraged to imagine that X is of the form N ( µ, I ), for some u nkno wn µ ∈ R d . Structure of this Section. In Section 2.1, w e discus s the basic intuition underlying the presented approac h. In Section 2.2, w e will describ e a stabilit y condition that is necessary for the algorithms in this section to succeed. In the subsequent subs ections, w e p resen t tw o related algorithmic tec hn iques taking adv an tage of th e stabilit y condition in d iﬀeren t wa ys. Sp eciﬁcally , in Section 2.3, we d escrib e an alg orithm that relies on con vex p rogramming. In Section 2.4, w e describ e an iterat iv e outlier remo v al tec h nique, whic h h as b een the m etho d of c hoice in practice. I n Section 2.5, we conclude with an ov erview of the relev an t literature. 2.1 Key Diﬃculties and High-Lev el In t uition Arguably the m ost natural idea to robustly estimate the mean of a distribution wo uld b e to iden tify the outliers and output the empirical mean of the remainin g p oint s. The key conceptual diﬃcult y is the fact that, in high dimensions, the outliers cannot b e identiﬁed at an individual lev el ev en w hen they mo ve the mean signiﬁcant ly. In many cases, we can easily identi fy the “extreme outliers” — via a p runing p ro cedure exploiting the concen tration prop erties of th e inliers. Alas, suc h naiv e approac hes t ypically do not s uﬃce to obtain non-trivial error gu arantees. The simplest examp le illustrating this diﬃculty is that of a h igh-dimensional spherical Gaussian. T yp ical samples w ill b e at ℓ 2 -distance app ro ximately Θ( √ d ) from the true mean. That is, we can certainly identify as outliers al l p oin ts of our d ataset at distance more than Ω( √ d ) fr om the co ordinate-wise median of the dataset. All other p oint s cannot b e remo ved via suc h a pr o cedure, as this could r esu lt in remo ving man y inliers as w ell. Ho wev er, b y p lacing an ǫ -fraction of outliers at d istance √ d in th e same d irection from th e un k n o wn mean, an adversary can corrupt the samp le mean by as m u c h as Ω( ǫ √ d ). 6 This lea v es the algorithm designer with a dilemma of sorts. O n the one hand, p oten tial outliers at distance Θ( √ d ) from the unkno wn mean could lead to large ℓ 2 -error, scaling p olynomially with d . On th e other hand, if th e adv ersary place s outliers at distance appro ximately Θ( √ d ) from the true mean in r andom dir e c tions , it ma y b e in formation-theoretica lly im p ossible to distinguish them from the inliers. T h e w a y out is the realization that it is in fact not ne c essary to dete ct and r emove al l outliers . It is only requir ed that the algo rithm can detect th e “consequen tial outliers”, i.e., the ones that can signiﬁcan tly impact our estimates of the mean. Let us assume w ith ou t loss of generalit y that there n o extreme outliers (as these can b e remo ved via p re-pro cessing). T hen the only way that the empiric al me an c an b e far fr om the true me an is if ther e is a “c onspir acy” of many outliers, al l pr o ducing err ors in appr oximately the same dir e ction. In tuitiv ely , if our corrupted p oin ts are at distance O ( √ d ) from the tr ue mean in random d irections, their con trib utions will on a verage cancel out, leading to a small error in the sample mean. In conclusion, it suﬃces to b e able to detect these kinds of conspiracies of outliers. The next k ey insight is simple and p o werful. Let T b e an ǫ -corrupted set of p oint s dra w n from N ( µ, I ). If suc h a conspiracy of outliers sub stan tially mo ves the empirical mean b µ of T , it m ust mo ve b µ in s ome direction. That is, there is a u nit v ector v such that these outliers cause v · ( b µ − µ ) to b e large. F or this to happ en, it m us t b e the case that these outliers are on a v erage far from µ in the v -direction. I n particular, if an ǫ -fraction of corrup ted p oint s in T mo v e the sample a verag e of v · ( U T − µ ), where U T is the u n iform distribution on T , by more than δ ( δ shou ld b e though t of as small, but su bstan tially la rger than ǫ ), then on a ve rage these corrupted p oin ts x m u st hav e v · ( x − µ ) at least δ/ǫ . Th is in turn means th at these corrupted p oints will ha v e a contribution of at least ǫ · ( δ /ǫ ) 2 = δ 2 /ǫ to the v ariance of v · U T . F ortu nately , th is condition can actuall y b e algorithmicall y d etected! In particular, by computing the top eigen ve ctor of the sample cov ariance matrix, w e can eﬃcient ly d etermine whether or not there is an y direction v f or whic h the v ariance of v · U T is abnormally large. The aforemen tioned discussion leads us to the o v erall s tructure of the algorithms w e w ill describ e in this section. Starting with an ǫ -corrupted set of points T (p erhaps weig hted in some w ay) , w e compute the samp le cov ariance matrix and ﬁnd th e eigen ve ctor v ∗ with largest eig en v alue λ ∗ . If λ ∗ is not muc h large r than what it should b e (in the absence of outliers), by the ab ov e discus sion, the empirical mean is close to the tru e mean, and w e can r etur n that as an answer. Otherwise, w e hav e obtained a particular direction v ∗ for which w e kno w th at the outliers pla y an unusual role, i.e., b ehav e s igniﬁcan tly diﬀerently than the inliers. The distribution of the p oin ts pro jected in th e v ∗ -direction can then b e used to p erform some sort of outlier remov al. The outlier remo v al pro cedur e can be quite subtle and crucially dep ends on our distribu tional assumptions ab out the clean data. 2.2 Go o d Sets and Stabilit y In this section, we giv e a deterministic cond ition on the uncorrup ted data that we call stability (Deﬁnition 2.1), whic h is necessary for the algorithms presented here to succeed. F urthermore, we pro v id e an eﬃcien tly c h ec k able condition u nder whic h the empirical m ean is certiﬁably close to the true mean (Lemma 2.4). Let S b e a set of n i.i.d. samples d ra w n from X . W e will t ypically call th ese sample p oin ts go o d. The adv ersary can select up to an ǫ -fraction of p oin ts in S and replace them with arbitrary p oin ts to obtain an ǫ -corrup ted set T , whic h is giv en as inp ut to the algorithm. T o establish correctness of an algorithm, w e need to sho w that with high probab ility ov er the c hoice of the set S , for an y c hoice of ho w to corrupt the go o d samples that the adversary make s, the algorithm will output an accurate estimate of the target mean. 7 T o carry out such an analysis, it is con v enient to explicitly state a collectio n of suﬃcien t deter- ministic conditions on the set S . Sp eciﬁcally , w e will deﬁne a notion of a “stable” set, quan tiﬁed b y the pr op ortion of conta mination ǫ and the distribu tion X . The precise stabilit y conditions v ary considerably based on the underlying estimat ion task and the assumptions on the distribu - tion family of the u ncorrupted data . Rou gh ly sp eaking, w e requ ire that the uniform distribution o ve r a stable set S b ehav es similarly to the distribution X w ith resp ect to higher momen ts and, p oten tially , tail b ound s . Imp ortant ly , w e require that these conditions hold ev en after removing an arbitrary ǫ -fraction of p oints in S . The notion of a stable set m ust ha v e t w o critical prop erties: (1) A set of n i.i.d. samples from X is stable with h igh p robabilit y , when n is at least a suﬃcien tly large p olynomial in the r elev ant parameters; and (2) If S is a stable set and T is obtained from S by changing at most an ǫ -fraction of the p oint s in S , then the algorithm wh en run on the set T will su cceed. The robus t mean estimat ion algorithms that will b e p resen ted in this s ection crucially r ely on considering sample means and co v ariances. Th e follo wing stability condition is an imp ortan t ingredien t in the success criteria of th ese algorithms: Deﬁnition 2.1 (Stabilit y Condition) . Fix 0 < ǫ < 1 / 2 and δ ≥ ǫ . A ﬁ nite set S ⊂ R d is ( ǫ, δ ) - stable (with resp ect to a distr ib ution X ) if for eve ry unit v ector v ∈ R d and ev ery S ′ ⊆ S with | S ′ | ≥ (1 − ǫ ) | S | , the follo wing conditions hold: 1.    1 | S ′ | P x ∈ S ′ v · ( x − µ X )    ≤ δ , and 2.    1 | S ′ | P x ∈ S ′ ( v · ( x − µ X )) 2 − 1    ≤ δ 2 /ǫ. The aforemen tioned stabilit y condition or a v arian t thereof is used in every kno wn robu st mean estimation al gorithm. Deﬁnition 2.1 r equires th at after r estricting to a (1 − ǫ )-dens ity s ubset S ′ , the sample mean of S ′ is within δ of the mean of X , µ X , and the sample v ariance of S ′ is 1 ± δ 2 /ǫ in ev ery direction. (W e note that Deﬁnition 2.1 is intended for distributions X with co v ariance Σ X = I or, more generally , Σ X  I . The case of arbitrary kno wn or b ounded co v ariance can b e reduced to this case via an appropriate linear transformation of the data.) The fact that the conditions of Deﬁn ition 2.1 m u st h old for every large subset S ′ of S might mak e it unclear if they can h old with h igh probabilit y . Ho w ever, one can sho w the follo win g: Prop osition 2.2. A set of i.i.d. samples f r om an identity c ovarianc e sub-gaussian distribution of size Ω( d/ǫ 2 ) is ( ǫ, O ( ǫ p log(1 /ǫ )) -stable with high pr ob ability. W e ske tc h a pro of of Prop osition 2.2. The only prop ert y required for the p ro of is that the distribution of the uncorru pted data has identit y co v ariance and sub -gaussian tails in eac h direction, i.e., the tail probabilit y of eac h u n iv ariate pro jection is b ounded fr om ab o ve by th e Gaussian tail. Fix a direction v . T o sho w that the ﬁrs t cond ition holds for this v , we note that we can m aximize 1 | S ′ | P x ∈ S ′ v · ( x − µ X ) b y remo ving from S the ǫ -fraction of p oin ts x for whic h v · x is smallest. Since the empirical mean of S is close to µ X with high pr obabilit y , we need to und erstand ho w muc h this quantit y is altered b y remo vin g the ǫ -tail in the v -direction. Giv en our assum ptions on the distribution of th e u ncorrup ted data, remo ving the ǫ -tail only c hanges the mean by O ( ǫ p log(1 /ǫ )). Therefore, if the empirical distribution of v · x , x ∈ S , b eha v es like a sp herical Gaussian in this wa y , the ﬁr st condition is satisﬁed. The second condition follo ws via a similar analysis. W e can minimize the relev ant quanti t y by remo ving th e ǫ -fraction of p oin ts x ∈ S with | v · ( x − µ X ) | as large as p ossib le. If v · x is d istributed lik e a unit-v ariance sub-gaussian, the total mass of its squ are o v er the ǫ -tails is O ( ǫ log (1 /ǫ )). W e h a ve 8 th u s established that b oth conditions hold with high probabilit y for an y ﬁxed directio n. S h o win g that the conditions hold with high probabilit y for all d irections sim u ltaneously can b e sho w n b y an appropriate co vering argumen t. More generally , one can show quant itativ ely diﬀeren t stabilit y conditions und er v arious d is- tributional assumptions. In particular, w e state the follo win g prop osition without pro of. (The in terested reader is referred to [22] for a pro of.) Prop osition 2.3. A set of i.i.d. samples fr om a distribution with c ovarianc e Σ  I of size ˜ Ω( d/ǫ ) is ( ǫ, O ( √ ǫ )) -stable with high pr ob ability. W e note that analogo u s b ounds can b e easily shown for identity c ovarianc e distributions with b ound ed higher cen tral moments. F or example, if our distribu tion has iden tit y co v ariance and its k -th cen tral momen t is b ound ed from ab o ve b y a constant, one can show that a set of Ω( d/ǫ 2 − 2 /k ) samples is ( ǫ, O ( ǫ 1 − 1 /k ))-stable with high probabilit y . The aforemen tioned n otion of stabilit y is p o werful and suﬃces for robust mean estimation. Some of the algorithms that w ill b e p resen ted in this s ection only wo r k u nder the stabilit y condition, while others require additional conditions b ey ond stability . The main reason why stabilit y suﬃces is quan tiﬁed in the follo wing lemma: Lemma 2.4 (Certiﬁcate for Empirical Mean) . L et S b e an ( ǫ, δ ) -stable set with r esp e ct to a dis- tribution X , for some δ ≥ ǫ > 0 . L et T b e an ǫ -c orrupte d ve rsion of S . L et µ T and Σ T b e the empiric al me an and c ovarianc e of T . If the lar g e st eigenvalue of Σ T is at most 1 + λ , for some λ ≥ 0 , then k µ T − µ X k 2 ≤ O ( δ + √ ǫλ ) . Pr o of of L emma 2.4. Let S ′ = S ∩ T and T ′ = T \ S ′ . By replacing S ′ with a s u bset if necessary , w e ma y assume that | S ′ | = (1 − ǫ ) | S | and | T ′ | = ǫ | S | . Let µ S ′ , µ T ′ , Σ S ′ , Σ T ′ represent the empir ical means and cov ariance matrices of S ′ and T ′ . A simple calculation giv es that Σ T = (1 − ǫ )Σ S ′ + ǫ Σ T ′ + ǫ (1 − ǫ )( µ S ′ − µ T ′ )( µ S ′ − µ T ′ ) T . Let v b e the unit v ector in the direction of µ S ′ − µ T ′ . W e ha v e that 1 + λ ≥ v T Σ T v = (1 − ǫ ) v T Σ S ′ v + ǫ v T Σ T ′ v + ǫ (1 − ǫ ) v T ( µ S ′ − µ T ′ )( µ S ′ − µ T ′ ) T v ≥ (1 − ǫ )(1 − δ 2 /ǫ ) + ǫ (1 − ǫ ) k µ S ′ − µ T ′ k 2 2 ≥ 1 − O ( δ 2 /ǫ ) + ( ǫ/ 2) k µ S ′ − µ T ′ k 2 2 , where we used the v ariational c haracterization of eigen v alues, the fact th at Σ T ′ is p ositiv e semidef- inite, and the second stabilit y condition for S ′ . By rearranging, w e obtain that k µ S ′ − µ T ′ k 2 = O ( δ /ǫ + p λ/ǫ ) . T herefore, we can write k µ T − µ X k 2 = k (1 − ǫ ) µ S ′ + ǫµ T ′ − µ X k 2 = k µ S ′ − µ X + ǫ ( µ T ′ − µ S ′ ) k 2 ≤ k µ S ′ − µ X k 2 + ǫ k µ S ′ − µ T ′ k 2 = O ( δ ) + ǫ · O ( δ /ǫ + p λ/ǫ ) = O ( δ + √ λǫ ) , where we used the ﬁ rst stabilit y condition for S ′ and our b ound on k µ S ′ − µ T ′ k 2 . W e note that the p ro of of Lemma 2.4 only u sed the lo w er b ou n d p art in the second condition of Deﬁnition 2.1, i.e., that the sample v ariance of S ′ in eac h direction is at least 1 − δ 2 /ǫ . Th e upp er b ound p art will b e crucially used in th e design and analysis of our robust mean estimation algorithms in th e follo w ing sectio ns. 9 Lemma 2.4 sa ys that if our input set of p oints T is an ǫ -corrup ted v ersion of an y stable set S and has b ounded sample co v ariance, the sample mean of T closely approximate s the true mean. This lemma, or a v arian t thereof, is a k ey result in all kno wn robust mean estimation algorithms. Unfortunately , w e are not alwa ys guarante ed that the set T we are giv en has this p r op erty . In order to deal with this, w e will wa n t to ﬁnd a subset of T with b ounded co v ariance and large in tersection with S . Ho w ever, for some of the algorithms p resen ted, it will b e con v enient to ﬁnd a probabilit y distrib ution ov er T rather than a subset. F or this, we will need a slight generalization of Lemma 2.4. Lemma 2.5. L et S b e an ( ǫ, δ ) -stable set with r esp e ct to a distribution X , f or some δ ≥ ǫ > 0 with | S | > 1 /ǫ . L et W b e a pr ob ability distribution that diﬀers fr om U S , the uniform distribution over S , by at most ǫ in total v ariation distanc e . L et µ W and Σ W b e the me an and c ovarianc e of W . If the lar gest eigenvalue of Σ W is at most 1 + λ , for some λ ≥ 0 , then k µ W − µ X k 2 ≤ O ( δ + √ ǫλ ) . Note that this sub sumes Lemma 2.4 b y letting W b e the un iform distrib ution ov er T . Th e pro of is essen tially identica l to that of Lemma 2.4, except th at w e need to sho w that the mean and v ariance of the conditional distribution W | S are appr o ximately correct, whereas in Lemma 2.4 the b ound s on the mean and v ariance of S ∩ T follo w ed directly from stabilit y . Lemma 2. 5 clariﬁes the goal of our outlier remo v al pro cedure. In particular, given our initial ǫ -corrupted set T , we will attempt to ﬁnd a distribu tion W su pp orted on T so th at Σ W has no large eigen v alues. The weigh t W ( x ), x ∈ T , qu antiﬁes our b elief whether p oint x is an inlier or an outlier. W e will also need to ensure that an y suc h W we c ho ose is close to the un iform distrib u tion o ve r S . More concretely , w e no w d escrib e a framew ork that captures our rob u st mean estimatio n algo- rithms. W e start with the follo wing deﬁnition: Deﬁnition 2.6. Let S b e a (3 ǫ, δ )-stable set with resp ect to X and let T b e an ǫ -corrupted v ersion of S . Let C b e the set of al l pr obabilit y distrib utions W su pp orted on T , where W ( x ) ≤ 1 | T | (1 − ǫ ) , for all x ∈ T . W e note th at any distribu tion in C diﬀers from U S , the uniform d istribution on S , b y at most 3 ǫ . I ndeed, for ǫ ≤ 1 / 3 , w e ha v e that: d TV ( U S , W ) = X x ∈ T max { W ( x ) − U S ( x ) , 0 } = X x ∈ S ∩ T max { W ( x ) − 1 / | T | , 0 } + X x ∈ T \ S W ( x ) ≤ X x ∈ S ∩ T ǫ | T | (1 − ǫ ) + X x ∈ T \ S 1 | T | (1 − ǫ ) ≤ | T |  ǫ | T | (1 − ǫ )  + ǫ | T |  1 | T | (1 − ǫ )  = 2 ǫ 1 − ǫ ≤ 3 ǫ . Therefore, if we ﬁ nd W ∈ C with Σ W ha ving n o large eigen v alues, Lemma 2.5 implies that µ W is a go o d appro ximation to µ X . F ortunately , w e know that such a W exists! In particular, if w e take W to b e W ∗ , the uniform distribution o ver S ∩ T , the largest eigen v alue is at most 1 + δ 2 /ǫ , and th u s we ac hiev e ℓ 2 -error O ( δ ). 10 A t this p oint, we ha ve an ineﬃcient algorithm for approximat ing µ X : Find any W ∈ C with b ound ed co v ariance. Th e remaining qu estion is h ow w e can eﬃcien tly ﬁnd one. Ther e are t wo basic algorithmic tec h niques to ac hieve this, th at w e presen t in the su bsequent sub sections. The ﬁr s t algorithmic tec hnique w e will describ e is based on con v ex programming. W e will call this the unknown c onvex pr o gr amming metho d . Note that C is a con v ex set and that ﬁ nding a p oin t in C that has b oun d ed co v ariance is almost a con v ex pr ogram. It is n ot quite a con vex program, b ecause the v ariance of v · W , for ﬁxed v , is not a con v ex function of W . Ho wev er, one can sho w that giv en a W with v ariance in some direction signiﬁcan tly larger than 1 + δ 2 /ǫ , we can eﬃcien tly construct a hyperp lane separating W fr om W ∗ (recall that W ∗ is the u niform d istribution o ver S ∩ T ) (Section 2.3). This metho d has the ad v antag e of naturally working un der only the stabilit y assumption. On the other hand, as it relies on the ellipsoid algorithm, it is quite slow (although p olynomial time). Our second te c hn ique, whic h we w ill call ﬁltering , is an iterativ e outlier remo v al method th at is t yp ically faster, as it relies primarily on sp ectral tec h n iques. The main idea of the metho d is the follo wing: If Σ W do es not hav e large eigen v alues, then the emp irical m ean is close to the true mean. Otherwise, there is some unit v ector v s uc h that V ar [ v · W ] is substan tially la rger than it should b e. This can only b e the case if W assigns substan tial mass to elements of T \ S that ha ve v alues of v · x v ery far from the true mean of v · µ . T his observ ation allo ws u s to p erform some kind of outlier remo v al, in p articular b y remo ving (or do w n -w eighting) the p oints x that ha ve v · x inappropriately large. An imp ortan t conceptual prop erty is that one cannot aﬀord to remo ve only outliers, bu t it is p ossible to ensur e that more outliers are remo ve d than inliers. Giv en a W where Σ W has a large eige n v alue, one ﬁltering step giv es a new d istribution W ′ ∈ C that is closer to W ∗ than W was. Rep eating the pro cess even tually giv es a W with no large eigen v alues. The ﬁltering metho d and its v ariations are discussed in Section 2.4. 2.3 The Unkno wn Conv ex Progr amming Metho d By Lemma 2.5, it suﬃces to ﬁnd a distribution W ∈ C with Σ W ha ving n o large eigen v alues. W e note that this condition almost deﬁ n es a conv ex pr ogram. This is b ecause C is a conv ex set of probabilit y distribu tions and the b ound ed co v ariance condition says that V ar [ v · W ] ≤ 1 + λ f or all unit v ectors v . Unfortun ately , the v ariance V ar [ v · W ] = E [ | v · ( W − µ W ) | 2 ] is not quite linear in W . (If w e instead had E [ | v · ( W − ν ) | 2 ], wh er e ν is some ﬁxed ve ctor, this w ou ld b e linear in W .) Ho we v er, w e will sho w that a unit vec tor v for whic h V ar [ v · W ] is to o large, can b e used to obtain a separation oracle , i.e., a linear function L f or which L ( W ) > L ( W ∗ ). Supp ose that we identify a un it vec tor v suc h that V ar [ v · W ] = 1 + λ , where λ > c ( δ 2 /ǫ ) for a suﬃcien tly large u niv er s al constan t c > 0. Applyin g L emm a 2.5 to the one-dimensional pr o jection v · W , giv es | v · ( µ W − µ X ) | ≤ O ( δ + √ ǫλ ) = O ( √ ǫλ ) . Let L ( Y ) := E X [ | v · ( Y − µ W ) | 2 ] an d note that L is a linear fun ction of th e p robabilit y distrib ution Y with L ( W ) = 1 + λ . W e can write L ( W ∗ ) = E W ∗ [ | v · ( W ∗ − µ W ) | 2 ] = V ar [ v · W ∗ ] + | v · ( µ W − µ W ∗ ) | 2 ≤ 1 + δ 2 /ǫ + 2 | v · ( µ W − µ X ) | 2 + 2 | v · ( µ W ∗ − µ X ) | 2 ≤ 1 + O ( δ 2 /ǫ + ǫλ ) < 1 + λ = L ( W ) . In summary , we ha ve an explicit conv ex set C of probabilit y distrib u tions from wh ic h we w ant to ﬁnd one with eigenv alues b ounded by 1 + O ( δ 2 /ǫ ). Giv en any W ∈ C whic h do es not satisfy this condition, w e can pr o duce a lin ear function L that separates W from W ∗ . Usin g the ellipsoid algorithm, we ob tain the follo w ing general theorem: 11 Theorem 2.7. L et S b e a (3 ǫ, δ ) -stable se t with r e sp e ct to a distribution X and let T b e an ǫ - c orrupte d version of S . Ther e exists a p olynomia l time algorithm which given T r eturns b µ su ch that k b µ − µ X k 2 = O ( δ ) . Implications for Concrete Distribution F amilies. Com binin g Theorem 2.7 with corr esp ond- ing stabilit y b oun ds, w e obtain a n um b er of concrete applications for v arious distribution families of interest. Using P r op osition 2.2, w e obtain: Corollary 2.8 (Id en tity Co v ariance Sub-gaussian Distributions) . L et T b e an ǫ - c orrupte d set of samples of size Ω( d/ǫ 2 ) fr om an identity c ovarianc e sub-gaussian distribution X on R d . Ther e exists a p olynomial time algorithm which giv en T r eturns b µ such th at with high pr ob ability k b µ − µ X k 2 = O ( ǫ p log(1 /ǫ )) . W e n ote that Corollary 2.8 can b e immediately adapted for identit y co v ariance distr ibutions satisfying weak er conce n tration assumptions. F or example, if X s atisﬁes sub-exp onen tial co ncen- tration in eac h direction, we obtain an eﬃcient robust m ean estimation algorithm with ℓ 2 -error of O ( ǫ log (1 /ǫ )). If X has iden tit y co v ariance and b ounded k -th cen tral momen ts, k ≥ 2, we obtain error O ( ǫ 1 − 1 /k ). These error b ound s are information-theoretica lly op timal, within constan t factors. F or distributions with unkno w n b ounded co v ariance, using Prop osition 2.3 we obtain: Corollary 2.9 (Unkn o wn Bounded Cov ariance Distributions) . L et T b e an ǫ -c orrupte d set of samples of si ze ˜ Ω( d/ǫ ) fr om a distribution X on R d with unknown b ounde d c ovarianc e Σ X  σ 2 I . Ther e exists a p olynomial time algorithm which given T r eturns b µ su c h that with high pr ob ability k b µ − µ X k 2 = O ( σ √ ǫ ) . By the discussion follo win g F act 1.2, the ab ov e error b oun d is also information-theoretically optimal, with in constant factors. 2.4 The Filtering Metho d As in the unkn o wn con v ex pr ogramming metho d, the goal of the ﬁltering method is to ﬁnd a distribution W ∈ C so that Σ W has b ounded eigenv alues. Given a W ∈ C , Σ W either has b ounded eigen v alues (in w hic h case the we igh ted empirical mean w orks) or there is a direction v in whic h V ar [ v · W ] is to o large. In the latte r case, the pro jectio ns v · W m ust b eha ve v ery diﬀeren tly fr om the pr o jections v · S or v · X . In particular, since an ǫ -fraction of outliers are causing a m u c h larger increase in the stand ard deviation, this means that the distribution of v · W w ill ha ve man y “extreme p oin ts” — more than on e w ould exp ect to ﬁnd in v · S . T his fact allo ws us to id en tify a non-empt y subset of extreme points the ma jorit y of which are outliers. These p oin ts can then b e remo ved (or do wn-weig h ted) in ord er to “clea n up” our sample. F ormally , giv en a W ∈ C without b ound ed eigen v alues, we can eﬃcien tly ﬁn d a W ′ ∈ C so that W ′ is closer to W ∗ than W w as. Iterating this pro cedure ev entually terminates giving a W with b ound ed eigenv alues. W e n ote that while it m ay b e conceptually u seful to consider the ab o ve sc heme for general distributions W o ver p oin ts, in most cases it suﬃces to consid er only W g iv en as the u niform distribution ov er some set of p oints. T he ﬁltering step in th is case consists of replacing the set T by some subset T ′ = T \ R , where R ⊂ T . T o guarantee progress to wards W ∗ (the uniform distribution o ver S ∩ T ), it suﬃces to ensure that at most a third of the elemen ts of R are also in S , or equiv alen tly that at least t wo thirds of the remo ved p oints are outliers (p er h aps in exp ectation). The algorithm will terminate when the curr en t set of p oin ts T ′ has b ound ed emp ir ical co v ariance, and the outp u t will b e the empirical mean of T ′ . 12 Before we pr o ceed with a more detailed tec h nical d iscussion, we note that there are several p ossible w a ys to implemen t the ﬁ ltering step, and that the metho d used has a signiﬁcant impact on the analysis. In general, a ﬁltering step remo ve s all p oin ts that are “far” f rom the sample mean in a large v ariance d irection. How ev er, the precise w a y that this is quantiﬁed can v ary in imp ortan t w ays. 2.4.1 Basic Filtering In this sub section, we pr esent a ﬁltering m etho d that yields eﬃcien t r obust mean estimators with optimal err or b ound s for identit y cov ariance (or, more generally , known co v ariance) distributions whose univ ariate pro jections satisfy appropriate tail b ounds. F or the p urp ose of this sectio n, we will restrict ourselv es to the Gaussian setting. W e n ote ho wev er that th is metho d immediately extends to distributions with wea ker concen tration prop erties, e.g., sub -exp onent ial or ev en inv erse p olynomial concentrati on, with appropriate mod iﬁcations. W e n ote that the ﬁltering metho d p resen ted here requires an additional condition on our go o d set of samp les, on top of the stabilit y condition. This is quan tiﬁed in the follo wing deﬁnition: Deﬁnition 2.10. A set S ⊂ R d is tail-b ound-go o d (with r esp e ct to X = N ( µ X , I ) ) if for any unit v ector v , and an y t > 0, w e hav e Pr x ∼ u S [ | v · ( x − µ X ) | > 2 t + 2] ≤ e − t 2 / 2 . (1) Since an y univ ariate pro jectio n of X is distributed lik e a standard Gaussian, E quation (1) should hold if the uniform distribu tion o v er S we r e replaced by X . I t ca n b e shown that this condition holds w ith high pr obabilit y if S consists of i.i.d. random samp les from X of a suﬃcientl y large p olynomial size [21]. In tuitiv ely , the additional tail condition of Deﬁnition 2.10 is needed to guarantee th at the ﬁltering metho d used here will r emo ve more outliers than inliers. F ormally , w e ha ve the follo wing: Lemma 2.11. L e t ǫ > 0 b e a suﬃciently smal l c onstant. L et S ⊂ R d b e b oth (2 ǫ, δ ) -stable and tail-b ound-go o d with r esp e ct to X = N ( µ X , I ) , with δ = cǫ p log(1 /ǫ ) , for c > 0 a suﬃciently lar ge c onstant. L et T ⊂ R d b e such tha t | T ∩ S | ≥ (1 − ǫ ) max( | T | , | S | ) and assume we ar e given a unit ve ctor v ∈ R d for which V ar [ v · T ] > 1 + 2 δ 2 /ǫ . Ther e e xi sts a p olynomial time algorithm that r eturns a subset R ⊂ T satisfying | R ∩ S | < | R | / 3 . T o see wh y Lemma 2.11 suﬃ ces for our p urp oses, note that by replacing T b y T ′ = T \ R , we obtain a less noisy version of S than T wa s. In particular, it is easy to s ee that th e size of the symmetric diﬀerence b et ween S and T ′ is strictly smaller than the size of the symmetric diﬀerence b et w een S and T . F rom this it follo ws that the hyp othesis | T ∩ S | ≥ (1 − ǫ ) max ( | T | , | S | ) still holds when T is rep laced b y T ′ , allo wing us to iterate this pro cess unti l w e are left with a set with small v ariance. Pr o of. Let V ar [ v · T ] = 1+ λ . By applying Lemma 2.4 to the set T , we get that | v · µ X − v · µ T | ≤ c √ λǫ . By (1), it follo ws that Pr x ∼ u S h | v · ( x − µ T ) | > 2 t + 2 + c √ λǫ i ≤ e − t 2 / 2 . W e claim that there exists a thresh old t 0 suc h that Pr x ∼ u T h | v · ( x − µ T ) | > 2 t 0 + 2 + c √ λǫ i > 4 e − t 2 0 / 2 , (2) where the constan ts h a ve not b een optimized. Giv en this claim, the set R = { x ∈ T : | v · ( x − µ T ) | > 2 t 0 + 2 + c √ λǫ } will satisfy the conditions of the lemma. 13 T o p ro ve our claim, we analyze the v ariance of v · T and note that m u ch of the excess m u s t b e due to p oin ts in T \ S . In p articular, by our assumption on the v ariance in th e v -d irection, P x ∈ T | v · ( x − µ T ) | 2 = | T | V ar [ v · T ] = | T | (1 + λ ), where λ > 2 δ 2 /ǫ . T he con tribution fr om the p oints x ∈ S ∩ T is at most X x ∈ S | v · ( x − µ T ) | 2 = | S |  V ar [ v · S ] + | v · ( µ T − µ S ) | 2  ≤ | S | (1 + δ 2 /ǫ + 2 c 2 λǫ ) ≤ | T | (1 + 2 c 2 λǫ + 3 λ/ 5) , where the ﬁrst inequalit y uses the stabilit y of S , and the last uses that | T | ≥ (1 − ǫ ) | S | . If ǫ is suﬃcien tly small relativ e to c , it follo ws that P x ∈ T \ S | v · ( x − µ T ) | 2 ≥ | T | λ/ 3 . On the other hand, b y deﬁnition we ha ve : X x ∈ T \ S | v · ( x − µ T ) | 2 = | T | Z ∞ 0 2 t Pr x ∼ u T [ | v · ( x − µ T ) | > t, x 6∈ S ] dt. (3) Assume for the sak e of contradicti on that there is no t 0 for whic h Equ ation (2) is sati sﬁed. Then the RHS of (3) is at most | T | Z 2+ c √ λǫ +10 √ log(1 /ǫ ) 0 2 t Pr x ∼ u T [ x 6∈ S ] dt + Z ∞ 2+ c √ λǫ +10 √ log(1 /ǫ ) 2 t Pr x ∼ u T [ | v · ( x − µ T ) | > t ] dt ! ≤| T | ǫ (2 + c √ λǫ + 10 p log(1 /ǫ )) 2 + Z ∞ 5 √ log(1 /ǫ ) 16(2 t + 2 + c √ λǫ ) e − t 2 / 2 dt ! ≤| T |  O ( c 2 λǫ 2 + ǫ log(1 /ǫ )) + O ( ǫ 2 ( p log(1 /ǫ ) + c √ λǫ ))  ≤| T | O ( c 2 λǫ 2 + ( δ 2 /ǫ ) /c ) < | T | λ/ 3 , whic h is a con tradiction. Th erefore, the tail b ou n ds and the concen tration violatio n together im p ly the existence of suc h a t 0 (whic h can b e eﬃcien tly computed). 2.4.2 Randomized Filtering The basic ﬁltering metho d of th e previous subsection is deterministic, relying on the violation of a concen tration inequalit y satisﬁed by the inliers. In some settings, deterministic ﬁltering seems to fail to giv e optimal results and we require the ﬁltering pro cedu re to b e randomized. A concrete suc h setting is when the un corr u pted distribu tion is only assumed to ha ve b oun ded co v ariance. The main idea of randomized ﬁltering is simple: Sup p ose w e can iden tify a non-negativ e fu nc- tion f ( x ), deﬁned on the samples x , for wh ic h (u nder some h igh probabilit y condition on the inliers) it holds that P T f ( x ) ≥ 2 P S f ( x ), where T is an ǫ -corrupted set of samples an d S is the cor- resp ond ing set of inliers. Then w e ca n create a r andomized ﬁlter b y remo ving eac h sample p oin t x ∈ T with pr ob ab ility prop ortional to f ( x ). This ens ures that the e xp e cte d num b er of outliers remo ved is at least the exp e cte d n u m b er of inliers remo ved. The analysis of such a r andomized ﬁlter is sligh tly more subtle, so we will discuss it in the follo w ing paragraphs. The k ey p rop erty the ab o ve randomized ﬁ lter en sures is that the sequ en ce of random v ariables (# Inliers r emo ve d) − (# Ou tliers remo ved) (wh ere “inliers” are p oints in S and “outlie r s” p oin ts in T \ S ) across iterations is a su b-martingale. Since the total n umber of outliers remov ed across all iterations accoun ts f or at most an ǫ -fraction of the total samples, this means that with pr obabilit y at least 2 / 3, at no p oint do es the algorithm remov e more th an a 2 ǫ -fraction of the inliers. A formal statemen t follo ws: 14 Theorem 2.12. L et S ⊂ R d b e a (6 ǫ, δ ) -stable set (with r esp e ct to X ) and T b e an ǫ -c orrupte d version of S . Supp ose that given any T ′ ⊆ T with | T ′ ∩ S | ≥ (1 − 6 ǫ ) | S | for which Cov [ T ′ ] has an eigenvalue bi g ger than 1 + λ , for some λ ≥ 0 , ther e is an eﬃcient algorithm that c omputes a non-zer o function f : T ′ → R + such that P x ∈ T ′ f ( x ) ≥ 2 P x ∈ T ′ ∩ S f ( x ) . Then ther e exists a p olynomial time r andomize d algorithm that c omputes a ve ctor b µ tha t with pr ob ability at le ast 2 / 3 satisﬁes k b µ − µ X k 2 = O ( δ + √ ǫλ ) . The algorithm is describ ed in pseudo co de b elo w: Algorithm Ran domized Filtering 1. Compu te Co v [ T ] and its largest eigen v alue ν . 2. If ν ≤ 1 + λ , return µ T . 3. Else • Compu te f as guaran teed in the theorem statemen t. • Remo v e eac h x ∈ T with pr ob ab ility f ( x ) / max x ∈ T f ( x ) and return to Step 1 with the new set T . Pr o of of The or em 2.12. First, it is easy to see that this algorithm runs in p olynomial time. Indeed, as the p oint x ∈ T attaining the maxim u m v alue of f ( x ) is d eﬁnitely remo ved in eac h ﬁltering iteration, eac h iteration redu ces | T | b y at least one. T o establish correctness, we will sho w that, with probab ility at least 2 / 3, at eac h ite ration of the algorithm it holds | S ∩ T | ≥ (1 − 6 ǫ ) | S | . Assuming this claim, Lemma 2.4 implies that our ﬁnal error will b e as desir ed . T o prov e the desired claim, w e consider the sequence of r andom v ariables d ( T ) = | S ∆ T | = | S \ T | + | T \ S | across the iteratio ns of the algorithm. W e note that, initially , d ( T ) = 2 ǫ | S | and that d ( T ) cannot drop b elow 0. Finally , we note that at eac h stage of the algorithm d ( T ) decreases by (# Inliers r emo ve d) − (# Outliers remo ve d ), and that the exp ectatio n of this quan tit y is X x ∈ S \ T f ( x ) − X x ∈ T \ S f ( x ) = 2 X x ∈ S ∩ T f ( x ) − X x ∈ T f ( x ) ≤ 0 . This means that d ( T ) is a su b -martingale (at least u n til we reac h a p oin t w here | S ∩ T | ≤ (1 − 6 ǫ ) | S | ). Ho we v er, if we set a stopping time at the ﬁrst o ccasion where this condition fails, we note that the exp ectation of d ( T ) is at most 2 ǫ | S | . S ince it is at least 0, this m eans that with probabilit y at least 2 / 3 it is nev er m ore than 6 ǫ | S | , which would imp ly that | S ∩ T | ≥ (1 − 6 ǫ ) | S | throughout the algorithm. If this is the case, the inequalit y | T ′ ∩ S | ≥ (1 − 6 ǫ ) | S | will cont in u e to hold throughout our algorithm, thus even tually yielding suc h a set with the v ariance of T ′ b ound ed. By Lemma 2.4, the mean of this T ′ will b e a suitable estimate for the true mean. Metho ds of Po in t Remo v al. The rand omized ﬁltering method describ ed ab o ve only r equires that eac h p oin t x is remo v ed with p robabilit y f ( x ) / m ax x ∈ T f ( x ), without any assumption of inde- p enden ce. Th erefore, giv en an f , there are sev eral wa ys to implemen t this scheme. A few natural ones are give n here: • R andomize d Thr esholding: P erh aps the easiest metho d f or implementi ng our rand omized ﬁlter is generating a u niform r an d om n u mb er y ∈ [0 , max x ∈ T f ( x )] and removing all p oints x ∈ T for whic h f ( x ) ≥ y . This metho d is practical ly useful in many app lications. Finding the set of suc h p oint s is often fairly easy , as this condition may we ll corresp ond to a simp le thresh old. 15 • Indep endent R emoval: Eac h x ∈ T is remov ed in dep end en tly with probability f ( x ) / max x ∈ T f ( x ). This sc heme h as the adv anta ge of leading to less v ariance in d ( T ). A careful analysis of the random walk in volv ed allo w s one to r educe the failure probabilit y to exp( − Ω( ǫ | S | )). • Deterministic R eweighting: Instead of remo ving p oin ts, this s c heme allo ws for we igh ted sets of p oints. In particular, eac h p oint will b e assigned a weig ht in [0 , 1] and we will consider w eighte d m eans and co v ariances. Instead of remo vin g a p oin t with pr obabilit y pr op ortional to f ( x ), we can remo ve a fractio n of x ’s weigh t prop ortional to f ( x ) . T his ensur es that the appropriate w eigh ted version of d ( T ) is deﬁnitely non-increasing, implying deterministic correctness of the algorithm. Practical C onsiderations. While the aforemen tioned p oin t remo v al metho d s ha ve similar the- oretical guarantee s, recen t implemen tations [24] suggest that they ha ve diﬀerent practical p erfor- mance on real datasets. Th e deterministic rew eight ing metho d is somewhat slow er in practice as its w orst-case runti me and its typica l run time are comparable. In more d etail, one can guarantee ter- mination b y setting th e constan t of prop ortionalit y so that at eac h step at least one of the non-zero w eights is set to zero. Ho wev er, in pr actical circumstances, w e will not b e able to do b etter. T h at is, the algorithm ma y we ll b e forced to undergo ǫ | S | iterations. On the other hand, the randomized v ersions of the alg orithm are lik ely to remov e seve r al p oin ts of T at eac h ﬁltering step. Another reason wh y th e rand omized versions may b e preferable has to d o with the qu alit y of the results. The r andomized algorithms only pro du ce bad r esults when th ere is a c hance th at d ( T ) ends up b eing very large. How ev er, since d ( T ) is a sub-martingale, this will only eve r b e the case if there is a corresp onding p ossibilit y that d ( T ) w ill b e exceptionally small. T h us, although the randomized algorithms ma y hav e a probabilit y of giving wo r se results some of the time, this w ill on ly happ en if a corresp onding fraction of the time, th ey also giv e b etter results than the theory guarantee s. This consideration suggests that the randomized thr esholding pro cedu re might hav e adv an tages o ve r th e indep endent remo v al p ro cedure precisely b ecause it has a higher p robabilit y of failure. This has b een observ ed exp erimental ly in [24]: In r eal datasets (p oisoned with a constan t f raction of adv ersarial outliers), the num b er of iterations of randomized ﬁltering is t ypically b ound ed b y a small constant . 2.4.3 Univ ersal Filtering In this subs ection, w e show ho w to u se randomized ﬁltering to construct a u niv ersal ﬁlter th at w orks under only the stabilit y cond ition (Deﬁnition 2.1) — not requ ir ing the tail -b ound cond ition of the b asic ﬁlter (Lemma 2.11). F ormally , we sho w: Prop osition 2.13. L et S ⊂ R d b e an ( ǫ, δ ) -stable set f or ǫ , δ > 0 suﬃciently smal l c onstants and δ at le ast a suﬃci e ntly lar g e multiple of ǫ . L et T b e an ǫ -c orrupte d version of S . Su pp ose that Co v [ T ] has lar gest eigenvalue 1 + λ > 1 + 8 δ 2 /ǫ . Then ther e exists a c omputationa l ly eﬃcient algorith m that, on input ǫ, δ , T , c omputes a non-zer o function f : T → R + satisfying P x ∈ T f ( x ) ≥ 2 P x ∈ T ∩ S f ( x ) . By com b ining Theorem 2.12 and Prop osition 2.13, we obtain a ﬁlter-based algorithm establish- ing Th eorem 2.7. Pr o of of P r op osition 2.13 . Th e algorithm to construct f is the f ollo wing: W e start by computing the sample mean µ T and the top (un it) eigen vecto r v of Co v [ T ]. F or x ∈ T , w e let g ( x ) = ( v · ( x − µ T )) 2 . Let L b e the s et of ǫ · | T | elemen ts of T on whic h g ( x ) is largest. W e deﬁne f to b e f ( x ) = 0 for x 6∈ L , and f ( x ) = g ( x ) for x ∈ L . 16 Our basic p lan of attac k is as follo ws: First, we n ote that the sum of g ( x ) o ver x ∈ T is the v ariance of v · Z , Z ∼ u T , whic h is substantia lly larger th an the su m of g ( x ) ov er S , whic h is appro ximately the v ariance of v · Z , Z ∼ u S . T herefore, th e sum of g ( x ) o ver the ǫ | S | elements of T \ S m ust b e quite large. In fact, u s ing the stabilit y condition, w e can sho w that the latter quan tit y m ust b e larger than the sum of the largest ǫ | S | v alues of g ( x ) o v er x ∈ S . Ho wev er, since | T \ S | ≤ | L | , we ha ve that P x ∈ T f ( x ) = P x ∈ L g ( x ) ≥ P x ∈ T \ S g ( x ) ≥ 2 P x ∈ S f ( x ) . W e no w pro ceed with the detailed analysis. First, note that X x ∈ T g ( x ) = | T | V ar [ v · T ] = | T | (1 + λ ) . Moreo v er, for an y S ′ ⊆ S with | S ′ | ≥ (1 − 2 ǫ ) | S | , w e ha ve that X x ∈ S ′ g ( x ) = | S ′ | ( V ar [ v · S ′ ] + ( v · ( µ T − µ S ′ )) 2 ) . (4) By the second stabilit y condition, we hav e that | V ar [ v · S ′ ] − 1 | ≤ δ 2 /ǫ . F urther m ore, the stabilit y condition and Lemma 2.4 giv e k µ T − µ S ′ k 2 ≤ k µ T − µ X k 2 + k µ X − µ S ′ k 2 = O ( δ + √ ǫλ ) . Since λ ≥ 8 δ 2 /ǫ , com bining th e ab o ve giv es that P x ∈ T \ S g ( x ) ≥ (2 / 3) | S | λ . Moreo v er, since | L | ≥ | T \ S | and since g tak es its largest v alues on p oin ts x ∈ L , we ha ve that X x ∈ T f ( x ) = X x ∈ L g ( x ) ≥ X x ∈ T \ S g ( x ) ≥ (16 / 3) | S | δ 2 /ǫ . Comparing the results of Equation (4) with S ′ = S and S ′ = S \ L , we ﬁnd that X x ∈ S ∩ T f ( x ) = X x ∈ S ∩ L g ( x ) = X x ∈ S g ( x ) − X x ∈ S \ L g ( x ) = | S | (1 ± δ 2 /ǫ + O ( δ 2 + ǫλ )) − | S \ L | (1 ± δ 2 /ǫ + O ( δ 2 + ǫλ )) ≤ 2 | S | δ 2 /ǫ + | S | O ( δ 2 + ǫλ ) . The latter qu an tit y is at m ost (1 / 2) P x ∈ T f ( x ) when δ an d ǫ /δ are su ﬃcien tly sm all constants. This completes the pro of of Prop osition 2.13. 2.5 Bibliographic N otes The conv ex programming and ﬁltering metho ds describ ed in this article app eared in [21, 22]. Here w e ga ve a simpliﬁed and u niﬁed pr esen tation of these tec hn iques. The idea of remo ving outliers b y pro jecting on the top eig en vect or of the empirical co v ariance go es bac k to [53], wh o used it in the con text of learning linear separators with malicious noise. T hat w ork [53] used a “hard ” ﬁltering step whic h only remo ves outliers, and consequen tly leads to errors that scale logarithmica lly with the dimension. Subsequentl y , the w ork of [1] emp lo ye d a soft-outlier remo v al step in the same sup er v ised setting as [53], to obtain impro v ed b oun ds for that problem. It sh ould b e noted that the soft-outlier metho d of [1] is similarly insuﬃcien t to obtain dimension-ind ep endent error b ound s for the unsup ervised setting. The wo rk of [57] dev elop ed a recursiv e dimension-halving tec hn ique for robust mean estimation. Their tec hniqu e leads to error O ( ǫ p log(1 /ǫ ) √ log d ) for Gaussian robu s t mean estimation in Hub er’s 17 con tamination mod el. In short, the algorithm of [57] b egins by remo v in g any extreme outliers from the input set of ǫ -corrupted samples. This ensures that, after this basic outlier remo v al step, the empirical co v ariance matrix h as trace d (1 + ˜ O ( ǫ )), whic h implies that the d/ 2 smallest eigen v alues are all at most 1 + ˜ O ( ǫ ). This allo w s [57 ] to show, using tec hniqu es akin to Lemma 2.4, th at the p ro jections of th e true mean and the empirical mean onto the sub space spanned by the corresp onding (small) eigen vecto r s are close. The [57] algo rithm then uses this appro xim ation for this pro jection of the m ean, p ro jects the remaining p oint s onto the orthogo nal subspace, and recursiv ely ﬁn ds the mean of the other p ro jection. In addition to r ob u st mean and co v ariance estimation, [21, 57] ga ve robust learnin g algorithms for v arious other statistical tasks, including robust densit y estimation for mixtures of spherical Gaussians and binary pro d uct distributions, robust ind ep endent comp onen t analysis (IC A), and robust singular v alue decomp osition (SVD). Building on the robust mean estimation tec hniqu es of [21], [18] ga v e robust p arameter estimation algorithms for Ba y esian netw orks with kno wn graph structure. Another extension of these r esults w as f ound by [69] w ho ga ve an eﬃcien t algorithm for robust mean estimat ion with resp ect to all ℓ p -norms. The algorithmic approac hes describ ed in this section robustly estimate the mean of a spherical Gaussian within error O ( ǫ p log(1 /ǫ )) in the strong con tamination mo del of Deﬁnition 1.1. A more sophisticated ﬁltering tec hnique that ac hiev es the optimal error of O ( ǫ ) in the additive contami- nation mo d el w as d ev elop ed in [23]. V ery roughly , this algorithm pro ceeds b y using a nov el ﬁ lter to r emo ve bad p oints if the empirical co v ariance matrix has many eigen v alues of size 1 + Ω( ǫ ). Otherwise, the algorithm u ses the empirical mean to estimate the mean on the space spanned b y small eigen v ectors, and then uses b rute force to estimate the p r o jection ont o the few principal eigen v ectors. F or the strong con tamination mod el, it was sho wn in [26] that an y improv emen t on the O ( ǫ p log(1 /ǫ )) err or requires sup er-p olynomial time in the S tatistical Qu ery mo del. Finally , we note that ideas from [21] ha ve led to pro of-of-concept improv emen ts in the analysis of genetic data [22] and in adv ersarial mac hine learning [24, 72]. 3 Bey ond Robust Mean Estimation In this s ection, we p r o vide an o v erview of the ideas b ehind recen tly deve lop ed robust estimators for more ge neral statistica l tasks. This sect ion follo ws the structure of a S TOC 20 19 tutorial b y the authors [29]. 3.1 Robust Sto ch astic Optimization It turn s out that th e algorithmic tec hniqu es for high-dimensional robust mean estimation describ ed in the previous sectio n can b e view ed as usefu l pr imitiv es for robu s tly solving a r ange of mac hine learning problems. More sp eciﬁcally , we will argue in this sectio n that any eﬃcient robust mean estimator can b e used (in essen tially a blac k-b o x manner) to obtain eﬃcien t robust algorithms for mac hine learning tasks that can b e expressed as stochasti c optimization pr oblems. In a stoc hastic optimization problem, we are giv en samp les from an u nknown distribution F o ve r functions f : R d → R , and our goal is to ﬁ nd an ap p ro ximate minimizer of the function F ( w ) = E f ∼F [ f ( w )] ov er W ⊆ R d . This framewo rk encapsulates a num b er of w ell-studied mac hin e learning problems. First, we note that the problem of mean estimation can be expressed in this form, b y observing that th e mean of a distribu tion X is the v alue µ = arg min w ∈ R d E x ∼ X [ k w − x k 2 2 ]. That is, giv en a sample x ∼ X , the d istribution F o v er fun ctions f x ( w ) = k w − x k 2 2 turns the task of mean estimation in to a sto chastic optimization problem. A more interesting example is the problem of least-squares lin ear r egression: Give n a d istribution D o v er pairs ( x, y ), where x ∈ R d 18 and y ∈ R , w e wan t to ﬁ nd a vect or w ∈ R d minimizing E ( x,y ) ∼D [( w · x − y ) 2 ]. Similarly , the problem of linear regression ﬁts in the sto c h astic optimizatio n f ramew ork b y deﬁnin g th e distrib ution F ov er functions f ( x,y ) ( w ) = ( w · x − y 2 ), where ( x, y ) ∼ D . Similar form ulations exist for n u merous other mac hine learning problems, in cluding L 1 -regression, logistic regression, supp ort vect or machines, and generalized linear mo dels (see, e.g., [24]). Finally , we note that the sto c hastic optimization framew ork encompasses non-conv ex p roblems as we ll. F or example, th e general and c hallenging problem of training a neural n et can b e expressed in this f r amew ork, where w represents some high-dimensional ve ctor of parameters classifying the n et and eac h f u nction f ( w ) quantiﬁes ho w w ell that particular net classiﬁes a given data p oin t. Before w e discuss r obust sto chastic optimiza tion, we mak e a few basic remarks regarding the non-robust s etting. W e start by n oting that, without an y assumptions, the p roblem of optimizing the function F ( w ) = E f ∼F [ f ( w )], even app ro ximately , is NP-hard. On the other hand, in man y situations, it su ﬃces to ﬁnd an appr o ximate critical p oint of F , i.e., a p oin t w suc h that k∇ F ( w ) k 2 is small. F or example, if F is conv ex (whic h holds if eac h f ∼ F is conv ex), an app ro ximate critical p oin t is also an appro ximate global min im u m . F or several structured non-conv ex problems, an app ro ximate critical p oin t is also considered a satisfactory solution. On inp ut a set of clean samples, i.e., an i.i.d. set of functions f 1 , . . . , f n ∼ F , w e can eﬃcien tly ﬁnd an approximat e critical p oint of F u sing (pro j ected) gradien t descen t. F or more structured problems, e.g., linear regression, faster and more direct metho d s ma y b e a v ailable. In the robust setting, w e hav e access to an ǫ -corrup ted training set of functions f 1 , . . . , f n from F . Unfortun ately , even a single corrupted s amp le can completely compromise the guarant ees of gradien t descen t. The robust v er s ion of this p r oblem wa s ﬁrst studied by [14], who consid ered th e problem in the case where a m a jorit y of the datap oint s are outliers. The v anilla outlier-robu s t setting, wh ere ǫ < 1 / 2, w as ﬁ rst studied in tw o concurrent works [65, 24]. The main int uition present in b oth these wo r ks is that robustly estimating the gradien t of the ob jectiv e f unction can b e view ed as a robust mean estimation problem. As a result, if an eﬃcien t r obust gradien t oracle is a v ailable, w e can “sim u late” gradient descent and compute an appr o ximate critical p oint of F . Note that this m etho d employs a robu st mean estimation algorithm at eve ry step of gradient descen t. The w ork of [24] also pr op osed an alternativ e approac h, whic h turn s out to b e muc h faster in practice. Instead of using a r obust gradien t estimator as a blac k-b o x, one uses any appro ximate empirical risk minimizer (ERM) in conjun ction with the ﬁ ltering algorithm for robust mean es- timation of the pr evious sect ion. This metho d only requires blac k -b o x access to an appro ximate ERM and calls th e ﬁltering r ou tin e only when the ERM reac hes an appr o ximate critical p oint. The correctness of this algorithm relies on structural pr op erties of the ﬁ ltering metho d . Roughly sp eaking, the main idea is as follo w s: S u pp ose th at we ha ve reac hed an approximate critical p oin t w of the emp irical risk and at this stage we apply a ﬁ ltering step. By th e guarante es of the ﬁ lter, w e know that w e are in one of t w o cases: either the ﬁltering step remov es more outliers (corru pted functions) than inliers (in exp ectation), or it ce rtiﬁes that the gradien t of F at w is close to the gradien t of th e empirical risk at w . In the former case, we mak e p rogress as w e pro d uce a “cleaner” set of f unctions. In the latte r case, since w is an appro ximate critical p oin t of the empir ical risk, our certiﬁcate implies that w is also an app ro ximate critica l p oin t of F , as desired. F or the ab ov e robu st optimization app roac hes to b e computationally eﬃcien t, w e requ ir e some assumptions on the distribution of the clean samples (fun ctions). With no suc h assumption, m ost of the s ize of F could b e determined b y a small fraction of the f ’s in suc h a wa y that it is compu- tationally in tractable to determine whether these v alues are d ue to corruptions or not. A n atur al condition us ed in [14, 24] is that for ev ery w the co v ariance matrix of the gradien t distribu tion, Co v f ∼F [ ∇ f ( w )], is b ounded from ab o ve. Und er this condition, if one has enough samp les from F (so that the emp irical co v ariance of the clean samples is b ounded for all w ), one can u se the ﬁ ltering 19 algorithm of Section 2.4 to robu s tly estimate ∇ F ( w ) to ℓ 2 -error O ( √ ǫ ) for an y w . Usin g either of the t wo afore-describ ed app roac hes, one can ﬁnd a p oin t w suc h that k∇ F ( w ) k 2 = O ( √ ǫ ) [24]. W e note that [65] uses the robu st mean estimator of [57], w hic h requ ir es somewhat stronger distribu tional assumptions and in curs err or scaling logarithmically with the dimension. In summary , we ha ve describ ed t wo meta-algorithms for robu s t sto c h astic optimization. An in teresting op en problem is to obtain a faster algorithm for this general task, in p articular one th at has information-theoretical ly optimal sample complexit y and u s es a minimum n umber of queries to an ERM oracle. R obust Line ar R e gr ession. As already mentioned, if F is con v ex, the appro ximate critical p oin t computed via robust s to c hastic optimization tran s lates to an appr o ximate global minimizer. In the follo wing paragraph s , w e describ e ho w to obtain an approximat e global minimum for the fun da- men tal task of lin ear regression. Sev eral other applications are giv en in [24, 65]. W e will focus on the f ollo wing standard setup: W e are give n a collection of lab eled examples ( x ( i ) , y ( i ) ), where x ( i ) is dra wn from a distribution X on R d and y ( i ) = β · x ( i ) + e ( i ) , where e ( i ) is dra w n fr om a distribution e that is ind ep endent of X , and has mean 0 and v ariance 1. The ob jectiv e is to ﬁnd a ve ctor w ∗ ∈ R d that appro ximately minimizes the f unction F ( w ) = E ( x,y ) [ f ( x,y ) ( w )], where f ( x,y ) ( w ) = ( w · x − y ) 2 . Recall that the r obust sto c hastic optimizatio n appr oac h describ ed in the previous p aragraphs relies on the assump tion that the co v ariance matrix of the gradien ts Co v f ∼F [ ∇ f ( w )] is b ounded. This condition translates to certain necessary conditions ab out the distrib ution X . T o b etter understand th ese conditions, w e ﬁrst consider the gradien t of f . W e ha v e that ∇ w f ( x,y ) ( w ) = 2( w · x − y ) x = 2(( w − β ) · x − e ) x . Using this exp r ession, it is not hard to see that the v ariance of the gradien t ∇ w f ( x,y ) ( w ) in the v direction is equ al to 4 E x ∼ X [( v · x ) 2 ] + 4 E x ∼ X [( v · x ) 2 (( w − β ) · x ) 2 ] . Note that the ﬁrst term ab ov e is b ounded, as long as X h as bou n ded co v ariance. T o b ound the second, w e w ill need to assume that X has b ound ed f ou r th momen ts. In particular, if we assu me that X has E x ∼ X [( v · x ) 4 ] = O (1) for all un it v ectors v , the co v ariance of the gradien ts has maxim u m eigen v alue b ounded b y O (1 + k w − β k 2 2 ). F or simplicit y , let u s assu m e that we kn o w a priori a ball of constan t ℓ 2 -radius con taining β . Th en , usin g our robust sto c hastic optimiza tion routine, we can eﬃcien tly compute an appro ximate critical p oint w ith gradien t of ℓ 2 -norm O ( √ ǫ ). W e n o w sho w that s u c h an appro ximate critical p oint is also an appro ximate global minimum. Note that the gradien t of F at w equ als E [2(( w − β ) · x − e ) x ] = 2 E [ X X T ]( w − β ) . F or con venience, w e w ill additionally assu me that E [ X X T ] is b ound ed from b elo w by a constant multiple of the iden tity matrix. Th is means that f or an y w ∈ R d w e ha v e that k w − β k 2 = O ( k∇ w F ( w ) |k 2 ). Therefore, an approximat e critical p oin t of F is equiv alen t to a goo d appr o ximation of β , whic h is the global m inimizer of F . The abov e immediately gives an appro ximate global minimizer with ℓ 2 -error O ( √ ǫ ), assumin g w e started with a constan t radius ball con taining β . It is n ot d iﬃ cu lt to handle a v ery r ough appro ximation to β with error at most R . A simple (b u t somewhat ineﬃcien t) metho d is to s earch for w within a ball of radius R in whic h the co v ariance of the gradients is b ound ed b y O (1 + R 2 ). F or R > 1, this guaran tees a new p oin t wh ic h appro ximates β within ℓ 2 -error O ( √ ǫR ). Iterating this p r o cedure, w e ca n achiev e a ﬁnal error of O ( √ ǫ ). A detaile d and more eﬃcie n t pro cedure is describ ed in [24]. 20 Finally , it s h ould b e noted that th e problem of robust linear regression h as b een extensiv ely studied in recent y ears. Using th e Sums-of-Squares hierarc hy , [52] d ev elop ed computationally eﬃ- cien t algo rithms for robust linear r egression. F or the case where the co v ariates follo w a Gaussian distribution, [31] obtained computationally eﬃcien t algorithms with near-minimax sample com- plexit y and error guaran tee. There has also b een recen t related work [8, 7, 71], whic h p rop osed eﬃcien t algorithms for “robust” linear regression in a restrictiv e corruption m o del that only allo ws adv ersarial corruptions to the resp onses, but not to th e co v ariates. 3.2 Robust Cov ariance Estimation The algorithmic tec hn iques for h igh-dimensional robu st mean estimation describ ed in Section 2 can b e generalized to robustly estimate higher momen ts under appropr iate assumptions. In this section, w e describ e ho w to adapt the ﬁlter tec hnique to robustly estimat e the co v ariance matrix Σ of a distribution X satisfying app ropriate momen t conditions. First note that w e can assume without loss of generalit y that X is cen tered. S p eciﬁcally , by considering the diﬀerences b et ween pairs of ǫ -corrupted samp les from X , we ha v e access to a set of 2 ǫ -corrupted samples from a distrib u tion X ′ with mean 0 and co v ariance matrix 2Σ. The basic idea und erlying this section is fairly simple: Robustly estimating th e co v ariance matrix of a cent ered r andom v ariable X is essen tially equiv alent to robustly estimating the mean of the rand om v ariable Y = X X T . That is, the problem of robu s t co v ariance estimation can b e “reduced” to the pr oblem of robust mean estimatio n of a more complicated random v ariable. If the random v ariable Y satisﬁes ap p ropriate moment conditions (or tail b ound s), w e can hop e to apply the tec h niques of the previous section. This is th e appr oac h tak en by [21, 57]. A t a v ery high-lev el, it is p ossible to design a robust co v ariance estimation algorithm using the ﬁltering m etho d [21], where eac h ﬁltering step remo ve s p oin ts based on the empirical four th moment tensor. F ormalizing the ab o ve approac h requires some care for th e follo w in g reason: The robu st mean estimation tec hniqu es for a distrib ution Y require an a priori upp er b ound on its cov ariance C o v [ Y ]. Unfortunately , su c h b ounds do not h old for our random v ariable Y = X X T , eve n if X is a Gaussian distribution. T o h andle this issue, we n eed to use additional s tructural prop erties of X . Sp eciﬁcally , if X ∼ N (0 , Σ), we can leverag e th e fact that the co v ariance of Y can b e expressed as a fun ction of the co v ariance of X . An upp er b ound on Σ will giv e us an upp er b ound on the co v ariance of Y , whic h can then b e used to obtain a b etter appro ximation of Σ. App lying this idea iterati v ely will allo w us to b o otstrap b etter and b etter app ro ximations, until w e end up w ith an app ro ximation to Σ with error close to the in formation-theoretic optimum. Th is method is prop osed and analyzed (with sligh tly diﬀeren t terminology) in [21, 23]. Before w e pro ceed with a detailed outline of the metho d, we should clarify the metric w e will use to approxi mate the co v ariance matrix. Recall that for mean estimation we used the ℓ 2 - norm betw een v ectors. A natural c hoice for the co v ariance is the F rob enius norm, i.e ., w e w ould lik e to ﬁn d an estimate b Σ of the true matrix Σ such that k b Σ − Σ k F is sm all. Here we will use the Mahalanobis distance, which is aﬃne in v arian t and intuitiv ely corresp onds to m u ltiplicativ e appro ximation. Sp eciﬁcally , we w ant to compute an estimate b Σ s u c h that k Σ − 1 / 2 b ΣΣ − 1 / 2 − I k F is small. A basic fact m otiv ating the use of this metric is that the total v ariation d istance b etw een tw o Gaussian d istributions N (0 , Σ) and N (0 , b Σ) is b ound ed from ab o v e by O ( k Σ − 1 / 2 b ΣΣ − 1 / 2 − I k F ). F or the remaind er of th is section, w e will assu m e for concreteness that X ∼ N (0 , Σ ) and w e will describ e an eﬃcien t algorithm for robust co v ariance estimation that ac hiev es Mahalanobis distance O ( ǫ log (1 /ǫ )), wh ich is within a logarithmic factor from the inf ormation-theoretic optim u m of Θ( ǫ ). W e note that the same approac h giv es error O ( √ ǫ ) for any distrib u tion w h ose four th moment tensor is appropr iately b oun ded b y a function of the co v ariance. 21 T o get started, we ﬁrst need to u nderstand the relationship b et w een Σ := Co v [ Y ] and Σ. T o that end, let u s denote b y A ﬂat the canonical ﬂattening of a matrix A into a v ector. With this notation, it is not hard to v erify that A ﬂat Σ A ﬂat = 2     Σ 1 / 2  A + A T 2  Σ 1 / 2     2 F . (5) This formula essentia lly expresses Σ in terms of the qu adratic f orm that it deﬁnes. An imp ortan t consequence of Equ ation (5) is the follo wing: Giv en co v ariance matrices Σ , Σ ′ and the corresp onding matrices Σ , Σ ′ , w e hav e that if Σ  Σ ′ , then Σ  Σ ′ . In other w ord s, if w e ha ve an upp er b ound on the true co v ariance Σ, this giv es us an up p er b oun d on the co v ariance of Y . Sp eciﬁcally , if Σ  Σ 0 , for some matrix Σ 0 , w e hav e that C ov [Σ − 1 / 2 0 Y Σ − 1 / 2 0 ] = O ( I ). Using our robust mean estimator for rand om v ariables with b ound ed co v ariance will allo w u s to approxima te E [Σ − 1 / 2 0 Y Σ − 1 / 2 0 ] = Σ − 1 / 2 0 ΣΣ − 1 / 2 0 to error O ( √ ǫ ) in F r ob enius n orm. Th is giv es us an estimate b Σ suc h that k Σ − 1 / 2 0 ( b Σ − Σ)Σ − 1 / 2 0 k F = O ( √ ǫ ). Th is means that, giv en an u p p er b oun d Σ 0 on Σ , w e can obtain a b etter one. T o ob tain an initial upp er b ound Σ 0 , w e note that t wice the sample co v ariance of a large set of samples from X p ro vid es an upp er b ound on the true co v ariance of X ev en w ith corruptions, as although corru ptions can s u bstant ially increase the empirical co v ariance of X , they cannot decrease it by m uch. Starting from Σ 0 , we obtain a new appro ximation Σ 1 = Σ + O ( √ ǫ )Σ 0 ; from th is we can obta in an imp r o ved approximati on Σ 2 = Σ + O ( √ ǫ )Σ 1 , and so forth. Iterating this tec hn ique yields a matrix b Σ such that k Σ − 1 / 2 ( b Σ − Σ)Σ − 1 / 2 k F = O ( √ ǫ ). The error guaran tee of O ( √ ǫ ) ac hieve d ab o ve is already fairly accurate. Once we ha ve suc h a go o d ap p ro ximation to the tru e co v ariance Σ, we can impro v e the error gu arantee ev en further b y u s ing stronger tail b oun ds for the Gaussian distrib ution. In the f ollo wing, we will assum e a rough s cale for Σ , namely that I  Σ  2 I . Sup p ose that, for some δ > ǫ , we hav e a matrix Σ 0 satisfying k Σ 0 − Σ k F ≤ δ . W e can then use Eq u ation (5) to app ro ximate Σ fr om Σ 0 . It is n ot hard to see that if we kno w Σ within F rob enius norm δ , this allo ws us to compute Σ within sp ectral norm O ( δ ). Thus, after applying an appr opriate linear transformation to Y , we obtain a random v ariable Y ′ with co v ariance within O ( δ ) of the iden tit y matrix in sp ectral norm. W e claim that Y ′ has strong tail b ound s. This follo w s from stand ard ta il b oun ds for degree-2 p olynomials o v er Gaussian random v ariables. Sp eciﬁcally , f or an y unit v ector v , v · Y ′ is a quadratic p olynomial in X with v ariance O (1). S tandard r esu lts imply that v · Y ′ has exp onent ial tails. F rom this w e can sho w that any suﬃcien tly large num b er of samples from Y ′ will b e ( O ( √ ǫδ + ǫ log(1 /ǫ )) , ǫ )-stable with high probabilit y . In summary , if we know Σ to F rob en ius error δ , w e can use robust mean estimation tec hniques to learn it to Mahalanobis error O ( √ ǫδ + ǫ log (1 /ǫ )). Iterating this, w e can obtain a ﬁnal Mahalanobis error of O ( ǫ log (1 /ǫ )), as desired. W e conclude b y noting that the aforemen tioned can b e used to robu s tly learn a Gaussian distribution w ith unkn o wn mean and co v ariance as f ollo ws: Firs t, one learns the co v ariance as ab ov e (b y reducing to the mean 0 case). Th en one learns the mean of a Gaussian with an (app ro ximately) kno wn co v ariance matrix. In s u mmary , one obtains a hyp othesis Gaussian N ( b µ, b Σ) w ith in total v ariation distance O ( ǫ log(1 /ǫ )) f rom the uncorrupted distrib u tion N ( µ, Σ ). 3.3 Robust Sparse Estimation T asks The task of leverag ing sparsity in high-dimensional parameter estimation is a w ell-studied problem in statistics. In the conte xt of robus t estimation, this problem wa s ﬁrst considered in [2], wh ich adapted the un kno w n conv ex programming metho d of [21 ] describ ed in this article. Here we fo cus on robu st sp arse mean estimation and describ e tw o algorithms: th e con v ex p rogramming algorithm of [2] and a no vel ﬁltering metho d [30] that only uses sp ectral op erations. 22 F ormally , giv en ǫ -corrupted samples from N ( µ, I ), where the m ean µ is u nkno wn and assum ed to b e k -sparse, i.e., supp orted on an unknown set of k co ordinates, we w ould lik e to appro ximate µ , in ℓ 2 -distance. Wit hout corruptions, this problem is easy: W e dra w O ( k log( d/k ) /ǫ 2 ) samples and outpu t the empirical m ean truncated in its largest magnitude k en tries. T he goal is to obtain similar sample complexit y and error guaran tees in the robu s t setting. A t a high lev el, w e note that the truncated sample m ean should b e accurate as long as there is n o k -sparse direction in whic h the error b etw een the true mean and s ample mean is large. This condition can b e certiﬁed, as long as w e kn o w that the sample v ariance of v · X is close to 1 for all unit k -sp arse v ectors v . This would in turn allo w us to create a ﬁlter-based algorithm for k -sparse robust mean estimat ion that uses only O ( k log ( d/k ) /ǫ 2 ) s amples. While this idea natur ally leads to a s ample-optimal robu st algorithm for the problem, it is computationally in feasible. T his holds b ecause the pr oblem of determining whether there is a k -sp arse dir ection with large v ariance (sparse PCA) is k n o wn to b e computationally hard, ev en under natural d istr ibutional assumptions [6 ]. T o circum ven t this hardness resu lt, [2] considers a con vex relaxat ion of sparse PC A, wh ic h leads to a p olynomial-time v ersion of the aforemen tioned algorithm that requires O ( k 2 log( d/k ) /ǫ 2 ) samples. Moreo v er, there is evidence [26], in the form of a lo wer b ound in the Stati stical Qu ery mod el (a restricted bu t p o w erf u l computational mo del), that th is quadratic blo w-up in the sample complexity is necessary for p olynomial-time algo rithms. Note th at although the O ( k 2 log( d/k ) /ǫ 2 ) s ample complexit y is w orse than the information-theoretic optim u m of Θ( k log ( d/k ) /ǫ 2 ), for sm all k , it is still sub stan tially b etter than the Ω( d/ǫ 2 ) sample s ize required by dense metho ds. The con v ex-programming algorithm of [2] w orks as f ollo ws: Let b Σ b e the empirical co v ariance matrix. If there is a k -sparse unit vect or v w ith v T b Σ v large, we ha ve that tr( b Σ v v T ) is large. Here v v T is a p ositiv e semi-deﬁnite, trace-1 matrix w hose entries ha ve ℓ 2 -norm at most 1 an d ℓ 1 -norm at most k (the latter follo wing from the sp arsit y of v ). The wo r k of [2] consider s the follo wing con ve x relaxation of the problem of ﬁndin g th e s parse v ector v : Find a p ositive semi-deﬁnite, trace-1 matrix H , whose en tries ha ve ℓ 2 -norm at most 1 and ℓ 1 -norm at most k , so that tr( b Σ H ) is as large as p ossible. If the optimal solution to this con ve x relaxation is small, we ha ve certiﬁed that b Σ h as no sparse directions of large v ariance, and consequently we ha v e certiﬁed the accuracy of the trun cated empirical mean. On the other h an d , if the optimal v alue of th e con vex relaxation is large, we ha v e found a “sparse direction” of large v ariance and can use this to reﬁ n e our set of samples. In particular, [2] use the ellipsoid metho d to ﬁnd a su bset of the samples so that for al l suc h H , tr( b Σ H ) is not too large. T o b ound the s amp le complexit y of this metho d, one n eeds to sh o w that with a suﬃciently large set S of iid samples from N ( µ, I ) the emp irical co v ariance of S , b Σ S , sat isﬁes th at the trace tr( b Σ S H ) is app ropriately small for all suc h H . One can sho w this as follo ws: First, is is easy to see that, give n a set S of size Ω ( k 2 log( d ) /ǫ 2 ), ev ery entry of b Σ S − I is O ( ǫ/k ) with high probabilit y . If this holds, th en we ha ve that tr( b Σ S H ) = tr( H ) + tr(( b Σ S − I ) H ) = 1 + O ( ǫ/k ) · k = 1 + O ( ǫ ) , where the second equalit y is b ecause the en tries of b Σ S − I hav e b ound ed ℓ ∞ -norm and the en tries of H hav e b oun ded ℓ 1 -norm. Therefore, with Ω( k 2 log( d ) /ǫ 2 ) samples, this algorithm can b e shown to wo rk with high pr ob ab ility . In subsequ ent w ork, [61] ga ve an iterativ e ﬁlter-based metho d for robus t sparse mean estimation, whic h av oids the use of the ellipsoid metho d but still requires multiple solutions to the conv ex relaxation of sparse PCA in eac h ﬁ ltering iteration. Another algorithm for robust sp arse mean estimation, pr op osed by [60], works via iterativ e trimmed thr esh olding. While th is algorithm seems pr acticall y viable in terms of runtime, it can tolerate v anishin gly small fraction of outliers. 23 More recen tly , [30 ] develo p ed iterativ e sp ectral algorithms for robu st sparse estimation tasks (including sparse mean estimation and sp arse PCA). These algo rithms achiev e the same error guaran tees as [2], while b eing signiﬁ can tly faster. In the cont ext or robust sp arse mean estimation, the algorithm of [30] considers the O ( k 2 ) largest ent ries of b Σ − I . If the ℓ 2 -norm of these ent ries is m u c h larger than ǫ , it follo ws that there is a s p arse, degree-2 p olynomial p ( x ) wh ere the exp ectation of p ov er all samples in S is su bstant ially diﬀerent from its av erage v alue ov er the clean samples. This allo ws us to build a ﬁ lter for p oin ts based on their v alues under p . On the other hand, if this is not the ca se, it means that, for sparse v ectors v , the con tribu tion to v T ( b Σ − I ) v coming from en tries other than the top O ( k 2 ) ones is small. Therefore, v T b Σ v wo uld only b e large if we could ﬁ nd suc h a v su p p orted only on the ro ws and columns of these O ( k 2 ) entries. W e can then c hec k for all vec tors v on this limited supp ort. A careful analysis sh o ws that with ˜ O ( k 2 log( d ) /ǫ 2 ) samples the app r opriate concen tration conditions hold for ev ery k 2 -sparse ve ctor and d egree-2 p olynomial. This allo ws the appropriate ﬁlters to w ork w ith this samp le size. 3.4 List-Deco dable Learning In this article, w e fo cus ed on the classical robust statistics setting, where the outliers constitute the minority of th e dataset, quan tiﬁed b y the prop ortion of cont amination ǫ < 1 / 2, and the goal is to obtain estimato rs w ith err or scaling as a function of ǫ (and is indep end en t of the dimension d ). A related setting of in terest fo cuses on the regime that the fractio n α of clean d ata (inliers) is small – strictly smaller than 1 / 2. That is, w e observ e n samples, an α -fraction of wh ich (for some α < 1 / 2) are dra wn from the distribu tion of interest, but the rest are arb itrary . This question w as ﬁr st studied in th e con text of mean estimation in [14]. A ﬁrst observ ation is that, in this regime, it is information-theoretically imp ossible to estimate the mean with a single h y p othesis. In deed, an adv ersary can pr o duce Ω(1 /α ) clusters of points eac h dra wn from a go o d distribution with diﬀerent mean. Eve n if the algorithm could learn the distrib ution of th e samples exactly , it still would not b e able to iden tify whic h of the clusters is the correct one. T o circumv en t this b ottlenec k, the deﬁn ition of learning m ust b e somewhat r elaxed. In particular, the algorithm should b e allo wed to return a smal l list of hyp otheses with the guaran tee that at le ast one of the h y p otheses is close to the tru e mean. This is the mo del of list-de c o dable le arning , a learning mo del in tro duced by [3 ]. Another qualitativ e diﬀerence with the s mall ǫ regime is that in list-decodable learning, it is often inf ormation-theoretica lly necessary for the error to increase without b ou n d as the fraction of clean d ata α go es to 0. In summary , giv en p olynomially man y corrupted samples, w e w ould like to output O (1 /α ) (or p oly(1 /α )) many h yp otheses with the guaran tee that (with high probabilit y) at least one hypothesis is within f ( α ) of the tru e mea n, where f ( α ) dep end s on the concen tration prop erties of the distrib ution in question, but otherwise is information-theoretically b est p ossible. The inform ation-theoretic limits of list-deco dable mean estimation ha v e only b een addressed v ery recentl y . The w ork [28] ga v e nearly tigh t b ounds on the minim um error ac hiev able for list- deco dable mean estimation on R d (with p oly(1 /α ) candidate hyp otheses) f or structured d istribution families, including Gaussians and distributions with b ound ed co v ariance. In particular, th e optimal ℓ 2 -error was determined to b e Θ( p log(1 /α )) for spherical Gaussians and Θ( α − 1 / 2 ) for b oun d ed co v ariance d istributions. The algo rithmic asp ects of list-decod able mean estimation hav e turn ed out to b e muc h more c h allenging. F or b ound ed co v ariance distributions, [14] ga v e an SDP-based algorithm ac hieving near-optimal error of ˜ O ( α − 1 / 2 ). In the rest of this section, w e d escrib e a generalizat ion of th e ﬁltering metho d for list-decod able mean estimatio n in tro duced in [28]. 24 List-Deco dable Mean E stimation via (Multi-) Filters. The ﬁltering tec h niques discussed in S ection 2.4 can b e adapted to wo r k in the list-deco dable setting as well. F or the remainder of this discussion, w e will restrict ours elv es to list-decod able mean estimation when the clean d ata is dra w n from an identit y co v ariance Gaussian distribu tion. At a high-lev el, the adaptation of the ﬁltering metho d w orks as follo ws: If the sample co v ariance matrix has no large eigen v alues, this certiﬁes that the true m ean and sample mean are not to o far apart. Ho w ever, if a large eigen v alue exists, the construction of a ﬁlter is more elaborate. T o some exten t, this is a necessary diﬃcult y b ecause the algorithm must return multiple h yp otheses. T o han d le this issu e, one needs to construct a multi-ﬁlter , w hic h ma y retur n sever al subsets of the original d ataset with the guarantee that at least one of them is clea ner than the original. Such a multi- ﬁlter was ﬁrst intro d uced in [28 ]. W e no w pro ceed with a d etailed o v erview. Th e main id ea is to emplo y some typ e of ﬁltering to obtain a sub set S of our original dataset T so that th e follo wing conditions are satisﬁed: (i) The set S con tains at least half of the original clean samples in T ; and (ii) T he empir ical co v ariance of S is b ound ed from ab o ve by some s m all p arameter σ > 0 in eve ry direction. If such a subs et S can b e eﬃcien tly compu ted, w e can certify that the empirical mean of S will b e close to the true mean. T o sho w this, let µ, µ G , and µ S denote the true mean of the uncorrup ted distribution, the mean of the clean samples in S and the mean of all the samp les in S , r esp ectiv ely . By condition (i), it is easy to see that k µ − µ G k 2 = O (1). On the other hand, the v ariance of S in the µ G − µ S direction is at least ( α/ 2) k µ G − µ S k 2 2 , since an at least ( α/ 2)- fraction of clean samples ha v e distance k µ G − µ S k 2 from the fu ll mean. Combining with condition (ii), we ha v e that k µ G − µ S k 2 = O ( p σ /α ), and hence by the triangle inequalit y it follo ws that k µ − µ S k 2 = O (1 + p σ /α ). W e can attempt to use th e ﬁltering app r oac h of Section 2.4 to ﬁn d suc h a set S . If th e initial set of samples T has b ound ed cov ariance, then its empirical mean works, so we can us e it as our set S . Otherwise, w e can pro j ect the samp les in T on a direction of large v ariance in an attempt to remo ve outliers. Un fortunately , the outlier remo v al step cannot b e so straigh tforwa r d in this setting. The ﬁltering steps we h a ve describ ed so far generally work b y ﬁrs t derivin g an appro ximation to the true mean and then r emo ving samples that are to o f ar a wa y from it. Ho wev er, the ﬁrst step of this pro cedure inheren tly fails here, since the outliers constitute the ma jorit y of th e dataset. In particular, if the in itial set of samples T come in t w o large but separated clusters, we will not b e able to determine whic h cluster con tains the true mean, and thus will not b e able to ﬁnd an y p oints that are deﬁn itiv ely outliers. Th is diﬃculty is of course necessary , as the list-decod able algorithm is in general required to p ro duce sev eral h yp otheses. T o circum ven t this issue, our algorithm will return multiple subsets of p oints with the guaran tee that at least one of these subsets is cleaner than the original. W e will call such an algorithm a multi-ﬁlter . Giv en a set S of samples con taining at least half of the original clean p oin ts and a direction in which S has large v ariance, we w ant our m u lti-ﬁlter to retur n a collection of (p oten tially o v er- lapping) subsets S i of S with the follo w ing prop erties: First, we need it to b e the case that for at le ast one of these sets, at most an α/ 2-fract ion of the p oint s in S \ S i are clea n . Second, w e need to ensur e that the b lo wup in the num b er of su c h subsets is not to o large. O ne wa y to ac h iev e this is to require that P i | S i | 2 ≤ | S | 2 . Our o v erall algorithm will work by mainta in ing sev eral sets S i of samples. If any of these sets has to o large v ariance in some direction, we will app ly the multi- ﬁlter replacing it with several smaller sub sets. W e note that by the ﬁ rst condition ab o ve, if w e started w ith a set where at most an α/ 2-fract ion of its complemen t was clean, at least one of the subsets w ill also ha ve th is prop erty . Therefore, at the end of this pro cedure, we are gu aranteed to end up with at least one S i satisfying the conditions necessary to giv e a goo d app ro ximation to the true mean. Our second condition will imply that at an y stage of this algorithm, the sum of the squares of the sizes of the S i ’s will n ev er exceed the squared size of our original set of samples. Th is condition guarant ees that the s amp le 25 complexit y an d r unti me of the o v erall algorithm are p olynomial. In fact, by obser v in g that w e only need to retur n the samp le mean of S i ’s that conta in at least an α/ 2-fract ion of our original set of samples, w e will ha ve at m ost O (1 /α 2 ) h yp otheses. T o reduce the list size of retur ned hyp otheses further, there is a simple m etho d [28] th at shows h o w to eﬃcien tly reduce any set of p olynomially man y h yp otheses (at least one of which is guarante ed to b e within r of the true mean) to a list of O (1 /α ) hyp otheses at least one of wh ic h is nearly as close. The multi-ﬁlter step w orks as f ollo ws: Giv en a directio n v in wh ic h the v ariance is to o large, there are tw o w a ys we can attempt to get an appropriate collecti on of s u bsets S i . First, if there is some in terv al I so that all b ut an α/ 10 -fraction of samples x ∈ S ha ve v · x ∈ I , we kno w almost certainly that v · µ ∈ I , since v · µ should ha v e at lea st an α/ 4-fraction of samples (coming from the clean s amples) on either s ide of it. This imp lies that samples x whose v -p ro jections are at distance muc h further than p log(1 /α ) fr om the endp oints of I are almost certainly outliers. Usin g tec hniques similar to the ones w e discuss ed in S ection 2.4, if the v ariance in the v -direction is more than a suﬃcien tly large constan t multiple of | I | 2 + log (1 /α ), one can ﬁnd a single su bset S ′ of S so that with high probabilit y almost all p oin ts in S \ S ′ are outliers. It remains to handle the complement ary case . If for some x ∈ S we let S 1 = { y ∈ S : v · y ≥ x − 10 p log(1 /α ) } and S 2 = { y ∈ S : v · y ≤ x + 10 p log(1 /α ) } , it is not h ard to see that at least one of S 1 or S 2 k eeps almost all of the clean samples in S . In particular, if v · µ ≥ x , then S 1 will only thr o w out clean samples that are at least 10 p log(1 /α ) to the left of µ (i.e., at most an α 10 -fraction). Similarly , if v · µ ≤ x , then S 2 will thro w a wa y at most an α 10 -fraction of clean samples. I f additionally , (i) eac h of S \ S 1 and S \ S 2 con tain at least an α 2 -fraction of the total samples, and (ii) | S 1 | 2 + | S 2 | 2 ≤ | S | 2 , then these subsets will suﬃce. The k ey observ ation is that if the v ariance of S in the v -direction is more than a suﬃcientl y large multiple of log(1 /α ), w e can alwa ys app ly at least one of the t wo multi-ﬁlters describ ed ab o ve. This holds b ecause if w e try to apply the second multi- ﬁlter for some giv en v alue of x , w e will ﬁnd that either the fraction of samples with v · y ≤ x − 10 p log(1 /α ) is m u c h smaller that the fraction with v · y ≤ x or the fraction of samples with v · y ≥ x + 10 p log(1 /α ) is m uch smaller that the fraction w ith v · y ≥ x . In either case, the tails of v · S must deca y fairly rapidly , at least unt il th ey are smaller than α 2 . Thus, if we cannot app ly this ﬁlter for any x , the con tribution to the v ariance coming fr om eve rything except the tails m ust b e small. On the other hand , letting I b e the int erv al excluding the α 2 -tails on either side, w e can apply the ﬁrst m u lti-ﬁlter if the con tribu tion from the α 2 -tails is large. I n summ ary , we can alw a ys apply one of the tw o m u lti-ﬁlters un less the v ariance of v · S is small. O v erall, this algorithm outpu ts p oly(1 /α ) many hyp otheses, at least on e of which is within O ( p log(1 /α ) /α ) of the true mean. 3.5 Robust E stimation Using High-Degree Momen ts The algorithms presented so far robu stly estimate the mean of high-d imensional d istributions b y lev eraging stru ctural inf orm ation ab out their co v ariance matrix. T he rob u st co v ariance estimation algorithm of Section 3.2 uses structural information ab out th e fourth momen t tensors, but also ﬁts in this fr amew ork, as it w orks b y r ob u stly estimating th e mean of the random v ariable X X T . It is natural to ask whether (and to what extent) one can exploit stru ctural information about high er de gr e e mo ments of the uncorr u pted distribution to robustly estimate its parameters. F or the b asic case that the uncorrupted distrib ution is a Gaussian, algorithmical ly exploiting higher-degree momen t in formation to obtain r obust estimators with inform ation-theoretic ally n ear- optimal accuracy tur ns out to b e m an ageable. F or a concrete example, w e fo cus here on the p r oblem of robust mean estimation for N ( µ, I ). Recall th at the conv ex pr ogramming and ﬁ ltering algorithms of Section 2 ac h iev e ℓ 2 -error O ( ǫ p log(1 /ǫ )) in the s tr ong con tamination mo del, whic h is optimal 26 for sp herical s u b-gaussian d istributions but sub optimal (up to the O ( p log(1 /ǫ )) m u ltiplicativ e factor) for sph erical Gaussians. The r eason f or this discrepancy is that the stabilit y condition of Deﬁnition 2.1 and the asso ciated ﬁlter/c on vex programming algorithms only r ely on the ﬁrst t wo momen ts of the d istribution. The wo rk of [26] show ed ho w to lev erage higher-order momen t in formation to impro ve on the O ( ǫ p log(1 /ǫ )) error b ound. Sp eciﬁcally , th is w ork ga v e a generalized ﬁltering algorithm that p erforms “outlier r emo v al” based on higher-ord er tensor information (of degree d = Ω(log 1 / 2 (1 /ǫ ))) to robustly estimate the mean of N ( µ, I ) in the strong con tamination model w ith in ℓ 2 -error O ( ǫ ) in time O ǫ ( d O (log 1 / 2 (1 /ǫ )) ). (This runtime upp er b ound is qu alitativ ely matc hed by an SQ low er b ound sho w n in the same pap er; see Section 3.7). This generalized ﬁ ltering an d its correctness analysis lev erage prop erties of the Gaussian distribution, including the a p riori kno wledge of the higher moments and the concentrat ion of high-degree p olynomials. Subsequ ently , the work [27] (see also [25] for the case of discrete distributions) used higher- order moments to robu s tly learn b ounded degree p olynomial thresh olds functions (PT Fs) under v arious distributions. It should b e noted that the latter result do es not require kno wledge of all the higher degree moments. Sp eciﬁcally , the alg orith m of [2 7] requires appr opriate concen tration and an ti-concen tration prop erties, and (approxi mate) kno w ledge of the momen ts up to degree 2 d , where d is the degree of the un derlying PTF. The more general setting w here we only ha ve upp er b ounds on the higher d egree moments of the uncorrupted distribution turns out to b e substan tially more challe nging algorithmically . In general, u pp er b ounds on the higher momen ts imply b etter information-theoretic err or upp er b ound s for robust estimatio n. F or example, for robus t mean estimation of d istributions with b ound ed k -th cen tral momen ts, the inf ormation-theoretica lly optimal error is easily seen to b e Θ( ǫ 1 − 1 /k ). Ho we v er, it is u nclear if this error b oun d is attainable algorithmically for k ≥ 4. Without an y additional assump tions on the underlying distrib ution (b eyond the b oun ded higher moments condition), recen t work [43] ga ve evidence th at obtaining error o ( ǫ 1 / 2 ) ma y b e compu tationally in tractable (see S ection 3.7). Ho we v er, there are circumstances in wh ic h higher momen t information can b e usefully exploited. A num b er of concurrent w orks obtained eﬃcien t algorithms leverag ing higher-degree moments to obtain n ear-optimal error guarantee s [28, 42, 56, 54, 55]. Sp eciﬁcally , the w ork of [28] ga ve a higher-momen t generalization of the m u lti-ﬁlter tec hnique describ ed in Section 3.4 that leads to a near-optimal error algorithm for list-deco dable mean estimation of N ( µ, I ). As an application, [28] obtained an eﬃcient alg orithm to learn the p arameters of mixtures of spherical Gaussians under near-optimal sep aration b et ween the comp onents. T he works [42, 56, 54, 55] used the Sums-of- Squares meta-alg orithm to obtain a num b er of algorithmic results, including robust m ean estimation giv en a Sums-of-Squares pro of certifying b ound ed cen tr al momen ts [42, 56], learning mixtures of spherical Gaussians [42, 54], and list-deco dable mean estimation [54] (u nder a similar Sums-of- Squares certiﬁabilit y assump tion). More recent ly , [48, 67] used the S ums-of-Squares m etho d to obtain the ﬁ rst non-trivial algo rithms for list-deco dable linear regression. In this section, we d escrib e in more detail t wo imp ortant settings where a fairly sophisticated use of higher d egree moments is required: list-deco dable mean estimation (with n ear-optimal error guaran tees) for Gaussian distributions; and robust mean estimation with certiﬁably b ound ed cen tral momen ts. Our presen tation will mainly fo cus on the metho dology dev elop ed by the authors in [28]. W e provide a high-lev el o ve r view of the Sums-of-Square appr oac h to th ese problems and refer the in terested reader to the recen t survey [66] for a more tec h nical exp osition of this approac h. 27 Near-Optimal List-Deco dable Gaussian Mean Estimation. W e consider the problem of list-decod able mean estimation, assum in g the uncorrupted samples are dra wn from a sp herical Gaussian distr ib ution N ( µ, I ). The techniques we discussed in Sectio n 3.4 sho w how to compu te a list of O (1 /α ) hypotheses (candid ate mean vecto rs) su c h that (with high probab ility o ver the uncorrup ted samples) at least one hyp othesis is w ithin ℓ 2 -norm ˜ O (1 /α 1 / 2 ) from th e true mean µ . W e note that this error b ound is actually v ery far from the information-theoretic optim um. In particular, for a p oint to b e a r easonable hypothesis, ther e must b e a cluster consisting of at least an α -fraction of the samples that are roughly Gaussian d istributed aroun d it. If t wo suc h hyp otheses are separated b y more than a large multiple of p log(1 /α ) , these clusters cannot o verlap on more than an α -fraction of their p oints (by Gaussian tail b ound s). Ho wev er, b y a simple coun ting argumen t, this implies that there cannot b e more th an Ω(1 /α ) man y suc h hyp otheses pairwise separated b y Ω ( p log(1 /α ) ). Hence, information-theoretically , th er e exists a list of O (1 /α ) man y h y p otheses such that (with high pr obabilit y) at least one is within d istance O ( p log(1 /α ) ) of the true mean. In f act, th is up p er b ound is kno wn to b e tight. In [28], it is sho wn ho w to construct distributions that are consisten t with man y plaus ib le true means eac h separated by Ω( p log(1 /α )). Unfortunately , while ac h ieving b etter error is information-theoretica lly p ossible, the algorithm discussed in Section 3.4 is not able to ac hiev e it. There are inh eren t structural reasons for this: This algorithm attempts to ﬁn d subsets of the samples with small v ariance. Unfortun ately , small v ariance is not suﬃcien t to imply b etter than O (1 /α 1 / 2 ) error. This is b ecause if our α -fr action of go o d samples is lo cated at distance of α − 1 / 2 from th e other samp les in some direction, the v ariance in that d irection wo uld only b e appr o ximately 1, and the tr u e mean w ould diﬀer from th e sample mean by ab out α 1 / 2 , in ℓ 2 -norm. An other w a y of putting it is th at this is a question ab out concen tration. Our existing algorithm manages to pro duce a set of samples, at least an α -fraction of wh ich are goo d , whic h also has b ound ed v ariance. Since b ound ed v ariance ens u res some measure of concen tration, this ensu r es that the mean of the clean samples in this set (close to the true m ean) cannot b e to o far from the sample mean of this set. Unfortu nately , the b ound ed v ariance condition can only take us so far, and in fact is consistent with the mean of the go o d samples b eing as far as α 1 / 2 from the samp le mean. T o impro v e our error guaran tee, w e wa n t to ﬁnd a set of samples with s tronger concen tration b ound s. A natural wa y to do this is to ensure that our s et of samples h as b oun d ed higher m oments. Sp eciﬁcally , we say that a set of p oint s S has b oun d ed d th c entr al moments if f or ev ery u nit vec tor v w e ha ve th at E x ∼ u S [ | v · ( x − µ ) | d ] ≤ C , f or some constan t C > 0. W e note that if S h as b ounded d th cen tral momen ts, the mean of any α -fraction of the p oints is no m ore than ( C /α ) 1 /d far from the o v erall mean. Thus, if w e could ﬁ nd a set with a large fraction of clean p oin ts whic h also had b ound ed d th cen tral momen ts for some d ≥ 2, w e w ould b e able to obtain b etter err or b ounds. The ab o ve paragraph natur ally giv es the outline for a b etter algo r ithm. W e start with some set S of samples. If its d th cen tral moments are small, w e return the mean of S . Otherwise, w e ﬁnd some d ir ection v which has large cen tral moments, p ro ject onto that direction and use this information to create a multi- ﬁlter and rep eat. Unfortunately , it is not clear h o w to implemen t this alg orithm eﬃcien tly . Recent results show that determining w hether a generic p oin t set has b oun d ed cen tral momen ts, for d > 2, seems to b e computationally intractable [43]. Ho wev er, we are not dealing with arbitrary p oin t sets. W e can tak e adv anta ge of the fact th at Gaussian distribu tions will satisfy stronger conditions on their higher moments than simply b ounded central momen ts, and w e can design algorithms that attempt to ﬁ n d sets of samp les satisfying these more stringen t (and hopefu lly computationally c hec k ab le) conditions instead. There are t wo diﬀerent wa ys to imp lemen t this idea: the squares of p olynomials metho d [28] and the Su ms-of-Squares metho d. F or the squares of p olynomials metho d, we note that the d = 2 case is easy , as it amounts to 28 ﬁnding the maxim u m v alue of a quadr atic p olynomial on a sphere (whic h can b e solv ed by sp ectral metho ds). F or d > 2, the problem requires that we optimize a higher-degree p olynomial, whic h is not that simp le. Ho w ever, if our go o d samples are Gaussian with a giv en mean, w e know what the exp ectatio ns of higher degree p olynomials sh ould b e and can thus c h eck for it. In p articular, in order to verify that the sample set has b ounded 2 d th cen tral moments, one ca n c heck if there is an y degree- d polynomial p where the a ve rage v alue of p 2 is too large. This can b e c hec ked b y considering the problem of optimizing a quadr atic form o ve r p olynomials p , and it is suﬃcien t by considering p ( x ) = ( v · ( x − µ )) d . If suc h a p is foun d, w e can use it to construct a multi-ﬁlte r, though th e mec hanism for doing so is highly non -trivial and inv ok es many sp ecial prop erties of the Gaussian d istribution. T he in terested reader can ﬁnd the fu ll details in [28]. The other metho d mak es use of the Sums-of-Squ ares pro of system. In particular, it is shown that the Gaussian boun ded central momen ts can b e pro v ed in the Su m-of-Squares pro of system. Th us, th e go al is to ﬁnd sub sets of the samples for whic h a similar Su ms-of-Squares pr o of allo w s one to sho w b ounded cen tral momen ts. These tec h niques w ere dev elop ed in [42, 55]. F or list-decod able mean estimation, b oth of these metho ds allo w one to learn the mean of a Gaussian to ℓ 2 -error appro x im ately α − 1 / (2 d ) b y considering the 2 d th momen ts. T h ese al gorithms will hav e r unti me p oly( n d ) and, by making d sup er-constan t, we can in fact obtain p oly-logarithmic error in quasi-p olynomial time. The Sums-of-Squares metho d has the adv antag e that it is m u c h more general and applies not ju s t to Gaussians bu t to any distribu tion wh ose b ound ed cen tr al momen ts can b e certiﬁed by S ums-of-Squares pro ofs of s m all degree. Ho w ever, these systems m u st searc h for S ums-of-Squares pro ofs, wh ic h r equire solving large conv ex optimization problems, meaning that these algorithms w ill b e slow er in practice. Robust Mean Estimation with Certiﬁably-Bounded Higher Momen ts. Another in ter- esting app lication of the Sums-of-Squares metho d in this con text is in reducing the conditions needed for r obust m ean estima tion. Recall that the deﬁ nition of stabilit y (Deﬁnition 2.1) has t wo conditions: First, it r equires strong concen tration on th e un corrupted samples, to ensur e that remo ving an y s m all fraction d o es not subs tantia lly alter th e mean (a condition that is information- theoreticall y necessary in some cont exts). S econd, to sati sfy th e d eﬁnition for δ = o ( √ ǫ ), the algorithm n eeds to know the co v ariance matrix of the go o d samples. This latter condition is not required for inform ation-theoretic reasons, but for computational ones. The algorithm needs to kno w the co v ariance matrix of th e clea n samples so that it can detect ev en small in cr eases in this co v ariance caused by corrup ted samples. The ab ov e imp lies for example, that if w e ha ve a distribu tion with, sa y , b ound ed fourth central momen ts, the ﬁrst stabilit y condition holds with δ = O ( ǫ 3 / 4 ); while the second part will only hold with δ = Ω( √ ǫ ), as the actual co v ariance matrix with no errors migh t b e m o destly far from the iden tity . The ﬁrst cond ition implies that information-theoretically , it should b e p ossible to learn the mean to error O ( ǫ 3 / 4 ) (for example b y lo oking at a trun cated mean in eac h d irection), but the standard ﬁlter will n ot get error b etter than Ω( √ ǫ ), as an adv ers ary can add errors to k eep the co v ariance matrix at most the identi t y and ye t corrupt the sample mean b y this muc h. T o circum v ent this natural b ottlenec k, one would similarly need a wa y to tak e adv an tage of higher momen ts and lev erage this s tronger concen tration. Once again, the Sums-of-Squares metho d can b e u sed to achiev e this under certain assumptions. Roughly sp eaking, if the uncorru pted distribution has b ounded d th cen tral moments pr ovable by low de gr e e Sums-of-Squar es pr o ofs , then b y searc hing for sets of samples with su ch a pro of, one can obtain error O d ( ǫ 1 − 1 /d ). 29 3.6 F ast Algorit hms for H igh-Dimensional Robust E stimation The main fo cus of the recen t algorithmic w ork in this ﬁeld has b een on obtaining p olynomial-time algorithms for v arious high-dimensional r ob u st estimation tasks. Once a p olynomial-time algorithm for a computational problem is d isco v ered, the natural next step is to dev elop faster algorithms for the pr oblem – with linear time as the ultimate goal. While recen t work has led to p olynomial-time robust learners for sev eral f u ndamenta l tasks, these algorithms are signiﬁcantly slo wer than their non-robust count erparts (e.g., the samp le a ve rage for mea n estimation). Th is raises the follo wing question: Can we design r obust estimators that ar e as eﬃc i ent as their non-r obust analo gues? In ad d ition to its potentia l pr actical implications, making progress on this general direction is of fundamental theoretical in terest, as it can elucidate the eﬀect of th e robu s tness requiremen t on the computational complexit y of high-dimensional statistical learnin g. The ab o ve d irection was initiated b y [16] in the con text of robust mean estimation. More sp eciﬁcally , [16] ga v e a robust mean estimatio n algo rithm for b ound ed co v ariance distributions on R d that is near-minimax optimal, ac hieve s th e optimal ℓ 2 -error guaran tee of O ( √ ǫ ), and run s in time ˜ O ( nd ) / p oly ( ǫ ), where n is the num b er of samples. Th at is, the algorithm of [16 ] has th e same (optimal) sample complexit y and error guarante e as previous p olynomial-time algo rithms [21, 22], while r unnin g in n ear-linear time w hen the f r action of outliers ǫ is a small constan t. A t the tec hnical lev el, [16] build s on the conv ex programming approac h of Section 2.3. Subsequ ent w ork [20] observ ed that a simple pr ep ro cessing s tep allo w s one to red u ce to th e case when the fraction of corru ptions is a small u niv ersal constan t. As a corollary , it w as sho wn in [20 ] that a simple mo diﬁcation of th e [16] algorithm ru ns in ˜ O ( nd ) time. More imp ortan tly , [20] ga v e a p robabilistic analysis that leads to a fast mean estimation algorithm that is simultaneously outlier-robust and ac h iev es su b-gaussian tail b oun ds. (W e note that th e q u estion of designing estimators for the mean of hea vy-tail d istr ibutions with sub-gaussian rates has b een of substantia l in terest recently . The in terested reader is referr ed to [41, 19 , 62, 63]) for recen t d ev elopments on this topic.) Indep endent ly and concurrently to [20], [32] built on the ﬁltering framewo rk of Section 2.4 to give a diﬀeren t ˜ O ( nd ) time rob u st mean estimation algorithm. Moreo ver, [32 ] gav e an empirical ev aluation sho w ing the practica l p erformance of their algorithm. Prior to the aforemen tioned devel opmen ts, the fastest kno w n runtime for robust mean esti- mation wa s ˜ O ( nd 2 ) and was ac hieve d b y the ﬁltering algorithm of Section 2.4. While w e did not pro v id e a detailed r unti me analysis in Section 2.4, it is not h ard to s ho w that eac h ﬁlter iteration can b e implement ed in ˜ O ( nd ) time (u sing p ow er iteration) and that the num b er of iterations can b e b ounded from ab o ve by O ( d ). While the ﬁltering algorithm has b een observed to run very fast in practice, taking at most a sm all constant n u m b er of iterations on r eal d atasets [22, 24], one can construct examples wh ere an ǫ -fraction of outliers force the algo rithm to tak e Ω( d ) iterations. This can b e ac hiev ed b y placing the outliers in Ω( d ) orthogonal directions at appropriate d istances from the true mean. Conceptually , the b ottlenec k of the ﬁ ltering m etho d is that it r elies on a certiﬁcate (Lemma 2.4) that allo ws the algorithm to remo ve outliers fr om one d irection at eac h iteration. A detailed explanation and comparison of the tec hniques in [16, 20, 32] is b ey ond the scop e of this survey . A t a high-lev el, a conceptual commonalit y of these w orks is that they lev erage tec hniques from conti n uous optimization to d ev elop ite r ativ e method s (with eac h iteration taking near-linear time) that are ab le to deal with m ultiple directions in p ar al lel . In p articular, the total n u m b er of iterati ons in eac h of these metho ds is at most p oly-lo garithmic in d/ǫ . Bey ond r obust mean estimation, the w ork [17] recen tly studied the problem of robust co v ariance estimation with a fo cu s on designing faster algorithms. By building on the tec hniques of [16], they obtained an algorithm for this problem with runti me ˜ O ( d 3 . 26 ). Rather curiously , this runtime is not linear in the inp ut s ize, but nearly matc hes the (b est kno w n ) run time of the corresp onding n on- 30 robust estimator (i.e., computing the empirical co v ariance). I n tr iguin gly , [17] also provided evidence that the runt ime of their algo rithm ma y b e b est p ossible with curren t algo r ithmic tec hniques. 3.7 Computational-Statistical T radeoﬀs in Robust Estimation The golden standard in robust estimation is to design computationally eﬃcien t algorithms w ith optimal s amp le complexities and error guaran tees. A conceptual m essage of the recent b o dy of algorithmic w ork in this area is that robustness may not p ose computational imp ediments to h igh- dimensional estimation. Indeed, f or a range of fundamental s tatistica l tasks, recen t w ork has dev elop ed computationally eﬃcien t robu st estimators with dimension-indep endent (and, in some cases, near-optimal) error guarante es. Ho wev er, it turns out that, in some settings, robustness creates c omputational-statist ic al tr ade oﬀs . Sp eciﬁcally , for sev eral n atural high-dimensional robust estimation tasks, w e no w ha v e comp elling evidence that ac hieving, or even appr oximati ng, the information-theoretical ly optimal error is computationally in tractable. Progress in this direction w as ﬁrst made in [26], w hic h used th e framework of Statistical Query (SQ) algorithms [50] to establish compu tational-sta tistical trade-oﬀs for a range of robust estimation tasks inv olving Gaussian distribu tions. More sp eciﬁcally , it was shown in [26] that ev en for the basic p roblems of robust mean and co v ariance estimation of a h igh-dimensional Gaussian with con tamination in total v ariation distance, ac hieving the optimal error requires sup er-p olynomial time. The same w ork established compu tational-sta tistical trade-oﬀs f or the problem of robu st sparse mean estimation, ev en in Hu b er’s con tamination m o del, s h o win g that eﬃcient algo rithms require quadratically more samples than the information-theoreti c minimum. I nterestingly , b oth these SQ low er b ounds are matc h ed b y the p erformance of recen tly dev elop ed robust learning algorithms [21, 2]. Motiv ated by this p r ogress, [43] made a ﬁrs t step to wards p ro vin g computational lo wer b ounds for robust mean estimation b ased on worst-c ase har dness assu mptions. In particular, this work established that current algo r ithmic tec hn iques for robust m ean estimatio n may not b e imp ro v able in term s of their error guarantee s, in the sense that they stum ble up on a w ell-known computa- tional barrier – th e so-cal led sm all set expansion hyp othesis (SSE), closel y r elated to the u nique games conjecture (UGC). More recen tly , [12] prop osed a k -p artite v ariant of the plant ed clique problem [46] and gav e a reduction, inspired by the SQ lo wer b ound of [26], imp lying (sub ject only to th e hard ness of the prop osed pr oblem) a statistical–co mputational gap for this p r oblem. An in teresting op en problem is to establish comp elling evidence (e.g., in the form of a Sums-of-Squ ares lo we r b ound ) of the aver age- c ase har dness of the prop osed k -partite v ariant of plan ted clique. In the f ollo wing paragraphs, we pro vid e a more d etailed description of [26, 43]. Statistical Query Lo wer Bounds. F or th e statistical estimation tasks s tudied in th is artic le, the input is a set of samples dra w n from a probabilit y distrib u tion of int erest. Statistic al Query (SQ) algorithms [50] are a restricted class of algorithms that are only allo we d to (adaptive ly) query exp ectations of b ounded fu nctions of the distribution – rather than directly access samples. In particular, if f is any b ounded fun ction, one ma y attempt to app ro ximate E [ f ( X )] by taking samples fr om the distribu tion X . With O (1 /τ 2 ) samples, one can obtain the correct ans w er within additiv e accuracy τ with high pr obabilit y . By doing this for s everal d iﬀeren t fun ctions f , p erh aps c hosen adaptiv ely , one can try to learn pr op erties of the u nderlying d istr ibution X . In the SQ mo del, th e algorithm get s to ask queries (with some give n accuracy τ ) of an oracle. These qu eries are in the f orm of a fun ction f with range con tained in [ − 1 , 1] and a d esired accuracy τ . The oracle then return s E [ f ( X )] to accuracy τ , and the algorithm gets to (adaptiv ely) c h ose another f u p to 31 Q times. Roughly sp eaking, an S Q algorithm with accuracy τ and Q queries corresp ond s to an actual algorithm using O (1 /τ 2 ) samples and O ( Q ) time. The Statistical Query (SQ) mo del is actually quite p o werful: a wide r ange of known algorithmic tec hniques in mac hine learning are known to b e implementa ble using S Qs. These includ e sp ectral tec hniques, moment and tensor metho d s, lo cal searc h (e.g., Exp ectation Maximization), and man y others (see , e.g., [34]). In fact, nearly every known statistical alg orith m (with a sm all num b er of exceptions) with prov able p erf ormance guarante es can b e sim u lated with small loss of eﬃciency in the SQ mo del. This make s the S Q mo del v ery u seful for proving low er b ounds , as a lo wer b ound in this m o del app lies to a broad family of algo r ithms. It is easy to see that one can estimate momen ts and appro ximate medians in the SQ mo d el. In fact, without d iﬃcult y , v arious ﬁlter algorithms describ ed in this article can b e imp lemen ted in the SQ framework. In d eed, momen t computations allo w one to estimate the co v ariance matrix. If large eigen v alues are found, further measurements can approxima te the cumulativ e density distribu tion of the pro j ection and decide on a ﬁlter. F rom then on, measuremen ts can b e made conditional on passing the ﬁlter (b y measuring f ( x ) times th e ind icator function of x passing the ﬁ lter). A recent line of w ork [34, 36, 35, 33] d ev elop ed a framework of SQ algorithms for search problems o v er distributions. One can pro v e unconditional low er b ounds on the computational complexit y of SQ algorithms via a notion of Statistic al Query dimension . Th is complexit y measure w as introd uced in [11] f or P A C learning of Bo olean fun ctions and w as recen tly generalized to the unsup ervised setting [34, 33]. A low er b oun d on the SQ dimension of a learning p roblem pr o vides an unconditional lo we r b ound on the compu tational complexity of any SQ algorithm for the p roblem. Supp ose we wa nt to estimate the parameters of an unkno wn d istribution X that b elongs in a kno wn family D . Roughly sp eaking, the aforementioned wo rk has sh o wn that if there are man y p ossib le distributions X ∈ D w hose den s it y fu nctions are p airwise nearly orthogonal with resp ect to an appropriate inner pro duct, an y SQ algo rithm with ins uﬃcien t accuracy will require man y queries to determine whic h of these distribu tions it is sampling from. The wo rk of [26] giv es an SQ lo w er b ound for the statistical task of non-Gaussian c omp onent analysis (NGCA) [10 ]. In tuitivel y , this is the pr oblem of ﬁ nding a non-Gaussian direction in a high-dimensional dataset. In more detail, let A b e the p df of a univ ariate distribu tion with the prop erty that the ﬁrst m moments of A matc h the corresp ondin g m omen ts of the stand ard univ ariate Gaussian N (0 , 1). F or a unit v ector v ∈ R d , let P v b e the p df of the distrib ution on R d deﬁned as f ollo ws: The p r o jection of P v in the v -direction is equal to A , and P v is a standard Gauss ian in the orth ogonal complement v ⊥ . Giv en sample access to a distrib u tion P v ∗ , for some unknown direction (unit vecto r) v ∗ ∈ R d , the goa l of NGCA is to appro ximate the hidden v ector v ∗ . It is sho w n in [26] that an y SQ algorithm to appr o ximate v ∗ requires either queries of accuracy d − Ω( m ) or exp onential ly man y 2 d Ω(1) oracle queries. The aforemen tioned SQ lo w er b ound can b e used in essen tially a blac k-b o x manner to obtain nearly tight SQ lo w er b ounds for a range of high-dimensional estimatio n tasks in volving Gaussian distributions, including learning mixtur es of Gaussians, robust mean and co v ariance estimation, and robus t sparse m ean estimation. At a h igh-leve l, this is ac hiev ed by constructing instances of these problems that amoun t to an NGCA instance for an appr opriate parameter m of matc hing momen ts. T he main idea is to add noise to the distribu tion in qu estion so that the noisy distribu tion is of the f orm P v ∗ for some momen t-matc hin g distrib u tion A . Using th ese tec hniques, it w as sh o wn that s u p er-p olynomial complexit y (in terms of either n umber of quer ies or accuracy) is r equ ired to learn the mean of an ǫ -corrupted Gaussian to ℓ 2 -error o ( ǫ p log(1 /ǫ )) or to learn its co v ariance to F rob enius error o ( ǫ log (1 /ǫ )), b oth of whic h are tigh t [21]. I t w as also b e sho wn that to robu stly learn the mean of a k -sparse Gaussian to constan t ℓ 2 -error r equ ires either sup er-p olynomially man y 32 queries or queries with accuracy O (max(1 /k , 1 / √ d )), whic h morally corresp onds to taking at least Ω(min { d, k 2 } samples, nearly matc hing the sample complexit y of kno wn algorithms [2]. Reductions from W orst- case Hard Problems. Pro ving computational lo wer b ounds via re- ductions f r om w orst-case hard problems has pro ve d to b e a rather c hallenging goa l for statistical estimation tasks. Some recent progress was mad e on this fr on t in the context of robu st mean estimation. Sp eciﬁcally , the w ork [43] esta blished compu tational lo wer b ound s against algorithmic tec hniques op erating along the same lines as existing ones. I n particular, existing alg orithms for robust mean estimation dep end on b eing able to ﬁn d a computationall y v eriﬁable certiﬁcate that (under appropr iate conditions) implies that the sample mean is clo s e to the tru e mean (analo- gous to Lemma 2.5). It is sho wn in [43] that, in s ome cases, ﬁ nding natural certiﬁcates ma y b e computationally intract able. As a sp eciﬁc example, consid er the class of d istr ibutions on R d with b ounded fourth central momen ts. F or suc h d istributions, it is n ot hard to sho w that the trimmed mean correctly estimates the mean of an y one-dimen sional pro jection within error O ( ǫ 3 / 4 ), sho w in g that this error rate is information-theoretical ly optimal, within constant factors. Ho w eve r, f or an algorithm to ac h iev e this, u sing tec hniques lik e those already known, it w ould need to ha ve a wa y to certify whether or not the sample mean of a giv en p oint set is close to the true mean. A natural wa y to d o this would in volv e v erifyin g whether the p oint set itself has b ound ed fourth cen tral momen ts. How ev er, [43] sh ow that, sub j ect to the Small-Set-Expansion Hyp othesis, this is computationally intracta ble. In fact, the certiﬁcation pr oblem remains int ractable eve n if one needs to distinguish b et w een a distribution whic h h as man y b oun d ed central momen ts and one that lac ks b ound ed fourth cen tral moments. While this h ardness result is h ardly deﬁnitiv e (as it leav es space for a v ariet y of d iﬀeren t kinds of algorithms), it excludes some of the most natur al approac hes of extending existing tec hn iques. 4 Conclusions In this article, w e gav e an o ve rview of the recen t dev elopmen ts on algorithmic asp ects of high- dimensional robus t statistics. While substantial progress has b een made in this ﬁ eld durin g the past few yea r s, the results obtained so far merely scratc h the sur face of the p ossibilities ahead. A ma jor goal going forw ard is to dev elop a gener al algorithmic the ory of r obust estimation . This in volv es (1) d ev eloping no vel algorithmic tec h n iques th at lead to eﬃcien t robust estimators for more general probabilistic mod els and estima tion tasks; (2) obtaining a deep er understanding of the computational limits of r ob u st estimation; (3) dev eloping m athematical connections to related areas, including n on-con ve x optimizatio n and pr iv acy; and (4) explorin g applications of algorithmic robust statistics to exploratory data analysis, safe machine learning, and deep learning. One of the main conceptual con trib u tions of the classical theory of robu st statistic s has b een to c hallenge trad itional statistical assump tions ab out the data generating pro cess, thereby enabling the design of metho d s that are stable to deviations from these assumptions. T he precise form of suc h d eviations d ep ends on th e setting an d can giv e rise to v arious mo d els of robus tn ess. W e b eliev e that a central ob jectiv e in a mo dern theory of robustness is to rethink old mo dels and deve lop new ones that enable the design of new pr actical algorithms with pro v able guaran tees. Ac kno wledgmen ts W e thank Alistair Stew art for extensive collab oration in this area. W e thank Tim Roughgarden, Ankur Moitra, and Greg V alian t for u seful commen ts on a shorter v ersion of this article. 33 References [1] P . Aw asthi, M. F. Balca n , and P . M. Long. The p o wer of lo calizatio n for eﬃcien tly learning linear separators with noise. J. ACM , 63(6):50:1 –50:27, 2017. [2] S . Balakrishnan, S . S. Du, J. Li, and A. Singh. Computationally eﬃcien t robu st sparse estima- tion in high dimensions. In Pr o c. 30th Annual Confer enc e on L e arning The ory , pages 169–21 2, 2017. [3] M. F. Balcan, A. Blum, and S . V empala. A discrim in ativ e framework for clustering via simi- larit y functions. In STOC , pages 671–6 80, 2008. [4] M. Barren o, B. Nelson, A. D. Joseph, and J. D. Tygar. The security of m ac hine learning. Machine L e arning , 81(2) :121–1 48, 2010. [5] T . Be rnholt. Robust estimato rs are hard to compu te. T ec hnical rep ort, Unive rsit y of Dort- m u nd, German y , 2006. [6] Q . Berthet an d P . Rigollet. Comp lexit y theoretic lo wer b ound s for sparse principal comp onent detection. I n COL T 2013 - The 26th Annual Confer enc e on L e arning The ory , pages 1046–10 66, 2013. [7] K . Bhatia, P . J ain, P . K amalaruban, and P . Kar. Consistent rob u st regression. In A dvanc es in Neur al Information Pr o c essing Systems 30: Annual Confer enc e on Neur al Information P r o- c essing Systems 2017 , pages 2107–2116 , 2017. [8] K . Bhatia, P . Jain, and P . Kar. Robust regression via hard thresholdin g. In A dvanc es in Neur al Information Pr o c essing Systems 28: Annual Confer enc e on N e ur al Informatio n Pr o c essing Systems 2015 , pages 721–729, 2015. [9] B. Bigg io, B. Nelson, and P . Lasko v. Poi soning att ac ks against sup p ort v ector mac hin es. In Pr o c e e dings of the 29th International Confer enc e on Machine L e arning, ICM L 2012 , 20 12. [10] G. Blanc hard, M. Ka w anab e, M. Su giyama , V. G. Sp oko in y , and K.-R. M ¨ uller. In search of non-gaussian comp onen ts of a high-d im en sional d istribution. J. Mach. L e arn. R es. , 7:247–28 2, 2006. [11] A. Blum, M. F ur s t, J. Jac ks on, M. K earn s, Y. Mansour, and S. Rudich. W eakly learning DNF and charact erizing statistical qu ery learnin g using Four ier analysis. In Pr o c e e dings of the Twenty-Sixth Annua l Symp osium on The ory of Computing , pages 253–2 62, 1994. [12] M. Brennan and G. Bresler. Averag e-case lo wer b ounds for learning sparse mixtures, robust estimation and semirandom adv ersaries. CoRR , abs/1908.0 6130, 2019. [13] N. Bshouty , N. Eir on, and E. Ku shilevitz. P AC Learning with Nast y Noise. The or etic al Computer Scienc e , 288(2):2 55–27 5, 2002. [14] M. Charik ar, J. S teinhardt, and G. V alian t. Learning fr om untrusted data. In Pr o c. 49th Annual ACM Symp osium on The ory of Computing , pages 47–60 , 2017. [15] M. Chen, C. Gao, and Z. Ren. Robust co v ariance and scatt er matrix estimation un der Hub er’s con tamination mo del. Ann. Statist. , 46(5):1932– 1960, 10 2018. 34 [16] Y. Cheng, I. Diak onik olas, and R. Ge. High-dimen s ional robust mean estimation in nearly- linear time. CoRR , abs/181 1.0938 0, 2018. Conf er en ce ve rsion in SOD A 201 9, p . 2755-27 71. [17] Y. Ch eng, I. Diak on ikola s, R. Ge, and D. P . W o o dru ﬀ. F aster algorithms for high-dimensional robust co v ariance estimatio n . In Confer enc e on L e arning The ory, CO L T 2019 , pages 727–757, 2019. [18] Y. Cheng, I. Diak on ikola s, D. M. Kane, and A. Stewa rt. Robus t learning of ﬁxed-structur e Ba y esian net w orks. In P r o c. 33r d Annual Confer enc e on Neur al Infor mation Pr o c essing Sys- tems (NeurIPS) , pages 103 04–10316, 2018. [19] Y. Cherapan amj er i, N. Flammarion, and P . L. Bartlett. F ast mean estimation with sub- gaussian rates. In Confer enc e on L e arning The ory, COL T 2019 , pages 786–80 6, 2019. [20] J. Dep ersin and G. Lecue. Robust subgaussian estimatio n of a mean v ector in n early linear time. CoRR , abs/1906 .03058 , 20 19. [21] I. Diako n ik olas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stew art. Rob u st estimators in h igh d imensions without the computational in tractabilit y . In Pr o c. 57th IEEE Symp osium on F oundations of Computer Sci enc e (F OCS) , pages 655–664 , 2016. [22] I. Diako n ik olas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewa rt. B eing robu st (in high dimensions) can b e pr actical. In Pr o c. 34th International Confer enc e on Machine L e arning (ICML) , pages 999–10 08, 2017. [23] I. Diak onik olas, G. Kamath, D. M. K an e, J. Li, A. Moitra, and A. S tewart. Robustly learnin g a Gaussian: Getting optimal err or, eﬃcie n tly . In Pr o c. 29th Annual Symp osium on Discr ete Algor ithms , p ages 2683–2 702, 2018. [24] I. Diak onik olas, G. Kamath, D. M. Kane, J. Li, J. Steinhard t, and A. Stewa rt. Sev er: A r obust meta-alg orithm for sto chastic optimization. CoRR , abs/1803.028 15, 2018. C onference version in ICML 2019. [25] I. Diak onik olas and D. M. Kane. Degree- d c how parameters robustly determine degree- d ptfs (and algo r ithmic applications). In Pr o c e e dings of the 51st Annual A CM SIGACT Symp o- sium on The ory of Computing, STO C 2019 , pages 804–815 , 20 19. F ull version a v ailable at h ttps ://arxiv.org/ abs/1811. 03491. [26] I. Diako n ik olas, D. M. Kane, and A. S tew art. Statistical query lo w er b ounds for robust esti- mation of high-dimensional Gaussians and Gaussian mixtures. In P r o c. 58th IEEE Symp osium on F oundations of Computer Sci enc e (F OCS) , pages 73–84, 2017. [27] I. Diak onikol as, D. M. Kane, and A. Stew art. Learning geometric concepts with nast y noise. In Pr o c. 50th Annual ACM Symp osium on The ory of Computing (STOC) , pages 1061 –1073 , 2018. [28] I. Diak onikol as, D. M. Kane, and A. S tew art. L ist-decod able robust mean estimation an d learning mixtures of spherical Gaussians. In Pr o c. 50th Annual A CM Symp osium on The ory of Computing (STOC) , pages 104 7–1060, 2018. [29] I. Diako n ik olas and D.M. Kane. Recen t adv ances in h igh-d imensional robust statistics. STOC 2019 T utorial. Slides a v ailable at h ttp://www.iliasdiak onikol as.org/stoc-robust-tutor ial.h tm l, 2019. 35 [30] I. Diak onik olas, S . Karmalk ar, D. Kane, E. Price, and A. Stew art. Outlier-robust high- dimensional sp arse estimation via iterativ e ﬁltering. In A dvanc es in Neur al Information Pr o- c essing Systems 33, NeurIPS 2019 , 2019. [31] I. Diako n ik olas, W. Kong, an d A. Stew art. Eﬃcient algorithms and lo w er b ound s for robust linear regression. In Pr o c. 30th Annual Symp osium on Disc r ete Algorithm s (SODA) , pages 2745– 2754, 2019. [32] Y. Dong, S. B. Hopkins, and J. Li. Q uan tum entrop y scoring for fast robus t mean estimation and im p ro ved outlier detection. CoRR , abs/190 6.1136 6, 2019. Conferen ce ve rsion in NeurIPS 2019. [33] V. F eldman. A general c h aracterizatio n of the statistical q u ery complexit y . CoRR , abs/1608. 02198 , 20 16. [34] V. F eldman, E. Grigorescu, L. Reyzin, S. V empala, and Y. Xiao. S tatistical algorithms and a lo we r b ound for d etecting planted cliques. In Pr o c e e dings of STOC’13 , pages 655–664 , 2013. [35] V. F eldman, C . Guzman, and S . V empala. Statistical query algorithms for sto c hastic con vex optimization. CoRR , abs/1512. 09170 , 20 15. [36] V. F eldman, W. Perkins, and S. V empala. On the complexit y of r andom satisﬁabilit y problems with p lan ted solutions. In Pr o c e e dings of the F orty-Seventh Annual A CM on Symp osium on The ory of Computing, STOC, 20 15 , p ages 77–8 6, 2015. [37] F. R. Hamp el. A general qualitativ e deﬁn ition of robu s tness. Ann. Math. Statist. , 42(6):188 7– 1896, 12 1971. [38] F. R. Hamp el, E. M. Ronc hetti, P . J. Rousseeu w, and W. A. Stahel. R obust statistics. The appr o ach b ase d on inﬂuenc e functions . Wiley New Y ork, 1986. [39] M. Hardt and A. Moitra. Algorithms and hardness for robust s u bspace reco v ery . In Pr o c. 26th Annual Confer enc e on L e arning The ory (COL T) , pages 354–375 , 2013. [40] D. Haussler. Decision theoretic generalizations of the P AC mo d el for n eural n et and other learning app lications. Information and Computation , 100:7 8–150 , 19 92. [41] S. B. Hopkins. Sub-gaussian mean estimation in p olynomial time. CoRR , abs/1809.07 425, 2018. [42] S. B. Hopkins and J. Li. Mixture mo dels, robu s tness, and su m of sq u ares pro ofs. In Pr o c. 50th Annual ACM Symp osium on The ory of Computing (STOC) , pages 1021–1 034, 2018 . [43] S. B. Hopkins and J. Li. How h ard is r ob u st mean estimation? In Confer enc e on L e arning The ory, COL T 2019 , pages 1649– 1682, 2019 . [44] P . J. Hub er. Robu st estimation of a lo cation parameter. Ann. Math. Statist. , 35(1):73 –101, 03 1964. [45] P . J. Hub er and E. M. Ronc hetti. R obust statistics . Wiley New Y ork, 200 9. [46] M. Jerrum. Large cliques elude the metrop olis pr o cess. R andom Struct. Al gorithms , 3(4):347– 360, 1992. 36 [47] D. S. Johns on and F. P . Preparata. The densest hemisphere problem. The or etic al Computer Scienc e , 6:93–10 7, 1978. [48] S. Karmalk ar, A. R. Kliv ans, and P . K. Kothari. List-deco dable linear r egression. CoRR , abs/1905. 05679 , 20 19. T o app ear in NeurIPS’19. [49] M. Kearns , R. Schapire, and L. S ellie. T o ward Eﬃcien t Agnostic Learnin g. Machine L e arning , 17(2/ 3):115– 141, 1994. [50] M. J. K earns. Eﬃcient noise-tole ran t learning from statistical queries. In Pr o c. 25th An nual ACM Symp osium on The ory of Com puting (STOC) , p ages 392–40 1. A CM Pr ess, 1993. [51] M. J. K earns and M. Li. Learning in the presence of m alicious errors. SIAM Journal on Computing , 22(4):807– 837, 1993. [52] A. Kliv ans, P . Kothari, and R. Mek a. Eﬃcient algorithms f or outlier-robus t regression. In Pr o c. 31st Annual Confer enc e on L e arning The ory (COL T) , pages 1420–14 30, 2018. [53] A. Kliv ans, P . Long, and R. Ser vedio. Learning halfsp aces with malicious noise. Journal of Machine L e arning R ese ar ch , 10:2 715–2 740, 2009. [54] P . K. Kothari and J. S teinhardt. Better agnostic clustering via relaxed tensor norm s. CoRR , abs/1711. 07465 , 20 17. [55] P . K. Kothari, J. Steinhardt, and D. S teurer. Robust moment estimation and imp ro ved clus- tering v ia sum of squares. In Pr o c. 50th Annual A CM Symp osium on The ory of Computing (STOC) , pages 1035 –1046, 2018. [56] P . K. Kothari an d D. Steurer. Outlier-robust momen t-estimation via sum-of-squares. CoRR , abs/1711. 11581 , 20 17. [57] K. A. L ai, A. B. Rao, and S . V emp ala. Agnostic estimation of mean and co v ariance. In P r o c. 57th IEEE Symp osium on F oundations of Computer Scienc e (FOCS) , pages 665–674 , 2016. [58] J.Z . Li, D.M. Absher, H. T ang, A.M. South wic k, A.M. Casto, S. Ramac handran , H.M. C ann, G.S. Barsh, M. F eldman, L.L. Cav alli-Sforza, an d R .M. My er s . W orldwid e human relationships inferred from genome- wide patterns of v ariation. Scienc e , 319 :1100– 1104, 2008. [59] H. Liu and C. Gao. Densit y estimation with cont amination: minimax rates and theory of adaptation. Ele ctr on. J. Statist . , 13(2):3613 –3653, 2019. [60] L. Liu, T. Li, and C. C aramanis. High dimensional r obust estimation of sparse mo d els via trimmed hard th resholding. CoRR , abs/1901.0 8237, 20 19. [61] L. Liu, Y. S hen, T. Li, and C . Caramanis. High dimen s ional robust sparse r egression. CoRR , abs/1805. 11643 , 20 18. [62] G. Lugosi and S. Mendelson. Mean estimation and r egression und er hea vy-tailed distributions: A survey . F oundations of Computational Mathema tic s , 19( 5):1145 –1190, 2019. [63] G. Lugosi and S. Mendelson. Robust multiv ariate mean estimatio n: th e optimalit y of trimmed mean. CoRR , abs/1907.11 391, 201 9. 37 [64] P . P asc h ou, J. Lewis, A. J a ve d, and P . Drineas. Ancestry informativ e marke rs for ﬁ ne-scale individual assignmen t to wo r ldwide p opulations. Journal of M e dic al Genetics , 47:8 35–84 7, 2010. [65] A. Prasad, A. S. Suggala, S. Balakrishnan, and P . Ravikumar. Robust estimation via robust gradien t estimatio n. arXiv pr e print arXiv:1802.064 85 , 2018. [66] P . Ragha v endra, T. S c hr amm, and D. Steurer. High-dimensional estimation via sum-of-squares pro ofs. CoRR , abs/1807.11 419, 2018. [67] P . Ragha v end ra and M. Y au. List deco dable learning via s um of squares. CoRR , abs/1905. 04660 , 20 19. T o app ear in SOD A’20. [68] N. Rosenber g, J. Pr itc hard, J. W eb er, H. C ann, K. Kid d , L.A. Zhiv oto vsky , and M.W. F eld- man. Genetic structure of h um an p opulations. Scienc e , 298:2381– 2385, 2002. [69] J. Steinhard t, M. Charik ar, and G. V alian t. Resilience: A criterion for learning in the presence of arbitrary outliers. In P r o c. 9th Innovations in The or etic al Computer Scienc e Confer enc e (ITCS) , pages 45: 1–45:2 1, 2018. [70] J. Steinhard t, P . W ei Koh, and P . S. Liang. Certiﬁed defenses for data p oisoning atta cks. In A dvanc es in Neur al Information Pr o c e ssing Systems 30 , pages 3520–3 532, 2017. [71] A. S. Suggala , K. Bhatia, P . Ravikumar, and P . Jain. Adaptiv e hard thresholding for near- optimal consistent robust r egression. In Confer enc e on L e arning The ory, COL T 2019 , pages 2892– 2897, 2019. [72] B. T r an, J. Li, and A. Madry . S p ectral signatur es in bac kd o or attac ks. In A dvanc es in Neur al Information Pr o c essing Systems 31: Annual Confer enc e on N e ur al Informatio n Pr o c essing Systems 2018 , NeurIPS 2018 , pages 8011–80 21, 2018. [73] J. W. T uke y . A surv ey of sampling from contaminat ed distribu tions. Contributions to pr ob a- bility and statistics , 2:448–485 , 1960. [74] J. W. T u key . Mathematics and picturing of d ata. I n Pr o c e e dings of ICM , vo lume 6, pages 523–5 31, 1975. [75] L. V alian t. Learning disjunctions of conjunctions. In Pr o c. 9th IJ CA I , pages 560–566, 1985. [76] B. Zhu, J. Jiao, and J. Steinhardt. Generalized resilience and r obust statistics. CoRR , abs/1909. 08755 , 20 19. 38

Recent Advances in Algorithmic High-Dimensional Robust Statistics

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment