Recent Advances in Algorithmic High-Dimensional Robust Statistics
Learning in the presence of outliers is a fundamental problem in statistics. Until recently, all known efficient unsupervised learning algorithms were very sensitive to outliers in high dimensions. In particular, even for the task of robust mean esti…
Authors: Ilias Diakonikolas, Daniel M. Kane
Recen t Adv ances in Algorithmic High-Dimensional Robust Statistics ∗ Ilias Diak o nik olas † Univ ersit y of Wisconsin-Madison ilias@cs.wi sc.edu Daniel M. Kane ‡ Univ ersit y of California, San Diego dakane@cs.u csd.edu No ve m b er 15, 2019 Abstract Learning in the presence of o utliers is a fundamental problem in statistics. Un til re c ent ly , all known efficient unsup ervised learning algor ithms w e r e very sensitive to outliers in high dimen- sions. In particular, even for the task of r obust mea n estimation under na tural distributional assumptions, no efficien t algorithm was known. Recent w o rk in theoretical co mputer scie nc e gav e the first efficient robust estimators for a nu mber of fundamental statistica l task s, including mean and cov aria nce estimation. Since then, there has been a flurry of resear ch activ ity on algorithmic high-dimensional ro bust estimation in a ra nge o f settings. In this survey article, we in tro duce the core ideas and alg orithmic techniques in the emer g ing area of algorithmic high-dimensional r obust statistics with a fo cus on ro bust mean estimation. W e also provide a n ov erv ie w of the a pproaches that ha ve led to computationa lly efficient robust estimator s for a range of bro ader statistical ta sks and dis cuss new dire c tions and opp o rtunities for future w o rk. 1 In tro duction 1.1 Bac kground Consider the follo wing basic statistical task: Given n indep endent samp les from an un kno w n mean spherical Gaussian distribution N ( µ, I ) on R d , estimate its mean v ector µ within small ℓ 2 -norm. It is not h ard to see th at the empirical m ean has ℓ 2 -error at most O ( p d/n ) f rom µ with high probabilit y . Moreov er, th is err or upp er b ound is b est p ossible among all n -sample estimators. The Ac hilles heel of the empirical estima tor is that it crucially relies on the assumption that the observ ations we re generated b y a spherical Gaussian. T he existence of ev en a single outlier can arbitrarily compromise this estimator’s p erformance. How ev er, the Gauss ian assumption is only ev er appro ximately v alid, as real datasets are typica lly exp osed to some source of con tamination. Hence, an y estimator that is to b e used in practice must b e r obust in the p r esence of outliers. Learning in the p r esence of outliers is an imp ortant goal in statisti cs and has b een studied in the rob u st statistics communit y since the 1960s [73 , 44] (see [38, 45] for in tro ductory statistical textb o oks on the topic). In recent y ears, the problem of designing r obust and computationally efficien t estimators for v arious high-dimensional statisti cal tasks has b ecome a pr essin g c hallenge ∗ This article is an expanded version of an invited chapter entitled “Robust High-Dimensional S tatistics” in the b ook “Beyond t he W orst-Case An alysis of Algorithms” to b e pu blished by Cambridge Un iversi ty Press. † Supp orted by NSF Award CCF-1652862 (CAR EER) and a Sloan R esearc h F ello wship. ‡ Supp orted by NSF Award CCF-1553288 (CAR EER) and a Sloan R esearc h F ello wship. 1 in a n umb er of data analysis applicatio ns. T hese include th e analysis of b iologic al datasets, where natural outliers are common [68, 64, 58] and can con taminate the downstream s tatistica l analysis, and d ata p oisoning attacks in mac hine learning [4], where ev en a s mall fraction of f ake data (outliers) can sub stan tially degrade the qualit y of the learned mod el [9, 70]. Classical work in robust statistics pin ned do w n the minimax r isk of high-d im en sional robust estimation in seve ral basic settings of inte rest. In con trast, until ve ry r ecen tly , eve n the most basic computational questions in th is field were p o orly understo o d . F or example, the T uk ey median [74] is a sample-efficien t r obust mean estimator for spherical Gaussian distributions. Ho wev er, it is NP- hard to compute in general [47] and the heuristics prop osed to approxima te it degrade in th e qualit y of their appro x im ation as th e d imension s cales. Similar hardness results ha v e b een sh o wn [5, 39] for essent ially all kno w n classical estimators in robu st statistics. Un til recen tly , all kn o wn compu tationally efficien t estimators could only tolerate a negligible fraction of outlie rs in high dimensions, ev en for the basic ta sk of mean estimation. Recen t wo rk b y Diak oniko las, Kamath, Kane, Li, Moitra, and Stewart [21], and b y Lai, Rao, and V empala [57] ga v e the first efficien t robu st estimators for v arious high-dimensional unsup ervised learning tasks, including mean and co v ariance estimation. Sp ecifically , [21] obtained the fi r st p olynomial-time robust estimators with dimension-indep endent error guarantee s, i.e., with err or scaling only with the fraction of corr u pted samples and not with the dimensionalit y of the data. Since the dissemination of these w orks [21, 57], there has b een significan t researc h activit y on designing computationally efficien t robust estimators in a v ariet y of settings. 1.2 Con tamination Mo del Throughout this article, we fo cus on the follo wing mo del of data cont amination that generalizes sev eral other existing mo dels: Definition 1.1 (Str ong Con tamination Mo del) . Given a p arameter 0 < ǫ < 1 / 2 and a distribution family D on R d , the adversa ry op erates as follo ws : The algo rithm sp ecifies a num b er of samples n , and n samp les are dra wn from some unkno w n X ∈ D . The adv ersary is allo we d to ins p ect the samples, remo v e up to ǫn of them and replace them with arbitrary p oin ts. This mod ified set of n p oints is then given as input to the algorithm. W e say that a set of s amples is ǫ -c orrupte d if it is generated by the ab ov e p ro cess. The con tamination mo del of Definition 1.1 can b e view ed as a semi-random input m o del: First, nature d ra ws a set S of i.i.d. s amples from a statistical mo del of interest, and then an adversary is allo w ed to c h ange th e set S in a b ounded wa y to obtain an ǫ -corrupted set T . The parameter ǫ is the prop ortion of con tamination and qu an tifies the p ow er of the adv ersary . Intuitiv ely , among our samples, an unkno w n (1 − ǫ ) fr action are generated from a distribu tion of interest and are cal led inliers , and the rest are calle d outliers . One can consider less p o werful adv ersaries, giving rise to we aker cont amination mo d els. An adv ersary may b e (i) adaptiv e or oblivious to the inliers, (ii) only allo w ed to add outliers, or only allo wed to remov e in liers, or allo wed to do b oth. W e pro vide some examples of natural and w ell-studied suc h mo dels in the follo wing paragraphs. In Hub er’s c ontamination mo del [44], th e ad versary is oblivious to the inliers and is only allo w ed to add outliers. More sp ecifically , in Hu b er’s mo d el, the adversary generates samples from a mixture distribution P of the form P = (1 − ǫ ) X + ǫN , where X ∈ D is the u nkno wn target distribution and N is an adv ersarially c hosen noise distrib ution. Another natural con tamination mo del very common in theoretical computer science allo ws the adv ersary to p ertur b the distribution X b y at most ǫ in total v ariation distance, i.e., the adv ersary generates samples from a distrib ution Y that 2 satisfies d TV ( Y , X ) ≤ ǫ . Intuitiv ely , the adv ersary in this mo del is oblivious to the inliers and is allo w ed to b oth add ou tliers and remo ve inliers. This mo del is v ery similar to a con tamination mo del prop osed by Hamp el [37]. W e note that cont amination in total v ariation distance is strictly stronger than Hub er’s mo del. More broadly , one can natur ally mo dify this mod el to study mo del missp ecification w ith resp ect to d ifferen t loss functions (see, e.g., [76]). In compu tational learnin g theo ry , the cont amination mod el of Definition 1.1 is related to the agnostic mo del [40, 49], wh ere the goal is to learn a lab eling f unction whose agreement with some underlying target function is close to the b est p ossible, among all fu nctions in some giv en class. An imp ortant difference with our setting is that the agnostic mo del requires that w e fit all the data, while in our robust setting we only w ant to fit the u ncorrup ted data. The s tr ong con tamination mo del can b e view ed as th e un s up erv ised analogue of th e c hallenging nasty noise mo del [13] (itself a strengthening of the m alicious mo del [75, 51]). In the n ast y mo del, an adversary is allo wed to corrupt an ǫ -fraction of b oth the lab els and the samples, and the goal of the learning algorithm is to output a hyp othesis with small misclassification error with resp ect to the clean data generating d istribution. In robu st mean estimation, giv en an ǫ -corrupted set of samples f rom a well-behav ed distribution (e.g., N ( µ, I )), we w an t to outpu t a vect or b µ that closely appr o ximates th e unknown mean v ector µ . A natural c hoice of metric b etw een the means for iden tit y co v ariance distrib utions is the ℓ 2 - error k b µ − µ k 2 , and in this article we fo cus on d esigning r obust esti mators minimizing k b µ − µ k 2 . W e note how ev er that th e same algorithms t yp ically lead to small Mahalanobis distance, i.e., k b µ − µ k Σ = | ( b µ − µ ) T Σ − 1 ( b µ − µ ) | 1 / 2 , wh en the un derlying distribution has co v ariance Σ . One can an alogously define robus t estimation for other paramete rs of high-dimensional dis- tributions (e.g., co v ariance matrix an d higher-order momen t tensors) with resp ect to natural loss functions (e .g., F rob enius norm, sp ectral norm). A more general statistical task is that of r obust density e stimation : Giv en an ǫ -corru pted set of samples from an u n kno w n distribution X ∈ D , output a hyp othesis distribu tion H (not necessarily in D ) such that the total v ariation distance d TV ( H , X ) is minimized. W e n ote that robust density estimation and robust parameter estimation are closely related to eac h other. F or man y n atur al parametric d istributions, the latter can b e reduced to the former for an appropriate choi ce of metric b et ween the parameters. In all these s ettings, the goal is to d esign computationally efficient learning algorithms that ac hiev e dimension-indep endent err or, i.e., error that scales only with the con tamination parameter ǫ , but n ot with the dimension d . The information-theoretic limits of robu st estimation dep end on our assump tions ab out the distribution family D . In the follo wing subsection, we pro vide the basic relev ant bac kground. 1.3 Basic Background: Samp le Efficien t Robust Est imation Before w e pr o ceed with p resen ting computationally efficien t robust estimators in the next sections, w e pro v id e a few basic facts on the in formation-theoretic limits of robus t estimati on. F or concrete- ness, we fo cus here on robust mean estimation. W e note that similar argumen ts can b e applied to v arious other parameter estimatio n tasks. The in terested reader is referred to [21] and to [15, 59] (and references therein) for r ecen t information-theoretic work from the stati stics comm unit y . W e fi rst note that some assump tions on the u nderlying distribution family D are necessary for robust mean estimation to b e information-theoretically p ossible. Consider for example, the family D = { D x , x ∈ R } , where D x is a probabilit y distribution on the r eal line with only one p oin t x ∈ R ha ving p ositiv e mass Pr [ D x = x ] = ǫ > 0 and suc h that E [ D x ] = x . While estimating the mean of an arbitrary distribution in D is straigh tforward without corru ptions (b y taking samples u n til w e see a sample t wice, which m u st b e the true mean), an adv ersary can erase all information ab out 3 the mean in an ǫ -corrupted sample from D x . Ind eed, an adversary can d elete the samples at x and mo ve them at an arbitrary lo cation to arbitrarily corrupt the sample mean. T yp ical assum p tions on the f amily D are either parametric (e.g., D could b e the family of all Gaussian distributions) or are defined b y concen tration prop erties (e.g., eac h distribu tion in D satisfies sub-gaussian concent ration) or cond itions on lo w-degree momen ts (e.g., eac h d istribution in D has appropriately b ou n ded higher-order moments). Another b asic observ ation is that, in con trast to th e uncorrup ted setting, in the con taminated setting of Definition 1.1 it is not p ossible to obtain consisten t estimators — that is, estimators with error con v erging to zero in probability as the sample size increases indefin itely . T ypically , there is an information-theoretic limit on the minimum attainable error that dep ends on the prop ortion of con tamination ǫ and stru ctural p rop erties of th e un derlying distribution family . In particular, for the r obust mean estimation of a h igh-dimensional Gaussian, w e h a ve: F act 1.2. F or any d ≥ 1 , any r obust estimator for the me an of X = N ( µ X , I ) on R d , must have ℓ 2 -err or Ω( ǫ ) , even in Hub er’s c ontamination mo del. This fact can b e shown as follo w s: Giv en tw o distinct distribu tions N ( µ 1 , I ) an d N ( µ 2 , I ) with k µ 1 − µ 2 k 2 = Θ( ǫ ), the adv ersary constructs t w o noise distributions N 1 , N 2 on R d suc h that Y = (1 − ǫ ) N ( µ 1 , I ) + ǫN 1 = (1 − ǫ ) N ( µ 2 , I ) + ǫN 2 . Consequ en tly , ev en in the infi nite samp le regime, any alg orith m can at b est learn that its samples are coming from Y , but will b e u nable to distinguish betw een the cases where the real distribu tion is N ( µ 1 , I ) an d where it is N ( µ 2 , I ). This pr o ves F act 1.2. If the target distribu tion X is allo wed to come from a br oader cla ss of distrib utions (suc h as allo wing an y distrib ution w ith subgaussian tails, or an y distr ib ution w ith b ounded co v ariance), the situati on is eve n worse. If one can find t w o distrib utions X and X ′ in the desired cla ss with d TV ( X, X ′ ) ≤ ǫ , it b ecomes information-theoretical ly imp ossible for an algorithm to distinguish b et w een the t wo . How ev er, if the differen ce b et wee n X and X ′ is concen trated in the tails of the distribution, then X and X ′ migh t h a ve v ery different means. Th is implies that for the class of distributions with sub-gaussian tails (and identit y co v ariance) we cann ot hop e to learn the mean to ℓ 2 -error b etter than Ω( ǫ p log(1 /ǫ )); and for the class of distributions with co v ariance Σ b oun ded by σ 2 I , we cannot exp ect to do b etter th an Ω( σ √ ǫ ). It tur ns out that these b ounds are inform ation- theoreticall y optimal, and in fact as w e will s ee, the m eans of such distrib u tions can b e robustly estimated to these errors in m an y cases. The problem of rob u st m ean estimation for N ( µ, I ) seems so in no cuous that one could n aturally w ond er why simple approac h es do not work. In the one-dimensional case, it is w ell-kno wn that the median is a robust estimator of the mea n, matc h ing the low er b ound of F act 1.2 . Sp ecifically , it is easy to sho w that th e median b µ of a m ultiset of n = Ω(log(1 /τ ) /ǫ 2 ) ǫ -corrup ted samp les from a one-dimensional Gaussian N ( µ, 1) satisfies | b µ − µ | ≤ O ( ǫ ) with pr obabilit y at least 1 − τ . In high dimensions, the situation is more s ubtle. T here are m an y reasonable wa ys to attempt to generalize the median as a robust estimator in h igh dimensions, but u nfortunately , most natural ones lead to ℓ 2 -error of Ω ( ǫ √ d ) in d d imensions, eve n in the infin ite samp le regime (see, e.g., [21, 57]). P erh aps the most obvious high-dimen sional generalization of the median is the c o or dinate-wise me dian . Here the i -th co ordin ate of the ou tp ut is the median of the i -th co ord inates of the input samples. This estima tor guaran tees that every coordinate of the outpu t is within O ( ǫ ) of the corresp ondin g coordin ate of the inp ut. Th is estimator suffi ces for ob taining small ℓ ∞ -error, b u t if one w ants ℓ 2 -error (whic h is natur al in the case of Gaussians), then th ere exist instances where the co ordin ate-wise median h as ℓ 2 -error as large as Ω( ǫ √ d ). Another p oten tial w ay to ge neralize the median to h igh dimensions is via th e ge ometric me dian , i.e., the p oin t x ∗ that minimizes 4 P i k x ( i ) − x ∗ k 2 . Unfortunately , th e geometric median can also pro duce ℓ 2 -error of Ω( ǫ √ d ) if the adv ersary adds the ǫ -fraction of the ou tliers all off f rom the mean in the same direction. A third h igh-d imensional generalization of the median r elies on the observ ation that taking the median of any univ ariate pro jection of our input p oints give s u s an appr o ximation to the pro jected mean. Find ing a mean v ector that min imizes the error ov er the worst dir e c tion can actually b e used to obtain ℓ 2 -error of O ( ǫ ) with high p robabilit y . In other words, it is p ossible to reduce the high-dimensional robust mean estimation problem to a collection of one-dimensional robust mean estimation pr ob lems. This is the underlying idea in T uk ey’s median [74], whic h is kno w n to b e a robust mean estimator for spherical Gaussians and for more general symmetric distributions. But unfortun ately , the T ukey median r elies on com binin g information for infinitely many dir ections, and is u nsurp risingly NP-Hard to compute in general. The follo wing p rop osition giv es a compu tationally inefficien t r obust mean estimator matc hing the lo w er b oun d of F act 1.2: Prop osition 1.3. Ther e exists an algorithm that, on input an ǫ -c orrupte d set of samples fr om X = N ( µ X , I ) of size n = Ω (( d + log (1 /τ )) /ǫ 2 ) , runs in p oly( n, 2 d ) time, and outputs b µ ∈ R d such that with pr ob ability at le ast 1 − τ , it ho lds tha t k b µ − µ X k 2 = O ( ǫ ) . The algo rithm to establish Prop osition 1.3 pro ceeds b y using a one-dimensional robust mean estimator to estimate v · µ , for an appr opriate net of 2 O ( d ) unit v ectors v ∈ R d , and then com b ines these estimates (by solving a large linear program) to obtain an accurate estimate of µ . When X = N ( µ X , I ), ou r one-dimensional robust mean estimator w ill b e the median, giving the O ( ǫ ) error in Prop osition 1.3. W e note that the same pro cedure is app licable to other distrib ution families as well (ev en non-symmetric ones), as long as there is an accurate robus t mean estimator for eac h un iv ariate pro jection. S p ecifically , if X h as tails b ounded by th ose of a Gaussian with co v ariance σ 2 I , one can use th e trimme d me an for eac h univ ariate pro jection. This giv es error of O ( σ ǫ p log(1 /ǫ )). If X is only assumed to ha ve boun ded co v ariance matrix (Σ X σ 2 I ), w e can similarly use the trimmed mean, wh ich leads to total error of O ( σ √ ǫ ). By the d iscu ssion follo wing F act 1.2, b oth these upp er b ound s are optimal, within constant factors, u nder the corresp onding assumptions. 1.4 Structure of this Art icle In Section 2, w e pro vide a un ifi ed p resen tation of tw o r elated algorithmic tec hniques that ga v e the first computationall y efficien t algorithms for high-dimensional r obust mean estimation. S ection 2 is the main tec hnical section of this article and sho w cases a num b er of core ideas and tec hniques that can b e applied to sev eral h igh-dimensional robust estimation tasks. Section 3 pro vid es an o verview of recen t algorithmic progress for more general r ob u st estimat ion tasks. Finally , in Section 4 w e conclude with a few general directions for fu tu re wo rk. 1.5 Preliminaries and Notation F or a distribution X , w e will use the notat ion x ∼ X to denote th at x is a sample dr a wn from X . F or a finite set S , w e will write x ∼ u S to denote that x is drawn from the uniform distrib ution on S . The probabilit y of ev ent E will b e denoted by Pr [ E ]. W e will use E [ X ] and V ar [ X ] to d enote the exp ectatio n and the v ariance of ran d om v ariable X . If X is multiv ariate, w e will d enote by Co v [ X ] its co v ariance matrix. W e w ill also use the notation µ X and Σ X for the mean and co v ariance of X . Similarly , for a finite s et S , w e w ill denote b y µ S and Σ S the sample mean and sample co v ariance of S . 5 F or a v ector v , we will use k v k 2 to denote its ℓ 2 -norm. F or a matrix A , w e will denote by k A k 2 and k A k F its sp ectral an d F rob enius norms r esp ectiv ely , and by tr( A ) its trace. W e will denote by the Loewn er order b et ween matrices. W e will use standard asymptotic n otation O ( · ), Ω( · ). Th e ˜ O ( · ) notation hides loga rithmic factors in its argumen t. 2 High-Dimensional Robust Mean Estimation In this s ection, w e illustrate the main insights u nderlying r ecen t robu s t high-d im en sional learning algorithms by fo cusin g on the problem of robust mean estimation. The algorithmic techniques pre- sen ted in this section were in tro duced in [21, 22]. Here w e giv e a simplified and unifi ed presentat ion that illustrates the k ey ideas and the connections b et ween them. The ob jectiv e of th is section is to pro vide th e intuitio n and bac kground r equired to develo p robust learning algorithms in an accessible w a y . As suc h, w e will not attempt to optimize the sample or compu tational complexitie s of the alg orithms presente d , other than to s ho w that they are p olynomial in th e relev ant parameters. In the problem of robust mean estimat ion, w e are giv en an ǫ -co rrupted set of samples from a distribution X on R d and our goal is to app ro ximate th e mean of X , within small error in ℓ 2 -norm. In order f or su ch a goal to b e information-theoretically p ossible, it is r equ ired that X b elongs to a suitably we ll-b eha ved f amily of distribu tions. A t yp ical assu m ption is that X b elongs to a family whose momen ts are guaran teed to satisfy certain conditions, or equiv alen tly , a family with appropriate concentrati on prop erties. In our initial discussion, w e will use the running example of a sp herical Gaussian, although the results presen ted here hold in greater generalit y . That is, the reader is encouraged to imagine that X is of the form N ( µ, I ), for some u nkno wn µ ∈ R d . Structure of this Section. In Section 2.1, w e discus s the basic intuition underlying the presented approac h. In Section 2.2, w e will describ e a stabilit y condition that is necessary for the algorithms in this section to succeed. In the subsequent subs ections, w e p resen t tw o related algorithmic tec hn iques taking adv an tage of th e stabilit y condition in d ifferen t wa ys. Sp ecifically , in Section 2.3, we d escrib e an alg orithm that relies on con vex p rogramming. In Section 2.4, w e describ e an iterat iv e outlier remo v al tec h nique, whic h h as b een the m etho d of c hoice in practice. I n Section 2.5, we conclude with an ov erview of the relev an t literature. 2.1 Key Difficulties and High-Lev el In t uition Arguably the m ost natural idea to robustly estimate the mean of a distribution wo uld b e to iden tify the outliers and output the empirical mean of the remainin g p oint s. The key conceptual difficult y is the fact that, in high dimensions, the outliers cannot b e identified at an individual lev el ev en w hen they mo ve the mean significant ly. In many cases, we can easily identi fy the “extreme outliers” — via a p runing p ro cedure exploiting the concen tration prop erties of th e inliers. Alas, suc h naiv e approac hes t ypically do not s uffice to obtain non-trivial error gu arantees. The simplest examp le illustrating this difficulty is that of a h igh-dimensional spherical Gaussian. T yp ical samples w ill b e at ℓ 2 -distance app ro ximately Θ( √ d ) from the true mean. That is, we can certainly identify as outliers al l p oin ts of our d ataset at distance more than Ω( √ d ) fr om the co ordinate-wise median of the dataset. All other p oint s cannot b e remo ved via suc h a pr o cedure, as this could r esu lt in remo ving man y inliers as w ell. Ho wev er, b y p lacing an ǫ -fraction of outliers at d istance √ d in th e same d irection from th e un k n o wn mean, an adversary can corrupt the samp le mean by as m u c h as Ω( ǫ √ d ). 6 This lea v es the algorithm designer with a dilemma of sorts. O n the one hand, p oten tial outliers at distance Θ( √ d ) from the unkno wn mean could lead to large ℓ 2 -error, scaling p olynomially with d . On th e other hand, if th e adv ersary place s outliers at distance appro ximately Θ( √ d ) from the true mean in r andom dir e c tions , it ma y b e in formation-theoretica lly im p ossible to distinguish them from the inliers. T h e w a y out is the realization that it is in fact not ne c essary to dete ct and r emove al l outliers . It is only requir ed that the algo rithm can detect th e “consequen tial outliers”, i.e., the ones that can significan tly impact our estimates of the mean. Let us assume w ith ou t loss of generalit y that there n o extreme outliers (as these can b e remo ved via p re-pro cessing). T hen the only way that the empiric al me an c an b e far fr om the true me an is if ther e is a “c onspir acy” of many outliers, al l pr o ducing err ors in appr oximately the same dir e ction. In tuitiv ely , if our corrupted p oin ts are at distance O ( √ d ) from the tr ue mean in random d irections, their con trib utions will on a verage cancel out, leading to a small error in the sample mean. In conclusion, it suffices to b e able to detect these kinds of conspiracies of outliers. The next k ey insight is simple and p o werful. Let T b e an ǫ -corrupted set of p oint s dra w n from N ( µ, I ). If suc h a conspiracy of outliers sub stan tially mo ves the empirical mean b µ of T , it m ust mo ve b µ in s ome direction. That is, there is a u nit v ector v such that these outliers cause v · ( b µ − µ ) to b e large. F or this to happ en, it m us t b e the case that these outliers are on a v erage far from µ in the v -direction. I n particular, if an ǫ -fraction of corrup ted p oint s in T mo v e the sample a verag e of v · ( U T − µ ), where U T is the u n iform distribution on T , by more than δ ( δ shou ld b e though t of as small, but su bstan tially la rger than ǫ ), then on a ve rage these corrupted p oin ts x m u st hav e v · ( x − µ ) at least δ/ǫ . Th is in turn means th at these corrupted p oints will ha v e a contribution of at least ǫ · ( δ /ǫ ) 2 = δ 2 /ǫ to the v ariance of v · U T . F ortu nately , th is condition can actuall y b e algorithmicall y d etected! In particular, by computing the top eigen ve ctor of the sample cov ariance matrix, w e can efficient ly d etermine whether or not there is an y direction v f or whic h the v ariance of v · U T is abnormally large. The aforemen tioned discussion leads us to the o v erall s tructure of the algorithms w e w ill describ e in this section. Starting with an ǫ -corrupted set of points T (p erhaps weig hted in some w ay) , w e compute the samp le cov ariance matrix and find th e eigen ve ctor v ∗ with largest eig en v alue λ ∗ . If λ ∗ is not muc h large r than what it should b e (in the absence of outliers), by the ab ov e discus sion, the empirical mean is close to the tru e mean, and w e can r etur n that as an answer. Otherwise, w e hav e obtained a particular direction v ∗ for which w e kno w th at the outliers pla y an unusual role, i.e., b ehav e s ignifican tly differently than the inliers. The distribution of the p oin ts pro jected in th e v ∗ -direction can then b e used to p erform some sort of outlier remov al. The outlier remo v al pro cedur e can be quite subtle and crucially dep ends on our distribu tional assumptions ab out the clean data. 2.2 Go o d Sets and Stabilit y In this section, we giv e a deterministic cond ition on the uncorrup ted data that we call stability (Definition 2.1), whic h is necessary for the algorithms presented here to succeed. F urthermore, we pro v id e an efficien tly c h ec k able condition u nder whic h the empirical m ean is certifiably close to the true mean (Lemma 2.4). Let S b e a set of n i.i.d. samples d ra w n from X . W e will t ypically call th ese sample p oin ts go o d. The adv ersary can select up to an ǫ -fraction of p oin ts in S and replace them with arbitrary p oin ts to obtain an ǫ -corrup ted set T , whic h is giv en as inp ut to the algorithm. T o establish correctness of an algorithm, w e need to sho w that with high probab ility ov er the c hoice of the set S , for an y c hoice of ho w to corrupt the go o d samples that the adversary make s, the algorithm will output an accurate estimate of the target mean. 7 T o carry out such an analysis, it is con v enient to explicitly state a collectio n of sufficien t deter- ministic conditions on the set S . Sp ecifically , w e will define a notion of a “stable” set, quan tified b y the pr op ortion of conta mination ǫ and the distribu tion X . The precise stabilit y conditions v ary considerably based on the underlying estimat ion task and the assumptions on the distribu - tion family of the u ncorrupted data . Rou gh ly sp eaking, w e requ ire that the uniform distribution o ve r a stable set S b ehav es similarly to the distribution X w ith resp ect to higher momen ts and, p oten tially , tail b ound s . Imp ortant ly , w e require that these conditions hold ev en after removing an arbitrary ǫ -fraction of p oints in S . The notion of a stable set m ust ha v e t w o critical prop erties: (1) A set of n i.i.d. samples from X is stable with h igh p robabilit y , when n is at least a sufficien tly large p olynomial in the r elev ant parameters; and (2) If S is a stable set and T is obtained from S by changing at most an ǫ -fraction of the p oint s in S , then the algorithm wh en run on the set T will su cceed. The robus t mean estimat ion algorithms that will b e p resen ted in this s ection crucially r ely on considering sample means and co v ariances. Th e follo wing stability condition is an imp ortan t ingredien t in the success criteria of th ese algorithms: Definition 2.1 (Stabilit y Condition) . Fix 0 < ǫ < 1 / 2 and δ ≥ ǫ . A fi nite set S ⊂ R d is ( ǫ, δ ) - stable (with resp ect to a distr ib ution X ) if for eve ry unit v ector v ∈ R d and ev ery S ′ ⊆ S with | S ′ | ≥ (1 − ǫ ) | S | , the follo wing conditions hold: 1. 1 | S ′ | P x ∈ S ′ v · ( x − µ X ) ≤ δ , and 2. 1 | S ′ | P x ∈ S ′ ( v · ( x − µ X )) 2 − 1 ≤ δ 2 /ǫ. The aforemen tioned stabilit y condition or a v arian t thereof is used in every kno wn robu st mean estimation al gorithm. Definition 2.1 r equires th at after r estricting to a (1 − ǫ )-dens ity s ubset S ′ , the sample mean of S ′ is within δ of the mean of X , µ X , and the sample v ariance of S ′ is 1 ± δ 2 /ǫ in ev ery direction. (W e note that Definition 2.1 is intended for distributions X with co v ariance Σ X = I or, more generally , Σ X I . The case of arbitrary kno wn or b ounded co v ariance can b e reduced to this case via an appropriate linear transformation of the data.) The fact that the conditions of Defin ition 2.1 m u st h old for every large subset S ′ of S might mak e it unclear if they can h old with h igh probabilit y . Ho w ever, one can sho w the follo win g: Prop osition 2.2. A set of i.i.d. samples f r om an identity c ovarianc e sub-gaussian distribution of size Ω( d/ǫ 2 ) is ( ǫ, O ( ǫ p log(1 /ǫ )) -stable with high pr ob ability. W e ske tc h a pro of of Prop osition 2.2. The only prop ert y required for the p ro of is that the distribution of the uncorru pted data has identit y co v ariance and sub -gaussian tails in eac h direction, i.e., the tail probabilit y of eac h u n iv ariate pro jection is b ounded fr om ab o ve by th e Gaussian tail. Fix a direction v . T o sho w that the firs t cond ition holds for this v , we note that we can m aximize 1 | S ′ | P x ∈ S ′ v · ( x − µ X ) b y remo ving from S the ǫ -fraction of p oin ts x for whic h v · x is smallest. Since the empirical mean of S is close to µ X with high pr obabilit y , we need to und erstand ho w muc h this quantit y is altered b y remo vin g the ǫ -tail in the v -direction. Giv en our assum ptions on the distribution of th e u ncorrup ted data, remo ving the ǫ -tail only c hanges the mean by O ( ǫ p log(1 /ǫ )). Therefore, if the empirical distribution of v · x , x ∈ S , b eha v es like a sp herical Gaussian in this wa y , the fir st condition is satisfied. The second condition follo ws via a similar analysis. W e can minimize the relev ant quanti t y by remo ving th e ǫ -fraction of p oin ts x ∈ S with | v · ( x − µ X ) | as large as p ossib le. If v · x is d istributed lik e a unit-v ariance sub-gaussian, the total mass of its squ are o v er the ǫ -tails is O ( ǫ log (1 /ǫ )). W e h a ve 8 th u s established that b oth conditions hold with high probabilit y for an y fixed directio n. S h o win g that the conditions hold with high probabilit y for all d irections sim u ltaneously can b e sho w n b y an appropriate co vering argumen t. More generally , one can show quant itativ ely differen t stabilit y conditions und er v arious d is- tributional assumptions. In particular, w e state the follo win g prop osition without pro of. (The in terested reader is referred to [22] for a pro of.) Prop osition 2.3. A set of i.i.d. samples fr om a distribution with c ovarianc e Σ I of size ˜ Ω( d/ǫ ) is ( ǫ, O ( √ ǫ )) -stable with high pr ob ability. W e note that analogo u s b ounds can b e easily shown for identity c ovarianc e distributions with b ound ed higher cen tral moments. F or example, if our distribu tion has iden tit y co v ariance and its k -th cen tral momen t is b ound ed from ab o ve b y a constant, one can show that a set of Ω( d/ǫ 2 − 2 /k ) samples is ( ǫ, O ( ǫ 1 − 1 /k ))-stable with high probabilit y . The aforemen tioned n otion of stabilit y is p o werful and suffices for robust mean estimation. Some of the algorithms that w ill b e p resen ted in this s ection only wo r k u nder the stabilit y condition, while others require additional conditions b ey ond stability . The main reason why stabilit y suffices is quan tified in the follo wing lemma: Lemma 2.4 (Certificate for Empirical Mean) . L et S b e an ( ǫ, δ ) -stable set with r esp e ct to a dis- tribution X , for some δ ≥ ǫ > 0 . L et T b e an ǫ -c orrupte d ve rsion of S . L et µ T and Σ T b e the empiric al me an and c ovarianc e of T . If the lar g e st eigenvalue of Σ T is at most 1 + λ , for some λ ≥ 0 , then k µ T − µ X k 2 ≤ O ( δ + √ ǫλ ) . Pr o of of L emma 2.4. Let S ′ = S ∩ T and T ′ = T \ S ′ . By replacing S ′ with a s u bset if necessary , w e ma y assume that | S ′ | = (1 − ǫ ) | S | and | T ′ | = ǫ | S | . Let µ S ′ , µ T ′ , Σ S ′ , Σ T ′ represent the empir ical means and cov ariance matrices of S ′ and T ′ . A simple calculation giv es that Σ T = (1 − ǫ )Σ S ′ + ǫ Σ T ′ + ǫ (1 − ǫ )( µ S ′ − µ T ′ )( µ S ′ − µ T ′ ) T . Let v b e the unit v ector in the direction of µ S ′ − µ T ′ . W e ha v e that 1 + λ ≥ v T Σ T v = (1 − ǫ ) v T Σ S ′ v + ǫ v T Σ T ′ v + ǫ (1 − ǫ ) v T ( µ S ′ − µ T ′ )( µ S ′ − µ T ′ ) T v ≥ (1 − ǫ )(1 − δ 2 /ǫ ) + ǫ (1 − ǫ ) k µ S ′ − µ T ′ k 2 2 ≥ 1 − O ( δ 2 /ǫ ) + ( ǫ/ 2) k µ S ′ − µ T ′ k 2 2 , where we used the v ariational c haracterization of eigen v alues, the fact th at Σ T ′ is p ositiv e semidef- inite, and the second stabilit y condition for S ′ . By rearranging, w e obtain that k µ S ′ − µ T ′ k 2 = O ( δ /ǫ + p λ/ǫ ) . T herefore, we can write k µ T − µ X k 2 = k (1 − ǫ ) µ S ′ + ǫµ T ′ − µ X k 2 = k µ S ′ − µ X + ǫ ( µ T ′ − µ S ′ ) k 2 ≤ k µ S ′ − µ X k 2 + ǫ k µ S ′ − µ T ′ k 2 = O ( δ ) + ǫ · O ( δ /ǫ + p λ/ǫ ) = O ( δ + √ λǫ ) , where we used the fi rst stabilit y condition for S ′ and our b ound on k µ S ′ − µ T ′ k 2 . W e note that the p ro of of Lemma 2.4 only u sed the lo w er b ou n d p art in the second condition of Definition 2.1, i.e., that the sample v ariance of S ′ in eac h direction is at least 1 − δ 2 /ǫ . Th e upp er b ound p art will b e crucially used in th e design and analysis of our robust mean estimation algorithms in th e follo w ing sectio ns. 9 Lemma 2.4 sa ys that if our input set of p oints T is an ǫ -corrup ted v ersion of an y stable set S and has b ounded sample co v ariance, the sample mean of T closely approximate s the true mean. This lemma, or a v arian t thereof, is a k ey result in all kno wn robust mean estimation algorithms. Unfortunately , w e are not alwa ys guarante ed that the set T we are giv en has this p r op erty . In order to deal with this, w e will wa n t to find a subset of T with b ounded co v ariance and large in tersection with S . Ho w ever, for some of the algorithms p resen ted, it will b e con v enient to find a probabilit y distrib ution ov er T rather than a subset. F or this, we will need a slight generalization of Lemma 2.4. Lemma 2.5. L et S b e an ( ǫ, δ ) -stable set with r esp e ct to a distribution X , f or some δ ≥ ǫ > 0 with | S | > 1 /ǫ . L et W b e a pr ob ability distribution that differs fr om U S , the uniform distribution over S , by at most ǫ in total v ariation distanc e . L et µ W and Σ W b e the me an and c ovarianc e of W . If the lar gest eigenvalue of Σ W is at most 1 + λ , for some λ ≥ 0 , then k µ W − µ X k 2 ≤ O ( δ + √ ǫλ ) . Note that this sub sumes Lemma 2.4 b y letting W b e the un iform distrib ution ov er T . Th e pro of is essen tially identica l to that of Lemma 2.4, except th at w e need to sho w that the mean and v ariance of the conditional distribution W | S are appr o ximately correct, whereas in Lemma 2.4 the b ound s on the mean and v ariance of S ∩ T follo w ed directly from stabilit y . Lemma 2. 5 clarifies the goal of our outlier remo v al pro cedure. In particular, given our initial ǫ -corrupted set T , we will attempt to find a distribu tion W su pp orted on T so th at Σ W has no large eigen v alues. The weigh t W ( x ), x ∈ T , qu antifies our b elief whether p oint x is an inlier or an outlier. W e will also need to ensure that an y suc h W we c ho ose is close to the un iform distrib u tion o ve r S . More concretely , w e no w d escrib e a framew ork that captures our rob u st mean estimatio n algo- rithms. W e start with the follo wing definition: Definition 2.6. Let S b e a (3 ǫ, δ )-stable set with resp ect to X and let T b e an ǫ -corrupted v ersion of S . Let C b e the set of al l pr obabilit y distrib utions W su pp orted on T , where W ( x ) ≤ 1 | T | (1 − ǫ ) , for all x ∈ T . W e note th at any distribu tion in C differs from U S , the uniform d istribution on S , b y at most 3 ǫ . I ndeed, for ǫ ≤ 1 / 3 , w e ha v e that: d TV ( U S , W ) = X x ∈ T max { W ( x ) − U S ( x ) , 0 } = X x ∈ S ∩ T max { W ( x ) − 1 / | T | , 0 } + X x ∈ T \ S W ( x ) ≤ X x ∈ S ∩ T ǫ | T | (1 − ǫ ) + X x ∈ T \ S 1 | T | (1 − ǫ ) ≤ | T | ǫ | T | (1 − ǫ ) + ǫ | T | 1 | T | (1 − ǫ ) = 2 ǫ 1 − ǫ ≤ 3 ǫ . Therefore, if we fi nd W ∈ C with Σ W ha ving n o large eigen v alues, Lemma 2.5 implies that µ W is a go o d appro ximation to µ X . F ortunately , w e know that such a W exists! In particular, if w e take W to b e W ∗ , the uniform distribution o ver S ∩ T , the largest eigen v alue is at most 1 + δ 2 /ǫ , and th u s we ac hiev e ℓ 2 -error O ( δ ). 10 A t this p oint, we ha ve an inefficient algorithm for approximat ing µ X : Find any W ∈ C with b ound ed co v ariance. Th e remaining qu estion is h ow w e can efficien tly find one. Ther e are t wo basic algorithmic tec h niques to ac hieve this, th at w e presen t in the su bsequent sub sections. The fir s t algorithmic tec hnique w e will describ e is based on con v ex programming. W e will call this the unknown c onvex pr o gr amming metho d . Note that C is a con v ex set and that fi nding a p oin t in C that has b oun d ed co v ariance is almost a con v ex pr ogram. It is n ot quite a con vex program, b ecause the v ariance of v · W , for fixed v , is not a con v ex function of W . Ho wev er, one can sho w that giv en a W with v ariance in some direction significan tly larger than 1 + δ 2 /ǫ , we can efficien tly construct a hyperp lane separating W fr om W ∗ (recall that W ∗ is the u niform d istribution o ver S ∩ T ) (Section 2.3). This metho d has the ad v antag e of naturally working un der only the stabilit y assumption. On the other hand, as it relies on the ellipsoid algorithm, it is quite slow (although p olynomial time). Our second te c hn ique, whic h we w ill call filtering , is an iterativ e outlier remo v al method th at is t yp ically faster, as it relies primarily on sp ectral tec h n iques. The main idea of the metho d is the follo wing: If Σ W do es not hav e large eigen v alues, then the emp irical m ean is close to the true mean. Otherwise, there is some unit v ector v s uc h that V ar [ v · W ] is substan tially la rger than it should b e. This can only b e the case if W assigns substan tial mass to elements of T \ S that ha ve v alues of v · x v ery far from the true mean of v · µ . T his observ ation allo ws u s to p erform some kind of outlier remo v al, in p articular b y remo ving (or do w n -w eighting) the p oints x that ha ve v · x inappropriately large. An imp ortan t conceptual prop erty is that one cannot afford to remo ve only outliers, bu t it is p ossible to ensur e that more outliers are remo ve d than inliers. Giv en a W where Σ W has a large eige n v alue, one filtering step giv es a new d istribution W ′ ∈ C that is closer to W ∗ than W was. Rep eating the pro cess even tually giv es a W with no large eigen v alues. The filtering metho d and its v ariations are discussed in Section 2.4. 2.3 The Unkno wn Conv ex Progr amming Metho d By Lemma 2.5, it suffices to find a distribution W ∈ C with Σ W ha ving n o large eigen v alues. W e note that this condition almost defi n es a conv ex pr ogram. This is b ecause C is a conv ex set of probabilit y distribu tions and the b ound ed co v ariance condition says that V ar [ v · W ] ≤ 1 + λ f or all unit v ectors v . Unfortun ately , the v ariance V ar [ v · W ] = E [ | v · ( W − µ W ) | 2 ] is not quite linear in W . (If w e instead had E [ | v · ( W − ν ) | 2 ], wh er e ν is some fixed ve ctor, this w ou ld b e linear in W .) Ho we v er, w e will sho w that a unit vec tor v for whic h V ar [ v · W ] is to o large, can b e used to obtain a separation oracle , i.e., a linear function L f or which L ( W ) > L ( W ∗ ). Supp ose that we identify a un it vec tor v suc h that V ar [ v · W ] = 1 + λ , where λ > c ( δ 2 /ǫ ) for a sufficien tly large u niv er s al constan t c > 0. Applyin g L emm a 2.5 to the one-dimensional pr o jection v · W , giv es | v · ( µ W − µ X ) | ≤ O ( δ + √ ǫλ ) = O ( √ ǫλ ) . Let L ( Y ) := E X [ | v · ( Y − µ W ) | 2 ] an d note that L is a linear fun ction of th e p robabilit y distrib ution Y with L ( W ) = 1 + λ . W e can write L ( W ∗ ) = E W ∗ [ | v · ( W ∗ − µ W ) | 2 ] = V ar [ v · W ∗ ] + | v · ( µ W − µ W ∗ ) | 2 ≤ 1 + δ 2 /ǫ + 2 | v · ( µ W − µ X ) | 2 + 2 | v · ( µ W ∗ − µ X ) | 2 ≤ 1 + O ( δ 2 /ǫ + ǫλ ) < 1 + λ = L ( W ) . In summary , we ha ve an explicit conv ex set C of probabilit y distrib u tions from wh ic h we w ant to find one with eigenv alues b ounded by 1 + O ( δ 2 /ǫ ). Giv en any W ∈ C whic h do es not satisfy this condition, w e can pr o duce a lin ear function L that separates W from W ∗ . Usin g the ellipsoid algorithm, we ob tain the follo w ing general theorem: 11 Theorem 2.7. L et S b e a (3 ǫ, δ ) -stable se t with r e sp e ct to a distribution X and let T b e an ǫ - c orrupte d version of S . Ther e exists a p olynomia l time algorithm which given T r eturns b µ su ch that k b µ − µ X k 2 = O ( δ ) . Implications for Concrete Distribution F amilies. Com binin g Theorem 2.7 with corr esp ond- ing stabilit y b oun ds, w e obtain a n um b er of concrete applications for v arious distribution families of interest. Using P r op osition 2.2, w e obtain: Corollary 2.8 (Id en tity Co v ariance Sub-gaussian Distributions) . L et T b e an ǫ - c orrupte d set of samples of size Ω( d/ǫ 2 ) fr om an identity c ovarianc e sub-gaussian distribution X on R d . Ther e exists a p olynomial time algorithm which giv en T r eturns b µ such th at with high pr ob ability k b µ − µ X k 2 = O ( ǫ p log(1 /ǫ )) . W e n ote that Corollary 2.8 can b e immediately adapted for identit y co v ariance distr ibutions satisfying weak er conce n tration assumptions. F or example, if X s atisfies sub-exp onen tial co ncen- tration in eac h direction, we obtain an efficient robust m ean estimation algorithm with ℓ 2 -error of O ( ǫ log (1 /ǫ )). If X has iden tit y co v ariance and b ounded k -th cen tral momen ts, k ≥ 2, we obtain error O ( ǫ 1 − 1 /k ). These error b ound s are information-theoretica lly op timal, within constan t factors. F or distributions with unkno w n b ounded co v ariance, using Prop osition 2.3 we obtain: Corollary 2.9 (Unkn o wn Bounded Cov ariance Distributions) . L et T b e an ǫ -c orrupte d set of samples of si ze ˜ Ω( d/ǫ ) fr om a distribution X on R d with unknown b ounde d c ovarianc e Σ X σ 2 I . Ther e exists a p olynomial time algorithm which given T r eturns b µ su c h that with high pr ob ability k b µ − µ X k 2 = O ( σ √ ǫ ) . By the discussion follo win g F act 1.2, the ab ov e error b oun d is also information-theoretically optimal, with in constant factors. 2.4 The Filtering Metho d As in the unkn o wn con v ex pr ogramming metho d, the goal of the filtering method is to find a distribution W ∈ C so that Σ W has b ounded eigenv alues. Given a W ∈ C , Σ W either has b ounded eigen v alues (in w hic h case the we igh ted empirical mean w orks) or there is a direction v in whic h V ar [ v · W ] is to o large. In the latte r case, the pro jectio ns v · W m ust b eha ve v ery differen tly fr om the pr o jections v · S or v · X . In particular, since an ǫ -fraction of outliers are causing a m u c h larger increase in the stand ard deviation, this means that the distribution of v · W w ill ha ve man y “extreme p oin ts” — more than on e w ould exp ect to find in v · S . T his fact allo ws us to id en tify a non-empt y subset of extreme points the ma jorit y of which are outliers. These p oin ts can then b e remo ved (or do wn-weig h ted) in ord er to “clea n up” our sample. F ormally , giv en a W ∈ C without b ound ed eigen v alues, we can efficien tly fin d a W ′ ∈ C so that W ′ is closer to W ∗ than W w as. Iterating this pro cedure ev entually terminates giving a W with b ound ed eigenv alues. W e n ote that while it m ay b e conceptually u seful to consider the ab o ve sc heme for general distributions W o ver p oin ts, in most cases it suffices to consid er only W g iv en as the u niform distribution ov er some set of p oints. T he filtering step in th is case consists of replacing the set T by some subset T ′ = T \ R , where R ⊂ T . T o guarantee progress to wards W ∗ (the uniform distribution o ver S ∩ T ), it suffices to ensure that at most a third of the elemen ts of R are also in S , or equiv alen tly that at least t wo thirds of the remo ved p oints are outliers (p er h aps in exp ectation). The algorithm will terminate when the curr en t set of p oin ts T ′ has b ound ed emp ir ical co v ariance, and the outp u t will b e the empirical mean of T ′ . 12 Before we pr o ceed with a more detailed tec h nical d iscussion, we note that there are several p ossible w a ys to implemen t the fi ltering step, and that the metho d used has a significant impact on the analysis. In general, a filtering step remo ve s all p oin ts that are “far” f rom the sample mean in a large v ariance d irection. How ev er, the precise w a y that this is quantified can v ary in imp ortan t w ays. 2.4.1 Basic Filtering In this sub section, we pr esent a filtering m etho d that yields efficien t r obust mean estimators with optimal err or b ound s for identit y cov ariance (or, more generally , known co v ariance) distributions whose univ ariate pro jections satisfy appropriate tail b ounds. F or the p urp ose of this sectio n, we will restrict ourselv es to the Gaussian setting. W e n ote ho wev er that th is metho d immediately extends to distributions with wea ker concen tration prop erties, e.g., sub -exp onent ial or ev en inv erse p olynomial concentrati on, with appropriate mod ifications. W e n ote that the filtering metho d p resen ted here requires an additional condition on our go o d set of samp les, on top of the stabilit y condition. This is quan tified in the follo wing definition: Definition 2.10. A set S ⊂ R d is tail-b ound-go o d (with r esp e ct to X = N ( µ X , I ) ) if for any unit v ector v , and an y t > 0, w e hav e Pr x ∼ u S [ | v · ( x − µ X ) | > 2 t + 2] ≤ e − t 2 / 2 . (1) Since an y univ ariate pro jectio n of X is distributed lik e a standard Gaussian, E quation (1) should hold if the uniform distribu tion o v er S we r e replaced by X . I t ca n b e shown that this condition holds w ith high pr obabilit y if S consists of i.i.d. random samp les from X of a sufficientl y large p olynomial size [21]. In tuitiv ely , the additional tail condition of Definition 2.10 is needed to guarantee th at the filtering metho d used here will r emo ve more outliers than inliers. F ormally , w e ha ve the follo wing: Lemma 2.11. L e t ǫ > 0 b e a sufficiently smal l c onstant. L et S ⊂ R d b e b oth (2 ǫ, δ ) -stable and tail-b ound-go o d with r esp e ct to X = N ( µ X , I ) , with δ = cǫ p log(1 /ǫ ) , for c > 0 a sufficiently lar ge c onstant. L et T ⊂ R d b e such tha t | T ∩ S | ≥ (1 − ǫ ) max( | T | , | S | ) and assume we ar e given a unit ve ctor v ∈ R d for which V ar [ v · T ] > 1 + 2 δ 2 /ǫ . Ther e e xi sts a p olynomial time algorithm that r eturns a subset R ⊂ T satisfying | R ∩ S | < | R | / 3 . T o see wh y Lemma 2.11 suffi ces for our p urp oses, note that by replacing T b y T ′ = T \ R , we obtain a less noisy version of S than T wa s. In particular, it is easy to s ee that th e size of the symmetric difference b et ween S and T ′ is strictly smaller than the size of the symmetric difference b et w een S and T . F rom this it follo ws that the hyp othesis | T ∩ S | ≥ (1 − ǫ ) max ( | T | , | S | ) still holds when T is rep laced b y T ′ , allo wing us to iterate this pro cess unti l w e are left with a set with small v ariance. Pr o of. Let V ar [ v · T ] = 1+ λ . By applying Lemma 2.4 to the set T , we get that | v · µ X − v · µ T | ≤ c √ λǫ . By (1), it follo ws that Pr x ∼ u S h | v · ( x − µ T ) | > 2 t + 2 + c √ λǫ i ≤ e − t 2 / 2 . W e claim that there exists a thresh old t 0 suc h that Pr x ∼ u T h | v · ( x − µ T ) | > 2 t 0 + 2 + c √ λǫ i > 4 e − t 2 0 / 2 , (2) where the constan ts h a ve not b een optimized. Giv en this claim, the set R = { x ∈ T : | v · ( x − µ T ) | > 2 t 0 + 2 + c √ λǫ } will satisfy the conditions of the lemma. 13 T o p ro ve our claim, we analyze the v ariance of v · T and note that m u ch of the excess m u s t b e due to p oin ts in T \ S . In p articular, by our assumption on the v ariance in th e v -d irection, P x ∈ T | v · ( x − µ T ) | 2 = | T | V ar [ v · T ] = | T | (1 + λ ), where λ > 2 δ 2 /ǫ . T he con tribution fr om the p oints x ∈ S ∩ T is at most X x ∈ S | v · ( x − µ T ) | 2 = | S | V ar [ v · S ] + | v · ( µ T − µ S ) | 2 ≤ | S | (1 + δ 2 /ǫ + 2 c 2 λǫ ) ≤ | T | (1 + 2 c 2 λǫ + 3 λ/ 5) , where the first inequalit y uses the stabilit y of S , and the last uses that | T | ≥ (1 − ǫ ) | S | . If ǫ is sufficien tly small relativ e to c , it follo ws that P x ∈ T \ S | v · ( x − µ T ) | 2 ≥ | T | λ/ 3 . On the other hand, b y definition we ha ve : X x ∈ T \ S | v · ( x − µ T ) | 2 = | T | Z ∞ 0 2 t Pr x ∼ u T [ | v · ( x − µ T ) | > t, x 6∈ S ] dt. (3) Assume for the sak e of contradicti on that there is no t 0 for whic h Equ ation (2) is sati sfied. Then the RHS of (3) is at most | T | Z 2+ c √ λǫ +10 √ log(1 /ǫ ) 0 2 t Pr x ∼ u T [ x 6∈ S ] dt + Z ∞ 2+ c √ λǫ +10 √ log(1 /ǫ ) 2 t Pr x ∼ u T [ | v · ( x − µ T ) | > t ] dt ! ≤| T | ǫ (2 + c √ λǫ + 10 p log(1 /ǫ )) 2 + Z ∞ 5 √ log(1 /ǫ ) 16(2 t + 2 + c √ λǫ ) e − t 2 / 2 dt ! ≤| T | O ( c 2 λǫ 2 + ǫ log(1 /ǫ )) + O ( ǫ 2 ( p log(1 /ǫ ) + c √ λǫ )) ≤| T | O ( c 2 λǫ 2 + ( δ 2 /ǫ ) /c ) < | T | λ/ 3 , whic h is a con tradiction. Th erefore, the tail b ou n ds and the concen tration violatio n together im p ly the existence of suc h a t 0 (whic h can b e efficien tly computed). 2.4.2 Randomized Filtering The basic filtering metho d of th e previous subsection is deterministic, relying on the violation of a concen tration inequalit y satisfied by the inliers. In some settings, deterministic filtering seems to fail to giv e optimal results and we require the filtering pro cedu re to b e randomized. A concrete suc h setting is when the un corr u pted distribu tion is only assumed to ha ve b oun ded co v ariance. The main idea of randomized filtering is simple: Sup p ose w e can iden tify a non-negativ e fu nc- tion f ( x ), defined on the samples x , for wh ic h (u nder some h igh probabilit y condition on the inliers) it holds that P T f ( x ) ≥ 2 P S f ( x ), where T is an ǫ -corrupted set of samples an d S is the cor- resp ond ing set of inliers. Then w e ca n create a r andomized filter b y remo ving eac h sample p oin t x ∈ T with pr ob ab ility prop ortional to f ( x ). This ens ures that the e xp e cte d num b er of outliers remo ved is at least the exp e cte d n u m b er of inliers remo ved. The analysis of such a r andomized filter is sligh tly more subtle, so we will discuss it in the follo w ing paragraphs. The k ey p rop erty the ab o ve randomized fi lter en sures is that the sequ en ce of random v ariables (# Inliers r emo ve d) − (# Ou tliers remo ved) (wh ere “inliers” are p oints in S and “outlie r s” p oin ts in T \ S ) across iterations is a su b-martingale. Since the total n umber of outliers remov ed across all iterations accoun ts f or at most an ǫ -fraction of the total samples, this means that with pr obabilit y at least 2 / 3, at no p oint do es the algorithm remov e more th an a 2 ǫ -fraction of the inliers. A formal statemen t follo ws: 14 Theorem 2.12. L et S ⊂ R d b e a (6 ǫ, δ ) -stable set (with r esp e ct to X ) and T b e an ǫ -c orrupte d version of S . Supp ose that given any T ′ ⊆ T with | T ′ ∩ S | ≥ (1 − 6 ǫ ) | S | for which Cov [ T ′ ] has an eigenvalue bi g ger than 1 + λ , for some λ ≥ 0 , ther e is an efficient algorithm that c omputes a non-zer o function f : T ′ → R + such that P x ∈ T ′ f ( x ) ≥ 2 P x ∈ T ′ ∩ S f ( x ) . Then ther e exists a p olynomial time r andomize d algorithm that c omputes a ve ctor b µ tha t with pr ob ability at le ast 2 / 3 satisfies k b µ − µ X k 2 = O ( δ + √ ǫλ ) . The algorithm is describ ed in pseudo co de b elo w: Algorithm Ran domized Filtering 1. Compu te Co v [ T ] and its largest eigen v alue ν . 2. If ν ≤ 1 + λ , return µ T . 3. Else • Compu te f as guaran teed in the theorem statemen t. • Remo v e eac h x ∈ T with pr ob ab ility f ( x ) / max x ∈ T f ( x ) and return to Step 1 with the new set T . Pr o of of The or em 2.12. First, it is easy to see that this algorithm runs in p olynomial time. Indeed, as the p oint x ∈ T attaining the maxim u m v alue of f ( x ) is d efinitely remo ved in eac h filtering iteration, eac h iteration redu ces | T | b y at least one. T o establish correctness, we will sho w that, with probab ility at least 2 / 3, at eac h ite ration of the algorithm it holds | S ∩ T | ≥ (1 − 6 ǫ ) | S | . Assuming this claim, Lemma 2.4 implies that our final error will b e as desir ed . T o prov e the desired claim, w e consider the sequence of r andom v ariables d ( T ) = | S ∆ T | = | S \ T | + | T \ S | across the iteratio ns of the algorithm. W e note that, initially , d ( T ) = 2 ǫ | S | and that d ( T ) cannot drop b elow 0. Finally , we note that at eac h stage of the algorithm d ( T ) decreases by (# Inliers r emo ve d) − (# Outliers remo ve d ), and that the exp ectatio n of this quan tit y is X x ∈ S \ T f ( x ) − X x ∈ T \ S f ( x ) = 2 X x ∈ S ∩ T f ( x ) − X x ∈ T f ( x ) ≤ 0 . This means that d ( T ) is a su b -martingale (at least u n til we reac h a p oin t w here | S ∩ T | ≤ (1 − 6 ǫ ) | S | ). Ho we v er, if we set a stopping time at the first o ccasion where this condition fails, we note that the exp ectation of d ( T ) is at most 2 ǫ | S | . S ince it is at least 0, this m eans that with probabilit y at least 2 / 3 it is nev er m ore than 6 ǫ | S | , which would imp ly that | S ∩ T | ≥ (1 − 6 ǫ ) | S | throughout the algorithm. If this is the case, the inequalit y | T ′ ∩ S | ≥ (1 − 6 ǫ ) | S | will cont in u e to hold throughout our algorithm, thus even tually yielding suc h a set with the v ariance of T ′ b ound ed. By Lemma 2.4, the mean of this T ′ will b e a suitable estimate for the true mean. Metho ds of Po in t Remo v al. The rand omized filtering method describ ed ab o ve only r equires that eac h p oin t x is remo v ed with p robabilit y f ( x ) / m ax x ∈ T f ( x ), without any assumption of inde- p enden ce. Th erefore, giv en an f , there are sev eral wa ys to implemen t this scheme. A few natural ones are give n here: • R andomize d Thr esholding: P erh aps the easiest metho d f or implementi ng our rand omized filter is generating a u niform r an d om n u mb er y ∈ [0 , max x ∈ T f ( x )] and removing all p oints x ∈ T for whic h f ( x ) ≥ y . This metho d is practical ly useful in many app lications. Finding the set of suc h p oint s is often fairly easy , as this condition may we ll corresp ond to a simp le thresh old. 15 • Indep endent R emoval: Eac h x ∈ T is remov ed in dep end en tly with probability f ( x ) / max x ∈ T f ( x ). This sc heme h as the adv anta ge of leading to less v ariance in d ( T ). A careful analysis of the random walk in volv ed allo w s one to r educe the failure probabilit y to exp( − Ω( ǫ | S | )). • Deterministic R eweighting: Instead of remo ving p oin ts, this s c heme allo ws for we igh ted sets of p oints. In particular, eac h p oint will b e assigned a weig ht in [0 , 1] and we will consider w eighte d m eans and co v ariances. Instead of remo vin g a p oin t with pr obabilit y pr op ortional to f ( x ), we can remo ve a fractio n of x ’s weigh t prop ortional to f ( x ) . T his ensur es that the appropriate w eigh ted version of d ( T ) is definitely non-increasing, implying deterministic correctness of the algorithm. Practical C onsiderations. While the aforemen tioned p oin t remo v al metho d s ha ve similar the- oretical guarantee s, recen t implemen tations [24] suggest that they ha ve different practical p erfor- mance on real datasets. Th e deterministic rew eight ing metho d is somewhat slow er in practice as its w orst-case runti me and its typica l run time are comparable. In more d etail, one can guarantee ter- mination b y setting th e constan t of prop ortionalit y so that at eac h step at least one of the non-zero w eights is set to zero. Ho wev er, in pr actical circumstances, w e will not b e able to do b etter. T h at is, the algorithm ma y we ll b e forced to undergo ǫ | S | iterations. On the other hand, the randomized v ersions of the alg orithm are lik ely to remov e seve r al p oin ts of T at eac h filtering step. Another reason wh y th e rand omized versions may b e preferable has to d o with the qu alit y of the results. The r andomized algorithms only pro du ce bad r esults when th ere is a c hance th at d ( T ) ends up b eing very large. How ev er, since d ( T ) is a sub-martingale, this will only eve r b e the case if there is a corresp onding p ossibilit y that d ( T ) w ill b e exceptionally small. T h us, although the randomized algorithms ma y hav e a probabilit y of giving wo r se results some of the time, this w ill on ly happ en if a corresp onding fraction of the time, th ey also giv e b etter results than the theory guarantee s. This consideration suggests that the randomized thr esholding pro cedu re might hav e adv an tages o ve r th e indep endent remo v al p ro cedure precisely b ecause it has a higher p robabilit y of failure. This has b een observ ed exp erimental ly in [24]: In r eal datasets (p oisoned with a constan t f raction of adv ersarial outliers), the num b er of iterations of randomized filtering is t ypically b ound ed b y a small constant . 2.4.3 Univ ersal Filtering In this subs ection, w e show ho w to u se randomized filtering to construct a u niv ersal filter th at w orks under only the stabilit y cond ition (Definition 2.1) — not requ ir ing the tail -b ound cond ition of the b asic filter (Lemma 2.11). F ormally , we sho w: Prop osition 2.13. L et S ⊂ R d b e an ( ǫ, δ ) -stable set f or ǫ , δ > 0 sufficiently smal l c onstants and δ at le ast a suffici e ntly lar g e multiple of ǫ . L et T b e an ǫ -c orrupte d version of S . Su pp ose that Co v [ T ] has lar gest eigenvalue 1 + λ > 1 + 8 δ 2 /ǫ . Then ther e exists a c omputationa l ly efficient algorith m that, on input ǫ, δ , T , c omputes a non-zer o function f : T → R + satisfying P x ∈ T f ( x ) ≥ 2 P x ∈ T ∩ S f ( x ) . By com b ining Theorem 2.12 and Prop osition 2.13, we obtain a filter-based algorithm establish- ing Th eorem 2.7. Pr o of of P r op osition 2.13 . Th e algorithm to construct f is the f ollo wing: W e start by computing the sample mean µ T and the top (un it) eigen vecto r v of Co v [ T ]. F or x ∈ T , w e let g ( x ) = ( v · ( x − µ T )) 2 . Let L b e the s et of ǫ · | T | elemen ts of T on whic h g ( x ) is largest. W e define f to b e f ( x ) = 0 for x 6∈ L , and f ( x ) = g ( x ) for x ∈ L . 16 Our basic p lan of attac k is as follo ws: First, we n ote that the sum of g ( x ) o ver x ∈ T is the v ariance of v · Z , Z ∼ u T , whic h is substantia lly larger th an the su m of g ( x ) ov er S , whic h is appro ximately the v ariance of v · Z , Z ∼ u S . T herefore, th e sum of g ( x ) o ver the ǫ | S | elements of T \ S m ust b e quite large. In fact, u s ing the stabilit y condition, w e can sho w that the latter quan tit y m ust b e larger than the sum of the largest ǫ | S | v alues of g ( x ) o v er x ∈ S . Ho wev er, since | T \ S | ≤ | L | , we ha ve that P x ∈ T f ( x ) = P x ∈ L g ( x ) ≥ P x ∈ T \ S g ( x ) ≥ 2 P x ∈ S f ( x ) . W e no w pro ceed with the detailed analysis. First, note that X x ∈ T g ( x ) = | T | V ar [ v · T ] = | T | (1 + λ ) . Moreo v er, for an y S ′ ⊆ S with | S ′ | ≥ (1 − 2 ǫ ) | S | , w e ha ve that X x ∈ S ′ g ( x ) = | S ′ | ( V ar [ v · S ′ ] + ( v · ( µ T − µ S ′ )) 2 ) . (4) By the second stabilit y condition, we hav e that | V ar [ v · S ′ ] − 1 | ≤ δ 2 /ǫ . F urther m ore, the stabilit y condition and Lemma 2.4 giv e k µ T − µ S ′ k 2 ≤ k µ T − µ X k 2 + k µ X − µ S ′ k 2 = O ( δ + √ ǫλ ) . Since λ ≥ 8 δ 2 /ǫ , com bining th e ab o ve giv es that P x ∈ T \ S g ( x ) ≥ (2 / 3) | S | λ . Moreo v er, since | L | ≥ | T \ S | and since g tak es its largest v alues on p oin ts x ∈ L , we ha ve that X x ∈ T f ( x ) = X x ∈ L g ( x ) ≥ X x ∈ T \ S g ( x ) ≥ (16 / 3) | S | δ 2 /ǫ . Comparing the results of Equation (4) with S ′ = S and S ′ = S \ L , we find that X x ∈ S ∩ T f ( x ) = X x ∈ S ∩ L g ( x ) = X x ∈ S g ( x ) − X x ∈ S \ L g ( x ) = | S | (1 ± δ 2 /ǫ + O ( δ 2 + ǫλ )) − | S \ L | (1 ± δ 2 /ǫ + O ( δ 2 + ǫλ )) ≤ 2 | S | δ 2 /ǫ + | S | O ( δ 2 + ǫλ ) . The latter qu an tit y is at m ost (1 / 2) P x ∈ T f ( x ) when δ an d ǫ /δ are su fficien tly sm all constants. This completes the pro of of Prop osition 2.13. 2.5 Bibliographic N otes The conv ex programming and filtering metho ds describ ed in this article app eared in [21, 22]. Here w e ga ve a simplified and u nified pr esen tation of these tec hn iques. The idea of remo ving outliers b y pro jecting on the top eig en vect or of the empirical co v ariance go es bac k to [53], wh o used it in the con text of learning linear separators with malicious noise. T hat w ork [53] used a “hard ” filtering step whic h only remo ves outliers, and consequen tly leads to errors that scale logarithmica lly with the dimension. Subsequentl y , the w ork of [1] emp lo ye d a soft-outlier remo v al step in the same sup er v ised setting as [53], to obtain impro v ed b oun ds for that problem. It sh ould b e noted that the soft-outlier metho d of [1] is similarly insufficien t to obtain dimension-ind ep endent error b ound s for the unsup ervised setting. The wo rk of [57] dev elop ed a recursiv e dimension-halving tec hn ique for robust mean estimation. Their tec hniqu e leads to error O ( ǫ p log(1 /ǫ ) √ log d ) for Gaussian robu s t mean estimation in Hub er’s 17 con tamination mod el. In short, the algorithm of [57] b egins by remo v in g any extreme outliers from the input set of ǫ -corrupted samples. This ensures that, after this basic outlier remo v al step, the empirical co v ariance matrix h as trace d (1 + ˜ O ( ǫ )), whic h implies that the d/ 2 smallest eigen v alues are all at most 1 + ˜ O ( ǫ ). This allo w s [57 ] to show, using tec hniqu es akin to Lemma 2.4, th at the p ro jections of th e true mean and the empirical mean onto the sub space spanned by the corresp onding (small) eigen vecto r s are close. The [57] algo rithm then uses this appro xim ation for this pro jection of the m ean, p ro jects the remaining p oint s onto the orthogo nal subspace, and recursiv ely fin ds the mean of the other p ro jection. In addition to r ob u st mean and co v ariance estimation, [21, 57] ga ve robust learnin g algorithms for v arious other statistical tasks, including robust densit y estimation for mixtures of spherical Gaussians and binary pro d uct distributions, robust ind ep endent comp onen t analysis (IC A), and robust singular v alue decomp osition (SVD). Building on the robust mean estimation tec hniqu es of [21], [18] ga v e robust p arameter estimation algorithms for Ba y esian netw orks with kno wn graph structure. Another extension of these r esults w as f ound by [69] w ho ga ve an efficien t algorithm for robust mean estimat ion with resp ect to all ℓ p -norms. The algorithmic approac hes describ ed in this section robustly estimate the mean of a spherical Gaussian within error O ( ǫ p log(1 /ǫ )) in the strong con tamination mo del of Definition 1.1. A more sophisticated filtering tec hnique that ac hiev es the optimal error of O ( ǫ ) in the additive contami- nation mo d el w as d ev elop ed in [23]. V ery roughly , this algorithm pro ceeds b y using a nov el fi lter to r emo ve bad p oints if the empirical co v ariance matrix has many eigen v alues of size 1 + Ω( ǫ ). Otherwise, the algorithm u ses the empirical mean to estimate the mean on the space spanned b y small eigen v ectors, and then uses b rute force to estimate the p r o jection ont o the few principal eigen v ectors. F or the strong con tamination mod el, it was sho wn in [26] that an y improv emen t on the O ( ǫ p log(1 /ǫ )) err or requires sup er-p olynomial time in the S tatistical Qu ery mo del. Finally , we note that ideas from [21] ha ve led to pro of-of-concept improv emen ts in the analysis of genetic data [22] and in adv ersarial mac hine learning [24, 72]. 3 Bey ond Robust Mean Estimation In this s ection, we p r o vide an o v erview of the ideas b ehind recen tly deve lop ed robust estimators for more ge neral statistica l tasks. This sect ion follo ws the structure of a S TOC 20 19 tutorial b y the authors [29]. 3.1 Robust Sto ch astic Optimization It turn s out that th e algorithmic tec hniqu es for high-dimensional robust mean estimation describ ed in the previous sectio n can b e view ed as usefu l pr imitiv es for robu s tly solving a r ange of mac hine learning problems. More sp ecifically , we will argue in this sectio n that any efficient robust mean estimator can b e used (in essen tially a blac k-b o x manner) to obtain efficien t robust algorithms for mac hine learning tasks that can b e expressed as stochasti c optimization pr oblems. In a stoc hastic optimization problem, we are giv en samp les from an u nknown distribution F o ve r functions f : R d → R , and our goal is to fi nd an ap p ro ximate minimizer of the function F ( w ) = E f ∼F [ f ( w )] ov er W ⊆ R d . This framewo rk encapsulates a num b er of w ell-studied mac hin e learning problems. First, we note that the problem of mean estimation can be expressed in this form, b y observing that th e mean of a distribu tion X is the v alue µ = arg min w ∈ R d E x ∼ X [ k w − x k 2 2 ]. That is, giv en a sample x ∼ X , the d istribution F o v er fun ctions f x ( w ) = k w − x k 2 2 turns the task of mean estimation in to a sto chastic optimization problem. A more interesting example is the problem of least-squares lin ear r egression: Give n a d istribution D o v er pairs ( x, y ), where x ∈ R d 18 and y ∈ R , w e wan t to fi nd a vect or w ∈ R d minimizing E ( x,y ) ∼D [( w · x − y ) 2 ]. Similarly , the problem of linear regression fits in the sto c h astic optimizatio n f ramew ork b y definin g th e distrib ution F ov er functions f ( x,y ) ( w ) = ( w · x − y 2 ), where ( x, y ) ∼ D . Similar form ulations exist for n u merous other mac hine learning problems, in cluding L 1 -regression, logistic regression, supp ort vect or machines, and generalized linear mo dels (see, e.g., [24]). Finally , we note that the sto c hastic optimization framew ork encompasses non-conv ex p roblems as we ll. F or example, th e general and c hallenging problem of training a neural n et can b e expressed in this f r amew ork, where w represents some high-dimensional ve ctor of parameters classifying the n et and eac h f u nction f ( w ) quantifies ho w w ell that particular net classifies a given data p oin t. Before w e discuss r obust sto chastic optimiza tion, we mak e a few basic remarks regarding the non-robust s etting. W e start by n oting that, without an y assumptions, the p roblem of optimizing the function F ( w ) = E f ∼F [ f ( w )], even app ro ximately , is NP-hard. On the other hand, in man y situations, it su ffices to find an appr o ximate critical p oint of F , i.e., a p oin t w suc h that k∇ F ( w ) k 2 is small. F or example, if F is conv ex (whic h holds if eac h f ∼ F is conv ex), an app ro ximate critical p oin t is also an appro ximate global min im u m . F or several structured non-conv ex problems, an app ro ximate critical p oin t is also considered a satisfactory solution. On inp ut a set of clean samples, i.e., an i.i.d. set of functions f 1 , . . . , f n ∼ F , w e can efficien tly find an approximat e critical p oint of F u sing (pro j ected) gradien t descen t. F or more structured problems, e.g., linear regression, faster and more direct metho d s ma y b e a v ailable. In the robust setting, w e hav e access to an ǫ -corrup ted training set of functions f 1 , . . . , f n from F . Unfortun ately , even a single corrupted s amp le can completely compromise the guarant ees of gradien t descen t. The robust v er s ion of this p r oblem wa s first studied by [14], who consid ered th e problem in the case where a m a jorit y of the datap oint s are outliers. The v anilla outlier-robu s t setting, wh ere ǫ < 1 / 2, w as fi rst studied in tw o concurrent works [65, 24]. The main int uition present in b oth these wo r ks is that robustly estimating the gradien t of the ob jectiv e f unction can b e view ed as a robust mean estimation problem. As a result, if an efficien t r obust gradien t oracle is a v ailable, w e can “sim u late” gradient descent and compute an appr o ximate critical p oint of F . Note that this m etho d employs a robu st mean estimation algorithm at eve ry step of gradient descen t. The w ork of [24] also pr op osed an alternativ e approac h, whic h turn s out to b e muc h faster in practice. Instead of using a r obust gradien t estimator as a blac k-b o x, one uses any appro ximate empirical risk minimizer (ERM) in conjun ction with the fi ltering algorithm for robust mean es- timation of the pr evious sect ion. This metho d only requires blac k -b o x access to an appro ximate ERM and calls th e filtering r ou tin e only when the ERM reac hes an appr o ximate critical p oint. The correctness of this algorithm relies on structural pr op erties of the fi ltering metho d . Roughly sp eaking, the main idea is as follo w s: S u pp ose th at we ha ve reac hed an approximate critical p oin t w of the emp irical risk and at this stage we apply a fi ltering step. By th e guarante es of the fi lter, w e know that w e are in one of t w o cases: either the filtering step remov es more outliers (corru pted functions) than inliers (in exp ectation), or it ce rtifies that the gradien t of F at w is close to the gradien t of th e empirical risk at w . In the former case, we mak e p rogress as w e pro d uce a “cleaner” set of f unctions. In the latte r case, since w is an appro ximate critical p oin t of the empir ical risk, our certificate implies that w is also an app ro ximate critica l p oin t of F , as desired. F or the ab ov e robu st optimization app roac hes to b e computationally efficien t, w e requ ir e some assumptions on the distribution of the clean samples (fun ctions). With no suc h assumption, m ost of the s ize of F could b e determined b y a small fraction of the f ’s in suc h a wa y that it is compu- tationally in tractable to determine whether these v alues are d ue to corruptions or not. A n atur al condition us ed in [14, 24] is that for ev ery w the co v ariance matrix of the gradien t distribu tion, Co v f ∼F [ ∇ f ( w )], is b ounded from ab o ve. Und er this condition, if one has enough samp les from F (so that the emp irical co v ariance of the clean samples is b ounded for all w ), one can u se the fi ltering 19 algorithm of Section 2.4 to robu s tly estimate ∇ F ( w ) to ℓ 2 -error O ( √ ǫ ) for an y w . Usin g either of the t wo afore-describ ed app roac hes, one can find a p oin t w suc h that k∇ F ( w ) k 2 = O ( √ ǫ ) [24]. W e note that [65] uses the robu st mean estimator of [57], w hic h requ ir es somewhat stronger distribu tional assumptions and in curs err or scaling logarithmically with the dimension. In summary , we ha ve describ ed t wo meta-algorithms for robu s t sto c h astic optimization. An in teresting op en problem is to obtain a faster algorithm for this general task, in p articular one th at has information-theoretical ly optimal sample complexit y and u s es a minimum n umber of queries to an ERM oracle. R obust Line ar R e gr ession. As already mentioned, if F is con v ex, the appro ximate critical p oin t computed via robust s to c hastic optimization tran s lates to an appr o ximate global minimizer. In the follo wing paragraph s , w e describ e ho w to obtain an approximat e global minimum for the fun da- men tal task of lin ear regression. Sev eral other applications are giv en in [24, 65]. W e will focus on the f ollo wing standard setup: W e are give n a collection of lab eled examples ( x ( i ) , y ( i ) ), where x ( i ) is dra wn from a distribution X on R d and y ( i ) = β · x ( i ) + e ( i ) , where e ( i ) is dra w n fr om a distribution e that is ind ep endent of X , and has mean 0 and v ariance 1. The ob jectiv e is to find a ve ctor w ∗ ∈ R d that appro ximately minimizes the f unction F ( w ) = E ( x,y ) [ f ( x,y ) ( w )], where f ( x,y ) ( w ) = ( w · x − y ) 2 . Recall that the r obust sto c hastic optimizatio n appr oac h describ ed in the previous p aragraphs relies on the assump tion that the co v ariance matrix of the gradien ts Co v f ∼F [ ∇ f ( w )] is b ounded. This condition translates to certain necessary conditions ab out the distrib ution X . T o b etter understand th ese conditions, w e first consider the gradien t of f . W e ha v e that ∇ w f ( x,y ) ( w ) = 2( w · x − y ) x = 2(( w − β ) · x − e ) x . Using this exp r ession, it is not hard to see that the v ariance of the gradien t ∇ w f ( x,y ) ( w ) in the v direction is equ al to 4 E x ∼ X [( v · x ) 2 ] + 4 E x ∼ X [( v · x ) 2 (( w − β ) · x ) 2 ] . Note that the first term ab ov e is b ounded, as long as X h as bou n ded co v ariance. T o b ound the second, w e w ill need to assume that X has b ound ed f ou r th momen ts. In particular, if we assu me that X has E x ∼ X [( v · x ) 4 ] = O (1) for all un it v ectors v , the co v ariance of the gradien ts has maxim u m eigen v alue b ounded b y O (1 + k w − β k 2 2 ). F or simplicit y , let u s assu m e that we kn o w a priori a ball of constan t ℓ 2 -radius con taining β . Th en , usin g our robust sto c hastic optimiza tion routine, we can efficien tly compute an appro ximate critical p oint w ith gradien t of ℓ 2 -norm O ( √ ǫ ). W e n o w sho w that s u c h an appro ximate critical p oint is also an appro ximate global minimum. Note that the gradien t of F at w equ als E [2(( w − β ) · x − e ) x ] = 2 E [ X X T ]( w − β ) . F or con venience, w e w ill additionally assu me that E [ X X T ] is b ound ed from b elo w by a constant multiple of the iden tity matrix. Th is means that f or an y w ∈ R d w e ha v e that k w − β k 2 = O ( k∇ w F ( w ) |k 2 ). Therefore, an approximat e critical p oin t of F is equiv alen t to a goo d appr o ximation of β , whic h is the global m inimizer of F . The abov e immediately gives an appro ximate global minimizer with ℓ 2 -error O ( √ ǫ ), assumin g w e started with a constan t radius ball con taining β . It is n ot d iffi cu lt to handle a v ery r ough appro ximation to β with error at most R . A simple (b u t somewhat inefficien t) metho d is to s earch for w within a ball of radius R in whic h the co v ariance of the gradients is b ound ed b y O (1 + R 2 ). F or R > 1, this guaran tees a new p oin t wh ic h appro ximates β within ℓ 2 -error O ( √ ǫR ). Iterating this p r o cedure, w e ca n achiev e a final error of O ( √ ǫ ). A detaile d and more efficie n t pro cedure is describ ed in [24]. 20 Finally , it s h ould b e noted that th e problem of robust linear regression h as b een extensiv ely studied in recent y ears. Using th e Sums-of-Squares hierarc hy , [52] d ev elop ed computationally effi- cien t algo rithms for robust linear r egression. F or the case where the co v ariates follo w a Gaussian distribution, [31] obtained computationally efficien t algorithms with near-minimax sample com- plexit y and error guaran tee. There has also b een recen t related work [8, 7, 71], whic h p rop osed efficien t algorithms for “robust” linear regression in a restrictiv e corruption m o del that only allo ws adv ersarial corruptions to the resp onses, but not to th e co v ariates. 3.2 Robust Cov ariance Estimation The algorithmic tec hn iques for h igh-dimensional robu st mean estimation describ ed in Section 2 can b e generalized to robustly estimate higher momen ts under appropr iate assumptions. In this section, w e describ e ho w to adapt the filter tec hnique to robustly estimat e the co v ariance matrix Σ of a distribution X satisfying app ropriate momen t conditions. First note that w e can assume without loss of generalit y that X is cen tered. S p ecifically , by considering the differences b et ween pairs of ǫ -corrupted samp les from X , we ha v e access to a set of 2 ǫ -corrupted samples from a distrib u tion X ′ with mean 0 and co v ariance matrix 2Σ. The basic idea und erlying this section is fairly simple: Robustly estimating th e co v ariance matrix of a cent ered r andom v ariable X is essen tially equiv alent to robustly estimating the mean of the rand om v ariable Y = X X T . That is, the problem of robu s t co v ariance estimation can b e “reduced” to the pr oblem of robust mean estimatio n of a more complicated random v ariable. If the random v ariable Y satisfies ap p ropriate moment conditions (or tail b ound s), w e can hop e to apply the tec h niques of the previous section. This is th e appr oac h tak en by [21, 57]. A t a v ery high-lev el, it is p ossible to design a robust co v ariance estimation algorithm using the filtering m etho d [21], where eac h filtering step remo ve s p oin ts based on the empirical four th moment tensor. F ormalizing the ab o ve approac h requires some care for th e follo w in g reason: The robu st mean estimation tec hniqu es for a distrib ution Y require an a priori upp er b ound on its cov ariance C o v [ Y ]. Unfortunately , su c h b ounds do not h old for our random v ariable Y = X X T , eve n if X is a Gaussian distribution. T o h andle this issue, we n eed to use additional s tructural prop erties of X . Sp ecifically , if X ∼ N (0 , Σ), we can leverag e th e fact that the co v ariance of Y can b e expressed as a fun ction of the co v ariance of X . An upp er b ound on Σ will giv e us an upp er b ound on the co v ariance of Y , whic h can then b e used to obtain a b etter appro ximation of Σ. App lying this idea iterati v ely will allo w us to b o otstrap b etter and b etter app ro ximations, until w e end up w ith an app ro ximation to Σ with error close to the in formation-theoretic optimum. Th is method is prop osed and analyzed (with sligh tly differen t terminology) in [21, 23]. Before w e pro ceed with a detailed outline of the metho d, we should clarify the metric w e will use to approxi mate the co v ariance matrix. Recall that for mean estimation we used the ℓ 2 - norm betw een v ectors. A natural c hoice for the co v ariance is the F rob enius norm, i.e ., w e w ould lik e to fin d an estimate b Σ of the true matrix Σ such that k b Σ − Σ k F is sm all. Here we will use the Mahalanobis distance, which is affine in v arian t and intuitiv ely corresp onds to m u ltiplicativ e appro ximation. Sp ecifically , we w ant to compute an estimate b Σ s u c h that k Σ − 1 / 2 b ΣΣ − 1 / 2 − I k F is small. A basic fact m otiv ating the use of this metric is that the total v ariation d istance b etw een tw o Gaussian d istributions N (0 , Σ) and N (0 , b Σ) is b ound ed from ab o v e by O ( k Σ − 1 / 2 b ΣΣ − 1 / 2 − I k F ). F or the remaind er of th is section, w e will assu m e for concreteness that X ∼ N (0 , Σ ) and w e will describ e an efficien t algorithm for robust co v ariance estimation that ac hiev es Mahalanobis distance O ( ǫ log (1 /ǫ )), wh ich is within a logarithmic factor from the inf ormation-theoretic optim u m of Θ( ǫ ). W e note that the same approac h giv es error O ( √ ǫ ) for any distrib u tion w h ose four th moment tensor is appropr iately b oun ded b y a function of the co v ariance. 21 T o get started, we first need to u nderstand the relationship b et w een Σ := Co v [ Y ] and Σ. T o that end, let u s denote b y A flat the canonical flattening of a matrix A into a v ector. With this notation, it is not hard to v erify that A flat Σ A flat = 2 Σ 1 / 2 A + A T 2 Σ 1 / 2 2 F . (5) This formula essentia lly expresses Σ in terms of the qu adratic f orm that it defines. An imp ortan t consequence of Equ ation (5) is the follo wing: Giv en co v ariance matrices Σ , Σ ′ and the corresp onding matrices Σ , Σ ′ , w e hav e that if Σ Σ ′ , then Σ Σ ′ . In other w ord s, if w e ha ve an upp er b ound on the true co v ariance Σ, this giv es us an up p er b oun d on the co v ariance of Y . Sp ecifically , if Σ Σ 0 , for some matrix Σ 0 , w e hav e that C ov [Σ − 1 / 2 0 Y Σ − 1 / 2 0 ] = O ( I ). Using our robust mean estimator for rand om v ariables with b ound ed co v ariance will allo w u s to approxima te E [Σ − 1 / 2 0 Y Σ − 1 / 2 0 ] = Σ − 1 / 2 0 ΣΣ − 1 / 2 0 to error O ( √ ǫ ) in F r ob enius n orm. Th is giv es us an estimate b Σ suc h that k Σ − 1 / 2 0 ( b Σ − Σ)Σ − 1 / 2 0 k F = O ( √ ǫ ). Th is means that, giv en an u p p er b oun d Σ 0 on Σ , w e can obtain a b etter one. T o ob tain an initial upp er b ound Σ 0 , w e note that t wice the sample co v ariance of a large set of samples from X p ro vid es an upp er b ound on the true co v ariance of X ev en w ith corruptions, as although corru ptions can s u bstant ially increase the empirical co v ariance of X , they cannot decrease it by m uch. Starting from Σ 0 , we obtain a new appro ximation Σ 1 = Σ + O ( √ ǫ )Σ 0 ; from th is we can obta in an imp r o ved approximati on Σ 2 = Σ + O ( √ ǫ )Σ 1 , and so forth. Iterating this tec hn ique yields a matrix b Σ such that k Σ − 1 / 2 ( b Σ − Σ)Σ − 1 / 2 k F = O ( √ ǫ ). The error guaran tee of O ( √ ǫ ) ac hieve d ab o ve is already fairly accurate. Once we ha ve suc h a go o d ap p ro ximation to the tru e co v ariance Σ, we can impro v e the error gu arantee ev en further b y u s ing stronger tail b oun ds for the Gaussian distrib ution. In the f ollo wing, we will assum e a rough s cale for Σ , namely that I Σ 2 I . Sup p ose that, for some δ > ǫ , we hav e a matrix Σ 0 satisfying k Σ 0 − Σ k F ≤ δ . W e can then use Eq u ation (5) to app ro ximate Σ fr om Σ 0 . It is n ot hard to see that if we kno w Σ within F rob enius norm δ , this allo ws us to compute Σ within sp ectral norm O ( δ ). Thus, after applying an appr opriate linear transformation to Y , we obtain a random v ariable Y ′ with co v ariance within O ( δ ) of the iden tit y matrix in sp ectral norm. W e claim that Y ′ has strong tail b ound s. This follo w s from stand ard ta il b oun ds for degree-2 p olynomials o v er Gaussian random v ariables. Sp ecifically , f or an y unit v ector v , v · Y ′ is a quadratic p olynomial in X with v ariance O (1). S tandard r esu lts imply that v · Y ′ has exp onent ial tails. F rom this w e can sho w that any sufficien tly large num b er of samples from Y ′ will b e ( O ( √ ǫδ + ǫ log(1 /ǫ )) , ǫ )-stable with high probabilit y . In summary , if we know Σ to F rob en ius error δ , w e can use robust mean estimation tec hniques to learn it to Mahalanobis error O ( √ ǫδ + ǫ log (1 /ǫ )). Iterating this, w e can obtain a final Mahalanobis error of O ( ǫ log (1 /ǫ )), as desired. W e conclude b y noting that the aforemen tioned can b e used to robu s tly learn a Gaussian distribution w ith unkn o wn mean and co v ariance as f ollo ws: Firs t, one learns the co v ariance as ab ov e (b y reducing to the mean 0 case). Th en one learns the mean of a Gaussian with an (app ro ximately) kno wn co v ariance matrix. In s u mmary , one obtains a hyp othesis Gaussian N ( b µ, b Σ) w ith in total v ariation distance O ( ǫ log(1 /ǫ )) f rom the uncorrupted distrib u tion N ( µ, Σ ). 3.3 Robust Sparse Estimation T asks The task of leverag ing sparsity in high-dimensional parameter estimation is a w ell-studied problem in statistics. In the conte xt of robus t estimation, this problem wa s first considered in [2], wh ich adapted the un kno w n conv ex programming metho d of [21 ] describ ed in this article. Here we fo cus on robu st sp arse mean estimation and describ e tw o algorithms: th e con v ex p rogramming algorithm of [2] and a no vel filtering metho d [30] that only uses sp ectral op erations. 22 F ormally , giv en ǫ -corrupted samples from N ( µ, I ), where the m ean µ is u nkno wn and assum ed to b e k -sparse, i.e., supp orted on an unknown set of k co ordinates, we w ould lik e to appro ximate µ , in ℓ 2 -distance. Wit hout corruptions, this problem is easy: W e dra w O ( k log( d/k ) /ǫ 2 ) samples and outpu t the empirical m ean truncated in its largest magnitude k en tries. T he goal is to obtain similar sample complexit y and error guaran tees in the robu s t setting. A t a high lev el, w e note that the truncated sample m ean should b e accurate as long as there is n o k -sparse direction in whic h the error b etw een the true mean and s ample mean is large. This condition can b e certified, as long as w e kn o w that the sample v ariance of v · X is close to 1 for all unit k -sp arse v ectors v . This would in turn allo w us to create a filter-based algorithm for k -sparse robust mean estimat ion that uses only O ( k log ( d/k ) /ǫ 2 ) s amples. While this idea natur ally leads to a s ample-optimal robu st algorithm for the problem, it is computationally in feasible. T his holds b ecause the pr oblem of determining whether there is a k -sp arse dir ection with large v ariance (sparse PCA) is k n o wn to b e computationally hard, ev en under natural d istr ibutional assumptions [6 ]. T o circum ven t this hardness resu lt, [2] considers a con vex relaxat ion of sparse PC A, wh ic h leads to a p olynomial-time v ersion of the aforemen tioned algorithm that requires O ( k 2 log( d/k ) /ǫ 2 ) samples. Moreo v er, there is evidence [26], in the form of a lo wer b ound in the Stati stical Qu ery mod el (a restricted bu t p o w erf u l computational mo del), that th is quadratic blo w-up in the sample complexity is necessary for p olynomial-time algo rithms. Note th at although the O ( k 2 log( d/k ) /ǫ 2 ) s ample complexit y is w orse than the information-theoretic optim u m of Θ( k log ( d/k ) /ǫ 2 ), for sm all k , it is still sub stan tially b etter than the Ω( d/ǫ 2 ) sample s ize required by dense metho ds. The con v ex-programming algorithm of [2] w orks as f ollo ws: Let b Σ b e the empirical co v ariance matrix. If there is a k -sparse unit vect or v w ith v T b Σ v large, we ha ve that tr( b Σ v v T ) is large. Here v v T is a p ositiv e semi-definite, trace-1 matrix w hose entries ha ve ℓ 2 -norm at most 1 an d ℓ 1 -norm at most k (the latter follo wing from the sp arsit y of v ). The wo r k of [2] consider s the follo wing con ve x relaxation of the problem of findin g th e s parse v ector v : Find a p ositive semi-definite, trace-1 matrix H , whose en tries ha ve ℓ 2 -norm at most 1 and ℓ 1 -norm at most k , so that tr( b Σ H ) is as large as p ossible. If the optimal solution to this con ve x relaxation is small, we ha ve certified that b Σ h as no sparse directions of large v ariance, and consequently we ha v e certified the accuracy of the trun cated empirical mean. On the other h an d , if the optimal v alue of th e con vex relaxation is large, we ha v e found a “sparse direction” of large v ariance and can use this to refi n e our set of samples. In particular, [2] use the ellipsoid metho d to find a su bset of the samples so that for al l suc h H , tr( b Σ H ) is not too large. T o b ound the s amp le complexit y of this metho d, one n eeds to sh o w that with a sufficiently large set S of iid samples from N ( µ, I ) the emp irical co v ariance of S , b Σ S , sat isfies th at the trace tr( b Σ S H ) is app ropriately small for all suc h H . One can sho w this as follo ws: First, is is easy to see that, give n a set S of size Ω ( k 2 log( d ) /ǫ 2 ), ev ery entry of b Σ S − I is O ( ǫ/k ) with high probabilit y . If this holds, th en we ha ve that tr( b Σ S H ) = tr( H ) + tr(( b Σ S − I ) H ) = 1 + O ( ǫ/k ) · k = 1 + O ( ǫ ) , where the second equalit y is b ecause the en tries of b Σ S − I hav e b ound ed ℓ ∞ -norm and the en tries of H hav e b oun ded ℓ 1 -norm. Therefore, with Ω( k 2 log( d ) /ǫ 2 ) samples, this algorithm can b e shown to wo rk with high pr ob ab ility . In subsequ ent w ork, [61] ga ve an iterativ e filter-based metho d for robus t sparse mean estimation, whic h av oids the use of the ellipsoid metho d but still requires multiple solutions to the conv ex relaxation of sparse PCA in eac h fi ltering iteration. Another algorithm for robust sp arse mean estimation, pr op osed by [60], works via iterativ e trimmed thr esh olding. While th is algorithm seems pr acticall y viable in terms of runtime, it can tolerate v anishin gly small fraction of outliers. 23 More recen tly , [30 ] develo p ed iterativ e sp ectral algorithms for robu st sparse estimation tasks (including sparse mean estimation and sp arse PCA). These algo rithms achiev e the same error guaran tees as [2], while b eing signifi can tly faster. In the cont ext or robust sp arse mean estimation, the algorithm of [30] considers the O ( k 2 ) largest ent ries of b Σ − I . If the ℓ 2 -norm of these ent ries is m u c h larger than ǫ , it follo ws that there is a s p arse, degree-2 p olynomial p ( x ) wh ere the exp ectation of p ov er all samples in S is su bstant ially different from its av erage v alue ov er the clean samples. This allo ws us to build a fi lter for p oin ts based on their v alues under p . On the other hand, if this is not the ca se, it means that, for sparse v ectors v , the con tribu tion to v T ( b Σ − I ) v coming from en tries other than the top O ( k 2 ) ones is small. Therefore, v T b Σ v wo uld only b e large if we could fi nd suc h a v su p p orted only on the ro ws and columns of these O ( k 2 ) entries. W e can then c hec k for all vec tors v on this limited supp ort. A careful analysis sh o ws that with ˜ O ( k 2 log( d ) /ǫ 2 ) samples the app r opriate concen tration conditions hold for ev ery k 2 -sparse ve ctor and d egree-2 p olynomial. This allo ws the appropriate filters to w ork w ith this samp le size. 3.4 List-Deco dable Learning In this article, w e fo cus ed on the classical robust statistics setting, where the outliers constitute the minority of th e dataset, quan tified b y the prop ortion of cont amination ǫ < 1 / 2, and the goal is to obtain estimato rs w ith err or scaling as a function of ǫ (and is indep end en t of the dimension d ). A related setting of in terest fo cuses on the regime that the fractio n α of clean d ata (inliers) is small – strictly smaller than 1 / 2. That is, w e observ e n samples, an α -fraction of wh ich (for some α < 1 / 2) are dra wn from the distribu tion of interest, but the rest are arb itrary . This question w as fir st studied in th e con text of mean estimation in [14]. A first observ ation is that, in this regime, it is information-theoretically imp ossible to estimate the mean with a single h y p othesis. In deed, an adv ersary can pr o duce Ω(1 /α ) clusters of points eac h dra wn from a go o d distribution with different mean. Eve n if the algorithm could learn the distrib ution of th e samples exactly , it still would not b e able to iden tify whic h of the clusters is the correct one. T o circumv en t this b ottlenec k, the defin ition of learning m ust b e somewhat r elaxed. In particular, the algorithm should b e allo wed to return a smal l list of hyp otheses with the guaran tee that at le ast one of the h y p otheses is close to the tru e mean. This is the mo del of list-de c o dable le arning , a learning mo del in tro duced by [3 ]. Another qualitativ e difference with the s mall ǫ regime is that in list-decodable learning, it is often inf ormation-theoretica lly necessary for the error to increase without b ou n d as the fraction of clean d ata α go es to 0. In summary , giv en p olynomially man y corrupted samples, w e w ould like to output O (1 /α ) (or p oly(1 /α )) many h yp otheses with the guaran tee that (with high probabilit y) at least one hypothesis is within f ( α ) of the tru e mea n, where f ( α ) dep end s on the concen tration prop erties of the distrib ution in question, but otherwise is information-theoretically b est p ossible. The inform ation-theoretic limits of list-deco dable mean estimation ha v e only b een addressed v ery recentl y . The w ork [28] ga v e nearly tigh t b ounds on the minim um error ac hiev able for list- deco dable mean estimation on R d (with p oly(1 /α ) candidate hyp otheses) f or structured d istribution families, including Gaussians and distributions with b ound ed co v ariance. In particular, th e optimal ℓ 2 -error was determined to b e Θ( p log(1 /α )) for spherical Gaussians and Θ( α − 1 / 2 ) for b oun d ed co v ariance d istributions. The algo rithmic asp ects of list-decod able mean estimation hav e turn ed out to b e muc h more c h allenging. F or b ound ed co v ariance distributions, [14] ga v e an SDP-based algorithm ac hieving near-optimal error of ˜ O ( α − 1 / 2 ). In the rest of this section, w e d escrib e a generalizat ion of th e filtering metho d for list-decod able mean estimatio n in tro duced in [28]. 24 List-Deco dable Mean E stimation via (Multi-) Filters. The filtering tec h niques discussed in S ection 2.4 can b e adapted to wo r k in the list-deco dable setting as well. F or the remainder of this discussion, w e will restrict ours elv es to list-decod able mean estimation when the clean d ata is dra w n from an identit y co v ariance Gaussian distribu tion. At a high-lev el, the adaptation of the filtering metho d w orks as follo ws: If the sample co v ariance matrix has no large eigen v alues, this certifies that the true m ean and sample mean are not to o far apart. Ho w ever, if a large eigen v alue exists, the construction of a filter is more elaborate. T o some exten t, this is a necessary difficult y b ecause the algorithm must return multiple h yp otheses. T o han d le this issu e, one needs to construct a multi-filter , w hic h ma y retur n sever al subsets of the original d ataset with the guarantee that at least one of them is clea ner than the original. Such a multi- filter was first intro d uced in [28 ]. W e no w pro ceed with a d etailed o v erview. Th e main id ea is to emplo y some typ e of filtering to obtain a sub set S of our original dataset T so that th e follo wing conditions are satisfied: (i) The set S con tains at least half of the original clean samples in T ; and (ii) T he empir ical co v ariance of S is b ound ed from ab o ve by some s m all p arameter σ > 0 in eve ry direction. If such a subs et S can b e efficien tly compu ted, w e can certify that the empirical mean of S will b e close to the true mean. T o sho w this, let µ, µ G , and µ S denote the true mean of the uncorrup ted distribution, the mean of the clean samples in S and the mean of all the samp les in S , r esp ectiv ely . By condition (i), it is easy to see that k µ − µ G k 2 = O (1). On the other hand, the v ariance of S in the µ G − µ S direction is at least ( α/ 2) k µ G − µ S k 2 2 , since an at least ( α/ 2)- fraction of clean samples ha v e distance k µ G − µ S k 2 from the fu ll mean. Combining with condition (ii), we ha v e that k µ G − µ S k 2 = O ( p σ /α ), and hence by the triangle inequalit y it follo ws that k µ − µ S k 2 = O (1 + p σ /α ). W e can attempt to use th e filtering app r oac h of Section 2.4 to fin d suc h a set S . If th e initial set of samples T has b ound ed cov ariance, then its empirical mean works, so we can us e it as our set S . Otherwise, w e can pro j ect the samp les in T on a direction of large v ariance in an attempt to remo ve outliers. Un fortunately , the outlier remo v al step cannot b e so straigh tforwa r d in this setting. The filtering steps we h a ve describ ed so far generally work b y firs t derivin g an appro ximation to the true mean and then r emo ving samples that are to o f ar a wa y from it. Ho wev er, the first step of this pro cedure inheren tly fails here, since the outliers constitute the ma jorit y of th e dataset. In particular, if the in itial set of samples T come in t w o large but separated clusters, we will not b e able to determine whic h cluster con tains the true mean, and thus will not b e able to find an y p oints that are defin itiv ely outliers. Th is difficulty is of course necessary , as the list-decod able algorithm is in general required to p ro duce sev eral h yp otheses. T o circum ven t this issue, our algorithm will return multiple subsets of p oints with the guaran tee that at least one of these subsets is cleaner than the original. W e will call such an algorithm a multi-filter . Giv en a set S of samples con taining at least half of the original clean p oin ts and a direction in which S has large v ariance, we w ant our m u lti-filter to retur n a collection of (p oten tially o v er- lapping) subsets S i of S with the follo w ing prop erties: First, we need it to b e the case that for at le ast one of these sets, at most an α/ 2-fract ion of the p oint s in S \ S i are clea n . Second, w e need to ensur e that the b lo wup in the num b er of su c h subsets is not to o large. O ne wa y to ac h iev e this is to require that P i | S i | 2 ≤ | S | 2 . Our o v erall algorithm will work by mainta in ing sev eral sets S i of samples. If any of these sets has to o large v ariance in some direction, we will app ly the multi- filter replacing it with several smaller sub sets. W e note that by the fi rst condition ab o ve, if w e started w ith a set where at most an α/ 2-fract ion of its complemen t was clean, at least one of the subsets w ill also ha ve th is prop erty . Therefore, at the end of this pro cedure, we are gu aranteed to end up with at least one S i satisfying the conditions necessary to giv e a goo d app ro ximation to the true mean. Our second condition will imply that at an y stage of this algorithm, the sum of the squares of the sizes of the S i ’s will n ev er exceed the squared size of our original set of samples. Th is condition guarant ees that the s amp le 25 complexit y an d r unti me of the o v erall algorithm are p olynomial. In fact, by obser v in g that w e only need to retur n the samp le mean of S i ’s that conta in at least an α/ 2-fract ion of our original set of samples, w e will ha ve at m ost O (1 /α 2 ) h yp otheses. T o reduce the list size of retur ned hyp otheses further, there is a simple m etho d [28] th at shows h o w to efficien tly reduce any set of p olynomially man y h yp otheses (at least one of which is guarante ed to b e within r of the true mean) to a list of O (1 /α ) hyp otheses at least one of wh ic h is nearly as close. The multi-filter step w orks as f ollo ws: Giv en a directio n v in wh ic h the v ariance is to o large, there are tw o w a ys we can attempt to get an appropriate collecti on of s u bsets S i . First, if there is some in terv al I so that all b ut an α/ 10 -fraction of samples x ∈ S ha ve v · x ∈ I , we kno w almost certainly that v · µ ∈ I , since v · µ should ha v e at lea st an α/ 4-fraction of samples (coming from the clean s amples) on either s ide of it. This imp lies that samples x whose v -p ro jections are at distance muc h further than p log(1 /α ) fr om the endp oints of I are almost certainly outliers. Usin g tec hniques similar to the ones w e discuss ed in S ection 2.4, if the v ariance in the v -direction is more than a sufficien tly large constan t multiple of | I | 2 + log (1 /α ), one can find a single su bset S ′ of S so that with high probabilit y almost all p oin ts in S \ S ′ are outliers. It remains to handle the complement ary case . If for some x ∈ S we let S 1 = { y ∈ S : v · y ≥ x − 10 p log(1 /α ) } and S 2 = { y ∈ S : v · y ≤ x + 10 p log(1 /α ) } , it is not h ard to see that at least one of S 1 or S 2 k eeps almost all of the clean samples in S . In particular, if v · µ ≥ x , then S 1 will only thr o w out clean samples that are at least 10 p log(1 /α ) to the left of µ (i.e., at most an α 10 -fraction). Similarly , if v · µ ≤ x , then S 2 will thro w a wa y at most an α 10 -fraction of clean samples. I f additionally , (i) eac h of S \ S 1 and S \ S 2 con tain at least an α 2 -fraction of the total samples, and (ii) | S 1 | 2 + | S 2 | 2 ≤ | S | 2 , then these subsets will suffice. The k ey observ ation is that if the v ariance of S in the v -direction is more than a sufficientl y large multiple of log(1 /α ), w e can alwa ys app ly at least one of the t wo multi-filters describ ed ab o ve. This holds b ecause if w e try to apply the second multi- filter for some giv en v alue of x , w e will find that either the fraction of samples with v · y ≤ x − 10 p log(1 /α ) is m u c h smaller that the fraction with v · y ≤ x or the fraction of samples with v · y ≥ x + 10 p log(1 /α ) is m uch smaller that the fraction w ith v · y ≥ x . In either case, the tails of v · S must deca y fairly rapidly , at least unt il th ey are smaller than α 2 . Thus, if we cannot app ly this filter for any x , the con tribution to the v ariance coming fr om eve rything except the tails m ust b e small. On the other hand , letting I b e the int erv al excluding the α 2 -tails on either side, w e can apply the first m u lti-filter if the con tribu tion from the α 2 -tails is large. I n summ ary , we can alw a ys apply one of the tw o m u lti-filters un less the v ariance of v · S is small. O v erall, this algorithm outpu ts p oly(1 /α ) many hyp otheses, at least on e of which is within O ( p log(1 /α ) /α ) of the true mean. 3.5 Robust E stimation Using High-Degree Momen ts The algorithms presented so far robu stly estimate the mean of high-d imensional d istributions b y lev eraging stru ctural inf orm ation ab out their co v ariance matrix. T he rob u st co v ariance estimation algorithm of Section 3.2 uses structural information ab out th e fourth momen t tensors, but also fits in this fr amew ork, as it w orks b y r ob u stly estimating th e mean of the random v ariable X X T . It is natural to ask whether (and to what extent) one can exploit stru ctural information about high er de gr e e mo ments of the uncorr u pted distribution to robustly estimate its parameters. F or the b asic case that the uncorrupted distrib ution is a Gaussian, algorithmical ly exploiting higher-degree momen t in formation to obtain r obust estimators with inform ation-theoretic ally n ear- optimal accuracy tur ns out to b e m an ageable. F or a concrete example, w e fo cus here on the p r oblem of robust mean estimation for N ( µ, I ). Recall th at the conv ex pr ogramming and fi ltering algorithms of Section 2 ac h iev e ℓ 2 -error O ( ǫ p log(1 /ǫ )) in the s tr ong con tamination mo del, whic h is optimal 26 for sp herical s u b-gaussian d istributions but sub optimal (up to the O ( p log(1 /ǫ )) m u ltiplicativ e factor) for sph erical Gaussians. The r eason f or this discrepancy is that the stabilit y condition of Definition 2.1 and the asso ciated filter/c on vex programming algorithms only r ely on the first t wo momen ts of the d istribution. The wo rk of [26] show ed ho w to lev erage higher-order momen t in formation to impro ve on the O ( ǫ p log(1 /ǫ )) error b ound. Sp ecifically , th is w ork ga v e a generalized filtering algorithm that p erforms “outlier r emo v al” based on higher-ord er tensor information (of degree d = Ω(log 1 / 2 (1 /ǫ ))) to robustly estimate the mean of N ( µ, I ) in the strong con tamination model w ith in ℓ 2 -error O ( ǫ ) in time O ǫ ( d O (log 1 / 2 (1 /ǫ )) ). (This runtime upp er b ound is qu alitativ ely matc hed by an SQ low er b ound sho w n in the same pap er; see Section 3.7). This generalized fi ltering an d its correctness analysis lev erage prop erties of the Gaussian distribution, including the a p riori kno wledge of the higher moments and the concentrat ion of high-degree p olynomials. Subsequ ently , the work [27] (see also [25] for the case of discrete distributions) used higher- order moments to robu s tly learn b ounded degree p olynomial thresh olds functions (PT Fs) under v arious distributions. It should b e noted that the latter result do es not require kno wledge of all the higher degree moments. Sp ecifically , the alg orith m of [2 7] requires appr opriate concen tration and an ti-concen tration prop erties, and (approxi mate) kno w ledge of the momen ts up to degree 2 d , where d is the degree of the un derlying PTF. The more general setting w here we only ha ve upp er b ounds on the higher d egree moments of the uncorrupted distribution turns out to b e substan tially more challe nging algorithmically . In general, u pp er b ounds on the higher momen ts imply b etter information-theoretic err or upp er b ound s for robust estimatio n. F or example, for robus t mean estimation of d istributions with b ound ed k -th cen tral momen ts, the inf ormation-theoretica lly optimal error is easily seen to b e Θ( ǫ 1 − 1 /k ). Ho we v er, it is u nclear if this error b oun d is attainable algorithmically for k ≥ 4. Without an y additional assump tions on the underlying distrib ution (b eyond the b oun ded higher moments condition), recen t work [43] ga ve evidence th at obtaining error o ( ǫ 1 / 2 ) ma y b e compu tationally in tractable (see S ection 3.7). Ho we v er, there are circumstances in wh ic h higher momen t information can b e usefully exploited. A num b er of concurrent w orks obtained efficien t algorithms leverag ing higher-degree moments to obtain n ear-optimal error guarantee s [28, 42, 56, 54, 55]. Sp ecifically , the w ork of [28] ga ve a higher-momen t generalization of the m u lti-filter tec hnique describ ed in Section 3.4 that leads to a near-optimal error algorithm for list-deco dable mean estimation of N ( µ, I ). As an application, [28] obtained an efficient alg orithm to learn the p arameters of mixtures of spherical Gaussians under near-optimal sep aration b et ween the comp onents. T he works [42, 56, 54, 55] used the Sums-of- Squares meta-alg orithm to obtain a num b er of algorithmic results, including robust m ean estimation giv en a Sums-of-Squares pro of certifying b ound ed cen tr al momen ts [42, 56], learning mixtures of spherical Gaussians [42, 54], and list-deco dable mean estimation [54] (u nder a similar Sums-of- Squares certifiabilit y assump tion). More recent ly , [48, 67] used the S ums-of-Squares m etho d to obtain the fi rst non-trivial algo rithms for list-deco dable linear regression. In this section, we d escrib e in more detail t wo imp ortant settings where a fairly sophisticated use of higher d egree moments is required: list-deco dable mean estimation (with n ear-optimal error guaran tees) for Gaussian distributions; and robust mean estimation with certifiably b ound ed cen tral momen ts. Our presen tation will mainly fo cus on the metho dology dev elop ed by the authors in [28]. W e provide a high-lev el o ve r view of the Sums-of-Square appr oac h to th ese problems and refer the in terested reader to the recen t survey [66] for a more tec h nical exp osition of this approac h. 27 Near-Optimal List-Deco dable Gaussian Mean Estimation. W e consider the problem of list-decod able mean estimation, assum in g the uncorrupted samples are dra wn from a sp herical Gaussian distr ib ution N ( µ, I ). The techniques we discussed in Sectio n 3.4 sho w how to compu te a list of O (1 /α ) hypotheses (candid ate mean vecto rs) su c h that (with high probab ility o ver the uncorrup ted samples) at least one hyp othesis is w ithin ℓ 2 -norm ˜ O (1 /α 1 / 2 ) from th e true mean µ . W e note that this error b ound is actually v ery far from the information-theoretic optim um. In particular, for a p oint to b e a r easonable hypothesis, ther e must b e a cluster consisting of at least an α -fraction of the samples that are roughly Gaussian d istributed aroun d it. If t wo suc h hyp otheses are separated b y more than a large multiple of p log(1 /α ) , these clusters cannot o verlap on more than an α -fraction of their p oints (by Gaussian tail b ound s). Ho wev er, b y a simple coun ting argumen t, this implies that there cannot b e more th an Ω(1 /α ) man y suc h hyp otheses pairwise separated b y Ω ( p log(1 /α ) ). Hence, information-theoretically , th er e exists a list of O (1 /α ) man y h y p otheses such that (with high pr obabilit y) at least one is within d istance O ( p log(1 /α ) ) of the true mean. In f act, th is up p er b ound is kno wn to b e tight. In [28], it is sho wn ho w to construct distributions that are consisten t with man y plaus ib le true means eac h separated by Ω( p log(1 /α )). Unfortunately , while ac h ieving b etter error is information-theoretica lly p ossible, the algorithm discussed in Section 3.4 is not able to ac hiev e it. There are inh eren t structural reasons for this: This algorithm attempts to fin d subsets of the samples with small v ariance. Unfortun ately , small v ariance is not sufficien t to imply b etter than O (1 /α 1 / 2 ) error. This is b ecause if our α -fr action of go o d samples is lo cated at distance of α − 1 / 2 from th e other samp les in some direction, the v ariance in that d irection wo uld only b e appr o ximately 1, and the tr u e mean w ould differ from th e sample mean by ab out α 1 / 2 , in ℓ 2 -norm. An other w a y of putting it is th at this is a question ab out concen tration. Our existing algorithm manages to pro duce a set of samples, at least an α -fraction of wh ich are goo d , whic h also has b ound ed v ariance. Since b ound ed v ariance ens u res some measure of concen tration, this ensu r es that the mean of the clean samples in this set (close to the true m ean) cannot b e to o far from the sample mean of this set. Unfortu nately , the b ound ed v ariance condition can only take us so far, and in fact is consistent with the mean of the go o d samples b eing as far as α 1 / 2 from the samp le mean. T o impro v e our error guaran tee, w e wa n t to find a set of samples with s tronger concen tration b ound s. A natural wa y to do this is to ensure that our s et of samples h as b oun d ed higher m oments. Sp ecifically , we say that a set of p oint s S has b oun d ed d th c entr al moments if f or ev ery u nit vec tor v w e ha ve th at E x ∼ u S [ | v · ( x − µ ) | d ] ≤ C , f or some constan t C > 0. W e note that if S h as b ounded d th cen tral momen ts, the mean of any α -fraction of the p oints is no m ore than ( C /α ) 1 /d far from the o v erall mean. Thus, if w e could fi nd a set with a large fraction of clean p oin ts whic h also had b ound ed d th cen tral momen ts for some d ≥ 2, w e w ould b e able to obtain b etter err or b ounds. The ab o ve paragraph natur ally giv es the outline for a b etter algo r ithm. W e start with some set S of samples. If its d th cen tral moments are small, w e return the mean of S . Otherwise, w e find some d ir ection v which has large cen tral moments, p ro ject onto that direction and use this information to create a multi- filter and rep eat. Unfortunately , it is not clear h o w to implemen t this alg orithm efficien tly . Recent results show that determining w hether a generic p oin t set has b oun d ed cen tral momen ts, for d > 2, seems to b e computationally intractable [43]. Ho wev er, we are not dealing with arbitrary p oin t sets. W e can tak e adv anta ge of the fact th at Gaussian distribu tions will satisfy stronger conditions on their higher moments than simply b ounded central momen ts, and w e can design algorithms that attempt to fi n d sets of samp les satisfying these more stringen t (and hopefu lly computationally c hec k ab le) conditions instead. There are t wo different wa ys to imp lemen t this idea: the squares of p olynomials metho d [28] and the Su ms-of-Squares metho d. F or the squares of p olynomials metho d, we note that the d = 2 case is easy , as it amounts to 28 finding the maxim u m v alue of a quadr atic p olynomial on a sphere (whic h can b e solv ed by sp ectral metho ds). F or d > 2, the problem requires that we optimize a higher-degree p olynomial, whic h is not that simp le. Ho w ever, if our go o d samples are Gaussian with a giv en mean, w e know what the exp ectatio ns of higher degree p olynomials sh ould b e and can thus c h eck for it. In p articular, in order to verify that the sample set has b ounded 2 d th cen tral moments, one ca n c heck if there is an y degree- d polynomial p where the a ve rage v alue of p 2 is too large. This can b e c hec ked b y considering the problem of optimizing a quadr atic form o ve r p olynomials p , and it is sufficien t by considering p ( x ) = ( v · ( x − µ )) d . If suc h a p is foun d, w e can use it to construct a multi-filte r, though th e mec hanism for doing so is highly non -trivial and inv ok es many sp ecial prop erties of the Gaussian d istribution. T he in terested reader can find the fu ll details in [28]. The other metho d mak es use of the Sums-of-Squ ares pro of system. In particular, it is shown that the Gaussian boun ded central momen ts can b e pro v ed in the Su m-of-Squares pro of system. Th us, th e go al is to find sub sets of the samples for whic h a similar Su ms-of-Squares pr o of allo w s one to sho w b ounded cen tral momen ts. These tec h niques w ere dev elop ed in [42, 55]. F or list-decod able mean estimation, b oth of these metho ds allo w one to learn the mean of a Gaussian to ℓ 2 -error appro x im ately α − 1 / (2 d ) b y considering the 2 d th momen ts. T h ese al gorithms will hav e r unti me p oly( n d ) and, by making d sup er-constan t, we can in fact obtain p oly-logarithmic error in quasi-p olynomial time. The Sums-of-Squares metho d has the adv antag e that it is m u c h more general and applies not ju s t to Gaussians bu t to any distribu tion wh ose b ound ed cen tr al momen ts can b e certified by S ums-of-Squares pro ofs of s m all degree. Ho w ever, these systems m u st searc h for S ums-of-Squares pro ofs, wh ic h r equire solving large conv ex optimization problems, meaning that these algorithms w ill b e slow er in practice. Robust Mean Estimation with Certifiably-Bounded Higher Momen ts. Another in ter- esting app lication of the Sums-of-Squares metho d in this con text is in reducing the conditions needed for r obust m ean estima tion. Recall that the defi nition of stabilit y (Definition 2.1) has t wo conditions: First, it r equires strong concen tration on th e un corrupted samples, to ensur e that remo ving an y s m all fraction d o es not subs tantia lly alter th e mean (a condition that is information- theoreticall y necessary in some cont exts). S econd, to sati sfy th e d efinition for δ = o ( √ ǫ ), the algorithm n eeds to know the co v ariance matrix of the go o d samples. This latter condition is not required for inform ation-theoretic reasons, but for computational ones. The algorithm needs to kno w the co v ariance matrix of th e clea n samples so that it can detect ev en small in cr eases in this co v ariance caused by corrup ted samples. The ab ov e imp lies for example, that if w e ha ve a distribu tion with, sa y , b ound ed fourth central momen ts, the first stabilit y condition holds with δ = O ( ǫ 3 / 4 ); while the second part will only hold with δ = Ω( √ ǫ ), as the actual co v ariance matrix with no errors migh t b e m o destly far from the iden tity . The first cond ition implies that information-theoretically , it should b e p ossible to learn the mean to error O ( ǫ 3 / 4 ) (for example b y lo oking at a trun cated mean in eac h d irection), but the standard filter will n ot get error b etter than Ω( √ ǫ ), as an adv ers ary can add errors to k eep the co v ariance matrix at most the identi t y and ye t corrupt the sample mean b y this muc h. T o circum v ent this natural b ottlenec k, one would similarly need a wa y to tak e adv an tage of higher momen ts and lev erage this s tronger concen tration. Once again, the Sums-of-Squares metho d can b e u sed to achiev e this under certain assumptions. Roughly sp eaking, if the uncorru pted distribution has b ounded d th cen tral moments pr ovable by low de gr e e Sums-of-Squar es pr o ofs , then b y searc hing for sets of samples with su ch a pro of, one can obtain error O d ( ǫ 1 − 1 /d ). 29 3.6 F ast Algorit hms for H igh-Dimensional Robust E stimation The main fo cus of the recen t algorithmic w ork in this field has b een on obtaining p olynomial-time algorithms for v arious high-dimensional r ob u st estimation tasks. Once a p olynomial-time algorithm for a computational problem is d isco v ered, the natural next step is to dev elop faster algorithms for the pr oblem – with linear time as the ultimate goal. While recen t work has led to p olynomial-time robust learners for sev eral f u ndamenta l tasks, these algorithms are significantly slo wer than their non-robust count erparts (e.g., the samp le a ve rage for mea n estimation). Th is raises the follo wing question: Can we design r obust estimators that ar e as effic i ent as their non-r obust analo gues? In ad d ition to its potentia l pr actical implications, making progress on this general direction is of fundamental theoretical in terest, as it can elucidate the effect of th e robu s tness requiremen t on the computational complexit y of high-dimensional statistical learnin g. The ab o ve d irection was initiated b y [16] in the con text of robust mean estimation. More sp ecifically , [16] ga v e a robust mean estimatio n algo rithm for b ound ed co v ariance distributions on R d that is near-minimax optimal, ac hieve s th e optimal ℓ 2 -error guaran tee of O ( √ ǫ ), and run s in time ˜ O ( nd ) / p oly ( ǫ ), where n is the num b er of samples. Th at is, the algorithm of [16 ] has th e same (optimal) sample complexit y and error guarante e as previous p olynomial-time algo rithms [21, 22], while r unnin g in n ear-linear time w hen the f r action of outliers ǫ is a small constan t. A t the tec hnical lev el, [16] build s on the conv ex programming approac h of Section 2.3. Subsequ ent w ork [20] observ ed that a simple pr ep ro cessing s tep allo w s one to red u ce to th e case when the fraction of corru ptions is a small u niv ersal constan t. As a corollary , it w as sho wn in [20 ] that a simple mo dification of th e [16] algorithm ru ns in ˜ O ( nd ) time. More imp ortan tly , [20] ga v e a p robabilistic analysis that leads to a fast mean estimation algorithm that is simultaneously outlier-robust and ac h iev es su b-gaussian tail b oun ds. (W e note that th e q u estion of designing estimators for the mean of hea vy-tail d istr ibutions with sub-gaussian rates has b een of substantia l in terest recently . The in terested reader is referr ed to [41, 19 , 62, 63]) for recen t d ev elopments on this topic.) Indep endent ly and concurrently to [20], [32] built on the filtering framewo rk of Section 2.4 to give a differen t ˜ O ( nd ) time rob u st mean estimation algorithm. Moreo ver, [32 ] gav e an empirical ev aluation sho w ing the practica l p erformance of their algorithm. Prior to the aforemen tioned devel opmen ts, the fastest kno w n runtime for robust mean esti- mation wa s ˜ O ( nd 2 ) and was ac hieve d b y the filtering algorithm of Section 2.4. While w e did not pro v id e a detailed r unti me analysis in Section 2.4, it is not h ard to s ho w that eac h filter iteration can b e implement ed in ˜ O ( nd ) time (u sing p ow er iteration) and that the num b er of iterations can b e b ounded from ab o ve by O ( d ). While the filtering algorithm has b een observed to run very fast in practice, taking at most a sm all constant n u m b er of iterations on r eal d atasets [22, 24], one can construct examples wh ere an ǫ -fraction of outliers force the algo rithm to tak e Ω( d ) iterations. This can b e ac hiev ed b y placing the outliers in Ω( d ) orthogonal directions at appropriate d istances from the true mean. Conceptually , the b ottlenec k of the fi ltering m etho d is that it r elies on a certificate (Lemma 2.4) that allo ws the algorithm to remo ve outliers fr om one d irection at eac h iteration. A detailed explanation and comparison of the tec hniques in [16, 20, 32] is b ey ond the scop e of this survey . A t a high-lev el, a conceptual commonalit y of these w orks is that they lev erage tec hniques from conti n uous optimization to d ev elop ite r ativ e method s (with eac h iteration taking near-linear time) that are ab le to deal with m ultiple directions in p ar al lel . In p articular, the total n u m b er of iterati ons in eac h of these metho ds is at most p oly-lo garithmic in d/ǫ . Bey ond r obust mean estimation, the w ork [17] recen tly studied the problem of robust co v ariance estimation with a fo cu s on designing faster algorithms. By building on the tec hniques of [16], they obtained an algorithm for this problem with runti me ˜ O ( d 3 . 26 ). Rather curiously , this runtime is not linear in the inp ut s ize, but nearly matc hes the (b est kno w n ) run time of the corresp onding n on- 30 robust estimator (i.e., computing the empirical co v ariance). I n tr iguin gly , [17] also provided evidence that the runt ime of their algo rithm ma y b e b est p ossible with curren t algo r ithmic tec hniques. 3.7 Computational-Statistical T radeoffs in Robust Estimation The golden standard in robust estimation is to design computationally efficien t algorithms w ith optimal s amp le complexities and error guaran tees. A conceptual m essage of the recent b o dy of algorithmic w ork in this area is that robustness may not p ose computational imp ediments to h igh- dimensional estimation. Indeed, f or a range of fundamental s tatistica l tasks, recen t w ork has dev elop ed computationally efficien t robu st estimators with dimension-indep endent (and, in some cases, near-optimal) error guarante es. Ho wev er, it turns out that, in some settings, robustness creates c omputational-statist ic al tr ade offs . Sp ecifically , for sev eral n atural high-dimensional robust estimation tasks, w e no w ha v e comp elling evidence that ac hieving, or even appr oximati ng, the information-theoretical ly optimal error is computationally in tractable. Progress in this direction w as first made in [26], w hic h used th e framework of Statistical Query (SQ) algorithms [50] to establish compu tational-sta tistical trade-offs for a range of robust estimation tasks inv olving Gaussian distribu tions. More sp ecifically , it was shown in [26] that ev en for the basic p roblems of robust mean and co v ariance estimation of a h igh-dimensional Gaussian with con tamination in total v ariation distance, ac hieving the optimal error requires sup er-p olynomial time. The same w ork established compu tational-sta tistical trade-offs f or the problem of robu st sparse mean estimation, ev en in Hu b er’s con tamination m o del, s h o win g that efficient algo rithms require quadratically more samples than the information-theoreti c minimum. I nterestingly , b oth these SQ low er b ounds are matc h ed b y the p erformance of recen tly dev elop ed robust learning algorithms [21, 2]. Motiv ated by this p r ogress, [43] made a firs t step to wards p ro vin g computational lo wer b ounds for robust mean estimation b ased on worst-c ase har dness assu mptions. In particular, this work established that current algo r ithmic tec hn iques for robust m ean estimatio n may not b e imp ro v able in term s of their error guarantee s, in the sense that they stum ble up on a w ell-known computa- tional barrier – th e so-cal led sm all set expansion hyp othesis (SSE), closel y r elated to the u nique games conjecture (UGC). More recen tly , [12] prop osed a k -p artite v ariant of the plant ed clique problem [46] and gav e a reduction, inspired by the SQ lo wer b ound of [26], imp lying (sub ject only to th e hard ness of the prop osed pr oblem) a statistical–co mputational gap for this p r oblem. An in teresting op en problem is to establish comp elling evidence (e.g., in the form of a Sums-of-Squ ares lo we r b ound ) of the aver age- c ase har dness of the prop osed k -partite v ariant of plan ted clique. In the f ollo wing paragraphs, we pro vid e a more d etailed description of [26, 43]. Statistical Query Lo wer Bounds. F or th e statistical estimation tasks s tudied in th is artic le, the input is a set of samples dra w n from a probabilit y distrib u tion of int erest. Statistic al Query (SQ) algorithms [50] are a restricted class of algorithms that are only allo we d to (adaptive ly) query exp ectations of b ounded fu nctions of the distribution – rather than directly access samples. In particular, if f is any b ounded fun ction, one ma y attempt to app ro ximate E [ f ( X )] by taking samples fr om the distribu tion X . With O (1 /τ 2 ) samples, one can obtain the correct ans w er within additiv e accuracy τ with high pr obabilit y . By doing this for s everal d ifferen t fun ctions f , p erh aps c hosen adaptiv ely , one can try to learn pr op erties of the u nderlying d istr ibution X . In the SQ mo del, th e algorithm get s to ask queries (with some give n accuracy τ ) of an oracle. These qu eries are in the f orm of a fun ction f with range con tained in [ − 1 , 1] and a d esired accuracy τ . The oracle then return s E [ f ( X )] to accuracy τ , and the algorithm gets to (adaptiv ely) c h ose another f u p to 31 Q times. Roughly sp eaking, an S Q algorithm with accuracy τ and Q queries corresp ond s to an actual algorithm using O (1 /τ 2 ) samples and O ( Q ) time. The Statistical Query (SQ) mo del is actually quite p o werful: a wide r ange of known algorithmic tec hniques in mac hine learning are known to b e implementa ble using S Qs. These includ e sp ectral tec hniques, moment and tensor metho d s, lo cal searc h (e.g., Exp ectation Maximization), and man y others (see , e.g., [34]). In fact, nearly every known statistical alg orith m (with a sm all num b er of exceptions) with prov able p erf ormance guarante es can b e sim u lated with small loss of efficiency in the SQ mo del. This make s the S Q mo del v ery u seful for proving low er b ounds , as a lo wer b ound in this m o del app lies to a broad family of algo r ithms. It is easy to see that one can estimate momen ts and appro ximate medians in the SQ mo d el. In fact, without d ifficult y , v arious filter algorithms describ ed in this article can b e imp lemen ted in the SQ framework. In d eed, momen t computations allo w one to estimate the co v ariance matrix. If large eigen v alues are found, further measurements can approxima te the cumulativ e density distribu tion of the pro j ection and decide on a filter. F rom then on, measuremen ts can b e made conditional on passing the filter (b y measuring f ( x ) times th e ind icator function of x passing the fi lter). A recent line of w ork [34, 36, 35, 33] d ev elop ed a framework of SQ algorithms for search problems o v er distributions. One can pro v e unconditional low er b ounds on the computational complexit y of SQ algorithms via a notion of Statistic al Query dimension . Th is complexit y measure w as introd uced in [11] f or P A C learning of Bo olean fun ctions and w as recen tly generalized to the unsup ervised setting [34, 33]. A low er b oun d on the SQ dimension of a learning p roblem pr o vides an unconditional lo we r b ound on the compu tational complexity of any SQ algorithm for the p roblem. Supp ose we wa nt to estimate the parameters of an unkno wn d istribution X that b elongs in a kno wn family D . Roughly sp eaking, the aforementioned wo rk has sh o wn that if there are man y p ossib le distributions X ∈ D w hose den s it y fu nctions are p airwise nearly orthogonal with resp ect to an appropriate inner pro duct, an y SQ algo rithm with ins ufficien t accuracy will require man y queries to determine whic h of these distribu tions it is sampling from. The wo rk of [26] giv es an SQ lo w er b ound for the statistical task of non-Gaussian c omp onent analysis (NGCA) [10 ]. In tuitivel y , this is the pr oblem of fi nding a non-Gaussian direction in a high-dimensional dataset. In more detail, let A b e the p df of a univ ariate distribu tion with the prop erty that the first m moments of A matc h the corresp ondin g m omen ts of the stand ard univ ariate Gaussian N (0 , 1). F or a unit v ector v ∈ R d , let P v b e the p df of the distrib ution on R d defined as f ollo ws: The p r o jection of P v in the v -direction is equal to A , and P v is a standard Gauss ian in the orth ogonal complement v ⊥ . Giv en sample access to a distrib u tion P v ∗ , for some unknown direction (unit vecto r) v ∗ ∈ R d , the goa l of NGCA is to appro ximate the hidden v ector v ∗ . It is sho w n in [26] that an y SQ algorithm to appr o ximate v ∗ requires either queries of accuracy d − Ω( m ) or exp onential ly man y 2 d Ω(1) oracle queries. The aforemen tioned SQ lo w er b ound can b e used in essen tially a blac k-b o x manner to obtain nearly tight SQ lo w er b ounds for a range of high-dimensional estimatio n tasks in volving Gaussian distributions, including learning mixtur es of Gaussians, robust mean and co v ariance estimation, and robus t sparse m ean estimation. At a h igh-leve l, this is ac hiev ed by constructing instances of these problems that amoun t to an NGCA instance for an appr opriate parameter m of matc hing momen ts. T he main idea is to add noise to the distribu tion in qu estion so that the noisy distribu tion is of the f orm P v ∗ for some momen t-matc hin g distrib u tion A . Using th ese tec hniques, it w as sh o wn that s u p er-p olynomial complexit y (in terms of either n umber of quer ies or accuracy) is r equ ired to learn the mean of an ǫ -corrupted Gaussian to ℓ 2 -error o ( ǫ p log(1 /ǫ )) or to learn its co v ariance to F rob enius error o ( ǫ log (1 /ǫ )), b oth of whic h are tigh t [21]. I t w as also b e sho wn that to robu stly learn the mean of a k -sparse Gaussian to constan t ℓ 2 -error r equ ires either sup er-p olynomially man y 32 queries or queries with accuracy O (max(1 /k , 1 / √ d )), whic h morally corresp onds to taking at least Ω(min { d, k 2 } samples, nearly matc hing the sample complexit y of kno wn algorithms [2]. Reductions from W orst- case Hard Problems. Pro ving computational lo wer b ounds via re- ductions f r om w orst-case hard problems has pro ve d to b e a rather c hallenging goa l for statistical estimation tasks. Some recent progress was mad e on this fr on t in the context of robu st mean estimation. Sp ecifically , the w ork [43] esta blished compu tational lo wer b ound s against algorithmic tec hniques op erating along the same lines as existing ones. I n particular, existing alg orithms for robust mean estimation dep end on b eing able to fin d a computationall y v erifiable certificate that (under appropr iate conditions) implies that the sample mean is clo s e to the tru e mean (analo- gous to Lemma 2.5). It is sho wn in [43] that, in s ome cases, fi nding natural certificates ma y b e computationally intract able. As a sp ecific example, consid er the class of d istr ibutions on R d with b ounded fourth central momen ts. F or suc h d istributions, it is n ot hard to sho w that the trimmed mean correctly estimates the mean of an y one-dimen sional pro jection within error O ( ǫ 3 / 4 ), sho w in g that this error rate is information-theoretical ly optimal, within constant factors. Ho w eve r, f or an algorithm to ac h iev e this, u sing tec hniques lik e those already known, it w ould need to ha ve a wa y to certify whether or not the sample mean of a giv en p oint set is close to the true mean. A natural wa y to d o this would in volv e v erifyin g whether the p oint set itself has b ound ed fourth cen tral momen ts. How ev er, [43] sh ow that, sub j ect to the Small-Set-Expansion Hyp othesis, this is computationally intracta ble. In fact, the certification pr oblem remains int ractable eve n if one needs to distinguish b et w een a distribution whic h h as man y b oun d ed central momen ts and one that lac ks b ound ed fourth cen tral moments. While this h ardness result is h ardly definitiv e (as it leav es space for a v ariet y of d ifferen t kinds of algorithms), it excludes some of the most natur al approac hes of extending existing tec hn iques. 4 Conclusions In this article, w e gav e an o ve rview of the recen t dev elopmen ts on algorithmic asp ects of high- dimensional robus t statistics. While substantial progress has b een made in this fi eld durin g the past few yea r s, the results obtained so far merely scratc h the sur face of the p ossibilities ahead. A ma jor goal going forw ard is to dev elop a gener al algorithmic the ory of r obust estimation . This in volv es (1) d ev eloping no vel algorithmic tec h n iques th at lead to efficien t robust estimators for more general probabilistic mod els and estima tion tasks; (2) obtaining a deep er understanding of the computational limits of r ob u st estimation; (3) dev eloping m athematical connections to related areas, including n on-con ve x optimizatio n and pr iv acy; and (4) explorin g applications of algorithmic robust statistics to exploratory data analysis, safe machine learning, and deep learning. One of the main conceptual con trib u tions of the classical theory of robu st statistic s has b een to c hallenge trad itional statistical assump tions ab out the data generating pro cess, thereby enabling the design of metho d s that are stable to deviations from these assumptions. T he precise form of suc h d eviations d ep ends on th e setting an d can giv e rise to v arious mo d els of robus tn ess. W e b eliev e that a central ob jectiv e in a mo dern theory of robustness is to rethink old mo dels and deve lop new ones that enable the design of new pr actical algorithms with pro v able guaran tees. Ac kno wledgmen ts W e thank Alistair Stew art for extensive collab oration in this area. W e thank Tim Roughgarden, Ankur Moitra, and Greg V alian t for u seful commen ts on a shorter v ersion of this article. 33 References [1] P . Aw asthi, M. F. Balca n , and P . M. Long. The p o wer of lo calizatio n for efficien tly learning linear separators with noise. J. ACM , 63(6):50:1 –50:27, 2017. [2] S . Balakrishnan, S . S. Du, J. Li, and A. Singh. Computationally efficien t robu st sparse estima- tion in high dimensions. In Pr o c. 30th Annual Confer enc e on L e arning The ory , pages 169–21 2, 2017. [3] M. F. Balcan, A. Blum, and S . V empala. A discrim in ativ e framework for clustering via simi- larit y functions. In STOC , pages 671–6 80, 2008. [4] M. Barren o, B. Nelson, A. D. Joseph, and J. D. Tygar. The security of m ac hine learning. Machine L e arning , 81(2) :121–1 48, 2010. [5] T . Be rnholt. Robust estimato rs are hard to compu te. T ec hnical rep ort, Unive rsit y of Dort- m u nd, German y , 2006. [6] Q . Berthet an d P . Rigollet. Comp lexit y theoretic lo wer b ound s for sparse principal comp onent detection. I n COL T 2013 - The 26th Annual Confer enc e on L e arning The ory , pages 1046–10 66, 2013. [7] K . Bhatia, P . J ain, P . K amalaruban, and P . Kar. Consistent rob u st regression. In A dvanc es in Neur al Information Pr o c essing Systems 30: Annual Confer enc e on Neur al Information P r o- c essing Systems 2017 , pages 2107–2116 , 2017. [8] K . Bhatia, P . Jain, and P . Kar. Robust regression via hard thresholdin g. In A dvanc es in Neur al Information Pr o c essing Systems 28: Annual Confer enc e on N e ur al Informatio n Pr o c essing Systems 2015 , pages 721–729, 2015. [9] B. Bigg io, B. Nelson, and P . Lasko v. Poi soning att ac ks against sup p ort v ector mac hin es. In Pr o c e e dings of the 29th International Confer enc e on Machine L e arning, ICM L 2012 , 20 12. [10] G. Blanc hard, M. Ka w anab e, M. Su giyama , V. G. Sp oko in y , and K.-R. M ¨ uller. In search of non-gaussian comp onen ts of a high-d im en sional d istribution. J. Mach. L e arn. R es. , 7:247–28 2, 2006. [11] A. Blum, M. F ur s t, J. Jac ks on, M. K earn s, Y. Mansour, and S. Rudich. W eakly learning DNF and charact erizing statistical qu ery learnin g using Four ier analysis. In Pr o c e e dings of the Twenty-Sixth Annua l Symp osium on The ory of Computing , pages 253–2 62, 1994. [12] M. Brennan and G. Bresler. Averag e-case lo wer b ounds for learning sparse mixtures, robust estimation and semirandom adv ersaries. CoRR , abs/1908.0 6130, 2019. [13] N. Bshouty , N. Eir on, and E. Ku shilevitz. P AC Learning with Nast y Noise. The or etic al Computer Scienc e , 288(2):2 55–27 5, 2002. [14] M. Charik ar, J. S teinhardt, and G. V alian t. Learning fr om untrusted data. In Pr o c. 49th Annual ACM Symp osium on The ory of Computing , pages 47–60 , 2017. [15] M. Chen, C. Gao, and Z. Ren. Robust co v ariance and scatt er matrix estimation un der Hub er’s con tamination mo del. Ann. Statist. , 46(5):1932– 1960, 10 2018. 34 [16] Y. Cheng, I. Diak onik olas, and R. Ge. High-dimen s ional robust mean estimation in nearly- linear time. CoRR , abs/181 1.0938 0, 2018. Conf er en ce ve rsion in SOD A 201 9, p . 2755-27 71. [17] Y. Ch eng, I. Diak on ikola s, R. Ge, and D. P . W o o dru ff. F aster algorithms for high-dimensional robust co v ariance estimatio n . In Confer enc e on L e arning The ory, CO L T 2019 , pages 727–757, 2019. [18] Y. Cheng, I. Diak on ikola s, D. M. Kane, and A. Stewa rt. Robus t learning of fixed-structur e Ba y esian net w orks. In P r o c. 33r d Annual Confer enc e on Neur al Infor mation Pr o c essing Sys- tems (NeurIPS) , pages 103 04–10316, 2018. [19] Y. Cherapan amj er i, N. Flammarion, and P . L. Bartlett. F ast mean estimation with sub- gaussian rates. In Confer enc e on L e arning The ory, COL T 2019 , pages 786–80 6, 2019. [20] J. Dep ersin and G. Lecue. Robust subgaussian estimatio n of a mean v ector in n early linear time. CoRR , abs/1906 .03058 , 20 19. [21] I. Diako n ik olas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stew art. Rob u st estimators in h igh d imensions without the computational in tractabilit y . In Pr o c. 57th IEEE Symp osium on F oundations of Computer Sci enc e (F OCS) , pages 655–664 , 2016. [22] I. Diako n ik olas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewa rt. B eing robu st (in high dimensions) can b e pr actical. In Pr o c. 34th International Confer enc e on Machine L e arning (ICML) , pages 999–10 08, 2017. [23] I. Diak onik olas, G. Kamath, D. M. K an e, J. Li, A. Moitra, and A. S tewart. Robustly learnin g a Gaussian: Getting optimal err or, efficie n tly . In Pr o c. 29th Annual Symp osium on Discr ete Algor ithms , p ages 2683–2 702, 2018. [24] I. Diak onik olas, G. Kamath, D. M. Kane, J. Li, J. Steinhard t, and A. Stewa rt. Sev er: A r obust meta-alg orithm for sto chastic optimization. CoRR , abs/1803.028 15, 2018. C onference version in ICML 2019. [25] I. Diak onik olas and D. M. Kane. Degree- d c how parameters robustly determine degree- d ptfs (and algo r ithmic applications). In Pr o c e e dings of the 51st Annual A CM SIGACT Symp o- sium on The ory of Computing, STO C 2019 , pages 804–815 , 20 19. F ull version a v ailable at h ttps ://arxiv.org/ abs/1811. 03491. [26] I. Diako n ik olas, D. M. Kane, and A. S tew art. Statistical query lo w er b ounds for robust esti- mation of high-dimensional Gaussians and Gaussian mixtures. In P r o c. 58th IEEE Symp osium on F oundations of Computer Sci enc e (F OCS) , pages 73–84, 2017. [27] I. Diak onikol as, D. M. Kane, and A. Stew art. Learning geometric concepts with nast y noise. In Pr o c. 50th Annual ACM Symp osium on The ory of Computing (STOC) , pages 1061 –1073 , 2018. [28] I. Diak onikol as, D. M. Kane, and A. S tew art. L ist-decod able robust mean estimation an d learning mixtures of spherical Gaussians. In Pr o c. 50th Annual A CM Symp osium on The ory of Computing (STOC) , pages 104 7–1060, 2018. [29] I. Diako n ik olas and D.M. Kane. Recen t adv ances in h igh-d imensional robust statistics. STOC 2019 T utorial. Slides a v ailable at h ttp://www.iliasdiak onikol as.org/stoc-robust-tutor ial.h tm l, 2019. 35 [30] I. Diak onik olas, S . Karmalk ar, D. Kane, E. Price, and A. Stew art. Outlier-robust high- dimensional sp arse estimation via iterativ e filtering. In A dvanc es in Neur al Information Pr o- c essing Systems 33, NeurIPS 2019 , 2019. [31] I. Diako n ik olas, W. Kong, an d A. Stew art. Efficient algorithms and lo w er b ound s for robust linear regression. In Pr o c. 30th Annual Symp osium on Disc r ete Algorithm s (SODA) , pages 2745– 2754, 2019. [32] Y. Dong, S. B. Hopkins, and J. Li. Q uan tum entrop y scoring for fast robus t mean estimation and im p ro ved outlier detection. CoRR , abs/190 6.1136 6, 2019. Conferen ce ve rsion in NeurIPS 2019. [33] V. F eldman. A general c h aracterizatio n of the statistical q u ery complexit y . CoRR , abs/1608. 02198 , 20 16. [34] V. F eldman, E. Grigorescu, L. Reyzin, S. V empala, and Y. Xiao. S tatistical algorithms and a lo we r b ound for d etecting planted cliques. In Pr o c e e dings of STOC’13 , pages 655–664 , 2013. [35] V. F eldman, C . Guzman, and S . V empala. Statistical query algorithms for sto c hastic con vex optimization. CoRR , abs/1512. 09170 , 20 15. [36] V. F eldman, W. Perkins, and S. V empala. On the complexit y of r andom satisfiabilit y problems with p lan ted solutions. In Pr o c e e dings of the F orty-Seventh Annual A CM on Symp osium on The ory of Computing, STOC, 20 15 , p ages 77–8 6, 2015. [37] F. R. Hamp el. A general qualitativ e defin ition of robu s tness. Ann. Math. Statist. , 42(6):188 7– 1896, 12 1971. [38] F. R. Hamp el, E. M. Ronc hetti, P . J. Rousseeu w, and W. A. Stahel. R obust statistics. The appr o ach b ase d on influenc e functions . Wiley New Y ork, 1986. [39] M. Hardt and A. Moitra. Algorithms and hardness for robust s u bspace reco v ery . In Pr o c. 26th Annual Confer enc e on L e arning The ory (COL T) , pages 354–375 , 2013. [40] D. Haussler. Decision theoretic generalizations of the P AC mo d el for n eural n et and other learning app lications. Information and Computation , 100:7 8–150 , 19 92. [41] S. B. Hopkins. Sub-gaussian mean estimation in p olynomial time. CoRR , abs/1809.07 425, 2018. [42] S. B. Hopkins and J. Li. Mixture mo dels, robu s tness, and su m of sq u ares pro ofs. In Pr o c. 50th Annual ACM Symp osium on The ory of Computing (STOC) , pages 1021–1 034, 2018 . [43] S. B. Hopkins and J. Li. How h ard is r ob u st mean estimation? In Confer enc e on L e arning The ory, COL T 2019 , pages 1649– 1682, 2019 . [44] P . J. Hub er. Robu st estimation of a lo cation parameter. Ann. Math. Statist. , 35(1):73 –101, 03 1964. [45] P . J. Hub er and E. M. Ronc hetti. R obust statistics . Wiley New Y ork, 200 9. [46] M. Jerrum. Large cliques elude the metrop olis pr o cess. R andom Struct. Al gorithms , 3(4):347– 360, 1992. 36 [47] D. S. Johns on and F. P . Preparata. The densest hemisphere problem. The or etic al Computer Scienc e , 6:93–10 7, 1978. [48] S. Karmalk ar, A. R. Kliv ans, and P . K. Kothari. List-deco dable linear r egression. CoRR , abs/1905. 05679 , 20 19. T o app ear in NeurIPS’19. [49] M. Kearns , R. Schapire, and L. S ellie. T o ward Efficien t Agnostic Learnin g. Machine L e arning , 17(2/ 3):115– 141, 1994. [50] M. J. K earns. Efficient noise-tole ran t learning from statistical queries. In Pr o c. 25th An nual ACM Symp osium on The ory of Com puting (STOC) , p ages 392–40 1. A CM Pr ess, 1993. [51] M. J. K earns and M. Li. Learning in the presence of m alicious errors. SIAM Journal on Computing , 22(4):807– 837, 1993. [52] A. Kliv ans, P . Kothari, and R. Mek a. Efficient algorithms f or outlier-robus t regression. In Pr o c. 31st Annual Confer enc e on L e arning The ory (COL T) , pages 1420–14 30, 2018. [53] A. Kliv ans, P . Long, and R. Ser vedio. Learning halfsp aces with malicious noise. Journal of Machine L e arning R ese ar ch , 10:2 715–2 740, 2009. [54] P . K. Kothari and J. S teinhardt. Better agnostic clustering via relaxed tensor norm s. CoRR , abs/1711. 07465 , 20 17. [55] P . K. Kothari, J. Steinhardt, and D. S teurer. Robust moment estimation and imp ro ved clus- tering v ia sum of squares. In Pr o c. 50th Annual A CM Symp osium on The ory of Computing (STOC) , pages 1035 –1046, 2018. [56] P . K. Kothari an d D. Steurer. Outlier-robust momen t-estimation via sum-of-squares. CoRR , abs/1711. 11581 , 20 17. [57] K. A. L ai, A. B. Rao, and S . V emp ala. Agnostic estimation of mean and co v ariance. In P r o c. 57th IEEE Symp osium on F oundations of Computer Scienc e (FOCS) , pages 665–674 , 2016. [58] J.Z . Li, D.M. Absher, H. T ang, A.M. South wic k, A.M. Casto, S. Ramac handran , H.M. C ann, G.S. Barsh, M. F eldman, L.L. Cav alli-Sforza, an d R .M. My er s . W orldwid e human relationships inferred from genome- wide patterns of v ariation. Scienc e , 319 :1100– 1104, 2008. [59] H. Liu and C. Gao. Densit y estimation with cont amination: minimax rates and theory of adaptation. Ele ctr on. J. Statist . , 13(2):3613 –3653, 2019. [60] L. Liu, T. Li, and C. C aramanis. High dimensional r obust estimation of sparse mo d els via trimmed hard th resholding. CoRR , abs/1901.0 8237, 20 19. [61] L. Liu, Y. S hen, T. Li, and C . Caramanis. High dimen s ional robust sparse r egression. CoRR , abs/1805. 11643 , 20 18. [62] G. Lugosi and S. Mendelson. Mean estimation and r egression und er hea vy-tailed distributions: A survey . F oundations of Computational Mathema tic s , 19( 5):1145 –1190, 2019. [63] G. Lugosi and S. Mendelson. Robust multiv ariate mean estimatio n: th e optimalit y of trimmed mean. CoRR , abs/1907.11 391, 201 9. 37 [64] P . P asc h ou, J. Lewis, A. J a ve d, and P . Drineas. Ancestry informativ e marke rs for fi ne-scale individual assignmen t to wo r ldwide p opulations. Journal of M e dic al Genetics , 47:8 35–84 7, 2010. [65] A. Prasad, A. S. Suggala, S. Balakrishnan, and P . Ravikumar. Robust estimation via robust gradien t estimatio n. arXiv pr e print arXiv:1802.064 85 , 2018. [66] P . Ragha v endra, T. S c hr amm, and D. Steurer. High-dimensional estimation via sum-of-squares pro ofs. CoRR , abs/1807.11 419, 2018. [67] P . Ragha v end ra and M. Y au. List deco dable learning via s um of squares. CoRR , abs/1905. 04660 , 20 19. T o app ear in SOD A’20. [68] N. Rosenber g, J. Pr itc hard, J. W eb er, H. C ann, K. Kid d , L.A. Zhiv oto vsky , and M.W. F eld- man. Genetic structure of h um an p opulations. Scienc e , 298:2381– 2385, 2002. [69] J. Steinhard t, M. Charik ar, and G. V alian t. Resilience: A criterion for learning in the presence of arbitrary outliers. In P r o c. 9th Innovations in The or etic al Computer Scienc e Confer enc e (ITCS) , pages 45: 1–45:2 1, 2018. [70] J. Steinhard t, P . W ei Koh, and P . S. Liang. Certified defenses for data p oisoning atta cks. In A dvanc es in Neur al Information Pr o c e ssing Systems 30 , pages 3520–3 532, 2017. [71] A. S. Suggala , K. Bhatia, P . Ravikumar, and P . Jain. Adaptiv e hard thresholding for near- optimal consistent robust r egression. In Confer enc e on L e arning The ory, COL T 2019 , pages 2892– 2897, 2019. [72] B. T r an, J. Li, and A. Madry . S p ectral signatur es in bac kd o or attac ks. In A dvanc es in Neur al Information Pr o c essing Systems 31: Annual Confer enc e on N e ur al Informatio n Pr o c essing Systems 2018 , NeurIPS 2018 , pages 8011–80 21, 2018. [73] J. W. T uke y . A surv ey of sampling from contaminat ed distribu tions. Contributions to pr ob a- bility and statistics , 2:448–485 , 1960. [74] J. W. T u key . Mathematics and picturing of d ata. I n Pr o c e e dings of ICM , vo lume 6, pages 523–5 31, 1975. [75] L. V alian t. Learning disjunctions of conjunctions. In Pr o c. 9th IJ CA I , pages 560–566, 1985. [76] B. Zhu, J. Jiao, and J. Steinhardt. Generalized resilience and r obust statistics. CoRR , abs/1909. 08755 , 20 19. 38
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment