Robust Regression and Lasso

1 Rob ust Re gression and Lasso Huan Xu, Constantine Car amanis, Member , and Shie Mannor , Member Abstract Lasso, or ℓ 1 regularized leas t squ ares, has been explored extensi vely for its remark able sparsity proper ties. It is shown in this paper that the solution to Lasso, in add ition to its sparsity , has robustness proper ties: it is the solution to a robust optimization pro blem. Th is h as two impo rtant consequen ces. First, robustness p rovides a conn ection o f the r egularizer to a phy sical pr operty , namely , protec tion fro m noise. This allows a pr incipled selection of th e regularizer, and in particular, g eneralization s of Lasso that also y ield conv ex optimization pro blems are o btained by consid ering d ifferent u ncertainty sets. Secondly , robustness can itself be used as an av enue to exploring different properties of the solution. In particular, it is shown that robustness of the solution explains why the solution is sparse. The analy sis as well as the speciﬁc re sults obtain ed dif fer fro m standard spar sity results, provid ing dif ferent geom etric intuition. Fur thermore , it is shown that the robust op timization formu lation is related to kernel d ensity estimation, and based on this approach, a pro of that Lasso is con sistent is given using ro bustness directly . Finally , a th eorem saying that sparsity and algorithm ic stability con tradict each other, and hence Lasso is no t stable, is presented . Index T erms Statistical Learnin g, Regression, Regularization, K ernel density estimato r , Lasso, Robustness, Spar- sity , Stability . I . I N T R O D U C T I O N In this paper we consid er linear regression prob lems with least-sq uare error . The probl em is to ﬁnd a vector x so that t he ℓ 2 norm of the residual b − A x is mi nimized, for a given m atrix A ∈ R n × m and vector b ∈ R n . From a learning/regression perspective, each row of A can be regarded as a traini ng sampl e, and the corresponding element o f b as the t ar get value of this observed sample. Each column of A corresponds to a feature, and the objective is to ﬁnd a set of weigh ts so that the weighted sum of the feature values approxim ates the t ar get value. It is well known that mi nimizing the least squared error can lead to s ensitive soluti ons [1]– [4]. Many regularization m ethods have been proposed to decrease this sensit ivity . Amon g them, T ikhonov regularization [5] and Lasso [6], [7] are two wi dely known and cited algorith ms. These methods minimi ze a weighted sum of the residual norm and a certain regularization term, k x k 2 for T ikhonov regularization and k x k 1 for Lasso. In addition to providing regularity , Lasso is also known for the tendency to select sparse so lutions. Recently this has attracted much attent ion for its ability to reconstruct sparse solu tions when sampling occurs far below th e Ny quist rate, and A preliminary version of t his paper was presented at T wenty-Second Annual Conference on Neural Information Processing Systems. H. Xu and S. Mannor are with the Department of Electrical and Computer Engineering, McGill Univ ersity , Montr ´ eal, H3A2A7, Canada email: (xuhuan@cim.mcgill.ca; shie.mannor@mcgill.ca) C. C aramanis is with the Department of Electrical and Computer Engineering, T he Univ ersity of T exas at Austin, Austin, TX 78712 USA email: (cmcaram@ece.utexas.edu). Nov ember 26, 2024 DRAFT 2 also for i ts abili ty to recove r the sparsity patt ern exactly with probabil ity one, asymptot ically as the number o f ob serva tions increases (there is an extensive literature on t his subject, and we refer th e reader to [8]–[12] and references th erein). The ﬁrst result o f this paper is that the solut ion to L asso has robustness properties: it is the solution to a robust optim ization problem . In itself, this in terpretation of Lasso as th e solu tion to a robust least s quares problem is a deve lopm ent in line with t he results of [13]. There, th e authors propose an alternati ve approach of reducing sensitivity of linear regre ssion by considering a robust version of the regre ssion probl em, i.e., mini mizing the worst-case resid ual for the observations un der some unknown but bound ed disturbance. Most of the research in this area considers either the case where the disturbance i s row-wise uncoupled [14], or the case where the Frobeniu s norm of th e dis turbance matrix is bounded [13]. None of thes e robust optimization approaches produces a sol ution that has sparsity properties (in particular , the soluti on to Lasso does not solve any o f these pre viousl y formulated robust optimizatio n problems). In contrast, we in vestigate the robust re gression prob lem where the uncertainty set is d eﬁned by feature-wise const raints. Such a noise model is o f interest when values of features are o btained wi th som e noisy pre-processing st eps, and t he magnitu des of such noises are known or bounded. Another s ituation o f interest is where features are m eaningfully coupled. W e deﬁne coupled and uncoupl ed disturbances and uncertainty sets precisely in Section II-A below . Intuitively , a d isturbance is feature-wise coupl ed if the variation or di sturbance across features satis fy joint constraints, and uncoupled otherwise. Considering the s olution t o Lass o as t he sol ution of a robust least s quares probl em has two important consequences. First, robustness provides a connection of the regularizer to a physical property , namely , protectio n from noise. This allows more pri ncipled selection of the regularizer , and in particular , considering different uncertaint y sets, we construct generalizations o f Lasso that also yield con ve x optim ization problems. Secondly , and perhaps m ost signi ﬁcantly , robustness is a st rong property th at can it self be used as an avenue to in vestigating different properties of the solution. W e show that robustness of the solution can explain why the solu tion is sparse. The analysi s as well as th e speciﬁc results we obtain differ from standard sparsity resul ts, providing dif ferent geometric intui tion, and extending beyond t he least-squares settin g. Sparsity result s obtained for Lasso ul timately depend on the fact that int roducing addition al features i ncurs larger ℓ 1 -penalty than the least s quares error reduction. In contrast, we exploit the fact that a rob ust sol ution is, by deﬁnit ion, the optim al solution under a worst-case perturbation. Our result s s how that, essentially , a coef ﬁcient of th e solution is nonzero if the correspondin g feature is relev ant under all allowa ble perturbations . In addition to sparsity , we also u se robustness di rectly to p rove consist ency of Lasso. W e brieﬂy list th e main cont ributions as well as t he organization of t his paper . • In Section II, we formulate the robust regression problem wi th feature-wise independent disturbances, and show that th is formulation is equiv alent t o a least-square problem wi th a weighted ℓ 1 norm regularization term. H ence, we p rovide an interpretation of Lasso from a robustness perspective . • W e generalize t he rob ust regression formulation to loss functions of arbit rary norm in Section III. W e also consider uncertainty sets th at require disturbances of different features to satisfy join t con ditions. This can be u sed to mi tigate the conservati veness of the robust solution and to obtain so lutions with addi tional prop erties. • In Section IV, we present ne w sparsit y results for the robust regression prob lem with feature-wise independ ent distu rbances. This provides a new ro bustness-based explanation Nov ember 26, 2024 DRAFT 3 to the sparsity of Lasso. Our approach gives new analysis and also g eometric i ntuition, and furthermore allows one to obtain sparsity results for m ore general loss functions, b eyond the squared loss. • N ext, we relate Lasso to kernel density estim ation in Section V . This allows us to re-prove consistency in a statisti cal learning setup, usin g th e new robustness tool s and formulation we i ntroduce. Al ong wit h our results on sparsity , t his i llustrates the power of robustness in explaining and also exploring different prop erties of the solution. • Fin ally , w e prove in Section VI a “no-free-lunch” theorem, stat ing that an algori thm that encourages sparsi ty cannot be stabl e. Notation . W e use capital l etters to represent matrices, and b oldface letters to represent column vectors. Row vectors are represented as the transp ose of colum n v ectors. For a v ector z , z i denotes its i th element. Throughout the paper , a i and r ⊤ j are used to denote the i th column and the j th row of the observation matrix A , respectiv ely . W e use a ij to deno te the ij element of A , hence it is the j th element of r i , and i th element o f a j . For a con ve x fun ction f ( · ) , ∂ f ( z ) represents an y of its sub-gradients e va luated at z . A vector with lengt h n and each element equals 1 is d enoted as 1 n . I I . R O B U S T R E G R E S S I O N W I T H F E A T U R E - W I S E D I S T U R B A N C E In this section, we show t hat our robust regression formulati on recovers Lasso as a special case. W e als o derive p robabilistic boun ds that guide in the construction of the uncertainty set. The regression formulation we consider differs from the standard Lasso form ulation, as we minimize the norm of the error , rather than the squared n orm. It is known that these two coi ncide up to a change of t he regularization coefﬁcient. Y et as we discuss above, o ur resul ts lead to m ore ﬂexible and pot entially powerful robust form ulations, and give new insight in to known resul ts. A. F ormulation Robust linear regression considers t he case where the observed matrix is corrupted by some potentially malicio us disturbance. The objective is to ﬁnd the op timal solu tion in the worst case sense. T his is usually formulated as the fol lowing mi n-max problem, Rob ust Linear Regression: min x ∈ R m  max ∆ A ∈U k b − ( A + ∆ A ) x k 2  , (1) where U i s called the uncertain ty set , or the set of admissible disturbances of the matrix A . In this section, we consider the class of uncertainty sets that bound the norm of the dist urbance to each feature, witho ut placing any joint requi rements across feature disturbances. That is, we consider the class of uncertainty sets: U , n ( δ 1 , · · · , δ m )    k δ i k 2 ≤ c i , i = 1 , · · · , m o , (2) for giv en c i ≥ 0 . W e call these uncertainty set s feat u r e-wise uncoupl ed , in contrast to coupled uncertainty sets that require disturbances of dif ferent features to satisfy some joint constraints (we discuss t hese extensively below , and their signiﬁcance). While the inner maximi zation problem of (1) is noncon vex, we show i n the next theorem t hat uncoupled norm -bounded uncertainty sets lead to an easily solvable optim ization problem. Nov ember 26, 2024 DRAFT 4 Theor em 1: The robust regression problem (1) with uncertainty s et of the form (2) is equiv- alent t o the following ℓ 1 regularized regression problem: min x ∈ R m n k b − A x k 2 + m X i =1 c i | x i | o . (3) Pr oof: Fix x ∗ . W e prove th at max ∆ A ∈U k b − ( A + ∆ A ) x ∗ k 2 = k b − A x ∗ k 2 + P m i =1 c i | x ∗ i | . The left hand side can be written as max ∆ A ∈U k b − ( A + ∆ A ) x ∗ k 2 = max ( δ 1 , ··· , δ m ) |k δ i k 2 ≤ c i    b −  A + ( δ 1 , · · · , δ m )  x ∗    2 = max ( δ 1 , ··· , δ m ) |k δ i k 2 ≤ c i k b − A x ∗ − m X i =1 x ∗ i δ i k 2 ≤ max ( δ 1 , ··· , δ m ) |k δ i k 2 ≤ c i    b − A x ∗    2 + m X i =1 k x ∗ i δ i k 2 ≤k b − A x ∗ k 2 + m X i =1 | x ∗ i | c i . (4) Now , let u ,  b − A x ∗ k b − A x ∗ k 2 if A x ∗ 6 = b , any vector w ith unit ℓ 2 norm o therwise ; and l et δ ∗ i , − c i sgn( x ∗ i ) u . Observe that k δ ∗ i k 2 ≤ c i , hence ∆ A ∗ , ( δ ∗ 1 , · · · , δ ∗ m ) ∈ U . Noti ce that max ∆ A ∈U k b − ( A + ∆ A ) x ∗ k 2 ≥k b − ( A + ∆ A ∗ ) x ∗ k 2 =    b −  A + ( δ ∗ 1 , · · · , δ ∗ m )  x ∗    2 =    ( b − A x ∗ ) − m X i =1  − x ∗ i c i sgn( x ∗ i ) u     2 =    ( b − A x ∗ ) + ( m X i =1 c i | x ∗ i | ) u    2 = k b − A x ∗ k 2 + m X i =1 c i | x ∗ i | . (5) The last equation holds from the deﬁnition of u . Combining Inequali ties (4) and (5), establish es the equality max ∆ A ∈U k b − ( A + ∆ A ) x ∗ k 2 = k b − A x ∗ k 2 + P m i =1 c i | x ∗ i | for any x ∗ . Minim izing over x on both sides proves the theorem. T aking c i = c and normal izing a i for all i , Problem (3) recover s the well -known Lasso [6], [7]. Nov ember 26, 2024 DRAFT 5 B. Uncertainty Set Constructi on The selection of an uncertainty set U in Robust Optim ization is of fundamental importance. One way this can be done is as an approximation of so-called chance constraints , wher e a deterministic constraint is replaced by the requirement that a const raint is sat isﬁed with at least some probability . These can be formulated when we know the distribution exactly , or wh en we hav e only partial information of the uncertainty , such as, e.g., ﬁrst and second mom ents. This chance-constraint formulatio n i s particularly im portant when the di stribution has large support, rendering t he naive robust optimi zation form ulation overly pessimisti c. For conﬁdence level η , the chance const raint formulati on becomes: minimize: t Subject to: Pr( k b − ( A + ∆ A ) x k 2 ≤ t ) ≥ 1 − η . Here, x and t are th e decision variables. Constructing the uncertainty set for feature i can be done quickly via line search and bisection , as lo ng as we can ev aluate Pr( k a i k 2 ≥ c ) . If we know t he dis tribution exactly (i. e., if we hav e complete probabilisti c information), this can be qu ickly done vi a sampling. Anoth er s etting of interest is when we hav e access only to some moment s of the dist ribution o f the uncertainty , e.g., the mean and variance. In this sett ing, the uncertainty sets are cons tructed v ia a bi section procedure which e valuates the worst-case probability ove r all dis tributions wi th given m ean and var iance. W e do this using a tight bound on t he p robability of an ev ent, given t he ﬁ rst two moments. In the scalar case, th e Markov Inequ ality provides such a bound. The next t heorem is a gener- alization of the M arkov inequality to R n , which bo unds the probability where the dis turbance on a give n feature is more than c i , if onl y the ﬁrst and second mom ent of the random variable are known. W e p ostpone the proof to the appendix, and refer the reader to [15] for s imilar results using sem i-deﬁnite optim ization. Theor em 2: Consider a random vector v ∈ R n , such that E ( v ) = a , and E ( vv ⊤ ) = Σ , Σ  0 . Then we have Pr {k v k 2 ≥ c i } ≤                min P , q ,r,λ T race(Σ P ) + 2 q ⊤ a + r subject to:  P q q ⊤ r   0  I ( m ) 0 0 ⊤ − c 2 i   λ  P q q ⊤ r − 1  λ ≥ 0 . (6) The opt imization problem (6) is a semi-deﬁnite program ming, whi ch is known be so lved in polynomial tim e. Furthermore, if we replace E ( vv ⊤ ) = Σ by an inequality E ( vv ⊤ ) ≤ Σ , the uniform bound still holds. Thu s, e ven if our estimation to the variance is not precise, we are still able to bound the probabili ty of having “large” dist urbance. I I I . G E N E R A L U N C E RT A I N T Y S E T S One re ason the rob ust optimization formulation is po werful, is that ha ving provided the connec- tion to Lasso, it then allows the opportuni ty t o g eneralize t o efﬁcient “Lasso-like” regularization algorithms. Nov ember 26, 2024 DRAFT 6 In this section, we make se veral generalizations of the robust formulation (1) and derive counterparts of Theorem 1. W e generalize the robust formulatio n in two ways: (a) to th e case of arbitrary norm; and (b) to the case of coupled uncertainty sets. W e ﬁrst consi der the case of an arbitrary norm k · k a of R n as a cost functi on rather t han the squared l oss. The proof of the next theorem is identi cal to that of Theorem 1, wi th only the ℓ 2 norm changed to k · k a . Theor em 3: The robust regression prob lem min x ∈ R m  max ∆ A ∈U a k b − ( A + ∆ A ) x k a  ; U a , n ( δ 1 , · · · , δ m )    k δ i k a ≤ c i , i = 1 , · · · , m o ; is equiv alent to the fol lowing regularized regression problem min x ∈ R m n k b − A x k a + m X i =1 c i | x i | o . W e next remove the ass umption that the d isturbances are feature-wise uncoupled. Allowing coupled uncertainty sets i s u seful when we have some additional information abo ut potential noise in the problem, and we want to l imit the conservati veness o f the worst-case formul ation. Consider t he following uncertainty set: U ′ ,  ( δ 1 , · · · , δ m )   f j ( k δ 1 k a , · · · , k δ m k a ) ≤ 0; j = 1 , · · · , k  , where f j ( · ) are con ve x functions. Notice that, both k and f j can be arbitrary , hence th is is a very general form ulation, and provides us with signi ﬁcant ﬂexibility in desig ning un certainty sets and equiv alently new regression algorithms (see for example Corollary 1 and 2). The following theorem con verts t his formulati on to tractable optimization problem s. T he proof is postpon ed to the append ix. Theor em 4: Assum e that the set Z , { z ∈ R m | f j ( z ) ≤ 0 , j = 1 , · · · , k ; z ≥ 0 } has no n-empty relative interior . Then th e robust regression problem min x ∈ R m  max ∆ A ∈U ′ k b − ( A + ∆ A ) x k a  is equiv alent to the fol lowing regularized regression problem min λ ∈ R k + , κ ∈ R m + , x ∈ R m n k b − A x k a + v ( λ , κ , x ) o ; where: v ( λ , κ , x ) , max c ∈ R m h ( κ + | x | ) ⊤ c − k X j =1 λ j f j ( c ) i (7) Remark: Problem (7) is e fﬁciently solvable. Denote z c ( λ , κ , x ) , h ( κ + | x | ) ⊤ c − P k j =1 λ j f j ( c ) i . This i s a con vex function of ( λ , κ , x ) , and the sub-gradient of z c ( · ) can be computed easil y for any c . The function v ( λ , κ , x ) is t he maximum o f a set of con ve x functions, z c ( · ) , hence is con ve x, and satisﬁes ∂ v ( λ ∗ , κ ∗ , x ∗ ) = ∂ z c 0 ( λ ∗ , κ ∗ , x ∗ ) , Nov ember 26, 2024 DRAFT 7 where c 0 maximizes h ( κ ∗ + | x | ∗ ) ⊤ c − P k j =1 λ ∗ j f j ( c ) i . W e ca n ef ﬁciently eva luate c 0 due to con ve xity of f j ( · ) , and hence we can efﬁciently ev aluate the s ub-gradient of v ( · ) . The next two corollaries are a direct application of Theorem 4. Cor ollary 1: Suppose U ′ = n ( δ 1 , · · · , δ m )      k δ 1 k a , · · · , k δ m k a   s ≤ l ; o for a symmetric norm k · k s , th en th e resulting regularized regression problem is min x ∈ R m n k b − A x k a + l k x k ∗ s o ; where k · k ∗ s is the dual n orm of k · k s . This corollary int erprets arbitrary norm -based regularizers from a robust regression persp ectiv e. For example, it is st raightforward to s how that if we take both k · k α and k · k s as the Euclidean norm, t hen U ′ is th e s et of matrices with their Frobenious n orms b ounded, and Corollary 1 reduces to the robust formulation introduced by [13]. Cor ollary 2: Suppose U ′ = n ( δ 1 , · · · , δ m )    ∃ c ≥ 0 : T c ≤ s ; k δ j k a ≤ c j ; o , then the result- ing regularized regression problem is Minimize: k b − A x k a + s ⊤ λ Subject t o: x ≤ T ⊤ λ − x ≤ T ⊤ λ λ ≥ 0 . Unlike previous result s, this corollary considers general pol ytope uncertainty sets. Advantages of such sets include the lin earity of the ﬁnal formulation. Moreover , the modeling power is considerable, as many interestin g disturbances can be modeled in thi s way . W e brieﬂy m ention som e further examples meant t o il lustrate the powe r and ﬂexibility of the robust formulation. W e refer the interested reader to [16] for full details. As t he result s above in dicate, t he robust formulati on can model a broad class of uncertainties, and yield computatio nally tractable (i.e., con vex) problems. In particular , one can use the polytope uncertainty discussed above, to sh ow (see [16]) that by employing an u ncertainty set ﬁrst used in [17 ], we can model cardinali ty constrained noise, where some (unknown) sub set of at mo st k features can be corrupted. Another av enue one m ay take us ing robustness, and which is als o possi ble to solve easily , is the case where the uncertainty set allows independent perturbati on of the colu mns and the rows of the matrix A . The resul ting formulati on resembles the elastic-net form ulation [18], where there is a com bination of ℓ 2 and ℓ 1 regularization. I V . S P A R S I T Y In this section, we i n vestigate the sparsity properties of robust regression (1), and equiva lently Lasso. Lasso’ s ability to recover s parse solut ions has been extensively stu died and d iscussed (cf [8]–[11]). There are generally two approaches. T he ﬁrst approach i n vestigates the problem from a statistical perspective. That is, it assumes that the ob serva tions are generated by a (sparse) linear combination o f t he features, and in vestigates the asym ptotic or probabil istic conditi ons required for Lasso to correctly recover the generati ve model. The second app roach treats t he probl em from an optimization perspective, and studies under what cond itions a pair ( A, b ) deﬁnes a problem wit h sparse solutio ns (e.g., [19]). Nov ember 26, 2024 DRAFT 8 W e follow th e second approach and do not assum e a generative m odel. Instead, we cons ider the conditi ons t hat lead to a feature receiving zero weight. Our ﬁrst result pav es the way for the remainder of this section. W e s how in Th eorem 5 that, essentially , a feature rec eives no weight (nam ely , x ∗ i = 0 ) if there exists an allowable perturbation of that feature whi ch makes it irrele vant. This result hol ds for general norm loss fun ctions, but in the ℓ 2 case, we obtain furth er geometric resul ts. For instance, using Theorem 5, we show , amon g oth er results, that “nearly” orthogonal features get zero weight (Theorem 6). Using similar tool s, we pro vide additi onal results in [16]. There, we sh ow , among o ther resul ts, that the sparsity patt ern of any optimal solution must sati sfy certain angular separatio n conditions between the residual and the relev ant features, and that “nearly” linearly dependent features get zero weight. Substantial research regarding sp arsity properties of Lasso can be found in the l iterature (cf [8]–[11], [20]–[23] and many oth ers). In particul ar , sim ilar result s as in poi nt (a), th at rely on an incoher ence property , ha ve b een est ablished in, e.g., [19], and are used as standard tools in i n vestigating sparsit y of Lasso from the statist ical perspective . Howe ver , a p roof exploiting robustness and properties of the un certainty is novel. Indeed, such a proof shows a fundamental connection between robustness and sparsity , and imp lies that robustifying w .r .t. a feature-wise independent uncertaint y set m ight be a plausible way to achieve s parsity for other probl ems. T o state the m ain theorem of t his section, from which the other results deriv e, we in troduce some notatio n to facilitate t he discussion. Given a feature-wise uncoupled uncertainty set, U , an index s ubset I ⊆ { 1 , . . . , n } , and any ∆ A ∈ U , let ∆ A I denote the element of U that equals ∆ A on each feature indexed by i ∈ I , and is zero elsewhere . Then, we can write any element ∆ A ∈ U as ∆ A I + ∆ A I c (where I c = { 1 , . . . , n } \ I ). Then we hav e the following theorem. W e note that the result hol ds for any norm lo ss functi on, but we state and prove it for the ℓ 2 norm, since t he proof for ot her norm s is identical. Theor em 5: The robust regression prob lem min x ∈ R m  max ∆ A ∈U k b − ( A + ∆ A ) x k 2  , has a sol ution su pported o n an in dex set I if there exists som e perturbation ∆ ˜ A I c ∈ U o f the features in I c , su ch that the robust regression problem min x ∈ R m  max ∆ ˜ A I ∈U I k b − ( A + ∆ ˜ A I c + ∆ ˜ A I ) x k 2  , has a solut ion supported on the set I . Thus, a robust regression has an opti mal solution su pported on a set I , if any perturbation of the features corresponding to the complement of I makes them irrelev ant. Theorem 5 is a special case of the following theorem with c j = 0 for all j 6∈ I : Theor em 5’ . Let x ∗ be an opt i mal solution o f the r obust r e gr ession pr oblem: min x ∈ R m  max ∆ A ∈U k b − ( A + ∆ A ) x k 2  , and let I ⊆ { 1 , · · · , m } be such th a t x ∗ j = 0 ∀ j 6∈ I . Let ˜ U , n ( δ 1 , · · · , δ m )    k δ i k 2 ≤ c i , i ∈ I ; k δ j k 2 ≤ c j + l j , j 6∈ I o . Nov ember 26, 2024 DRAFT 9 Then, x ∗ is an optim a l soluti on of min x ∈ R m  max ∆ A ∈ ˜ U k b − ( ˜ A + ∆ A ) x k 2  , for any ˜ A tha t satis ﬁes k ˜ a j − a j k ≤ l j for j 6∈ I , and ˜ a i = a i for i ∈ I . Pr oof: Notice th at max ∆ A ∈ ˜ U    b − ( A + ∆ A ) x ∗    2 = max ∆ A ∈U    b − ( A + ∆ A ) x ∗    2 = max ∆ A ∈U    b − ( ˜ A + ∆ A ) x ∗    2 . These equalities hold because for j 6∈ I , x ∗ j = 0 , hence the j th column of b oth ˜ A and ∆ A has no effect on t he residu al. For an arbitrary x ′ , we hav e max ∆ A ∈ ˜ U    b − ( A + ∆ A ) x ′    2 ≥ max ∆ A ∈U    b − ( ˜ A + ∆ A ) x ′    2 . This is because, k a j − ˜ a j k ≤ l j for j 6∈ I , and a i = ˜ a i for i ∈ I . Hence, we ha ve  A + ∆ A   ∆ A ∈ U  ⊆  ˜ A + ∆ A   ∆ A ∈ ˜ U  . Finally , not ice that max ∆ A ∈U    b − ( A + ∆ A ) x ∗    2 ≤ max ∆ A ∈U    b − ( A + ∆ A ) x ′    2 . Therefore we have max ∆ A ∈ ˜ U    b − ( ˜ A + ∆ A ) x ∗    2 ≤ max ∆ A ∈ ˜ U    b − ( ˜ A + ∆ A ) x ′    2 . Since thi s holds for arbit rary x ′ , we establis h the th eorem. W e can i nterpret the result of thi s theorem by considering a generativ e model 1 b = P i ∈ I w i a i + ˜ ξ where I ⊆ { 1 · · · , m } and ˜ ξ is a random variable, i.e., b is generated by features belong ing to I . In this case, for a feature j 6∈ I , Lasso would assign zero weight as long as there exists a perturbed value of this feature, such t hat t he optim al regression assigned it zero weight. When we consid er ℓ 2 loss, we can translate the condition of a feature being “irrelev ant” in to a geometric condition , namely , orthogonalit y . W e now us e the result of Theore m 5 t o sho w that robust regression h as a sparse soluti on as long as an incoherence-type property is satisﬁed. This result is more in l ine with th e traditional sparsity results, but we note that the geometric reasoning is differe nt, and ours i s based on robustness. Indeed, we s how that a feature receives zero weight, if it is “nearly” (i.e., within an allow able perturbation ) ortho gonal to the signal, and 1 While we are not assuming generativ e models to establish the results, it is stil l interesting to see how these results can help in a generati ve model setup. Nov ember 26, 2024 DRAFT 10 all relev ant features. Theor em 6: Let c i = c for all i and consider ℓ 2 loss. If there exists I ⊂ { 1 , · · · , m } such that for all v ∈ span  { a i , i ∈ I } S { b }  , k v k = 1 , we hav e v ⊤ a j ≤ c , ∀ j 6∈ I , then any optim al solution x ∗ satisﬁes x ∗ j = 0 , ∀ j 6∈ I . Pr oof: For j 6∈ I , l et a = j denote the projection of a j onto the sp an of { a i , i ∈ I } S { b } , and l et a + j , a j − a = j . Thu s, we hav e k a = j k ≤ c . Let ˆ A be such t hat ˆ a i =  a i i ∈ I ; a + i i 6∈ I . Now let ˆ U , { ( δ 1 , · · · , δ m ) |k δ i k 2 ≤ c, i ∈ I ; k δ j k 2 = 0 , j 6∈ I } . Consider the robust regression problem min ˆ x n max ∆ A ∈ ˆ U   b − ( ˆ A + ∆ A ) ˆ x   2 o , which is equiv- alent to min ˆ x n   b − ˆ A ˆ x   2 + P i ∈ I c | ˆ x i |  . Note that the ˆ a j are orthogonal to the span of { ˆ a i , i ∈ I } S { b } . Hence for any give n ˆ x , by changing ˆ x j to zero for all j 6∈ I , the minimizing objective does n ot increase. Since k ˆ a − ˆ a j k = k a = j k ≤ c ∀ j 6∈ I , (and recall that U = { ( δ 1 , · · · , δ m ) |k δ i k 2 ≤ c, ∀ i } ) applying T heorem 5 concludes th e proof. V . D E N S I T Y E S T I M A T I O N A N D C O N S I ST E N C Y In this s ection, we in vestigate t he robust linear regression formu lation from a statisti cal perspectiv e and rederive using onl y r obustness pr operties that Lasso is asympt otically cons istent. The basic idea of the consis tency p roof is as follo ws. W e sho w that the robust optimi zation formulation can be s een to be the maximum error w .r .t. a class of probabili ty measures. This class i ncludes a kernel density estimator , and us ing thi s, we show that Lasso is consistent. A. Robust Op t imization , W orst-case Expected Uti l ity and K ernel Densi ty Estima t or In this sub section, we present some notio ns and intermedi ate results. In p articular , we link a robust optimization formulation with a worst expected utility (w .r .t. a class of probability measures); we t hen brieﬂy recall t he deﬁnition of a kernel densit y estimator . Such results will be used in establis hing the consis tency of Lasso, as well as providing some addit ional ins ights on robust optim ization. Proofs are postpo ned to t he appendi x. W e ﬁrst establish a general result on the equiv alence between a robust optimization formulation and a worst-case expected uti lity: Pr oposition 1: Gi ven a function g : R m +1 → R and Borel sets Z 1 , · · · , Z n ⊆ R m +1 , let P n , { µ ∈ P |∀ S ⊆ { 1 , · · · , n } : µ ( [ i ∈ S Z i ) ≥ | S | /n } . The foll owing hold s 1 n n X i =1 sup ( r i ,b i ) ∈Z i h ( r i , b i ) = sup µ ∈P n Z R m +1 h ( r , b ) dµ ( r , b ) . This leads to the following corollary for Lasso, which states that for a given x , the rob ust regression loss over the traini ng data is equal to the worst-case expected generalization error . Nov ember 26, 2024 DRAFT 11 Cor ollary 3: Giv en b ∈ R n , A ∈ R n × m , th e following equation hold s for any x ∈ R m , k b − A x k 2 + √ nc n k x k 1 + √ nc n = sup µ ∈ ˆ P ( n ) s n Z R m +1 ( b ′ − r ′⊤ x ) 2 dµ ( r ′ , b ′ ) . (8) Here 2 , ˆ P ( n ) , [ k σ k 2 ≤ √ nc n ; ∀ i : k δ i k 2 ≤ √ nc n P n ( A, ∆ , b , σ ) ; P n ( A, ∆ , b , σ ) , { µ ∈ P | Z i = [ b i − σ i , b i + σ i ] × m Y j =1 [ a ij − δ ij , a ij + δ ij ]; ∀ S ⊆ { 1 , · · · , n } : µ ( [ i ∈ S Z i ) ≥ | S | /n } . Remark 1: W e brieﬂy explain Coroll ary 3 to a void possib le confusion s. Equati on (8) is a non-probabilist ic equali ty . That is, i t ho lds witho ut any assumpti on (e.g., i.i .d. or generated by certain dist ributions) on b and A . And it does not in volve any p robabilistic operation s uch as taking expectation o n th e left-hand-si de, instead, it i s an equiv alence relationshi p which hold for an arbitrary set of sam ples. Notice that, t he right -hand-side also depend s on the s amples since ˆ P ( n ) is deﬁned through A and b . Indeed, ˆ P ( n ) represents the unio n of classes of distri butions P n ( A, ∆ , b , σ ) such that the norm of each column of ∆ is bound ed, where P n ( A, ∆ , b , σ ) is the set of distributions corresponds t o (see Proposition 1) di sturbance in hyper-rectangle Borel sets Z 1 , · · · , Z n centered at ( b i , r ⊤ i ) wi th lengths (2 σ i , 2 δ i 1 , · · · , 2 δ im ) . W e will later show that ˆ P n consists a kernel density estimator . Hence we recall here its deﬁnition. The kernel dens i ty estima tor for a density ˆ h i n R d , originall y proposed in [24], [25], is deﬁned by h n ( x ) = ( nc d n ) − 1 n X i =1 K  x − ˆ x i c n  , where { c n } is a sequence of posi tive numbers, ˆ x i are i.i.d. s amples generated according to ˆ f , and K is a Borel m easurable function (kernel) sat isfying K ≥ 0 , R K = 1 . See [26], [27] and the reference therein for detailed d iscussions. Fig ure 1 il lustrates a kernel densit y estim ator usin g Gaussian kernel for a randomly generated sample-set. A celebrated property of a kernel densi ty estimator i s that it con ver ges in L 1 to ˆ h when c n ↓ 0 and nc d n ↑ ∞ [26]. B. Consistency of Lasso W e restrict our discuss ion to the case where the magnitude of th e allow able uncertainty for all features equals c , (i.e., the st andard L asso) and establ ish the stati stical consi stency of Lasso from a dis tributional robustness argument. Generalization to the non-uni form case is straightforward. Throughout, we use c n to represent c where there are n s amples (we take c n to zero). Recall the st andard generativ e model in stati stical learning: let P be a probabili ty measure with bound ed support th at generates i.i.d sampl es ( b i , r i ) , and has a density f ∗ ( · ) . Denot e the 2 Recall t hat a ij is the j th element of r i Nov ember 26, 2024 DRAFT 12 0 5 10 15 20 25 samples −10 −5 0 5 10 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 kernel function −10 0 10 20 30 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 estimated density Fig. 1. Illustration of Kernel Density E stimation. set o f the ﬁrst n samples by S n . Deﬁne x ( c n , S n ) , arg min x n v u u t 1 n n X i =1 ( b i − r ⊤ i x ) 2 + c n k x k 1 o = ar g min x n √ n n v u u t n X i =1 ( b i − r ⊤ i x ) 2 + c n k x k 1 o ; x ( P ) , arg min x n s Z b, r ( b − r ⊤ x ) 2 d P ( b, r ) o . In words, x ( c n , S n ) is the so lution to Lasso with the tradeoff parameter set to c n √ n , and x ( P ) is the “true” optimal solution. W e ha ve the following consistency result. The theorem itself is a well- known result . Howe ver , the p roof techniqu e is novel. This technique is of i nterest because th e standard techniques to establish consistency in st atistical learning including V apnik-Chervonenkis (VC) dimens ion (e.g., [28]) and algorith mic st ability (e.g., [29]) often work for a li mited range of algo rithms, e.g., the k -Nearest Neighbor is k nown to hav e inﬁnite VC dimens ion, and we show in Section VI t hat Lasso is not stabl e . In contrast, a much wider range of algorit hms hav e robustness int erpretations, allowing a u niﬁed approach to prove their consistency . Theor em 7: Let { c n } be s uch that c n ↓ 0 and lim n →∞ n ( c n ) m +1 = ∞ . Suppose there exists a constant H such that k x ( c n , S n ) k 2 ≤ H . Th en, lim n →∞ s Z b, r ( b − r ⊤ x ( c n , S n )) 2 d P ( b, r ) = s Z b, r ( b − r ⊤ x ( P )) 2 d P ( b, r ) , almost s urely . Pr oof: Step 1 : W e show that the right hand side of Equ ation (8) i ncludes a kernel density estimator for the true (unknown) di stribution. C onsider the following k ernel estimato r gi ven Nov ember 26, 2024 DRAFT 13 samples S n = ( b i , r i ) n i =1 and t radeof f parameter c n , f n ( b, r ) , ( nc m +1 n ) − 1 n X i =1 K  b − b i , r − r i c n  , where: K ( x ) , I [ − 1 , +1] m +1 ( x ) / 2 m +1 . (9) Let ˆ µ n denote the dist ribution giv en by t he density function f n ( b, r ) . Easy to check that ˆ µ n belongs to P n ( A, ( c n 1 n , · · · , c n 1 n ) , b , c n 1 n ) and hence belongs to ˆ P ( n ) by deﬁnition. Step 2 : Using the L 1 con ver gence p roperty of the ke rnel density estim ator , we prove th e consistency of robust regression and equiv alently Lasso. First no tice that, k x ( c n , S n ) k 2 ≤ H and P has a bounded support implies that there exists a univ ersal constant C s uch that max b, r ( b − r ⊤ w ( c n , S n )) 2 ≤ C . By Corollary 3 and ˆ µ n ∈ ˆ P ( n ) we have s Z b, r ( b − r ⊤ x ( c n , S n )) 2 d ˆ µ n ( b, r ) ≤ sup µ ∈ ˆ P ( n ) s Z b, r ( b − r ⊤ x ( c n , S n )) 2 dµ ( b, r ) = √ n n v u u t n X i =1 ( b i − r ⊤ i x ( c n , S n )) 2 + c n k x ( c n , S n ) k 1 + c n ≤ √ n n v u u t n X i =1 ( b i − r ⊤ i x ( P )) 2 + c n k x ( P ) k 1 + c n , the l ast inequalit y holds by deﬁnition of x ( c n , S n ) . T aking the squ are of both s ides, we hav e Z b, r ( b − r ⊤ x ( c n , S n )) 2 d ˆ µ n ( b, r ) ≤ 1 n n X i =1 ( b i − r ⊤ i x ( P )) 2 + c 2 n (1 + k x ( P ) k 1 ) 2 + 2 c n (1 + k x ( P ) k 1 ) v u u t 1 n n X i =1 ( b i − r ⊤ i x ( P )) 2 . Notice that, the ri ght-hand sid e con ver ges to R b, r ( b − r ⊤ x ( P )) 2 d P ( b, r ) as n ↑ ∞ and c n ↓ 0 Nov ember 26, 2024 DRAFT 14 almost s urely . Furthermore, we hav e Z b, r ( b − r ⊤ x ( c n , S n )) 2 d P ( b, r ) ≤ Z b, r ( b − r ⊤ x ( c n , S n )) 2 d ˆ µ n ( b, r ) +  max b, r ( b − r ⊤ x ( c n , S n )) 2  Z b, r | f n ( b, r ) − f ∗ ( b, r ) | d ( b, r ) ≤ Z b, r ( b − r ⊤ x ( c n , S n )) 2 d ˆ µ n ( b, r ) + C Z b, r | f n ( b, r ) − f ∗ ( b, r ) | d ( b, r ) , where the last inequality follo ws from the deﬁnition of C . Notice that R b, r | f n ( b, r ) − f ∗ ( b, r ) | d ( b, r ) goes to zero almo st surely when c n ↓ 0 and nc m +1 n ↑ ∞ s ince f n ( · ) is a kernel densi ty est imation of f ∗ ( · ) (see e.g. Th eorem 3.1 of [26]). Hence the theorem foll ows. W e can remove the assumption that k x ( c n , S n ) k 2 ≤ H , and as in Theorem 7, the proof technique rather than the result its elf is of interest. Theor em 8: Let { c n } conv erge t o zero sufﬁ ciently slowly . Then lim n →∞ s Z b, r ( b − r ⊤ x ( c n , S n )) 2 d P ( b, r ) = s Z b, r ( b − r ⊤ x ( P )) 2 d P ( b, r ) , almost s urely . Pr oof: T o prove the t heorem, we need to cons ider a s et of dis tributions belo nging to ˆ P ( n ) . Hence we establish the following lemma ﬁrst. Lemma 1: Partition the support of P as V 1 , · · · , V T such the ℓ ∞ radius of each set is less than c n . If a distribution µ sati sﬁes µ ( V t ) =     i | ( b i , r i ) ∈ V t     /n ; t = 1 , · · · , T , (10) then µ ∈ ˆ P ( n ) . Pr oof: Let Z i = [ b i − c n , b i + c n ] × Q m j =1 [ a ij − c n , a ij + c n ] ; recall that a ij the j th element of r i . Not ice V t has ℓ ∞ norm less than c n we have ( b i , r i ∈ V t ) ⇒ V t ⊆ Z i . Therefore, for any S ⊆ { 1 , · · · , n } , the foll owing holds µ ( [ i ∈ S Z i ) ≥ µ ( [ V t |∃ i ∈ S : b i , r i ∈ V t ) = X t |∃ i ∈ S : b i , r i ∈ V t µ ( V t ) = X t |∃ i ∈ S : b i , r i ∈ V t #  ( b i , r i ) ∈ V t  /n ≥ | S | /n. Hence µ ∈ P n ( A, ∆ , b, c n ) where each elem ent o f ∆ is c n , whi ch leads to µ ∈ ˆ P ( n ) . Now we proceed to prove t he theorem. Partition the support of P into T subsets such that ℓ ∞ radius of each one is smaller than c n . Denote ˜ P ( n ) as the set of p robability measures satisfying Equation (10). Hence ˜ P ( n ) ⊆ ˆ P ( n ) by Lemma 1. Further notice that there exists a u niv ersal constant K such that k x ( c n , S n ) k 2 ≤ K /c n due to the fact that the square loss of the s olution Nov ember 26, 2024 DRAFT 15 x = 0 is bou nded by a const ant only depends on the s upport of P . Thus, there exists a constant C s uch that max b, r ( b − r ⊤ x ( c n , S n )) 2 ≤ C /c 2 n . Follo w a s imilar argument as the proof of Theorem 7, we hav e sup µ n ∈ ˜ P ( n ) Z b, r ( b − r ⊤ x ( c n , S n )) 2 dµ n ( b, r ) ≤ 1 n n X i =1 ( b i − r ⊤ i x ( P )) 2 + c 2 n (1 + k x ( P ) k 1 ) 2 + 2 c n (1 + k x ( P ) k 1 ) v u u t 1 n n X i =1 ( b i − r ⊤ i x ( P )) 2 , (11) and Z b, r ( b − r ⊤ x ( c n , S n )) 2 d P ( b, r ) ≤ inf µ n ∈ ˜ P ( n ) n Z b, r ( b − r ⊤ x ( c n , S n )) 2 dµ n ( b, r ) + max b, r ( b − r ⊤ x ( c n , S n )) 2 Z b, r | f µ n ( b, r ) − f ( b, r ) | d ( b, r ) ≤ sup µ n ∈ ˜ P ( n ) Z b, r ( b − r ⊤ x ( c n , S n )) 2 dµ n ( b, r ) + 2 C /c 2 n inf µ ′ n ∈ ˜ P ( n ) n Z b, r | f µ ′ n ( b, r ) − f ( b, r ) | d ( b, r ) o , here f µ stands for the density function of a measure µ . Notice t hat ˜ P n is th e set of distributions satisfying Equation (10), hence inf µ ′ n ∈ ˜ P ( n ) R b, r | f µ ′ n ( b, r ) − f ( b, r ) | d ( b, r ) is upper -bounded by P T t =1 | P ( V t ) − #( b i , r i ∈ V t ) | /n , which goes to zero as n increases for any ﬁxed c n (see for example Proposition A6.6 of [30 ]). Therefore, 2 C /c 2 n inf µ ′ n ∈ ˜ P ( n ) n Z b, r | f µ ′ n ( b, r ) − f ( b, r ) | d ( b, r ) o → 0 , if c n ↓ 0 sufﬁciently slow . Combining this with Inequal ity (11) proves the theorem. V I . S T A B I L I T Y Knowing th at the robust regression prob lem (1) and in particular Lasso encourage sparsity , it is of int erest to in vestigate anot her d esirable characteristic of a learning algorithm, namely , stability . W e show i n this section that Lasso i s not sta b l e . This is a special case of a more general result we prove in [31], where we sh ow that this is a common property for all alg orithms that encourage s parsity . That is, if a learning algorithm achieves certain sparsity conditio n, then it cannot have a non-trivial stability bo und. W e recall t he deﬁnitio n of un iform s tability [29] ﬁrst. W e let Z denote the space of points and l abels (typically this wi ll be a com pact subset of R n +1 ) so that S ∈ Z m denotes a collection of m labelled training poi nts. W e let L denote a learning algorithm, and for S ∈ Z m , we let L S Nov ember 26, 2024 DRAFT 16 denote the output of the learning algorithm (i.e., the regression functi on it has learned from the training data). Then given a l oss function l , and a l abeled poi nt s = ( z , b ) ∈ Z , we let l ( L S , s ) denote the loss of the alg orithm that has been trained on the set S , on the data point s . Thus for squ ared loss , we would hav e l ( L S , s ) = k L S ( z ) − b k 2 . Deﬁnition 1: An algo rithm L has uniform stabil ity bound of β m with respect to the lo ss function l if the following holds ∀ S ∈ Z m , ∀ i ∈ { 1 , · · · , m } , k l ( L S , · ) − l ( L S \ i , · ) k ∞ ≤ β m . Here L S \ i stands for the l earned solut ion with t he i th sample removed from S . At ﬁrst glance, this deﬁnition may seem too stringent for any reasonable algorithm to e xhibit good stability p roperties. Howe ver , as shown in [29], T ikhonov-r e gulariz ed r e gr ession ha s s tability that scales as 1 /m . Stabili ty that scales at least as fast as o ( 1 √ m ) can be used to establish stron g P A C bounds (see [29]). In this section we show that not only is the stabi lity (in the sense deﬁned above) of Lasso much worse than the stabil ity of ℓ 2 -regularized regression, but in fact Lasso’ s s tability i s, in the following sense, as bad as it gets. T o this end, we deﬁne the notio n of the trivial boun d, which is the worst pos sible error a training algorithm can ha ve for arbitrary t raining set and testing sample labelled by zero. Deﬁnition 2: Given a subset from which we can draw m labelled points, Z ⊆ R n × ( m +1) and a subset for one unlabelled point, X ⊆ R m , a t rivial b ound for a learning algorithm L w .r .t. Z and X is b ( L , Z , X ) , max S ∈Z , z ∈X l  L S , ( z , 0)  . As above, l ( · , · ) is a given los s function. Notice that the trivial bound does not dim inish as the number of samples increases, since by repeatedly choos ing the worst sample, t he algori thm will yield the same solution. Now we show that the uniform stabil ity bou nd of Lasso can be no better th an its trivial bou nd with the num ber of features halved. Theor em 9: Given ˆ Z ⊆ R n × (2 m +1) be the domain of sample set and ˆ X ⊆ R 2 m be the domain of new observation, su ch th at ( b , A ) ∈ Z = ⇒ ( b , A, A ) ∈ ˆ X , ( z ⊤ ) ∈ X = ⇒ ( z ⊤ , z ⊤ ) ∈ ˆ X . Then t he uniform stability bou nd of Lasso is lower bounded by b (Lasso , Z , X ) . Pr oof: Let ( b ∗ , A ∗ ) and (0 , z ∗⊤ ) be the sam ple set and the ne w observation such that they join tly achie ve b (Lasso , Z , X ) , and let x ∗ be the optim al solution to Lasso w .r .t ( b ∗ , A ∗ ) . Consider t he following sample set  b ∗ A ∗ A ∗ 0 0 ⊤ z ∗⊤  . Observe that ( x ⊤ , 0 ⊤ ) ⊤ is an optim al sol ution of Lasso w .r .t to this sample set. Now remove the last sample from the sampl e set. N otice that ( 0 ⊤ , x ⊤ ) ⊤ is an optimal solu tion for th is new sample set. Using t he last s ample as a t esting observation, the sol ution w .r .t the full sample set has zero cost, whil e the solution of the leave -one-out sample s et has a cost b (Lasso , Z , X ) . An d hence we prove t he th eorem. Nov ember 26, 2024 DRAFT 17 V I I . C O N C L U S I O N In th is paper , w e considered robust regression wi th a least-squ are-error loss . In contrast to pre vious work on robust regression, we considered the case where the perturbations of the observations are in the features. W e show that this formulation is equiv alent to a weigh ted ℓ 1 norm regularized regression problem if no correlation o f dist urbances among di f ferent features is allowed, and hence provide an interpretation of the widely us ed Lasso algori thm from a robustness perspective . W e also form ulated tractable robust regression prob lems for disturbance coupled among differe nt features and hence generalize Lasso to a wider class of regularization schemes. The sparsity and consi stency of Lasso are also in vestigated based on i ts robustness i nterpre- tation. In particular we present a “no-free-lunch” t heorem sayi ng that sparsity and algorit hmic stability contradict each other . This result s hows, although sparsity and al gorithmic stability are both regarded as desirable properties of regression algorith ms, it is not poss ible to achiev e them simultaneous ly , and we hav e t o tradeoff th ese two properties in design ing a regression algorith m. The main thrust of this work is to treat the wid ely used regularized regression s cheme from a robust optim ization perspectiv e, and extend the result of [13] (i .e., T ikhonov regularization is equiv alent to a r obust formulation for Frobenius norm bounded disturbance set) to a broader range of disturbance set and hence regularization scheme. This provides us not only wi th new insight of why regularization schemes work, but also offer s olid motiv ations for selecting regularization parameter for existing regularization scheme and facilitate designing new regularizing schemes. R E F E R E N C E S [1] L. Elden. Perturbation theory for the least-square problem with linear equality constraints. BIT , 24:472–476, 1985. [2] G. Golub and C. V an Loan. Matrix Computation . John Hopkins Univ ersity Press, Baltimore, 1989. [3] D. Higham and N. Higham. Backward error and condition of structured linear systems. SIAM J ournal on Matrix Analysis and Applications , 13:162–1 75, 1992. [4] R. F ierro and J. Bunch. Collinearity and total l east squares. SIAM Journal on Matrix Analysis and Applications , 15:1167 – 1181, 1994. [5] A. T ikhonov and V . Arsenin. Solution for Ill -P osed Problems . Wiley , New Y ork, 1977. [6] R. Tibshirani. Regression shrinkage and selection via the lasso. J ournal of t he Royal Statistical Society , Series B , 58(1):267–2 88, 1996. [7] B. Efron, T . Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics , 32(2):407–499, 2004. [8] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientiﬁc Computing , 20(1):33–61 , 1998. [9] A. Feuer and A. Nemir ovsk i. On sparse representation i n pairs of bases. IEEE T ransac tions on Information Theory , 49(6):1579– 1581, 2003. [10] E. C and ` es, J. Romberg, and T . T ao . Robust uncertainty principles: Exact signal reconstruction fr om highly incomplete frequenc y information. IEEE T ransactions on Information Theory , 52(2):489–509, 2006. [11] J. Tropp . Greed is good: Algorithmic results for sparse approximation. IEEE T ransactions on Information Theory , 50(10):2231 –2242, 2004. [12] M. W ainwright. Sharp thresholds for noisy and high-dimensional recov ery of spar- sity using ℓ 1 -constrained quadratic programming. T ech nical Report A v ailable from: http://http:/ /www.stat.ber keley.edu/tec h-reports/709.pdf , Department of Statistics, UC Berkele y , 2006. [13] L. E l Ghaoui and H. Lebret. R obu st solutions t o least-squares problems with uncertain data. SIAM Jou rnal on Matrix Analysis and Applications , 18:1035–1 064, 1997. [14] P . Shiv aswamy , C . Bhattacharyya, and A. Smola. S econd order cone programming approaches for handling missing and uncertain data. Journa l of Machin e Learning Resear ch , 7:1283–1314, July 2006. [15] D. Bertsimas and I. Popescu. Optimal inequalities i n probability theory: A con v ex optimization approach. SIAM Jou rnal of Optimization , 15(3):780–80 0, 2004. [16] H. Xu , C. Caramanis, and S. Mann or . Robus t regression and Lasso. T echnical report, Gera d, A vailable from http://www.ci m.mcgill.ca/ ∼ xuhuan/LassoG erad.pdf , 2008. Nov ember 26, 2024 DRAFT 18 [17] D. Bertsimas and M. Si m. The price of robustne ss. Operations Resear ch , 52(1):35–53, January 2004. [18] H. Zou and T . Hastie. Regularization and variable selection via the elastic net. Jo urnal Of The Royal Statistical Society Series B , 67(2):301–32 0, 2005. [19] J. T ropp. Just r elax: Con ve x programming methods for identifying sparse si gnals. IEEE T ransactions on Information Theory , 51(3):1030–1 051, 2006. [20] F . Girosi. An equi valen ce between sparse approximation and support vector machines. Neural Computation , 10(6):1445– 1480, 1998. [21] R. R. Coifman and M. V . W ickerhauser . Entropy-ba sed algorithms for best-basis selection. IEEE Tr ansactions on Information T heory , 38(2):713–7 18, 1992. [22] S. Mallat and Z . Zhang. Matching P ursuits with time-frequence dictionaries. IEEE Tr ansactions on Signal Proce ssing , 41(12):3397 –3415, 1993. [23] D. Donoho. Compressed sensing. IEEE T ran sactions on Information Theory , 52(4):1289–1306, 2006. [24] M. Rosenblatt. Remarks on some nonparametric estimates of a density function. Annals of Mathematical Statistics , 27:832–8 37, 1956. [25] E. Parzen. O n t he esti mation of a probability density function and the mode. Annals of Mathematical Statistics , 33:1065– 1076, 1962. [26] L. Dev roye and L. Gy ¨ orﬁ. Nonparametric Density E stimation: the l 1 V iew . John Wiley & Sons, 1985. [27] D. Scott. Multivariate Density Estimation: Theory , Practice and V isua lization . John W iley & Sons, 1992. [28] V . V apnik and A. Chervonenk is. T he necessary and sufﬁcient conditions for consistency in the empirical ri sk minimization method. P attern Recognition and Imag e Analysis , 1(3):260–284, 1991. [29] O. Bousquet and A. Eli sseef f. S tability and generalization. Jou rnal of Machine Learning Resear ch , 2:499–526, 2002. [30] A. v an der V aart and J. W ellner . W eak Con ver genc e and E mpirical P r ocesses . Springer-V erlag, Ne w Y ork, 2000. [31] H. Xu, C. Caramanis, and S . Mannor . S parse algorithms are not st able: A no-free-lunch theorem. In Proc eedings of F orty-Sixth A l lerton Confere nce on C ommunication, Contro l, and Computing , 2008. A P P E N D I X A P R O O F O F T H E O R E M 2 Theor em 2 . Consider a r andom vector v ∈ R n , such that E ( v ) = a , and E ( vv ⊤ ) = Σ , Σ  0 . Then we have Pr {k v k 2 ≥ c i } ≤                min P , q ,r,λ T race(Σ P ) + 2 q ⊤ a + r subject to:  P q q ⊤ r   0  I ( m ) 0 0 ⊤ − c 2 i   λ  P q q ⊤ r − 1  λ ≥ 0 . (12) Pr oof: Consider a functio n f ( · ) parameterized by P , q , r deﬁned as f ( v ) = v ⊤ P v + 2 q ⊤ v + r . Not ice E  f ( v )  = T ra ce (Σ P ) + 2 q ⊤ a + r . Now we show that f ( v ) ≥ 1 k v k≥ c i for all P , q , r satisfying the constraints i n (12). T o show f ( v ) ≥ 1 k v k 2 ≥ c i , we need to establish (i) f ( v ) ≥ 0 for all v , and (ii) f ( v ) ≥ 1 when k v k 2 ≥ c i . Not ice that f ( v ) =  v 1  ⊤  P q q ⊤ r   v 1  , hence (i) holds because  P q q ⊤ r   0 . T o establi sh condition (ii), it sufﬁces t o s how v ⊤ v ≥ c 2 i implies v ⊤ P v + 2 q ⊤ v + r ≥ 1 , which is equivalent to show  v   v ⊤ P v + 2 q ⊤ v + r − 1 ≤ 0  ⊆  v   v ⊤ v ≤ c 2 i  . Noti cing this is an ellipsoid-contai nment con dition, by S-Procedure, we see that i s equiv alent to the condi tion th at Nov ember 26, 2024 DRAFT 19 there exists a λ ≥ 0 such that  I ( m ) 0 0 ⊤ − c 2 i   λ  P q q ⊤ r − 1  . Hence we have f ( v ) ≥ 1 k v k 2 ≥ c i , taking expectation over both side that notice that t he expectation of a indi cator function is the probability , we establish the theorem. A P P E N D I X B P R O O F O F T H E O R E M 4 Theor em 4 . A ssume that the set Z , { z ∈ R m | f j ( z ) ≤ 0 , j = 1 , · · · , k ; z ≥ 0 } has non-empty r elative in t erior . Then the r obust r e gr ession pr oblem min x ∈ R m  max ∆ A ∈U ′ k b − ( A + ∆ A ) x k a  is equivalent to the fol lowing r e gularized r e gr ession pr oblem min λ ∈ R k + , κ ∈ R m + , x ∈ R m n k b − A x k a + v ( λ , κ , x ) o ; wher e: v ( λ , κ , x ) , max c ∈ R m h ( κ + | x | ) ⊤ c − k X j =1 λ j f j ( c ) i Pr oof: Fix a so lution x ∗ . Not ice that, U ′ = { ( δ 1 , · · · , δ m ) | c ∈ Z ; k δ i k a ≤ c i , i = 1 , · · · , m } . Hence we hav e: max ∆ A ∈U ′ k b − ( A + ∆ A ) x ∗ k a = max c ∈Z n max k δ i k a ≤ c i , i =1 , ··· , m k b −  A + ( δ 1 , · · · , δ m )  x ∗ k a o = max c ∈Z n k b − A x ∗ k a + m X i =1 c i | x ∗ i | o = k b − A x ∗ k a + max c ∈Z n | x ∗ | ⊤ c o . (13) The second equation follows from Theorem 3. Now we n eed to ev aluate max c ∈Z {| x ∗ | ⊤ c } , which equals to − min c ∈Z {−| x ∗ | ⊤ c } . Hence we are m inimizing a li near fun ction over a set of con vex cons traints. Furthermore, by assum ption the Slater’ s condition hol ds. Hence the d uality gap of min c ∈Z {−| x ∗ | ⊤ c } is zero. A standard duality analy sis shows that max c ∈Z n | x ∗ | ⊤ c o = min λ ∈ R k + , κ ∈ R m + v ( λ , κ , x ∗ ) . (14) Nov ember 26, 2024 DRAFT 20 W e establish the theorem by substitu ting Equation (14) back into Equati on (13) and t aking minimum over x on both sides. A P P E N D I X C P R O O F O F P RO P O S I T I O N 1 Pr oposition 1 . Gi ven a functio n g : R m +1 → R an d Bor el sets Z 1 , · · · , Z n ⊆ R m +1 , let P n , { µ ∈ P |∀ S ⊆ { 1 , · · · , n } : µ ( [ i ∈ S Z i ) ≥ | S | /n } . The foll owing holds 1 n n X i =1 sup ( r i ,b i ) ∈Z i h ( r i , b i ) = sup µ ∈P n Z R m +1 h ( r , b ) dµ ( r , b ) . Pr oof: T o prove Proposition 1, we ﬁrst establ ish the following lem ma. Lemma 2: Given a function f : R m +1 → R , and a Borel s et Z ⊆ R m +1 , the following ho lds: sup x ′ ∈Z f ( x ′ ) = sup µ ∈P | µ ( Z )=1 Z R m +1 f ( x ) dµ ( x ) . Pr oof: Let ˆ x be a ǫ − optimal solution to the l eft hand side, consider the probabil ity m easure µ ′ that put mass 1 on ˆ x , whi ch satisfy µ ′ ( Z ) = 1 . Hence, we hav e sup x ′ ∈Z f ( x ′ ) − ǫ ≤ sup µ ∈P | µ ( Z )=1 Z R m +1 f ( x ) dµ ( x ) , since ǫ can be arbit rarily small, this leads to sup x ′ ∈Z f ( x ′ ) ≤ sup µ ∈P | µ ( Z )=1 Z R m +1 f ( x ) dµ ( x ) . (15) Next construct functio n ˆ f : R m +1 → R as ˆ f ( x ) ,  f ( ˆ x ) x ∈ Z ; f ( x ) otherwise . By deﬁnition o f ˆ x we ha ve f ( x ) ≤ ˆ f ( x ) + ǫ f or a ll x ∈ R m +1 . Hence, for an y probability measure µ such that µ ( Z ) = 1 , the following h olds Z R m +1 f ( x ) dµ ( x ) ≤ Z R m +1 ˆ f ( x ) dµ ( x ) + ǫ = f ( ˆ x ) + ǫ ≤ sup x ′ ∈Z f ( x ′ ) + ǫ. This leads to sup µ ∈P | µ ( Z )=1 Z R m +1 f ( x ) dµ ( x ) ≤ sup x ′ ∈Z f ( x ′ ) + ǫ. Notice ǫ can be arbit rarily small , we ha ve sup µ ∈P | µ ( Z )=1 Z R m +1 f ( x ) dµ ( x ) ≤ sup x ′ ∈Z f ( x ′ ) (16) Combining (15) and (16), we prove th e lemm a. Nov ember 26, 2024 DRAFT 21 Now we proceed t o prove the prop osition. Let ˆ x i be an ǫ − optimal solution to sup x i ∈Z i f ( x i ) . Observe that the empirical distribution for ( ˆ x 1 , · · · , ˆ x n ) belongs t o P n , si nce ǫ can be arbitrarily close t o zero, we have 1 n n X i =1 sup x i ∈Z i f ( x i ) ≤ sup µ ∈P n Z R m +1 f ( x ) dµ ( x ) . (17) W ithout loss o f generality , assume f ( ˆ x 1 ) ≤ f ( ˆ x 2 ) ≤ · · · ≤ f ( ˆ x n ) . (18) Now const ruct the following function ˆ f ( x ) ,  min i | x ∈Z i f ( ˆ x i ) x ∈ S n j =1 Z j ; f ( x ) otherwise . (19) Observe that f ( x ) ≤ ˆ f ( x ) + ǫ for all x . Furthermore, given µ ∈ P n , we have Z R m +1 f ( x ) dµ ( x ) − ǫ = Z R m +1 ˆ f ( x ) dµ ( x ) = n X k =1 f ( ˆ x k ) h µ ( k [ i =1 Z i ) − µ ( k − 1 [ i =1 Z i ) i Denote α k , h µ ( S k i =1 Z i ) − µ ( S k − 1 i =1 Z i ) i , we have n X k =1 α k = 1 , t X k =1 α k ≥ t/n. Hence by Equatio n (18) we hav e n X k =1 α k f ( ˆ x k ) ≤ 1 n n X k =1 f ( ˆ x k ) . Thus we hav e for any µ ∈ P n , Z R m +1 f ( x ) dµ ( x ) − ǫ ≤ 1 n n X k =1 f ( ˆ x k ) . Therefore, sup µ ∈P n Z R m +1 f ( x ) dµ ( x ) − ǫ ≤ sup x i ∈Z i 1 n n X k =1 f ( x k ) . Notice ǫ can be arbit rarily close to 0 , we proved the proposi tion by com bining with (17). Nov ember 26, 2024 DRAFT 22 A P P E N D I X D P R O O F O F C O R O L L A RY 3 Corollary 3. Given b ∈ R n , A ∈ R n × m , th e fol lowing equation holds f or any x ∈ R m , k b − A x k 2 + √ nc n k x k 1 + √ nc n = sup µ ∈ ˆ P ( n ) s n Z R m +1 ( b ′ − r ′⊤ x ) 2 dµ ( r ′ , b ′ ) . (20) Her e, ˆ P ( n ) , [ k σ k 2 ≤ √ nc n ; ∀ i : k δ i k 2 ≤ √ nc n P n ( A, ∆ , b , σ ) ; P n ( A, ∆ , b , σ ) , { µ ∈ P | Z i = [ b i − σ i , b i + σ i ] × m Y j =1 [ a ij − δ ij , a ij + δ ij ]; ∀ S ⊆ { 1 , · · · , n } : µ ( [ i ∈ S Z i ) ≥ | S | /n } . Pr oof: The right-hand-side of Equation (20) equals sup k σ k 2 ≤ √ nc n ; ∀ i : k δ i k 2 ≤ √ nc n n sup µ ∈P n ( A, ∆ , b , σ ) s n Z R m +1 ( b ′ − r ′⊤ x ) 2 dµ ( r ′ , b ′ ) o . Notice by the equiv alence to robust formul ation, the left-hand-si de equals to max k σ k 2 ≤ √ nc n ; ∀ i : k δ i k 2 ≤ √ nc n    b + σ −  A + [ δ 1 , · · · , δ m ]  x    2 = sup k σ k 2 ≤ √ nc n ; ∀ i : k δ i k 2 ≤ √ nc n    sup ( ˆ b i , ˆ r i ) ∈ [ b i − σ i ,b i + σ i ] × Q m j =1 [ a ij − δ ij ,a ij + δ ij ] v u u t n X i =1 ( ˆ b i − ˆ r ⊤ i x ) 2    = sup k σ k 2 ≤ √ nc n ; ∀ i : k δ i k 2 ≤ √ nc n v u u t n X i =1 sup ( ˆ b i , ˆ r i ) ∈ [ b i − σ i ,b i + σ i ] × Q m j =1 [ a ij − δ ij ,a ij + δ ij ] ( ˆ b i − ˆ r ⊤ i x ) 2 , furthermore, app lying Proposition 1 yiel ds v u u t n X i =1 sup ( ˆ b i , ˆ r i ) ∈ [ b i − σ i ,b i + σ i ] × Q m j =1 [ a ij − δ ij ,a ij + δ ij ] ( ˆ b i − ˆ r ⊤ i x ) 2 = s sup µ ∈P n ( A, ∆ , b , σ ) n Z R m +1 ( b ′ − r ′⊤ x ) 2 dµ ( r ′ , b ′ ) = sup µ ∈P n ( A, ∆ , b , σ ) s n Z R m +1 ( b ′ − r ′⊤ x ) 2 dµ ( r ′ , b ′ ) , which proves the corollary . Nov ember 26, 2024 DRAFT

Robust Regression and Lasso

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment