Approximation errors of online sparsification criteria

Many machine learning frameworks, such as resource-allocating networks, kernel-based methods, Gaussian processes, and radial-basis-function networks, require a sparsification scheme in order to address the online learning paradigm. For this purpose, …

Authors: Paul Honeine

1 Approximation errors of online sparsificati on criteria Paul Honeine, Membe r IEEE Abstract —Many machine learning framew orks, such as resour ce-allocating networks, kernel-based methods, Gaussian processes, and radi al-basis-function networks, require a sp arsifi- cation scheme in order to address the online learning paradigm. For this purpose, sev eral online sparsification criteria have been proposed to restrict th e model defin ition on a subset of samples. The most known criterion is the (l inear) approximation criterion, which discards any sample that can be well repr esented by the already contributing samples, an operation with excessiv e computational complexity . Sev eral computationally efficient spar - sification criteria have b een int roduced in th e literature, such as the d istance, the coherence and the Babel criteria. In this paper , we pro vide a framew ork that co nnects these sparsification criteria to the issue of approximating samples, by deriving theoretical bounds on the approximation err ors. Mor eov er , we in v estigate the error of approximating any f eature, by proposing upper -bounds on the approxima tion err or for each of the af or ementioned sparsification criteria. T wo classes of features ar e described in detail, the empirical mean and the principal axes in the kernel principal component analysis. Index T erms —Sparse approxima tion, adaptive filtering, k ernel- based methods, r esource -allocating networks, Ga ussian processes, Gram matrix, machine learning, p attern recognition, online learning, sparsification criteria. I . I N T R O D U C T I O N D A T A DELUGE i n the era of “Big Data” bring s new challenges (and oppo rtunities) in th e ar ea o f m achine learning and signal p rocessing [1], [ 2], [3]. Demanding o nline learning, th is pa radigm can not be ad dressed d irectly b y mo st (if no t all) c on ventional learnin g ma chines, such as resour ce- allocating networks [4], kernel-based m ethods for classifi- cation and regression [5], Gaussian proc esses [ 6], radial- basis-functio n networks [7] and kern el principal co mponen t analysis [8], only to name a fe w . Ind eed, these machines share e ssentially the same underly ing model, with a s many parameters to be estimated a s training sam ples, as defined by the “Representer Theorem” [9]. This model is inappropr iate in online learning, where a new sam ple is available at eac h instant. T o stay co mputation ally tractable, one needs to restrict the increm entation in the model com plexity , by selecting the subset of samples that contributes to a r educed- order m odel as an app roximation of the full-ord er featur e to be estimated. In ord er to overcome this bottleneck in online learnin g, sparsification schemes have been propo sed f or all the afore- mentioned m achines, d efined as follows: at eac h instant, it determines if the new sample can be safely discarded fro m contributing to the order g rowth of the m odel; otherw ise, the sample needs to take part in the ord er inc rementation . Th e most known online sparsification criteria is the approxima- tion c riterion, also called approx imate linear depe ndency . I t P . Honeine is with the Institut Charle s Delaunay (CNRS), Univ ersit´ e de technol ogie de Tro yes, 10000, Tro yes, France. Phone: +33(0)32571 5625; Fax: +33(0)3257 15699; E -mail: paul.honeine@u tt.fr has been wid ely in vestigated in the literature, for Gau ssian processes [10], kernel recur si ve least squares algor ithm [11], kernel least m ean squ are alg orithm [12], a nd kerne l prin- cipal com ponent an alysis [8]. This criterion determines the relev ance of discard ing or a ccepting the current sample by comparin g, to a predefine d threshold, the residual error of ap- proxim ating it with a rep resentation ( i.e. , linear com bination) of samples — or nonlinear ly ma pped samples as in kernel methods — already c ontributing to the model. A crucial issue in the approx imation criterion is its computatio nal complexity , which scales cu bically with the m odel order . Sev eral computatio nally efficient spa rsification criteria have been in troduced in the literatu re, with essentially th e same computatio nal co mplexity that scales lin early with the model order . T hese sparsification criteria rely on th e topo logy of the samp les in orde r to select the most r elev ant samples. The most widely in vestigated cr iteria are th e distance and the coheren ce criter ia, as well as the Babel criterion. The d istance criterion, introduced b y Platt in [4] to co ntrol the co mplexity of resource- allocating networks in radial-basis-fu nction networks, retains the most mutu ally d istant samp les ; see also [1 3], [14] for rece nt ad vances. The coheren ce criterion, in troduced b y Honeine, Richard, an d Bermu dez in [1 5], [1 6] with the recent advances in compressed sensing [17], [ 18], retains samples that are mutua lly least coheren t. As an extension of th e coheren ce criterion, the Babel criterio n uses the cumulative coh erence as a measur e of diversity [19]. These sparsificatio n criteria hav e been separately in vesti- gated in the literature. T o the best of o ur knowledge, ther e is no work that studies all these sparsification criteria together . The condu cted analyses hav e been often based on the compu - tational comp lexity , as advocated in [16], [20] by criticizing the c omputation al cost of the ap proxima tion criterion in fav or of the other spar sification criteria. In [15], [16], [2 1], we have developed with colleagues several theore tical results that al- lows to compare the coheren ce to the app roximation criterion. These results h av e not been extended to other sparsification criteria, and were demo nstrated for the particular case o f unit- norm data. This paper pre sents a fram ew ork to stud y o nline sparsifica- tion criteria by cross-fer tilizing p reviously derived results, by obtaining often tig hter bound s, and by extending these results to o ther spar sification criteria such as the distanc e and the Babel criteria. O ne the one han d, we br idge the gap b etween the a pprox imation criterion and the othe r online sparsification criteria, firstly by provid ing upper bou nds on the er ror of approx imating, with samp les already retained, any sample discarded b y the sparsification criterion, an d secon dly b y providing lower bound s on the err or of a pprox imating accepted samples. On the other han d, we examine the r elev ance of approx imating any featur e with a sparse model obtain ed with 2 Distance Approximati on Coherenc e Babel Section Referen ce: m ost known work [4] [10] [16] [17] Referen ce: m ore recen t work [20] [8] [25] [19] Appr oxim ation of any sample X · X X IV  Error on discarded samples X · [15] X IV -A  Error on any atom X · [16] [19] IV -B Appr oxim ation of any feature X X X X V  Error on the mean (centroid) X X [21] X V -A  Error on the principal axes X [11] [15] X V -B T ABLE I A B I R D S E Y E V I E W O F T H I S PA P E R . S O M E O F T H E R E S U LT S W E R E P R E V I O U S LY S T U D I E D F O R U N I T - N O R M K E R N E L S , A S S H O W N W I T H T H E R E F E R E N C E S G I V E N I N T H E TA B L E ( W H E R E · D E N O T E S T R I V I A L I T Y ) . I N T H I S W O R K , W E P R OV I D E A N E X T E N S I V E S T U DY T H AT C O M P L E T E S T H E A N A LY S I S T O A L L S PA R S I F I C AT I O N C R I T E R I A , O F T E N W I T H T I G H T E R B O U N D S ( S H O W N I N G R AY C O L O R ) , A N D W E D E R I V E N E W T H E O R E T I C A L R E S U LT S . M O R E O V E R , W E G E N E R A L I Z E T H E S E R E S U LT S T O A N Y T Y P E O F K E R N E L , B E Y O N D U N I T - N O R M K E R N E L S . any o f the aforemen tioned sparsification cr iteria, including the approx imation criterion. W e p rovide upper bounds on the error of ap proxim ating an y feature in th e genera l case. Fur thermore , we explore in detail two particular featur es, the empirical mean ( i.e. , centroid , stud ied for instance in [22], [ 23]) and the principal axes in the kernel principal compo nent analysis (kernel-PCA, [24]). The big picture of th e cross-fe rtilization and extensions given in this p aper is illustrated in T ABLE I. The remainder of this paper is organized as follows. Next section intro duces the kerne l-based machin es f or on line learn- ing and presents the key issues studied in this work. Sec - tion III presents the aforemen tioned compu tationally efficient sparsification cr iteria. Section IV in vestigates b ounds on the error of approxim ation samples, eith er discard ed or accepted by any spa rsification criterion. These results are extended in Section V to the problem of a pprox imating any feature. Section VI c oncludes this do cument with some discussions. I I . K E R N E L - BA S E D M AC H I N E S F O R O N L I N E L E A R N I N G In this section, we intro duce the kern el-based m achines for online learnin g, by presenting the appro ximation c riterion with the key issues studied in th is paper . A. Machine learnin g and online lea rning Machine learning seek s a featur e ψ ( · ) connec ting an in put space X ⊂ R d to an ou tput space Y ⊂ R , by using a set of training s amples, denoted { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) } with ( x k , y k ) ∈ X × Y . Con sidering a loss fun ction C ( · , · ) defined o n Y × Y that measures the erro r between the desired output and the estimated one with ψ ( · ) , the o ptimization problem co nsists in minimiz ing a regularized empirical risk of the f orm argmin ψ ( · ) ∈ H n X i =1 C ( ψ ( x i ) , y i ) + η R ( k ψ ( · ) k 2 H ) , (1) where H is the featur e space o f cand idate solutions and η is a parameter that contro ls the tradeo ff between the fitness error (first te rm) and the regularity of the solution (secon d term) with R ( · ) being a mono tonically increasin g functio n. Examples of loss fun ctions are the quadratic loss | ψ ( x i ) − y i | 2 , the hing e loss (1 − ψ ( x i ) y i ) + of the SVM [5], th e logistic re- gression log(1 + exp( − ψ ( x i ) y i )) , as well as the unsup ervised loss func tion − | ψ ( x i ) | 2 which is related to the PCA. Let κ : X × X → R be a positiv e definite kernel, and ( H , h· , ·i H ) the induce d rep roducin g kernel Hilbert space (RKHS) with its inner pro duct. The reprod ucing pr operty states that any f unction ψ ( · ) of H can be ev aluated at any sample x i of X using ψ ( x i ) = h ψ ( · ) , κ ( · , x i ) i H . Th is proper ty shows that any sample x i of X is rep resented with κ ( · , x i ) in the spa ce H . Moreover, the rep roducin g proper ty leads to the so-called kernel trick , that is for any pair of samples ( x i , x j ) , we h a ve h κ ( · , x i ) , κ ( · , x j ) i H = κ ( x i , x j ) . In p articular, k κ ( · , x i ) k H = h κ ( · , x i ) , κ ( · , x i ) i H = κ ( x i , x i ) . The most used kernels and ther e exp ressions are as follows: Kernel κ ( x i , x j ) Linear h x i , x j i Polynom ial ( h x i , x j i + c ) p Expon ential exp ( h x i , x j i ) Gaussian exp  − 1 2 σ 2 k x i − x j k 2  Among th ese kernels, on ly the Gaussian kernel is unit-no rm, that is k κ ( x , · ) k H = 1 for any sample x ∈ X . O ther kernels can be un it-norm on some re stricted X , such as the linear kernel when dealing with un it-norm samples. In this paper, we d o no t restrict ourselves to any pa rticular kernel or space X . W e den ote r 2 = inf x ∈ X κ ( x , x ) and R 2 = sup x ∈ X κ ( x , x ) . For u nit-norm kernels, we get R = r = 1 . The Represen ter T heorem provid es a p rincipal result that is essential in kern el-based machines for classification and regression, as well as unsuperv ised learning . It states th at th e solution of th e optimization pro blem ( 1) takes the for m ψ ( · ) = n X i =1 α i κ ( x i , · ) . (2) The proof of this theore m is derived in [9], and a sketch of proof is g i ven in the fo otnote 1 . This th eorem shows that the optimal solution has as ma ny parameters α i to be estimated a s the numb er o f av ailable samples ( x i , y i ) . Th is result consti- tutes the pr incipal bo ttleneck for online learning. I ndeed, in an online setting, the solution should be ad apted based o n a n ew sample a v ailable at each instant, nam ely ( x t , y t ) at in stant t . Thus, b y including the new pair ( x t , y t ) in the tra ining 1 T o prov e the Repre senter Theorem (2), we decompose a ny func tion ψ ( · ) of H into ψ ( · ) = P n i =1 α i κ ( x i , · ) + ψ ⊥ ( · ) , where h ψ ⊥ ( · ) , κ ( x i , · ) i H = 0 for all i = 1 , 2 , . . . , n . On the one hand, any ev al uation ψ ( x i ) i s inde- pendent of ψ ⊥ ( · ) since ψ ( x i ) = h ψ ( · ) , κ ( x i , · ) i H . On the other hand, the monotonicall y increasing function R ( · ) guarant ees that R ( k ψ ( · ) k 2 H ) = R ( k P n i =1 α i κ ( x i , · ) + ψ ⊥ ( · ) k 2 H ) ≥ R ( k P n i =1 α i κ ( x i , · ) k 2 H ) , where the Pythago rean theorem is used. Therefore , a null ψ ⊥ ( · ) minimizes the regul ariza tion term without af fecti ng the fitness term. HONEINE: APPRO XIMA TION ERRORS OF ONLINE SP ARSIF ICA TION CRITERIA 3 set, the correspond ing para meter α t is be added to the set of pa rameters to be e stimated, by f ollowing the Representer Theorem . As a consequence, the order of the m odel (2) is continuo usly increasing . T o ov ercome this bottleneck , on e n eeds to co ntrol the growth of the model ord er at each instan t, b y keep ing on ly a fraction of the kernel f unctions in the expansion (2). The reduced -order mod el takes the form ψ ( · ) = m X j =1 α j κ ( ` x j , · ) (3) with m ≪ t , pr edefined o r d ependen t o n t . In this expression, { ` x 1 , ` x 2 , . . . , ` x m } is a subset o f { x 1 , x 2 , . . . , x t } , n amely ` x j is some x ω j with ω j ∈ { 1 , 2 , . . . , t } . W e deno te by d ictionary the set D = { κ ( ` x 1 , · ) , κ ( ` x 2 , · ) , . . . , κ ( ` x m , · ) } , an d by atoms its elements. Thr ougho ut th is paper, all quantities associated to the d ictionary have an accent (by an alogy to ph onetics, where stress accen ts are associated to pro minence). This is the case for instance of the m -b y- m Gram matrix ` K whose ( i, j ) -th entry is κ ( ` x i , ` x j ) . The eigenv alues of th is m atrix are deno ted ` λ 1 , ` λ 2 , . . . , ` λ m , given in n on-incr easing order . The optimization prob lem is two-fold at each instant: selecting the proper dictionary D = { κ ( ` x 1 , · ) , κ ( ` x 2 , · ) , . . . , κ ( ` x m , · ) } and estimating the correspo nding parameters α 1 , α 2 , . . . , α m . New challenge s (and oppor tunities) arise in an online learning setting. Determining the optimal d ictionary at each instant is a combinato rial optimization p roblem, whe n optima lity is measured b y compar ing reduced- order solu tion (3) to the feature in its full- order for m (2). An elegant way to overcome this compu tationally intractable problem, is a recu rsiv e update, by determinin g if th e n ew kernel functio n κ ( x t , · ) needs to be included to the dictionary , o r it can be discard ed since it is efficiently appro ximated with atoms already belo nging to the dictionary . This is the essence of the ap proxima tion criterio n. B. Appr oximation criterion The (lin ear) appr oximation criterion was initially propo sed in [26] for classification and regression, and in [2 7] for Gaussian p rocesses. I n on line lea rning with kernels, a s stud ied for system identification in [1 1] a nd more r ecently for kern el principal compo nent analy sis in [8], it operates as fo llows: the current samp le is d iscarded (no t included in the dictio nary), if it can be sufficiently rep resented by a lin ear combination of atoms already belonging to the dictio nary; o therwise, it is included in the dictionary . Formally , th e kernel function κ ( x t , · ) is in cluded in the dictionary if min ξ 1 ··· ξ m    κ ( x t , · ) − m X j =1 ξ j κ ( ` x j , · )    2 H ≥ δ 2 , (4) where δ is a positive threshold parameter that controls the lev el of sparseness. The above n orm is the residual error obtained by pro jecting κ ( x t , · ) onto the sp ace spanned by the dictionary . The optimal value of each coe fficient ξ j is obtain ed by nu llifying th e deriv ativ e of the above cost function with respect to it, which leads to ξ = ` K − 1 ` κ ( ` x t ) , where ` κ ( x t ) is the colu mn vector of entries κ ( ` x j , x t ) , for j = 1 , 2 , . . . , m . By injecting this expr ession in the cond ition (4), we get the following condition o f accep ting the current kernel fu nction: κ ( x t , x t ) − ` κ ( x t ) ⊤ ` K − 1 ` κ ( x t ) ≥ δ 2 . (5) The resulting dictionary is called δ -ap proxim ate, satisfying the relation min i =1 ··· m min ξ 1 ··· ξ m    κ ( ` x i , · ) − m X j =1 j 6 = i ξ j κ ( ` x j , · )    H ≥ δ. One could also include a removal p rocess, in the same spirit as the fixed-budget co ncept, by d iscarding the atom th at can be well appr oximated with the other atom s, as in vestigated for instance in [28]. Non etheless, the d ictionary is still δ - approx imate. The use of a re moval pro cess d oes no t affect the results given in this paper . C. Issues stu died in th is paper In the following, we describe several issues that m otiv ates (and structures) this work, illu strated here with resp ect to the approx imation criterion . Computation al complexity The appro ximation criterion requir es the inversion of the Gram matrix associated to the diction ary , which is the most computatio nal expen si ve proc ess. Its computatio nal com plex- ity scales cub ically with the size of the dictio nary , i.e. , O ( m 3 ) o perations. Mo reover , the ev aluation of th e c ondition expressed in (5) req uires two matrix multiplicatio ns at each instant. These com putation cost m ay coun teract the b enefits of several o nline learn ing techniqu es, su ch as gradien t-based and least-mean-sq uare algorith ms ( e.g. , LMS, NLMS, affine projection , ...). T o reduce th e comp utational burden o f the approx imation criterion, se veral co mputation ally efficient sparsification crite- ria h a ve been prop osed in the liter ature, sh aring essentially the same c omputatio nal co mplexity that scales linearly with th e size of the d ictionary , i.e. , O ( m ) operation s at each instant. The mo st known criteria are the d istance, the co herence and the Babel criter ia; see Section III f or a d escription. Appr oximation err or of an y sa mple The appr oximation criterio n re lies on establishing a dictio- nary such that the error of approx imating each of its atoms, with a linear combination of the oth er atom s, canno t be smaller than the g i ven thr eshold δ . Moreover , the decision of d iscarding any sample f rom the dictio nary is defined by the same process, nam ely when its app roximation error, with a linear combina tion o f the other atoms, is smaller than the same threshold δ . While the approx imation criterion possesses such duality between accepting and discard ing samples at the very same value o f threshold ing the app roximatio n er ror, this is not the case of th e other sparsification c riteria. 4 In Section IV, we b ridge the gap between the appr oximation criterion and the o ther o nline sparsification criteria. For this purpo se, on the on e hand in Section IV - A, we derive up per bound s on the error of approxim ating a discarded samples with atoms of a dictio nary obtain ed by the d istance, the coh erence, or the Babel criterion. One the other hand in Sectio n IV -B, we derive lower bound s o n the error of app roximating any atom with the oth er atoms of the sparse diction ary u nder scrutiny . F r om appr oximating sa mples to appr oximating features All the af orementio ned sparsification criter ia operate in a pre-pr ocessing scheme, by selectin g samp les indepe ndently of the resulting sparse rep resentation of the fe ature. In other words, the selection of the re le vant subset { ` x 1 , ` x 2 , . . . , ` x m } from th e set { x 1 , x 2 , . . . , x t } is only b ased on the topo logy of the samples; it is indep endent of th e power of the diction ary to appr oximate acc urately any feature of the fo rm ( 2) with th e reduced -order mod el (3). In Sectio n V, we study the relevance of appr oximating any feature with a sparse dictionar y obtained by any sparsification criterion, in cluding the approx imation criter ion. W e derive upper b ounds on the ap proxim ation error of any featu re, before examinin g in detail two particu lar class of features, th e empirical mean studied in Section V - A an d the m ost r elev ant principal axes in kernel-PCA investigated in Section V -B. I I I . O N L I N E S PA R S I FI C AT I O N C R I T E R I A W ith a novel samp le x t av ailable at each instant t , a sparsification ru le determ ines if κ ( x t , · ) shou ld be included in the diction ary , by incre menting th e mode l o rder m and settin g ` x m +1 = x t . The spa rsification criteria measure the relevance of such co mplexity-incre mentation by compar ing the curren t kernel f unction κ ( x t , · ) with the atoms of the dictiona ry . T hey are define d by e ither a d issimilarity measure, i.e. , constructing the dictionar y with the m ost mutually distant atoms, or a similarity measur e, i.e. , co nstructing the dictionary with the least c oherent or co rrelated atom s. T o this end, a thr eshold is used to contro l the level o f sparsity of th e dictiona ry . Th e most investigated criter ia are outlined in the following. A. Distance criterion It is natural to prop ose a sparsification c riterion that con- structs a diction ary with large distances between its entries, thus discar ding samples tha t are too clo se to an y of the atoms already belon ging to th e dictionary . The cu rrent kernel function κ ( x t , · ) is in cluded in the dictionary if min j =1 ··· m min ξ k κ ( x t , · ) − ξ κ ( ` x j , · ) k H ≥ δ, (6) for a pre defined positive th reshold δ ; otherwise, it can be efficiently ap proxim ated, up to a mu ltiplicativ e constant, with an atom of the d ictionary . It is easy to see that the optim al value of the scaling factor ξ is κ ( x t , ` x j ) /κ ( ` x j , ` x j ) , since the left-hand -side in the ab ove expression is residu al error of the projection of κ ( x t , · ) on to κ ( ` x j , · ) (in the same spirit a s the approx imation criterion). This allows to simp lify th e con dition (6) to get min j =1 ··· m  κ ( x t , x t ) − κ ( x t , ` x j ) 2 κ ( ` x j , ` x j )  ≥ δ 2 . (7) The resultin g d ictionary , called δ -distan t, satisfies for any pa ir ( ` x i , ` x j ) : κ ( ` x i , ` x i ) − κ ( ` x i , ` x j ) 2 κ ( ` x j , ` x j ) ≥ δ 2 . (8) For unit-nor m ato ms, this expression reduces to the conditio n | κ ( ` x i , ` x j ) | ≤ √ 1 − δ 2 . This sparsificatio n c riterion has been extensi vely used in the literature under different n ames, s uch as the novelty criterio n pro posed in [4] (where the scaling factor was dropped an d a prediction er ror mechanism was included in a second stage; see also [29], [7]) and the quan tized criterion defined in [3 0]. B. Coher ence criterion The coh erence measure has b een extensively studied in the literature of co mpressed sensing in th e particula r ca se of the linear kernel with un it-norm samples [17], [18]. I n th e more general ca se with th e kernel forma lism, the cohere nce of a dictionary is d efined with the m easure max i,j =1 ··· m i 6 = j | κ ( ` x i , ` x j ) | p κ ( ` x i , ` x i ) κ ( ` x j , ` x j ) , (9) which co rrespond s to the largest value of the co sine angle between any pair of atoms, since the above ob jectiv e functio n can be written as |h κ ( ` x i , · ) , κ ( ` x j , · ) i H | k κ ( ` x i , · ) k H k κ ( ` x j , · ) k H . The coh erence criterion introd uced in [1 5], [ 16] co nstructs a dictionary with atoms that are mutually least coher ent, by re- stricting this me asure below some predefin ed value γ ∈ [0 ; 1] , where a null value yields a n orth ogonal basis. Th e criterion includes the current kern el f unction κ ( x t , · ) in the d ictionary if max j =1 ··· m | κ ( x t , ` x j ) | p κ ( x t , x t ) κ ( ` x j , ` x j ) ≤ γ . (10) It is worth noting th at the denominato r in e ach o f the ab ove expressions redu ces to 1 wh en d ealing with un it-norm atoms, thus expression (10) becom es max j =1 ··· m | κ ( x t , ` x j ) | ≤ γ . C. Babel criterion While the coheren ce measure examin es the largest c orrela- tion between all pairs of atom s in a diction ary , a mor e tho rough analysis is provided b y th e Babel measure, which con sidering the m aximum cum ulative correlatio n between an atom and all the atoms of the dictiona ry [31], [17]. The Babel criter ion for online spa rsification is defined as follows: the curren t kern el function κ ( x t , · ) is includ ed in the diction ary if m X j =1 | κ ( x t , ` x j ) | ≤ γ , (11) HONEINE: APPRO XIMA TION ERRORS OF ONLINE SP ARSIF ICA TION CRITERIA 5 for a g iv en positive th reshold γ [1 9]. The resulting dictionary , called γ -Babel, satisfies max i =1 ··· m m X j =1 j 6 = i | κ ( ` x i , ` x j ) | ≤ γ . (12) By an alogy with the coherence m easure, which correspo nds to the ∞ - norm when d ealing with unit-no rm ato ms, the Babel measure 2 is related to the 1 -no rm o f the Gram matrix, where k ` K k 1 = max i P j | κ ( ` x i , ` x j ) | . I V . A P P RO X I M AT I O N E R RO R O F A N Y S A M P L E In this section, we study the elemen tary issue o f a pprox- imating a sample by the spa n of a sparse d ictionary , in the kernel-based framework. T o this end, this issue is considered in its two folds: on e the one hand, the err or of ap proxim ating a d iscarded samp le, a nd o n the other hand , the error of approx imating a ny accepted sample, nam ely approxim ating any ato m of th e dictionary w ith all th e other atoms. W e p rovide upper bo unds on the former an d lower bounds on the latter, for each of the sparsification criteria studied in pr evious section. It is worth n oting that on ly the app roximatio n cr iterion relies on a d uality of discardin g and accep ting samples at th e very same value in thresho lding the approx imation error, which is not the case of the other criter ia, as examined in th e fo llowing. Let ` P be th e projectio n o perator onto the su bspace spanned by the atoms κ ( ` x 1 , · ) , . . . , κ ( ` x m , · ) of a d ictionary resulting from a sparsification c riterion. Thu s, for any samp le x , the projection of the kernel fu nction κ ( x , · ) onto this subspace is g i ven b y ` P κ ( x , · ) . The quadr atic n orm of the latter cor- respond s to the maximum inn er pr oduct h κ ( x , · ) , ϕ ( · ) i H over all th e u nit-norm functions ϕ ( · ) of that subspac e. By writing ϕ ( · ) = P m j =1 β j κ ( ` x j , · ) / k P m j =1 β j κ ( ` x j , · ) k H , one g ets k ` P κ ( x , · ) k 2 H = max β h P m j =1 β j κ ( ` x j , · ) , κ ( x , · ) i H k P m j =1 β j κ ( ` x j , · ) k H = max β P m j =1 β j κ ( x , ` x j ) k P m j =1 β j κ ( ` x j , · ) k H . (13) Moreover , the Pyth agorean Theorem allows to mea sure the residual nor m of this projectio n, with k ( I − ` P ) κ ( x , · ) k 2 H = k κ ( x , · ) k 2 H − k ` P κ ( x , · ) k 2 H = κ ( x , x ) − k ` P κ ( x , · ) k 2 H , where I is the identity ope rator . Therefore, the qu adratic approx imation error is k ( I − ` P ) κ ( x , · ) k 2 H = κ ( x , x ) − max β P m j =1 β j κ ( x , ` x j ) k P m j =1 β j κ ( ` x j , · ) k H . (14) In the following, we in vestigate this expression in order to derive proper bo unds o n the a pprox imation error, either when the sample is discarded or when it a lready belongs to the dictionary . 2 One can also consider a normali zed ve rsion of the Babel measure, by substituting κ ( x t , ` x j ) in (11) with κ ( x t , ` x j ) / p κ ( x t , x t ) κ ( ` x j , ` x j ) . These two definitions are equi v ale nt when dealing with unit-norm atoms. T o the best of our knowl edge, this formulation is not used in the litera ture. Moreov er , it looses the matrix-norm notion. A. Appr oximation err or of discarded sample s When the sample x t is discarded , we p ropose to up per bound the q uadratic appro ximation error (14) with k ( I − ` P ) κ ( x t , · ) k 2 H ≤ κ ( x t , x t ) − ma x j | κ ( x t , ` x j ) | p κ ( ` x j , ` x j ) , ( 15) where th e in equality correspon ds to the special choice of the coe fficients, with β 1 = . . . = β m = 0 e xcept for β j = sign( κ ( x t , ` x j )) . Next, we show that the q uotient | κ ( x t , ` x j ) | / p κ ( ` x j , ` x j ) in the above expression is bo unded , with a lo wer boun d that depends on the threshold of the in vestigated spa rsification c riterion. For this purpo se, we ex- amine sep arately th e d istance (Th eorem 1), the co herence (Theor em 2), and the Babel (Th eorem 3) criteria. Theor em 1 (Discar ding err or fo r the d istance criterion): Let x t be a samp le not satisfying th e distance cond ition (7) for some giv en thre shold δ . The quadratic erro r o f approx imating κ ( x t , · ) with a line ar combination of atoms from the r esulting d ictionary is upper-bound ed by δ 2 and κ ( x t , x t ) − p κ ( x t , x t ) − δ 2 . The latter up per bo und is sharper when δ 2 > κ ( x t , x t ) − 1 . This is th e case when dealing with unit-no rm atoms, wh ere we get the upper bou nd 1 − √ 1 − δ 2 . Pr o of: Firstly , one c an ea sily deriv e the first exp ression of the u pper boun d, since min ξ 1 ··· ξ m   κ ( x t , · ) − m X i =1 ξ i κ ( ` x i , · )   2 H ≤ min j =1 ··· m min ξ j   κ ( x t , · ) − ξ j κ ( ` x j , · )   2 H < δ 2 , where the first inequa lity follows fro m the special case when all ξ i are null except f or a s ingle one, and the s econd inequality is due to the violation of (6). Secondly , the second expression of the up per boun d is a bit more tric kier . The approx imation error is given by the n orm of th e residu al of th e pr ojection of κ ( x t , · ) o nto the subspace spanned by the d ictionary atom s, namely as given in (15). Since x t does n ot satisfy the con dition (6)-(7), we h av e min j =1 ...m  κ ( x t , x t ) − κ ( x t , ` x j ) 2 κ ( ` x j , ` x j )  < δ 2 , and as a co nsequenc e, since κ ( x t , x t ) ≥ δ 2 , we ca n e asily show that p κ ( x t , x t ) − δ 2 < max j =1 ...m | κ ( x t , ` x j ) | p κ ( ` x j , ` x j ) . By injec ting this ineq uality in (15), we get the seco nd expres- sion of the upper bou nd with    I − ` P  κ ( x t , · )   2 H < κ ( x t , x t ) − p κ ( x t , x t ) − δ 2 . Finally , we comp are these two expressions. It is easy to see that κ ( x t , x t ) − p κ ( x t , x t ) − δ 2 is sharper th an δ 2 when κ ( x t , x t ) − p κ ( x t , x t ) − δ 2 < δ 2 , 6 namely when κ ( x t , x t ) − δ 2 < p κ ( x t , x t ) − δ 2 . This con- dition, o f the fo rm u 2 < u , is satisfied wh en u < 1 , namely κ ( x t , x t ) − δ 2 < 1 . Theor em 2 (Discarding err or for the coher ence criterion ): Let x t be a sample not satisfying the coherence con dition (10) for some gi ven thresh old γ . The quadratic err or of approx imating κ ( x t , · ) with a linear comb ination o f atoms from the r esulting dictiona ry is upper-bound ed by κ ( x t , x t ) − γ p κ ( x t , x t ) . In the p articular case of u nit-norm atoms, we g et 1 − γ . Pr o of: The un fulfilled coheren ce con dition (10), namely max j =1 ··· m | κ ( x t , ` x j ) | p κ ( x t , x t ) κ ( ` x j , ` x j ) > γ , can be w ritten in the eq uiv alent for m max j =1 ··· m | κ ( x t , ` x j ) | p κ ( ` x j , ` x j ) > γ p κ ( x t , x t ) . By injec ting this ine quality in (1 5), we get an u pper bo und on the app roximatio n er ror as follows: k ( I − ` P ) κ ( x t , · ) k 2 H < κ ( x t , x t ) − γ p κ ( x t , x t ) , which conc ludes the proo f. Theor em 3 (Discarding err or for the Babel criterion): Let x t be a sample not satisfying the Babel cond ition ( 11) fo r some giv en thresh old γ . The quadra tic error of approx imating κ ( x t , · ) with a linear com bination of atoms fr om the resulting dictionary is u pper-bounde d by κ ( x t , x t ) − γ p m ( R 2 + γ ) , which beco mes 1 − γ √ m (1+ γ ) for unit- norm ato ms. Pr o of: T o prove this result, we use the q uadratic approx i- mation erro r giv en in expr ession (1 4) where, fo r the p articular case of β j = sign( κ ( x t , ` x j )) , we g et k ( I − ` P ) κ ( x t , · ) k 2 H = κ ( x t , x t ) − ma x β P m j =1 β j κ ( x t , ` x j ) k P m j =1 β j κ ( ` x j , · ) k H ≤ κ ( x t , x t ) − P m j =1 | κ ( x t , ` x j ) | ( β ⊤ ` K β ) 1 2 The above n umerator is b ounded since th e Babel con dition (11) is n ot satisfied, nam ely P m j =1 | κ ( x t , ` x j ) | > γ . The above denomin ator is boun ded thanks to the m in-max th eorem ( i.e. , the Rayleig h-Ritz qu otient), with β ⊤ ` K β ≤ ` λ 1 k β k 2 . It turn s o ut that the this upper bound is equa l to m ( R 2 + γ ) . T o show this, on the one ha nd, we have P m j =1 β 2 j = P m j =1 | sign( κ ( x t , ` x j )) | 2 = m due to the af orementio ned particular choice of the β j . On the other h and, the eigenv alues of a Gram matrix ` K associated to a γ -Babel dictio nary are upper-boun ded b y R 2 + γ , as given in the Append ix; see [3 2, Theorem 5] fo r more details. The com bination of all these results con cludes th e p roof. B. Appr oximation err or of an atom fr om the dictiona ry In this section, we stud y the ap proxim ation o f an ato m of a dictionary with a linear com bination of its o ther ato ms. W e provide a lower bo und on the ap proxim ation error for ea ch sparsification criterio n. Let κ ( ` x i , · ) be an atom o f the dictionary , and con sider its projection onto the subspace span ned by the oth er m − 1 atoms. By f ollowing the same d eriv ations a s in th e beginn ing of Section IV, we have k ( I − ` P ) κ ( ` x i , · ) k 2 H = κ ( ` x i , ` x i ) − max β P m j =1 ,j 6 = i β j κ ( ` x i , ` x j ) k P m j =1 ,j 6 = i β j κ ( ` x j , · ) k H . On the one h and, the numerato r in the ab ove expression is u pper-bound ed, since w e hav e from the Cauchy-Schwarz inequality: m X j =1 j 6 = i β j κ ( ` x i , ` x j ) ! 2 ≤ m X j =1 j 6 = i β 2 j m X j =1 j 6 = i | κ ( ` x i , ` x j ) | 2 . On the o ther hand, th e d enomin ator has a lower bound , since we have:    m X j =1 j 6 = i β j κ ( ` x j , · )    2 H = β ⊤ ` K \ { i } β ≥ ` λ \ { i } m − 1 k β k 2 , where ` K \ { i } is the ( m − 1) -b y- ( m − 1) subm atrix of the Gra m matrix ` K obtained by removin g its i -th row and its i -th column, i.e. , the en tries associated to ` x i , and ` λ \ { i } m − 1 is its smallest e igen value. By co mbining these two inequ alities, we get k ( I − ` P ) κ ( ` x i , · ) k 2 H ≥ κ ( ` x i , ` x i ) − v u u u t 1 ` λ \ { i } m − 1 m X j =1 j 6 = i | κ ( ` x i , ` x j ) | 2 , (16) This expression is inv estigated in th e follo wing for th e distance (Theor em 4), the coherence (Theorem 5), and the Babel (Theor em 6) criteria. For each sparsification criterion , th e above lower bo und is written by using the c orrespon ding summation expression an d the approp riate lower b ound on the eigenv alues, as deriv ed in [32, Section IV] and put in a nu tshell in the Ap pendix. Theor em 4 (Accepta nce err or for the distance criterion) : For a d ictionary resultin g from th e distance c riterion f or some giv en threshold δ , the quadra tic err or of a pprox imating any atom κ ( ` x i , · ) with a lin ear combin ation of the o ther a toms is lower -bounde d b y κ ( ` x i , ` x i ) − s  κ ( ` x i , ` x i ) − δ 2  ( m − 1 ) R 2 r 2 − ( m − 2) R √ R 2 − δ 2 . For unit- norm atoms, we ge t a lower bou nd for all atoms, with 1 − s ( m − 1)(1 − δ 2 ) 1 − ( m − 2) √ 1 − δ 2 . HONEINE: APPRO XIMA TION ERRORS OF ONLINE SP ARSIF ICA TION CRITERIA 7 Pr o of: The pro of is split in two pa rts, by in vestigating expression ( 16). Firstly , the sum mation term is upper-boun ded since, from ( 8), w e have that any pair ( ` x i , ` x j ) satisfies m X j =1 j 6 = i | κ ( ` x i , ` x j ) | 2 ≤ m X j =1 j 6 = i κ ( ` x j , ` x j )  κ ( ` x i , ` x i ) − δ 2  =  κ ( ` x i , ` x i ) − δ 2  m X j =1 j 6 = i κ ( ` x j , ` x j ) =  κ ( ` x i , ` x i ) − δ 2  ( m − 1 ) ma x j =1 ··· m j 6 = i κ ( ` x j , ` x j ) =  κ ( ` x i , ` x i ) − δ 2  ( m − 1 ) R 2 . Secondly , the eigenv alue in this expr ession is lower -bounded by r 2 − ( m − 2) R √ R 2 − δ 2 for a δ -distant dictionary of m − 1 atoms, as shown in Lem ma A.1 of the Ap pendix . Theor em 5 (Acc eptance err or for the coherence criterion): For a dictionary resulting f rom the coherenc e criter ion for some given thresh old γ , the qu adratic error of a pproxim ating any a tom κ ( ` x i , · ) with a linear combinatio n of the other atoms is lower-bounded by κ ( ` x i , ` x i ) − s ( m − 1) γ 2 R 2 κ ( ` x i , ` x i ) r 2 − ( m − 2) γ R 2 . For unit-no rm a toms, this bo unds becom es in depend ent of ` x i , with 1 − s ( m − 1 ) γ 2 1 − ( m − 2) γ . Pr o of: Th e p roof follows the same proced ure as in th e previous pro of. On the o ne hand, we h av e m X j =1 j 6 = i | κ ( ` x i , ` x j ) | 2 ≤ ( m − 1) max j =1 ··· m j 6 = i | κ ( ` x i , ` x j ) | 2 ≤ ( m − 1) γ 2 max j =1 ··· m j 6 = i κ ( ` x i , ` x i ) κ ( ` x j , ` x j ) ≤ ( m − 1) γ 2 R 2 κ ( ` x i , ` x i ) , where the secon d inequality follows f rom the cohere nce co n- dition. On the o ther hand, we use the lower bo und r 2 − ( m − 2) γ R 2 on the eigen v alues associated to a γ -coh erent dictionar y of m − 1 atoms, as derived in Lem ma A.2 of the App endix. T o comple te the proof, we com bine th ese results in ( 16). Theor em 6 (Acc eptance err or for the Babel criterion) : For a dictionar y resulting fro m the Babel criterion for some giv en thr eshold γ , the qu adratic err or of appro ximating any atom κ ( ` x i , · ) with a lin ear combin ation of the o ther atoms is lower -bounde d b y κ ( ` x i , ` x i ) − γ p r 2 − γ . For unit-no rm atoms, we get the fo llowing lower bo und for all atoms: 1 − γ √ 1 − γ . Pr o of: The p roof is ob tained by sub stituting the ℓ 2 -norm in (16), i.e. ,  P j | κ ( ` x i , ` x j ) | 2  1 2 , with an ℓ 1 -norm , since we have the relation k u k 2 ≤ k u k 1 . This yields k ( I − ` P ) κ ( ` x i , · ) k 2 H ≥ κ ( ` x i , ` x i ) − 1 q ` λ \ { i } m − 1 m X j =1 j 6 = i | κ ( ` x i , ` x j ) | . The above summation term is up per-bounded by γ thanks to the Bab el defin ition in (12). Moreover , the above eigenv alue is lower -bound ed by r 2 − γ , as der i ved in Lemma A.3 of the Append ix. This co ncludes the proo f. This theorem should be co mpared with the work o f [19, Theorem 1] , where the au thors pro pose a lower bo und on the quadra tic approximatio n er ror fo r unit-n orm atoms, with 1 − γ . It is easy to see that the lower bo und giv en in Theorem 6 is tighter than the pr eviously propo sed bo und, and extends the result to ato ms th at are no t unit-n orm. V . A P P R OX I M AT I O N O F A F E AT U R E In this sectio n, we study th e relev ance of ap proxim ating an y feature with its projection o nto the sub space spann ed by the atoms of a dictionary . An upper b ound on the approx imation error is derived in th e f ollowing theor em for any sparsification criterion, while specific bound s in term of the thresho ld of each criterion are gi ven i n the following Theorem 8. Mo reover , these results are explored in two p articular kernel-based le arn- ing algo rithms, with th e empirical me an (see Sectio n V -A) and the prin cipal axes (see Sectio n V -B) as feature s to be estimated. Theor em 7: Consider the appro ximation of some featu re ψ ( · ) = P n i =1 α i κ ( x i , · ) with a sparse solution given by projecting it onto the su bspace spanned by the m atoms of a given dictio nary . T he qu adratic error of su ch ap proxima tion is upper-bou nded b y ( n − m ) k α k 2 ǫ 2 , where ǫ is an upper bo und on the a pproxim ation o f any κ ( x i , · ) with a linear co mbination of a toms from the d ictionary . Pr o of: Let ` P be th e projectio n operator onto the sub space spanned by the atom s of the dictionary under s crutiny , approxi- mating ψ ( · ) = P n i =1 α i κ ( x i , · ) with ` ψ ( · ) = P m j =1 ` α j κ ( ` x j , · ) . The error of such app roximation is k ( I − ` P ) ψ ( · ) k H = k n X i =1 α i ( I − ` P ) κ ( x i , · ) k H ≤ n X i =1 | α i | k ( I − ` P ) κ ( x i , · ) k H , (17) where the inequality is due to the generalized triangular inequality . By apply ing the Cauchy-Schwarz ineq uality , we get the q uadratic appro ximation error k ( I − ` P ) ψ ( · ) k 2 H ≤ n X i =1 α 2 i n X i =1 k ( I − ` P ) κ ( x i , · ) k 2 H . (18) The first summation is the quad ratic ℓ 2 -norm of the vector of co efficients, namely k α k 2 . For the second summ ation, we separate it in two terms, entries belon ging to the dictio nary 8 and tho se discarded th anks to th e used sp arsification criter ion. While th e former do not contribute to the er ror, o nly th e latter take p art in the summation , nam ely the n − m discarded samples where m is the size of the d ictionary . Let ǫ 2 be an upper b ound on the qu adratic erro r of d iscarding sample s, as giv en in Section IV -A. Then, we g et k ( I − ` P ) ψ ( · ) k 2 H ≤ ( n − m ) k α k 2 ǫ 2 , which conc ludes the proo f. By r evisit ing the u pper bo unds giv en in Section IV -A f or each sparsification criterio n, we can easily sh ow the fo llowing results. Expression s for non -unit-no rm ato ms can be d erived without difficulty f rom Theo rems 1, 2, 3. Theor em 8: The upper bou nd given in Theorem 7 can be specified f or ea ch spar sification criter ion in terms of the used threshold. For unit-n orm atom s, we have • ( n − m ) k α k 2 (1 − √ 1 − δ 2 ) for the δ -distant criterion. • ( n − m ) k α k 2 δ 2 for the δ -app roximate cr iterion. • ( n − m ) k α k 2 (1 − γ ) for the γ -coh erent criterio n. • ( n − m ) k α k 2  1 − γ / p m (1 + γ )  for the γ -Babel crite- rion. W e explo re next t hese results for tw o particular kernel-based learning alg orithms, in order to clarify the relevance of these bound s. A. Appr oximation of th e empirical mea n The emp irical mean is a fundam ental fea ture of the set of sample, and its use is essential in many statistical me thods. For instance, it is in vestigated in [23] for visualization an d clustering of nonn egati ve data and in [ 22], [3 3] fo r one-class classification with kernel- based m ethods. In the following, w e study the relev ance of approx imating the empirical mean by its pro jection onto the sub space spanned by th e atoms of a dictionary . Let ψ ( · ) = 1 n P n i =1 κ ( x i , · ) b e th e emp irical m ean, namely α i = 1 / n for a ny i = 1 , 2 , . . . , n . From Theo rem 7, we get k ( I − ` P ) ψ ( · ) k 2 H ≤  1 − m n  ǫ 2 , (19) where max i k ( I − ` P ) κ ( x i , · ) k H ≤ ǫ . In t he following, we give a sharp er bo und. Indeed , we provid e a sharp er bo und by relaxing the use of the Cauchy- Schwarz inequality in (18), than ks to the fact that the coefficients α i are constant, i.e. , indepen dent of i . A s a consequen ce, we g et by revisiting exp ression (1 7): k ( I − ` P ) ψ ( · ) k H ≤ n X i =1 | α i |k ( I − ` P ) κ ( x i , · ) k H = 1 n n X i =1 k ( I − ` P ) κ ( x i , · ) k H ≤ 1 n ( n − m ) ǫ, where we have followed the same decom position as in th e proof o f Theorem 7, with only the n − m discarde d samples contribute to the summatio n term. Ther efore, the quad ratic approx imation error is upper-boun ded as follows: k ( I − ` P ) ψ ( · ) k 2 H ≤  1 − m n  2 ǫ 2 . This boun d is sha rper than th e o ne in ( 19) since 1 − m n < 1 . By r evisiting Theore m 8 in the lig ht of th is result, th e u pper bound o n th e qu adratic appro ximation error k ( I − ` P ) ψ ( · ) k 2 H can be describ ed in terms of the threshold of eac h sparsifica- tion criterion , as f ollows: •  1 − m n  2  1 − p 1 − δ 2  for the δ -distant cr iterion. •  1 − m n  2 δ 2 for the δ -appro ximate criterio n. •  1 − m n  2 (1 − γ ) for the γ -coherent c riterion. •  1 − m n  2 1 − γ p m (1 + γ ) ! for the γ -Babel crite- rion. These results g eneralize the work in [2 1], where only the case of the coh erence criterion is stud ied. Expr essions for dictionaries with atom s that are n ot unit-no rm can be easily obtained thank s to T heorems 1, 2, 3. B. Appr oximation of th e most r elevant principal axes Any sparsification criterion c an be seen as a dimen sionality reduction techniqu e, b ecause it iden tifies a sub space b y se- lecting re le vant samples fro m the av ailable ones. Since it is an un supervised appr oach, it is n atural to conn ect it with the kernel prin cipal compon ent analysis. For the sake of c larity , it is assumed th at the data are centered in the featur e space; see [34] for co nnections to the u ncentered case. The prin cipal compon ent analysis (PCA) seek s the principal axes that captur e the most of the data variance. The princip al axes co rrespon d to the eigenv ectors associated to the largest eigenv alues of the covariance matrix. In its kernel- based counterp art, i.e. , the kern el-PCA, the k -th principal axis takes the form ψ k ( · ) = P n i =1 α i,k κ ( x i , · ) , wher e th e coefficients α i,k are the entries of the k -th eigenv ector of th e Gr am matrix K . Moreover , to get un it-norm prin cipal axes, the coe fficients α i,k are norma lized such th at P n i =1 α 2 i,k = 1 /nλ k . In this expression, λ k is the k -th eigenv alue of the Gram matrix, also called prin cipal value. In the following, we highlig ht th e con- nections between the kern el-PCA an d the on line sparsification criteria. Theor em 9: Let ψ k ( · ) be the k -th p rincipal axe of the kernel fun ctions κ ( x i , · ) , fo r i = 1 , 2 , . . . , n , associated to the eige n value λ k of th e corr esponding Gram matrix. Its approx imation with a dictiona ry o f m kernel f unctions has a quad ratic er ror that can b e upper-boun ded by  1 − m n  ǫ 2 λ k , where ǫ is an upper bo und on the a pproxim ation o f any κ ( x i , · ) with a linear co mbination of a toms from the d ictionary . The p roof of this theorem is straightfor ward, by substituting k α k 2 with 1 /nλ k in Theorem 7. Theorem 9 shows th at, under the only cond ition that the used dictio nary has an upper bound o n the error o f ap proxim ating each kernel f unction, the principal axes associated to the largest prin cipal values have the smallest app roximation err ors. One can th erefore say that the most r elev ant princip al axes lie, with a sma ll er ror, in the span of the sparse dictionar y . HONEINE: APPRO XIMA TION ERRORS OF ONLINE SP ARSIF ICA TION CRITERIA 9 Moreover , w e derive expression s fo r each sparsification criterion, as given n ext in te rms of th e u sed threshold : •  1 − m n  1 − √ 1 − δ 2 λ k for the δ -distant cr iterion. •  1 − m n  δ 2 λ k for the δ -appro ximate criterio n. •  1 − m n  1 − γ λ k for the γ -cohe rent criterion . •  1 − m n  1 λ k 1 − γ p m (1 + γ ) ! for the γ -Babel c rite- rion. These results generalize previous work on the app roximatio n and the coheren ce criteria, and provide tigh ter bo unds than the ones previously kn own in the literature. Ind eed, the uppe r bound δ 2 /λ k was derived for th e approx imation criter ion in [11, Theo rem 3 .3] and in [8, Theore m 5] , while the co herence criterion is studied in [15, Prop osition 5] with the u pper bo und (1 − γ ) / λ k . V I . F I N A L R E M A R K S In this paper, w e stud ied the approx imation err ors of any sample wh en dealing with the distance, the c oherence , or the Babe l criterio n, rev ealing that th ese criteria are rough ly based o n an a pprox imation process. By deriving an upp er bound on the error of approx imating a sample discard ed fro m the diction ary , we explored that the atoms are “sufficient” to represent any samp le. Th e dual condition, namely sho wing th at each atom of th e dictionar y is “necessary ”, was also exh ibited by p roviding a lower b ound o n th e a pprox imation of any atom of the diction ary with the other ato ms. Moreover , beyo nd the analysis of a single sample, we extended these results to the estimation of any f eature, by describing in detail two classes of features, the empiric al me an ( i.e. , cen troid) an d the princip al axes in kernel-PCA. This work did not devise any p articular sparsification c ri- terion. It provid ed a fram ew ork to stud y on line sparsification criteria. W e a rgued that these criteria behave essentially in an identical m echanism, and share many interesting and desirab le proper ties. W itho ut loss o f gen erality , we considered the framework of kernel-based learning algo rithms. It is worth noting that these machin es ar e in timately connected with the Gaussian pro cesses [6], wher e the app roximation criterion was initially prop osed [ 10]. A P P E N D I X This append ix provides bo unds on the eig en values of a Gram matrix associated to a sparse dictio nary , for each of the sparsity m easures investigated in this pap er . For comp leteness, these boun ds are put her e in a nutshell ; see [32, Section IV] for m ore details. A co rnerstone o f th ese resu lts is the well- known Ger ˇ sgorin Discs Theor em [ 35, Chapter 6]. Re visited here for a Gram matr ix assoc iated to a sparse diction ary , it states that any of its eigenv alues lies in th e union of the m discs, centere d on e ach diagon al entry of ` K with a r adius giv en by th e sum of the absolute values of the other m − 1 entries from th e same row . In o ther words, for eac h ` λ i , there exists a t least on e j ∈ { 1 , 2 , . . . , m } such that | ` λ i − κ ( ` x j , ` x j ) | ≤ m X j =1 j 6 = i | κ ( ` x i , ` x j ) | . (20) This theorem provid es upp er and lower bound s on the e igen- values o f a Gram ma trix associated to a sparse diction ary , as described in th e following for each sparsity measur e. Lemma A.1: The eigenv alues of a Gram matrix associated to a δ - distant dictio nary of m atoms a re bound ed as fo llows: r 2 − ( m − 1) R p R 2 − δ 2 ≤ ` λ m ≤ · · · · · · ≤ ` λ 1 ≤ R 2 + ( m − 1) R p R 2 − δ 2 . Pr o of: From (8), a δ -distan t dictionar y satisfies | κ ( ` x i , ` x j ) | ≤ q κ ( ` x j , ` x j )  κ ( ` x i , ` x i ) − δ 2  , for any i = 1 , 2 , . . . , m , which yields X j | κ ( ` x i , ` x j ) | ≤ X j q κ ( ` x j , ` x j )  κ ( ` x i , ` x i ) − δ 2  = p κ ( ` x i , ` x i ) − δ 2 X j q κ ( ` x j , ` x j ) . By substituting this relation in ( 20), we get that, for each eigenv alue ` λ k , there exists at least one i such that | ` λ k − κ ( ` x i , ` x i ) | ≤ p κ ( ` x i , ` x i ) − δ 2 m X j =1 j 6 = i q κ ( ` x j , ` x j ) . Lemma A.2: The eigenv alues of a Gram matrix associated to a γ -coh erent dictionar y of m atom s are bound ed as follows: r 2 − ( m − 1) γ R 2 ≤ ` λ m ≤ · · · ≤ ` λ 1 ≤ R 2 + ( m − 1) γ R 2 . Pr o of: A γ -coherent dictionary satisfies max j =1 ··· m j 6 = i | κ ( ` x i , ` x j ) | p κ ( ` x i , ` x i ) κ ( ` x j , ` x j ) ≤ γ , for any i, j = 1 , 2 , . . . , m , which yields max j =1 ··· m j 6 = i | κ ( ` x i , ` x j ) | ≤ γ max j =1 ··· m j 6 = i q κ ( ` x i , ` x i ) κ ( ` x j , ` x j ) = γ p κ ( ` x i , ` x i ) max j =1 ··· m j 6 = i q κ ( ` x j , ` x j ) ≤ γ R p κ ( ` x i , ` x i ) . By injecting this exp ression in (20), we get m X j =1 j 6 = i | κ ( ` x i , ` x j ) | ≤ ( m − 1) max j =1 ··· m j 6 = i | κ ( ` x i , ` x j ) | ≤ ( m − 1) γ R p κ ( ` x i , ` x i ) . Since κ ( ` x i , ` x i ) ≤ R 2 , this co mpletes th e pro of. 10 Lemma A.3: The eigenv alues of a Gram matrix associated to a γ -Babel diction ary are bound ed as fo llows : r 2 − γ ≤ ` λ m ≤ · · · ≤ ` λ 1 ≤ R 2 + γ . Pr o of: Th e proo f is straightfo rward f rom the Ger ˇ s gorin Discs Th eorem, since expression (20) becom es | ` λ k − κ ( ` x i , ` x i ) | ≤ m X j =1 j 6 = i | κ ( ` x i , ` x j ) | ≤ γ , where the last inequality is du e to the Babe l measure. R E F E R E N C E S [1] “Spec ial issue on signal processing for big data, ” Selected T opics in Signal Pr ocessing , IEEE Jo urnal of , vol. 8, pp. 507–507, June 2014. [2] “Spec ial issue on signal processing for big data, ” Signal Pr ocessing , IEEE T ransactions on , vol. 62, pp. 1899–1899, April 2014. [3] G. Giannakis, F . Bach, R. Cendrillon, M. Mahoney , and J. Ne ville, “Sig- nal processing for big data [from the guest editors], ” Signal Processi ng Magazi ne, IEEE , vol. 31, pp. 15–16, Sept 2014. [4] J. Platt, “ A resou rce-al locati ng network for function interpolati on, ” Neural Comput. , vol. 3, pp. 213–225, June 1991. [5] V . N. V apnik, Statistical Learning Theory . Ne w Y ork, NY , USA: Wi ley , September 1998. [6] C. E. Rasmussen and C. W illi ams, Gaussian Processe s for Machine Learning . MIT Press, 2006. [7] G. bin Huang, P . Saratch, S. Member , and N. Sundararajan, “ A gener - alize d growing and pruning rbf (ggap-rbf) neural network for function approximat ion, ” IEEE T ransactions on Neura l Network s , vol. 16, pp . 57– 67, 2005. [8] P . Honeine , “Online kerne l princip al component analysis: a reduced - order model, ” IEEE T ransactions on P atte rn Analysis and Machi ne Intell ige nce , vol. 34, pp. 1814–1826, September 2012. [9] B. Sch ¨ olk opf, R. Herbrich, and A. J. Smola, “ A generaliz ed represente r theorem, ” in Pro c. 14th Annual Confer enc e on Computatio nal Learning Theory and 5th E ur opean Confere nce on Computationa l Learning The- ory , COL T/EuroCOL T , (London, UK), pp. 416–426, Springer -V erlag, 2001. [10] L. Csat ´ o and M. Opper , “Sparse online gaussian processes, ” Neural Computati on , vol. 14, pp. 641–668, 2002. [11] Y . Engel, S. Mannor , and R. Meir , “The kernel recursiv e least squares algorit hm, ” IEEE T rans. Signal P r ocessin g , vol. 52, no. 8, pp. 2275– 2285, 2004. [12] P . P . Pokharel, W . Liu, and J. C. Principe, “Kerne l least mean square algorit hm with constraine d growt h, ” Signal Proc essing , vol. 89, no. 3, pp. 257 – 265, 2009. [13] G. S. Babu and S . Suresh, “Meta-cognit i ve rbf network and its pro- jecti on based learning algorithm for classificat ion problems, ” Appl. Soft Comput. , vol. 13, pp. 654–666, Jan. 2013. [14] Y . -K. Y ang, T . -Y . Sun, C.-L. Huo, Y .-H. Y u, C.-C. Liu, and C.-H. Tsai, “ A nov el self-const ructing radial basis functi on neural-fuzz y s ystem, ” Applied Soft Computing , vol. 13, no. 5, pp. 2390 – 2404, 2013. [15] P . Honein e, C. Richard, and J. C. M. Bermudez, “On-line nonline ar sparse approximat ion of functions, ” in P r oc. IEEE International Sympo- sium on Information Theory , (Nice, France), pp. 956–960, June 2007. [16] C. Richar d, J. C. M. Bermudez, and P . Honeine , “Online predicti on of time series data with kernels, ” IEEE T r ansactio ns on Signal P r ocessing , vol. 57, pp. 1058–1067, March 2009. [17] J. A. Tropp, “Greed is good: algorit hmic results for sparse approxima- tion, ” IEEE T ra ns. Information Theory , vol. 50, pp. 2231–2242, 2004. [18] M. Elad , Sparse and Redundant Represe ntation s: Fr om Theory to Applicat ions in Signal and Image Pr ocessing . Springer , 2010. [19] H. Fan, Q. Song, and S. B. Shrestha, “Online learning with kerne l regul arize d least mean square algorithms, ” Knowle dge-Base d Systems , vol. 59, no. 0, pp. 21 – 32, 2014. [20] W . Liu, J. C. Princi pe, and S. Haykin, Kernel A daptive Fil tering: A Compr ehensiv e Intro duction . W ile y Publishing, 1st ed., 2010. [21] Z. Noumir , P . Honeine, and C. Richar d, “One-class machines based on the coherence criteri on, ” in P r oc. IEEE workshop on Statistical Signal Pr ocessing , (Ann Arbor, Michigan, U SA), pp. 600–603, 5–8 August 2012. [22] Z. Noumir , P . Honeine, and C. Richard , “On simple one-class classifica- tion methods, ” in P r oc. IE EE Internati onal Symposium on Information Theory , (MIT , Cambridge (MA), USA), pp. 2022–2026, 1–6 July 2012. [23] R. Jenssen, “Mean vector component analy sis for visualiz ation and clusteri ng of nonnegat i ve data, ” Neural Networks and Learning Systems, IEEE T ransactions on , vol. 24, pp. 1553–1564, Oct 2013. [24] B. Sch ¨ olkopf, A. Smola, and K.-R. M ¨ uller , “Nonline ar compone nt analysi s as a kernel eigen value problem, ” Neural Comput. , vol. 10, pp. 1299–1319, J uly 1998. [25] C. Said ´ e, R. Lengell´ e, P . Honeine, and R. Achkar , “Online kernel adapti ve algorithms with dictionary adaptatio n for mimo models, ” IEEE Signal Pr ocessing Letters , vol. 20, pp. 535–538, May 2013. [26] G. Baudat and F . Anouar , “Kern el-base d methods and function ap- proximati on, ” in In Internationa l Joint Confer en ce on Neural Network s (IJCNN) , vol. 5, (W ashington, DC, USA), pp. 1244–1249, July 2001. [27] L. Csat ´ o and M. Opper , “Sparse representat ion for gaussian process models, ” in Advances in Neural Information Pr ocessing Systems 13 , pp. 444–450, MIT Press, 2001. [28] D. Nguyen-T uong and J. Peters, “Incremental online sparsification for model learnin g in real-t ime robot control, ” Neur oco mputing , vol. 74, no. 11, pp. 1859 – 1867, 2011. [29] R. Rosipal, M. Koska, and I. Farka s, “Predict ion of chaoti c time-series with a resource-a lloca ting RBF network, ” in Neural Processi ng Letter s , pp. 185–197, 1997. [30] B. Chen, S. Zhao, P . Zhu, and J. Principe, “Quantized kerne l least mean square algorithm, ” Neural Networks and Learning Systems, IEE E T ransactions on , vol. 23, pp. 22–32, Jan 2012. [31] A. C. Gilbert, S. Muthukrishnan, M. J. Strauss, and J. Tropp, “Improv ed sparse approximati on over quasi-incoher ent dicti onarie s, ” in Interna- tional Confer ence on Imag e Pr oc essing (ICIP) , vol. 1, (Barc elona , Spain), pp. 37–40, Sept. 2003. [32] P . Honeine, “ Analyzing sparse dictionarie s for online learning with kerne ls, ” IEEE T ransactions on Signal Pro cessing , 2014 submitted. [33] Z. Noumir , P . Honeine, and C. Richar d, “Online one-class machines based on the coher ence criterion, ” in P r oc. 20th Eur opean Confer ence on Signal Proce ssing , (Bucharest, Romania), pp. 664–668, 27–31 August 2012. [34] P . Honeine, “ An eigenanal ysis of data centering in machine learning, ” IEEE T ransac tions on P attern Analysis and Mach ine Intelli genc e , 2014 submitted. [35] R. A. Horn and C. R. Johnson, Matrix analysis . New Y ork, NY , USA: Cambridge Unive rsity Press, 2nd edition ed., December 2012. PLA CE PHO TO HERE Paul H oneine (M’07) was born in Beirut, Lebanon, on Oct ober 2, 1977. He recei ved the Dipl.-Ing. degre e in mechani cal engineeri ng in 2002 and the M.Sc. deg ree in industrial control in 2003, both from the Faculty of Engineeri ng, the Lebanese Uni versi ty , Lebanon. In 2007, he recei v ed the Ph.D. degree in Systems Optimisation and Security from the Uni- versi ty of T echnology of Troy es, France, and was a Postdoctor al Research associate with the Systems Modelin g and Dependabi lity Laboratory , from 2007 to 2008. Since Septembe r 2008, he has been an assistant Professor at the Univ ersity of T echnol ogy of Tro yes, France. His research interests include nonstationary signal analysis and classificatio n, nonline ar and statist ical signal processing , sparse represen tation s, machine learni ng. Of particu lar interest are applicat ions to (wireless) sensor networks, biomedic al signal processing, hyperspect ral imagery and nonlinea r adapti ve system identificat ion. He is the co-author (with C. Richard ) of the 2009 Best Paper A ward at the IEEE W orkshop on Machine L earning for Signal Processing. Over the past 5 years, he has published more than 100 peer- re vie wed papers.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment