Dictionary learning for fast classification based on soft-thresholding

Dictionary learning for fast classiﬁcation based on soft-thresholding Alh ussein F a wzi ∗ Mik e Da vies † P ascal F rossard ∗ Octob er 3, 2014 Abstract Classiﬁers based on sparse represen tations hav e recently b een shown to pro vide excellent results in many visual recognition and classiﬁcation tasks. Ho wev er, the high cost of computing sparse representations at test time is a ma jor obstacle that limits the applicability of these metho ds in large-scale problems, or in scenarios where computational pow er is restricted. W e consider in this pap er a simple yet eﬃcient alternative to sparse co ding for feature extraction. W e study a classiﬁcation scheme that applies the soft-thr esholding nonlinear mapping in a dictionary , follo wed by a linear classiﬁer. A nov el supervised dictionary learning algorithm tailored for this low complexity classiﬁcation arc hitecture is prop osed. The dictionary learning problem, which jointly learns the dictionary and linear classiﬁer, is cast as a diﬀer enc e of c onvex (DC) program and solved eﬃciently with an iterative DC solver. W e conduct exp erimen ts on several datasets, and show that our learning algorithm that lev erages the structure of the classiﬁcation problem outperforms generic learning pro cedures. Our simple classiﬁer based on soft-thresholding also competes with the recent sparse co ding classiﬁers, when the dictionary is learned appropriately . The adopted classiﬁcation scheme further requires less computational time at the testing stage, compared to other classiﬁers. The prop osed sc heme shows the p oten tial of the adequately trained soft-thresholding mapping for classiﬁcation and pav es the wa y tow ards the dev elopmen t of v ery eﬃcient classiﬁcation methods for vision problems. 1 In tro duction The recent decade has witnessed the emergence of h uge volumes of high dimensional information pro duced by all sorts of sensors. F or instance, a massive amoun t of high-resolution images are uploaded on the In ternet ev ery min ute. In this context, one of the key challenges is to develop tec hniques to pro cess these large amounts of data in a computationally eﬃcient wa y . W e fo cus in this pap er on the image classiﬁc ation problem, which is one of the most c hallenging tasks in image analysis and computer vision. Giv en training examples from m ultiple classes, the goal is to ﬁnd a rule that p ermits to predict the class of test samples. Line ar classiﬁc ation is a computationally eﬃcient wa y to categorize test samples. It consists in ﬁnding a linear separator b et ween tw o classes. Linear classiﬁcation has b een the fo cus of m uch researc h in statistics and mac hine learning for decades and the resulting algorithms are well understo o d. How ev er, many datasets cannot b e separated linearly and require complex nonlinear classiﬁers. A p opular nonlinear scheme, which leverages the eﬃcency and simplicity of linear classiﬁers, em beds the data into a high dimensional feature space, where a linear classiﬁer is even tually sough t. The feature space mapping is chosen to be nonlinear in order to con v ert nonlinear relations to linear relations. This nonlinear classiﬁcation framew ork is at the heart of the p opular k ernel-based methods (Sha w e-T aylor and Cristianini, 2004) that make use of a computational shortcut to b ypass the explicit computation of feature v ectors. Despite the p opularit y of kernel-based classiﬁcation, its computational complexit y at test time strongly dep ends on the num b er of training samples (Burges, 1998), which limits its applicability in large scale settings. A more recen t approach for nonlinear classiﬁcation is based on sp arse c o ding , which consists in ﬁnding a compact representation of the data in an ov ercomplete dictionary . Sparse co ding is known to b e b eneﬁcial in signal pro cessing tasks suc h as denoising (Elad and Aharon, 2006), inpainting (F adili et al, 2009), co ding (Figueras i V entura et al, 2006), but it has also recen tly emerged in the context of classiﬁcation, where it is view ed as a nonlinear feature extraction mapping. It is usually follow ed by a linear classiﬁer (Raina et al, 2007), but can also b e used in conjunction with other classiﬁers (W righ t et al, 2009). Classiﬁcation architectures based on sparse co ding ha v e b een shown to work very well in practice and even ac hieve state-of-the-art results on particular tasks (Mairal et al, 2012; Y ang et al, 2009). The crucial drawbac k of sparse co ding classiﬁers is ho wev er the prohibitive cost of computing the sparse representation of a signal or image sample at test time. This limits the relev ance of such techniques in large-scale vision problems or when computational p ow er is scarce. ∗ Ecole Polytec hnique F ederale de Lausanne (EPFL), Signal Pro cessing Lab oratory (L TS4), Lausanne 1015-Switzerland. Email: (alhussein.fa wzi@epﬂ.ch, pascal.frossard@epﬂ.c h) † IDCOM, The Univ ersit y of Edin burgh, Edin burgh, UK. Email: mike.da vies@ed.ac.uk 1 Label D T w T Figure 1: Soft-thresholding classiﬁcation sc heme. The b ox in the middle applies the soft-thresholding non- linearit y h α . T o remedy to these large computational requirements, w e adopt in the classiﬁcation a computationally eﬃcien t sparsifying transform, the soft thresholding mapping h α , deﬁned b y: h α ( z ) = max(0 , z − α ) , ( z − α ) + , (1) for α ∈ R + and ( · ) + = max(0 , · ). Note that, unlike the usual deﬁnition of soft-thresholding giv en b y sgn( z )( | z | − α ) + , we consider here the one-side d version of the soft-thresholding map, where the function is equal to zero for negativ e v alues (see Fig. 3 (a) vs. Fig 3 (b)). The map h α is naturally extended to vectors z by applying the scalar map to eac h co ordinate indep endently . Given a dictionary D , this map can be applied to a transformed signal z = D T x that represents the co eﬃcients of features in a signal x . Its outcome, whic h only considers the most important features of x , is used for classiﬁcation. In more details, we consider in this pap er the following simple t w o-step pro cedure for classiﬁcation: 1. F eature extraction: Let D = [ d 1 | . . . | d N ] ∈ R n × N and α ∈ R + . Giv en a test p oint x ∈ R n , compute h α ( D T x ). 2. Linear classiﬁcation: Let w ∈ R N . If w T h α ( D T x ) is p ositiv e, assign x to class 1. Otherwise, assign to class − 1. The arc hitecture is illustrated in Fig. 1. The proposed classiﬁcation scheme has the adv an tage of b eing simple, eﬃcien t and easy to implement as it inv olv es a single matrix-vector multiplication and a max op eration. The soft-thresholding map has b een successfully used in (Coates and Ng, 2011), as well as in a num ber of deep learning arc hitectures (Ka vukcuoglu et al, 2010b), which shows the relev ance of this eﬃcien t feature extraction mapping. The remark able results in Coates and Ng (2011) show that this simple enco der, when coupled with a standard learning algorithm, can often ac hiev e results comparable to those of sparse co ding, pro vided that the num b er of lab eled samples and the dictionary size are large enough. How ev er, when this is not the cas e, a prop er training of the classiﬁer parameters ( D , w ) b ecomes crucial for reaching go od classiﬁcation p erformance. This is the ob jective of this pap er. W e prop ose a nov el sup ervise d dictionary le arning algorithm, which we call LAST (Learning Algorithm for Soft-Thresholding classiﬁer). It jointly learns the dictionary D and the linear classiﬁer w tailored for the classiﬁcation arc hitecture based on soft-thresholding. W e p ose the learning problem as an optimization problem comprising a loss term that con trols the classiﬁcation accuracy and a regularizer that preven ts o verﬁtting. This problem is shown to b e a diﬀer enc e-of-c onvex (DC) program, whic h is solv ed eﬃciently with an iterative DC solv er. W e then p erform extensiv e exp erimen ts on textures, digits and natural images datasets, and show that the prop osed classiﬁer, coupled with our dictionary learning approac h, exhibits remark able p erformance with resp ect to numerous comp etitor metho ds. In particular, we show that our classiﬁer pro vides comparable or b etter classiﬁcation accuracy than sparse co ding schemes. The rest of this pap er is organized as follo ws. In the next Section, w e highligh t the related w ork. In Section 3, w e form ulate the dictionary learning problem for classiﬁers based on soft-thresholding. Section 4 then presen ts our no v el learning algorithm, LAST, based on DC optimization. In Section 5, we perform extensive experiments on textures, natural images and digits datasets and Section 6 ﬁnally gathers a n umber of imp ortan t observ ations on the dictionary learning algorithm, and the classiﬁcation scheme. 2 Related w ork W e ﬁrst highlight in this section the diﬀerence b etw een the prop osed approach and existing techniques from the sparse co ding and dictionary learning literature. Then, w e dra w a connection betw een the considered approach and neural net w ork models on the architecture and optimization asp ects. 2.1 Sparse co ding The classiﬁcation sc heme adopted in this paper shares similarities with the now popular architectures that use sparse co ding at the feature extraction stage. W e recall that the sparse coding mapping, applied to a datap oin t 2 x in a dictionary D consists in solving the optimization problem argmin c ∈ R N k x − Dc k 2 2 + λ k c k 1 . (2) It is now known that, when the parameters of the sparse co ding classiﬁer are trained in a discriminative wa y , excellen t classiﬁcation results are obtained in many vision tasks (Mairal et al, 2012, 2008; Ramirez et al, 2010). In particular, signiﬁcan t gains ov er the standard reconstructive dictionary learning approac hes are obtained when the dictionary is optimized for classiﬁcation. Several dictionary learning metho ds also consider an additional structure (e.g., lo w-rankness) on the dictionary , in order to incorp orate a task-sp eciﬁc prior knowledge (Zhang et al, 2013; Chen et al, 2012; Ma et al, 2012). This line of research is esp ecially p opular in face recognition applications, where a mixture of subspace model is kno wn to hold (W right et al, 2009). Up to our knowledge, all the discriminative dictionary learning metho ds optimize the dictionary in regards to the sparse co ding map in Eq. (2), or a v arian t that still requires to solv e a non trivial optimization problem. In our work how ev er, w e in troduce a discriminativ e dictionary learning metho d sp e ciﬁc to the eﬃcient soft-thr esholding map . Interestingly , soft- thresholding can b e view ed as a coarse approximation to non-negative sparse co ding, as w e show in App endix A. This further motiv ates the use of soft-thresholding for feature extraction, as the merits of sparse co ding for classiﬁcation are no w w ell-established. Closer to our work, several approaches ha v e b een introduced to approximate sparse co ding with a more eﬃcien t feed-forward predictor (Ka vukcuoglu et al, 2010a; Gregor and LeCun, 2010), whose parameters are learned in order to minimize the approximation error with resp ect to sparse co des. These works are how ev er diﬀeren t from ours in several asp ects. First, our approac h do es not require the result of the soft-thresholding mapping to b e close to that of s parse co ding. W e rather require solely a go o d classiﬁcation accuracy on the training samples. Moreov er, our dictionary learning approach is purely sup ervised, unlike Ka vukcuoglu et al (2010a,b). Finally , these metho ds often use nonlinear maps (e.g., h yp erbolic tangent in Kavuk cuoglu et al (2010a), multi-la y er soft-thresholding in Gregor and LeCun (2010)) that are diﬀerent from the one considered in this pap er. The single soft-thresholding mapping considered here has the adv antage of b eing simple, v ery eﬃcien t and easy to implemen t in practice. It is also strongly tied to sparse co ding (see App endix A). 2.2 Neural netw orks The classiﬁcation architecture considered in our w ork is also quite strongly related to artiﬁcial neural netw ork mo dels (Bishop, 1995). Neural netw ork mo dels are multi-la y er arc hitectures, where each la yer consists of a set of neurons. The neurons compute a linear combination of the activ ation v alues of the preceding lay er, and an activation function is then used to con vert the neurons’ w eigh ted input to its activ ation v alue. Popular choices of activ ation functions are logistic sigmoid and hyperb olic tangent nonlinearities. Our classiﬁcation arc hitecture can b e seen as a neural net work with one hidden la yer and h α as the hidden units’ activ ation function, and zero bias (Fig. 2). Equiv alen tly , the activ ation function can b e set to max(0 , x ) with a constan t bias − α across all hidden units. The dictionary D deﬁnes the connections b etw een the input and hidden la yer, while w represents the w eigh ts that connect the hidden la y er to the output. Output Hidden la yer Input Dictionar y D Normal vector w Figure 2: Neural netw ork representation of our classiﬁcation architecture. Greyed neurons ha v e zero activ ation v alue. In an imp ortant recen t contribution, Glorot et al (2011) show ed that using the rectiﬁer activ ation function max(0 , x ) results in b etter p erformance for deep net works than the more classical hyperb olic tangen t function. On top of that, the rectiﬁer nonlinearit y is more biologically plausible, and leads to sparse net works; a prop erty that is highly desirable in representation learning (Bengio et al, 2013). While the arc hitecture considered in this pap er is close to that of Glorot et al (2011), it diﬀers in several important aspects. First, our architecture assumes that hidden units hav e a bias equal to − α < 0, shared across all the hidden units, while it is unclear whether an y constraint on the bias is set in the existing rectiﬁer net works. The parameter α is in timately related 3 to the sparsit y of the features. This can be justiﬁed b y the fact that h α is an appro ximant to the non-negative sparse co ding map with sparsity penalty α (see App endix A). Without imp osing any restriction on the neurons’ bias (e.g., negativity) in rectiﬁer netw orks, the representation might how ev er not b e sparse. This p otentially explains the necessity to use an additional ` 1 sparsifying regularizer on the activ ation v alues in Glorot et al (2011) to enforce the sparsity of the netw ork, while sparsity is ac hieved implicitly in our scheme. Second, unlik e the work of (Glorot et al, 2011) that employs a biological argument to in tro duce the rectiﬁer function, we c ho ose the soft-thresholding nonlinearit y due to its strong relation to sparse co ding. Our w ork therefore pro vides an indep endent motiv ation for considering the rectiﬁer activ ation function, while the biological motiv ation in (Glorot et al, 2011) in turn gives us another motiv ation for considering soft-thresholding. Third, rectiﬁed linear units are very often used in the con text of deep net w orks (Maas et al, 2013; Zeiler et al, 2013), and seldom used with only one hidden la yer. In that se nse, the classiﬁcation sc heme considered in this paper has a simpler description, and can be seen as a particular instance of the general neural netw ork models. F rom an optimization persp ective, our learning algorithm leverages the simplicity of our classiﬁcation ar- c hitecture and is very diﬀeren t from the generic tec hniques used to train neural netw orks. In particular, while neural net w orks are generally trained with stochastic gradien t descen t, w e adopt an optimization based on the DC framew ork that directly exploits the structure of the learning problem. 3 Problem form ulation W e present b elow the learning problem, that estimates jointly the dictionary D ∈ R n × N and linear classiﬁer w ∈ R N in our fast classiﬁcation sc heme describ ed in Section 1. W e consider the binary classiﬁcation task where X = [ x 1 | . . . | x m ] ∈ R n × m and y = [ y 1 | . . . | y m ] ∈ {− 1 , 1 } m denote resp ectiv ely the set of training p oin ts and their associated lab els. W e consider the follo wing sup ervised learning form ulation argmin D , w m X i =1 L ( y i w T h α ( D T x i )) + ν 2 k w k 2 2 , (3) where L denotes a con vex loss function that p enalizes incorrect classiﬁcation of a training sample and ν is a regularization parameter that preven ts o verﬁtting. The soft-thresholding map h α has b een deﬁned in Eq. (1). T ypical loss functions that can b e used in Eq. (3) are the hinge loss ( L ( x ) = max(0 , 1 − x )), which we adopt in this pap er, or its smo oth approximation, the logistic loss ( L ( x ) = log (1 + e − x )). The ab ov e optimization problem attempts to ﬁnd a dictionary D and a linear separator w such that w T ( D T x i − α ) + has the same sign as y i on the training set, whic h leads to correct classiﬁcation. At the same time, it keeps k w k 2 small in order to preven t o v erﬁtting. Note that to simplify the exposition, the bias term in the linear classiﬁer is dropp ed. Ho wev er, our study extends straigh tforwardly to include nonzero bias. The problem formulation in Eq. (3) is reminiscent of the p opular supp ort vector machine (SVM) training pro cedure, where only a linear classiﬁer w is learned. Instead, w e em bed the nonlinearit y directly in the problem form ulation, and learn jointly the dictionary D and the linear classiﬁer w . This signiﬁcan tly broadens the applicabilit y of the learned classiﬁer to imp ortan t nonlinear classiﬁcation tasks. Note how ever that adding a nonlinear mapping raises an important optimization challenge, as the learning problem is no more conv ex. When we lo ok closer at the optimization problem in Eq. (3), we note that, for any α > 0, the ob jective function is equal to: m X i =1 L ( y i α w T h 1 ( D T x i /α )) + ν 2 k w k 2 2 = m X i =1 L ( y i ˜ w T h 1 ( ˜ D T x i )) + ν 0 2 k ˜ w k 2 2 , where ˜ w = α w , ˜ D = D /α and ν 0 = ν /α 2 . Therefore, without loss of generality , we set the sparsity parameter α to 1 in the rest of this paper. This is in con trast with traditional dictionary learning approac hes based on ` 0 or ` 1 minimization problems, where a sparsit y parameter needs to b e set man ually beforehand. Fixing α = 1 and unconstraining the norms of the dictionary atoms essentially p ermits to adapt the sparsity to the problem at hand. This represents an imp ortant adv antage, as setting the sparsity parameter is in general a diﬃcult task. A sample x is then assigned to class ‘+1’ if w T h 1 ( D T x ) > 0, and class ‘ − 1’ otherwise. Finally , we note that, ev en if our fo cus primarily goes to the binary classiﬁcation problem, the extension to m ulti-class can be easily done through a one-vs-all strategy , for instance. 4 Learning algorithm The problem in Eq. (3) is non-conv ex and diﬃcult to solve in general. In this section, we propose to relax the original optimization problem and cast it as a diﬀer enc e-of-c onvex (DC) program. Leveraging this property , 4 w e in tro duce LAST, an eﬃcient algorithm for learning the dictionary and the classiﬁer parameters in our classiﬁcation sc heme based on soft-thresholding. 4.1 Relaxed formulation W e rewrite now the learning problem in an appropriate form for optimization. W e start with a simple but crucial c hange of v ariables. Sp eciﬁcally , we deﬁne u j ← | w j | d j , v j ← | w j | and s j ← sgn( w j ). Using this c hange of v ariables, we hav e for an y 1 ≤ i ≤ m , y i w T h 1 ( D T x i ) = y i N X j =1 sgn( w j )( | w j | d T j x i − | w j | ) + = y i N X j =1 s j ( u T j x i − v j ) + . Therefore, the problem in Eq.(3), with α = 1, can b e rewritten in the follo wing w a y: argmin U , v , s m X i =1 L   y i N X j =1 s j ( u T j x i − v j ) +   + ν 2 k v k 2 2 , (4) sub ject to v > 0 . The equiv alence b etw een the t w o problem form ulations in Eqs. (3) and (4) only holds when the comp onents of the linear classiﬁer w are restricted to b e all non zero. This is how ever not a limiting assumption as zero comp onen ts in the normal vector of the optimal hyperplane of Eq. (3) can b e remov ed, which is equiv alent to using a dictionary of smaller size. The v ariable s , that is the sign of the comp onents of w , essentially enco des the “classes” of the diﬀerent atoms. In other words, an atom d j for which s j = +1 (i.e., w j is p ositive) is most likely to b e active for samples of class ‘1’. Con versely , atoms with s j = − 1 are most lik ely activ e for class ‘ − 1’ samples. W e assume here that the vector s is known a priori. In other words, this means that we hav e a prior kno wledge on the prop ortion of class 1 and class − 1 atoms in the desired dictionary . F or example, setting half of the entries of the v ector s to b e equal to +1 and the other half to − 1 encodes the prior knowledge that we are searc hing for a dictionary with a balanced n um b er of class-sp eciﬁc atoms. Note that s can b e estimated from the distribution of the diﬀeren t classes in the training set, assuming that the prop ortion of class-sp eciﬁc atoms in the dictionary should appro ximately follo w that of the training samples. −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 (a) −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 (b) Figure 3: (a): sgn( x )( | x | − α ) + , (b): h α (solid), and its smo oth approximation q ( x − α ) (dashed), with β = 10. W e used α = 1. After the ab ov e change of v ariables, we now approximate the term ( u T j x i − v j ) + in Eq.(4) with a smo oth function q ( u T j x i − v j ) where q ( x ) = 1 β log (1 + exp ( β x )), and β is a parameter that con trols the accuracy of the appro ximation (Fig. 3 (b)). Speciﬁcally , as β increases, the quality of the appro ximation b ecomes b etter. The function q with β = 1 is often referred to as “soft-plus” and plays an important role in the training ob jective of man y classiﬁcation schemes, such as the classiﬁcation restricted Boltzmann machines (Laro c helle et al, 2012). Note that this approximation is used only to mak e the optimization easier at the learning stage; at test time, the original soft-thresholding is applied for feature extraction. Finally , we replace the strict inequality v > 0 in Eq. (4) with v ≥  , where  is a small p ositive constant n umber. The latter constraint is easier to handle in the optimization, yet b oth constrain ts are essen tially equiv alent in practice. 5 W e end up with the follo wing optimization problem: (P) : argmin U , v m X i =1 L   y i N X j =1 s j q ( u T j x i − v j )   + ν 2 k v k 2 2 , sub ject to v ≥ , that is a relaxed v ersion of the learning problem in Eq. (4). Once the optimal v ariables ( U , v ) are determined, D and w can b e obtained using the abov e c hange of v ariables. 4.2 DC decomp osition The problem (P) is still a nonconv ex optimization problem that can be hard to solv e using traditional metho ds, suc h as gradient descent or Newton-type metho ds. How ev er, we show in this section that problem (P) can b e written as a diﬀer enc e of c onvex (DC) program (Horst, 2000) which leads to eﬃcient solutions. W e ﬁrst deﬁne DC functions. A real-v alued function f deﬁned on a conv ex set U ⊆ R n is called DC on U if, for all x ∈ U , f can b e expressed in the form f ( x ) = g ( x ) − h ( x ) , where g and h are conv ex functions on U . A represen tation of the abov e form is said to b e a DC decomp osition of f . Note that DC decomp ositions are clearly not unique, as f ( x ) = ( g ( x ) + c ( x )) − ( h ( x ) + c ( x )) pro vides other decomp ositions of f , for an y conv ex function c . Optimization problems of the form min x { f ( x ) : f i ( x ) ≤ 0 , i = 1 , . . . , p } , where f and f i for 1 ≤ i ≤ p are all DC functions, are called DC pr o gr ams . The follo wing proposition now states that the problem (P) is DC: Prop osition 1 F or any c onvex loss function L and any c onvex function q , the pr oblem ( P ) is DC. While Prop osition 1 states that the problem (P) is DC, it does not give an explicit decomp osition of the ob jective function, whic h is crucial for optimization. The following proposition exhibits a decomp osition when L is the hinge loss. Prop osition 2 When L ( x ) = max(0 , 1 − x ) , the obje ctive function of pr oblem (P) is e qual to g − h , wher e g = ν 2 k v k 2 2 + m X i =1 max  X j : s j = y i q ( u T j x i − v j ) , 1 + X j : s j 6 = y i q ( u T j x i − v j )  , h = m X i =1 X j : s j = y i q ( u T j x i − v j ) . The proofs of Propositions 1 and 2 are given in App endix B. Due to Prop osition 2, the problem (P) can b e solv ed eﬃcien tly using a DC solv er. 4.3 Optimization DC problems are well studied optimization problems and eﬃcien t optimization algorithms hav e been proposed in (Horst, 2000; T ao and An, 1998) with go od p erformance in practice (see An and T ao (2005) and references therein, Srip erumbudur et al (2007)). While there exists a n um b er of p opular approaches that solve glob al ly DC programs (e.g., cutting plane and branch-and-bound algorithms (Horst, 2000)), these techniques are often ineﬃcien t and limited to v ery small scale problems. A robust and eﬃcient diﬀerence of con vex algorithm (DCA) is prop osed in T ao and An (1998), which is suited for solving general large scale DC programs. DCA is an iterativ e algorithm that consists in solving, at each iteration, the conv ex optimization problem obtained b y linearizing h (i.e., the non conv ex part of f = g − h ) around the current solution. The lo cal conv ergence of DCA is prov en in Theorem 3.7 of T ao and An (1998), and we refer to this pap er for further theoretical guarantees on the stability and robustness of the algorithm. Although DCA is only guaranteed to reach a lo cal minima, the authors of T ao and An (1998) state that DCA often conv erges to a global optimum. When this is not the case, using multiple restarts migh t b e used to improv e the solution. W e note that DCA is very close to the conca ve-con vex procedure (CCCP) introduced in (Y uille et al, 2002). A t iteration k of DCA, the linearized optimization problem is giv en b y: argmin ( U , v ) { g ( U , v ) − T r ( U T A ) − v T b } sub ject to v ≥ . (5) 6 where ( A , b ) = ∇ h ( U k , v k ) and ( U k , v k ) are the solution estimates at iteration k , and the functions g and h are deﬁned in Prop osition 2. Note that, due to the con vexit y of g , the problem in Eq. (5) is conv ex and can b e solv ed using any conv ex optimization algorithm (Boyd and V andenberghe, 2004). The metho d we prop ose to use here is a pro jected ﬁrst-order sto chastic subgradien t descent algorithm. Sto chastic gradient descent is an eﬃcien t optimization algorithm that can handle large training sets (Ak ata et al, 2014). T o mak e the exposition clearer, w e ﬁrst deﬁne the function: p ( U , v ; x i , y i ) = max  X j : s j = y i q ( u T j x i − v j ) , 1 + X j : s j 6 = y i q ( u T j x i − v j )  + 1 m  ν 2 k v k 2 2 − T r ( U T A ) − v T b  . The ob jectiv e function of Eq. (5) that w e wish to minimize can then be written as P m i =1 p ( U , v ; x i , y i ). W e solv e this optimization problem with the pro jected sto c hastic subgradien t descen t algorithm in Algorithm 1. Algorithm 1 Optimization algorithm to solve the linearized problem in Eq. (5) 1 . Initialization: U ← U k and v ← v k . 2 . F or t = 1 , . . . , T 2 . 1 Let ( x , y ) be a randomly chosen training p oint, and its associated lab el. 2 . 2 Cho ose the stepsize ρ t ← min( ρ, ρ t 0 t ). 2 . 3 Up date U , and v , by pro jected subgradient step: U ← U − ρ t ∂ U p ( U , v ; x , y ) , v ← Π v ≥  ( v − ρ t ∂ v p ( U , v ; x , y )) , where Π v ≥  is the pro jection op erator on the set v ≥  . 3 . Return U k + 1 ← U and v k + 1 ← v . In more details, at each iteration of Algorithm 1, a training sample ( x , y ) is drawn. U and v are then up dated b y p erforming a step in the direction ∂ p ( U , v ; x , y ). Many diﬀerent stepsize rules can b e used with sto chastic gradien t descent metho ds. In this pap er, similarly to the strategy employ ed in Mairal et al (2012), w e hav e c hosen a stepsize that remains constant for the ﬁrst t 0 iterations, and then takes the v alue ρt 0 /t . 1 Moreo ver, to accelerate the conv ergence of the sto chastic gradien t descent algorithm, w e consider a small v ariation of Algorithm 1, where a minibatch containing several training samples along with their lab els is drawn at each iteration, instead of a single sample. This is a classical heuristic in stochastic gradient descen t algorithms. Note that, when the size of the minibatch is equal to the num b er of training samples, this algorithm reduces to traditional batc h gradien t descen t. Finally , our complete LAST learning algorithm based on DCA is formally giv en in Algorithm 2. Starting from a feasible p oint U 0 and v 0 , LAST solv es iteratively the constrained conv ex problem giv en in Eq. (5) with the solution prop osed in Algorithm 1. Recall that this problem corresp onds to the original DC program (P), except that the function h has been replaced by its linear approximation around the curren t solution ( U k , v k ) at iteration k . Many criteria can b e used to terminate the algorithm. W e c ho ose here to terminate when a maxim um num ber of iterations K has b een reac hed, and terminate the algorithm earlier when the following condition is satisﬁed: min  | ( ω k +1 − ω k ) i,j | ,     ( ω k +1 − ω k ) i,j ( ω k ) i,j      ≤ δ , where the matrix Ω k = ( ω k ) i,j is the row concatenation of U and v T , and δ is a small p ositiv e num ber. This condition detects the con vergence of the learning algorithm, and is veriﬁed whenever the c hange in U and v is v ery small. This termination criterion is used for example in Sriperumbudur et al (2007). 5 Exp erimen tal results In this section, w e ev aluate the performance of our classiﬁcation algorithm on textures, digits and natural images datasets, and compare it to diﬀerent competitor schemes. W e exp ose in Section 5.1 the choice of the parameters of the mo del and the algorithm. W e then fo cus on the exp erimental assessment of our scheme. F ollowing the metho dology of Coates and Ng (2011), w e break the feature extraction algorithms into (i) a learning algorithm (e.g, K-Means) where a set of basis functions (or dictionary) is learned and (ii) an enco ding function (e.g., ` 1 1 The precise c hoice of the parameters ρ and t 0 are discussed later in Section 5.1. 7 Algorithm 2 LAST (Learning Algorithm for Soft-Thresholding classiﬁer) 1 . Choose any initial point: U 0 and v 0 ≥  . 2 . F or k = 0 , . . . , K − 1, 2 . 1 Compute ( A , b ) = ∇ h ( U k , v k ). 2 . 2 Solve with Algorithm 1 the conv ex optimization problem: ( U k + 1 , v k + 1 ) ← argmin ( U , v ) { g ( U , v ) − T r ( U T A ) − v T b } sub ject to v ≥ . 2 . 3 If ( U k + 1 , v k + 1 ) ≈ ( U k , v k ), return ( U k + 1 , v k + 1 ). sparse co ding) that maps an input p oin t to its feature vector. In a ﬁrst step of our analysis (Section 5.2), w e therefore ﬁx the enc o der to b e the soft-thresholding mapping and compare LAST to existing supervised and unsup ervised learning techniques. Then, in the following subsections, we compare our complete classiﬁcation arc hitecture (i.e., learning and enco ding function) to sev eral classiﬁers, in terms of accuracy and eﬃciency . In particular, we show that our prop osed approach is able to compete with recen t classiﬁers, despite its simplicit y . 5.1 P arameter selection W e ﬁrst discuss the choice of the mo del parameters for our metho d. Unless stated otherwise, w e c ho ose the v ector s according to the distribution of the diﬀeren t classes in the training set. W e set the v alue of the regularization parameter to ν = 1, as it was found empirically to b e a go o d c hoice in our exp eriments. It is worth mentioning that setting ν b y cross-v alidation migh t giv e better results, but it w ould also b e computationally more expensive. W e set moreov er the parameter of the soft-thresholding mapping approximation to β = 100. Recall ﬁnally that the sparsity parameter α is alwa ys equal to 1 in our method, and therefore does not require any manual setting or cross-v alidation pro cedure. In all experiments, we hav e moreov er chosen to initialize LAST b y setting U 0 equal to a random subsample of the training set, and v 0 is set to the v ector whose entries are all equal to 1. W e ho w ever noticed empirically that c hoosing a diﬀeren t initialization strategy do es not signiﬁcantly change the testing accuracy . Then, w e ﬁx the maximum n umber of iterations of LAST to K = 50. Moreov er, setting prop erly the parameters t 0 and ρ in Algorithm 1 is quite crucial in con trolling the conv ergence of the algorithm. In all the experiments, w e hav e set the parameter t 0 = T / 10, where T denotes the nu mber of iterations. F urthermore, during the ﬁrst T / 20 iterations, several v alues of ρ are tested { 0 . 1 , 0 . 01 , 0 . 001 } , and the v alue that leads to the smallest ob jective function is c hosen for the rest of the iterations. Finally , the minibatch size in Algorithm 1 depends on the size of the training data. In particular, when the size of the training data m is relatively small (i.e., smaller than 5000), we used a batch gradient descen t, as the computation of the (complete) gradient is tractable. In this case, we set the n umber of iterations to T = 1000. Otherwise, w e use a batch size of 200, and p erform T = 5000 iterations of the stochastic gradien t descent in Algorithm 1. 5.2 Analysis of the learning algorithm In a ﬁrst set of exp eriments, we fo cus on the comparison of our learning algorithm (LAST) to other learn- ing techniques, and ﬁx the enco der to b e the soft-thresholding mapping for all the metho ds. W e present a comparativ e study on textures and natural images classiﬁcation tasks. 5.2.1 Exp erimen tal settings W e consider the follo wing dictionary learning algorithms: 1. Sup ervised random samples: The atoms of D are c hosen randomly from the training set, in a su- p ervised manner. That is, if κ denotes the desired proportion of class ‘1’ atoms in the dictionary , the dictionary is built b y randomly picking κN training samples from class ‘1’ and (1 − κ ) N samples from class ‘ − 1’, where N is the n um b er of atoms in the dictionary . 2. Sup ervised K-means: W e build the dictionary by merging the sub dictionaries obtained by applying the K-means algorithm successively to training samples of class ‘1’ and ‘ − 1’, where the num b er of clusters is ﬁxed respectively to κN and (1 − κ ) N . 8 3. Dictionary learning for ` 1 sparse co ding: The dictionary D is built by solving the classical dictionary learning problem for ` 1 sparse coding: min D , c i m X i =1 k x i − Dc i k 2 2 + λ k c i k 1 sub ject to ∀ j, k d j k 2 2 ≤ 1 . (6) T o solve this optimization problem, w e used the algorithm prop osed b y Mairal et al (2010) and imple- men ted in the SP AMS pack age. The parameter λ is chosen by a cross-v alidation procedure in the set { 0 . 1 , 0 . 01 , 0 . 001 } . Note that, while the previous tw o learning algorithms make use of the lab els, this algorithm is unsupervised. 4. Sto chastic Gradient Descen t (SGD): The dictionary D and classiﬁer w are obtained by optimizing the follo wing ob jective function using mini-b atch sto chastic gr adient desc ent : J ( D , w ) = m X i =1 L ( y i w T q ( D T x i − α )) + ν 2 k w k 2 2 , with q ( x ) = 1 β log(1 + exp( β x )). This corresp onds to the original ob jective function in Eq. (3), where h α is replaced with its smo oth approximan t. 2 This smo othing pro cedure is similar to the one used in our relaxed formulation (Section 4.1). As in LAST, w e set β = 100, α = 1, and use the same initialization strategy . This setting allo ws us to directly compare LAST and this generic stochastic gradient descen t pro cedure widely used for training neural netw orks. F ollo wing Glorot et al (2011), we use a mini-batc h size of 10, and use a constan t step size chosen in { 0 . 1 , 0 . 01 , 0 . 001 , 0 . 0001 } . The stepsize is c hosen through a cross-v alidation pro cedure, with a randomly c hosen v alidation set made up of 10% of the training data. The n um b er of iterations of SGD is set to 250000. F or the ﬁrst three algorithms, the parameter α in the soft-thresholding mapping is chosen with cross v ali- dation in { 0 . 1 , 0 . 2 , . . . , 0 . 9 , 1 } . The features are then computed b y applying the soft thresholding map h α , and a linear SVM classiﬁer is trained in the feature space. F or the random samples and K -means approaches, we set κ = 0 . 5 as we consider classiﬁcation tasks with roughly equal num ber of training samples from each class. Finally , for SGD and LAST, the dictionary D and linear classiﬁer w are learned simultaneously . The enco der h 1 is used to compute the features. 5.2.2 Exp erimen tal results In our ﬁrst experiment, w e consider t wo binary texture classiﬁcation tasks, where the textures are collected from the 32 Bro datz dataset (V alkealah ti and Oja, 1998) and shown in Fig. 4. F or each pair of textures under test, w e build the training set by randomly selecting 500 12 × 12 patches p er texture, and the test data is constructed similarly by taking 500 patc hes p er texture. The test data do es not con tain any of the training patches. All the patc hes are moreov er normalized to hav e unit ` 2 norm. Fig. 5 shows the binary classiﬁcation accuracy of the soft-thresholding based classiﬁer as a function of the dictionary size, for dictionaries learned with the diﬀerent algorithms. T ask 1 T ask 2 vs vs Bark W oodgrain Pigskin Pressedcl Figure 4: Two binary classiﬁcation tasks ( b ark vs wo o dgr ain and pigskin vs. pr esse dcl ) F or the ﬁrst task ( b ark vs. wo o dgr ain ), one can see that LAST and SGD dictionary learning metho ds outp erform the other metho ds for small dictionary sizes. F or large dictionaries (i.e., N ≈ 400) how ev er, all the learning algorithms yield approximately the same classiﬁcation accuracy . This result is in agreement with the conclusions of Coates and Ng (2011), where the authors show empirically that the choice of the learning 2 W e also tested SGD on the original (non-smo oth) optimization problem. This resulted in slightly worse p erformance. W e therefore only report results obtained on the smo othed ob jective function. 9 algorithm b ecomes less crucial when dictionaries are very large. In the second and more diﬃcult classiﬁcation task ( pigskin vs. pr esse dcl ), our algorithm yields the b est classiﬁcation accuracy for all tested dictionary sizes (10 ≤ N ≤ 400). In terestingly , unlik e the previous task, the design of the dictionary is crucial for all tested dictionary sizes. Using muc h larger dictionaries might result in p erformance that is close to the one obtained using our algorithm, but comes at the price of additional computational and memory costs. 0 50 100 150 200 250 300 350 400 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Dictionary size Classification accuracy Supervised random samples Supervised K−Means DL for l1 sparse coding SGD LAST (a) Bark vs. Wo o dgr ain 0 50 100 150 200 250 300 350 400 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Dictionary size Classification accuracy Supervised random samples Supervised K−Means DL for l1 sparse coding SGD LAST (b) Pigskin vs. Pr esse dcl Figure 5: T exture classiﬁcation results (ﬁxed soft-thresholding encoder) Fig. 6 further illustrates the ev olution of the ob jectiv e function with resp ect to the elapsed training time for LAST and SGD, for a dictionary of size 50. One can see that LAST quickly con verges to a solution with a small ob jectiv e function. On the other hand, SGD reac hes a solution with larger ob jective function than LAST. 0 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 700 800 900 1000 1100 Time [s] J(D, w) SGD LAST Figure 6: J ( D , w ) as a function of the elapsed time [s] for Sto chastic Gradient Descent and LAST. F or SGD: J ( D t =100 , w t =100 ) = 19, LAST: J ( D t =100 , w t =100 ) = 1 . 4. W e now conduct exp eriments on the popular CIF AR-10 image database (Krizhevsky and Hinton, 2009). The dataset contains 10 classes of 32 × 32 RGB images. F or simplicit y and b etter comparison of the diﬀerent learning algorithms, we restrict in a ﬁrst stage the dataset to the tw o classes “deer” and “horse”. W e extend our results to the multi-class scenario later in Section 5.5. Fig. 7 illustrates some training examples from the t wo classes. The classiﬁcation results are reported in Fig. 8. Figure 7: Examples of CIF AR-10 images in categories “deer” and “horse”. Once again, the soft-thresholding based classiﬁer with a dictionary and linear classiﬁer learned with LAST outp erforms all other learning tec hniques. In particular, using the LAST dictionary learning strategy results in signiﬁcan tly higher p erformance than sto chastic gradien t descent for all dictionary sizes. W e further note that with a very small dictionary (i.e., N = 2), LAST reaches an accuracy of 77%, whereas some learning algorithms (e.g., K-means) do not reach this accuracy ev en with a dictionary that contains as man y as 400 atoms. T o 10 0 50 100 150 200 250 300 350 400 0.65 0.7 0.75 0.8 0.85 Dictionary size Classification accuracy Supervised random samples Supervised K−Means DL for l1 sparse coding SGD LAST Figure 8: P erformance of the “deer” vs. “horse” binary classiﬁcation task (ﬁxed soft-thresholding encoder) further illustrate this p oint, we sho w in Fig. 9 the 2-D testing features obtained with a dictionary of tw o atoms, when D is learned resp ectiv ely with the K-Means metho d and LAST. Despite the v ery low-dimensionalit y of the feature vectors, the tw o classes can b e separated with a reasonable accuracy using our algorithm (Fig. 9 (b)), whereas features obtained with the K-means algorithm clearly cannot b e discriminated (Fig. 9 (a)). W e ﬁnally illustrate in Fig. 10 the dictionaries learned using K-Means and LAST for N = 30 atoms. It can b e observ ed that, while K-Means dictionary consists of smo othed images that minimize the reconstruction error, our algorithm learns a discriminative dictionary whose goal is to underline the diﬀerence betw een the images of the t w o classes. In summary , our sup ervised learning algorithm, speciﬁcally tailored for the soft-thresholding enco der provides signiﬁcan t impro vemen ts ov er traditional dictionary learning schemes. Our classiﬁer can reac h high accuracy rates, ev en with v ery small dictionaries, whic h is not p ossible with other learning schemes. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 max(0, d 1 T x − 1) max(0, d 2 T x − 1) Class 1 testpoints Class 2 testpoints (a) K-Means 0 0.1 0.2 0.3 0.4 0.5 0 500 1000 1500 2000 2500 3000 max(0, d 1 T x − 1) max(0, d 2 T x − 1) Class 1 testpoints Class 2 testpoints (b) LAST Figure 9: Learned 2D features and linear classiﬁers with K-Means and LAST for the “deer” vs. “horse” classiﬁcation task ( N = 2). (a) K-Means (b) LAST Figure 10: Normalized dictionary atoms learned with K-Means and LAST, for the “deer” vs. “horse” binary classiﬁcation task ( N = 30). 5.3 Classiﬁcation p erformance on binary datasets In this section, we compare the prop osed LAST classiﬁcation metho d 3 to other classiﬁers. Before going through the experimental results, w e ﬁrst presen t the diﬀerent methods under comparison: 1. Linear SVM: W e use the eﬃcien t Liblinear (F an et al, 2008) implemen tation for training the linear classiﬁer. The regularization parameter is c hosen using a cross-v alidation pro cedure. 3 By extension, w e deﬁne the LAST classiﬁer to b e the soft-thresholding based classiﬁer, where the parameters ( D , w ) are learned with LAST. 11 2. RBF kernel SVM: W e use LibSVM (Chang and Lin, 2011) for training. Similarly , the regularization and width parameters are set with cross-v alidation. 3. Sparse co ding: Similarly to the previous section, we train the dictionary by solving Eq. (6). W e use ho wev er the enco der that “matc hes naturally” with this training algorithm, that is: argmin c k x − Dc k 2 2 + λ k c k 1 , where x is the test sample, D the previously learned dictionary and c the resulting feature vector. A linear SVM is then trained on the resulting feature v ectors. This classiﬁcation arc hitecture, denoted “sparse coding” b elow, is similar to that of Raina et al (2007). 4. Nearest neighbor classiﬁer (NN): Our last comparativ e scheme is a nearest neigh bor classiﬁer where the dictionary is learned using the sup ervised K-means pro cedure describ ed in 5.2.1. At test time, the sample is assigned the label of the dictionary atom (i.e., cluster) that is closest to it. Note that w e ha v e dropp ed the supervised random samples learning algorithm used in the previous section as it w as sho wn to ha v e w orse classiﬁcation accuracy than the K-means approach. T ask 1 [%] T ask 2 [%] Linear SVM 49.5 49.1 RBF k ernel SVM 98.5 90.1 Sparse coding ( N = 50) 97.5 85.5 Sparse coding ( N = 400) 98.1 90.9 NN ( N = 50) 94.3 84.1 NN ( N = 400) 97.8 86.6 LAST ( N = 50) 98.7 87.3 LAST ( N = 400) 98.6 93.5 T able 1: Classiﬁcation accuracy for binary texture classiﬁcation tasks. T able 1 ﬁrst sho ws the accuracies of the diﬀerent classiﬁers in the tw o binary textures classiﬁcation tasks describ ed in 5.2.2. In both exp erimen ts, the linear SVM classiﬁer results in a v ery p o or p erformance, which is close to the random classiﬁer. This suggests that the considered task is nonlinear, and has to be tac kled with a nonlinear classiﬁer. One can see that the RBF k ernel SVM results in a signiﬁcan t increase in the classiﬁcation accuracy . Similarly , the ` 1 sparse co ding non linear mapping also results in m uc h b etter performance compared to the linear classiﬁer, while the nearest neigh b or approach p erforms a bit w orse than sparse co ding. W e note that, for a ﬁxed dictionary size, our classiﬁer outp erforms NN and sparse co ding classiﬁers in b oth tasks. Moreo ver, it provides comparable or sup erior p erformance to the RBF k ernel SVM in both tasks. W e now turn to the binary exp eriment “deer” vs. “horse” describ ed in the previous subsection. W e show the classiﬁcation accuracies of the diﬀerent classiﬁers in T able 2. LAST outperforms sparse coding and nearest neigh b our classiﬁers for the tested dictionary sizes. RBF k ernel SVM how ev er sligh tly outp erforms LAST with N = 100 in this exp eriment. Note ho wev er that the RBF kernel SVM approach is muc h slow er at test time, whic h mak es it impractical for large-scale problems. “deer” vs. “horse” [%] Linear SVM 72.6 RBF k ernel SVM 83.5 Sparse coding ( N = 10) 70.6 Sparse coding ( N = 100) 76.2 NN ( N = 10) 67.7 NN ( N = 100) 70.9 LAST ( N = 10) 80.1 LAST ( N = 100) 82.8 T able 2: Binary classiﬁcation accuracy on the binary classiﬁcation problem “deer” vs. “horse”. Ov erall, the prop osed LAST classiﬁer compares fav orably to the diﬀeren t tested classiﬁers. In particular, LAST outp erforms the sparse co ding technique for a ﬁxed dictionary size in our exp eriments. This result is notable, as sparse co ding classiﬁers are known to provide very go od classiﬁcation p erformance in vision tasks. Note that, when used with another standard learning approach as K-Means, the soft-thresholding based classiﬁer is outp erformed by sparse co ding, which shows the importance of the learning scheme in the success of this classiﬁer. 12 MNIST USPS Linear SVM 8.19 9.07 RBF k ernel SVM 1.4 4.2 K-NN ` 2 5.0 5.2 LAST 1.32 4.53 Sparse coding 3.0 5.33 Huang and Aviy en te (2006) - 6.05 SDL-G L (Mairal et al, 2008) 3.56 6.67 SDL-D L (Mairal et al, 2008) 1.05 3.54 Ramirez et al (2010) 1.26 3.98 SGD 2.22 5.88 3 la y ers ReLU net (Glorot et al, 2011) 1.43 - T able 3: Classiﬁcation error (p ercen tage) on MNIST and USPS datasets. 5.4 Handwritten digits classiﬁcation W e now consider a classiﬁcation task on the MNIST (LeCun et al, 1998) and USPS (Hull, 1994) handwritten digits datasets. USPS con tains 9298 images of size 16 × 16 pixels, with 7291 images used for training and 2007 for testing. The larger MNIST database is composed of 60000 training images and 10000 test images, all of size 28 × 28 pixels. W e prepro cess all the images to ha v e zero-mean and to be of unit Euclidean norm. W e address the multi-class classiﬁcation task using a one-vs-all strategy , as it is often done in classiﬁcation problems. Sp eciﬁcally , we learn a separate dictionary and a binary linear classiﬁer b y solving the optimization problem for each one-vs-all problem. Classiﬁcation is then done by predicting using eac h binary classiﬁer, and c ho osing the prediction with highest score. In LAST, for each one-vs-all task, we naturally set 1 / 10 of the en tries of s to 1 and the other en tries to − 1, assuming the distribution of features of the diﬀerent classes in the dictionary should roughly b e that of the images in the training set. In our prop osed approach and SGD, we used dictionaries of size N = 200 for USPS and N = 400 for MNIST as the latter dataset contains m uc h more training samples. W e compare LAST to baseline classiﬁcation techniques described in the previous section, as w ell as to sparse coding based metho ds. In addition to building the dictionary in an unsupervised wa y , w e consider the sparse co ding classiﬁers in Mairal et al (2008); Huang and Aviyen te (2006); Ramirez et al (2010), whic h construct the dictionary in a supervised fashion. Classiﬁcation results are shown in T able 3. One can see that LAST largely outp erforms linear and nearest neigh b our classiﬁers. Moreo v er, our metho d has a slightly b etter accuracy than RBF-SVM in MNIST, while b eing slightly worse on the USPS dataset. Our approach also outperforms the soft-thresholding based classiﬁer optimized with sto c hastic gradient descen t on b oth tasks, whic h highlights the b eneﬁts of our optimization tec hnique compared to the standard algorithm used for training neural netw orks. W e also rep ort from Glorot et al (2011) the p erformance of a three hidden lay er rectiﬁed net w ork optimized with sto chastic gradien t decent, without unsupervised pre-training. It can b e seen that LAST, while ha ving a m uc h simpler arc hitecture, slightly outp erforms the deep rectiﬁer net w ork on the MNIST task. F urthermore, LAST outperforms the unsup ervised sparse co ding classiﬁer in b oth datasets. Interestingly , the proposed scheme also comp etes with, and sometimes outp erforms the discriminative sparse co ding techniques of (Huang and Aviyen te, 2006; Mairal et al, 2008; Ramirez et al, 2010), where the dictionary is tuned for classiﬁcation. While providing comparable results, the LAST classiﬁer is muc h faster at test time than sparse co ding techniques and RBF-SVM classiﬁers. It is notew orthy to men tion that the b est discriminative dictionary learning results w e are aw are of on these datasets are ac hieved by Mairal et al (2012) with an error rate of 0 . 54% on MNIST and 2 . 84% on USPS. Note how ev er that in this pap er, the authors explicitly incorp orate translation inv ariance in the problem by augmen ting the training set with shifted versions of the digits. Our fo cus go es here instead on metho ds that do not augment the training set with distorted or transformed samples. 5.5 CIF AR-10 classiﬁcation W e no w consider the m ulti-class classiﬁcation problem on the CIF AR-10 dataset (Krizhevsky and Hin ton, 2009). The dataset contains 60000 color images of size 32 × 32 pixels, with 50000 images for training and 10000 for testing. The classiﬁer input consists of vectors of ra w pixel v alues of dimension 32 × 32 × 3 = 3072. This setting, similar to that of Glorot et al (2011), tak es no adv antage of the fact that we are dealing with images and is sometimes referred to as “p erm utation in v ariant”, a s columns in the data could b e shuﬄed without aﬀecting the result. W e consider this scenario to fo cus on the comparison of the p erformance of the classiﬁers. Due to the relatively high dimensions of the problem ( n = 3072, m = 50000), we limit ourselves to classiﬁers with feedforw ard architectures. In fact, using RBF-SVM for this task would b e prohibitiv ely slow at the training and testing stage. F or eac h one-vs-all task, we set the dictionary size of LAST and SGD metho ds to 400. Moreo v er, 13 CIF AR-10 Linear SVM 59.70 LAST ( N = 400) 46.56 SGD ( N = 400) 52.96 3 la y ers ReLU net 50.86 3 la y ers ReLU net + sup. pre-train 49.96 T able 4: Classiﬁcation error (p ercentage) on the CIF AR-10 dataset. ReLU net results are rep orted from (Glorot et al, 2011). unlik e the previous experiment, we set in LAST half of the entries of the sign v ector s to 1 and the other half to − 1. This is due to the high v ariability of intra-class images and the relativ ely small dictionary size: the num ber of atoms required to enco de the positive class might not b e suﬃcien t if s is set according to the distribution of images in the training set. The results are reported in T able 4. Once again, this exp erimen t conﬁrms the superiority of our learning algorithm o v er linear SVM. Moreo v er, LAST signiﬁcantly outp erforms the generic SGD training algorithm (by more than 6%) in this c hallenging classiﬁcation example. What is more surprising is that LAST signiﬁcantly surpasses the rectiﬁer neural net work with 3 hidden la y ers (Glorot et al, 2011) trained using a generic sto c hastic gradient descen t algorithm (with or without pre-training). This shows that, despite the simplicity of our architecture (it can b e seen as one hidden la y er), the adequate training of the classiﬁcation scheme can give better p erformance than complicated structures that are p oten tially diﬃcult to train. W e ﬁnally rep ort the results of sparse co ding classiﬁer with a dictionary trained using Eq. (6). If we use a dictionary with 400 atoms, we get an error of 53 . 9%. By using a m uch larger dictionary of 4000 atoms, the error reduces to 46 . 5%. The computation of the test features is ho wev er computationally very expensive in that case. 6 Discussion W e ﬁrst discuss in this section asp ects related to the computational complexit y of LAST. Then, w e analyze the sparsit y of the obtained solutions. W e ﬁnally explain some of the diﬀerences b etw een LAST and the generic sto c hastic gradien t descen t algorithm. 6.1 Computational complexity at test time W e compare the computational complexity and running times of LAST classiﬁer to the ones of diﬀeren t classi- ﬁcation algorithms. T able 5 shows the computational complexity for classifying one test sample using v arious classiﬁers and the time needed to classify MNIST test images. W e recall that n , m , and N denote resp ectiv ely the signals dimension, the n umber of training samples and the dictionary size. Clearly , linear classiﬁcation is v ery eﬃcien t as it only requires the computation of one inner pro duct b etw een tw o v ectors of dimension n . Nonlinear SVMs ho wev er hav e a test complexity that is linear in the num ber of supp ort vectors, which scales linearly with the training size (Burges, 1998). This solution is therefore not practical for relativ ely large training sets, like MNIST or CIF AR-10. F eature extraction with sparse co ding in volv es solving an optimization problem, whic h roughly requires 1 / √  matrix-vector m ultiplications, where  controls the precision (Bec k and T eb oulle, 2009). F or a typical v alue of  = 10 − 6 , the complexit y b ecomes 1000 nN (neglecting other constan ts), that is 3 orders of magnitude larger than the complexit y of the prop osed metho d. This can b e seen clearly in the computation times, as our approac h is slightly more exp ensive than linear SVM, but remains m uc h faster than other metho ds. Note moreov er that the soft-thresholding classiﬁcation sc heme is very simple to implement in practice at test time, as it is a direct map that only in v olves max and linear op erations. 6.2 Sparsit y Sparsit y is a highly b eneﬁcial prop erty in representation learning, as it helps decomp osing the factors of v ari- ations in the data in to high lev el features (Bengio et al, 2013; Glorot et al, 2011). T o assess the sparsity of the learned represen tation, we compute the av erage sparsity of our representation ov er all data p oints (training and testing combined) on the MNIST and CIF AR-10 dataset. W e obtain an av erage of 96 . 7% zeros in the MNIST case, and 95 . 3% for CIF AR-10. In other w ords, our represen tations are very sparse, without adding an explicit sparsity p enalization as in (Glorot et al, 2011). Interestingly , the rep orted av erage sparsity in (Glorot et al, 2011) is 83 . 4% on MNIST and 72 . 0% on CIF AR-10. Our one-lay er representation therefore exhibits an in teresting sparsit y prop ert y , while providing go o d predictive p erformance. 14 Complexity Time [s] Linear SVM O ( n ) 0.4 RBF k ernel SVM O ( nm ) 154 Sparse coding O  nN √   4 14 5 LAST classiﬁer O ( nN ) 1.0 T able 5: Computational complexit y for classifying one test sample, and time needed to predict the lab els of the 10000 test samples in the MNIST dataset. F or reference, all the experiments are carried out on a 2.6 GHz In tel Core i7 mac hine with 16 GB RAM. 6.3 LAST vs. sto c hastic gradien t descent As discussed earlier, the soft-thresholding classiﬁcation scheme b elongs to the more general neural net w ork mo dels. Neural netw orks are commonly optimized with sto chastic gradient descent algorithms, as opp osed to the DC metho d prop osed in this pap er. The prop osed learning algorithm has several adv antages compared to SGD: • Better lo cal minimum: In all our experiments, LAST reac hed a b etter solution than SGD in terms of the testing accuracy . This conﬁrms the observ ations of T ao and An (1998) whereby DCA con v erges to “go od” lo cal minima, and often to global minima in practice. • Descent metho d: Unlik e sto c hastic gradient descent, LAST (and more generally DCA) is a descent metho d. Moreov er, it is guaran teed to conv erge to a critical p oin t (T ao and An, 1998). • No stepsize selection: Sto chastic gradient descent (and more generally gradient descent based algo- rithms) are very sensible to the diﬃcult choice of the stepsize. Cho osing a large stepsize in SGD can b e b eneﬁcial as it helps escaping local minimas, but it can also lead to an oscillatory b eha viour that preven ts con vergence. Interestingly , our optimization algorithm does not in volv e any stepsize selection, when giv en a con vex optimization solver. In fact, our algorithm solves a sequence of conv ex problems, which can b e solved with an y oﬀ-the-shelf conv ex solver. Note that ev en if the in termediate conv ex optimization problems are solved with a gradien t-descen t based technique, the choice of the stepsize is less challenging as w e ha ve a better understanding of the theoretical prop erties of stepsize rules in conv ex optimization problems. As w e ha ve previously men tioned, unlik e SGD, our algorithm assumes the sign v ector of the linear classiﬁer w to b e kno wn. A simple heuristic choice of this parameter was shown how ev er to provide very go o d results in the exp eriments, compared to SGD. Of course, choosing this parameter with cross-v alidation might lead to b etter results, but also implies a slow er training procedure. 7 Conclusion W e ha ve prop osed a sup ervised learning algorithm tailored for the soft thresholding based classiﬁer. The learning problem, whic h jointly estimates a discriminative dictionary D and a classiﬁer hyperplane w is cast as a DC problem and solved eﬃciently with an iterativ e algorithm. The prop osed algorithm (LAST), which leverages the DC structure, signiﬁcantly outp erforms sto chastic gradient descen t in all our exp erimen ts. F urthermore, the resulting classiﬁer consisten tly leads to b etter results than the unsup ervised sparse co ding classiﬁer. Our metho d moreo v er compares fav orably to other standard tec hniques as linear, RBF kernel or nearest neigh bour classiﬁers. The prop osed LAST classiﬁer has also been shown to compete with recent discriminativ e sparse co ding techniques in handwritten digits classiﬁcation experiments. W e should mention that, while the sparse co ding encoder features some form of competition betw een the diﬀeren t atoms in the dictionary (often referred to as explaining-away (Gregor and LeCun, 2010)), our enco der acts on the diﬀeren t atoms independently . Despite its simple b ehavior, our scheme is comp etitive when the dictionary and classiﬁer parameters are learned in a suitable manner. The classiﬁcation scheme adopted in this pap er can b e seen as a one hidden la y er neural netw ork with a soft-thresholding activ ation function. This activ ation function has recen tly gained signiﬁcan t attention in the deep learning communit y , as it is b elieved to make the training pro cedure easier and less prone to bad local minima. Our work reveals an interesting structure of the optimization problem for the one-hidden lay er v ersion of that net work that allows to reach go o d minima. An interesting question is whether it is p ossible to ﬁnd a 4 The complexity reported here is that of the FIST A algorithm Beck and T eb oulle (2009), where  denotes the required precision. Note that another popular metho d for solving sparse coding is the homotopy metho d, whic h is eﬃcient in practice, ho wev er it has exponential theoretical complexit y Mairal and Y u (2012). 5 T o provide a fair comparison with our method, w e used dictionaries of the same size as for our prop osed approach, for the sake of this experiment. 15 similar structure for netw orks with man y hidden lay ers. This would help the training of deep net w orks, and oﬀer insigh ts on this c hallenging problem, whic h is usually tackled using sto chastic gradien t descen t. A Soft-thresholding as an approximation to non-negativ e sparse co ding W e show here that soft-thresholding can b e viewed as a coarse approximation to the non-negative sparse co ding mapping (Denil and de F reitas, 2012). T o see this, w e consider the pro ximal gradient algorithm to solve the sparse co ding problem with additional nonnegativity constraints on the co eﬃcients. Speciﬁcally , we consider the follo wing mapping argmin c ∈ R N k x − Dc k 2 2 + λ k c k 1 sub ject to c ≥ 0 . The pro ximal gradien t algorithm proceeds by iterating the following recursiv e equation to con v ergence: c k + 1 = pro x λt k·k 1 + I ·≥ 0 ( c k + t D T ( x − Dc k )) , where pro x is the pro ximal op erator, t is the c hosen stepsize and I ·≥ 0 is the indicator function, whic h is equal to 0 if all the components of the v ector are nonnegativ e, and + ∞ otherwise. Using the deﬁnition of the pro ximal mapping, w e ha v e pro x λt k·k 1 + I ·≥ 0 ( x ) , argmin u ≥ 0 { 1 2 k u − x k 2 2 + λt k u k 1 } = max(0 , x − λt ) . Therefore, imp osing the initial condition c 0 = 0 , and a stepsize t = 1, the ﬁrst step of the proximal gradient algorithm can be written c 1 = max(0 , D T x − λ ) = h λ ( D T x ) , whic h precisely corresp onds to our soft-thresholding map. In this wa y , our soft-thresholding map corresp onds to an appro ximation of sparse coding, where only one iteration of pro ximal gradien t algorithm is performed. B Pro ofs B.1 Pro of of Prop osition 1 Before going through the proof of Prop osition 1, w e need the follo wing results in (Horst, 2000, Section 4 . 2): Prop osition 3 1. L et { f i } l i =1 b e DC functions. Then, for any set of r e al numb ers ( λ 1 , . . . , λ l ) , P l i =1 λ i f i is also DC. 2. L et f : R n → R b e DC and g : R → R b e c onvex. Then, the c omp osition g ( f ( x )) is DC. W e recall that the ob jective function of (P) is given b y: m X i =1 L   y i N X j =1 s j q ( u T j x i − v j )   + ν 2 k v k 2 2 , The function k v k 2 2 is con v ex and therefore DC. W e sho w that the ﬁrst part of the ob jective function is also DC. W e rewrite this part as follo ws: m X i =1 L   X j : s j = y i q ( u T j x i − v j ) − X j : s j 6 = y i q ( u T j x i − v j )   . Since q is conv ex, q ( u T j x i − v j ) is also conv ex (Boyd and V andenberghe, 2004). As the loss function L is conv ex, w e ﬁnally conclude from Prop osition 3 that the ob jectiv e function is DC. Moreov er, since the constrain t v ≥  is con v ex, w e conclude that (P) is a DC optimization problem. 16 B.2 Pro of of Prop osition 2 W e now suppose that L ( x ) = max(0 , 1 − x ), and deriv e the DC form of the ob jectiv e function. W e ha ve: m X i =1 L  y i N X j =1 s j q ( u T j x i − v j )  = m X i =1 max  0 , 1 + X j : s j 6 = y i q ( u T j x i − v j ) − X j : s j = y i q ( u T j x i − v j )  = m X i =1 max  X j : s j = y i q ( u T j x i − v j ) − X j : s j = y i q ( u T j x i − v j ) , 1 + X j : s j 6 = y i q ( u T j x i − v j ) − X j : s j = y i q ( u T j x i − v j )  = m X i =1 max  X j : s j = y i q ( u T j x i − v j ) , 1 + X j : s j 6 = y i q ( u T j x i − v j )  − m X i =1 X j : s j = y i q ( u T j x i − v j ) . The ob jective function of (P) can therefore b e written as g − h , with: g = ν 2 k v k 2 2 + m X i =1 max  X j : s j = y i q ( u T j x i − v j ) , 1 + X j : s j 6 = y i q ( u T j x i − v j )  , h = m X i =1 X j : s j = y i q ( u T j x i − v j ) , where g and h are con v ex functions. Ac knowledgmen ts The authors w ould like to thank the associate editor and the anonymous reviewers for their v aluable comments and references that helped to improv e the quality of this pap er. References Ak ata Z, Perronnin F, Harc haoui Z, Schmid C (2014) Go o d practice in large-scale learning for image classiﬁca- tion. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e pp 507–520 An L TH, T ao PD (2005) The DC (diﬀerence of con v ex functions) programming and DCA revisited with dc mo dels of real w orld nonconv ex optimization problems. A nnals of Op er ations R ese ar ch 133(1-4):23–46 Bec k A, T eb oulle M (2009) A fast iterative shrink age-thresholding algorithm for linear in v erse problems. SIAM Journal on Imaging Scienc es 2(1):183–202 Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new p ersp ectives. IEEE T r ans- actions on Pattern Analysis and Machine Intel ligenc e 35(8):1798–1828 Bishop CM (1995) Neural Net works for P attern Recognition. Oxfor d University Pr ess, Inc. Bo yd S, V andenberghe L (2004) Convex optimization . Cambridge Univ ersity Press Burges C (1998) A tutorial on support vector machines for pattern recognition. Data mining and know le dge disc overy 2(2):121–167 Chang CC, Lin CJ (2011) LIBSVM: A library for supp ort vector machines. ACM T r ansactions on Intel ligent Systems and T e chnolo gy 2:27:1–27:27 Chen CF, W ei CP , W ang YC (2012) Low-rank matrix recov ery with structural incoherence for robust face recognition. In: IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pp 2618–2625 17 Coates A, Ng A (2011) The imp ortance of enco ding versus training with sparse coding and v ector quantization. In: International Confer enc e on Machine L e arning (ICML) , pp 921–928 Denil M, de F reitas N (2012) Rec klessly appro ximate sparse coding. arXiv pr eprint arXiv:12080959 Elad M, Aharon M (2006) Image denoising via sparse and redundan t represen tations o ver learned dictionaries. IEEE T r ansactions on Image Pr o c essing 15(12):3736–3745 F adili J, Starc k JL, Murtagh F (2009) Inpainting and zooming using sparse representations. The Computer Journal 52(1):64–79 F an RE, Chang KW, Hsieh CJ, W ang XR, Lin CJ (2008) LIBLINEAR: A library for large linear classiﬁcation. Journal of Machine L e arning R ese ar ch 9:1871–1874 Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectiﬁer netw orks. In: International Confer enc e on Artiﬁcial Intel ligenc e and Statistics (AIST A TS) , vol 15, pp 315–323 Gregor K, LeCun Y (2010) Learning fast appro ximations of sparse co ding. In: International Confer enc e on Machine L e arning (ICML) , pp 399–406 Horst R (2000) Intr o duction to glob al optimization . Springer Huang K, Aviy en te S (2006) Sparse represen tation for signal classiﬁcation. In: A dvanc es in Neur al Information Pr o c essing Systems , pp 609–616 Hull JJ (1994) A database for handwritten text recognition research. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e 16(5):550–554 Ka vukcuoglu K, Ranzato M, LeCun Y (2010a) F ast inference in sparse co ding algorithms with applications to ob ject recognition. arXiv pr eprint arXiv:10103467 Ka vukcuoglu K, Sermanet P , Boureau YL, Gregor K, Mathieu M, LeCun Y (2010b) Learning con v olutional feature hierarchies for visual recognition. In: A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , pp 1090–1098 Krizhevsky A, Hin ton G (2009) Learning multiple la yers of features from tiny images. Master’s thesis, Depart- men t of Computer Science, Univ ersit y of T oron to Laro c helle H, Mandel M, Pascan u R, Bengio Y (2012) Learning algorithms for the classiﬁcation restricted b oltzmann mac hine. The Journal of Machine L e arning R ese ar ch 13:643–669 LeCun Y, Bottou L, Bengio Y, Haﬀner P (1998) Gradient-based learning applied to do cument recognition. Pr o c e e dings of the IEEE 86(11):2278–2324 Ma L, W ang C, Xiao B, Zhou W (2012) Sparse representation for face recognition based on discriminativ e lo w-rank dictionary learning. In: IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pp 2586–2593 Maas A, Hannun A, Ng A (2013) Rectiﬁer nonlinearities impro v e neural net work acoustic mo dels. In: Interna- tional Confer enc e on Machine L e arning (ICML) Mairal J, Y u B (2012) Complexit y analysis of the lasso regularization path. In: International Confer enc e on Machine L e arning (ICML) , pp 353–360 Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A (2008) Sup ervised dictionary learning. In: A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , pp 1033–1040 Mairal J, Bach F, P once J, Sapiro G (2010) Online learning for matrix factorization and sparse co ding. The Journal of Machine L e arning R ese ar ch 11:19–60 Mairal J, Bac h F, P once J (2012) T ask-driven dictionary learning. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e 34(4):791–804 Raina R, Battle A, Lee H, Pac k er B, Ng A Y (2007) Self-taugh t learning: transfer learning from unlab eled data. In: International Confer enc e on Machine L e arning (ICML) , pp 759–766 Ramirez I, Sprec hmann P , Sapiro G (2010) Classiﬁcation and clustering via dictionary learning with structured incoherence and shared features. In: IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pp 3501–3508 18 Sha we-T aylor J, Cristianini N (2004) Kernel metho ds for p attern analysis . Cambridge Universit y Press Srip erum budur BK, T orres DA, Lanc kriet GR (2007) Sparse eigen metho ds by DC programming. In: Interna- tional Confer enc e on Machine le arning (ICML) , pp 831–838 T ao PD, An L TH (1998) A DC optimization algorithm for solving the trust-region subproblem. SIAM Journal on Optimization 8(2):476–505 V alkealah ti K, Oja E (1998) Reduced m ultidimensional co-o ccurrence histograms in texture classiﬁcation. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e 20(1):90–94 Figueras i V en tura R, V andergheynst P , F rossard P (2006) Lo w-rate and ﬂexible image coding with redundan t represen tations. IEEE T r ansactions on Image Pr o c essing 15(3):726–739 W right J, Y ang A, Ganesh A, Sastry S, Ma Y (2009) Robust face recognition via sparse representation. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e 31(2):210–227 Y ang J, Y u K, Gong Y, Huang T (2009) Linear spatial pyramid matc hing using sparse co ding for image classiﬁcation. In: IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pp 1794–1801 Y uille A, Rangara jan A, Y uille A (2002) The concav e-conv ex procedure (CCCP). In: A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , vol 2, pp 1033–1040 Zeiler M, Ranzato M, Monga R, Mao M, Y ang K, Le Q, Nguyen P , Senior A, V anhouc k e V, Dean J, Hinton G (2013) On rectiﬁed linear units for sp eec h pro cessing. In: IEEE International Confer enc e on A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP) , pp 3517–3521 Zhang Y, Jiang Z, Davis L (2013) Learning structured low-rank representations for image classiﬁcation. In: IEEE Confer enc e on Computer Vision a nd Pattern R e c o gnition (CVPR) , pp 676–683 19

Dictionary learning for fast classification based on soft-thresholding

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment