Feature Selection By KDDA For SVM-Based MultiView Face Recognition

Applications such as face recognition that deal with high-dimensional data need a mapping technique that introduces representation of low-dimensional features with enhanced discriminatory power and a proper classifier, able to classify those complex …

Authors: ** *저자 정보가 논문 본문에 명시되지 않아 확인 불가* **

Feature Selection By KDDA For SVM-Based MultiView Face Recognition
SETIT 2007 4 rth International Conferen ce: S ciences of E lectronic, T echnologies of I nformati on and T elecomm unications March 25-29, 20 07 – T UNISIA Featur e Selection By KDDA For SVM-Based MultiV iew Face Recognition Seyyed Majid V aliollahzadeh, Abolgha sem Sayadiyan, Mohammad Nazari Electrical Engineering Department, Amirkabir University of Technology, Tehran, Iran, 15914 valiollahz adeh@yahoo .com eea35@aut. ac.ir mohnazari@a ut.ac.ir Abstract: Applications such as Face Recognition (FR) that deal with high-dim ensional data need a m apping technique that introduces representati on of low-dim ensional featu res with enha nced discrim inatory powe r and a pr oper classifier , able to classify those complex features .Most of traditional Linear Discriminant An alysis (LDA) suffer from the disadvantage that their optimality criteria are not directly related to the classification ability o f the obtained feature representation. Moreover , the ir classifi cation accuracy is af fected by the “sma ll sam ple size” (SSS) problem which is often encou ntered in FR tas ks. In this sh ort paper , we combine nonli near kernel based mappi ng of data call ed KDDA with Support V ector m achine (SVM) classifi er to deal wi th both of the shortcomi ngs in an ef ficient and cost ef fective manner . The proposed here method is com pared, in terms of classification accuracy , to other commonly used FR methods on UMIST face database. Results indicate that the perf ormance of the proposed me thod is overall superior t o those of traditional FR approache s, such as the Eigenf aces, Fishe rfaces, and D-LDA methods and traditional linear classifiers. Keywords: Face Recognition, Kernel Direct Di scrim inant Analysis (KDDA), small sample size problem (SSS), Support V ector Machine (SVM). INTRODUCTION Selecting appropriate features to represent faces and proper classificati on of these features are two central issues to face rec ognition (FR) systems. For feature selection, successful solutions see m to be appearance-based approaches, (see [3], [2] for a survey), whic h directly operate on images or appearances of face objects a nd process the images as two-dimensional (2-D) holistic patterns, to avoid difficulties associated with Three-dimensional (3-D) modelling, an d shape or landm ark detection [ 2]. For the purpose of da ta reductio n and feature extraction in the appearance-based approaches, Principle component analysis (PCA) and linear dis criminant analysis (LDA) are introduced as two powerful tools . Eigenfaces [4] and Fisherfa ces [5] built on the two techniques, respectively , ar e two state-of-t he-art FR methods, proved to be very successful. It is generally believed that, LDA based algorithms outperform PCA based ones in solvi ng problem s of pattern classification, since the former optimizes the lo w- dimensional repr esentation of the obj ects with focus on the most discriminant feature ex traction while the latter achieves simply object reconstruction. However, many LDA based algorithm s suf fer from t he so-called “small sample size problem” (SSS) which exists in high-dimensional p attern recognitio n tasks where the number of available samples is smaller than the dimensionality of the samples. The traditional solution to the SSS problem is to utilize PCA concepts in conjunction with LD A (PCA+LD A) as it was done for example in Fisherfaces [1 1]. Recently , more effective solutions, calle d Direct LDA (D-LDA) m ethods, have been presented [1 2], [13]. Although successful in many cases, linear methods fail to deliver good performance when face pat terns are subject to lar ge variations in viewpoints, which results in a highly non-convex an d complex distr ibution. The limited success of these m ethods should be at tributed to their linear nature [14]. Ke rnel discrim inant analysi s algorithm, (KDDA) generalizes the strengths of the recently presented D-LD A [1] and the kernel techniques while at the same time overcomes ma ny of their shortcomings and li mitations. In this work, we first nonlinearly map the original input space to an implicit high-dimensional feature space, where the distribution of face patterns is hoped - 1 - SETIT2007 to be linearized and simplified. Then, KDDA method is introduced to effectively solve the SSS problem and derive a set of opti mal discrim inant basis vecto rs in the feature space. And then SVM approach is used for classification. The rest of the paper is or ganized as follows. In Section tow , we start the an alysis by briefly review ing KDDA m ethod. Following that in section three, SVM is introduced and analyzed as a powerful classifier . In Section four , a set of experiment s are presented to demonstrate the ef fectiven ess of the KDDA algorithm together wit h SVM classifier o n highly no nlinear , highly complex face pattern distributions . The proposed method is co mpared, in terms of the classification error rate perf ormance, to KPCA (kernel based PCA), GDA (Generalized Discri minant Analysis) and KDDA algorithm with nearest neighbour classifier on the m ulti-view UMIST face database. Conc lusions are su mmarized in S ection five. 2 Kernel Direct Discrimi-nant Analysis (KDDA) 2.1 Linear Di scriminant Analysis In the statistical pattern recognition tasks, the problem of feature extraction can be stated as follows: Assume that we have a traini ng set, {} 1 L i i Z = is available. Each image is defined as a vector of length , i.e. () wh NI I =× N i Z ∈ℜ where is the face image size and denotes a N-di mensiona l real space [1]. w II × h N ℜ It is further assum ed that each image belongs to one of C classes {} . The objective is to find a transformation 1 C i i Z = ϕ , based on optimizat ion of certain separability criteria, which produces a mapping, with that leads to an enh anced separability of different face objects. N i y ℜ ∈ Let and be the between- and within- class scatter matrices in the feature space respectively , expressed as follows: BTW S WTH S F () () T i i C i i BTW C L S φ φ φ φ − − = ∑ = 1 1 (1) () () ∑∑ == − − = C i C j T i ij i ij WTH i L S 11 1 φ φ φ φ (2) Where () ij ij Z φφ = , i φ is the mean of class i Zj and φ is the average of the e nsemble. () ∑ = = i C j ij i i Z C 1 1 φ φ (3) () ∑∑ == = C i C j ij i Z L 11 1 φ φ (4) The maximization can be ac hieved by solving the following ei genvalue problem: Φ Φ Φ Φ = Φ Φ WTH T BTW T S S max arg (5) The feature space F could be considered as a “linearization space” [6], ho wever , its dimensionality could be arbit rarily lar ge, and possibly infinite. Solving this problem lead us to LDA[1]. Assuming that is nonsingular and WTH S Φ the basis vectors correspond to the M first eigenvectors with the lar gest eigenvalu es of the discriminant criterion: ) ( 1 Φ = − BtW WTH S S tr J (6) The M-dimensiona l representation is then obtained by projecting the original face im ages onto the subspace spanned by the ei genvectors. 2.2 Kernel Dir ect Discri minant Analysis (KDDA) The maximization process in (3) is not directly linked to the classification error which is the criterion of performance used to m easure the success of t he FR procedure. Modi fied versions of the met hod, such as the Direct LDA (D-LDA) approach, use a weighting function in the input sp ace, to penalize those classes that are close and can potentially lead to misclassifications in the output space. Most LDA based algorithms including Fis herfaces [7] and D-LDA [9] utilize the conventional Fisher ’ s criterion d enoted by (3). The introduction of the k ernel function allows us to avoid the e xplicit evaluat ion of the m apping. Any function satisfying Mercer ’ s cond ition can be used as a kernel, and typical kernel functions incl ude polynomi al function, radial basis fu nction (RBF) an d multi-layer perceptrons [10]. () () Φ Φ + Φ Φ Φ Φ = Φ Φ WTH T BTW T BTW T S S S max arg (7) The KDDA meth od implements an improved D- LDA in a high-dimensional feature space using a kerne l appro ach. KDDA introduces a nonlinear mapping from the input space to an implicit high dim ensional feature space, where the nonlinear and c omplex distribution of patterns in the input space is “linearized” and “simplified” so that conv entional LDA can be applied and it effectively solves the sm all sample size (SSS) problem in the high-dim ensional feature space by employi ng an improve d D-LDA algorithm. Unlike th e original D-LDA m ethod of [10] zero eigenvalues of the within-class scatter matrix are never used as di visors in the im proved o ne. In this way , the optimal discriminant features can be exactly extracted from both of inside and outside of ’s null space. WTH S - 2 - SETIT2007 In GDA, to rem ove the null space of WTH , it is required to co mpute the pseud o inverse of the kernel matrix K, which could b e extremely ill-condition ed when certain kernels or ke rnel parameters are use d. Pseudo inversi on is based on inversion of t he nonzero eigenvalues. S 3 SVM Based Approach for Classification The principle of Support V ector Machine (SVM ) relies on a linear separat ion in a hi gh dimensi on feature space where the data have been pre viously mapped, in order to take into account the eventual non-linearities of the p roblem. 3.1 Support V ector Machines (S VM) If we assume that, the training set where l is the number of training vectors, R stands for the real line and R is the number of modalities, is labelled with two class targets , where : 1 () R l i i Xx = =⊂ R } 1 () l i i Yy = = {} F y R i → Φ + − ∈ R : 1 , 1 (8) Maps the data into a feature space F . V apnik has proved that m aximizing the m inimum distance in space F between and the separati ng hyper pl ane is a good means of reducing the generalization ri sk. Whe re: () X Φ (, ) Hw b { product) inner ( , 0 , | ) , ( is b f w F f b w H F <> = + > < ∈ = (9) V apnik also p roved that the opt imal hy per plane can be obtained solving th e convex quadratic programmi ng (QP) problem : l i b X w y c w i i l i i ,..., 1 1 ) ) ( , ( with 2 1 Minimize 1 2 = − ≥ + > Φ < + ∑ = ξ ξ (10) Where constant C and slack variables x are introduced to take in to account the even tual non- separability of () X Φ into F . In practice this criterion is softened to the minimizat ion of a cost fac tor involvi ng both the complexity of the classifier and the degree to which marginal points are misclassified, and the tradeoff between these factors is managed through a mar gin of error parame ter (usually designat ed C) which i s tuned through cro ss-valid ation procedures.Alth ough the SVM is based upon a linear discriminator, it is not restricted to making linear hypothe ses. Non-linear decisions are ma de possible by a non-linear m apping of the data to a higher dimensional space. The phenome non is analogous t o folding a flat sheet of paper into any three-dim ensional sha pe and then cutting it into two h alves, the resultant non-linear boundary in the two-dimens ional s pace is revealed by unfolding the pieces. The SVM’ s non-parametric mathematical formulation allows these transformations to be ap plied efficiently and implicitly: the SVM’ s objective is a function of the dot pro duct between pairs of vectors; the substitution of th e original dot products with those computed in anot her space eliminates the need to transform the original d ata points explicitly to the higher space. The c omputati on of dot pro ducts between vectors without explicitly mapping to anothe r space is performed by a ke rnel function. The nonlinear pr ojection of t he data is perform ed by this kernel functions. T here are severa l common kernel functions that are used such as the linear , polynomi al kernel and the sigmoidal ker nel (( , ) ( , 1 ) =< > + R d R Kx y x y ( ( , ) tanh( , )) R R K xy xy a = <> + , where x and y are feature vect ors in the input space. The other popul ar kernel is the Gaussian (or "radial basis f unction") kernel, define d as: ) ) 2 ( exp( ) , ( 2 2 σ y x y x K − − = (11) Where σ is a scale parameter , and x and y are feature-vectors in the input space. The Gaussian kernel has two hype r parameters to cont rol performance C and the scale param eter σ . In this paper we used radial basi s function (RB F). 3.2 Multi-clas s SVM The standard Support V ect or Machines (SVM) is designed for di chotomi c classification problem (two classes, called also bi nary classificat ion). Several dif ferent schem es can be applied to the basic SVM algorithm to handle the K-class pat tern classification problem. These schemes will be discussed in this section. The K -class pattern classification problem is posted as follow: • Given i.i.d. sample: l 11 ( , ) , ..., ( , ) ll x yx y where , for i x i = 1 , .. . , l is a feature vector of length d and {1 , . . . , } i yk = is the class label for data point i x . • Find a classifier with the decision funct ion, such that () fx () y fx = where y is the class label for x . The multi-class classification problem is commonly solved by deco mposition to several binary problem s for which t he standard SVM can be use d. For solving the multi-class problem are as listed below:  Using K one-to-rest classifiers (on e- against -all) - 3 - SETIT2007  Using pair wise classifie rs 2 / ) 1 ( − k k  Extending the formulation of SVM to support the k-class problem. 3.2.1. Combin ation of one-to-r est classi fiers This scheme is the simplest, and it do es give reasonable results. K classifiers will be constructed, one for each class. The K-th classifier will be trained to classify the training data of class k against all other training data . The decision funct ion for each of the classifier will be combined to give the final classification decision on the K-class classification problem. In this case the classification problem to k classes is decom posed to k dich otomy decisions , where the rule separates traini ng data of t he m-th class from the other training patterns. The classification of a pattern x is performed according to maximal value of functions , , () m fx 1 , ..., mK k ∈= () m fx () m fx mK ∈ 1 , ..., Kk = i.e. the label of is computed as: x )) ( ( max arg( ) ( x f x f m k m ∈ = (12) 3.2.2. Pair wis e Coupling cl assifiers The schemes require a binary classifier for each possible pair of cl asses. The decision function of the SVM classifier for 1 y -to- 2 y and 2 y -to- 1 y has reflectional symmetry in the zero planes. Hence only one of these pairs of classi fier is needed. The total number of classifiers for a K-class proble m will then be . The training data for each classifier is a subset of the available training data, and it will only contain the data for the two i nvolved classes. The data will be reliable accordingly , i.e. one will be labeled as +1 while the other as -1. These classifiers will now b e combined with some voting scheme to give t he final classification results. T he voting scheme s need the pair wise probability , i.e. the probability of x belong to class i given t hat it can be only belong to class i or j. (1 ) / 2 kk − The output va lue of the dec ision functi on of an SVM is not an estimate of the p.d.f. of a class or the pair wise probability . One way to estimate the requ ired information from the output of the SVM decision function is proposed by (Hast ie and T ibs hirani, 1 996) The Gaussian p.d.f. of a pa rticular class is estimated from the out put values of the decisi on function, () f x , for all x in that class. The centroid a nd radius of the Gaussia n is the mean and sta ndard deviat ion of () f x respectively . 4 EXPERIMENTS AND RESUL TS 4.1 Database In our work, we used a popular face databases (The UMIST [13]), for dem onstrating the effectiveness of our com bined KDDA and SVM proposed m ethod. It is compared wit h KPCA, GD A and KDDA algo rithm with ne arest neighbor classifier . W e use a radial basis functi on (RBF) kernel function: ) ) 2 ( exp( ) , ( 2 2 σ y x y x K − − = (13) Where σ is a scale parameter , and x and y are feature-vectors in the inpu t space. The RBF function is selected for the proposed SVM method and KDDA in the experiments. The selection of scale para meter σ is empirical. In addition, in the experiments the training set is selected randomly each time , so there exists some fluctuation am ong the results . In order to r educe the fluctuation, we do each ex periment m ore than 10 times and use t he average of them . 4.2 UMIST Datab ase The UMIST repository is a multi-view database, consisting of 575 im ages of 20 people, eac h coverin g a wide range of p oses from profil e to frontal views. Figure 1 depi cts some samples contained in the two databases, where each image is scaled into (1 12 92), resulting in an input d imensionality of N = 10304. For the face recognition experiments, in UMIST database is randomly partition ed into a training set and a test set with no overlap between the two set. W e used ten im ages per per son random ly chosen for training, an d the othe r ten for testing. Thus, training set of 200 images and the remaining 375 im ages are used to form the test set. It is worthy to menti on here that bot h experim ental setups introduce SSS conditions since the num ber of training samples are in bot h cases much sm aller than the dimensionality of the i nput space [1]. Figure 1: Some sample images of four persons randomly chosen from the UMIST database. On this database, we test the m ethods wit h different training samples and testing samples correspondi ng the training nu mber k=2, 3, 4, 5,6,7, 8 of each subject. Each time random ly select k samples from each subject to train a nd the other 10 K − to test. The experimental results are given in the t able 1. - 4 - SETIT2007 Table 1. Recognition rate (%) on the UM IST database. K Our method (KDDA+SVM) KDDA +NN * KPC A GD A 2 81.8 81.9 75.5 71.5 3 83.5 83.4 76.2 72.8 4 87.3 85.4 77.1 74.5 5 90.4 87.9 79.8 75.1 6 94.1 89.1 83.4 79.0 7 96.0 93.9 87.1 82.1 10 96.5 95.2 89.1 83.0 * Nearest Neighbour Figure 2 depicts the first two most discri minant features extracted by utilizing KDDA re spectively and we show the decision boundary for first 6 classes for training data in Combination of one-to- rest classifier SVM. Figure 2: The decision bo undary for first 6 classes for training data (Combi nation of one-to-rest classifier SVM) The only kernel parameter for RBF is the scale value 2 σ for SVM classifier . Fi gure.4 shows the error rates as functions of 2 σ , when the optimal number of feature vect ors (M is optimum ) is used. Figure 3: error rates as functions 2 σ of SVM. ( [1]) 26 51 0 KDDA σ =× As such, the average error rates of our m ethod with RBF kernel are shown in Figur e 5. It shows the error rates as functions of M within th e range from 2 to 19 ( 2 σ is optimum). 5 Discussions and Conclusions A new FR m ethod has been in troduced in th is paper . The propose d method com bines kernel-base d methodolo gies with discri minant anal ysis technique s and SVM classifier . The kernel function is utilized to map the original face patterns to a high-dim ensional feature space, where the highly non-convex and complex distribution of face patterns is simplified, so that linear discriminant techniques can be used for feature extraction. The small sample size problem caused by high dimensionality of mapped patterns is ad dressed by a kernel-based D-LDA t echnique (KDDA) which exactly finds the optimal discriminant subspace of the feature space without a ny loss of significant discrimi nant inform ation. Then feature space will be fed to SVM classifier . Experiment al results indicat e that the performance of the KDDA al gorithm together with SVM is overall superior to those obtained by the KPCA o r GDA approaches. In conclusion, the K DDA mapping and SVM classifier i s a general pat tern recognit ion met hod for nonlinearly featur e extraction from high- dimensional input patterns without suffering from the SSS problem. W e expect that in addition to face recognition, KDDA will provide excellent perfo rmance in applications where classifi cation tasks are routinely performed, suc h as content-b ased image indexi ng and retrieval, vi deo and audi o classificati on. Acknowledgements The authors would like to acknowledge the Iran T elecommuni cation Research C enter (ITRC ) for financially supporting this work. Figure 4: Comparison of error rates based on RBF kernel function. - 5 - SETIT2007 W e woul d also like to thank Dr . Daniel Graham and Dr . Nigel Allinson for providing the UMIST face database. References [1] J. Lu, K. N. Plataniotis, A. N. Venetsanopoulos, “Face Recognition Using LDA-Bas ed Algorithms” IEEE Trans. ON Neural Networks, vol. 14, no. 1, Jan.2003. [2] M.Turk, “A random walk through eigenspace,” IEICE Trans. Inform.Syst. , vol. E84-D , pp. 1586– 1695, Dec. 2001. [3] R. Chellappa, C. L.Wils on, and S. Sirohey, “Human and machine recognition of faces: A survey ,” Proc. IEEE , vol. 83, pp. 705–740, Ma y 1995. [4] M. Turk and A. P. Pentland, “Eigenfaces for recognition,” J. Cognitive Neurosci. , vol. 3, no. 1, pp. 71–86, 1991. [5] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. Fisherfaces: Reco gnition using class specific linear projection,” I EEE Trans. Pattern Anal. Machine Intell. , vol. 19, pp. 711–720, May 1997. [6] M. A. Aizerman, E. M. Braverman, and L. I. Rozonoer, “Theoretical foundations of the potential function method in pattern recognition learning”, Automation and Remote Control, vol. 25, pp. 821–837, 1964. [7] L.-F.Chen, H.-Y. Mark Liao, M-.T. Ko, J.-C . Lin, and G.-J. Yu, “A new LDA-based fa ce recognition system which can solve the small sample size problem,” Pattern Recognition, vol. 33, pp. 1713– 1726, 2000. [9] H. Yu and J. Yang, “A direct LDA algorithm for high-dimensional data with application to face recognition,” Pattern Recognition , vol. 34, pp . 2067– 2070, 2001. [10] V. N. Vapnik, “The Natu re of Statistical Learning Theory”, Springer-Verlag, New York, 1995. [11] D. B. Graham and N. M. Allin son, “Characterizing virtual eigensignatures for general purpose face recognition,” in Face Recognition: Fr om Theory to Applications, H. Wechsler, P. J. Phillips , V. Bruce, F. Fogelman- Soulie, and T. S. Huang, Eds., 1998, vol. 163, NATO ASI Series F, Computer and Systems Sciences, pp. 446–456. [12] D. L. Swets and J. Weng, “Using discriminant eigenfeatures for image retrieval,” IEEE Trans. Pattern Anal. Machine Intell. , vol. 18, pp. 831– 836, Aug. 1996. [13] Q. Liu, R. Huang, H. Lu, S. Ma, “Kernel-Based Optimized Feature Vectors Selection and Discriminant Analysis for Face Recognition” 2002 IEEE [14] C. Liu and H.Wechsl er, “Evolutionary pursuit and its application to face recognition,” I EEE Trans. Pattern Anal. Machine Intell. , vol. 22, pp. 570–582, June 2000. [ 1 5 ] K . L i u , Y . Q . C h e n g , J . Y . Y a n g , a n d X . L i u , “ A n efficient algorithm for Foley–Sammon optimal s et of discriminant vectors by algebraic method,” Int. J. Pattern Recog. Artificial Intell. , vol. 6, pp. 817– 829, 1992. [16] 0. Duda, R., E. Han. P. , and G. Stork, D. Parrern Recognirion. John Wiley & Sons, 2000. [17] L. Mangasarian. 0 . and R. Musicant. D. Successive over relaxation for support vector machines, IEEE Trans acrions on Ne ural Nerwor ks, l0(5), 1999. - 6 -

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment