Ensembles of Random Sphere Cover Classifiers
We propose and evaluate alternative ensemble schemes for a new instance based learning classifier, the Randomised Sphere Cover (RSC) classifier. RSC fuses instances into spheres, then bases classification on distance to spheres rather than distance t…
Authors: Anthony Bagnall, Reda Younsi
UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 1 T echnical Repor t CMPC14-05: Ensemb les of Random Sphere Co v er Classifiers Anthony Bagnall and Reda Y ounsi, Abstract —We propose and ev aluate alternative ensemble schemes f or a new instance based learni ng classifier , the Randomised Sphere Cov er (RSC) classifier . RSC fuses instances into spheres, then bases classification on distance to spheres rat her than distance to instances. The randomised nature of RSC makes it ideal f or use in ensembles. We propose two ensemble methods t ailored to the RSC classifier; αβ RSE, an ensemb le based on instance resampling and α RSSE, a subspace ensemble. We com pare αβ RSE and α RSSE to tree ba sed ensembles on a set of UCI datasets and demonstrates that RSC ensembles perf o rm significantly better than some of these ensemb les, and not signi ficantly worse than the others. We demonstrate via a case study on six gene e xpression data set s that α RSSE can outperf or m other subspace ensemble methods on high dimensional data when used in conjunction with a n attribute filter . Fi nally , we perform a set of Bias/V ariance decomposition experiments to analyse the source of i mprov ement in comparison to a base classifier . Index T erms —Sphere classifier , ensemble ✦ 1 I N T R O D U C T I O N W e propose and eva lua te alter na tive ensemble schemes for a simple instance based learning classifier , the Ran- domised Sphere Cover (RSC) classifier , first intr oduced in [49]. RSC creates sph eres around a subset of instances from the training data, then bases classification on dis- tance to spheres, rather than d istance to instances. Near- est neighbour (NN) ba sed classifiers remain popular in a wide range of fields such as image processing. One of their strength lies in the f act that they a re robust to changes in the tra ining data. However , this fea ture of NN c la ssifiers means that there is less observable benefit (in terms of e rror reduction) of using them in conjunction with resampling ensemble schemes such a s bagging [3]. RSC aims to overcome this pr oblem by using a randomised heuristic to select a subset of instances to represent the spheres used in classification. RSC can b e seen as a form of data reduction, and hence sca les well for large data sets. Data reduction algorithms [ 4 7, 2 6, 23] search the training data for a subset of ca ses and/or attributes with which to c lassify new instances to achieve the maximum compression with the minimum reduction in accuracy . RSC can be described by the Compression scheme method [12]. Compressi on scheme has been propos ed to explain the generalisation performance o f sparse algo- rithms. In general, a lgorithms a re called sparse because they keep a subset from the tr a ining set as pa rt of their learning pr ocess. A large number of algorithms fall in this category , such a s Support V ector Mac hines ( SVM), The Perceptron algorithm and KNN [19]. Recently , com- • A.Bagnall is with the School of Computing Sciences, University of East Anglia, Norwi ch, Norfol k, United Kingdo m. E-mail: ajb@uea.ac.uk pression scheme was rejuvenated to ex plore a similar algorithm to RSC, the set cover ing machine (SCM), proposed by shaw-ta ylor [40]. Y o unsi [50], examined the relationships between α , the accur a cy and the cardinality of the sphere cover classifier using ex isting probabilisti c bound based on the compression scheme. Although it is clear the spher e cover acc ur acy is synonymous with cov- ering, compression scheme ha s shown that degra dation is accuracy is only possible by hea vily pruning spheres. This suggests that the sphere cover classifier is indeed a strong candidate f or e xploring the accuracy/diversity dilemma found in ensemble design [2 4, 38, 42]. The process that creates the spheres for RSC is con- trolled by two parameters: α , the minimum number of cases a sphere must contain in order to be retained as part of the classifier; and β , the number of misclassified instances a sphere can contain. W e investigate ho w these parameters can be utilised to diversify the e nsemble. W e propose two ensemble methods tailored to the RSC classifier; αβ RSE, a n ensemble based on resampling and α RSSE, a subspace ensemble. W e demonstrate that the resulting ensemble classifiers are a t lea st compara- ble to, and of te n better than, state of the art ensem- ble techniques. W e perform a ca se study on six high dimensional gene expression data sets to demonstrate that α RSS E works well with attribute filters a nd that it outperforms other subspace ensemble methods on these data sets. Finally , we per f orm a set of Bias/V ariance (BV) decomposition experiments to ana lyse the source of improvement in comparison to a base classifier . The structure of the rest of this paper is as follows: In Section 2 we pro vide the background motivation for the RSC cla ssifier , an overview of the relevant e nsemble literature and a brief summary of Doming os BV d e com- position technique [10]. In Section 3 we formally describe the RSC c lassifier and in Se c tion 4 we define our two UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 2 ensemble schemes. In Section 5 we present the results and in Section 7 we summarise our conclusions. 2 B A C K G R O U N D A classifier constr ucts a decision rule based on a set of l training examples D = { ( x i , y i ) } l i =1 , wher e x i represents a vector of observa tions of m explana tory variables associated with the i th case, and y i indicates the class to which the i th example belongs. W e call the range of all possible values of the explanatory variables X and the range of the discrete response va riable Y = { C 1 , C 2 , . . . , C r } . W e assume a dissimilarity measure d is defined on X and is a function d : X × X → R + such that ∀ x 1 , x 2 ∈ X , d ( x 1 , x 1 ) = 0 and d ( x 1 , x 2 ) = d ( x 2 , x 1 ) ≥ 0 . A classifier f : X → Y , f ( x ) = ˆ y is a function from the attribute space to the response variable spac e . 2.1 Sphere Cover Classifier s The sphere covering mechanism we use stems from the class covering a p proach to classification which was first introduced in [6]. A sphere B i is associated with a particular class C B i , and is defined by a centre c i and radius r i . In practice we also include in the sphere definition all the instances within it’s boundary . Hence , a sphere is defined by a 4 -tuple B i = < C B i , c i , r i , X B i > where X B i = { x ∈ D : d ( x , c i ) < r i } . The centre of the sphere is the vector of the means of the attributes of the cases contained within. The radius of the sphere B i is defined a s the distance from the centre to the closest example f rom a cla ss other than C B i that is not in X B i , i.e. r i = min x j ∈{ X \ X B i }∧ y j 6 = C B i d ( x j , c i ) where X = { x ∈ D } . A union of spheres is called a cover . A cover that contains a ll of the examples in D is called proper a nd one c onsisting of spheres that only contain examples of one class is sa id to be pure . The class cover problem (CCP) involves finding a pur e and proper cover that has the minimum number of spheres of all possible pure and proper covers. The solution to the CCP proposed in [36] involves constructing a Class Cover Catch Digraph (CCCD), a directed graph based on the proximi ty of training cases. However , finding the optimal covering via the CCCD is NP-hard [7]. Hence [31, 30] p roposed a number of greedy algorithms to find a n approximately optimal set covering. However , these algorithms a re still slow and only find pure covers. The constraint of pure and proper covers will tend to lead to a classifier that overfits the training da ta. An al- gorithm that relaxes the requirement of class purity was proposed by [36]. This algorithm introduces two param- eters to alleviate the constraint of requiring a pure proper cover . The pa rameter α relaxes the pro per requirement by only allowing spheres that contain at least α ca ses to be retained in the classifier . The parameter β reduces the purity constraint by a llowing a sphere to contain β cases of the wrong class. The authors admit the resulting algorithms are infea sible for large data and hence (to the best of our knowledge) there has been very limited experimental evaluation of this and other CCP based classifiers. Furthermore, the resulting classifiers are very sensitive to the pa rameters. In pa rticular , β , if constant for a ll spheres, is too crude a mechanism for relaxing the purity constraint. In S ection 3 we describe a n e nsemble base classifier derived from CCP algorithm proposed in [32] that is randomised (ra ther than constructive) and retains just the single para meter , α . 2.2 Ensemble Methods An ensemble of cla ssifiers is a set of base cla ssifiers whose individual d ecisions are combined throug h some process of fusion to classify new examples [3 3, 9] . One key concept in e nsemble design is the r equirement to inject diversity into the ensemble [9, 3 9, 35, 15, 16, 18]. Broadly speaking, diversity can be a chieved in an en- semble by either: • employing different cla ssification algorithms to train each base classifier to form a heterogeneous ensem- ble; • changing the training data for each base classifier through a sampling scheme or by directed weight- ing of instances; • selecting different attributes to tra in each classifier; • modifying ea ch classifier internally , either through re-weighting the training data or through inher ent randomization. Clearly , these approaches c an be combined (see below). In this pa p e r we compare our homogeneous ensemble methods ( described in Section 4 ) with the following related ensembles. • Bagging [3 ] dive rsifies through sampling the tra in- ing data by b ootstrapping (sampling with replace- ment) for each member of the ensemble. • Random Subspace [20] ensembles select a ra ndom subset of attributes for each base classifier . • AdaBoost (Adapt i ve Boosting) [13] involves itera- tively re-weighting the sampling d istribution over the training data ba sed on the training accur a cy of the base classifiers at each itera tion. The weights can then be either embedded into the classifier algorithm or used as a weighting in a cost function for classifier sele c tion for inclusion. • Random Committe e [11] is a technique that creates diversity through r andomising the base cla ssifiers, which are a form of random tree. • Multiboost [45] is a com bination of a bo osting strat- egy (similar to AdaB oost) and wa gging, a Poisson weighted form of bagging. • Random Forests [4] combine bootstrap sampling with random attribute selection to construct a col- UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 3 lection of unprun ed trees. At each test node the optimal split is derived by searching a random subset of size K of candida te attributes se le c te d without replacement from the candidate attributes. Random forest ra ndom combines attribute sampling with bootstrap case sampling. • Rotation Forests [38] involve partitioning the at- tribute space then transforming in to the principal components space. Ea ch c lassifier is given the entire data set but tra ins on a different component space. In order to maintain consistency across these tech- niques we use C4.5 decision trees as the base classifier for all the ensembles. Forming a final classification f rom an ensemble re- quires some sort of fusion . W e employ a majority vote fusion [ 27] with ties resolved ra ndomly . For alter na tive fusion schemes see [2 5]. Beyond simple a c c uracy c ompar ison, there are three common approaches to analyse ensemble performance: diversity measures [28, 42]; margin theory [37, 33]; and BV decomposition [22, 43, 14, 5, 44, 2]. These have all been linked [42, 1 0]. 2.3 Bias/V ariance Decomposition In this section, we briefly describe BV decomposition using Domingos framework [10]. This framework is applicable to any loss f unction, but for simplicity sake we restrict ourselves to a tw o class classification problem with a 0/1 loss function. W e la bel the two class v a lues { C 1 = − 1 , C 2 = 1 } . The generalisation error of a classifier is defined as the expected error for a given loss f unction over the entire attribute spa ce. A loss function L ( y , ˆ y ) measures how close the predicted value is from the actual value for any observation ( x , y ) . The response variable Y will ge ner a lly be stochastic, so for a two class problem the expec ted loss is defined as E y [ L ( y , ˆ y )] = p ( Y = − 1 | x ) · L (0 , ˆ y ) + p ( Y = 1 | x ) · L (1 , ˆ y ) , and the optim al prediction y ∗ is the prediction that mimimizes the expected loss. T he optimal or Bayes classifier is one that minimizes the e xpected loss for all possible values of the attribute space, i.e. f ( x ) = y ∗ , ∀ x ∈ X . The e xpected loss over the attribute space of the Bayes classifier , E x [ E y [ L ( y , y ∗ )]] , more commonly written E x ,y [ L ( y , y ∗ )] is called the Bayes rate a nd is the lower bound f or the error of any classifier . In practice, classifiers are constructed with a finite data set, and the expected loss f or any given instance will vary depending on which data set the classifier is given. Let D be a set of s training sets, D = {{ D i } s i =1 } . The set of predictions for any element x is then ˆ Y = { ˆ y i , i = 1 · · · s } , where ˆ y i is the prediction of the i th classifier defined on training data D i when given explanatory variables x . W e then denote the mode of ˆ Y a s the main prediction, ˆ y . If we assume each d ata set is equally likely to ha ve bee n observed, the expected loss over s d ata sets for a given instance x is simply the a verage over the data sets, E D,y [ L ( y , ˆ y )] = P s i =1 E y [ L ( y , ˆ y i )] s The Domingo s framework decomposes this e xpected loss into three terms: Bias, V aria nce and Noise. The Bias is defined as the loss of the ma in prediction in relation to the optimal prediction. B ( x ) = L ( y ∗ , ˆ y ) Bias is ca used by systemic errors in classification re- sulting from the algorithm not capturing the underlying complexity of the true decision boundary (i.e. underfit- ting). V ariance describes the mean variation within the set of predictions about the main prediction for a given instance, i.e., V ( x ) = P s i =1 L ( ˆ y j , ˆ y ) s , and is the result of va riability of the classification func- tion caused by the finite training sample size and the hence inevitable varia tion acros s training samples (over- fitting). Noise is the unav oida ble (and unmeasurable) component of the loss that is incurred independently of the learning algorithm. The Noise term is N ( x ) = E [ L ( y , y ∗ )] . So for a single example, we ca n describe the e xpected loss as E D,y [ L ( y , ˆ y )] = N ( x ) + B ( x ) + c 2 · V ( x ) where c 2 is + 1 if B ( x ) = 0 and − 1 if B ( x ) = 1 . Bias and variance may be average d over all examples, in which case Domingos calls them av e rage Bias, B = E x [ B ( x )] , ave r age ( or net var ia nce) V = E x [ V ( x )] and average noise N = E x ( N ( x )) . The expected loss over all examples is the expected value of the expected loss over all exa mples, and c an be d ecomposed as E D,y , x [ L ( y , ˆ y )] = N + B + c 2 · V . Domingos shows that the net var ia nce can be e x- pressed as V = E x [(2 B ( x ) − 1) · V ( x )] and that V ca n be further deconstructed into the biased variance V b and the unbi a sed varian ce V u . V u is the average variance within the set of classifier estimates where the main prediction is correct ( B ( x ) = 0 ), V b is the variance when the main prediction is incorrect. The net variance V n is the difference be twee n the unbiased and the biased var iance, V n = V u − V b . Hence, unbiased UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 4 variance increases the net va riance ( and thus the gen- eralisation error) whereas biased variance decreases the net varia nce. The principle benefit of performing a Bias-V ariance (BV) de composition f or an ensemble algorithm is to address the question of whether an observed reduction in the expected loss is due to a reduction in bia s, a reduction in unbiased varia nce, an increase in biased variance or , m ore usually , a combination of these factors. W ithout unlimited data, these statistics are generally estimated through resampling. In Section 6 we describe our experimental design a nd perform a BV decompo- sition to assess the ensemble algorithms we propos e in Section 4 in conjunction with the b a se classifier described in S ection 3. 3 T H E R A N D O M I S E D S P H E R E C OV E R C L A S - S I FI E R ( R S C ) The reason for designing the αR S C algorithm was to develop an instance based classifier to use in ensembles. Hence our design criter ia were that it should be ra n- domised (to allow for diversity), fast (to mitigate against the inevitable overhead of ensembles) and comprehen- sible (to help produce meaningful interpretations from the mo dels pro duced). The αRS C a lgorithm has a single integer parameter , α , that specifies the minimum size for any sphere. Informally , αR S C works as follows. • Repeat until all da ta are covered or discarded 1) Randomly select a da ta point and ad d it to the set of covered cases. 2) Create a new sphere centered at this point. 3) Find the closest case in the tra ining set of a different class to the one selecte d as a centre. 4) Set the radius of the sphere to be the distance to this case. 5) Find all cases in the training set within the radius of this sphere. 6) If the number of cases in the sphere is greater than α , add all case s in the sphere to the set of covered ca ses and save the sphere details (centre, class a nd radius). A more f ormal algorithmic description is given in Algo- rithm 1. For a ll our expe riments we use the Euclidean distance metric, although the algorithm can work with any distance functio n. All attributes are normalised onto the range [0 , 1] . The parameter α allows us to smooth the d ecision boundary , which has been shown to pro- vide better generalisation by mitigating against noise and outliers, (se e , for example [29]). Figure 1 provides an example of the smoothing effect of removing small spheres on the decision boundary . The αRS C algorithm cla ssifies a new case by the following rules: 1) Rule 1. A test example that is covered by a sphere, takes the ta rget class of the sphere. If there is more than one sphere of different target class covering Algorithm 1 buildRSC (D, d , α ) . A Randomised Sphere Cover Classifier ( αRS C ) 1: Input: Cases D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } , distance function d ( x i , x j ) para meter α . 2: Output: Set of spheres B 3: Let covered cases be set C = ⊘ 4: Let uncovered cases be set U = ⊘ 5: while D 6 = C ∪ U do 6: Select a random element ( x i , y i ) ∈ D \ C 7: Copy ( x i , y i ) to C 8: Find min ( x j ,y j ) ∈ D d ( x i , x j ) such that y i 6 = y j 9: Let r i = d ( x i , x j ) 10: Create a B i with a center c i = x i , radius r i and ta rget class y i 11: Find a ll the cases in B i and store in temporary set T 12: if | T | ≥ α the n 13: C = C S T 14: Store the sphere B i in B 15: else 16: U = U ∪ T 17: end if 18: end while the test exa mple, the classifier takes the target cla ss of the sphere with the closest centre. 2) Rule 2. In the case where a test example is not covered by a sphere, the cla ssifier selects the closest spherical edge. A ca se covered by Rule 2 will generally be an outlier or at the boundary of the class distribution. Therefore, it may be preferable not to have spheres over-covering areas where su ch cases may occur . These areas are either close to the decision boundary specifically when the high overlap between classes e x ist ( an illustration is given in Figure 1 (a)), a nd a reas wher e noisy cases are within dense areas of exa mples of differ ent target class. The αRS C method of compressing thro ugh sphere covering and smoothing via boundar y setting as first proposed in [49] and has been shown to provides a robust simple classifier that is competitive with other commonly used classifiers [49]. In this paper we focus on the best way to use it a s a ba se classifier f or an ensemble. 4 E N S E M B L E M E T H O D S F O R α R S C 4.1 A Simple Ensemble: α RSE One of the basic d esign criteria for α RSC was to ran- domise the cover mechanism so that we could create diversity in an ensemble. Hence our first ensemble a l- gorithm, α RSE, is simply a majority voting ensemble of α RSC classifiers. W ith all ensembles we denote the number of classifiers in the ensemble a s L . W e fix α for all members of the ensemble. Each classifier is built using Algorithm 1 using the entir e training data. The basic questio n we experimentally a ssess is whether the UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 5 (a) A spher e cover with α = 1 (b) The same cover with α = 2 Fig. 1. An example of the smoothing eff ect of removing small spheres inherent r a ndomness of α RSC provides enough implicit diversity to make the ensemble robust. 4.2 A Resampling/Re-weighting Ensemble: αβ RSE The origi nal motivation f or R SC is the classifiers derived from the Class Cover Catch Digraph (CCCD) described in Section 2. These classifiers ha v e two parameters, α and β . The α parameter (minimum sphere size) is used to improve generalisation. The β parameter (number of misclassified examples allowed with in a sphere) is meant to filter outliers. In the CCCD, both α and β parameters are c hosen in advance. α can be set through cross validation. However , setting β is problematic; a global value of β is too arbitra ry , a local value for each sphere impractical. W e propose an automatic method for implicitly setting β iteratively . W e define the border case of a sphere to b e the closest data of the negative class in a given dataset. Bor der cases are the particular instance that halts the growth of a sphere a nd are hence crucial in the constructio n of the α RSC cla ssifier . Our design principle for diversification of the ensemble is then to iteratively remove some or all of the border cases during the process of ensemble con- struction. Informally , the algorithm proceeds as follows: 1) Initialise the current training set D 1 to the whole set D . 2) Build a base α RSC on the entire training set. 3) Find the border cases for the classifier . 4) Find the cases in the current training set that are uncovered by the classifier . 5) Find the cases in the entire training set that a re misclassified by the cla ssifier . 6) Set the next tra ining set, D 2 , equal to D 1 . 7) Remove border cases from D 2 . 8) Replace the border ca ses with a random sample (with replacement) taken from the list of border , uncovered and misclassified cases and a dd them to D 2 . 9) Repeat the process for each of the L classifiers. Algorithm 2 A Random ised Sphere Cover Ensemble ( αβ RSE) Input : Cases D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } , distance function d ( x i , x j ) , par a meters α , L . Output : L random sphere cover classifiers B 1 , . . . , B L 1: D 1 = D 2: for j = 1 to L do 3: B j = buildRS C ( D j , d , α ) . 4: E = borderCases ( B j , D j ) 5: F = un coveredCases ( B j , D j ) 6: G = misclassifiedCa ses ( B j , D ) 7: H = E + F + G 8: D j +1 = D j − E 9: for m = 1 to | E | do 10: c = randomSample ( H ) 11: D j +1 = D j +1 S c 12: end for 13: end for A f ormal description is given in Algorithm 2. New cases a re classified by a majority vote of the L classifiers. The principle ide a is that we re-weight the training data by removing border cases, thus facilitating spheres that are not pure on the original data , but continue to focus on the harder cases by inserting possible duplicates of border , uncovered or misclassified cases, thus implicitly re-weighting the tr aining data. Data pr eviously removed from the training data can be r eplaced if misclassified on the current iter ation. This data driven iterative a p p roach has strong analogies to constructive algorithms such as boosting. 4.3 A Random Subspace Ensemble: α RSSE As outlined in Se c tion 2.2, rather than r esampling and/or re-weighting for ensemble members, an alterna- UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 6 Fig. 2. An illustration showing a cov er modification with /beta param eter on a binary c lass to y dataset. tive a p p roach to d ive rsification is to present eac h base classifier with a different set of attributes with which to train. The Random Subspace Sphere Cover Ensemble ( α RSSE) builds base classifiers using random subsets of attributes by sampling without replacement f rom the original full attribute set. Each base classifier has the same number of attributes, κ . The attributes used by a classifier a re also stored, and the same set of attributes are used to classify a test e xample. The majority vote is again employed for the final hypothesis. Algorithm 3 A Random Subspace Sphere Cover Ensem- ble ( α RSS E ) Input : Ca ses D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } , d ( x i , x j ) , parameters α , L , k . Output : L ra ndom sphere cover classifiers B 1 , . . . , B L and associated attribute sets K 1 , . . . , K L . 1: for j = 1 to L do 2: K j = ran domAttribut es ( D , k ) 3: D j = filte rAttribute s ( D , K j ) 4: B j = buildRS C ( D j , d , α ) 5: end for UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 7 5 A C C U R A C Y C O M P A R I S O N S Our base classifier α RSC is a competitive c la ssifier in its own right, a chieving accur acy results comparable to C4.5, Naive Ba yes, Naive Bayes T ree, K-Nea rest Neigh- bour and the Non-Nested Generalised Hyper Rectangle classifiers [46]. W e wish to compare the per formance of α RSC based ensembles with equivalent tree based ensemble techniques. Our exper imental aims are: 1) T o confirm that ensembling α RSC improves the performance of the base classifier (Section 5.2). 2) T o show that the RSC ensemble αβ RSE performs better than tree based ensembles that utilise the whole fe a ture spa ce (Section 5.3). 3) T o demonstrate that the RSC ensemble α RSSE performs significantly better than all the subspace ensembles except rotation for est, wh ich itself is not significantly better than α RSSE (Section 5.4). 4) T o consider , through a cases study , whether α RSC ensembles outperform other subspac e ensemble methods on classification problems with a high dimensional fea ture spa ce (Section 5.5). T o assess the relative performance of the classifiers, we adopt the procedure described in [8], which is based on a two stage rank sum test. The first test, the Freidman F test is a non-para meteric equivalent to ANOV A and te sts the null hypothesis that the ave rage rank of k classifiers on n data sets is the same against the a lternative that at least one classifier ’s mean rank is different. If the Friedman test results in a rejection of the null hypothesis (i.e. we reject the hypothesis that all the mean ranks are the same), Dem ˇ sar recommends a post-hoc p a irwise Nemenyi test to discover where the differences lie. The performance of two classifiers is significantly d ifferent if the corresponding average ranks differ by at least the critical difference C D = q a r k ( k + 1) 6 n , where k is the number of classifiers, n the number of problems a nd q a is ba sed on the studentised ra nge statistic. The results of a post- hoc Nemenyi test are shown in the critical difference diagrams (as introduced in [8]). These graphs show the mean rank order of each al- gorithm on a linear scale , with bars indicating cliques , within which there is no significant difference in r ank (see Figure 4 below for an exa mple). A lternatively , if one of the classifiers can be c onsidered a control, it is more powerful to test for difference of mean ra nk between classifier i and j based on a Bonferonni a djustment. Under the null hypothesis of no differ ence in mea n rank between classifier i and j , the statistic z = ( ¯ r i − ¯ r j ) q k ( k +1) 6 n follows a sta ndard normal distribution. If we are per- forming ( k − 1) pairwise comparisons with our control classifier , a Bonferonni adjustment simply divides the critical value α by the number of comparisons per- formed. 5.1 Data Sets T AB LE 1 Benchmar k datasets used for the empirical e v aluat ions Dataset Examples Attributes Classes Abalone 4177 8 3 W aveform 5000 40 3 Satimage 6435 36 6 Ringnorm 7400 20 2 T wonorm 7400 20 2 Image 2310 18 2 German 1000 20 2 wdbc 569 30 2 Y east 1484 8 10 Diabetes 768 8 2 Ionosphere 351 34 2 Sonar 208 60 2 Heart 270 13 2 Cancer 315 13 2 W in sconsin 699 9 2 Ecoli 336 7 8 Breast Cancer 97 24481 2 Prostate 136 12600 2 Lung Canc er 181 12533 2 Ovarian 253 15154 2 Colon T umor 62 2000 2 Central Nerv o us 60 7129 2 T o evalua te the performance of the ensembles we used sixteen datasets from both the UCI data repository [11] and six benchmark gene expression datasets from [4 1]. These datasets a re summarised in T able 1 . They were selected bec ause they vary in the numbers of tra ining examples, classes and attributes and thus provide a d i- verse testbed. In add ition, they all have only continuous attributes, and this a llows us to fix the distance measure for a ll experiments to Euclidean distance. All the features are normalised onto a [0 , 1] scale. T he first sixteen data are used for all classification e xperiments in Sections 5.3 and 5.4. The six gene expression data sets are used for experiments presented in Section 5.5 to eva luate how the subspace based ensembles per f orm in conjunctio n with a feature selection filter on a pr oblem with high dimensional fea ture spa ce. 5.2 Base Classifier vs Ensemble As a basic sanity check, we start by showing that the ensemble outperforms the base classifier by comparing αβ RSE with 25 base classifiers against the a verage of 25 α RSC classifiers. Figure 3 sh ows the graphs of the classification accuracy (measured through 10 fold cross validation) for four different datasets. The ensemble accuracies are better than those of the 25 averaged classifiers, and this pattern w as consist ent across all data sets. In addition, we notice both curves follow a similar evolution in r elation to α . That is, α values that returned the best classification ac c uracy for αβ RSE are similar to those of a single classifier . This is the motivation for the model selection method we adopt in Section 5 .3. UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 8 Alpha (Prunng #) 0 20 40 60 80 100 Accuracy 85.0 86.0 87.0 88.0 89.0 90.0 91.0 Ensemble 10 CV accuracy Average 10 CV accuracy (a) W aveform Alpha (Pruning #) 0 100 200 300 400 500 Accuracy 93.0 94.0 95.0 96.0 97.0 98.0 Ensemble 10 CV accuracy Average 10 CV accuracy (b) T wonorm Alpha (Pruning #) 0 1 2 3 4 % training examples 93.0 94.0 95.0 96.0 97.0 98.0 Ensemble 10CV accuracy Average qocv accuracy (c) Ringnorm Alpha (Pruning #) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Accuracy 84 86 88 90 92 94 Ensemble 10 CV accuracy Average 10 CV accuaracy (d) Satimage Fig. 3. Accuracy as a function of α on four dat a sets. Each po int is the ten f old cross validation accuracy of αβ RSE with 25 classifiers and the av erage of 25 separate α RSC classifiers 5.3 Full Feature Space Ensembles T ables 2 a nd 3 show the classification accuracy of α RSE and αβ RSE against that of Adab oost, Bagging and Multi- boost trained with 25 a nd 100 base classifiers respec- tively . Adaboost, Bagging and Multiboost were used with the default settings f or the de cision tree and en- semble par a meters and were tra ined on the full tra ining split. For α RSE and αβ RSE, α was set thr ough a quick form of model selection by using the optimal training set cross valida tion values of a single classifier . T his form of quick, off-line model selection is possible beca use of the fact that RSC is controlled by just a single para meter and has little impact on the overall time taken to build the ensemble cla ssifier . As described in Section 4.2, the β pa rameter of αβ RSE is set implicitly through the sampling scheme. The a verage ranks and rank order are given in the final two rows of T a ble T ables 2 and 3. The critical difference for a test of difference in average rank for 5 cla ssifiers and 1 6 data sets at the 10 % level is 1.375 . W e make the following observations from these re- sults: • Firstly , although αβ RSE has the highest rank, we cannot reject the null hypothesis of no signi ficant difference between the mea n r a nks of the classifiers. The performance of th e simple majority vote ensem- ble α RSE is comparable to bagging with dec ision trees. This suggests that the base classifier α RSC inherently diver sifies as much a s bootstrapping de- cision trees a nd lends support to using α RSC as a base classifier . • Secondly , αβ RSE outperforms α RSE on 12 out of 16 d ata sets (with 2 ties) with 25 ba ses cla ssifiers and 14 out of 16 with 100 base classifiers. If we were performing a single comparison between these two classifiers, the difference would be significant. Whilst the multiple classifier comparisons mean we cannot make this claim, the results do indica te that allowing some misclassification a nd guiding the sphere creation process through directed resampling does improve perf ormance and that a simple ensem- ble does not best utilise the base cla ssifier . • Thirdly , αβ RSE has the highest average rank of the five algorithms, f rom which we infer that it performs at least compara bly to Adaboost, M ulti- boost and performs be tter than Bagging. These ex- periments d emonstrate that the re-weighting based ensemble αβ RSE is at lea st compara ble to the widely used tree based sampling and/or re-weighting en- sembles. 5.4 Subspace Ensemble Methods T ables 4 a nd 5 sh ow the classification accur a cy of α RSSE against those of Rotation Forest, Random Subspace, UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 9 T ABLE 2 mean classification accuracy (in %) and standard deviation o f αβ RSE, α RSE, Adabo ost, Baggin g, and Multibo ost ov er 30 different runs on indepen dent train /test s plits with 25 base classifiers. Data Set α RSE αβ RSE Adaboost Bagging MultiBoost Abalone 54.25 ± 0.94 54.89 ± 1.02 52.30 ± 1.20 53.98 ± 0.91 53.04 ± 1.47 W aveform 90.40 ± 0.67 90.68 ± 0.65 89.60 ± 0.69 88.71 ± 0.58 89.63 ± 0.56 Satimage 90.90 ± 0.41 90.90 ± 0.41 91.21 ± 0.45 89.82 ± 0.69 90.94 ± 0.57 Ringnorm 96.71 ± 0.38 97.17 ± 0.30 97.26 ± 0.33 95.01 ± 0.50 97.12 ± 0.31 T wonorm 97.32 ± 0.26 97.41 ± 0.26 96.43 ± 0.32 95.58 ± 0.46 96.41 ± 0.37 Image 96.87 ± 0.50 96.87 ± 0.51 97.77 ± 0.64 95.78 ± 0.90 97.32 ± 0.75 German 73.21 ± 1.76 74.00 ± 1.69 74.52 ± 1.76 75.24 ± 1.36 75.09 ± 2.51 wdbc 93.21 ± 1.47 93.86 ± 1.52 96.79 ± 1.26 95.19 ± 1.38 96.61 ± 1.22 Y east 56.34 ± 2.09 58.22 ± 1.24 58.23 ± 1.59 60.65 ± 1.57 58.65 ± 1.77 Diabetes 74.52 ± 1.78 75.01 ± 1.79 73.54 ± 1.88 75.94 ± 2.00 74.74 ± 2.34 Iono 93.48 ± 2.05 93.39 ± 2.25 92.85 ± 2.20 92.31 ± 2.60 93.25 ± 2.05 Sonar 84.67 ± 4.17 84.43 ± 3.66 81.38 ± 4.21 76.33 ± 5.66 80.76 ± 4.57 Heart 78.85 ± 3.60 80.74 ± 3.26 80.41 ± 3.11 81.26 ± 3.66 81.22 ± 2.87 Cancer 69.46 ± 2.97 70.07 ± 3.62 69.07 ± 4.36 73.44 ± 2.87 69.35 ± 4.71 W insc 95.53 ± 1.34 95.67 ± 1.33 96.21 ± 0.84 96.01 ± 0.97 96.49 ± 0.71 Ecoli 85.36 ± 2.78 85.51 ± 2.64 83.07 ± 2.75 83.45 ± 3.58 83.45 ± 2.73 A verage Ranks 3.31 2.50 3.13 3.28 2. 78 Ranking 5 1 3 4 2 T ABLE 3 mean classification accuracy (in %) and standard deviation o f αβ RSE, α RSE, Adabo ost, Baggin g, and Multibo ost ov er 30 different runs on indepen dent train /test s plits with 100 base classifiers. Data Set α RSE αβ RSE Adaboost Bagging MultiBoost Abalone 54.36 ± 1.16 54.48 ± 1.23 52.82 ± 0.99 54.1 ± 0.91 54.22 ± 1.47 W aveform 90.56 ± 0.70 90.32 ± 0.66 90.27 ± 0.58 89.08 ± 0.84 90.20 ± 0.93 Satimage 90.91 ± 0.38 91.12 ± 0.44 92.00 ± 0.39 90.47 ± 0.55 91.11 ± 0.60 Ringnorm 96.88 ± 0.37 97.54 ± 0.31 97.75 ± 0.29 95.23 ± 0.52 97.05 ± 0.52 T wonorm 97.36 ± 0.28 97.49 ± 0.22 97.13 ± 0.26 96.35 ± 0.38 96.95 ± 0.27 Image 96.77 ± 0.50 96.80 ± 0.56 97.98 ± 0.56 96.23 ± 0.80 96.71 ± 0.34 German 73.23 ± 1.82 74.16 ± 1.58 74.46 ± 1.54 74.91 ± 1.85 74.70 ± 0.64 wdbc 93.39 ± 1.56 93.91 ± 1.57 96.91 ± 1.55 96.33 ± 1.35 96.47 ± 1.07 Y east 57.26 ± 1.44 58.41 ± 1.36 58.13 ± 1.62 60.08 ± 1.56 59.57 ± 1.22 Diabetes 74.53 ± 1.84 75.04 ± 2.57 73.53 ± 2.20 75.68 ± 2.57 74.54 ± 1.28 Iono 93.56 ± 2.06 93.53 ± 1.96 92.99 ± 2.29 91.20 ± 3.01 92.39 ± 2.25 Sonar 84.86 ± 4.23 85.00 ± 3.72 82.71 ± 5.14 78.57 ± 5.86 82.71 ± 2.21 Heart 79.26 ± 3.40 80.67 ± 3.10 81.19 ± 2.88 81.56 ± 3.59 82.33 ± 4.20 Cancer 69.53 ± 3.29 69.58 ± 3.32 68.82 ± 5.07 73.19 ± 3.34 71.33 ± 3.51 W insc 95.54 ± 1.33 95.71 ± 1.33 96.48 ± 0.88 96.09 ± 0.94 97.00 ± 4.31 Ecoli 85.54 ± 2.96 85.86 ± 2.65 83.07 ± 2.75 83.45 ± 3.58 84.82 ± 0.75 A verage Ranks 3.38 2.38 3.03 3.44 2. 78 Ranking 4 1 3 5 2 T ABLE 4 Classification accuracy (in %) and standard de viation of α RSSE, Rotation F orest (RotF), Random SubSpace (RandS), Random Forest (RandF) and Rand om Committee RandC) using av erage results of 30 different r uns on indep enden t train/test splits with 25 base classifiers. Data Se t α RSSE RotF RandS RandF RandC Abalone 54.77 ± 1.28 55.56 ± 1.04 54.62 ± 1. 09 54.05 ± 1.16 53.56 ± 1.19 W aveform 90.21 ± 0.51 90.72 ± 0.77 89.35 ± 0.73 89.51 ± 0.61 89.32 ± 0.61 Satimage 91.71 ± 0.47 91.03 ± 0.50 90.79 ± 0.54 90.80 ± 0.52 90.24 ± 0.44 Ringnorm 98.29 ± 0.26 97.57 ± 0.23 96.82 ± 0.35 95.49 ± 0.38 96.6 ± 0.30 T wonorm 97.03 ± 0.30 97.42 ± 0.27 95.88 ± 0.33 96.02 ± 0.37 96.18 ± 0.35 Image 97.39 ± 0.65 98.04 ± 0.51 96.42 ± 0.73 97.27 ± 0.63 96.08 ± 0.58 German 74.59 ± 1.47 76.26 ± 1.63 72.28 ± 1.53 74.85 ± 1.46 73.65 ± 1.77 wdbc 94.67 ± 1.33 96.40 ± 1.03 95.35 ± 1.31 95.30 ± 1.42 96.04 ± 1.26 Y east 58.80 ± 1.90 61.06 ± 1.82 57.38 ± 2.45 58.96 ± 1.69 60.26 ± 1.75 Diabetes 76.17 ± 2.25 76.25 ± 2.30 74.48 ± 1.98 75.43 ± 1.92 74.78 ± 1.51 Iono 94.53 ± 1.79 93.50 ± 1.79 92.68 ± 2. 40 93.05 ± 1.86 93.13 ± 2.33 Sonar 84.52 ± 4.49 82.86 ± 4.50 79.57 ± 5.24 81 ± 4. 68 82.19 ± 3.99 Heart 82.74 ± 4.02 82.74 ± 3.32 83.30 ± 3.55 81.67 ± 3.17 81.00 ± 3.62 Cancer 76.27 ± 2.96 7 3.87 ± 3.29 74.73 ± 2.81 71.18 ± 3.74 70.93 ± 4.29 W insc 97.21 ± 0.95 9 7.18 ± 0.83 96.35 ± 1.01 96.48 ± 0.72 97.00 ± 0.84 Ecoli 85.00 ± 2.07 87.41 ± 2.44 84.02 ± 3.13 85.33 ± 2.76 84.82 ± 2.62 Mean Ranks 2.09 1.53 4.00 3.50 3.88 Ranks 2 1 5 3 4 UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 10 T ABLE 5 Classification accuracy (in %) and standard de viation of α RSSE, Rotation F orest (RotF), Random SubSpace (RandS), Random Forest (RandF) and Rand om Committee RandC) using av erage results of 30 different r uns on indep enden t train/test splits with 100 base classifiers. Data Set α RSSE RotF RandS RandF RandC Abalone 54.91 ± 0.98 56.04 ± 1.04 54.79 ± 1.02 54.47 ± 0.86 52.83 ± 0.95 W aveform 90.73 ± 0.53 91.07 ± 0.77 8 9.68 ± 0.62 89.97 ± 0.62 90.36 ± 0.63 Satimage 91.92 ± 0.54 91.70 ± 0.50 91.28 ± 0.55 91.59 ± 0.46 91 .82 ± 0.46 Ringnorm 98.43 ± 0.27 97.77 ± 0.23 97.22 ± 0.35 95.66 ± 0.43 97.70 ± 0.26 T wonorm 97.39 ± 0.28 97.53 ± 0.27 96.24 ± 0. 51 96.38 ± 0.50 97.22 ± 0.27 Image 97.83 ± 0.53 98.16 ± 0.51 96.78 ± 0.62 97.45 ± 0.62 97.93 ± 0.56 German 74.28 ± 1.56 75.69 ± 1.63 7 2.37 ± 1.06 75.63 ± 0.64 74.79 ± 1.86 wdbc 95.00 ± 1.44 96.75 ± 1.03 96.35 ± 1.49 96.95 ± 1.17 97.11 ± 1.32 Y east 59.43 ± 1.93 61.65 ± 1.82 58.94 ± 1.84 60.03 ± 1.31 58.22 ± 1. 57 Diabetes 76.25 ± 2.21 76.12 ± 2. 30 74.84 ± 2.07 75.14 ± 2.04 74.00 ± 2.02 Iono 94.76 ± 1.68 94.19 ± 1.79 92.74 ± 1.80 92.39 ± 1.77 93.33 ± 1.94 Sonar 85.24 ± 5.39 84.43 ± 4.50 79.62 ± 5.62 82.05 ± 4.44 82 .24 ± 4.63 Heart 84.00 ± 3.43 83.30 ± 3.15 83.41 ± 3.92 82.70 ± 3.35 81.22 ± 4.50 Cancer 76.16 ± 2.75 74 .12 ± 3.29 75.30 ± 2.85 71.36 ± 4.41 68.82 ± 5.07 W insc 97.42 ± 0.91 97.38 ± 0.83 96.60 ± 0.98 96.71 ± 0.90 96.47 ± 0. 78 Ecoli 85.71 ± 2.36 87.41 ± 2.44 84.02 ± 3. 13 85.33 ± 2.76 83.45 ± 2.73 Mean Ranks 1.94 1.69 4.06 3.50 3.81 Ranks 2 1 5 3 4 Random Committee and Random For est ensembles of decision trees, based on 25 and 1 0 0 classifiers. As with αβ RSE, the α RSSE para meters α and κ were set thro ugh cross v a lidation on one third of the training set. The optimal va lue of κ was estimated first, then the best value of α found f or that κ . The other ensembles were trained on the entire tra ining set with defa ult parameters. Figure 4 shows the Critical Difference dia gram f or the subspace methods with 25 base cla ssifiers. There is a significant difference in avera ge ra nk between the classifiers (the F statistic is 1 4.97, which gives a P value of less than 0.00001) . This difference can be described by two clea r cliques: Random Subspa ce, Random Commit- tee and Random Forest are significantly outperformed by the clique α RSSE and Rotation Forest. CD 5 4 3 2 1 1.5313 RotFor 2.0938 aRSSE 3.5 RandFor 3.875 RandComm 4 RandSub Fig. 4. Critical diff erence diag ram f or 5 subspace ensem- b les on 16 data sets. Critical diff erence is 1.375. So whilst rotation forest has a lower average rank than α RSSE on these data sets, the difference is not significant. W e further note that the difference in per- formance between rotation forest a nd α RSSE reduces with an increase in the number of base classifiers. T a ble 6 shows the classification accuracy ( calculated through 10CV) of α RSSE for va rious sizes of ensemble, varying from 15 to 500 base classifiers. In general, ensembles perform be tter when the size of the e nsemble is large. However , with many ensemble methods increasing the ensemble size dra matically results in over training and hence lower testing a ccuracy . T able 6 demonstrates that the performance of α RSSE actually improves with over 100 ba se classifiers, indicating α RSS E does not ha ve a tendency to over fit data sets with la rge ensemble sizes. Figure 5 shows the combined critical difference dia- gram for all 10 ensembles. The increase in the number of ensembles mea ns a much larger cr itica l difference is required to de te ct a significant difference. However , a similar pattern of ra nking is apparent. CD 10 9 8 7 6 5 4 3 2 1 2.0625 RotFor 2.6875 aRSSE 5.3438 RandFor 5.6875 abRSE 5.875 RandComm 6.25 MultiBoost 6.3125 Adaboost 6.3125 RandSub 7.0625 alphaRSE 7.4063 Bagging Fig. 5. Critical difference diagr am f or 10 ensembles on 16 data sets. Critical diff erence is 3.1257. UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 11 T ABLE 6 α RSSE 10CV accuracy f or ensemble s izes of 15 to 500. Ensemble Size Dataset (15) (25) (50) (100) (250) (500) W aveform 89.87 90.38 90.72 90.85 91.21 90.97 Ringnorm 97.97 98.14 98.27 98.31 98.37 98.39 T wonorm 96.79 97.20 97.39 97.49 97.63 97.64 Image 97.44 97.80 97.92 97.87 98.01 98.03 German 74.77 75.43 75.52 75.47 75.52 75.66 wdbc 97.27 97.45 97.75 97.68 97.99 97.98 Y east 59.02 59.79 59.56 59.58 59.86 59.94 Diabetes 76.89 76.95 76.96 77.03 77.21 76.96 Iono 95.09 95.37 95.23 95.11 95.46 95.43 Sonar 8 6.85 87.81 88.30 88.69 88.03 88.47 Heart 81.74 84.26 83.85 83.63 83.81 83.96 Cancer 75.54 75.88 76.05 77.06 76.93 77.30 W insc 97.08 97.34 97.24 97.40 97.38 97.33 Ecoli 86.17 86.45 86.57 86.15 86.62 86.60 The no fr ee lunch theorem [48] convinces us there will not be a single dominant algorithm for all classification problems. Instance based approaches are still popular in a range of problem domains, p a rticularly in research areas relating to ima ge processing and d atabases. αβ RSE and α RSSE offer ins tance based approaches to classi- fication problems that are highly competitive with the best tree based subspace and non-subspace ensemble techniques. In the f ollowing Section we propose a type of problem domain where we think α RSS E outperf orms the tree based ensembles. 5.5 Gene E xpression Classification Case Study: Subspace Ensemble Comparison Gene express ion profiling h elps to identify a set of genes that are responsible for cancerous tissue. Gene expres- sion da ta are generally chara cterised by a ve r y large number of a ttributes and relatively few cases. Instance based lear ner s such as k-NN often perform poorly in high dimensional attribute space. W e demonstrate that the subspace ensemble α RSSE can overcome this inher- ent problem and in fa c t outperform the other e nsemble techniques. T ABLE 7 The best test set accuracy (in %) of α RSSE ( αR ), Rotatio n Forest (RotF), Rand om Subspace (RandS), Random Forest (RandF), Adabo ost (AB), Baggi ng (Bag) and M ultiBo ostAB (Multi) using av erage results of 30 different runs on χ 2 . BC=Breast Cancer , CT= Colon T umor , LC=Lung Cancer , O V=Ovarian and PR=Prostrate Dataset αR RotF RandS RandF AB Bag Multi BC 82.93 79.60 76.26 80.91 79.19 78.99 78.79 CN 77.83 76.83 74.33 80.33 76.33 76.17 76.50 CT 85.87 86.19 83.49 84.13 82.38 83.65 82.86 LC 99.34 99.34 95.03 99.34 97.81 97.21 97.87 OV 99.18 99.80 97.88 98.98 97. 73 97.84 97.73 PR 94.13 93.70 91.30 94.57 91.23 91.38 91.09 F-avg 1.75 2.10 5.83 1.92 5.58 5 5.58 F-ranks 1 3 7 2 5.5 4 5.5 T ABLE 8 The best test set accuracy (in %) using av e rage results of 30 different runs on Infor mation G ain. Dataset αR Ro tF RandS RandF AB Bag Multi BC 85.15 79.39 77.47 83.94 79.49 80.10 79.80 CN 79.17 76.50 73.50 80.00 75. 67 76.17 76.00 CT 86.98 84.76 82.54 84.44 82.70 82.54 82.38 LC 99.34 99.34 94.75 99.34 97.76 97.16 97.81 OV 99.25 99.76 98.00 98.86 97.73 97.88 97.73 PR 93.77 93.48 91.74 93.62 91.09 92.32 90.80 F-avg 1.42 2.75 5.92 2.08 5.42 4.58 5. 58 F-ranks 1 3 7 2 5 4 6 T ABLE 9 The best test set accuracy (in %) using av e rage results of 30 different runs on Relief. Dataset αR Ro tF RandS RandF AB Bag Multi BC 80.20 79.19 72.42 78.18 73.74 74.85 73.23 CN 76.00 75.50 72.17 76.00 74. 00 72.00 73.33 CT 83.65 84.76 80.63 83.33 79.37 83.17 79.68 LC 99.34 99.23 94.75 98.91 97.43 96.61 97.49 OV 98.43 99.37 98.04 98.90 97.61 97.69 97.61 PR 89.13 93.33 91.67 93.62 93.41 89.71 93.26 F-avg 2.58 2.00 5.67 2.25 4.92 5.33 5. 25 F-ranks 3 1 7 2 4 6 5 T ABLE 10 The best test set accuracy (in %) the three attribute ranking methods. Dataset αR Ro tF RandS RandF Adaboost Bagging Multi BC 84.04 79.60 77.47 83.94 79.49 80.10 79.80 CN 79.17 76.83 74.33 80.33 76. 33 76.17 76. 5 CT 86.98 86.19 83.49 84.44 82.70 83.65 82.86 LC 99.34 99.34 95.03 99.34 97.81 97.21 97.87 OV 99.18 99.76 98.00 98.98 97.73 97.88 97.73 PR 94.13 93.70 91.74 94.57 93.41 92.32 93.26 F-avg 1.58 2.58 6.17 1.92 5.58 5.00 4.92 F-ranks 1 3 7 2 6 5 4 UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 12 Broadly spe a king, there a re three types of approach to problems with a large number of attributes [17]: employ a filter that uses a scoring method to rank the attributes independently of the classifier; use a wra pper to score subsets of attributes using the classifier to produce the model; or embed the attribute selection as part of the algorithm to build the c la ssifier [34]. W e focus on three simple, common ly used, filter measures, χ 2 , Information Gain (I G) and Relief, which are used to select a fixe d number of attr ibute s by ranking ea ch on how well they split the training data , in terms of the response variable. W e compare α RSSE to Adab oost, Ba gging, Ran- dom C omittee, Multiboost, Random S ubspaces, Random Forest and Rotation Forest. Our methodology is to filter on k = 5, 10, 2 0 3 0, 40 and 50 best r a nked attributes for the three r anking measures. Model selection for α RSSE is conducted a s described in S e ction 5.3. All the ensembles use 10 0 classifiers. For Adaboost, Bagging and the base d ecision tree classifiers in the ensembles we use the default parameters. T ables 7, 8 and 9 show the r elative performance of the eight e nsemble cla ssifiers on the best attribute filter se tting for each of the filter techniques. W e note that α RSSE is ranked highest overa ll when using χ 2 and Information Gain and is ranked third with Relief. From this we infer that when used in c onjunction with filtering α RSSE can overcome the inherent problem instance based learners have with high dimensional attribute spaces to produce results better than the state of the art tree based ensembles cla ssifiers. 6 B I A S V A R I A N C E A N A L Y S I S O F R S C E N - S E M B L E T E C H N I Q U E S The purpose of our bias/varia nce analysis of the en- sembles αβ RS E and α RSSE is to identify whether the reduction in gener alisation error in comparison to the base classifier is due to a reduction in bias, unbiased variance or an increase in biased variance. W e followed a similar e xperimental fra mework found in [44]. The standard experimental design for BV decomposition is to estimate Bias and V ar iance using small training sets and large te st sets. W e used bootstrapping to sample eight of our datasets. The data is divided into a training set and a test set, with the test set being at 1/3 of the entire set. 200 separa te training bootstrap samples of size 20 0 were taken by uniformly samplin g with replacement fro m the training set. W e then compute the main prediction, bias and both the unbiased and biased variance, and net- variance (as de fined in Section 2.3) over the 200 test sets. Figure 6 showing both bias and variance in relation to κ (number of attributes used in each classifier for α RSSE) for four of the d atasets. W e observe there is a strong relationship between ave raged error and bias for small κ , but that a s κ increases variance contributes a larger component to the error . Increasing κ seems to have a higher influence on unbiased variance reduction than biased variance. T o compare αRS C , αβ RS E and α RSSE, we perform the bias/va r iance experiment on the three classifiers with the optimal set of parameters (determined exper imentally). W e conclude from the above results that αβ RS E , in most cases, reduces the net var iance in comparison with a single classifier because of a decrease in unbiased variance. However , it is not straight forward in relation to bias. It might be that bias reduction depends on the geometrical complexity of the sample [21] (c om- plex structures requir e complex decision boundar ies), the chosen values for the pruning p a rameter α , and the intera ction between α and β . In that case, finding a method that sys tematically reduces bias whil e keeping unbiased variance low will further reduce the ensemble average error . T able 11 shows the bias/var iance decomposition of α RSSE, αβ RSE and α RSC. W e ma ke the following ob- servations from these results: 1) The avera ge er ror of α RSSE and αβ RSE is lower than α RSC for all the problems; 2) For αβ RSE, this is more commonly a result of a reduction in net va riance rather than a reduction in bias; 3) For α RSSE, whilst bias is reduced, we also see a more consistent reduction in variance. These experiments reinforce our preconception as to the effectiveness of the ensembles: αβ R S E introduces further diver sity into the e nsemble throug h a llowing misclassified instances within the spheres. The major effect of this is to reduce the varia nce of the resulting classifier . On the other hand, the subspace ensemble reduces the inherent bias commonly observed in instance based c la ssifiers used in conjunction with a Euclidea n distance metric: redundant attributes result in overfit- ting. 7 C O N C L U S I O N W e have describe d an instance ba sed classifier , αR S C , that has several interesting properties that can be used successfully in ensemble de sign. W e described three different ensemble methods with which it could be used a nd d e monstrated that the resulting ensembles are competitive with the best tree based ensemble tech- niques on a wide r ange of standard da ta sets. W e fur- ther investigated the reasons for the improvement in performance of the ensembles in relation to the base classifier using bia /variance decomposition. For the e n- semble based on resampling ( αβ RSE) accura cy wa s in- creased primarily by a reduction in variance. Hence we conclude the diversity introduced via the propos ed technique is mostly beneficial and the resulting ensemble classifier is more robust. W e also demonstrated through bia/variance de composition that the subspace e nsemble α RSSE improves pe r formance primarily by a d ecrease in bias. An obvious next step would be to embed the resampling technique within the random subspace en- semble. However , we found employing the β mechanism in the subspace did not make a significant difference to UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 13 Attribute 3 4 5 6 7 8 error 0.220 0.230 0.240 0.250 0.260 0.270 0.280 0.290 average error bias (a) A verage error and bias for Diabete s Attribute 3 4 5 6 7 8 error 0.00 0.02 0.04 0.06 0.08 0.10 0.12 net variance unbiased variance biased variance (b) V ariance d ecomposition for Diabetes Attribute 2 4 6 8 10 12 error 0.14 0.16 0.18 0.20 0.22 0.24 average error bias (c) A verage error and bias decompos i tion for Heart Attribute 2 4 6 8 10 12 error 0.02 0.04 0.06 0.08 0.10 0.12 net variance unbiased variance biased variance (d) V ariance d ecomposition for Heart Attribute 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 error 0.04 0.09 0.14 average error bias (e) A verage e rror and bias for Image Attribute 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 error 0.00 0.02 0.04 0.06 0.08 0.10 net variance unbiased variance biased variance (f) V ari an ces decomposition for Image Attribute 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 error 0.08 0.10 0.12 0.14 0.16 0.18 Average error bias (g) A v e rage error and bias for W aveform Attribute 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 error 0.020 0.040 0.060 0.080 net variance unbiased variance biased variance (h) V ariances deco mposition for W aveform Fig. 6. Bia s/V aria nce De composition of the α RSSE classifier . the α RSSE ensemble. This implies that attribute selection is the most important stage in ensembling α RSC, other than model sele c tion by setting α . This has lea d us into investigating embedd ing a ttribute se le c tion (rather than ra ndomisation) into the ensemble, with promising preliminary results. W e believe that α RSC is a usef ul edition to the fa mily of instance based learners since it is easy to unde r stand, quick to train a nd test and can effectively be employed in ensembles to achieve classification accura cy comparable to the most popular ensemble methods. R E F E R E N C E S [1] D. Aha, D. Kibler and M.K. Albert: Instance-based learning algorithms, Mac hine Lear ning, vol. 6, no. 1, pp.37 -66, 1991. [2] E. B a uer and R. Kohavi: An empirical comparison of voting classification algorithms: ba gging, boosting, and variants, M achine Le a rning, vol. 36, no. 1-2, pp. 105-1 39, 1 9 99. UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 14 T ABLE 11 Compar ing Bias/v a ria nce of α RSC , αβ RSE and α RSSE. (V ar . unb .) and (V ar. bias.) stand f or unbiased and biased v a rian ce. (Diff) stands for t he percenta ge diff erence between the algor ithms. T he up arrow ↑ means an incre ase while a down arrow ↓ means a decrease. Dataset Av g Erro r Bias N et V ar V ar . Unb. V ar . bias. W aveform (1) α RSC, α = 11 0.1387 0.0961 0.0426 0.0722 0.0296 (2) αβ RSE, α = 10 0.1223 0.0976 0.0247 0.0500 0.0254 (3) α RSSE, α = 2 , κ = 11 0.1141 0.0906 0.0235 0.0472 0.0237 Diff (1) vs (2) % ↓ 11.82 ↑ 1.56 ↓ 42.01 ↓ 30.74 ↓ 14.18 Diff (1) vs (3) % ↓ 17.73 ↓ 5.72 ↓ 44.83 ↓ 34.62 ↓ 19.93 Diabetes (1) α RSC, α = 3 0.2780 0.2367 0.0413 0.1006 0.0594 (2) αβ RSE, α = 3 0.2685 0.2359 0.0326 0. 0847 0.0521 (3) α RSSE, α = 2 , κ = 5 0.2603 0.2332 0.0271 0.0741 0.0469 Diff (1) vs (2) % ↓ 3.41 ↓ 0. 33 ↓ 21.06 ↓ 15.80 ↓ 12.29 Diff (1) vs (3) % ↓ 6.37 ↓ 1. 48 ↓ 34.38 ↓ 26.34 ↓ 21.04 Heart (1) α RSC, α = 7 0.2138 0.1667 0.0471 0.0872 0.0400 (2) αβ RSE, α = 10 0.1896 0.1756 0.0140 0.0431 0.0290 (3) α RSSE, α = 2 , κ = 5 0.1814 0.1533 0.0281 0.0568 0.0287 Diff (1) vs (2) % ↓ 11.31 ↑ 5.33 ↓ 70.27 ↓ 50.57 ↓ 27.5 Diff (1) vs (3) % ↓ 15.15 ↓ 8.04 ↓ 40.34 ↓ 34.86 ↓ 28.25 wdbc (1) α RSC, α = 8 0.0898 0.0784 0.0114 0.0275 0.0161 (2) αβ RSE, α = 2 0.0771 0.0663 0.0108 0. 0255 0.0147 (3) α RSSE, α = 0 , κ = 13 0.0698 0.0553 0.0145 0.0258 0.0112 Diff (1) vs (2) % ↓ 14.14 ↓ 15.43 ↓ 5.26 ↓ 7.27 ↓ 8.69 Diff (1) vs (3) % ↓ 22.27 ↓ 29.46 ↑ 27.19 ↓ 6.18 ↓ 30.43 Image (1) α RSC, α = 0 0.1184 0.0650 0.0534 0.0759 0.0225 (2) αβ RSE, α = 0 0.1050 0.0665 0.0385 0. 0603 0.0218 (3) α RSSE, α = 0 , κ = 10 0.0873 0.0495 0.0378 0.0541 0.0163 Diff (1) vs (2) % ↓ 11.31 ↑ 2.30 ↓ 27.90 ↓ 20.55 ↓ 3.11 Diff (1) vs (3) % ↓ 26.26 ↓ 23.84 ↓ 29.21 ↓ 28.72 ↓ 27 .55 T wonorm (1) α RSC, α = 10 0.0515 0.0222 0.0293 0.0366 0.0073 (2) αβ RSE, α = 10 0.0345 0.0224 0.0121 0.0179 0.0058 (3) α RSSE, α = 2 , κ = 13 0.0328 0.0225 0.0103 0.0159 0.0057 Diff (1) vs (2)% ↓ 33.01 ↑ 0.90 ↓ 58.70 ↓ 51.09 ↓ 20.54 Diff (1) vs (3)% ↓ 36.31 ↑ 1.35 ↓ 64.84 ↓ 56.55 ↓ 21.91 Ringnorm (1) α RSC, α = 0 0.1183 0.0596 0.0587 0.0783 0.0783 (2) αβ RSE, α = 0 0.0527 0.0208 0.0320 0. 0377 0.0058 (3) α RSSE α = 0 , κ = 10 0.0288 0.0167 0.0121 0.0166 0. 0045 Diff (1) vs (2) % ↓ 55.45 ↓ 65.10 ↓ 45.48 ↓ 51.85 ↓ 70 .40 Diff (1) vs (3) % ↓ 75.65 ↓ 71.97 ↓ 79.38 ↓ 78.79 ↓ 94 .25 [3] L. Breiman: Ba gging predictors, Machine Lear ning, vol. 24 , no. 2, pp. 12 3-140 , 19 96. [4] L. Breiman: Random forest s, Machine Learning, vol. 45, no. 1, pp. 5-32, 20 01. [5] L. B reiman: Bias, variance, a nd arcing classifiers, Statistics Department, Ber ke le y , technical report, no. 460, 199 6. [6] A. Ca nnon and L.J. Cowen: Approxim ation algo- rithms for the class cover problem, Annals of Math- ematics and Artificial Intelligence, vol. 40,no. 3-4, pp. 215 -223, 2004 . [7] A. Cannon, J . Mark Ettinger , D. Hush, C. Scovel: Machine learning with data d ependent hypothesis classes, The Journal of Machine Learning Research, vol. 2, pp. 33 5-358 , 200 2. [8] J. Dem ˇ sar: Statistical comparisons of classifiers over multiple data sets. Journal of M achine L earning Research, vol. 7, pp. 1- 30, 2006 . [9] T .G. Dietterich: An experimental comparison of three methods for constructin g ensembles of deci- sion trees: bagging, boosting, and randomization, Machine Learning, vol. 4 0, no. 2, pp. 1 39-15 7, 2000 . [10] P . Domingos: A unified bias-variance decomposi - tion f or ze ro-one a nd squared loss, Proceedings of the S eventeenth Na tional Conference on Artificial Intelligence and T welfth Conference on Innovative Applications of Artificial Intelligence, pp. 5 64-56 9, 2000. [11] A. Frank and A. Asuncion: U CI Machine Lea rn- ing Repository , University of California, Irvine, School of In formation and C omputer Sc iences, http://archive.ics.uci.edu/ml, 2010. [12] Sally Floyd and Manfred K. W armuth: Sample com- pression, learnability , and the va pnik-chervonenkis dimension. Machine Learning, vol 21, no. 5. pp. 26930 4, 19 95 UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 15 [13] Y . Fr eund and R.E. Schapire: Experiments with a new boosting algorithm, ICML, pp. 1 48-15 6, 199 6. [14] J.H. Fried man and U. Fayya d : On bias, variance, 0/1-loss, and the curse-of-dimensionality , Data Mining and Knowledge Discovery , vol. 1, pp. 55- 77, 19 97. [15] P . Geurts and D. E r nst a nd L. W ehenkel: Extremely randomised trees, Machine Learning, vol. 63 , no. 1, pp. 3-4 2 , 2006. [16] Y . Gra ndvalet, S. Canu a nd S. Boucheron: Noise in- jection: theoretical prospects, Neural Computation, vol. 9, no. 5, pp. 1 093-1 108, 19 9 7. [17] I. Guyon and A. Elisseeff: An introduction to variable and feature selection, Journal of M achine Learning Research, vol. 3, pp. 1157 -1182 , 2003. [18] L.K. Hansen and P . S alamo: Neural network en- sembles, IEEE T ra nsactions on Pattern A nalysis and Machine Intelligence, vol. 12, no. 1 0, pp. 9 9 3-100 1, 1990. [19] R. He rbrich: L earning Kernel Classifiers, Theory and Algorithms. MIT Press Cambridge, M A , USA , 2002. [20] T .K. Ho: The ra ndom subspace method for con- structing de cision for ests, IEEE T ransactions on Pat- tern Analysis and Machine Intelligence, vol. 20 , no. 8, pp. 832- 8 44, 1 998. [21] T .K. Ho: Geometrical complexity of classification problems, Proceedings of the 7th Course on Ensem- ble Methods for Lea rning Ma chines a t the Interna - tional School on Neura l Nets, 2 004. [22] G. James: V aria nce and bias for genera l loss f unc- tions, Machine Learning, vol. 51 , pp. 115-1 3 5, 2003 . [23] S.W . Kim and B.J. Oommen: A brief taxonomy and ranking of creative prototype reduction schemes, Pattern A nal. Appl., vol. 6, no. 3, pp. 232 -244, 2003. [24] L.I. Kuncheva: That Elusive Diversity in Classifier Ensembles, Lecture Notes in Computer S cience, pp. 1126- 1138, 2 0 03. [25] L.I. Kuncheva: A theoretical study on six classifier fusion strategies, Journal: IEEE T ransactions on Pat- tern Analysis and Machine Intelligence, vol. 24 , no. 2, pp. 281- 2 86, 2 002. [26] L.I. Kuncheva and J.C. Bezdek: Presupervised and postsupervised prototype classifier design, IEEE T ransactions on Neural Networks, vol. 10, n o. 5, pp. 1142- 1152, 1 9 99. [27] L.I. Kuncheva, C. Whitaker , C. Shipp and R.P .W Duin: Limits on the ma jority vote a c c uracy in classi- fier fusion, Pattern Analysis and Applications, vol. 6, no. 1, pp. 22-3 1, 2003. [28] L.I. Kuncheva: Diversity in multiple cla ssifier sys- tems, Information Fusion, vol. 6, no. 1, pp. 3 -4, 2 005. [29] H. Liu and H. M otoda: On issues of instance selec- tion, Data Mining and Knowledge Discovery , vol. 6, no. 2, pp. 115 - 130, 2002. [30] D.J. Ma rchette: Random graphs for sta tistical pat- tern recognition, W iley-Interscience, 2004. [31] D.J. Marchette and C.E. Priebe: Charac te rizing the scale dimension of a hi gh-dimensional cla ssification problem. Pattern Recognition, vol. 36, no. 1, p p . 45- 60, 2003 . [32] D.J. Marchette D., E.J. W egman a nd C.E. Priebe: A fast algorithm for approximating the dominating set of a class cover catch digraph, technical report, JHU DMS TR 6 35, 2 0 03. [33] R. Meir a nd G. R ¨ atsch: A n introduction to boosting and leve raging, Machine Lea rning S ummer School, Springer , pp. 118 -183, 2002 . [34] L.C. Molina, L. Belanche and A. Neb ot: Feature selection algorithms: a survey a nd experimental evaluation, IEEE International Conference on Data Mining, pp. 306-3 13, 2 002. [35] D. Opitz and R. Ma clin, Popular e nsemble methods: an empirica l study , Journal of A rtificial Intelligence Research, vol. 11, pp. 169 -198, 1999 . [36] C.E. Priebe, J .G. DeV inney , D.J. Marchette and D.A. Socolinsky: C la ssification using class cover ca tch digraphs. J ournal of Classification, vol. 20, no. 1, pp. 003 -023, 2003 . [37] G. Raetsch and T . Onoda and K.R. Mueller: Soft margins for a daboost, M a chine Lea rning, vol. 42, no. 3, 2 87-32 0, 2001. [38] J.J. Rodriguez a nd L.I. Kuncheva a nd C.J. A lonso: Rotation forest: a new cla ssifier ensemble method, IEEE transactions on pa ttern analysis and machine intelligence, vol. 2 8, no. 10, pp. 1619 -1630 , 200 6 . [39] R.E Schapire: Theor etical views of boosting and ap- plications, International W orkshop on Algorithmic Learning Theory , pp. 13-25, 199 9 . [40] Mario Ma rchand a nd J ohn Shawe-T aylor: T he Set Covering Machine, Journal of Machine Learning Research”, vol.3, pp. 7 23-74 6, 2002. [41] G. Stiglic a nd P . Kokol. GEMLeR: Gene E x pres- sion M achine Learning Reposito ry . A va ilable at: http://gemler .fz v .uni-mb.si/. [42] E.K. T ang, P .N. Suganthan and X. Y ao: An a nalysis of diversity measures, Machine Learning, vol. 6 5 , no. 1, pp . 24 7-271 , 2 006. [43] R. T ibshirani: Bias, variance and pr ediction error for classification rules, University of T oronto, Dept. of Statistics, T oronto, 19 96. [44] G. V alentini and T .G. Dietterich: Bias-variance anal- ysis of support vector ma chines for the develop- ment of S VM-based ensemble methods, J ournal of Machine Learning Research, vol. 5 , pp. 725-77 5, 2004. [45] G.I. W ebb: Multiboosting: a technique for combin- ing boosting and wagging, Machine Learning, vol. 40, pp. 159-19 6, 2000. [46] D. W ettschereck: A hybrid nea rest-neighbor and nearest-hyperrectangle algorithm, Proceedings of the European Conference on M a chine Lea rning, vol. 784, pp. 323-3 38, 1 994. [47] D. W ilson and T . R. M artinez: Reduction techniques for instance- b a sed learning a lgorithms, M a chine Learning, vol. 3 8, pp. 257- 286, 2 000. UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 16 [48] D. H. W olpert and W . G. Macready: No free lunch theorems for optimization, IE E E T rans. on Evo. Comp., vol. 1, no. 1, pp 67-8 2, 1997. [49] R. Y ounsi and A. Bagnall: An efficient randomized sphere cover classifier , Int. J. of Data Mining, Mod- elling and Management, vol. 4 , no. 2, pp. 156-171 , 2012. [50] R. Y ounsi: Investigating Randomised Sphere Covers in Supervised Learning, PhD thesis, University of East Anglia, 2011.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment