Ensembles of Random Sphere Cover Classifiers

UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 1 T echnical Repor t CMPC14-05: Ensemb les of Random Sphere Co v er Classiﬁers Anthony Bagnall and Reda Y ounsi, Abstract —We propose and ev aluate alternative ensemble schemes f or a new instance based learni ng classiﬁer , the Randomised Sphere Cov er (RSC) classiﬁer . RSC fuses instances into spheres, then bases classiﬁcation on distance to spheres rat her than distance to instances. The randomised nature of RSC makes it ideal f or use in ensembles. We propose two ensemble methods t ailored to the RSC classiﬁer; αβ RSE, an ensemb le based on instance resampling and α RSSE, a subspace ensemble. We com pare αβ RSE and α RSSE to tree ba sed ensembles on a set of UCI datasets and demonstrates that RSC ensembles perf o rm signiﬁcantly better than some of these ensemb les, and not signi ﬁcantly worse than the others. We demonstrate via a case study on six gene e xpression data set s that α RSSE can outperf or m other subspace ensemble methods on high dimensional data when used in conjunction with a n attribute ﬁlter . Fi nally , we perform a set of Bias/V ariance decomposition experiments to analyse the source of i mprov ement in comparison to a base classiﬁer . Index T erms —Sphere classiﬁer , ensemble ✦ 1 I N T R O D U C T I O N W e propose and eva lua te alter na tive ensemble schemes for a simple instance based learning classiﬁer , the Ran- domised Sphere Cover (RSC) classiﬁer , ﬁrst intr oduced in [49]. RSC creates sph eres around a subset of instances from the training data, then bases classiﬁcation on dis- tance to spheres, rather than d istance to instances. Near- est neighbour (NN) ba sed classiﬁers remain popular in a wide range of ﬁelds such as image processing. One of their strength lies in the f act that they a re robust to changes in the tra ining data. However , this fea ture of NN c la ssiﬁers means that there is less observable beneﬁt (in terms of e rror reduction) of using them in conjunction with resampling ensemble schemes such a s bagging [3]. RSC aims to overcome this pr oblem by using a randomised heuristic to select a subset of instances to represent the spheres used in classiﬁcation. RSC can b e seen as a form of data reduction, and hence sca les well for large data sets. Data reduction algorithms [ 4 7, 2 6, 23] search the training data for a subset of ca ses and/or attributes with which to c lassify new instances to achieve the maximum compression with the minimum reduction in accuracy . RSC can be described by the Compression scheme method [12]. Compressi on scheme has been propos ed to explain the generalisation performance o f sparse algo- rithms. In general, a lgorithms a re called sparse because they keep a subset from the tr a ining set as pa rt of their learning pr ocess. A large number of algorithms fall in this category , such a s Support V ector Mac hines ( SVM), The Perceptron algorithm and KNN [19]. Recently , com- • A.Bagnall is with the School of Computing Sciences, University of East Anglia, Norwi ch, Norfol k, United Kingdo m. E-mail: ajb@uea.ac.uk pression scheme was rejuvenated to ex plore a similar algorithm to RSC, the set cover ing machine (SCM), proposed by shaw-ta ylor [40]. Y o unsi [50], examined the relationships between α , the accur a cy and the cardinality of the sphere cover classiﬁer using ex isting probabilisti c bound based on the compression scheme. Although it is clear the spher e cover acc ur acy is synonymous with cov- ering, compression scheme ha s shown that degra dation is accuracy is only possible by hea vily pruning spheres. This suggests that the sphere cover classiﬁer is indeed a strong candidate f or e xploring the accuracy/diversity dilemma found in ensemble design [2 4, 38, 42]. The process that creates the spheres for RSC is con- trolled by two parameters: α , the minimum number of cases a sphere must contain in order to be retained as part of the classiﬁer; and β , the number of misclassiﬁed instances a sphere can contain. W e investigate ho w these parameters can be utilised to diversify the e nsemble. W e propose two ensemble methods tailored to the RSC classiﬁer; αβ RSE, a n ensemble based on resampling and α RSSE, a subspace ensemble. W e demonstrate that the resulting ensemble classiﬁers are a t lea st compara- ble to, and of te n better than, state of the art ensem- ble techniques. W e perform a ca se study on six high dimensional gene expression data sets to demonstrate that α RSS E works well with attribute ﬁlters a nd that it outperforms other subspace ensemble methods on these data sets. Finally , we per f orm a set of Bias/V ariance (BV) decomposition experiments to ana lyse the source of improvement in comparison to a base classiﬁer . The structure of the rest of this paper is as follows: In Section 2 we pro vide the background motivation for the RSC cla ssiﬁer , an overview of the relevant e nsemble literature and a brief summary of Doming os BV d e com- position technique [10]. In Section 3 we formally describe the RSC c lassiﬁer and in Se c tion 4 we deﬁne our two UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 2 ensemble schemes. In Section 5 we present the results and in Section 7 we summarise our conclusions. 2 B A C K G R O U N D A classiﬁer constr ucts a decision rule based on a set of l training examples D = { ( x i , y i ) } l i =1 , wher e x i represents a vector of observa tions of m explana tory variables associated with the i th case, and y i indicates the class to which the i th example belongs. W e call the range of all possible values of the explanatory variables X and the range of the discrete response va riable Y = { C 1 , C 2 , . . . , C r } . W e assume a dissimilarity measure d is deﬁned on X and is a function d : X × X → R + such that ∀ x 1 , x 2 ∈ X , d ( x 1 , x 1 ) = 0 and d ( x 1 , x 2 ) = d ( x 2 , x 1 ) ≥ 0 . A classiﬁer f : X → Y , f ( x ) = ˆ y is a function from the attribute space to the response variable spac e . 2.1 Sphere Cover Classiﬁer s The sphere covering mechanism we use stems from the class covering a p proach to classiﬁcation which was ﬁrst introduced in [6]. A sphere B i is associated with a particular class C B i , and is deﬁned by a centre c i and radius r i . In practice we also include in the sphere deﬁnition all the instances within it’s boundary . Hence , a sphere is deﬁned by a 4 -tuple B i = < C B i , c i , r i , X B i > where X B i = { x ∈ D : d ( x , c i ) < r i } . The centre of the sphere is the vector of the means of the attributes of the cases contained within. The radius of the sphere B i is deﬁned a s the distance from the centre to the closest example f rom a cla ss other than C B i that is not in X B i , i.e. r i = min x j ∈{ X \ X B i }∧ y j 6 = C B i d ( x j , c i ) where X = { x ∈ D } . A union of spheres is called a cover . A cover that contains a ll of the examples in D is called proper a nd one c onsisting of spheres that only contain examples of one class is sa id to be pure . The class cover problem (CCP) involves ﬁnding a pur e and proper cover that has the minimum number of spheres of all possible pure and proper covers. The solution to the CCP proposed in [36] involves constructing a Class Cover Catch Digraph (CCCD), a directed graph based on the proximi ty of training cases. However , ﬁnding the optimal covering via the CCCD is NP-hard [7]. Hence [31, 30] p roposed a number of greedy algorithms to ﬁnd a n approximately optimal set covering. However , these algorithms a re still slow and only ﬁnd pure covers. The constraint of pure and proper covers will tend to lead to a classiﬁer that overﬁts the training da ta. An al- gorithm that relaxes the requirement of class purity was proposed by [36]. This algorithm introduces two param- eters to alleviate the constraint of requiring a pure proper cover . The pa rameter α relaxes the pro per requirement by only allowing spheres that contain at least α ca ses to be retained in the classiﬁer . The parameter β reduces the purity constraint by a llowing a sphere to contain β cases of the wrong class. The authors admit the resulting algorithms are infea sible for large data and hence (to the best of our knowledge) there has been very limited experimental evaluation of this and other CCP based classiﬁers. Furthermore, the resulting classiﬁers are very sensitive to the pa rameters. In pa rticular , β , if constant for a ll spheres, is too crude a mechanism for relaxing the purity constraint. In S ection 3 we describe a n e nsemble base classiﬁer derived from CCP algorithm proposed in [32] that is randomised (ra ther than constructive) and retains just the single para meter , α . 2.2 Ensemble Methods An ensemble of cla ssiﬁers is a set of base cla ssiﬁers whose individual d ecisions are combined throug h some process of fusion to classify new examples [3 3, 9] . One key concept in e nsemble design is the r equirement to inject diversity into the ensemble [9, 3 9, 35, 15, 16, 18]. Broadly speaking, diversity can be a chieved in an en- semble by either: • employing different cla ssiﬁcation algorithms to train each base classiﬁer to form a heterogeneous ensem- ble; • changing the training data for each base classiﬁer through a sampling scheme or by directed weight- ing of instances; • selecting different attributes to tra in each classiﬁer; • modifying ea ch classiﬁer internally , either through re-weighting the training data or through inher ent randomization. Clearly , these approaches c an be combined (see below). In this pa p e r we compare our homogeneous ensemble methods ( described in Section 4 ) with the following related ensembles. • Bagging [3 ] dive rsiﬁes through sampling the tra in- ing data by b ootstrapping (sampling with replace- ment) for each member of the ensemble. • Random Subspace [20] ensembles select a ra ndom subset of attributes for each base classiﬁer . • AdaBoost (Adapt i ve Boosting) [13] involves itera- tively re-weighting the sampling d istribution over the training data ba sed on the training accur a cy of the base classiﬁers at each itera tion. The weights can then be either embedded into the classiﬁer algorithm or used as a weighting in a cost function for classiﬁer sele c tion for inclusion. • Random Committe e [11] is a technique that creates diversity through r andomising the base cla ssiﬁers, which are a form of random tree. • Multiboost [45] is a com bination of a bo osting strat- egy (similar to AdaB oost) and wa gging, a Poisson weighted form of bagging. • Random Forests [4] combine bootstrap sampling with random attribute selection to construct a col- UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 3 lection of unprun ed trees. At each test node the optimal split is derived by searching a random subset of size K of candida te attributes se le c te d without replacement from the candidate attributes. Random forest ra ndom combines attribute sampling with bootstrap case sampling. • Rotation Forests [38] involve partitioning the at- tribute space then transforming in to the principal components space. Ea ch c lassiﬁer is given the entire data set but tra ins on a different component space. In order to maintain consistency across these tech- niques we use C4.5 decision trees as the base classiﬁer for all the ensembles. Forming a ﬁnal classiﬁcation f rom an ensemble re- quires some sort of fusion . W e employ a majority vote fusion [ 27] with ties resolved ra ndomly . For alter na tive fusion schemes see [2 5]. Beyond simple a c c uracy c ompar ison, there are three common approaches to analyse ensemble performance: diversity measures [28, 42]; margin theory [37, 33]; and BV decomposition [22, 43, 14, 5, 44, 2]. These have all been linked [42, 1 0]. 2.3 Bias/V ariance Decomposition In this section, we brieﬂy describe BV decomposition using Domingos framework [10]. This framework is applicable to any loss f unction, but for simplicity sake we restrict ourselves to a tw o class classiﬁcation problem with a 0/1 loss function. W e la bel the two class v a lues { C 1 = − 1 , C 2 = 1 } . The generalisation error of a classiﬁer is deﬁned as the expected error for a given loss f unction over the entire attribute spa ce. A loss function L ( y , ˆ y ) measures how close the predicted value is from the actual value for any observation ( x , y ) . The response variable Y will ge ner a lly be stochastic, so for a two class problem the expec ted loss is deﬁned as E y [ L ( y , ˆ y )] = p ( Y = − 1 | x ) · L (0 , ˆ y ) + p ( Y = 1 | x ) · L (1 , ˆ y ) , and the optim al prediction y ∗ is the prediction that mimimizes the expected loss. T he optimal or Bayes classiﬁer is one that minimizes the e xpected loss for all possible values of the attribute space, i.e. f ( x ) = y ∗ , ∀ x ∈ X . The e xpected loss over the attribute space of the Bayes classiﬁer , E x [ E y [ L ( y , y ∗ )]] , more commonly written E x ,y [ L ( y , y ∗ )] is called the Bayes rate a nd is the lower bound f or the error of any classiﬁer . In practice, classiﬁers are constructed with a ﬁnite data set, and the expected loss f or any given instance will vary depending on which data set the classiﬁer is given. Let D be a set of s training sets, D = {{ D i } s i =1 } . The set of predictions for any element x is then ˆ Y = { ˆ y i , i = 1 · · · s } , where ˆ y i is the prediction of the i th classiﬁer deﬁned on training data D i when given explanatory variables x . W e then denote the mode of ˆ Y a s the main prediction, ˆ y . If we assume each d ata set is equally likely to ha ve bee n observed, the expected loss over s d ata sets for a given instance x is simply the a verage over the data sets, E D,y [ L ( y , ˆ y )] = P s i =1 E y [ L ( y , ˆ y i )] s The Domingo s framework decomposes this e xpected loss into three terms: Bias, V aria nce and Noise. The Bias is deﬁned as the loss of the ma in prediction in relation to the optimal prediction. B ( x ) = L ( y ∗ , ˆ y ) Bias is ca used by systemic errors in classiﬁcation re- sulting from the algorithm not capturing the underlying complexity of the true decision boundary (i.e. underﬁt- ting). V ariance describes the mean variation within the set of predictions about the main prediction for a given instance, i.e., V ( x ) = P s i =1 L ( ˆ y j , ˆ y ) s , and is the result of va riability of the classiﬁcation func- tion caused by the ﬁnite training sample size and the hence inevitable varia tion acros s training samples (over- ﬁtting). Noise is the unav oida ble (and unmeasurable) component of the loss that is incurred independently of the learning algorithm. The Noise term is N ( x ) = E [ L ( y , y ∗ )] . So for a single example, we ca n describe the e xpected loss as E D,y [ L ( y , ˆ y )] = N ( x ) + B ( x ) + c 2 · V ( x ) where c 2 is + 1 if B ( x ) = 0 and − 1 if B ( x ) = 1 . Bias and variance may be average d over all examples, in which case Domingos calls them av e rage Bias, B = E x [ B ( x )] , ave r age ( or net var ia nce) V = E x [ V ( x )] and average noise N = E x ( N ( x )) . The expected loss over all examples is the expected value of the expected loss over all exa mples, and c an be d ecomposed as E D,y , x [ L ( y , ˆ y )] = N + B + c 2 · V . Domingos shows that the net var ia nce can be e x- pressed as V = E x [(2 B ( x ) − 1) · V ( x )] and that V ca n be further deconstructed into the biased variance V b and the unbi a sed varian ce V u . V u is the average variance within the set of classiﬁer estimates where the main prediction is correct ( B ( x ) = 0 ), V b is the variance when the main prediction is incorrect. The net variance V n is the difference be twee n the unbiased and the biased var iance, V n = V u − V b . Hence, unbiased UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 4 variance increases the net va riance ( and thus the gen- eralisation error) whereas biased variance decreases the net varia nce. The principle beneﬁt of performing a Bias-V ariance (BV) de composition f or an ensemble algorithm is to address the question of whether an observed reduction in the expected loss is due to a reduction in bia s, a reduction in unbiased varia nce, an increase in biased variance or , m ore usually , a combination of these factors. W ithout unlimited data, these statistics are generally estimated through resampling. In Section 6 we describe our experimental design a nd perform a BV decompo- sition to assess the ensemble algorithms we propos e in Section 4 in conjunction with the b a se classiﬁer described in S ection 3. 3 T H E R A N D O M I S E D S P H E R E C OV E R C L A S - S I FI E R ( R S C ) The reason for designing the αR S C algorithm was to develop an instance based classiﬁer to use in ensembles. Hence our design criter ia were that it should be ra n- domised (to allow for diversity), fast (to mitigate against the inevitable overhead of ensembles) and comprehen- sible (to help produce meaningful interpretations from the mo dels pro duced). The αRS C a lgorithm has a single integer parameter , α , that speciﬁes the minimum size for any sphere. Informally , αR S C works as follows. • Repeat until all da ta are covered or discarded 1) Randomly select a da ta point and ad d it to the set of covered cases. 2) Create a new sphere centered at this point. 3) Find the closest case in the tra ining set of a different class to the one selecte d as a centre. 4) Set the radius of the sphere to be the distance to this case. 5) Find all cases in the training set within the radius of this sphere. 6) If the number of cases in the sphere is greater than α , add all case s in the sphere to the set of covered ca ses and save the sphere details (centre, class a nd radius). A more f ormal algorithmic description is given in Algo- rithm 1. For a ll our expe riments we use the Euclidean distance metric, although the algorithm can work with any distance functio n. All attributes are normalised onto the range [0 , 1] . The parameter α allows us to smooth the d ecision boundary , which has been shown to pro- vide better generalisation by mitigating against noise and outliers, (se e , for example [29]). Figure 1 provides an example of the smoothing effect of removing small spheres on the decision boundary . The αRS C algorithm cla ssiﬁes a new case by the following rules: 1) Rule 1. A test example that is covered by a sphere, takes the ta rget class of the sphere. If there is more than one sphere of different target class covering Algorithm 1 buildRSC (D, d , α ) . A Randomised Sphere Cover Classiﬁer ( αRS C ) 1: Input: Cases D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } , distance function d ( x i , x j ) para meter α . 2: Output: Set of spheres B 3: Let covered cases be set C = ⊘ 4: Let uncovered cases be set U = ⊘ 5: while D 6 = C ∪ U do 6: Select a random element ( x i , y i ) ∈ D \ C 7: Copy ( x i , y i ) to C 8: Find min ( x j ,y j ) ∈ D d ( x i , x j ) such that y i 6 = y j 9: Let r i = d ( x i , x j ) 10: Create a B i with a center c i = x i , radius r i and ta rget class y i 11: Find a ll the cases in B i and store in temporary set T 12: if | T | ≥ α the n 13: C = C S T 14: Store the sphere B i in B 15: else 16: U = U ∪ T 17: end if 18: end while the test exa mple, the classiﬁer takes the target cla ss of the sphere with the closest centre. 2) Rule 2. In the case where a test example is not covered by a sphere, the cla ssiﬁer selects the closest spherical edge. A ca se covered by Rule 2 will generally be an outlier or at the boundary of the class distribution. Therefore, it may be preferable not to have spheres over-covering areas where su ch cases may occur . These areas are either close to the decision boundary speciﬁcally when the high overlap between classes e x ist ( an illustration is given in Figure 1 (a)), a nd a reas wher e noisy cases are within dense areas of exa mples of differ ent target class. The αRS C method of compressing thro ugh sphere covering and smoothing via boundar y setting as ﬁrst proposed in [49] and has been shown to provides a robust simple classiﬁer that is competitive with other commonly used classiﬁers [49]. In this paper we focus on the best way to use it a s a ba se classiﬁer f or an ensemble. 4 E N S E M B L E M E T H O D S F O R α R S C 4.1 A Simple Ensemble: α RSE One of the basic d esign criteria for α RSC was to ran- domise the cover mechanism so that we could create diversity in an ensemble. Hence our ﬁrst ensemble a l- gorithm, α RSE, is simply a majority voting ensemble of α RSC classiﬁers. W ith all ensembles we denote the number of classiﬁers in the ensemble a s L . W e ﬁx α for all members of the ensemble. Each classiﬁer is built using Algorithm 1 using the entir e training data. The basic questio n we experimentally a ssess is whether the UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 5 (a) A spher e cover with α = 1 (b) The same cover with α = 2 Fig. 1. An example of the smoothing eff ect of removing small spheres inherent r a ndomness of α RSC provides enough implicit diversity to make the ensemble robust. 4.2 A Resampling/Re-weighting Ensemble: αβ RSE The origi nal motivation f or R SC is the classiﬁers derived from the Class Cover Catch Digraph (CCCD) described in Section 2. These classiﬁers ha v e two parameters, α and β . The α parameter (minimum sphere size) is used to improve generalisation. The β parameter (number of misclassiﬁed examples allowed with in a sphere) is meant to ﬁlter outliers. In the CCCD, both α and β parameters are c hosen in advance. α can be set through cross validation. However , setting β is problematic; a global value of β is too arbitra ry , a local value for each sphere impractical. W e propose an automatic method for implicitly setting β iteratively . W e deﬁne the border case of a sphere to b e the closest data of the negative class in a given dataset. Bor der cases are the particular instance that halts the growth of a sphere a nd are hence crucial in the constructio n of the α RSC cla ssiﬁer . Our design principle for diversiﬁcation of the ensemble is then to iteratively remove some or all of the border cases during the process of ensemble con- struction. Informally , the algorithm proceeds as follows: 1) Initialise the current training set D 1 to the whole set D . 2) Build a base α RSC on the entire training set. 3) Find the border cases for the classiﬁer . 4) Find the cases in the current training set that are uncovered by the classiﬁer . 5) Find the cases in the entire training set that a re misclassiﬁed by the cla ssiﬁer . 6) Set the next tra ining set, D 2 , equal to D 1 . 7) Remove border cases from D 2 . 8) Replace the border ca ses with a random sample (with replacement) taken from the list of border , uncovered and misclassiﬁed cases and a dd them to D 2 . 9) Repeat the process for each of the L classiﬁers. Algorithm 2 A Random ised Sphere Cover Ensemble ( αβ RSE) Input : Cases D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } , distance function d ( x i , x j ) , par a meters α , L . Output : L random sphere cover classiﬁers B 1 , . . . , B L 1: D 1 = D 2: for j = 1 to L do 3: B j = buildRS C ( D j , d , α ) . 4: E = borderCases ( B j , D j ) 5: F = un coveredCases ( B j , D j ) 6: G = misclassiﬁedCa ses ( B j , D ) 7: H = E + F + G 8: D j +1 = D j − E 9: for m = 1 to | E | do 10: c = randomSample ( H ) 11: D j +1 = D j +1 S c 12: end for 13: end for A f ormal description is given in Algorithm 2. New cases a re classiﬁed by a majority vote of the L classiﬁers. The principle ide a is that we re-weight the training data by removing border cases, thus facilitating spheres that are not pure on the original data , but continue to focus on the harder cases by inserting possible duplicates of border , uncovered or misclassiﬁed cases, thus implicitly re-weighting the tr aining data. Data pr eviously removed from the training data can be r eplaced if misclassiﬁed on the current iter ation. This data driven iterative a p p roach has strong analogies to constructive algorithms such as boosting. 4.3 A Random Subspace Ensemble: α RSSE As outlined in Se c tion 2.2, rather than r esampling and/or re-weighting for ensemble members, an alterna- UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 6 Fig. 2. An illustration showing a cov er modiﬁcation with /beta param eter on a binary c lass to y dataset. tive a p p roach to d ive rsiﬁcation is to present eac h base classiﬁer with a different set of attributes with which to train. The Random Subspace Sphere Cover Ensemble ( α RSSE) builds base classiﬁers using random subsets of attributes by sampling without replacement f rom the original full attribute set. Each base classiﬁer has the same number of attributes, κ . The attributes used by a classiﬁer a re also stored, and the same set of attributes are used to classify a test e xample. The majority vote is again employed for the ﬁnal hypothesis. Algorithm 3 A Random Subspace Sphere Cover Ensem- ble ( α RSS E ) Input : Ca ses D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } , d ( x i , x j ) , parameters α , L , k . Output : L ra ndom sphere cover classiﬁers B 1 , . . . , B L and associated attribute sets K 1 , . . . , K L . 1: for j = 1 to L do 2: K j = ran domAttribut es ( D , k ) 3: D j = ﬁlte rAttribute s ( D , K j ) 4: B j = buildRS C ( D j , d , α ) 5: end for UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 7 5 A C C U R A C Y C O M P A R I S O N S Our base classiﬁer α RSC is a competitive c la ssiﬁer in its own right, a chieving accur acy results comparable to C4.5, Naive Ba yes, Naive Bayes T ree, K-Nea rest Neigh- bour and the Non-Nested Generalised Hyper Rectangle classiﬁers [46]. W e wish to compare the per formance of α RSC based ensembles with equivalent tree based ensemble techniques. Our exper imental aims are: 1) T o conﬁrm that ensembling α RSC improves the performance of the base classiﬁer (Section 5.2). 2) T o show that the RSC ensemble αβ RSE performs better than tree based ensembles that utilise the whole fe a ture spa ce (Section 5.3). 3) T o demonstrate that the RSC ensemble α RSSE performs signiﬁcantly better than all the subspace ensembles except rotation for est, wh ich itself is not signiﬁcantly better than α RSSE (Section 5.4). 4) T o consider , through a cases study , whether α RSC ensembles outperform other subspac e ensemble methods on classiﬁcation problems with a high dimensional fea ture spa ce (Section 5.5). T o assess the relative performance of the classiﬁers, we adopt the procedure described in [8], which is based on a two stage rank sum test. The ﬁrst test, the Freidman F test is a non-para meteric equivalent to ANOV A and te sts the null hypothesis that the ave rage rank of k classiﬁers on n data sets is the same against the a lternative that at least one classiﬁer ’s mean rank is different. If the Friedman test results in a rejection of the null hypothesis (i.e. we reject the hypothesis that all the mean ranks are the same), Dem ˇ sar recommends a post-hoc p a irwise Nemenyi test to discover where the differences lie. The performance of two classiﬁers is signiﬁcantly d ifferent if the corresponding average ranks differ by at least the critical difference C D = q a r k ( k + 1) 6 n , where k is the number of classiﬁers, n the number of problems a nd q a is ba sed on the studentised ra nge statistic. The results of a post- hoc Nemenyi test are shown in the critical difference diagrams (as introduced in [8]). These graphs show the mean rank order of each al- gorithm on a linear scale , with bars indicating cliques , within which there is no signiﬁcant difference in r ank (see Figure 4 below for an exa mple). A lternatively , if one of the classiﬁers can be c onsidered a control, it is more powerful to test for difference of mean ra nk between classiﬁer i and j based on a Bonferonni a djustment. Under the null hypothesis of no differ ence in mea n rank between classiﬁer i and j , the statistic z = ( ¯ r i − ¯ r j ) q k ( k +1) 6 n follows a sta ndard normal distribution. If we are per- forming ( k − 1) pairwise comparisons with our control classiﬁer , a Bonferonni adjustment simply divides the critical value α by the number of comparisons per- formed. 5.1 Data Sets T AB LE 1 Benchmar k datasets used for the empirical e v aluat ions Dataset Examples Attributes Classes Abalone 4177 8 3 W aveform 5000 40 3 Satimage 6435 36 6 Ringnorm 7400 20 2 T wonorm 7400 20 2 Image 2310 18 2 German 1000 20 2 wdbc 569 30 2 Y east 1484 8 10 Diabetes 768 8 2 Ionosphere 351 34 2 Sonar 208 60 2 Heart 270 13 2 Cancer 315 13 2 W in sconsin 699 9 2 Ecoli 336 7 8 Breast Cancer 97 24481 2 Prostate 136 12600 2 Lung Canc er 181 12533 2 Ovarian 253 15154 2 Colon T umor 62 2000 2 Central Nerv o us 60 7129 2 T o evalua te the performance of the ensembles we used sixteen datasets from both the UCI data repository [11] and six benchmark gene expression datasets from [4 1]. These datasets a re summarised in T able 1 . They were selected bec ause they vary in the numbers of tra ining examples, classes and attributes and thus provide a d i- verse testbed. In add ition, they all have only continuous attributes, and this a llows us to ﬁx the distance measure for a ll experiments to Euclidean distance. All the features are normalised onto a [0 , 1] scale. T he ﬁrst sixteen data are used for all classiﬁcation e xperiments in Sections 5.3 and 5.4. The six gene expression data sets are used for experiments presented in Section 5.5 to eva luate how the subspace based ensembles per f orm in conjunctio n with a feature selection ﬁlter on a pr oblem with high dimensional fea ture spa ce. 5.2 Base Classiﬁer vs Ensemble As a basic sanity check, we start by showing that the ensemble outperforms the base classiﬁer by comparing αβ RSE with 25 base classiﬁers against the a verage of 25 α RSC classiﬁers. Figure 3 sh ows the graphs of the classiﬁcation accuracy (measured through 10 fold cross validation) for four different datasets. The ensemble accuracies are better than those of the 25 averaged classiﬁers, and this pattern w as consist ent across all data sets. In addition, we notice both curves follow a similar evolution in r elation to α . That is, α values that returned the best classiﬁcation ac c uracy for αβ RSE are similar to those of a single classiﬁer . This is the motivation for the model selection method we adopt in Section 5 .3. UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 8 Alpha (Prunng #) 0 20 40 60 80 100 Accuracy 85.0 86.0 87.0 88.0 89.0 90.0 91.0 Ensemble 10 CV accuracy Average 10 CV accuracy (a) W aveform Alpha (Pruning #) 0 100 200 300 400 500 Accuracy 93.0 94.0 95.0 96.0 97.0 98.0 Ensemble 10 CV accuracy Average 10 CV accuracy (b) T wonorm Alpha (Pruning #) 0 1 2 3 4 % training examples 93.0 94.0 95.0 96.0 97.0 98.0 Ensemble 10CV accuracy Average qocv accuracy (c) Ringnorm Alpha (Pruning #) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Accuracy 84 86 88 90 92 94 Ensemble 10 CV accuracy Average 10 CV accuaracy (d) Satimage Fig. 3. Accuracy as a function of α on four dat a sets. Each po int is the ten f old cross validation accuracy of αβ RSE with 25 classiﬁers and the av erage of 25 separate α RSC classiﬁers 5.3 Full Feature Space Ensembles T ables 2 a nd 3 show the classiﬁcation accuracy of α RSE and αβ RSE against that of Adab oost, Bagging and Multi- boost trained with 25 a nd 100 base classiﬁers respec- tively . Adaboost, Bagging and Multiboost were used with the default settings f or the de cision tree and en- semble par a meters and were tra ined on the full tra ining split. For α RSE and αβ RSE, α was set thr ough a quick form of model selection by using the optimal training set cross valida tion values of a single classiﬁer . T his form of quick, off-line model selection is possible beca use of the fact that RSC is controlled by just a single para meter and has little impact on the overall time taken to build the ensemble cla ssiﬁer . As described in Section 4.2, the β pa rameter of αβ RSE is set implicitly through the sampling scheme. The a verage ranks and rank order are given in the ﬁnal two rows of T a ble T ables 2 and 3. The critical difference for a test of difference in average rank for 5 cla ssiﬁers and 1 6 data sets at the 10 % level is 1.375 . W e make the following observations from these re- sults: • Firstly , although αβ RSE has the highest rank, we cannot reject the null hypothesis of no signi ﬁcant difference between the mea n r a nks of the classiﬁers. The performance of th e simple majority vote ensem- ble α RSE is comparable to bagging with dec ision trees. This suggests that the base classiﬁer α RSC inherently diver siﬁes as much a s bootstrapping de- cision trees a nd lends support to using α RSC as a base classiﬁer . • Secondly , αβ RSE outperforms α RSE on 12 out of 16 d ata sets (with 2 ties) with 25 ba ses cla ssiﬁers and 14 out of 16 with 100 base classiﬁers. If we were performing a single comparison between these two classiﬁers, the difference would be signiﬁcant. Whilst the multiple classiﬁer comparisons mean we cannot make this claim, the results do indica te that allowing some misclassiﬁcation a nd guiding the sphere creation process through directed resampling does improve perf ormance and that a simple ensem- ble does not best utilise the base cla ssiﬁer . • Thirdly , αβ RSE has the highest average rank of the ﬁve algorithms, f rom which we infer that it performs at least compara bly to Adaboost, M ulti- boost and performs be tter than Bagging. These ex- periments d emonstrate that the re-weighting based ensemble αβ RSE is at lea st compara ble to the widely used tree based sampling and/or re-weighting en- sembles. 5.4 Subspace Ensemble Methods T ables 4 a nd 5 sh ow the classiﬁcation accur a cy of α RSSE against those of Rotation Forest, Random Subspace, UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 9 T ABLE 2 mean classiﬁcation accuracy (in %) and standard deviation o f αβ RSE, α RSE, Adabo ost, Baggin g, and Multibo ost ov er 30 different runs on indepen dent train /test s plits with 25 base classiﬁers. Data Set α RSE αβ RSE Adaboost Bagging MultiBoost Abalone 54.25 ± 0.94 54.89 ± 1.02 52.30 ± 1.20 53.98 ± 0.91 53.04 ± 1.47 W aveform 90.40 ± 0.67 90.68 ± 0.65 89.60 ± 0.69 88.71 ± 0.58 89.63 ± 0.56 Satimage 90.90 ± 0.41 90.90 ± 0.41 91.21 ± 0.45 89.82 ± 0.69 90.94 ± 0.57 Ringnorm 96.71 ± 0.38 97.17 ± 0.30 97.26 ± 0.33 95.01 ± 0.50 97.12 ± 0.31 T wonorm 97.32 ± 0.26 97.41 ± 0.26 96.43 ± 0.32 95.58 ± 0.46 96.41 ± 0.37 Image 96.87 ± 0.50 96.87 ± 0.51 97.77 ± 0.64 95.78 ± 0.90 97.32 ± 0.75 German 73.21 ± 1.76 74.00 ± 1.69 74.52 ± 1.76 75.24 ± 1.36 75.09 ± 2.51 wdbc 93.21 ± 1.47 93.86 ± 1.52 96.79 ± 1.26 95.19 ± 1.38 96.61 ± 1.22 Y east 56.34 ± 2.09 58.22 ± 1.24 58.23 ± 1.59 60.65 ± 1.57 58.65 ± 1.77 Diabetes 74.52 ± 1.78 75.01 ± 1.79 73.54 ± 1.88 75.94 ± 2.00 74.74 ± 2.34 Iono 93.48 ± 2.05 93.39 ± 2.25 92.85 ± 2.20 92.31 ± 2.60 93.25 ± 2.05 Sonar 84.67 ± 4.17 84.43 ± 3.66 81.38 ± 4.21 76.33 ± 5.66 80.76 ± 4.57 Heart 78.85 ± 3.60 80.74 ± 3.26 80.41 ± 3.11 81.26 ± 3.66 81.22 ± 2.87 Cancer 69.46 ± 2.97 70.07 ± 3.62 69.07 ± 4.36 73.44 ± 2.87 69.35 ± 4.71 W insc 95.53 ± 1.34 95.67 ± 1.33 96.21 ± 0.84 96.01 ± 0.97 96.49 ± 0.71 Ecoli 85.36 ± 2.78 85.51 ± 2.64 83.07 ± 2.75 83.45 ± 3.58 83.45 ± 2.73 A verage Ranks 3.31 2.50 3.13 3.28 2. 78 Ranking 5 1 3 4 2 T ABLE 3 mean classiﬁcation accuracy (in %) and standard deviation o f αβ RSE, α RSE, Adabo ost, Baggin g, and Multibo ost ov er 30 different runs on indepen dent train /test s plits with 100 base classiﬁers. Data Set α RSE αβ RSE Adaboost Bagging MultiBoost Abalone 54.36 ± 1.16 54.48 ± 1.23 52.82 ± 0.99 54.1 ± 0.91 54.22 ± 1.47 W aveform 90.56 ± 0.70 90.32 ± 0.66 90.27 ± 0.58 89.08 ± 0.84 90.20 ± 0.93 Satimage 90.91 ± 0.38 91.12 ± 0.44 92.00 ± 0.39 90.47 ± 0.55 91.11 ± 0.60 Ringnorm 96.88 ± 0.37 97.54 ± 0.31 97.75 ± 0.29 95.23 ± 0.52 97.05 ± 0.52 T wonorm 97.36 ± 0.28 97.49 ± 0.22 97.13 ± 0.26 96.35 ± 0.38 96.95 ± 0.27 Image 96.77 ± 0.50 96.80 ± 0.56 97.98 ± 0.56 96.23 ± 0.80 96.71 ± 0.34 German 73.23 ± 1.82 74.16 ± 1.58 74.46 ± 1.54 74.91 ± 1.85 74.70 ± 0.64 wdbc 93.39 ± 1.56 93.91 ± 1.57 96.91 ± 1.55 96.33 ± 1.35 96.47 ± 1.07 Y east 57.26 ± 1.44 58.41 ± 1.36 58.13 ± 1.62 60.08 ± 1.56 59.57 ± 1.22 Diabetes 74.53 ± 1.84 75.04 ± 2.57 73.53 ± 2.20 75.68 ± 2.57 74.54 ± 1.28 Iono 93.56 ± 2.06 93.53 ± 1.96 92.99 ± 2.29 91.20 ± 3.01 92.39 ± 2.25 Sonar 84.86 ± 4.23 85.00 ± 3.72 82.71 ± 5.14 78.57 ± 5.86 82.71 ± 2.21 Heart 79.26 ± 3.40 80.67 ± 3.10 81.19 ± 2.88 81.56 ± 3.59 82.33 ± 4.20 Cancer 69.53 ± 3.29 69.58 ± 3.32 68.82 ± 5.07 73.19 ± 3.34 71.33 ± 3.51 W insc 95.54 ± 1.33 95.71 ± 1.33 96.48 ± 0.88 96.09 ± 0.94 97.00 ± 4.31 Ecoli 85.54 ± 2.96 85.86 ± 2.65 83.07 ± 2.75 83.45 ± 3.58 84.82 ± 0.75 A verage Ranks 3.38 2.38 3.03 3.44 2. 78 Ranking 4 1 3 5 2 T ABLE 4 Classiﬁcation accuracy (in %) and standard de viation of α RSSE, Rotation F orest (RotF), Random SubSpace (RandS), Random Forest (RandF) and Rand om Committee RandC) using av erage results of 30 different r uns on indep enden t train/test splits with 25 base classiﬁers. Data Se t α RSSE RotF RandS RandF RandC Abalone 54.77 ± 1.28 55.56 ± 1.04 54.62 ± 1. 09 54.05 ± 1.16 53.56 ± 1.19 W aveform 90.21 ± 0.51 90.72 ± 0.77 89.35 ± 0.73 89.51 ± 0.61 89.32 ± 0.61 Satimage 91.71 ± 0.47 91.03 ± 0.50 90.79 ± 0.54 90.80 ± 0.52 90.24 ± 0.44 Ringnorm 98.29 ± 0.26 97.57 ± 0.23 96.82 ± 0.35 95.49 ± 0.38 96.6 ± 0.30 T wonorm 97.03 ± 0.30 97.42 ± 0.27 95.88 ± 0.33 96.02 ± 0.37 96.18 ± 0.35 Image 97.39 ± 0.65 98.04 ± 0.51 96.42 ± 0.73 97.27 ± 0.63 96.08 ± 0.58 German 74.59 ± 1.47 76.26 ± 1.63 72.28 ± 1.53 74.85 ± 1.46 73.65 ± 1.77 wdbc 94.67 ± 1.33 96.40 ± 1.03 95.35 ± 1.31 95.30 ± 1.42 96.04 ± 1.26 Y east 58.80 ± 1.90 61.06 ± 1.82 57.38 ± 2.45 58.96 ± 1.69 60.26 ± 1.75 Diabetes 76.17 ± 2.25 76.25 ± 2.30 74.48 ± 1.98 75.43 ± 1.92 74.78 ± 1.51 Iono 94.53 ± 1.79 93.50 ± 1.79 92.68 ± 2. 40 93.05 ± 1.86 93.13 ± 2.33 Sonar 84.52 ± 4.49 82.86 ± 4.50 79.57 ± 5.24 81 ± 4. 68 82.19 ± 3.99 Heart 82.74 ± 4.02 82.74 ± 3.32 83.30 ± 3.55 81.67 ± 3.17 81.00 ± 3.62 Cancer 76.27 ± 2.96 7 3.87 ± 3.29 74.73 ± 2.81 71.18 ± 3.74 70.93 ± 4.29 W insc 97.21 ± 0.95 9 7.18 ± 0.83 96.35 ± 1.01 96.48 ± 0.72 97.00 ± 0.84 Ecoli 85.00 ± 2.07 87.41 ± 2.44 84.02 ± 3.13 85.33 ± 2.76 84.82 ± 2.62 Mean Ranks 2.09 1.53 4.00 3.50 3.88 Ranks 2 1 5 3 4 UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 10 T ABLE 5 Classiﬁcation accuracy (in %) and standard de viation of α RSSE, Rotation F orest (RotF), Random SubSpace (RandS), Random Forest (RandF) and Rand om Committee RandC) using av erage results of 30 different r uns on indep enden t train/test splits with 100 base classiﬁers. Data Set α RSSE RotF RandS RandF RandC Abalone 54.91 ± 0.98 56.04 ± 1.04 54.79 ± 1.02 54.47 ± 0.86 52.83 ± 0.95 W aveform 90.73 ± 0.53 91.07 ± 0.77 8 9.68 ± 0.62 89.97 ± 0.62 90.36 ± 0.63 Satimage 91.92 ± 0.54 91.70 ± 0.50 91.28 ± 0.55 91.59 ± 0.46 91 .82 ± 0.46 Ringnorm 98.43 ± 0.27 97.77 ± 0.23 97.22 ± 0.35 95.66 ± 0.43 97.70 ± 0.26 T wonorm 97.39 ± 0.28 97.53 ± 0.27 96.24 ± 0. 51 96.38 ± 0.50 97.22 ± 0.27 Image 97.83 ± 0.53 98.16 ± 0.51 96.78 ± 0.62 97.45 ± 0.62 97.93 ± 0.56 German 74.28 ± 1.56 75.69 ± 1.63 7 2.37 ± 1.06 75.63 ± 0.64 74.79 ± 1.86 wdbc 95.00 ± 1.44 96.75 ± 1.03 96.35 ± 1.49 96.95 ± 1.17 97.11 ± 1.32 Y east 59.43 ± 1.93 61.65 ± 1.82 58.94 ± 1.84 60.03 ± 1.31 58.22 ± 1. 57 Diabetes 76.25 ± 2.21 76.12 ± 2. 30 74.84 ± 2.07 75.14 ± 2.04 74.00 ± 2.02 Iono 94.76 ± 1.68 94.19 ± 1.79 92.74 ± 1.80 92.39 ± 1.77 93.33 ± 1.94 Sonar 85.24 ± 5.39 84.43 ± 4.50 79.62 ± 5.62 82.05 ± 4.44 82 .24 ± 4.63 Heart 84.00 ± 3.43 83.30 ± 3.15 83.41 ± 3.92 82.70 ± 3.35 81.22 ± 4.50 Cancer 76.16 ± 2.75 74 .12 ± 3.29 75.30 ± 2.85 71.36 ± 4.41 68.82 ± 5.07 W insc 97.42 ± 0.91 97.38 ± 0.83 96.60 ± 0.98 96.71 ± 0.90 96.47 ± 0. 78 Ecoli 85.71 ± 2.36 87.41 ± 2.44 84.02 ± 3. 13 85.33 ± 2.76 83.45 ± 2.73 Mean Ranks 1.94 1.69 4.06 3.50 3.81 Ranks 2 1 5 3 4 Random Committee and Random For est ensembles of decision trees, based on 25 and 1 0 0 classiﬁers. As with αβ RSE, the α RSSE para meters α and κ were set thro ugh cross v a lidation on one third of the training set. The optimal va lue of κ was estimated ﬁrst, then the best value of α found f or that κ . The other ensembles were trained on the entire tra ining set with defa ult parameters. Figure 4 shows the Critical Difference dia gram f or the subspace methods with 25 base cla ssiﬁers. There is a signiﬁcant difference in avera ge ra nk between the classiﬁers (the F statistic is 1 4.97, which gives a P value of less than 0.00001) . This difference can be described by two clea r cliques: Random Subspa ce, Random Commit- tee and Random Forest are signiﬁcantly outperformed by the clique α RSSE and Rotation Forest. CD 5 4 3 2 1 1.5313 RotFor 2.0938 aRSSE 3.5 RandFor 3.875 RandComm 4 RandSub Fig. 4. Critical diff erence diag ram f or 5 subspace ensem- b les on 16 data sets. Critical diff erence is 1.375. So whilst rotation forest has a lower average rank than α RSSE on these data sets, the difference is not signiﬁcant. W e further note that the difference in per- formance between rotation forest a nd α RSSE reduces with an increase in the number of base classiﬁers. T a ble 6 shows the classiﬁcation accuracy ( calculated through 10CV) of α RSSE for va rious sizes of ensemble, varying from 15 to 500 base classiﬁers. In general, ensembles perform be tter when the size of the e nsemble is large. However , with many ensemble methods increasing the ensemble size dra matically results in over training and hence lower testing a ccuracy . T able 6 demonstrates that the performance of α RSSE actually improves with over 100 ba se classiﬁers, indicating α RSS E does not ha ve a tendency to over ﬁt data sets with la rge ensemble sizes. Figure 5 shows the combined critical difference dia- gram for all 10 ensembles. The increase in the number of ensembles mea ns a much larger cr itica l difference is required to de te ct a signiﬁcant difference. However , a similar pattern of ra nking is apparent. CD 10 9 8 7 6 5 4 3 2 1 2.0625 RotFor 2.6875 aRSSE 5.3438 RandFor 5.6875 abRSE 5.875 RandComm 6.25 MultiBoost 6.3125 Adaboost 6.3125 RandSub 7.0625 alphaRSE 7.4063 Bagging Fig. 5. Critical difference diagr am f or 10 ensembles on 16 data sets. Critical diff erence is 3.1257. UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 11 T ABLE 6 α RSSE 10CV accuracy f or ensemble s izes of 15 to 500. Ensemble Size Dataset (15) (25) (50) (100) (250) (500) W aveform 89.87 90.38 90.72 90.85 91.21 90.97 Ringnorm 97.97 98.14 98.27 98.31 98.37 98.39 T wonorm 96.79 97.20 97.39 97.49 97.63 97.64 Image 97.44 97.80 97.92 97.87 98.01 98.03 German 74.77 75.43 75.52 75.47 75.52 75.66 wdbc 97.27 97.45 97.75 97.68 97.99 97.98 Y east 59.02 59.79 59.56 59.58 59.86 59.94 Diabetes 76.89 76.95 76.96 77.03 77.21 76.96 Iono 95.09 95.37 95.23 95.11 95.46 95.43 Sonar 8 6.85 87.81 88.30 88.69 88.03 88.47 Heart 81.74 84.26 83.85 83.63 83.81 83.96 Cancer 75.54 75.88 76.05 77.06 76.93 77.30 W insc 97.08 97.34 97.24 97.40 97.38 97.33 Ecoli 86.17 86.45 86.57 86.15 86.62 86.60 The no fr ee lunch theorem [48] convinces us there will not be a single dominant algorithm for all classiﬁcation problems. Instance based approaches are still popular in a range of problem domains, p a rticularly in research areas relating to ima ge processing and d atabases. αβ RSE and α RSSE offer ins tance based approaches to classi- ﬁcation problems that are highly competitive with the best tree based subspace and non-subspace ensemble techniques. In the f ollowing Section we propose a type of problem domain where we think α RSS E outperf orms the tree based ensembles. 5.5 Gene E xpression Classiﬁcation Case Study: Subspace Ensemble Comparison Gene express ion proﬁling h elps to identify a set of genes that are responsible for cancerous tissue. Gene expres- sion da ta are generally chara cterised by a ve r y large number of a ttributes and relatively few cases. Instance based lear ner s such as k-NN often perform poorly in high dimensional attribute space. W e demonstrate that the subspace ensemble α RSSE can overcome this inher- ent problem and in fa c t outperform the other e nsemble techniques. T ABLE 7 The best test set accuracy (in %) of α RSSE ( αR ), Rotatio n Forest (RotF), Rand om Subspace (RandS), Random Forest (RandF), Adabo ost (AB), Baggi ng (Bag) and M ultiBo ostAB (Multi) using av erage results of 30 different runs on χ 2 . BC=Breast Cancer , CT= Colon T umor , LC=Lung Cancer , O V=Ovarian and PR=Prostrate Dataset αR RotF RandS RandF AB Bag Multi BC 82.93 79.60 76.26 80.91 79.19 78.99 78.79 CN 77.83 76.83 74.33 80.33 76.33 76.17 76.50 CT 85.87 86.19 83.49 84.13 82.38 83.65 82.86 LC 99.34 99.34 95.03 99.34 97.81 97.21 97.87 OV 99.18 99.80 97.88 98.98 97. 73 97.84 97.73 PR 94.13 93.70 91.30 94.57 91.23 91.38 91.09 F-avg 1.75 2.10 5.83 1.92 5.58 5 5.58 F-ranks 1 3 7 2 5.5 4 5.5 T ABLE 8 The best test set accuracy (in %) using av e rage results of 30 different runs on Infor mation G ain. Dataset αR Ro tF RandS RandF AB Bag Multi BC 85.15 79.39 77.47 83.94 79.49 80.10 79.80 CN 79.17 76.50 73.50 80.00 75. 67 76.17 76.00 CT 86.98 84.76 82.54 84.44 82.70 82.54 82.38 LC 99.34 99.34 94.75 99.34 97.76 97.16 97.81 OV 99.25 99.76 98.00 98.86 97.73 97.88 97.73 PR 93.77 93.48 91.74 93.62 91.09 92.32 90.80 F-avg 1.42 2.75 5.92 2.08 5.42 4.58 5. 58 F-ranks 1 3 7 2 5 4 6 T ABLE 9 The best test set accuracy (in %) using av e rage results of 30 different runs on Relief. Dataset αR Ro tF RandS RandF AB Bag Multi BC 80.20 79.19 72.42 78.18 73.74 74.85 73.23 CN 76.00 75.50 72.17 76.00 74. 00 72.00 73.33 CT 83.65 84.76 80.63 83.33 79.37 83.17 79.68 LC 99.34 99.23 94.75 98.91 97.43 96.61 97.49 OV 98.43 99.37 98.04 98.90 97.61 97.69 97.61 PR 89.13 93.33 91.67 93.62 93.41 89.71 93.26 F-avg 2.58 2.00 5.67 2.25 4.92 5.33 5. 25 F-ranks 3 1 7 2 4 6 5 T ABLE 10 The best test set accuracy (in %) the three attribute ranking methods. Dataset αR Ro tF RandS RandF Adaboost Bagging Multi BC 84.04 79.60 77.47 83.94 79.49 80.10 79.80 CN 79.17 76.83 74.33 80.33 76. 33 76.17 76. 5 CT 86.98 86.19 83.49 84.44 82.70 83.65 82.86 LC 99.34 99.34 95.03 99.34 97.81 97.21 97.87 OV 99.18 99.76 98.00 98.98 97.73 97.88 97.73 PR 94.13 93.70 91.74 94.57 93.41 92.32 93.26 F-avg 1.58 2.58 6.17 1.92 5.58 5.00 4.92 F-ranks 1 3 7 2 6 5 4 UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 12 Broadly spe a king, there a re three types of approach to problems with a large number of attributes [17]: employ a ﬁlter that uses a scoring method to rank the attributes independently of the classiﬁer; use a wra pper to score subsets of attributes using the classiﬁer to produce the model; or embed the attribute selection as part of the algorithm to build the c la ssiﬁer [34]. W e focus on three simple, common ly used, ﬁlter measures, χ 2 , Information Gain (I G) and Relief, which are used to select a ﬁxe d number of attr ibute s by ranking ea ch on how well they split the training data , in terms of the response variable. W e compare α RSSE to Adab oost, Ba gging, Ran- dom C omittee, Multiboost, Random S ubspaces, Random Forest and Rotation Forest. Our methodology is to ﬁlter on k = 5, 10, 2 0 3 0, 40 and 50 best r a nked attributes for the three r anking measures. Model selection for α RSSE is conducted a s described in S e ction 5.3. All the ensembles use 10 0 classiﬁers. For Adaboost, Bagging and the base d ecision tree classiﬁers in the ensembles we use the default parameters. T ables 7, 8 and 9 show the r elative performance of the eight e nsemble cla ssiﬁers on the best attribute ﬁlter se tting for each of the ﬁlter techniques. W e note that α RSSE is ranked highest overa ll when using χ 2 and Information Gain and is ranked third with Relief. From this we infer that when used in c onjunction with ﬁltering α RSSE can overcome the inherent problem instance based learners have with high dimensional attribute spaces to produce results better than the state of the art tree based ensembles cla ssiﬁers. 6 B I A S V A R I A N C E A N A L Y S I S O F R S C E N - S E M B L E T E C H N I Q U E S The purpose of our bias/varia nce analysis of the en- sembles αβ RS E and α RSSE is to identify whether the reduction in gener alisation error in comparison to the base classiﬁer is due to a reduction in bias, unbiased variance or an increase in biased variance. W e followed a similar e xperimental fra mework found in [44]. The standard experimental design for BV decomposition is to estimate Bias and V ar iance using small training sets and large te st sets. W e used bootstrapping to sample eight of our datasets. The data is divided into a training set and a test set, with the test set being at 1/3 of the entire set. 200 separa te training bootstrap samples of size 20 0 were taken by uniformly samplin g with replacement fro m the training set. W e then compute the main prediction, bias and both the unbiased and biased variance, and net- variance (as de ﬁned in Section 2.3) over the 200 test sets. Figure 6 showing both bias and variance in relation to κ (number of attributes used in each classiﬁer for α RSSE) for four of the d atasets. W e observe there is a strong relationship between ave raged error and bias for small κ , but that a s κ increases variance contributes a larger component to the error . Increasing κ seems to have a higher inﬂuence on unbiased variance reduction than biased variance. T o compare αRS C , αβ RS E and α RSSE, we perform the bias/va r iance experiment on the three classiﬁers with the optimal set of parameters (determined exper imentally). W e conclude from the above results that αβ RS E , in most cases, reduces the net var iance in comparison with a single classiﬁer because of a decrease in unbiased variance. However , it is not straight forward in relation to bias. It might be that bias reduction depends on the geometrical complexity of the sample [21] (c om- plex structures requir e complex decision boundar ies), the chosen values for the pruning p a rameter α , and the intera ction between α and β . In that case, ﬁnding a method that sys tematically reduces bias whil e keeping unbiased variance low will further reduce the ensemble average error . T able 11 shows the bias/var iance decomposition of α RSSE, αβ RSE and α RSC. W e ma ke the following ob- servations from these results: 1) The avera ge er ror of α RSSE and αβ RSE is lower than α RSC for all the problems; 2) For αβ RSE, this is more commonly a result of a reduction in net va riance rather than a reduction in bias; 3) For α RSSE, whilst bias is reduced, we also see a more consistent reduction in variance. These experiments reinforce our preconception as to the effectiveness of the ensembles: αβ R S E introduces further diver sity into the e nsemble throug h a llowing misclassiﬁed instances within the spheres. The major effect of this is to reduce the varia nce of the resulting classiﬁer . On the other hand, the subspace ensemble reduces the inherent bias commonly observed in instance based c la ssiﬁers used in conjunction with a Euclidea n distance metric: redundant attributes result in overﬁt- ting. 7 C O N C L U S I O N W e have describe d an instance ba sed classiﬁer , αR S C , that has several interesting properties that can be used successfully in ensemble de sign. W e described three different ensemble methods with which it could be used a nd d e monstrated that the resulting ensembles are competitive with the best tree based ensemble tech- niques on a wide r ange of standard da ta sets. W e fur- ther investigated the reasons for the improvement in performance of the ensembles in relation to the base classiﬁer using bia /variance decomposition. For the e n- semble based on resampling ( αβ RSE) accura cy wa s in- creased primarily by a reduction in variance. Hence we conclude the diversity introduced via the propos ed technique is mostly beneﬁcial and the resulting ensemble classiﬁer is more robust. W e also demonstrated through bia/variance de composition that the subspace e nsemble α RSSE improves pe r formance primarily by a d ecrease in bias. An obvious next step would be to embed the resampling technique within the random subspace en- semble. However , we found employing the β mechanism in the subspace did not make a signiﬁcant difference to UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 13 Attribute 3 4 5 6 7 8 error 0.220 0.230 0.240 0.250 0.260 0.270 0.280 0.290 average error bias (a) A verage error and bias for Diabete s Attribute 3 4 5 6 7 8 error 0.00 0.02 0.04 0.06 0.08 0.10 0.12 net variance unbiased variance biased variance (b) V ariance d ecomposition for Diabetes Attribute 2 4 6 8 10 12 error 0.14 0.16 0.18 0.20 0.22 0.24 average error bias (c) A verage error and bias decompos i tion for Heart Attribute 2 4 6 8 10 12 error 0.02 0.04 0.06 0.08 0.10 0.12 net variance unbiased variance biased variance (d) V ariance d ecomposition for Heart Attribute 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 error 0.04 0.09 0.14 average error bias (e) A verage e rror and bias for Image Attribute 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 error 0.00 0.02 0.04 0.06 0.08 0.10 net variance unbiased variance biased variance (f) V ari an ces decomposition for Image Attribute 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 error 0.08 0.10 0.12 0.14 0.16 0.18 Average error bias (g) A v e rage error and bias for W aveform Attribute 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 error 0.020 0.040 0.060 0.080 net variance unbiased variance biased variance (h) V ariances deco mposition for W aveform Fig. 6. Bia s/V aria nce De composition of the α RSSE classiﬁer . the α RSSE ensemble. This implies that attribute selection is the most important stage in ensembling α RSC, other than model sele c tion by setting α . This has lea d us into investigating embedd ing a ttribute se le c tion (rather than ra ndomisation) into the ensemble, with promising preliminary results. W e believe that α RSC is a usef ul edition to the fa mily of instance based learners since it is easy to unde r stand, quick to train a nd test and can effectively be employed in ensembles to achieve classiﬁcation accura cy comparable to the most popular ensemble methods. R E F E R E N C E S [1] D. Aha, D. Kibler and M.K. Albert: Instance-based learning algorithms, Mac hine Lear ning, vol. 6, no. 1, pp.37 -66, 1991. [2] E. B a uer and R. Kohavi: An empirical comparison of voting classiﬁcation algorithms: ba gging, boosting, and variants, M achine Le a rning, vol. 36, no. 1-2, pp. 105-1 39, 1 9 99. UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 14 T ABLE 11 Compar ing Bias/v a ria nce of α RSC , αβ RSE and α RSSE. (V ar . unb .) and (V ar. bias.) stand f or unbiased and biased v a rian ce. (Diff) stands for t he percenta ge diff erence between the algor ithms. T he up arrow ↑ means an incre ase while a down arrow ↓ means a decrease. Dataset Av g Erro r Bias N et V ar V ar . Unb. V ar . bias. W aveform (1) α RSC, α = 11 0.1387 0.0961 0.0426 0.0722 0.0296 (2) αβ RSE, α = 10 0.1223 0.0976 0.0247 0.0500 0.0254 (3) α RSSE, α = 2 , κ = 11 0.1141 0.0906 0.0235 0.0472 0.0237 Diff (1) vs (2) % ↓ 11.82 ↑ 1.56 ↓ 42.01 ↓ 30.74 ↓ 14.18 Diff (1) vs (3) % ↓ 17.73 ↓ 5.72 ↓ 44.83 ↓ 34.62 ↓ 19.93 Diabetes (1) α RSC, α = 3 0.2780 0.2367 0.0413 0.1006 0.0594 (2) αβ RSE, α = 3 0.2685 0.2359 0.0326 0. 0847 0.0521 (3) α RSSE, α = 2 , κ = 5 0.2603 0.2332 0.0271 0.0741 0.0469 Diff (1) vs (2) % ↓ 3.41 ↓ 0. 33 ↓ 21.06 ↓ 15.80 ↓ 12.29 Diff (1) vs (3) % ↓ 6.37 ↓ 1. 48 ↓ 34.38 ↓ 26.34 ↓ 21.04 Heart (1) α RSC, α = 7 0.2138 0.1667 0.0471 0.0872 0.0400 (2) αβ RSE, α = 10 0.1896 0.1756 0.0140 0.0431 0.0290 (3) α RSSE, α = 2 , κ = 5 0.1814 0.1533 0.0281 0.0568 0.0287 Diff (1) vs (2) % ↓ 11.31 ↑ 5.33 ↓ 70.27 ↓ 50.57 ↓ 27.5 Diff (1) vs (3) % ↓ 15.15 ↓ 8.04 ↓ 40.34 ↓ 34.86 ↓ 28.25 wdbc (1) α RSC, α = 8 0.0898 0.0784 0.0114 0.0275 0.0161 (2) αβ RSE, α = 2 0.0771 0.0663 0.0108 0. 0255 0.0147 (3) α RSSE, α = 0 , κ = 13 0.0698 0.0553 0.0145 0.0258 0.0112 Diff (1) vs (2) % ↓ 14.14 ↓ 15.43 ↓ 5.26 ↓ 7.27 ↓ 8.69 Diff (1) vs (3) % ↓ 22.27 ↓ 29.46 ↑ 27.19 ↓ 6.18 ↓ 30.43 Image (1) α RSC, α = 0 0.1184 0.0650 0.0534 0.0759 0.0225 (2) αβ RSE, α = 0 0.1050 0.0665 0.0385 0. 0603 0.0218 (3) α RSSE, α = 0 , κ = 10 0.0873 0.0495 0.0378 0.0541 0.0163 Diff (1) vs (2) % ↓ 11.31 ↑ 2.30 ↓ 27.90 ↓ 20.55 ↓ 3.11 Diff (1) vs (3) % ↓ 26.26 ↓ 23.84 ↓ 29.21 ↓ 28.72 ↓ 27 .55 T wonorm (1) α RSC, α = 10 0.0515 0.0222 0.0293 0.0366 0.0073 (2) αβ RSE, α = 10 0.0345 0.0224 0.0121 0.0179 0.0058 (3) α RSSE, α = 2 , κ = 13 0.0328 0.0225 0.0103 0.0159 0.0057 Diff (1) vs (2)% ↓ 33.01 ↑ 0.90 ↓ 58.70 ↓ 51.09 ↓ 20.54 Diff (1) vs (3)% ↓ 36.31 ↑ 1.35 ↓ 64.84 ↓ 56.55 ↓ 21.91 Ringnorm (1) α RSC, α = 0 0.1183 0.0596 0.0587 0.0783 0.0783 (2) αβ RSE, α = 0 0.0527 0.0208 0.0320 0. 0377 0.0058 (3) α RSSE α = 0 , κ = 10 0.0288 0.0167 0.0121 0.0166 0. 0045 Diff (1) vs (2) % ↓ 55.45 ↓ 65.10 ↓ 45.48 ↓ 51.85 ↓ 70 .40 Diff (1) vs (3) % ↓ 75.65 ↓ 71.97 ↓ 79.38 ↓ 78.79 ↓ 94 .25 [3] L. Breiman: Ba gging predictors, Machine Lear ning, vol. 24 , no. 2, pp. 12 3-140 , 19 96. [4] L. Breiman: Random forest s, Machine Learning, vol. 45, no. 1, pp. 5-32, 20 01. [5] L. B reiman: Bias, variance, a nd arcing classiﬁers, Statistics Department, Ber ke le y , technical report, no. 460, 199 6. [6] A. Ca nnon and L.J. Cowen: Approxim ation algo- rithms for the class cover problem, Annals of Math- ematics and Artiﬁcial Intelligence, vol. 40,no. 3-4, pp. 215 -223, 2004 . [7] A. Cannon, J . Mark Ettinger , D. Hush, C. Scovel: Machine learning with data d ependent hypothesis classes, The Journal of Machine Learning Research, vol. 2, pp. 33 5-358 , 200 2. [8] J. Dem ˇ sar: Statistical comparisons of classiﬁers over multiple data sets. Journal of M achine L earning Research, vol. 7, pp. 1- 30, 2006 . [9] T .G. Dietterich: An experimental comparison of three methods for constructin g ensembles of deci- sion trees: bagging, boosting, and randomization, Machine Learning, vol. 4 0, no. 2, pp. 1 39-15 7, 2000 . [10] P . Domingos: A uniﬁed bias-variance decomposi - tion f or ze ro-one a nd squared loss, Proceedings of the S eventeenth Na tional Conference on Artiﬁcial Intelligence and T welfth Conference on Innovative Applications of Artiﬁcial Intelligence, pp. 5 64-56 9, 2000. [11] A. Frank and A. Asuncion: U CI Machine Lea rn- ing Repository , University of California, Irvine, School of In formation and C omputer Sc iences, http://archive.ics.uci.edu/ml, 2010. [12] Sally Floyd and Manfred K. W armuth: Sample com- pression, learnability , and the va pnik-chervonenkis dimension. Machine Learning, vol 21, no. 5. pp. 26930 4, 19 95 UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 15 [13] Y . Fr eund and R.E. Schapire: Experiments with a new boosting algorithm, ICML, pp. 1 48-15 6, 199 6. [14] J.H. Fried man and U. Fayya d : On bias, variance, 0/1-loss, and the curse-of-dimensionality , Data Mining and Knowledge Discovery , vol. 1, pp. 55- 77, 19 97. [15] P . Geurts and D. E r nst a nd L. W ehenkel: Extremely randomised trees, Machine Learning, vol. 63 , no. 1, pp. 3-4 2 , 2006. [16] Y . Gra ndvalet, S. Canu a nd S. Boucheron: Noise in- jection: theoretical prospects, Neural Computation, vol. 9, no. 5, pp. 1 093-1 108, 19 9 7. [17] I. Guyon and A. Elisseeff: An introduction to variable and feature selection, Journal of M achine Learning Research, vol. 3, pp. 1157 -1182 , 2003. [18] L.K. Hansen and P . S alamo: Neural network en- sembles, IEEE T ra nsactions on Pattern A nalysis and Machine Intelligence, vol. 12, no. 1 0, pp. 9 9 3-100 1, 1990. [19] R. He rbrich: L earning Kernel Classiﬁers, Theory and Algorithms. MIT Press Cambridge, M A , USA , 2002. [20] T .K. Ho: The ra ndom subspace method for con- structing de cision for ests, IEEE T ransactions on Pat- tern Analysis and Machine Intelligence, vol. 20 , no. 8, pp. 832- 8 44, 1 998. [21] T .K. Ho: Geometrical complexity of classiﬁcation problems, Proceedings of the 7th Course on Ensem- ble Methods for Lea rning Ma chines a t the Interna - tional School on Neura l Nets, 2 004. [22] G. James: V aria nce and bias for genera l loss f unc- tions, Machine Learning, vol. 51 , pp. 115-1 3 5, 2003 . [23] S.W . Kim and B.J. Oommen: A brief taxonomy and ranking of creative prototype reduction schemes, Pattern A nal. Appl., vol. 6, no. 3, pp. 232 -244, 2003. [24] L.I. Kuncheva: That Elusive Diversity in Classiﬁer Ensembles, Lecture Notes in Computer S cience, pp. 1126- 1138, 2 0 03. [25] L.I. Kuncheva: A theoretical study on six classiﬁer fusion strategies, Journal: IEEE T ransactions on Pat- tern Analysis and Machine Intelligence, vol. 24 , no. 2, pp. 281- 2 86, 2 002. [26] L.I. Kuncheva and J.C. Bezdek: Presupervised and postsupervised prototype classiﬁer design, IEEE T ransactions on Neural Networks, vol. 10, n o. 5, pp. 1142- 1152, 1 9 99. [27] L.I. Kuncheva, C. Whitaker , C. Shipp and R.P .W Duin: Limits on the ma jority vote a c c uracy in classi- ﬁer fusion, Pattern Analysis and Applications, vol. 6, no. 1, pp. 22-3 1, 2003. [28] L.I. Kuncheva: Diversity in multiple cla ssiﬁer sys- tems, Information Fusion, vol. 6, no. 1, pp. 3 -4, 2 005. [29] H. Liu and H. M otoda: On issues of instance selec- tion, Data Mining and Knowledge Discovery , vol. 6, no. 2, pp. 115 - 130, 2002. [30] D.J. Ma rchette: Random graphs for sta tistical pat- tern recognition, W iley-Interscience, 2004. [31] D.J. Marchette and C.E. Priebe: Charac te rizing the scale dimension of a hi gh-dimensional cla ssiﬁcation problem. Pattern Recognition, vol. 36, no. 1, p p . 45- 60, 2003 . [32] D.J. Marchette D., E.J. W egman a nd C.E. Priebe: A fast algorithm for approximating the dominating set of a class cover catch digraph, technical report, JHU DMS TR 6 35, 2 0 03. [33] R. Meir a nd G. R ¨ atsch: A n introduction to boosting and leve raging, Machine Lea rning S ummer School, Springer , pp. 118 -183, 2002 . [34] L.C. Molina, L. Belanche and A. Neb ot: Feature selection algorithms: a survey a nd experimental evaluation, IEEE International Conference on Data Mining, pp. 306-3 13, 2 002. [35] D. Opitz and R. Ma clin, Popular e nsemble methods: an empirica l study , Journal of A rtiﬁcial Intelligence Research, vol. 11, pp. 169 -198, 1999 . [36] C.E. Priebe, J .G. DeV inney , D.J. Marchette and D.A. Socolinsky: C la ssiﬁcation using class cover ca tch digraphs. J ournal of Classiﬁcation, vol. 20, no. 1, pp. 003 -023, 2003 . [37] G. Raetsch and T . Onoda and K.R. Mueller: Soft margins for a daboost, M a chine Lea rning, vol. 42, no. 3, 2 87-32 0, 2001. [38] J.J. Rodriguez a nd L.I. Kuncheva a nd C.J. A lonso: Rotation forest: a new cla ssiﬁer ensemble method, IEEE transactions on pa ttern analysis and machine intelligence, vol. 2 8, no. 10, pp. 1619 -1630 , 200 6 . [39] R.E Schapire: Theor etical views of boosting and ap- plications, International W orkshop on Algorithmic Learning Theory , pp. 13-25, 199 9 . [40] Mario Ma rchand a nd J ohn Shawe-T aylor: T he Set Covering Machine, Journal of Machine Learning Research”, vol.3, pp. 7 23-74 6, 2002. [41] G. Stiglic a nd P . Kokol. GEMLeR: Gene E x pres- sion M achine Learning Reposito ry . A va ilable at: http://gemler .fz v .uni-mb.si/. [42] E.K. T ang, P .N. Suganthan and X. Y ao: An a nalysis of diversity measures, Machine Learning, vol. 6 5 , no. 1, pp . 24 7-271 , 2 006. [43] R. T ibshirani: Bias, variance and pr ediction error for classiﬁcation rules, University of T oronto, Dept. of Statistics, T oronto, 19 96. [44] G. V alentini and T .G. Dietterich: Bias-variance anal- ysis of support vector ma chines for the develop- ment of S VM-based ensemble methods, J ournal of Machine Learning Research, vol. 5 , pp. 725-77 5, 2004. [45] G.I. W ebb: Multiboosting: a technique for combin- ing boosting and wagging, Machine Learning, vol. 40, pp. 159-19 6, 2000. [46] D. W ettschereck: A hybrid nea rest-neighbor and nearest-hyperrectangle algorithm, Proceedings of the European Conference on M a chine Lea rning, vol. 784, pp. 323-3 38, 1 994. [47] D. W ilson and T . R. M artinez: Reduction techniques for instance- b a sed learning a lgorithms, M a chine Learning, vol. 3 8, pp. 257- 286, 2 000. UNIVERSITY OF EAST ANGLIA COMPUTER SCIENCE TECHNICAL REPORT CM PC14-05 16 [48] D. H. W olpert and W . G. Macready: No free lunch theorems for optimization, IE E E T rans. on Evo. Comp., vol. 1, no. 1, pp 67-8 2, 1997. [49] R. Y ounsi and A. Bagnall: An efﬁcient randomized sphere cover classiﬁer , Int. J. of Data Mining, Mod- elling and Management, vol. 4 , no. 2, pp. 156-171 , 2012. [50] R. Y ounsi: Investigating Randomised Sphere Covers in Supervised Learning, PhD thesis, University of East Anglia, 2011.

Ensembles of Random Sphere Cover Classifiers

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment