Classification using distance nearest neighbours

This paper is concerned with the problem of supervised classification, a topic of interest in both statistics and machine learning. Hastie et al (2001) gives a description of various classification methods. We outline our problem as follows. We have a collection of training data {(x i , y i ), i = 1, . . . , n}. The values in the collection x = {x 1 , . . . , x n } are often called features and can be conveniently thought of as covariates. We denote the class labels as y = {y 1 , . . . , y n }, where each y i takes one of the values 1, 2, . . . , G. Given a collection of incomplete/unlabelled test data {(x i , y i ), i = n + 1, . . . , n + m}, the problem amounts to predicting the class labels for y * = {y n+1 , . . . , y n+m } with corresponding feature vectors x * = {x n+1 , . . . , x n+m }. Perhaps the most common approach to classification is the well-known k-nearest neighbours (k-nn) algorithm. This algorithm amounts to classifying an unlabelled y n+i as the most common class among the k nearest neighbours of x n+i in the training set {(x i , y i ), i = 1, . . . , n}. While this algorithm is easy to implement, and often gives good performance, it can be criticised since it does not allow any uncertainty to be associated to the test class labels, and to the value to k. Indeed the choice of k is crucial to the performance of the algorithm. The value of k is often chosen on the basis of leave-one-out cross-validation. There has been some interest in extending the k-nearest neighbours algorithm to allow for uncertainty in the test class labelling, most notably by (Holmes and Adams 2002), (Holmes and Adams 2003) and more recently (Cucala et al 2009). Each of these probabilistic variants of the k-nearest neighbour algorithm, is based on defining a neighbourhood of each point x i , consisting of the k nearest neighbours of x i . But moreover, each of these neighbouring points has equal influence in determining the missing class label for y i , regardless of distance from x i . In this article we present a class of models, the distance nearest neighbour model, which shares many of the advantages of these probabilistic approaches, but in contrast to these approaches, the relative influence of neighbouring points depends on the distance from x i . Formally, the distance nearest neighbour model is a discrete-valued Markov random field, and, as is typical with such models, depends on an intractable normalising constant. To overcome this problem we use the exchange algorithm of Murray et al. (2006) and illustrate that this provides a computationally efficient algorithm with very good mixing properties. This contrasts with the difficulties encountered by Cucala et al. (2009) in their implementation of the sampling scheme of Møller et al (2006). This article is organised as follows. Section 2 presents a recent overview of recent probabilistic approaches to supervised classification. Section 3 introduces the new distance nearest neighbour model and outlines how it compares and contrasts to previous probabilistic nearest neighbour approaches. We provide a computationally efficient framework for carrying out inference for the distance nearest neighbour model in Section 4. The performance of the algorithm is illustrated in Section 5 for a variety of benchmark datasets, as well as challenging high-dimensional datasets. Finally, we present some closing remarks in Section 6. 2 Probabilistic nearest neighbour models Holmes and Adams (2003) attempted to place the k-nn algorithm in a probabilistic setting therefore allowing for uncertainty in the test class labelling. In their approach the full-conditional distribution for a training label is written as where the summation is over the k nearest neighbours of x i and where I(y i = y j ) is an indicator function taking the value 1 if y i = y j and 0 otherwise. The notation, j ∼ k i means that x j is one of the k nearest neighbours of x i . However, as pointed out in (Cucala et al 2009), there is a difficulty with this formulation, namely that there will almost never be a joint probability for y corresponding to this collection of full-conditionals. The reason is simply because the k-nn neighbourhood system is usually asymmetric. If x i is one of the k nearest neighbours of x j , then it does not necessarily follow that x j is one of the k nearest neighbours of x i . Cucala et al. (2009) corrected the issue surrounding the asymmetry of the k-nn neighbourhood system. In their probabilistic k-nn (pk-nn) model, the full-conditional for class label y i appears as and this gives rise to the joint distribution Therefore under this model, following (1), mutual neighbours are given double weight, with respect to non-mutual neighbours and for this reason the model could be seen, perhaps, as an ad-hoc solution to this problem. It is important to also note that both Holmes and Adams (2002) and Cucala et al. (2009) allow the value of k to be a variable. Therefore the neighbourhood size can vary. Holmes and Adams (2002) argue that allowing k to vary has a certain type of smoothing effect. Motivated by the work of Holmes and Adams (2002) and Cucala et al. (2009) our interest focuses on modelling the distribution of the training data as a Markov random field. Similar to these approaches, we consider a Markov random field based approach, but in contrast our approach explicitly models and depends on the distances between points in the training set. Specifically, we define the full-conditional distribution of the class label y i as Positive values of the Markov random field parameter β encourage aggregation of the class label. When β = 0, the class labels are uncorrelated. In contrast to the pk-nn model, here the neighbourhood set of x i is constructed to be and is therefore of maximal size. We consider three possible models depending on how the collection of weights {w j i } for j = 1, . . . , i -1, i + 1, . . . , n are defined. 1. d-nn 1 : where d is a distance measure such as Euclidean. 2. d-nn 2 : ), for j = 1, . . . , i -1, i + 1, . . . , n, again where, I is an indicator function taking value 1, if d(x i , x j ) < σ and 0, otherwise. Further, ∈ (0, 1) is defined as a constant, and is set to a value close to 0. (Throughout this paper we assign the value = 10 -10 .) A non-zero value of guarantees that if there are no features within a distance σ of x i then the class of y i is modelled using the marginal proportions of the class labels. 3. d-nn 3 : Clearly the neighbour system for both models is symmetric, and so the Hammersley-Clifford theorem guarantees that the joint distribution of the class labels is a Markov random field. This joint distribution is written as As usual, the normalising constant of such a Markov random field is difficult to evaluate in all but trivial cases. It appears as Some comments: 1. The k-nn algorithm and its probabilistic variants always contain neighbourhoods of size k, regardless of how far each of neighbouring points are from the center point, x i . Moreover, each neighbouring point x j has equal influence, regardless of distance from x i . It could therefore be argued that these algorithms are not sensitive to outliers. By contrast the distance nearest neighbour models deal with outlying points in a more robust manner, since if a point x j lies further away from other neighbours of x i , then it will have a relatively smaller weight, and consequently less influence in determining the likely class label of y i . 2. The formulation of distance nearest neighbour models includes every training point in the neighbourhood set, but the value of σ determines the relative influence of points in the neighbourhood set. For the d-nn 1 model, small values of σ imply that only those points with small distance from the centre point will be influential, while for large values of σ, points in the neighbourhood set are more uniformly weighted. Similarly, for the d-nn 2 model, points within a σ radius of the center point are weighted equally, while those outside a σ radius of the center point will have relatively little weight, when is very close to 0. By contrast, for the d-nn 3 model, large values of the parameter σ imply that points close to the centre point will be influential. 3. For the d-nn 2 model, if there are no features in the training set within a distance σ of x i , then π(y i = j|y -i , x, β, σ) ∝ exp βp i j , for j = 1, . . . , G, where p i j denotes the proportion of class labels j in the set y \ {y i }. The parameter β determines the dependence on the class proportions. A large value of β typically predicts the class label to be the class with the largest proportion, whereas a small value of β results in a prediction which is almost uniform over all possible classes. Conversely, if there any feature vectors within a radius σ of x i , then the class labels for these features will most influence the class label of y i . 4. As β → ∞, the most frequently occurring training label in the neighbourhood of a test point will be chosen with increasing large probability. The β parameter can be thought of, in a sense, as a tempering parameter. In the limit as β → ∞, the modal class label in the neighbourhood set has probability 1. There has been work on extending the k-nearest neighbours algorithm to weight neighbours within the neighbourhood of size k. For example, (Dudani 1976) weighted neighbours using the distance in a linear manner while standardizing weights to lie in [0,1]. A model similar to the d-nn 1 model appeared in (Zhu and Ghahramani 2002), but it does not contain the β Markov random field parameter to control the level of aggregation in the spatial field. Moreover, the authors outline some MCMC approaches, but note that inference for this model is challenging. The aim of this paper is to illustrate how this model may be generalised and to illustrate an efficient algorithm to sample from this model. We now address the latter issue. Throughout we consider a Bayesian treatment of this problem. The posterior distribution of test labels and Markov random field parameters can be expressed as where π(β) and π(σ) are prior distributions for β and σ, respectively. Note, however that the first term on the right hand side above depends on the intractable normalising constant (3). In fact, the number of test labels is often much greater than the number of training labels, and so the resulting normalising constant for the distribution π(y, y * |β, σ, x, x * ) involves a summation over G n+m terms, where as before n, m and G are the number of test data points, training data points and class labels, respectively. A more pragmatic alternative is to consider the posterior distribution of the unknown parameters for the training class labels, where now the normalising constant depends on G n terms. Test class labels can then be predicted by averaging over the posterior distribution of the training data, Obviously, this assumes that the test class labels, y * are mutually independent, given the training data, which will typically be an unreasonable assumption. The training class labels are modelled as being mutually independent. Clearly, this is not ideal from the Bayesian perspective. Nevertheless, it should reduce the computational complexity of the problem dramatically. In practice, we can estimate the predictive probability of y n+i as an ergodic average where β (j) , σ (j) are samples from the posterior distribution π(β, σ|x, y). A standard approach to approximate the distribution of a Markov random field is to use a pseudolikelihood approximation, first proposed in (Besag 1974). This approximation consists of a product of easily normalised full-conditional distributions. For our model, we can write a pseudolikelihood approximation as . This approximation yields a fast approximation to the posterior distribution, however it does ignore dependencies beyond first order. The main computational burden is sampling from the posterior distribution A naive implementation of a Metropolis-Hastings algorithm proposing to move from (β, σ) to (β , σ ) would require calculation of the following ratio at each sweep of the algorithm q(y|β , σ , x)π(β )π(σ ) The intractability of the normalising constants, z(β, σ) and z(β , σ ), makes this algorithm unworkable. There has been work which has tackled the problem of sampling from such complicated distributions, for example, (Møller et al 2006). The algorithm presented in this paper overcomes the problem of sampling from a distribution with intractable normalising constant, to a large extent. However the algorithm can result in an MCMC chain with poor mixing among the parameters. The algorithm in (Møller et al 2006) has been extended and improved in (Murray, Ghahramani and MacKay 2006). The algorithm samples from an augmented distribution where π(y |β , σ , x) is the same distance nearest-neighbour distribution as the training data y. The distribution h(β , σ |β, σ) is any arbitrary distribution for the augmented variables (β , σ ) which might depend on the variables (β, σ), for example, a random walk distribution centred at (β, σ). It is clear that the marginal distribution of (5) for variables σ and β is the posterior distribution of interest. The algorithm can be written in the following concise way: 1. Gibbs update of (β , σ , y ): (ii) Draw y ∼ π(•|β , σ , x). 2. Propose to move from (β, σ, y), (β , σ , y ) to (β , σ , y), (β, σ, y ). (Exchange move) with probability Step 2, that all intractable normalising constants cancel above and below the fraction. The difficult step of the algorithm in the context of the d-nn model is Step 1 (ii), since this requires a draw from π(y |β , σ , x). Perfect sampling (Propp and Wilson 1996) is often possible for Markov random field models, however a pragmatic alternative is to sample from π(•|β , σ , x) by standard MCMC methods, for example, Gibbs sampling, and take a realisation from a long run of the chain as an approximate draw from the distribution. Note that this is the approach that Cucala et al. (2009) take. They argue that perfect sampling is possible for the pk-nn algorithm for the case where there are two classes, but that the time to coalescence can be prohibitively large. They note that perfect sampling for more than two classes is not yet available. Note that this algorithm has some similarities with Approximate Bayesian Computation (ABC) methods (Sisson, Fan and Tanaka 2007) in the sense that ABC algorithms also rely on drawing exact values from analytically intractable distributions. By contrast however, ABC algorithms rely on comparing summary statistics of the auxiliary data to summary statistics of the observed data. Finally, note that the Metropolis-Hastings ratio in step 2 above, after re-arranging some terms, and assuming that h(β, σ|β , σ ) is symmetric can be written as q(y|β , σ , x)π(β )π(σ )q(y |β, σ, x) q(y|β, σ, x)π(β)π(σ)q(y |β , σ , x) . Comparing this to (4), we see that the ratio of normalising constants, z(β, σ)/z(β , σ ), is replaced by q(y |β, σ, x)/q(y |β , σ , x), which itself can be interpreted as an importance sampling estimate of z(β, σ)/z(β , σ ), since The performance of our algorithm is illustrated in a variety of settings. We begin by testing the algorithm on a collection of benchmark datasets and follow this by exploring two real datasets with high-dimensional feature vectors. Matlab computer code and all of the datasets (test and training) used in this paper can be found at mathsci.ucd.ie/∼nial/dnn/. In this section we present results for our model and in each case we compare results with the k-nn algorithm for well known benchmark datasets. A summary description of each dataset is presented in Table 1. In all situations, the training dataset was approximately 25% of the size of the overall dataset, thereby presenting a challenging scenario for the various algorithms. Note that the sizes of each dataset ranges from quite small in the case of the iris dataset, to reasonably large in the case of the forensic dataset. In all examples, the data was standardised to give transformed features with zero mean and unit variance. In the Bayesian model, non-informative N (0, 50 2 ) and U (0, 100) priors were chosen for β and σ, respectively. Each d-nn algorithm was run for 20, 000 iterations, with the first 10, 000 serving as burn-in iterations. The auxiliary chain within the exchange algorithm was run for 1, 000 iterations. The k-nn algorithm was computed for values of k from 1 to half the number of features in the training set. In terms of computational run time, the d-nn algorithms took, depending on the size of the dataset, between 1 to 12 hours to run using Matlab code on a 2GHz desktop machine. A summary of misclassification error rates is presented in Table 2 for various benchmark datasets. In almost all of the situations d-nn 1 and d-nn 3 performs at least as well as k-nn and often considerably better. In general, d-nn 1 and d-nn 3 performed better than d-nn 2 . A possible explanation for this may be due to the cut-off nature of the weight function in the d-nn 2 model, since if a point x i has no neighbours inside a ball of radius σ, then w j i is uniform over the entire test set, and consequently there is no effect of distance. By contrast, both the d-nn 1 and d-nn 3 models, have weight functions which depend on distance, and smoothly converge to a uniform distribution as σ → ∞ and σ → 0, respectively. As before, non-informative normal, N (0, 50 2 ) and uniform U (0, 10) priors were chosen for β and σ, respectively. In the exchange algorithm, the auxiliary chain was run for 1000 iterations, and the overall chain ran for 20, 000 of which the first 10, 000 were discarded as burn-in iterations. The overall acceptance rate for the exchange algorithm was around 25% for each of the d-nn models. The misclassification error rate for leave-one-out cross-validation on the training dataset is minimised for k = 3 and k = 4. See Figure 1 (a). At both of these values, the k-nn algorithm yielded a misclassification error rate of 35% and 39%, respectively, for the test dataset. See Figure 1 (b). By comparison, the d-nn 1 , d-nn 2 and d-nn 3 models achieved misclassification error rates of 29%, 33% and 27%, respectively, for the test dataset. This example further illustrates the value of the d-nn models. This example concerns classifying Greek oil samples, again based on infra-red spectroscopy. Here 65 samples of Greek virgin olive-oil were collected. The aim of this study was to see if these measurements could be used to classify each olive-oil sample to the correct geographical region. Here there were 3 possible classes (Crete (18 locations), Peloponnese (28 locations) and other regions (19 locations). In our experiment the data were randomly split into a training set of 25 observations and a test set of 40 observations. In the training dataset the proportion of class labels was similar to that in the complete dataset. In the Bayesian model, non-informative N (0, 50 2 ) and U (0, 100) priors were chosen for β and σ. In the exchange algorithm, the auxiliary chain was run for 1000 iterations, and the overall chain ran for 50, 000 of which the first 20, 000 were discarded as burn-in iterations. The overall acceptance rate for the exchange algorithm was around 15% for each of the Markov chains. The d-nn 1 , d-nn 2 and d-nn 3 models achieved misclassification rates of 20%, 26% and 20%, respectively. In terms of comparison with the k-nn algorithm, leave-one-out cross-validation was minimised for k = 3 for the training dataset. See Figure 2 (a). The misclassification rates at this value of k was 29% for the test dataset. See Figure 2 (b). It is again encouraging that the d-nn algorithms yielded improved misclassification rates by comparison. In terms of providing a probabilistic approach to a Bayesian analysis of supervised learning, our work builds on that of Cucala et al (2009) and shares many of the advantages of the approach there, providing a sound setting for Bayesian inference. The most likely allocations for the test dataset can be evaluated and also the uncertainty that goes with them. So this makes it possible to determine regions where allocation to specific classes is uncertain. In addition, the Bayesian framework allows for an automatic approach to choosing weights for neighbours or neighbourhood sizes. The present paper also addresses the computational difficulties related to the wellknown issue of the intractable normalising constant for discrete exponential family models. While Cucala et al (2009) demonstrated that MCMC sampling is a practical alternative to the perfect sampling scheme of Møller et al (2006), there remain difficulties with their implementation of the approach of (Møller et al 2006), namely the choice of an auxiliary distribution. To partially overcome the difficulties of a poor choice, Cucala et al (2009) use an adaptive algorithm where the auxiliary distribution is defined by using historical values in the Monte Carlo algorithm. We use an alternative approach based on the exchange algorithm which avoids this choice or adaptation and has very good mixing properties and therefore also has computational efficiency. An issue with the neighbourhood model of Cucala et al (2009), which is an Ising or Boltzmann type model, is that it is necessary to define an upper value for the association parameter β. This parameter value arises from the phase change of the model and which is known for a regular neighbourhood structure but has to be investigated empirically for the probabilistic neighbourhood model. Our distance nearest neighbour models avoid this difficulty. Our approach is robust to outliers whereas the nearest neighbour approaches will always have an outlying point having neighbours and therefore classified according to assumed independent distant points which are the nearest neighbours.

Classification using distance nearest neighbours

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment