Bandwidth selection for kernel estimation in mixed multi-dimensional spaces

Reading time: 6 minute
...

📝 Original Info

  • Title: Bandwidth selection for kernel estimation in mixed multi-dimensional spaces
  • ArXiv ID: 0709.1920
  • Date: 2007-09-12
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Kernel estimation techniques, such as mean shift, suffer from one major drawback: the kernel bandwidth selection. The bandwidth can be fixed for all the data set or can vary at each points. Automatic bandwidth selection becomes a real challenge in case of multidimensional heterogeneous features. This paper presents a solution to this problem. It is an extension of \cite{Comaniciu03a} which was based on the fundamental property of normal distributions regarding the bias of the normalized density gradient. The selection is done iteratively for each type of features, by looking for the stability of local bandwidth estimates across a predefined range of bandwidths. A pseudo balloon mean shift filtering and partitioning are introduced. The validity of the method is demonstrated in the context of color image segmentation based on a 5-dimensional space.

💡 Deep Analysis

Deep Dive into Bandwidth selection for kernel estimation in mixed multi-dimensional spaces.

Kernel estimation techniques, such as mean shift, suffer from one major drawback: the kernel bandwidth selection. The bandwidth can be fixed for all the data set or can vary at each points. Automatic bandwidth selection becomes a real challenge in case of multidimensional heterogeneous features. This paper presents a solution to this problem. It is an extension of \cite{Comaniciu03a} which was based on the fundamental property of normal distributions regarding the bias of the normalized density gradient. The selection is done iteratively for each type of features, by looking for the stability of local bandwidth estimates across a predefined range of bandwidths. A pseudo balloon mean shift filtering and partitioning are introduced. The validity of the method is demonstrated in the context of color image segmentation based on a 5-dimensional space.

📄 Full Content

shift" search for the modes of a kernel density estimation. The non-parametric aspect of the approach makes it very versatile to analyze arbitrary feature spaces. Hierarchical clustering methods are also non-parametric. However, they are computationally expensive and defining the stopping criterion is not simple. These reasons explain why the mean shift clustering became recently so popular in computer vision applications.

Mean shift was first introduced by Fukunaga [9] and latter by Cheng [3]. It has then been widely studied, in particular by Comaniciu [7,6,5]. Mean shift is an iterative gradient ascent method used to locate the density modes of a cloud of points, i.e. the local maxima of its density. The estimation of the density is done through a kernel density estimation. The difficulty is to define the size of the kernel, i.e. the bandwidth matrix. The value of the bandwidth matrix highly influences the results of the mean shift clustering.

There are two types of bandwidth matrices. The first ones are fixed for the all data set. At the opposite, the variable bandwidth matrices vary along the set and capture the local characteristics of the data. Of course, the second type is more appropriate for real scenes. In fact, a fixed bandwidth affects the estimation performance by undersmoothing the tails of the density and oversmoothing the peaks. A variable bandwidth mean shift procedure has been introduced in [5]. It is based on the sample point density estimator [10]. The estimation bias of this estimator decreases in comparison to the fixed-bandwidth estimators, while the covariance remains the same. The choice of a good value for the bandwidth matrix is really essential for the variable bandwidth mean shift. Indeed, when the bandwidth is not selected properly, the performance is often worse than with a fixed bandwidth.

Another variable bandwidth estimator is the balloon estimator. It suffers of several drawbacks and has therefore never been used in a mean shift algorithm. However, it has been shown in [23] that this estimator gives better result than the fixed bandwidth and the sample point estimators when the dimensionality of the data is higher than three. Hence, in section 3.3.2, we will propose a new mean shift clustering algorithm based on the balloon estimator.

The bandwidth selection can be statistical analysis-based or task-oriented. Statistical analysisbased methods compute the best bandwidth by balancing the bias against the variance of the density estimate. Task-oriented methods rely on the stability of the feature space partitioning. For example, a semi parametric bandwidth selection algorithm, well adapted for variable bandwidth mean shift, has been proposed by Comaniciu in [4,7]. It works as follows. Fixed-bandwidth mean shift partitionings are run on the data for several predefined bandwidth values. Each cluster obtained is described by a normal law. Then, for each point, the clusters to which it belongs across the range of predefined bandwidths are compared. The final selected bandwidth for this point corresponds to the one, within the predefined range, that gave the most stable among these clusters. The results obtained for color segmentation were promising. However, this method has some limits. In particular, in case of a multidimensional data points composed of independent features, the bandwidth for each feature subspace should be chosen independently. Indeed, the most stable cluster is not always the same for all the feature subspaces. A solution could be to define a set of bandwidths for each domain and to partition the INRIA data using all the possible bandwidth matrices resulting from the combination of the different sets.

However as the dimensions become high and/or if the sets of predefined bandwidths become large, the algorithm can become very computationally expensive.

In this paper we address the problem of data-driven bandwidth selection for multidimensional data composed of different independent features (a data point is a concatenation of different, possibly multidimensional features, thus living in a product of different feature spaces). As no statistically founded method exists for variable bandwidth and for high dimension, we concentrate on a taskoriented method, i.e. a method that relies on the stability of the feature space partitionings. Bandwidths are selected by iteratively applying the stability criteria of [4] for each different feature space or domain. We also introduce a new pseudo balloon mean shift which is better adapted for high dimensional feature spaces than the variable bandwidth mean shift of [7].

We first recall some theory on kernel density estimation (section 2) and mean shift filtering (section 3), and introduce the pseudo balloon mean shift filtering and partitioning (subsection 3.3.1). In section 4, we present our algorithm for bandwidth selection algorithm in case of multivariate data and finally we show results of our algorithm for color c

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut