Fast Selection of Spectral Variables with B-Spline Compression

Reading time: 6 minute
...

📝 Original Info

  • Title: Fast Selection of Spectral Variables with B-Spline Compression
  • ArXiv ID: 0709.3639
  • Date: 2007-09-26
  • Authors: ** > 논문에 명시된 저자 정보가 제공되지 않았습니다. (원문에 저자 명단이 포함되어 있지 않음) **

📝 Abstract

The large number of spectral variables in most data sets encountered in spectral chemometrics often renders the prediction of a dependent variable uneasy. The number of variables hopefully can be reduced, by using either projection techniques or selection methods; the latter allow for the interpretation of the selected variables. Since the optimal approach of testing all possible subsets of variables with the prediction model is intractable, an incremental selection approach using a nonparametric statistics is a good option, as it avoids the computationally intensive use of the model itself. It has two drawbacks however: the number of groups of variables to test is still huge, and colinearities can make the results unstable. To overcome these limitations, this paper presents a method to select groups of spectral variables. It consists in a forward-backward procedure applied to the coefficients of a B-Spline representation of the spectra. The criterion used in the forward-backward procedure is the mutual information, allowing to find nonlinear dependencies between variables, on the contrary of the generally used correlation. The spline representation is used to get interpretability of the results, as groups of consecutive spectral variables will be selected. The experiments conducted on NIR spectra from fescue grass and diesel fuels show that the method provides clearly identified groups of selected variables, making interpretation easy, while keeping a low computational load. The prediction performances obtained using the selected coefficients are higher than those obtained by the same method applied directly to the original variables and similar to those obtained using traditional models, although using significantly less spectral variables.

💡 Deep Analysis

📄 Full Content

Prediction problems are often encountered in analytical spectral chemometrics. They require estimating the unknown value of a dependent variable from, for example, a near-infrared spectrum. Such problems may be encountered in the food (Ozaki et al., 1992), pharmaceutical (Blanco et al., 1999) and textile (Blanco et al., 1997) industry, to cite only a few.

Viewed from a statistical or data analysis perspective, the main difficulty in such problem is to cope with the colinearity between spectral variables: not only consecutive variables in a spectrum are highly correlated by nature, but in addition real applications usually concern databases with a low number of known spectra, and a high number of spectral variables. Any method built on the original spectral variables is thus ill-posed, making feature (spectral variable) selection and/or projection necessary.

Selection and projection methods differ by several aspects. Projection methods are more general by essence, as selection may be regarded as projection with many zero weights. However, projection methods usually build factors (latent variables) that are combinations of a large number of original features. Even if their prediction properties are good, they usually suffer from the fact that the latent variables are hardly interpretable in terms of original features (wavelengths in the case of infrared spectra). On the contrary, selection methods are based on the principle of choosing a small number of variables among the original ones, leading to easy interpretation. Of course, the challenge with selection methods is to obtain prediction performances of the same level as projection ones.

In this work, we are interested in variable selection methods providing interpretability. However, if the whole procedure consisting in selecting the features and building a prediction model on them is kept linear, it will certainly lead to poorer performances than the traditional and widely used PLS (Partial Least Squares), as the latter consists in a projection and a prediction. It is thus investigated how nonlinear models may be used, both for selecting the features and performing the prediction.

Nonlinear models could be used in a wrapper approach (Kohavi and John, 1997), in which their estimated generalization performances is used as a relevance criterion for a group of variables. This however, is very demanding in terms of computational load because resampling techniques must be used to estimate accurately the predicted error of the model, in addition to the fact that one model must be learned for each considered feature set. This paper thus focuses on the so-called filter approach: the features are selected prior the use of any prediction model. Among filter methods, the correlation is the standard criterion to be used for selecting features in a linear way: features with maximal correlation with the dependent (output) variable, and possibly with minimal information between them to avoid redundancy, are selected. Mutual information (see e.g. Cover and Thomas (1991)) extends the correlation to the measure of nonlinear dependencies, while correlation is strictly limited to linear ones. As an example, the correlation between a centered antisymmetric variable and its second power is zero, despite the fact they obviously depend one from another (though in a nonlinear way). The mutual information avoids this drawback, providing a more general and less restricted way to measure dependencies between variables.

The mutual information (MI) has already been used to select variables from near-infrared spectra (Rossi et al., 2006). Despite it provides a promising way to extend state-of-the-art spectral analysis to nonlinear methodologies, the direct selection of variables by MI suffers from some drawbacks. First, the MI estimation becomes difficult as the number of selected variables grows. Indeed in a forward procedure the estimation is faced to the curse of dimensionality, making the estimation of the MI with the last selected feature much more difficult than with the first selected one. Second, the low number of spectra usually available for learning makes the results of the selection highly dependent on the data set: a small change in the data can lead to different selected variable sets, resulting in difficult interpretation. Finally, even though the estimation of the mutual information is less demanding in terms of computation time than the construction of a nonlinear model, the large number of initial variables results in high computation times for the selection.

In this paper, we propose to first reduce the number of variables through a projection of the spectral features before the selection by mutual information. To maintain the interpretability despite the use of a projection, the latter is achieved by ensuring that each coordinate in the projection corresponds to a restricted set of initial features with consecutive wavelengths. The general methodology proposed

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut