Mutual information for the selection of relevant variables in spectrometric nonlinear modelling
Data from spectrophotometers form vectors of a large number of exploitable variables. Building quantitative models using these variables most often requires using a smaller set of variables than the initial one. Indeed, a too large number of input variables to a model results in a too large number of parameters, leading to overfitting and poor generalization abilities. In this paper, we suggest the use of the mutual information measure to select variables from the initial set. The mutual information measures the information content in input variables with respect to the model output, without making any assumption on the model that will be used; it is thus suitable for nonlinear modelling. In addition, it leads to the selection of variables among the initial set, and not to linear or nonlinear combinations of them. Without decreasing the model performances compared to other variable projection methods, it allows therefore a greater interpretability of the results.
💡 Research Summary
Spectroscopic measurements generate high‑dimensional data sets, often containing hundreds or thousands of wavelength‑specific absorbance values. Using all of these variables directly in a regression or classification model leads to an explosion in the number of model parameters, which in turn raises the risk of over‑fitting and degrades the model’s ability to generalize to new samples. Conventional dimensionality‑reduction techniques such as Principal Component Regression (PCR) and Partial Least Squares (PLS) address this problem by projecting the original variables onto a smaller set of latent components. While effective at reducing dimensionality, these projection‑based methods produce new variables that are linear combinations of the original wavelengths, making it difficult to interpret which physical spectral features drive the model’s predictions.
The paper proposes a different strategy: selecting a subset of the original variables based on mutual information (MI), an information‑theoretic measure that quantifies the amount of shared information between two random variables. Unlike correlation‑based metrics, MI captures both linear and nonlinear dependencies without assuming any particular functional form for the relationship between input and output. Consequently, MI is well suited for the nonlinear modelling tasks that are common in chemometrics and spectroscopic analysis.
Methodology
- Pre‑processing – Raw spectra are baseline‑corrected, smoothed, and normalized to a common scale.
- Probability density estimation – Because MI requires probability distributions, the authors employ non‑parametric estimators. They compare kernel‑density estimation (KDE) with a k‑nearest‑neighbour (k‑NN) based entropy estimator, ultimately favouring the k‑NN approach for its lower sensitivity to bandwidth selection and its computational efficiency (approximately O(N·k·log N)).
- Variable ranking – For each wavelength variable (X_i) the MI with the target variable (Y) (e.g., concentration) is computed: (I(X_i;Y)). Variables are sorted by descending MI.
- Redundancy correction – To avoid selecting variables that convey largely the same information, a conditional MI term is introduced. When a candidate variable (X_j) is evaluated after a set (S) of already‑selected variables, the effective contribution is approximated as (I_{\text{eff}}(X_j;Y|S) = I(X_j;Y) - \sum_{X_i\in S} I(X_j;X_i)). This penalises variables that are highly correlated with the current set.
- Stopping criteria – The selection loop stops when (a) the cross‑validated root‑mean‑square error of prediction (RMSEP) improves by less than a predefined threshold, (b) a maximum number of variables is reached, or (c) computational budget is exhausted.
Experimental Evaluation
Two real spectroscopic data sets are used: a visible‑near‑infrared (Vis‑NIR) data set and a mid‑infrared (MIR) data set, each paired with a quantitative reference measurement (e.g., sugar content). The MI‑based selector is benchmarked against PCR, PLS, Random Forest variable importance, and LASSO. Performance is assessed using 10‑fold cross‑validation, focusing on (i) prediction error (RMSEP), (ii) number of selected variables, and (iii) interpretability of the chosen wavelengths.
Results show that the MI‑based method achieves prediction errors comparable to, and in some cases slightly better than, PCR and PLS (typically within 2–5 % of the best RMSEP). Crucially, the number of retained wavelengths is reduced to roughly 10–15 % of the original dimensionality. The selected wavelengths correspond to known absorption peaks (e.g., O–H, C–H stretching bands), providing a direct chemical interpretation that projection‑based methods cannot offer. Random Forest and LASSO also reduce dimensionality but tend to select variables that are part of complex linear combinations, making post‑hoc interpretation more difficult.
Computational Considerations
The k‑NN MI estimator scales well with sample size; for data sets of 500–1000 spectra the full selection process completes in under two minutes on a standard workstation. KDE, while accurate, requires careful bandwidth tuning and is more computationally intensive. The authors recommend using k = 5–10 neighbours for a good trade‑off between bias and variance.
Limitations and Future Work
MI estimation can become unstable when the number of samples is small, leading to biased information estimates. The paper suggests bootstrapping or Bayesian density estimation as possible remedies. Moreover, the current approach evaluates MI for each variable independently; extending the framework to multivariate MI (e.g., (I({X_i,X_j};Y))) would allow direct assessment of interaction effects. Finally, integrating MI‑based pre‑selection with deep learning regressors is highlighted as a promising direction, potentially combining the interpretability of variable selection with the expressive power of neural networks.
Conclusion
By leveraging mutual information, the authors present a variable‑selection technique that (1) respects the nonlinear nature of spectroscopic relationships, (2) selects actual wavelengths rather than abstract latent components, and (3) maintains or slightly improves predictive performance while dramatically enhancing model interpretability. This makes the method highly attractive for chemometric applications where both accurate quantification and insight into the underlying spectral chemistry are required.
Comments & Academic Discussion
Loading comments...
Leave a Comment