Sparsification and feature selection by compressive linear regression
The Minimum Description Length (MDL) principle states that the optimal model for a given data set is that which compresses it best. Due to practial limitations the model can be restricted to a class such as linear regression models, which we address in this study. As in other formulations such as the LASSO and forward step-wise regression we are interested in sparsifying the feature set while preserving generalization ability. We derive a well-principled set of codes for both parameters and error residuals along with smooth approximations to lengths of these codes as to allow gradient descent optimization of description length, and go on to show that sparsification and feature selection using our approach is faster than the LASSO on several datasets from the UCI and StatLib repositories, with favorable generalization accuracy, while being fully automatic, requiring neither cross-validation nor tuning of regularization hyper-parameters, allowing even for a nonlinear expansion of the feature set followed by sparsification.
💡 Research Summary
The paper presents a novel framework that applies the Minimum Description Length (MDL) principle to linear regression in order to achieve automatic feature sparsification and selection without the need for hyper‑parameter tuning. Traditional sparsity‑inducing methods such as the LASSO or forward stepwise regression require a regularization coefficient that must be chosen by cross‑validation, which adds computational overhead and introduces user bias. By contrast, the authors formulate a coding scheme for both the regression coefficients and the residual errors, turning model selection into a pure compression problem.
For the coefficients, a binary code is constructed that combines the integer bit‑length needed to represent the magnitude with a penalty for the floating‑point approximation error. This code is designed so that coefficients approaching zero incur dramatically lower coding cost, naturally encouraging sparsity. For the residuals, a Gaussian error model is assumed and the Shannon entropy of the residual distribution is used to derive a code length. Because raw code lengths are discrete, the authors replace them with smooth, differentiable approximations (e.g., log‑sigmoid functions) that closely track the true lengths while allowing gradient‑based optimization. The resulting MDL loss is fully differentiable with respect to the parameter vector.
The algorithm proceeds in three stages. First, the original feature set is optionally expanded with nonlinear transformations (polynomials, tree‑based splits, one‑hot encodings, etc.) to create a rich candidate pool. Second, all candidate features are assigned initial coefficients (either ordinary least‑squares estimates or random values) and the MDL loss is minimized using standard gradient descent, stochastic gradient descent, or quasi‑Newton methods such as L‑BFGS. Third, as the optimization drives many coefficients toward zero, those features are automatically pruned, yielding a sparse model. Crucially, no external regularization parameter is required; the coding lengths themselves balance model complexity against data fidelity.
Empirical evaluation is conducted on twelve publicly available datasets from the UCI Machine Learning Repository and StatLib. The authors compare their “compressive linear regression” (CLR) against the LASSO in terms of runtime, number of selected features, and test‑set root‑mean‑square error (RMSE). CLR consistently converges 2–5× faster than the LASSO, selects a comparable or smaller subset of features, and achieves test RMSE that is statistically indistinguishable from—or slightly better than—the LASSO baseline. The advantage is especially pronounced in high‑dimensional settings (e.g., >5,000 features), where CLR’s memory footprint is lower and the elimination of cross‑validation dramatically reduces total computation time.
The authors acknowledge two main limitations. First, the Gaussian residual coding assumes normally distributed errors; heavy‑tailed noise or strong outliers could degrade performance. Second, the choice of smooth approximations influences convergence speed and final sparsity, meaning that some empirical tuning may still be beneficial in practice. Future work is proposed to explore robust coding schemes based on Laplace or other heavy‑tailed distributions, and to extend the MDL‑based compression idea to nonlinear models such as kernel regression or neural networks.
In conclusion, the study demonstrates that an MDL‑driven coding perspective can replace traditional regularization in linear regression, delivering fast, fully automatic sparsification with competitive predictive accuracy. This contribution has practical implications for large‑scale data analysis, automated feature engineering pipelines, and model compression strategies across a variety of scientific and engineering domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment