Automated Model Selection for Generalized Linear Models

Automated Model Selection for Generalized Linear Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we show how mixed-integer conic optimization can be used to combine feature subset selection with holistic generalized linear models to fully automate the model selection process. Concretely, we directly optimize for the Akaike and Bayesian information criteria while imposing constraints designed to deal with multicollinearity in the feature selection task. Specifically, we propose a novel pairwise correlation constraint that combines the sign coherence constraint with ideas from classical statistical models like Ridge regression and the OSCAR model.


💡 Research Summary

The paper presents a unified framework for fully automated model selection in generalized linear models (GLMs) by formulating the feature subset selection (FSS) problem as a mixed‑integer conic optimization task that directly minimizes information criteria such as AIC and BIC. Traditional best subset selection (BSS) focuses on a fixed number of predictors and is NP‑hard; most practical approaches resort to heuristics, LASSO‑type relaxations, or piecewise‑linear approximations of the log‑likelihood. In contrast, the authors embed the GLM log‑likelihood itself into conic form: Gaussian regression uses a second‑order cone, while logistic and Poisson regressions are expressed with the exponential cone. This exact conic representation enables modern conic solvers to obtain globally optimal solutions without approximating the likelihood.

A central contribution is a novel pairwise correlation constraint designed to mitigate multicollinearity. The constraint merges three ideas: (i) sign‑coherence, which forces coefficients of highly correlated predictors to share the same sign; (ii) Ridge‑type equal‑magnitude shrinkage, encouraging all selected coefficients to have comparable absolute values; and (iii) OSCAR‑style clustering, which can force exact equality among groups of coefficients. Mathematically, for any two predictors i and j the constraint enforces |β_i| = |β_j| or a bound of the form |β_i|·|β_j| ≤ τ·corr(X_i,X_j)·M, which is linearized and incorporated as mixed‑integer linear constraints. This formulation reduces variance inflation, improves interpretability, and stabilizes the solution in the presence of strong predictor correlations.

The paper also addresses the well‑known separation problem in binomial GLMs, where perfect separation leads to infinite maximum‑likelihood estimates. The authors propose a pre‑check based on a linear program that verifies the existence of a finite solution before invoking the conic optimizer, thereby preventing solvers from returning spurious unbounded results.

Extensive simulation studies and experiments on real‑world datasets demonstrate that the proposed method consistently yields lower AIC/BIC values than stepwise regression, LASSO, hybrid coordinate‑descent plus combinatorial search, and other heuristic FSS techniques. In scenarios with severe multicollinearity, conventional methods either flip coefficient signs or suffer numerical instability, whereas the new pairwise correlation constraint maintains stable, sign‑consistent estimates. Moreover, the framework naturally extends to Poisson regression, marking the first use of conic optimization for FSS in count data models.

In summary, the paper’s three key innovations are: (1) an exact conic‑programming formulation of GLM FSS that directly optimizes AIC/BIC, (2) a robust pairwise correlation constraint that simultaneously enforces sign coherence, equal magnitude shrinkage, and coefficient clustering to combat multicollinearity, and (3) a systematic separation detection step for binomial models. By integrating statistical model selection criteria with state‑of‑the‑art mixed‑integer conic solvers, the work bridges the gap between statistical theory and optimization practice, offering a scalable, interpretable, and fully automated solution for high‑dimensional GLM analysis.


Comments & Academic Discussion

Loading comments...

Leave a Comment