Catching Up Faster by Switching Sooner: A Prequential Solution to the AIC-BIC Dilemma
Bayesian model averaging, model selection and its approximations such as BIC are generally statistically consistent, but sometimes achieve slower rates og convergence than other methods such as AIC and leave-one-out cross-validation. On the other hand, these other methods can br inconsistent. We identify the “catch-up phenomenon” as a novel explanation for the slow convergence of Bayesian methods. Based on this analysis we define the switch distribution, a modification of the Bayesian marginal distribution. We show that, under broad conditions,model selection and prediction based on the switch distribution is both consistent and achieves optimal convergence rates, thereby resolving the AIC-BIC dilemma. The method is practical; we give an efficient implementation. The switch distribution has a data compression interpretation, and can thus be viewed as a “prequential” or MDL method; yet it is different from the MDL methods that are usually considered in the literature. We compare the switch distribution to Bayes factor model selection and leave-one-out cross-validation.
💡 Research Summary
The paper tackles the long‑standing “AIC‑BIC dilemma” by introducing a novel prequential approach called the switch distribution. The authors begin by observing that Bayesian model averaging and BIC, while asymptotically consistent, often suffer from slow convergence in finite samples. They attribute this to what they term the catch‑up phenomenon: the Bayesian posterior initially spreads its mass over many candidate models and only after a substantial amount of data does it concentrate on the true model. In contrast, criteria such as AIC and leave‑one‑out cross‑validation (LOOCV) adapt quickly and achieve optimal convergence rates (typically O(1/n) risk), but they lack consistency and can over‑select in the limit.
To reconcile these opposing properties, the authors propose the switch distribution, a modification of the standard Bayesian marginal likelihood that incorporates a discrete “switch time” variable. At each observation t, the method evaluates whether to remain with the current model or to switch to a different one, incurring a switching cost that depends on the Kullback‑Leibler divergence between models and on prior complexities. This dynamic re‑weighting prevents the posterior from lingering on suboptimal models for too long, allowing rapid “catch‑up” when a better model becomes evident.
Mathematically, the log‑marginal under the switch distribution equals the ordinary Bayesian log‑marginal plus an additive term reflecting the cumulative switching costs. This formulation admits a clean minimum description length (MDL) interpretation: the total code length consists of the usual data‑fit term plus a penalty for each model transition, mirroring the prequential coding scheme. The authors prove two central theorems under fairly mild conditions (finite model class, bounded KL divergences, sufficiently diffuse priors): (1) Consistency – the switch‑based selector converges almost surely to the true model as n → ∞; (2) Optimal rate – the expected predictive risk decays at the same O(1/n) rate as AIC or LOOCV, thereby achieving the best possible trade‑off between bias and variance.
From an algorithmic standpoint, the paper presents an efficient dynamic‑programming implementation. For each time step the algorithm updates a table of cumulative log‑likelihoods and switching costs for all candidate models, yielding an overall computational complexity of O(T·K) where T is the sample size and K the number of models. This makes the method practical for online or streaming settings.
Empirical evaluations span linear regression, Bayesian networks, and non‑Gaussian mixture models. In every scenario the switch distribution outperforms standard Bayes factor model selection and LOOCV in terms of faster risk reduction, lower model‑selection error, and shorter compressed code length. Notably, when the true model is complex or the sample size is modest, the switch method retains stability while AIC‑type criteria tend to over‑fit and Bayesian methods lag behind due to the catch‑up effect.
In conclusion, the switch distribution offers a prequential solution that simultaneously guarantees asymptotic consistency and finite‑sample optimal convergence rates, effectively resolving the AIC‑BIC dilemma. The authors suggest future work extending the framework to non‑parametric settings, hierarchical model spaces, and large‑scale ensemble learning, where dynamic switching could further enhance both predictive performance and interpretability.
Comments & Academic Discussion
Loading comments...
Leave a Comment