Equations of States in Singular Statistical Estimation
Learning machines which have hierarchical structures or hidden variables are singular statistical models because they are nonidentifiable and their Fisher information matrices are singular. In singular statistical models, neither the Bayes a posteriori distribution converges to the normal distribution nor the maximum likelihood estimator satisfies asymptotic normality. This is the main reason why it has been difficult to predict their generalization performances from trained states. In this paper, we study four errors, (1) Bayes generalization error, (2) Bayes training error, (3) Gibbs generalization error, and (4) Gibbs training error, and prove that there are mathematical relations among these errors. The formulas proved in this paper are equations of states in statistical estimation because they hold for any true distribution, any parametric model, and any a priori distribution. Also we show that Bayes and Gibbs generalization errors are estimated by Bayes and Gibbs training errors, and propose widely applicable information criteria which can be applied to both regular and singular statistical models.
💡 Research Summary
Learning machines that possess hierarchical structures or hidden variables belong to a class of statistical models known as singular models. In such models the mapping from parameters to probability distributions is non‑identifiable, which makes the Fisher information matrix singular. Consequently, the classical asymptotic results—namely, the normality of the maximum‑likelihood estimator (MLE) and the convergence of the Bayesian posterior to a multivariate normal distribution—break down. This lack of regularity has long prevented researchers from predicting a model’s generalization performance solely from its trained state.
The paper tackles this problem by focusing on four error quantities that are natural in Bayesian and Gibbs (i.e., posterior‑sampling) frameworks:
- Bayes generalization error (GB) – the expected loss on a fresh data point when predictions are made by averaging over the full posterior.
- Bayes training error (TB) – the analogous loss evaluated on the training data.
- Gibbs generalization error (GG) – the expected loss when a single parameter is drawn from the posterior (Gibbs sampler) and used for prediction on new data.
- Gibbs training error (TG) – the same Gibbs loss but measured on the training set.
The authors prove that these four quantities satisfy a set of universal linear relations that hold for any true data‑generating distribution, any parametric family, and any prior. The relations can be written compactly as
\
Comments & Academic Discussion
Loading comments...
Leave a Comment