On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

On the Relation between Realizable and Nonrealizable Cases of the   Sequence Prediction Problem

A sequence $x_1,\dots,x_n,\dots$ of discrete-valued observations is generated according to some unknown probabilistic law (measure) $\mu$. After observing each outcome, one is required to give conditional probabilities of the next observation. The realizable case is when the measure $\mu$ belongs to an arbitrary but known class $\mathcal C$ of process measures. The non-realizable case is when $\mu$ is completely arbitrary, but the prediction performance is measured with respect to a given set $\mathcal C$ of process measures. We are interested in the relations between these problems and between their solutions, as well as in characterizing the cases when a solution exists and finding these solutions. We show that if the quality of prediction is measured using the total variation distance, then these problems coincide, while if it is measured using the expected average KL divergence, then they are different. For some of the formalizations we also show that when a solution exists, it can be obtained as a Bayes mixture over a countable subset of $\mathcal C$. We also obtain several characterization of those sets $\mathcal C$ for which solutions to the considered problems exist. As an illustration to the general results obtained, we show that a solution to the non-realizable case of the sequence prediction problem exists for the set of all finite-memory processes, but does not exist for the set of all stationary processes. It should be emphasized that the framework is completely general: the processes measures considered are not required to be i.i.d., mixing, stationary, or to belong to any parametric family.


💡 Research Summary

The paper investigates the problem of sequential prediction for an infinite discrete‑valued observation stream (x_{1},x_{2},\dots) generated by an unknown probability measure (\mu). After each observation the predictor must output a conditional distribution for the next symbol. Two distinct settings are considered. In the realizable case the true measure (\mu) is assumed to belong to a known class (\mathcal C) of stochastic processes. In the non‑realizable case (\mu) may be arbitrary, but the quality of a predictor is evaluated with respect to the class (\mathcal C); that is, a predictor is judged by how well it performs compared to the best possible predictor that knows (\mathcal C).

Two loss functions are used to quantify prediction quality: (i) the total variation (TV) distance between the true conditional distribution and the predictor’s conditional distribution, summed over time, and (ii) the expected average Kullback–Leibler (KL) divergence. The main contributions are a systematic comparison of the realizable and non‑realizable problems under these two metrics, a set of necessary and sufficient conditions for the existence of a universal predictor, and constructive solutions when they exist.

TV‑based results. The authors prove that under the TV metric the realizable and non‑realizable problems are equivalent. Specifically, a predictor that is TV‑consistent for every (\mu\in\mathcal C) can be turned into a predictor that is TV‑consistent for all probability measures, and vice‑versa. The proof relies on the martingale property of conditional probabilities and the triangle inequality for TV distance. Consequently, any method that solves the realizable problem (for example, a Bayes mixture over a countable dense subset of (\mathcal C)) automatically solves the non‑realizable problem as well.

KL‑based results. The situation changes dramatically when the loss is measured by the expected average KL divergence. Here the two problems diverge. The paper shows that a KL‑consistent predictor exists for the non‑realizable setting only if the class (\mathcal C) possesses a certain absolute continuity structure: every possible data‑generating measure must be absolutely continuous with respect to at least one element of (\mathcal C). When this holds, a predictor can be constructed as a Bayes mixture over a countable sub‑family ({\mu_i}_{i\in\mathbb N}\subset\mathcal C) with positive weights (w_i). The mixture predictor \