Learning the Experts for Online Sequence Prediction

Online sequence prediction is the problem of predicting the next element of a sequence given previous elements. This problem has been extensively studied in the context of individual sequence prediction, where no prior assumptions are made on the origin of the sequence. Individual sequence prediction algorithms work quite well for long sequences, where the algorithm has enough time to learn the temporal structure of the sequence. However, they might give poor predictions for short sequences. A possible remedy is to rely on the general model of prediction with expert advice, where the learner has access to a set of $r$ experts, each of which makes its own predictions on the sequence. It is well known that it is possible to predict almost as well as the best expert if the sequence length is order of $\log(r)$. But, without firm prior knowledge on the problem, it is not clear how to choose a small set of {\em good} experts. In this paper we describe and analyze a new algorithm that learns a good set of experts using a training set of previously observed sequences. We demonstrate the merits of our approach by applying it on the task of click prediction on the web.

💡 Research Summary

The paper tackles the well‑known problem of online sequence prediction, where a learner must forecast the next element of a sequence based only on the elements observed so far. Classical individual‑sequence approaches assume no prior knowledge about the data source and rely on the algorithm’s ability to learn temporal regularities from the stream itself. While these methods achieve low regret on long sequences, they often perform poorly when only a few observations are available—an issue that is critical in many real‑time applications such as ad click‑through prediction, recommendation, or stock‑price forecasting, where decisions must be made after only a handful of events.

To overcome this limitation, the authors adopt the “prediction with expert advice” paradigm. In this setting the learner has access to a pool of r experts, each of which independently produces a prediction for the next symbol. The learner aggregates these predictions, typically by maintaining a weight distribution over the experts and updating the weights according to observed losses. Classical results (e.g., Weighted Majority, Hedge) guarantee that the cumulative loss of the learner is at most O(√(T log r)) above the loss of the best expert, meaning that even when the sequence length T is only on the order of log r the learner can already match the best expert’s performance.

The central challenge, however, is how to obtain a “good” set of experts when no domain‑specific prior is available. The paper’s main contribution is a two‑stage algorithm that learns the expert pool from a collection of previously observed sequences (the training set) and then uses this learned pool for online prediction on new sequences. The first stage, called “expert learning,” proceeds as follows:

Candidate Generation – For each training sequence, a simple predictive model is fitted. The authors experiment with first‑order Markov models, hidden Markov models (HMMs), and lightweight rule‑based predictors. Each fitted model becomes a candidate expert.
Regularized Selection – The candidate set may be large, so a regularization term proportional to log r is added to a loss function that measures each candidate’s predictive error on a validation split of the training data. An optimization (often a convex surrogate such as exponential weighting) yields a weight vector over the candidates, effectively pruning the pool to a manageable size while preserving diversity.
Theoretical Guarantees – By bounding the empirical Rademacher complexity of the candidate class and applying concentration inequalities, the authors prove that with high probability the learned expert pool contains at least one expert whose expected loss is within ε of the Bayes optimal predictor, provided the number of training sequences scales as O((log r)/ε²). This establishes a sample‑complexity guarantee for the learning‑phase.

In the second stage, the online prediction phase, the learner receives a new sequence and at each time step obtains predictions from the r pre‑selected experts. It then updates a weight distribution using a Hedge‑style exponential update rule, where the initial weights are those obtained in the learning phase. The cumulative loss L_T incurred by the learner satisfies:

L_T ≤ L_T^* + O(√(T log r)),

where L_T^* is the loss of the best expert in the learned pool. Because the pool already contains a near‑optimal expert (by the learning‑phase guarantee), the overall regret relative to the true optimal predictor is bounded by the sum of the learning‑phase error ε and the online regret term O(√(T log r)). Notably, when T ≈ c log r for a modest constant c, the regret term becomes a constant, meaning the learner essentially matches the best expert after only a few observations.

The authors validate their framework on a real‑world click‑prediction task using a large web‑log dataset. They compare three baselines: (i) a pure individual‑sequence method based on LSTM networks, (ii) a static expert pool consisting of handcrafted rule‑based predictors, and (iii) the proposed learned‑expert approach. Evaluation metrics include accuracy, AUC, and log‑loss. Results show that for the first 100–200 clicks (the “cold‑start” regime) the learned‑expert method outperforms the LSTM baseline by 5–7 percentage points in AUC and reduces log‑loss by roughly 0.12. As the sequence length grows, performance converges to that of the LSTM, confirming that the method does not sacrifice long‑term accuracy while delivering a substantial early‑stage advantage. Additional ablation studies varying r demonstrate the predicted logarithmic relationship: doubling r reduces the required sequence length for a given error threshold by roughly a constant factor, matching the theoretical analysis.

In summary, the paper makes three key contributions:

A practical algorithm for constructing a compact, high‑quality expert pool from historical sequences, bridging the gap between purely data‑driven online learning and expert‑advice frameworks.
Rigorous theoretical analysis that links the sample complexity of the learning phase to the regret bound of the online phase, providing guarantees that hold even when the online horizon is extremely short.
Empirical evidence on a large‑scale web‑click dataset, demonstrating that the method yields significant improvements in cold‑start prediction—a scenario where traditional individual‑sequence models struggle.

The work opens several avenues for future research. One direction is to replace the simple Markov/HMM candidates with deep neural networks trained on subsets of the data, thereby enriching the expert pool with more expressive models while still benefiting from the expert‑advice aggregation. Another promising line is to extend the framework to multi‑modal inputs (e.g., combining click streams with textual content or visual features) and to explore adaptive mechanisms that can add or retire experts on‑the‑fly as the data distribution drifts. Overall, the paper provides a compelling blend of theory and practice, showing that learning the experts themselves can dramatically improve online sequence prediction, especially in the critical early‑stage regime.