An Empirical Comparison of Algorithms for Aggregating Expert Predictions

Predicting the outcomes of future events is a challenging problem for which a variety of solution methods have been explored and attempted. We present an empirical comparison of a variety of online and offline adaptive algorithms for aggregating experts’ predictions of the outcomes of five years of US National Football League games (1319 games) using expert probability elicitations obtained from an Internet contest called ProbabilitySports. We find that it is difficult to improve over simple averaging of the predictions in terms of prediction accuracy, but that there is room for improvement in quadratic loss. Somewhat surprisingly, a Bayesian estimation algorithm which estimates the variance of each expert’s prediction exhibits the most consistent superior performance over simple averaging among our collection of algorithms.

💡 Research Summary

The paper tackles the classic problem of aggregating probabilistic forecasts from multiple experts to predict future events, using a large‑scale real‑world dataset drawn from the ProbabilitySports online contest. The authors collected probability predictions for the outcomes of 1,319 National Football League (NFL) games spanning five seasons (approximately 2005‑2009). Each game attracted predictions from thousands of participants, who supplied a probability between 0 and 1 for the home team’s victory. The authors treat these predictions as “expert advice” and evaluate a suite of online and offline adaptive aggregation algorithms on two performance metrics: binary accuracy (the proportion of correctly predicted winners) and quadratic loss (the Brier score, i.e., the mean squared error between predicted probabilities and actual outcomes).

Algorithmic Landscape
The study groups the methods into two families.

Online adaptive algorithms update expert weights after each game, aiming for low regret. The set includes Hedge, Weighted Majority, Exponential Weights, and a variant of Follow‑the‑Regularized‑Leader (FTRL). These algorithms are computationally light (O(N) per round, where N is the number of experts) and are designed to react quickly to changing expert performance.
Offline (batch) algorithms first process the entire dataset to learn a fixed set of weights. This family comprises ordinary least‑squares regression, Expectation‑Maximization (EM) mixture models, and several Bayesian estimators. The Bayesian approaches are the most distinctive: they treat each expert’s prediction as a noisy observation of the true probability and place a prior on the expert’s variance. By inferring posterior variances, the methods automatically down‑weight erratic experts and up‑weight consistently accurate ones.

Experimental Design
Two loss functions drive the evaluation. Binary accuracy treats any predicted probability >0.5 as a “win” prediction, yielding a 0‑1 loss. Quadratic loss (Brier score) penalizes the distance between the forecasted probability and the realized binary outcome, thus rewarding well‑calibrated probability estimates even when the binary decision is wrong. The authors run each algorithm on the full 1,319‑game sequence, using a rolling‑origin evaluation for the online methods (i.e., predictions are made before the outcome of each game is revealed, then weights are updated).

Key Findings

Accuracy: Across the board, the simple arithmetic mean of all expert forecasts performs on par with the sophisticated online methods. No algorithm achieves a statistically significant improvement in win‑loss prediction accuracy over the mean. This suggests that, for a binary decision, the majority‑vote effect inherent in averaging already captures most of the signal present in the crowd.
Quadratic loss: Here the picture changes. Bayesian variance‑estimation algorithms consistently achieve lower Brier scores than the mean. The best performers—Variational Bayesian and a Bayesian mean‑variance estimator—reduce quadratic loss by roughly 3–5 % relative to simple averaging. The improvement stems from the explicit modeling of each expert’s uncertainty: experts who tend to be over‑confident or erratic receive higher posterior variance, which translates into lower influence on the final aggregate.
Computational considerations: Online algorithms are extremely lightweight (linear in the number of experts per round) and thus suitable for real‑time deployment, but their performance ceiling is low in this domain. Bayesian batch methods require iterative optimization (EM or variational inference) with a complexity on the order of O(N·T·log K) (N = experts, T = games, K = number of iterations). While more demanding, the computational cost is acceptable for offline analysis and yields tangible gains in calibration.
Data characteristics: The NFL setting provides binary outcomes but continuous probability forecasts. The variance among experts is substantial; some consistently provide well‑calibrated probabilities, while others are noisy or biased. Simple averaging smooths out noise but cannot differentiate between high‑quality and low‑quality contributors. The Bayesian framework’s ability to learn per‑expert variance exploits this heterogeneity, leading to better calibrated predictions.

Implications and Future Directions
The authors conclude that, for tasks where the primary goal is accurate binary classification, simple averaging is hard to beat. However, when the quality of the probability estimate itself matters—e.g., in betting markets, risk‑adjusted decision making, or any application that uses the forecast as a probability input—modeling expert uncertainty yields measurable benefits. The study suggests several avenues for further research: (1) extending the Bayesian model with hierarchical priors to capture group‑level effects (team‑specific, season‑specific trends); (2) incorporating non‑linear transformations or kernel methods to capture interactions among experts; (3) developing online Bayesian updating schemes that retain the calibration advantages while operating in real time; and (4) testing the approaches on other domains (financial forecasts, medical diagnosis) where expert probability elicitation is common.

In sum, the paper provides a thorough empirical benchmark of aggregation techniques on a realistic, large‑scale dataset, demonstrating that while simple averaging remains a robust baseline for accuracy, Bayesian variance‑aware aggregation offers a principled path to superior probabilistic calibration.