Test-Time Efficient Pretrained Model Portfolios for Time Series Forecasting
Is bigger always better for time series foundation models? With the question in mind, we explore an alternative to training a single, large monolithic model: building a portfolio of smaller, pretrained forecasting models. By applying ensembling or model selection over these portfolios, we achieve competitive performance on large-scale benchmarks using much fewer parameters. We explore strategies for designing such portfolios and find that collections of specialist models consistently outperform portfolios of independently trained generalists. Remarkably, we demonstrate that post-training a base model is a compute-effective approach for creating sufficiently diverse specialists, and provide evidences that ensembling and model selection are more compute-efficient than test-time fine-tuning.
💡 Research Summary
The paper challenges the prevailing “bigger is better” belief in time‑series foundation models by proposing a portfolio‑based alternative. Instead of training a single massive model, the authors construct a collection of small pretrained forecasters (1 M–9 M parameters) based on the Chronos‑Bolt encoder‑decoder architecture. Diversity is introduced by partitioning the training corpus along metadata dimensions such as frequency (hourly, daily, etc.) and domain (energy, finance, etc.). Each partition is used to fine‑tune a shared generalist model, producing specialist models via a lightweight “post‑training” step that costs only about 0.5 % of the original pretraining time. This approach avoids the prohibitive expense of training every specialist from scratch while still yielding a diverse set of experts.
At inference time the portfolio can be leveraged in two ways. The first is model selection: a validation window is used to identify the single specialist that achieves the lowest loss, and that model alone is used for forecasting. The second is greedy ensemble selection (Caruana et al., 2004), which assigns weights to multiple specialists and produces a weighted average forecast. Both strategies rely on time‑series cross‑validation to generate validation folds, and they can be applied even when only a single target series is available.
Experiments are conducted primarily on Chronos Benchmark II, a large zero‑shot benchmark that was not seen during training. For each model size (1 M, 2 M, 4 M, and “tiny” 9 M) the authors train a generalist on the full corpus and then create a portfolio of specialists by post‑training on the frequency‑ and domain‑specific subsets. Results show that specialist‑augmented portfolios consistently outperform portfolios consisting solely of generalists, delivering 2–4 % relative improvements in quantile loss. Moreover, the best‑performing specialist or a small greedy ensemble matches or exceeds the accuracy of much larger monolithic models (10 M–100 M parameters) while using an order of magnitude fewer active parameters at test time. The compute‑performance scaling curve of the portfolio mirrors that of large models, confirming that the approach retains the desirable scaling properties of foundation models without their inference overhead.
Ablation studies reveal that the primary source of diversity is the metadata‑driven data partitioning; random bagging or perturbations provide far less benefit. Post‑training proves crucial for efficiency: training specialists from scratch would increase total compute by roughly tenfold, whereas the proposed method adds negligible overhead. The authors also compare model selection versus ensembling, finding that selection is computationally cheaper and often yields comparable accuracy, while ensembling can give modest gains when more compute is available.
In summary, the paper demonstrates three key contributions: (1) a portfolio of small, specialist time‑series models can achieve accuracy on par with state‑of‑the‑art large pretrained forecasters; (2) a simple post‑training scheme efficiently generates diverse specialists without incurring the full cost of independent training; and (3) test‑time model selection or lightweight greedy ensembling provides a more compute‑effective alternative to test‑time fine‑tuning. The work opens a new direction for building scalable, cost‑efficient forecasting systems and suggests that similar portfolio strategies could be applied to other domains where large foundation models dominate.
Comments & Academic Discussion
Loading comments...
Leave a Comment