Cisco Time Series Model Technical Report

Reading time: 5 minute
...

📝 Original Info

  • Title: Cisco Time Series Model Technical Report
  • ArXiv ID: 2511.19841
  • Date: 2025-11-26
  • Authors: Researchers from original ArXiv paper

📝 Abstract

We introduce the Cisco Time Series Model, a univariate zero-shot forecaster. This time series foundation model is the result of a general architectural innovation to a time series model enabling it to accept multiresolution input, applied to a popular decoder-only time series model (TimesFM). The resulting multiresolution decoder-only model is trained on over 300B unique data points, with more than half coming from the observability domain. Quantitative and qualitative evaluations demonstrate that the resulting model achieves superior performance on observability datasets while retaining very similar performance on a standard general-purpose forecasting benchmark (GIFT-Eval), and suggest that the multiresolution structure enables the model to make more accurate predictions on long context input.

💡 Deep Analysis

Deep Dive into Cisco Time Series Model Technical Report.

We introduce the Cisco Time Series Model, a univariate zero-shot forecaster. This time series foundation model is the result of a general architectural innovation to a time series model enabling it to accept multiresolution input, applied to a popular decoder-only time series model (TimesFM). The resulting multiresolution decoder-only model is trained on over 300B unique data points, with more than half coming from the observability domain. Quantitative and qualitative evaluations demonstrate that the resulting model achieves superior performance on observability datasets while retaining very similar performance on a standard general-purpose forecasting benchmark (GIFT-Eval), and suggest that the multiresolution structure enables the model to make more accurate predictions on long context input.

📄 Full Content

Modern LLMs are capable of learning complex statistical properties of language from a vast corpus of text. Rather than being trained to emulate a particular style or perform a particular task, they learn structure across diverse examples of token sequences, and the learned representations can be transferred to many downstream tasks and applications. The main idea of a time series foundation model (TSFM) is to apply the same playbook -including the transformer architecture that has revolutionized natural language processing -to sequences of numerical data, i.e., time series. Our present focus is to train a univariate TSFM capable of high-quality zero-shot forecasting, with emphasis on time series arising in certain business domains (initially, observability). Thus, having been exposed to patterns across many time series during training, given a segment of a new (unseen) time series, the TSFM is expected to predict its subsequent segment without any auxiliary parameter adjustment or fitting.

Architectural differences among TSFMs can be found in their approaches to tokenization, transformer configuration, and prediction heads. PatchTST [Nie+23] introduces the idea of a time series patch as the analogue of a token, uses a linear transformation of a patch as a replacement for the token embedding, and finally applies a standard transformer encoder architecture. TimesFM [Das+24] uses a residual block to embed time series patches, enabling learning of more complex representations, and applies a decoder-only architecture. Chronos [Ans+24] tokenizes individual data points via scaling and then applies the (encoder-decoder) T5 architecture [Raf+20], notably formulating forecasting as a classification problem; subsequent versions (Chronos-Bolt, Chronos-2 [Ans+25]) utilize patching and “meta features” before applying transformer layers, and Chronos-2 uses a T5 encoder. Moirai [Woo+24] utilizes multiple patch sizes and also learned a mixture of distributions, elevating probabilistic forecasting to a first class consideration; Moirai-MoE [Liu+25] applies the mixture of experts pattern as a learnable replacement for various frequency heuristics. Toto [Coh+25b], [Coh+25a] uses a causal patching mechanism, a learned mixture of t-distributions for the prediction head, and a composite loss function; its training corpus and accompanying BOOM benchmark are based heavily on observability data.

There have been several efforts to incorporate multiscale structure in TSFMs. Pyraformer [Liu+22] uses convolutions at multiple scales to build a multiresolution representation, then applies attention in a pyramidal pattern to share information across resolutions. Scaleformer [Sha+23] processes the same input at multiple scales, proceeding from coarser to finer, using average pooling and upsampling to translate across resolutions. Pathformer [Che+24] introduces an adaptive multi-scale transformer block, giving the patch size a dynamic flavor; it also intentionally models trend and seasonality. Multiresolution time series transformer [Zha+24] iteratively applies attention directly to several patchings of the same time series (with different length and stride); the attention operates separately on each patching, and the results are combined.

While existing TSFMs have introduced a variety of architectural innovations-from patch-based tokenization to mixture of experts designs-the majority remain constrained by relatively limited context windows, typically spanning 512 to 4,096 data points, with the latest TimesFM 2.5 [Das+24] extending this to 16,384. All of the more complex multiresolution architectures mentioned above share the characteristic that they process the same input at multiple resolutions, so have no particular suitability for long context. This limitation hampers their ability to effectively leverage long historical sequences, which are often critical for accurate forecasting in domains (such as observability) where past patterns persist over extended periods. Addressing this gap is central to our work: we propose an architecture and data handling methodology that explicitly target improved modeling in long context scenarios. In contrast to prior multiresolution approaches, we view the coarser resolution context as a potential asset to make the finer resolution predictions more accurate. A very practical motivation for our architecture is that time series data is often available at different resolutions according to age (fine resolution data “expires” and is aggregated into coarse resolution summaries), and 1-minute and 1-hour resolutions in particular are often persisted. Our model is able to exploit pre-computed rollups in scenarios where full history at the finest resolution may not be available. We achieve efficient use of long context and a better tradeoff between recent detail and historical context as our model operates directly on multiresolution input: the more complex multiresolution architectures would require a conte

…(Full text truncated)…

📸 Image Gallery

Multi_Resolution_Time_Series_example_v5.png after-dedupe-50.png before-dedupe-50.png cluster-compare.png cover.png dedup-workflow.png motivation.png page_2.webp page_3.webp qa-4plots.png qa-long-1.png qa-long-2.png qa-long-3.png qa-short-1.png stats-dist-3plots.png timesfm-approaches-ResEmb+SpecToken_v3.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut