MMTS-BENCH: A Comprehensive Benchmark for Time Series Understanding and Reasoning

MMTS-BENCH: A Comprehensive Benchmark for Time Series Understanding and Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Time series data are central to domains such as finance, healthcare, and cloud computing, yet existing benchmarks for evaluating various large language models (LLMs) on temporal tasks remain scattered and unsystematic. To bridge this gap, we introduce MMTS-BENCH, a comprehensive multimodal benchmark built upon a hierarchical taxonomy of time-series tasks, spanning structural awareness, feature analysis, temporal reasoning, sequence matching and cross-modal alignment. MMTS-BENCH comprises 2,424 time series question answering (TSQA) pairs across 4 subsets: Base, InWild, Match, and Align, generated through a progressive real-world QA framework and modular synthetic data construction. We conduct extensive evaluations on closed-source, open-source LLMs and existing time series adapted large language models (TS-LLMs), revealing that: (1) TS-LLMs significantly lag behind general-purpose LLMs in cross-domain generalization, (2) LLMs show weaknesses in local tasks compared to global tasks, (3) chain-of-thought (CoT) reasoning and multimodal integration substantially improve performance, and (4) the dominant factor in existing TS-LLMs remains the backbone network capability rather than the time series encoder design. MMTS-BENCH not only provides a rigorous evaluation framework but also offers clear directions for advancing LLMs toward robust, interpretable, and generalizable time-series reasoning.


💡 Research Summary

MMTS‑BENCH is introduced as a comprehensive multimodal benchmark designed to evaluate large language models (LLMs) on a wide spectrum of time‑series tasks. Recognizing that existing benchmarks are fragmented, lack hierarchical structure, and are often confined to single domains, the authors propose a systematic taxonomy that organizes time‑series capabilities into five core dimensions: structural awareness, feature analysis, temporal reasoning, sequence matching, and cross‑modal understanding. Each dimension is further broken down into concrete subtasks, yielding a total of 286 fine‑grained composite tasks.

The benchmark comprises 2,424 time‑series question‑answer (TSQA) pairs divided into four subsets. Base uses a controllable synthetic data pipeline built from 17 expert‑designed templates that combine trend, seasonality, and noise through additive and concatenative operations. Precise parameter logging enables graded difficulty and diagnostic probing of model behavior. InWild draws real‑world multivariate series from five domains (transport, finance, cloud operations, climate, healthcare) in the LOTSA dataset. A three‑stage generation process integrates multimodal context (raw series, visual plots, domain metadata, statistical features) into LLM prompts, produces summaries and QA pairs, and then undergoes expert verification, ensuring high‑quality, domain‑rich questions. Match focuses on sequence‑matching abilities such as isomorphism, robustness, localization, and reverse correspondence. Align evaluates bidirectional mapping between time series and natural‑language semantics, testing advanced cross‑modal understanding.

Evaluation is performed on a range of closed‑source (e.g., GPT‑4, Claude) and open‑source LLMs (LLaMA‑2, Falcon) as well as specialized time‑series LLMs (Time‑MQA, ChatTime, ChatTS). Results reveal four key insights: (1) TS‑LLMs lag behind general‑purpose LLMs in out‑of‑distribution (OOD) generalization across domains, indicating that current TS‑LLM designs over‑fit to narrow data distributions; (2) models perform worse on local tasks (e.g., short‑window volatility detection) than on global tasks (e.g., long‑term trend inference), suggesting limited capacity for capturing fine‑grained temporal dependencies; (3) incorporating chain‑of‑thought (CoT) prompting and multimodal inputs (time‑series plots) yields substantial gains, especially on complex reasoning and cross‑modal alignment, with average accuracy improvements of 12 percentage points for reasoning tasks; (4) the dominant factor driving TS‑LLM performance is the underlying backbone network’s expressive power rather than the specific time‑series encoder, as models sharing the same backbone exhibit similar scores regardless of encoder variations.

The paper argues that future progress should prioritize scaling and pre‑training of the language backbone, while treating time‑series encoders as lightweight adapters. Moreover, the hierarchical taxonomy and the fine‑grained diagnostic capability of MMTS‑BENCH enable researchers to pinpoint precise weaknesses (e.g., local pattern detection) and guide targeted model improvements. The authors release all data, code, and evaluation scripts, inviting the community to extend the benchmark with additional domains, more complex multivariate interactions, and novel model architectures.

In summary, MMTS‑BENCH fills a critical gap by providing a systematic, hierarchical, and multimodal evaluation suite for time‑series understanding and reasoning, offering clear evidence of current LLM and TS‑LLM limitations and outlining concrete directions for building more robust, interpretable, and generalizable time‑series foundation models.


Comments & Academic Discussion

Loading comments...

Leave a Comment