TS-Debate: Multimodal Collaborative Debate for Zero-Shot Time Series Reasoning
Recent progress at the intersection of large language models (LLMs) and time series (TS) analysis has revealed both promise and fragility. While LLMs can reason over temporal structure given carefully engineered context, they often struggle with numeric fidelity, modality interference, and principled cross-modal integration. We present TS-Debate, a modality-specialized, collaborative multi-agent debate framework for zero-shot time series reasoning. TS-Debate assigns dedicated expert agents to textual context, visual patterns, and numerical signals, preceded by explicit domain knowledge elicitation, and coordinates their interaction via a structured debate protocol. Reviewer agents evaluate agent claims using a verification-conflict-calibration mechanism, supported by lightweight code execution and numerical lookup for programmatic verification. This architecture preserves modality fidelity, exposes conflicting evidence, and mitigates numeric hallucinations without task-specific fine-tuning. Across 20 tasks spanning three public benchmarks, TS-Debate achieves consistent and significant performance improvements over strong baselines, including standard multimodal debate in which all agents observe all inputs.
💡 Research Summary
The paper introduces TS‑Debate, a novel framework for zero‑shot time‑series reasoning (TSR) that explicitly leverages multimodal expertise while avoiding the pitfalls of naïve multimodal fusion. The authors first identify three core weaknesses of existing LLM‑based TSR approaches: (1) numeric fidelity loss when raw series are serialized as text, (2) modality interference when visual, textual, and numeric cues are merged without clear separation, and (3) lack of systematic verification, leading to hallucinated numbers and contradictory conclusions.
TS‑Debate addresses these issues by constructing a three‑stage pipeline. In the knowledge‑elicitation stage, the system extracts domain priors (k) from the user query (q) and any auxiliary context (c). These priors encode task semantics, expected patterns, constraints, and guidance on which modalities are most informative. This step grounds the subsequent reasoning in a shared “analysis contract” and reduces spurious reasoning paths.
Next, the multimodal query construction stage transforms the raw time‑series x₁:T into three complementary views: a temporal view (raw values and trend markers), a spectral view (frequency‑domain representation such as FFT or spectrogram), and a statistical‑feature view (summary statistics, detected peaks, anomalies). Each view is presented through a modality‑specific interface: a textual description for the Text Analyst, a chart image for the Visual Analyst, and precise numeric arrays for the Numerical Analyst.
The collaborative debate stage deploys three specialist agents, each operating on its dedicated modality. Every agent produces structured statements consisting of Observations, Inferences, and Limitations. For example, the Visual Analyst may note a rising trend in the chart, the Numerical Analyst may compute the exact mean increase, and the Text Analyst may cite domain knowledge such as “product launches typically cause a 5‑10 % price bump”. Importantly, the agents are isolated; they never see each other’s raw inputs, ensuring that disagreements reflect genuine cross‑modal evidence rather than paraphrasing.
The heart of the framework is the Verification‑Conflict‑Calibration (VCC) protocol. A set of Reviewer agents equipped with lightweight code execution and lookup tools programmatically checks every quantitative claim. If a claim like “the average after Q3 is 4.2” is made, the reviewer computes the actual average from the numeric view and flags any discrepancy. Simultaneously, the reviewers detect cross‑modal conflicts (e.g., visual trend vs. numeric flatness) and assign conflict scores.
Finally, a Synthesizer aggregates reviewer scores, weighs them against the domain priors, and produces a calibrated answer. When conflicts are resolvable (e.g., a visual artifact is identified as a plotting error), the synthesizer adjusts the answer accordingly; when conflicts remain (e.g., ambiguous future prediction), it may output a probabilistic statement or request additional data.
Empirically, TS‑Debate is evaluated on 20 tasks across three public benchmarks: MTBench, TimerBed, and TSQA. Compared with strong baselines—including standard multimodal debate (where all agents see all inputs) and single‑modality LLMs—TS‑Debate achieves consistent improvements: +7.39 % on MTBench, +22.74 % on TimerBed, and +21.58 % on TSQA. The gains are especially pronounced in tasks requiring precise numeric reasoning (anomaly detection, forecasting), where error rates drop by over 30 % relative to baselines.
A notable contribution is that TS‑Debate requires no task‑specific fine‑tuning; it operates with off‑the‑shelf LLMs and MLLMs, relying solely on prompt engineering, modular agents, and the VCC protocol. This makes the approach lightweight and readily adaptable to new domains such as finance, climate monitoring, or industrial IoT, simply by updating the domain‑knowledge elicitation component.
The paper also discusses limitations: the current design handles only three modalities, assumes the availability of a code execution sandbox for verification, and may need extensions for streaming or high‑frequency data where real‑time verification is challenging. Future work is suggested in expanding modality coverage, learning meta‑policies for reviewer verification, and integrating online conflict resolution for continuous monitoring scenarios.
In summary, TS‑Debate introduces a principled, verification‑driven multimodal debate architecture that preserves modality fidelity, surfaces conflicting evidence, and dramatically improves zero‑shot time‑series reasoning without any model retraining. It represents a significant step toward trustworthy, multimodal LLM‑assisted analytics in real‑world time‑series applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment