SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning

SenTSR-Bench SenTSR-Bench: Thinking with Injected Knowledge for T ime-Series Reasoning Zelin He 1 ∗ , Boran Han 2 , Xiyuan Zhang 2 , Shuai Zhang 2 , Haotian Lin 3 , Qi Zhu 2 , Haoyang Fang 2 , Danielle C. Maddix 2 , Abdul Fatir Ansari 2 , Akash Chandrayan 3 , Abhinav Pradhan 3 , Bernie W ang 2 , Matthew Reimherr 1,3 1 The Pennsylvania State University 2 A WS AI Labs 3 Amazon RME Project Page Abstract T ime-series diagnostic reasoning is essential for many applications, yet existing solutions face a persistent gap: general reasoning lar ge language models (GRLMs) possess strong r easoning skills but lack the domain-speciﬁc knowledge to understand complex time-series patterns. Conversely , ﬁne-tuned time-series LLMs (TSLMs) understand these patterns but lack the capacity to generalize reasoning for more complicated questions. T o bridge this gap, we propose a hybrid knowledge- injection framework that injects TSLM-generated insights directly into GRLM’s reasoning trace, thereby achieving strong time-series r easoning with in-domain knowledge. As collecting data for knowledge injection ﬁne-tuning is costly , we further leverage a r einforcement learning-based approach with veriﬁable r ewards (RL VR) to elicit knowledge-rich traces without human supervision , then transfer such an in-domain thinking trace into GRLM for efﬁcient knowledge injection. W e further release SenTSR-Bench , a multivariate time-series-based diagnostic reasoning benchmark collected from r eal-world industrial operations . Across SenTSR-Bench and other public datasets, our method consistently surpasses TSLMs by 9.1%–26.1% and GRLMs by 7.9%–22.4%, delivering robust, context-awar e time-series diagnostic insights. 1 Intro duction Diagnostic r easoning over time-series data is a fundamental capability in many domains, enabling critical tasks such as event characterization, root-cause diagnosis, and decision-making ( Leite et al. , 2024 ; Chen et al. , 2024a ). In industrial operations, for instance, streams of sensor data measuring machine temperature and vibration are analyzed to diagnose potential equipment failures (Figure 1 (a)). However , existing resear ch in this domain has predominantly focused on surface-level anomaly detection ( Alnegheimish et al. , 2025 ). While effective at identifying irregularities, these techniques cannot offer actionable insights because they lack the capacity for temporal and causal reasoning requir ed to explain an anomaly’s origin, diagnose its root cause, or r ecommend corrective actions. Recent advances in LLMs ( Jaech et al. , 2024 ; Guo et al. , 2025 ; Anthr opic , 2025 ), have unlocked an enhanced reasoning capabilities via embed implicit r easoning mechanisms ( Y eo et al. , 2025 ), yielding remarkable gains on benchmarks requiring r easoning. However , these general reasoning LLMs (GRLMs) lack the domain knowledge needed to interpret complex time-series patterns, ther eby producing incorr ect reasoning trajectories and thus incorrect diagnoses ( Merrill et al. , 2024 ; Cao et al. , 2026 ). In parallel, smaller LLM variants ﬁne-tuned on domain-speciﬁc time-series-textual pairs ( Xie et al. , 2024 ; Zhang et al. , 2025 ) have shown improved alignment with time-series understanding tasks. Y et these ﬁne-tuned time-series language models (TSLMs) frequently overﬁt to narrow template, like tasks and lack the ∗ W ork done during an internship at Amazon. 1 Figure 1: (a) The newly released SenTSR-Bench benchmark, collected fr om real-world machine monitoring environments, with multi-stage diagnostic questions. (b) Performance of the proposed framework on SenTSR-Bench , surpassing both stand-alone time-series specialists (TSLM) and general reasoning models (GRLM). (c) Case study illustrating why knowledge injection helps: the specialist captures key time-series patterns but fails to connect them to the corr ect root cause; the general r easoner shows strong r easoning but overlooks domain-speciﬁc critical failure patterns; our method injects the in-domain knowledge from ﬁne-tuned specialist into the r easoner ’s reasoning trace , aligning the trace with domain knowledge and producing the corr ect diagnosis. reasoning depth or generalization capacity r equired for out-of-distribution scenarios. As a r esult, both standalone GRLMs and TSLMs fall short in practice (illustrated in Figure 1 (c)). T o addr ess the above challenge, we propose a r easoning with knowledge injection framework that couples the reasoning power of GRLMs with the in-domain knowledge of TSLMs. At its core, the framework injects knowledge from TSLMs directly into the r easoning process of GRLMs, allowing the generated reasoning trace to continue with guidance from in-domain information. When the injected knowledge is reliable, it helps steer the reasoning trajectory toward accurate diagnoses; when the knowledge is weaker , the model corrects it with its str ong critical thinking capacity . One additional challenge is that a TSLM trained for in-domain question answering often fails to function effectively as an assistant for a GRLM. A typical alternative is to ﬁnetune a dedicated helper model, but this approach is constrained by the need to construct lar ge, high-quality datasets explicitly tailored for knowledge injection. T o overcome this supervision bottleneck, we intr oduce thinking transfer . Our method trains the TSLM within a r einforcement learning with veriﬁable r eward framework ( Guo et al. , 2025 ), leveraging r ule-based veriﬁable rewards and an explicit thinking structure to naturally elicit knowledge-rich thinking traces without any manual supervision. At inference, these RL-honed traces are injected into the GRLM, providing it with high-quality , in-domain knowledge to ground its subsequent reasoning pr ocess. Furthermore, to benchmark time-series diagnostic reasoning in real-world diagnostic settings, we in- troduce Sensor -based T ime- S eries Diagnostic R easoning ( SenTSR-Bench ) Benchmark, a ﬁrst-of-its-kind dataset of multivariate sensor streams and diagnostic texts for time-series diagnostic r easoning evalua- tion. In contrast to prior benchmarks that ar e either purely synthetic or LLM-annotated, SenTSR-Bench is built on the real-world multivariate time-series data drawn from real-world diagnostics events with human-annotated data . Across SenTSR-Bench and other existing benchmark datasets, and on both closed-sour ce and open- source reasoning models, our method surpasses TSLMs by 9.1–26.1% and GRLMs by 7.9–22.4%. RL- enhanced injection further yields 1.66×–2.92× larger gains than SFT -enhanced injection, and consistently 2 Figure 2: Overview of the proposed paradigm. (a) Knowledge injection: given a reasoning question and its time-series, a time–series LM (TSLM) produces gr ounded analysis snippets that are injected into the reasoning trace of a general fr ozen reasoning LM (GRLM) to answer diagnostic queries without weight updates. (b) Thinking transfer via RL: W e train the TSLM using r einforcement learning with veriﬁable rewards (RL VR) with an explicit thinking structure to elicit analysis-ﬁrst thinking traces without human supervision ; at inference, these traces are transferred via injection into the reasoning LM to strengthen temporal grounding for diagnosis. outperforms few-shot prompting and pr ompt-based collaboration approaches. T aken together , our key contributions are as follows: • New Paradigm for T ime-series Reasoning. W e formalize a framework that injects in-domain knowl- edge from a TSLM into an GRLM’s r easoning process, steering reasoning with domain knowledge. • RL-Based Method for Ef ﬁcient Injection. W e propose an injection paradigm that utilizes r einforcement learning with veriﬁable rewar ds to elicit knowledge-rich thinking traces without manual supervision for injection. • Real-W orld Benchmark and Evaluation. W e release SenTSR-Bench , a de-identiﬁed, r eal-world multi- variate time-series benchmark for diagnostic r easoning. Evaluations on SenTSR-Bench and public datasets show state-of-the-art diagnostic accuracy of our proposed solution with interpr etable explanations. 2 Metho dology Figure 2 pr ovides an overview of our proposed framework. In this section, we ﬁrst establish prelim- inaries and formally deﬁne the reasoning model generation process (Section 2.1 ). W e then intr oduce the general paradigm of knowledge injection (Section 2.2 ). W e then further instantiate this framework (Section 2.3 ). Finally , we describe a reinforcement learning-based framework for efﬁcient knowledge injection (Section 2.4 ). 2.1 Prelimina ries and Notation Multimodal Input. W rite V for the discrete token vocabulary and V ∗ for the space of ﬁnite token sequences, and use [ a , b ] to denote the concatenation of two sequences a and b . Let q = ( q 1 , . . . , q n ) ∈ V ∗ be a sequence of textual tokens describing the task (e.g., question, context, or instr uctions). A multivariate time-series is denoted by X = { x t } T t = 1 , where each x t ∈ R D is the reading of D channels at time step t . T o interface with language models, X must be mapped into the token space V ∗ . This can be done, for example, by r endering the series as a line-plot image and encoding it ( Liu et al. , 2025c ), converting it into structured JSON text followed by standard text tokenization, or applying a specialized time- series tokenizer ( Xie et al. , 2024 ). W ith slight abuse of notation, we use X to denote the ﬁnal tokenized repr esentation. 3 Reasoning Model. W e deﬁne a reasoning model through its generative distribution π (also referr ed to as a policy in later context) that generates two outputs: an internal reasoning trace r = ( r 1 , . . . , r K ) ∈ V ∗ and a ﬁnal answer y = ( y 1 , . . . , y M ) ∈ V ∗ . Generation pr oceeds in two phases. In the r easoning phase, the model autoregr essively produces a latent reasoning trace conditioned on the input pair ( X , q ) and a special thinking structur e: π ( r | X , q ) = K ∏ k = 1 π ( r k | X , q , [ ⟨ think ⟩ , r < k ] ) . (1) Here, ⟨ think ⟩ marks the beginning of the r easoning segment, which continues until the model emits the closing token ⟨ / think ⟩ . In the r esponse phase, the model conditions on both the input and the full reasoning trace to generate the ﬁnal answer: π ( y | X , q , [ ⟨ think ⟩ , r , ⟨ / think ⟩ ] ) = M ∏ j = 1 π  y j    X , q , [ ⟨ think ⟩ , r , ⟨ / think ⟩ , y < j ]  . (2) This reasoning–then–r esponse decomposition exposes the latent reasoning trace r , which we later inspect and modify through knowledge injection. In this paper , we distinguish two models. A (frozen) general reasoning model (GRLM), quantiﬁed by π G , is a lar ge open/closed–source model that follows the r easoning–then–response factorization discussed above, and a time-series language model (TSLM), quantiﬁed by π T , is a small ﬁne–tuned in-domain specialist. 2.2 General Knowledge Injection P aradigm Specialist Knowledge Generation. Given the curr ent reasoning state of the general reasoner π G at step k , i.e., the preﬁx r G < k together with inputs ( X , q ) , we form an injection–oriented token sequence ˜ q = Query  q , r G ≤ k  , where Query ( · ) is a deterministic query–shaping function (e.g., “provide helpful information”, “validate the claims”). Concr ete choices are given in later subsections. Then a TSLM is invoked on ( X , ˜ q ) to produce an output sequence K T ∼ π T  ·   X , ˜ q  . Intuitively , K T stands for the relevant in-domain time-series knowledge for injection. Reasoning with Knowledge Injection. Given the TSLM knowledge output K T and the current GRLM reasoning pr eﬁx r G ≤ k , we apply injection with r Inj ≤ k = Inject  r G ≤ k , K T  , which r eturns an updated thinking trace preﬁx to be used. Here Inject ( · ) is a deterministic injection function; the choice of k is chosen based on the injection method. The general reasoner GRLM then resumes generation conditioned on the updated pr eﬁx: r G j ∼ π G  ·    X , q , [ ⟨ think ⟩ , r Inj < j ]  , j ≥ k , (3) and then produce the ﬁnal answer similar to Eq. ( 2 ). 2.3 Instantiating Knowledge Injection Early Knowledge Injection A simple yet effective way to realize the injection paradigm is through early injection: immediately after ⟨ think ⟩ . W e choose a tokenized instruction v help (e.g., “produce a step–by–step analysis of the question with the time-series data”) and form ˜ q = Query help  q , ∅  = h q , v help i . 4 Algorithm 1: Algorithm W orkﬂow for Knowledge Injection with RL Honed Thinking Input: T raining set D train , test set D test , general reasoner policy π G Output: T rained specialist policy π T and predictions on D test # Stage I: Train TSLM with RLVR 1 for ( X , q , y ⋆ ) ∈ D train do 2 Update π T using RL VR training with composite r eward // cf. Eq. ( 6 ) # Stage II: Inference-time knowledge injection for GRLM 3 for ( X , q ) ∈ D test do 4 Obtain r T ∼ π T ( · | X , q ) 5 Form r Inj ≤ 1 ← Inject re f l ec t ( ∅ , r T ) // cf. Eq. ( 5 ) 6 Obtain r G ∼ π G ( · | X , q , [ ⟨ think ⟩ , r Inj ≤ 1 ]) 7 Produce and r ecord y G with X , q , r G // cf. Eq. ( 2 ) 8 return π T and all test predictions The specialist then generates the knowledge snippet K T ∼ π T ( · | X , ˜ q ) from the learnt time-series knowledge, which we then appended a brief reﬂection trigger to elicit critical r easoning v re f l ec t (e.g., “W ait, let me r eﬂect on my previous thinking pr ocess with the time-series data.”). W e then inject at k = 1: r Inj ≤ 1 = Inject re f l ec t  ∅ , K T  =  K T , v re f l ec t  , after which the general reasoner π G continues its reasoning trace and pr oduces the ﬁnal response conditioned on r Inj ≤ 1 (cf. Section 2.1 ). Conceptually , π T contributes grounded, in–domain time-series-based insights extracted from X , while π G performs the general reasoning by integrating the injected knowledge with context q , adjudicating alternatives, and producing the ﬁnal answer . Other Injection Paradigms Beyond early injection, the framework also supports alternative strategies. Examples include intermediate injection that corrects the GRLM’s reasoning pr ocess by inserting TSLM’s knowledge at low-conﬁdence points in the reasoning trace; or late injection that prompts TSLM to critique the entire GRLM r easoning trace and prompts reﬂection befor e the ﬁnal answer . Full implementation details are pr ovided in Appendix E . In practice, we ﬁnd that early injection is the most broadly effective, and thereby we adopt early injection as the default in subsequent method development, and report comparison results for the other variants in Section 4.3 . Practical Implementation. The method is easy to implement and compatible with standard LLM APIs. For models that support assistant preﬁll, the injected trace can be directly fed as assistant’s initial tokens by pre-inserting [ ⟨ think ⟩ , r Inj ≤ k ] . For models do not allow preﬁll for r easoning traces, we instead use an instructional proxy by wrapping the injected trace in the model’s recommended thinking templates. See Appendix E for details. 2.4 Kno wledge Injection with RL-Honed Thinking T races A time-series specialist π T is typically optimized for direct question answering, y T ∼ π T ( · | X , q ) , where the objective is to predict the answer tokens y T given inputs ( X , q ) . In contrast, knowledge injection requir es the specialist to provide an intermediate analysis or evidence rather than a ﬁnal answer . This is usually elicited through a help-oriented query , K T ∼ π T ( · | X , ˜ q ) , ˜ q = [ q , v help ] , 5 where recall v help is an instruction for producing helping knowledge (cf. Section 2.3 ). This mismatch induces a task shift : the TSLM π T , trained to pr oduce direct answers, tends to generate hallucinated content rather than faithful, unbiased analysis. As a result, K T is systematically misaligned with the desired gr ound-truth knowledge for injection. Constructing large expert-annotated corpora speciﬁcally for this injection setting could mitigate the issue but is prohibitively costly . Thinking T ransfer . T o r esolve the task shift between answering and supplying knowledge, we propose to align the specialist with its injection role by training it to produce a thinking trace befor e any answer . Then, such a specialist thinking trace is served directly as the knowledge sour ce, K T th i nk : = r T ∼ π T ( · | X , q ) . (4) At inference, we perform injection by starting GRLM r easoning with this analysis and a brief reﬂection cue, r Inj ≤ 1 = Inject re f l ec t  ∅ , r T  =  r T , v reﬂect  , (5) and then continue the general reasoning pr ocess. This design naturally aligns training and deployment: the TSLM learns to produce analysis ﬁrst, and the injected analysis serves as a grounded knowledge source for steering the r easoner . RL T raining without Thinking Supervision. Directly training a TSLM to produce analysis-ﬁrst thinking traces, as in Eq. ( 4 ), is challenging as most time-series diagnostic datasets contain only ground-truth answers y ∗ but not the intermediate reasoning traces r ∗ . T o overcome this, we employ reinforcement learning with veriﬁable r ewards (RL VR) ( Guo et al. , 2025 ). Let z = [ r , y ] denote a sampled completion containing both a trace and an answer . For each context ( X , q ) , we draw a gr oup of G completions { z i } G i = 1 and optimize with the group-r elative objective: max θ E { z i } G i = 1 ∼ π θ ( · | X , q ) h L G R PO  θ , { R ( z i ) } G i = 1  i , (6) with R ( z ) = r fmt ( z ) + r hard ( z ) , wher e r fmt ∈ { 0, 1 } is a format r eward that equals 1 if the output follows the target str ucture ⟨ think ⟩ r ⟨ / think ⟩ ⟨ answer ⟩ y ⟨ / answer ⟩ , and 0 otherwise. The hard r eward r hard ∈ { 0, 1 } equals 1 if the predicted answer y matches the ground- truth y ∗ , and 0 otherwise. Here the objective is computed over groups of G sampled completions, with rewards normalized within the group; the detailed form of L G R PO is provided in Appendix A.1 . Importantly , no labeled traces are required : the policy is driven to elicit analysis-ﬁrst reasoning purely through structural and correctness feedback. This is particularly valuable for time-series diagnostics, where ground-truth outcomes are available but the intermediate causal links between the time seires X and the underlying root cause y ∗ is unobserved and must be discovered through learning. The full algorithm is summarized in Algorithm 1 . 3 Benchma rk: SenTSR-Bench Figure 3: SenTSR-Bench Construction pipeline. Despite the gr owing interest in time-series diag- nostic reasoning, ther e are still very limited high- quality datasets that couple real-world time-series with textual diagnostic annotations. As summa- rized in T able 1 , existing work primarily rely on LLM-annotated versions of public time-series datasets or fully synthetic time series–text pairs, and typically pr ovide only a single question per se- ries, falling short of capturing real-world diagnos- tic complexity . In this work, we introduce SenTSR- Bench , a new benchmark dir ectly motivated by 6 T able 1: Comparison of time-series diagnostic reasoning benchmarks. Benchmark New T ime-Series? Real- W orld? Multi-stage Advancing Questions? Anno- tation? TSEvol ( Xie et al. , 2024 ) ✗ ✓ ✗ LLM TS&Language ( Merrill et al. , 2024 ) ✓ ✗ ✗ LLM MTBench ( Chen et al. , 2025b ) ✓ ✓ ✗ LLM Sensor-TSR (Ours) ✓ ✓ ✓ Human real-world sensor monitoring for machine breakdown diagnosis and tr oubleshooting. The benchmark consists of de-identiﬁed, multivariate time-series signals collected from vibration (acceleration, velocity) and temperature sensors, pair ed with human-curated diagnostic annotations. SenTSR-Bench moves beyond anomaly ﬂagging and evaluate the full procedure of diagnostic r easoning. The benchmark contains dif ferent levels of questions: (i) what happened (recognizing anomalous seg- ments in multivariate time-series), (ii) how happened (inferring plausible root causes behind the observed signals), and (iii) suggested ﬁx (proposing potential corrective actions). This benchmark provides a realistic and challenging testbed for developing models capable of robust, context-aware diagnostic reasoning. Figure 3 shows a simpliﬁed version of the data construction pipeline. Additional details on the benchmark construction pipeline is pr ovided in Appendix D . Evaluation Dataset Curation T o build the evaluation dataset, we follow a three-stage curation pipeline. First, we ﬁlter 110 multivariate sensor streams out of an initial pool of over 2,000 candidate samples, selecting those that exhibit clear anomalous patterns tied to potential troubleshooting actions. All signals are then standar dized to remove sensitive information. Second, we design an annotation pipeline that generates multi-stage diagnostic text while preserving privacy , producing faithful but de-identiﬁed annotations. Third, we construct 330 multiple-choice questions (MCQ) by pairing ground-tr uth answers with distractors. This process yields a benchmark that is both realistic and privacy-preserving, while supporting rigorous evaluation of anomaly r ecognition, root cause reasoning, and ﬁx pr oposal tasks. T raining Dataset Generation. A key challenge is generating diverse multivariate sensor streams with a small number of real-world seeds are available. T o address this, we design a two-stage synthetic generation pipeline powered by vision–language models (VLMs). Stage 1: Iterative code synthesis prompts a VLM with plots and context fr om 23 de-identiﬁed seeds to pr oduce Python codes that mimic the original behaviors. Stage 2: Diversiﬁcation and simpliﬁcation transforms these simulators into compact stochastic generators that introduce randomized dynamics and parameter variation, yielding broad families of realistic synthetic series. The resulting synthetic data are then used to construct 6,000 MCQ training entries consistent with the evaluation design. 4 Exp eriments 4.1 Exp eriment Setup Datasets. For evaluation, we use SenTSR-Bench , our de-identiﬁed, real-world benchmark of multivariate time-series with three pr ogressively harder tasks: What happened (key time-series anomaly characteriza- tion), How it happened (root-cause diagnosis), and Suggested ﬁx (action recommendation). W e additionally assess the performance on two public benchmarks: TSEvol (Dataset A) fr om Xie et al. ( 2024 ), which covers inductive , deductive , and causal reasoning, and MCQ2 dataset from TS&Language Benchmark ( Merrill et al. , 2024 ), which poses relational queries over paired time-series under textual context. Additional details on the datasets are pr ovided in Appendix D . 7 T able 2: Reasoning performance on SenTSR-Bench Benchmark (mean ± std). Best per block are bolded . The last two columns report relative gains (in %) for Injection rows vs. the corresponding specialized TSLM and the zero-shot general r easoner (GRLM), respectively . Model Paradigm What Happened How Happened Suggested Fix Overall Improvement vs. TSLM GRLM TSLM (Qwen-VL-3B) SFT 0.530 ± 0.037 0.567 ± 0.029 0.548 ± 0.011 0.549 ± 0.019 — — RL 0.512 ± 0.038 0.594 ± 0.019 0.546 ± 0.009 0.551 ± 0.014 — — GRLM (Claude3.7-T ext) Zero-shot 0.712 ± 0.019 0.409 ± 0.033 0.473 ± 0.024 0.531 ± 0.011 — — Few-shot 0.691 ± 0.009 0.561 ± 0.011 0.509 ± 0.009 0.587 ± 0.006 — +10.5% TSLM + GRLM SFT -Injection 0.742 ± 0.023 0.603 ± 0.021 0.558 ± 0.019 0.634 ± 0.006 +15.5% +19.4% RL-Injection 0.779 ± 0.014 0.627 ± 0.018 0.542 ± 0.028 0.650 ± 0.010 +18.0% +22.4% TSLM (Qwen-VL-3B) SFT 0.530 ± 0.037 0.567 ± 0.029 0.548 ± 0.011 0.549 ± 0.019 — — RL 0.512 ± 0.038 0.594 ± 0.019 0.546 ± 0.009 0.551 ± 0.014 — — GRLM (Claude3.7-V ision) Zero-shot 0.764 ± 0.016 0.542 ± 0.019 0.555 ± 0.018 0.620 ± 0.006 — — Few-shot 0.824 ± 0.014 0.552 ± 0.014 0.555 ± 0.018 0.643 ± 0.005 — +3.7% TSLM + GRLM SFT -Injection 0.756 ± 0.031 0.588 ± 0.013 0.649 ± 0.029 0.665 ± 0.020 +21.1% +7.3% RL-Injection 0.827 ± 0.009 0.661 ± 0.014 0.597 ± 0.032 0.695 ± 0.012 +26.1% +12.1% T able 3: Reasoning performance on TSEvol and TS&Language Benchmark (mean ± std). Best per block are bolded . The last two columns report r elative gains (in %) for Injection rows vs. the corresponding specialized TSLM and the zero-shot general r easoner (GRLM), respectively . Model Paradigm Causal Deductive Inductive MCQ2 Overall Improvement vs. TSLM GRLM TSLM (Qwen-VL-3B) SFT 0.623 ± 0.006 0.520 ± 0.013 0.357 ± 0.010 0.507 ± 0.032 0.502 ± 0.005 — — RL 0.627 ± 0.016 0.496 ± 0.014 0.313 ± 0.023 0.597 ± 0.031 0.508 ± 0.006 — — GRLM (Qwen3-32B) Zero-shot 0.507 ± 0.041 0.473 ± 0.035 0.623 ± 0.036 0.407 ± 0.015 0.502 ± 0.023 — — Few-shot 0.622 ± 0.028 0.473 ± 0.035 0.460 ± 0.033 0.427 ± 0.015 0.495 ± 0.010 — -1.4% TSLM+GRLM SFT -Injection 0.569 ± 0.035 0.543 ± 0.013 0.592 ± 0.031 0.410 ± 0.036 0.528 ± 0.008 +5.2% +5.2% RL-Injection 0.627 ± 0.025 0.512 ± 0.047 0.588 ± 0.035 0.490 ± 0.046 0.554 ± 0.021 +9.1% +10.4% TSLM (Qwen-VL-3B) SFT 0.623 ± 0.006 0.520 ± 0.013 0.357 ± 0.010 0.507 ± 0.032 0.502 ± 0.005 — — RL 0.627 ± 0.016 0.496 ± 0.014 0.313 ± 0.023 0.597 ± 0.031 0.508 ± 0.006 — — GRLM (R1-Distilled-Qwen-32B) Zero-shot 0.522 ± 0.022 0.550 ± 0.054 0.525 ± 0.015 0.483 ± 0.015 0.520 ± 0.010 — — Few-shot 0.542 ± 0.017 0.558 ± 0.040 0.478 ± 0.022 0.513 ± 0.021 0.523 ± 0.007 — +0.6% TSLM+GRLM SFT -Injection 0.594 ± 0.023 0.535 ± 0.023 0.519 ± 0.004 0.490 ± 0.020 0.534 ± 0.007 +6.4% +2.7% RL-Injection 0.634 ± 0.013 0.543 ± 0.013 0.532 ± 0.010 0.537 ± 0.032 0.561 ± 0.011 +10.4% +7.9% Implementation and Evaluation For the general reasoning model, we test the open-source models DeepSeekR1-Distilled-Qwen-32B ( Guo et al. , 2025 ) and Qwen3-32B ( Y ang et al. , 2025 ) as well as closed-source models Claude3.7 ( Anthropic , 2025 ) with time-series encoded as either the vision form ( -vision ) or the textual form ( -text ). All models are set up with standard conﬁg. For ﬁne-tuned TSLM, we primarily use Qwen2.5-VL-3B ( Bai et al. , 2025 ) for SFT and RL training. W e also use ChatTS-14B ( Xie et al. , 2024 ) for injection design exploration. For evaluation, generative QA tasks (inductive reasoning in SenTSR-Bench ) are evaluated using RAGAS. V eriﬁable tasks r eport accuracy . All results are averaged over three independent runs. Further details on implementation and evaluation are provided in Appendix E . 4.2 P erformance Analysis For performance analysis, we evaluate the proposed knowledge injection framework on both our newly released SenTSR-Bench benchmark and public benchmarks. W e test injection across differ ent TSLM training and injection paradigms (SFT/RL-based), and multiple general r easoning models. Results are presented in T able 1. Here ar e the observations: Injection Lifts Both Baselines. Across all benchmark datasets, injecting TSLM knowledge, whether from SFT or RL-tuned TSLM, consistently boosts accuracy over both stand-alone specialists and stand- alone reasoners. On SenTSR-Bench , gains range from +15.5% to +26.1% over the specialized TSLM and 8 T able 4: Performance with different injection strategy on TSEvol and TS&Language Benchmark (mean ± std). Best results ar e bolded . Model Injection Strategy Inductive Deductive Causal MCQ2 Overall TSLM (ChatTS-14B) — 0.812 ± 0.007 0.597 ± 0.013 0.732 ± 0.006 0.590 ± 0.026 0.683 ± 0.010 GRLM (Claude3.7-T ext) — 0.763 ± 0.021 0.612 ± 0.029 0.645 ± 0.021 0.640 ± 0.014 0.665 ± 0.010 TSLM + GRLM Intermediate 0.805 ± 0.026 0.659 ± 0.022 0.645 ± 0.010 0.703 ± 0.037 0.703 ± 0.006 Late 0.791 ± 0.014 0.667 ± 0.011 0.703 ± 0.019 0.680 ± 0.022 0.710 ± 0.003 Early 0.824 ± 0.019 0.643 ± 0.011 0.703 ± 0.019 0.690 ± 0.016 0.715 ± 0.003 TSLM (ChatTS-14B) — 0.812 ± 0.007 0.597 ± 0.013 0.732 ± 0.006 0.590 ± 0.026 0.683 ± 0.010 GRLM (Claude3.7-V ision) — 0.792 ± 0.016 0.643 ± 0.011 0.630 ± 0.009 0.690 ± 0.008 0.689 ± 0.005 TSLM + GRLM Intermediate 0.809 ± 0.011 0.674 ± 0.000 0.663 ± 0.009 0.713 ± 0.017 0.715 ± 0.004 Late 0.800 ± 0.019 0.682 ± 0.011 0.707 ± 0.009 0.697 ± 0.005 0.721 ± 0.005 Early 0.825 ± 0.011 0.643 ± 0.029 0.746 ± 0.005 0.730 ± 0.014 0.736 ± 0.002 +7.3% to +22.4% over the general GRLM; improvements span all three tasks and are most pronounced on How happened , which involves both in-domain anomaly detection knowledge and str ong causal reasoning capacity . On the public benchmarks, we observe similar trends: +5.2% to +10.4% over the specialist and +2.7% to +10.4% over the reasoner . The injected variant shows r obustness: even when the TSLM performs poorly (e.g., the Inductive task), the injected model leverages the r easoner ’s critical thinking capacity to maintain competitive performance. T aken together , injection delivers the best overall performance across settings. RL-based Injection Consistently Y ields Larger Gains. Compared with SFT -based injection, RL-based thinking transfer delivers consistently lar ger impr ovements over zero-shot GRLMs: when measuring gains, RL-based injection pr ovides 1.66 × the improvement on Claude3.7-Vision , 2.00 × on Qwen3-32B , and 2.92 × on DeepSeekR1-Distilled-Qwen-32B . Both SFT and RL injection outperform few-shot prompting, but RL pr ovides the biggest lifts (e.g., 3.27 × than few-shot on Claude3.7-Vision ). More- over , injection is mor e token-efﬁcient : while tokenized multivariate time-series in TSevol can exceed ∼ 50k tokens, making few-shot prompts infeasible, injection instead provides a compact analysis snippet through thinking pr eﬁll, offering a more scalable mechanism for time-series diagnostic r easoning. 4.3 F ramewo rk Analysis Figure 4: Comparison of baseline (zero-shot) r eason- ing, knowledge prompting, and knowledge injec- tion. (a) SenTSR-Bench Benchmark with Qwen-VL- 3B (RL) as the TSLM. (b) TSEvol and TS&Language Benchmarks with Qwen-VL-3B (RL) as the TSLM. (c) TSEvol and TS&Language Benchmarks with ChatTS- 14B as the TSLM. Acr oss all settings, the injection- based method consistently outperforms others. Comparison across Different Injection Strate- gies. W e evaluate three injection strategies — early , intermediate , and late . Early injection inserts the specialist’s analysis immediately after the open- ing token, Intermediate injection with correcting the lowest-conﬁdence token position and Late in- jection appends a specialist-generated critique to the full r easoning trace at the end. Further im- plementation details ar e provided in Appendix E . W e use ChatTS-14B ( Xie et al. , 2024 ) here (rather than smaller specialists) as it is tuned with a broad range of time-series QA tasks and thus better trained to investigate the problem. T a- ble 3 reports results acr oss Claude3.7-Text and Claude3.7-Vision . Results show that all three strategies consistently outperform both baselines, aligned with our previous ﬁndings. Among them, 9 early injection yields the str ongest gains across both text and vision reasoners. One key reason is that mid/late injection requir es the specialist to read and revise long r easoning traces, which lie outside the distribution of QA-style SFT and lead to drift or hallucination. Early injection aligns naturally with the specialist’s strengths of producing short, focused analyses that can be directly pr eﬁxed into the reasoning trajectory . Comparison between Prompting and Knowledge Injection. W e next compar e our knowledge injection approach with a prompting-based alternative. In the prompting setup, the same TSLM outputs are provided to the reasoning model as additional prompt instructions, rather than being integrated into its internal r easoning trace. Figure 4 contrasts the three strategies: baseline (zer o-shot) reasoning, prompting, and injection. Across all model families, fr om open-source to closed-sour ce, and across all three benchmark datasets, we observe that injection consistently outperforms pr ompting. This advantage arises because injection places domain knowledge directly inside the r easoning pr ocess, which encourages the model to interact with and reﬂect upon the knowledge more ef fectively . In contrast, when knowledge is only presented as external pr ompt instructions, the reasoning model often fails to fully incorporate it. See Appendix F for illustrative case studies. 4.4 A dditional Analysis W e ablate the role of direct time-series access by removing the raw series from the GRLM input, showing that relying solely on the TSLM’s textual summary creates an information bottleneck that limits down- stream reasoning (Appendix B.1 ). W e also compar e injection against pr ompting-based alternatives such as few-shot, self-consistency , and tree-of-thought in terms of accuracy and inference latency (Appendix B.2 ). T o assess training data requir ements, we measure the sensitivity of TSLM performance to synthetic data diversity (Appendix B.3 ), and to examine whether the gains from injection can be replicated by str onger RL objectives alone, we evaluate DAPO, GSPO, and CISPO alongside GRPO (Appendix B.4 ). Qualitative case studies comparing standalone baselines with injection and contrasting knowledge prompting with knowledge injection are pr esented in Appendix F . 5 Related W ork T ime-series reasoning has r ecently attracted growing inter est. One line of work studies prompting-based structur ed reasoning over temporal data ( Jiang et al. , 2025 ; Liu et al. , 2025d ; Merrill et al. , 2024 ; Liu et al. , 2025c ). Another line develops specialist models post-trained on time-series–text pairs ( Kong et al. , 2025 ; Xie et al. , 2024 ). While these approaches show promise, the former lacks domain-speciﬁc priors for capturing key diagnostic patterns, and the latter often overﬁts to in-domain data and struggles with generalization. Our knowledge-injection framework aims to bridge these gaps by combining the reasoning capacity of general LLMs with domain-aligned insights fr om time-series specialists. Another related dir ection investigates interventions on the reasoning process. Prior work has explored modifying reasoning traces or internal r easoning for improved faithfulness, safety , and instruction following ( W u et al. , 2025 ; Ar cuschin et al. , 2025 ; Baker et al. , 2025 ) as well as methods for controlling the length of reasoning traces to balance accuracy and efﬁciency ( Han et al. , 2024a ; Aggarwal and W elleck , 2025 ; Lee et al. , 2025 ). Our work differs in that we explicitly inject domain knowledge from a specialized model into a general reasoning model, with a speciﬁc focus on diagnostic reasoning over time-series data. See Appendix C for additional r elated works on time-series reasoning in for ecasting, time-series reasoning benchmarks. 6 Conclusion In this paper , we introduced a knowledge injection framework that combines domain knowledge from time-series specialists with the strong reasoning ability of large general LLMs. W e further proposed 10 RL-based thinking transfer for knowledge injection, which naturally elicits analysis-ﬁrst traces without supervision, enabling effective and task-aligned injection. In addition, we released SenTSR-Bench , a real-world benchmark for time-series diagnostic reasoning with multi-stage questions covering anomaly recognition, r oot-cause diagnosis, and corrective suggestions. Across SenTSR-Bench and public datasets, our injeciton framework achieves 7.9%–26.1% improvements over standalone baselines. W e encourage exploration on SenTSR-Bench and further investigation of knowledge injection appr oaches for broader time-series diagnostic reasoning tasks. References Aggarwal, P . and W elleck, S. (2025). L1: Controlling how long a reasoning model thinks with r einforce- ment learning. arXiv preprint . Alnegheimish, S., He, Z., Reimherr , M., Chandrayan, A., Pradhan, A., and D’Angelo, L. (2025). M2ad: Multi-sensor multi-system anomaly detection thr ough global scoring and calibrated thresholding. In International Conference on Artiﬁcial Intelligence and Statistics , pages 4384–4392. PMLR. Anthropic (2025). Claude 3.7 sonnet system card. System card, Anthr opic PBC. Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoharan, S., Nanda, N., and Conmy , A. (2025). Chain- of-thought reasoning in the wild is not always faithful. arXiv preprint . Bai, S., Chen, K., Liu, X., W ang, J., Ge, W ., Song, S., Dang, K., W ang, P ., W ang, S., T ang, J., Zhong, H., Zhu, Y ., Y ang, M., Li, Z., W an, J., W ang, P ., Ding, W ., Fu, Z., Xu, Y ., Y e, J., Zhang, X., Xie, T ., Cheng, Z., Zhang, H., Y ang, Z., Xu, H., and Lin, J. (2025). Qwen2.5-vl technical report. Baker , B., Huizinga, J., Gao, L., Dou, Z., Guan, M. Y ., Madry , A., Zar emba, W ., Pachocki, J., and Far hi, D. (2025). Monitoring reasoning models for misbehavior and the risks of pr omoting obfuscation. arXiv preprint arXiv:2503.11926 . Cao, Y ., Fallahi, F ., Dandu, M. M. K., Morishetti, L., Zhao, K., Ma, L., Subramaniam, S., Xu, J., Korpeoglu, E., Nag, K., et al. (2026). Is more context always better? examining llm r easoning capability for time interval prediction. arXiv preprint . Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Y ang, B., Shan, B., Y u, C., W ang, C., Zhu, C., et al. (2025a). Minimax-m1: Scaling test-time compute efﬁciently with lightning attention. arXiv preprint arXiv:2506.13585 . Chen, J., Feng, A., Zhao, Z., Garza, J., Nurbek, G., Qin, C., Maatouk, A., T assiulas, L., Gao, Y ., and Y ing, R. (2025b). Mtbench: A multimodal time series benchmark for temporal reasoning and question answering. arXiv preprint . Chen, M., Cui, D., Haick, H., and T ang, N. (2024a). Artiﬁcial intelligence-based medical sensors for healthcare system. Advanced Sensor Research , 3(3):2300009. Chen, W ., Hao, X., W u, Y ., and Liang, Y . (2024b). T erra: A multimodal spatio-temporal dataset spanning the earth. Advances in Neural Information Processing Systems , 37:66329–66356. Guo, D., Y ang, D., Zhang, H., Song, J., W ang, P ., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. (2025). Deepseek-r1 incentivizes reasoning in llms through r einforcement learning. Nature , 645(8081):633–638. Han, T ., W ang, Z., Fang, C., Zhao, S., Ma, S., and Chen, Z. (2024a). T oken-budget-awar e llm reasoning. arXiv preprint arXiv:2412.18547 . Han, X., Zhang, Z., W u, Y ., Zhang, X., and W u, Z. (2024b). Event traf ﬁc for ecasting with sparse multimodal data. In Proceedings of the 32nd ACM International Conference on Multimedia , pages 8855–8864. 11 Jaech, A., Kalai, A., Lerer , A., Richar dson, A., El-Kishky , A., Low , A., Helyar , A., Madry , A., Beutel, A., Carney , A., et al. (2024). Openai o1 system car d. arXiv pr eprint arXiv:2412.16720 . Jiang, Y ., Y u, W ., Lee, G., Song, D., Shin, K., Cheng, W ., Liu, Y ., and Chen, H. (2025). Explainable multi-modal time series prediction with llm-in-the-loop. arXiv preprint . Jin, M., W ang, S., Ma, L., Chu, Z., Zhang, J. Y ., Shi, X., Chen, P .-Y ., Liang, Y ., Li, Y .-F ., Pan, S., et al. (2023). T ime-llm: T ime series forecasting by repr ogramming large language models. arXiv preprint arXiv:2310.01728 . Kong, Y ., Y ang, Y ., Hwang, Y ., Du, W ., Zohren, S., W ang, Z., Jin, M., and W en, Q. (2025). T ime-mqa: T ime series multi-task question answering with context enhancement. arXiv preprint . Le, H., Do, D., Nguyen, D., and V enkatesh, S. (2025). Reasoning under 1 billion: Memory-augmented reinfor cement learning for large language models. arXiv preprint . Lee, A., Che, E., and Peng, T . (2025). How well do llms compress their own chain-of-thought? a token complexity approach. arXiv preprint . Leite, D., Andrade, E., Rativa, D., and Maciel, A. M. (2024). Fault detection and diagnosis in industry 4.0: a review on challenges and opportunities. Sensors (Basel, Switzerland) , 25(1):60. Liu, C., Xu, Q., Miao, H., Y ang, S., Zhang, L., Long, C., Li, Z., and Zhao, R. (2025a). T imecma: T owards llm-empowered multivariate time series forecasting via cr oss-modality alignment. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence , volume 39, pages 18780–18788. Liu, H., Kamarthi, H., Zhao, Z., Xu, S., W ang, S., W en, Q., Hartvigsen, T ., W ang, F ., and Prakash, B. A. (2025b). How can time series analysis beneﬁt from multiple modalities? a survey and outlook. arXiv preprint arXiv:2503.11835 . Liu, H., Liu, C., and Prakash, B. A. (2025c). A pictur e is worth a thousand numbers: Enabling llms reason about time series via visualization. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T echnologies (V olume 1: Long Papers) , pages 7486–7518. Liu, H., Zhao, Z., Li, S., and Prakash, B. A. (2025d). Evaluating system 1 vs. 2 reasoning appr oaches for zero-shot time series for ecasting: A benchmark and insights. arXiv pr eprint arXiv:2503.01895 . Liu, P ., Guo, H., Dai, T ., Li, N., Bao, J., Ren, X., Jiang, Y ., and Xia, S.-T . (2025e). Calf: Aligning llms for time series forecasting via cross-modal ﬁne-tuning. In Proceedings of the AAAI Confer ence on Artiﬁcial Intelligence , volume 39, pages 18915–18923. Liu, Y ., Qin, G., Huang, X., W ang, J., and Long, M. (2024). Autotimes: Autor egressive time series forecasters via lar ge language models. Advances in Neural Information Processing Systems , 37:122154– 122184. Merrill, M., T an, M., Gupta, V ., Hartvigsen, T ., and Althoff, T . (2024). Language models still struggle to zero-shot r eason about time series. In Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 3512–3533. T avakoli, M., Chandra, R., T ian, F ., and Bravo, C. (2025). Multi-modal deep learning for credit rating prediction using text and numerical data str eams. Applied Soft Computing , 171:112771. W an, Z., Liu, C., W ang, X., T ao, C., Shen, H., Peng, Z., Fu, J., Ar cucci, R., Y ao, H., and Zhang, M. (2024). Meit: Multi-modal electrocar diogram instruction tuning on lar ge language models for r eport generation. arXiv preprint . 12 W ang, X., Feng, M., Qiu, J., Gu, J., and Zhao, J. (2024). From news to for ecast: Integrating event analysis in llm-based time series for ecasting with reﬂection. Advances in Neural Information Processing Systems , 37:58118–58153. W ang, X., W ei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery , A., and Zhou, D. (2022). Self- consistency improves chain of thought r easoning in language models. arXiv preprint . W u, T ., Xiang, C., W ang, J. T ., Suh, G. E., and Mittal, P . (2025). Effectively contr olling reasoning models through thinking intervention. arXiv preprint . Xie, Z., Li, Z., He, X., Xu, L., W en, X., Zhang, T ., Chen, J., Shi, R., and Pei, D. (2024). Chatts: Aligning time series with llms via synthetic data for enhanced understanding and reasoning. arXiv pr eprint arXiv:2412.03104 . Y ang, A., Li, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Gao, C., Huang, C., Lv , C., et al. (2025). Qwen3 technical report. arXiv preprint . Y ao, S., Y u, D., Zhao, J., Shafran, I., Grifﬁths, T ., Cao, Y ., and Narasimhan, K. (2023). T ree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems , 36:11809–11822. Y eo, E., T ong, Y ., Niu, M., Neubig, G., and Y ue, X. (2025). Demystifying long chain-of-thought reasoning in llms. arXiv preprint . Y u, Q., Zhang, Z., Zhu, R., Y uan, Y ., Zuo, X., Y ue, Y ., Dai, W ., Fan, T ., Liu, G., Liu, L., et al. (2025). Dapo: An open-source llm r einforcement learning system at scale. arXiv preprint . Zhang, C., Zhang, Y ., Shao, Q., Feng, J., Li, B., Lv , Y ., Piao, X., and Y in, B. (2024). Bjtt: A large-scale multimodal dataset for trafﬁc pr ediction. IEEE T ransactions on Intelligent T ransportation Systems . Zhang, J., Feng, L., Guo, X., Wu, Y ., Dong, Y ., and Xu, D. (2025). T imemaster: T raining time-series multimodal llms to reason via r einforcement learning. arXiv preprint . Zheng, C., Liu, S., Li, M., Chen, X.-H., Y u, B., Gao, C., Dang, K., Liu, Y ., Men, R., Y ang, A., et al. (2025). Group sequence policy optimization. arXiv preprint . 13 A A dditional T echnical Details A.1 Details of GRPO T raining Objective For completeness, we provide the explicit form of the Group Relative Policy Optimization (GRPO) objective L G R PO ( θ , R ( z ) ) ( Guo et al. , 2025 ) used in Eq. ( 6 ). Given a training context ( X , q ) , we ﬁrst sample a group of G complete sequences  z i  G i = 1 ∼ π θ old ( · | X , q ) , where each z i contains both a reasoning trace and a ﬁnal answer . W e then compute scalar rewar ds { r i } G i = 1 for each sampled sequence using the composite rewar d function R ( z ) . W e normalize the r ewards into advantages by subtracting the gr oup mean and dividing by the standard deviation: ˆ A i = r i − µ r σ r , µ r = 1 G G ∑ j = 1 r j , σ r = v u u t 1 G G ∑ j = 1 ( r j − µ r ) 2 + γ , where γ is a small constant to ensure numerical stability . Each token z i , k in sequence z i shares the same normalized advantage ˆ A i , ensuring stable gradient updates across contexts. W e then optimize the clipped surrogate objective with KL regularization against a fr ozen refer ence model π ref : L G R PO ( θ ) = 1 G G ∑ i = 1 1 | z i | | z i | ∑ k = 1 min  ρ i , k ˆ A i , clip ( ρ i , k , 1 − ϵ , 1 + ϵ ) ˆ A i  − β KL [ π θ ( · | X , q ) ∥ π ref ( · | X , q ) ] , where ρ i , k = π θ ( z i , k | z i , < k , X , q ) π θ old ( z i , k | z i , < k , X , q ) is the token-level importance ratio, ϵ is the PPO clipping threshold, and β is the KL regularization coefﬁcient. This objective balances three forces: (i) impr oving the likelihood of high-rewar d completions relative to the old policy , (ii) clipping updates to maintain stability , and (iii) penalizing divergence fr om a refer ence model to prevent degeneration. A.2 Other Injection Pa radigms Beyond early insertion, the same framework can be applied for other paradigms by changing where we place the snippet and how we shape the request. For completeness, here we introduce framework of intermediate and late injection. Intermediate Knowledge Injection The general reasoner ﬁrst drafts a partial or full reasoning trace. Although the trace is token-level in our notation, in experiments we elicit a sentence-level structur e by instructing the model to reason step by step, sentence by sentence, starting from observable time- series evidence and then connecting to higher-level conclusions. Formally , we apply a deterministic segmentation operator that groups tokens into sentences, ˆ r G = [ s 1 , s 2 , . . . , s L ] , s ℓ ∈ V ∗ . Along with each sentence s ℓ , we prompt the model to output a self-r eported conﬁdence score c ℓ ∈ [ 0, 1 ] . In this paper we allow the reasoner to complete the draft ˆ r G and then select the sentence with the lowest conﬁdence, ℓ ∗ = ar g min ℓ ∈ { 1, .. ., L } Conf ( s ℓ ) . Let k ∗ be the token index at the start of s ℓ ∗ . W e now shape an assistance query that asks the specialist to judge this speciﬁc statement against the time series and to provide evidence or a corr ection, ˜ q = Query assist  q , ˆ r G ≤ k ∗ , s ℓ ∗ , v judge  , where v judge instructs the specialist to verify whether s ℓ ∗ is supported by the time-series X . The specialist returns knowledge K T ∼ π T ( · | X , ˜ q ) . W e then perform injection by rolling back to the insertion point and inserting a brief reﬂection cue v reﬂect between the existing trace and the specialist knowledge, r Inj ≤ k ∗ = Inject assist  ˆ r G < k ∗ , v reﬂect , K T  =  ˆ r G < k ∗ , v reﬂect , K T  . The reasoner then resumes generation for j ≥ k ∗ conditioned on r Inj and produces the ﬁnal answer following Eq. ( 2 ) . This design targets the least certain statement in the draft and supplies focused, time-series grounded evidence at that point. Late Knowledge Injection. The general reasoner ﬁrst pr oduces a complete reasoning trace ˆ r G . As in the intermediate setting, we elicit a sentence by sentence structure by prompting the model to enumerate observations from the time series before drawing conclusions. Recall that we segment the trace into sentences ˆ r G = [ s 1 , s 2 , . . . , s L ] , s ℓ ∈ V ∗ , where each s ℓ states an observation or an intermediate claim about X . In late injection there is no conﬁdence monitoring. Instead, we submit the entire draft to the specialist for a structur ed critique. W e shape a critique query that includes the question, the full draft, and a critique instruction, ˜ q = Query critique  q , ˆ r G , v critique  , where v critique asks the specialist to examine each sentence s ℓ against X , indicate whether it is supported or contradicted, explain why , and if incorrect provide a corrected statement with channel and time refer ences. The specialist returns a knowledge sequence K T ∼ π T ( · | X , ˜ q ) , which we structur e as a list of per sentence judgments and corrections. W e then inject after the full draft by appending a brief reﬂection cue followed by the specialist critique, r Inj = Inject critique  ˆ r G , v reﬂect , K T  =  ˆ r G , v reﬂect , K T  . The reasoner performs a short r eﬁnement pass that summarizes the critique, r econciles disagreements, and updates its conclusion, then generates the ﬁnal answer following Eq. ( 2 ) . This late insertion supplies broad, time series gr ounded feedback on the entire draft and encourages reﬂection befor e ﬁnalization. B A dditional Exp eriment Results B.1 Ablation on Reliance on TSLM T extual Summaries W e examine whether the GRLM can r ely solely on the TSLM’s textual summary , or whether its own direct access to the raw time series X is necessary for effective reasoning. In our full design, both the TSLM and the GRLM receive X , and the injected summary serves as auxiliary guidance rather than the only information source. This is formalized in Eq. ( 3 ) and Algorithm 1 , where the GRLM conditions on ( X , q , r ) . The motivation is to avoid a potential failur e mode where err ors or omissions in the TSLM summary become a single point of failure for downstr eam reasoning. T o explicitly test this concern, we introduce an ablation in which the GRLM receives only the TSLM- generated textual summary , without access to the raw time series. As shown in Figur e 5 (a), the “Injection w/o TS” variant impr oves over the standalone TSLM, indicating that transferring learned knowledge from the specialist is beneﬁcial. However , it consistently underperforms the full injection setting. On average, relying only on the textual summary yields approximately a 7% improvement, whereas full injection with dir ect time-series access achieves around a 17% impr ovement. The gap is most pr onounced 15 Figure 5: (a) Performance comparison between (i) the standalone TSLM, (ii) knowledge injection where the GRLM receives only the TSLM textual summary (Injection w/o TS), and (iii) full knowledge injection where the GRLM receives both the raw time series and the injected summary (Injection w/ TS). (b) Comparison of overall diagnostic accuracy versus inference latency for dif ferent methods. Figure 6: (a) Performance of the TSLM versus synthetic training data diversity , measured by varying the proportion of seed-generated synthetic data used during training. (b) Comparison of rewar d trajectories for four RL objectives (GRPO, DAPO, GSPO, CISPO) used to train the TSLM. for the “What Happened” stage, where accurate per ception of temporal patterns and anomalies is critical. This ablation demonstrates that the dual-input design is essential: the GRLM does not blindly inherit the TSLM’s errors, but instead combines its own per ception of the time series with injected domain knowledge, leading to more r obust and accurate diagnostic reasoning. B.2 A ccuracy and Latency Comparison A cross Prompting-Based Alternatives and Injection W e compar e our injection-based approach against several commonly used prompting alternatives, including few-shot prompting, self-consistency ( W ang et al. , 2022 ), and tree-of-thought ( Y ao et al. , 2023 ). Self-consistency is implemented with thr ee independent reasoning r uns, and tree-of-thought uses three parallel branches. As shown in Figur e 5 (b), these methods consistently improve over the zer o-shot GRLM baseline, conﬁrming that structured prompting and sampling-based reasoning can enhance performance. However , all pr ompting-based appr oaches r emain noticeably below the injection method in terms of ﬁnal accuracy , despite incurring substantially higher infer ence latency . This indicates that the advantage of injection stems from transferring knowledge learned by the TSLM through training, rather than from prompt-level heuristics. 16 B.3 Sensitivit y of TSLM p erfo rmance to synthetic data diversit y T o examine the sensitivity of the TSLM to the quality and diversity of synthetic training data, we conduct an ablation where the model is trained with increasing proportions of seed-generated synthetic data, ranging from no synthetic data to the full dataset. As shown in Figure 6 (a), training without synthetic data yields performance close to random guessing across all subtasks, indicating that training is essential for establishing basic time-series diagnostic reasoning ability for small models. Once training is intr oduced, performance improves rapidly: using roughly 50% of the seed-generated data already r ecovers the majority of the ﬁnal performance, while increasing diversity beyond 75% yields only marginal additional gains. This tr end is consistent across subtasks, with slightly stronger saturation effects for higher -level reasoning tasks (How Happened and Suggested Fix). B.4 Rew ard convergence under diﬀerent RL optimization metho ds Motivated by recent work on efﬁcient R1-style ﬁne-tuning ( Le et al. , 2025 ; Y eo et al. , 2025 ), we further examine whether more advanced RL objectives can impr ove TSLM training in our setting. W e evaluate three r epresentative RL objectives—DAPO ( Y u et al. , 2025 ), GSPO ( Zheng et al. , 2025 ), and CISPO ( Chen et al. , 2025a ), alongside our GRPO baseline. As shown in Figur e 6 (b), methods like DAPO yields faster and smoother rewar d convergence. At the same time, we observe that the ﬁnal rewar d achieved by these methods remains similar across objectives. This suggests that, beyond convergence efﬁciency , overall performance is primarily constrained by the available supervision and the capacity of the model rather than the speciﬁc RL methods. C A dditional Related Wo rk Multi-modal T ime-Series Forecasting Models Recent there ar e several lines of resear ch that explores a multimodal solution for time-series analysis to incorporate the information fr om textual data ( Liu et al. , 2025b ). Examples include augmenting series with domain-r elevant text ( Jin et al. , 2023 ; Liu et al. , 2025a ; 2024 ; 2025e ), aligning physiological signals with clinical notes ( W an et al. , 2024 ), linking stock tr ends with news ( W ang et al. , 2024 ; T avakoli et al. , 2025 ), and incorporating geographic context ( Chen et al. , 2024b ), trafﬁc data ( Zhang et al. , 2024 ), or external events ( Han et al. , 2024b ) for trafﬁc-ﬂow modeling. However , these work mostly focuses on for ecasting tasks rather than multi-modal understanding and diagnostic tasks. T ime-Series Reasoning Models and Benchmarks. T ime series reasoning has recently drawn gr owing interest as research moves from prediction toward explanation and diagnosis. Several works explore prompting based r easoning over temporal data ( Jiang et al. , 2025 ; Liu et al. , 2025d ; Merrill et al. , 2024 ). VL T ime ( Liu et al. , 2025c ) represents time series as visual plots and queries multimodal models such as GPT4o for zero or few shot interpretation, while T imeMQA ( Kong et al. , 2025 ) formulates question answering tasks using multiple choice reasoning. Both works introduce accompanying benchmarks, T imerBench and T imeMQA , which are derived from forecasting, classiﬁcation, or anomaly detection datasets rather than from diagnostic annotations. Recent datasets such as TS&Language ( Merrill et al. , 2024 ) and TSEvol ( Xie et al. , 2024 ) extend the setting to textual question–answer tasks, but their explanations are automatically generated by large language models and lack veriﬁed diagnostic grounding. Our TSRIndustrial benchmark differs in two key aspects: it pr ovides human-veriﬁed diagnostic annotations and introduces a multi-stage pr oblem structure that pr ogresses from identifying anomalies to inferring root causes and suggesting ﬁxes, reﬂecting the reasoning depth requir ed in real-world maintenance scenarios. Concurr ently , Cao et al. ( 2026 ) pr ovide one of the ﬁrst systematic investigations of LLMs on structur ed temporal reasoning, focusing on time interval prediction. Their ﬁndings reveal that LLMs outperform lightweight statistical baselines yet consistently underperform dedicated machine learning models, and that incorporating additional context does not always impr ove and can even degrade 17 prediction quality . This formal characterization of LLM temporal r easoning capabilities lays important groundwork for the br oader time-series reasoning direction, and extending such analysis beyond interval prediction to diagnostic r easoning remains an interesting futur e direction. D Dataset Details D.1 Public Dataset W e evaluate our framework on two public benchmarks, TSEvol and TSandLanguage . TSEvol ( Xie et al. , 2024 ) consists of multiple subdatasets, among which we speciﬁcally use Dataset A, as it contains r eal-world time series collected fr om diverse domains such as AIOps, meteorology , the Numenta Anomaly Benchmark (NAB), and Oracle system metrics. The time series in Dataset A are manually annotated to mark key temporal behaviors, while the contextual prompts and root-cause options are generated automatically by LLM. The dataset includes 525 questions spanning thr ee reasoning categories: (i) inductive reasoning — summarizing the physical semantics in univariate or multivariate series, (ii) deductive reasoning — verifying temporal conditions, and (iii) causal reasoning — selecting the most plausible cause under a given textual context. TSandLanguage (MCQ2) ( Merrill et al. , 2024 ) is an open-sour ce dataset designed for relational and comparison reasoning between two time series under textual context. The time series, questions, and answers are automatically generated by lar ge language models. Following Xie et al. ( 2024 ), we focus on its diagnostic-style multiple-choice subset and exclude etiological reasoning and forecasting components that are not aligned with our evaluation objectives, randomly sampling 100 r epresentative questions. D.2 SenTSR-Bench SenTSR-Bench is a de-identiﬁed real-world diagnostic reasoning benchmark derived from industrial sensor systems. It contains 110 multivariate time series pair ed with 330 human-veriﬁed diagnostic questions, spanning three progressive reasoning stages: (i) what happened — identifying anomalous signals and temporal patterns, (ii) how it happened — inferring plausible root causes behind the observed behavior , and (iii) suggested ﬁx — proposing potential corrective actions. The dataset captures realistic multivariate temporal reasoning complexity , fr om signal interpretation to causal and pr escriptive reasoning. D.2.1 Evaluation Dataset Curation The construction of SenTSR-Bench pr oceeds in three stages: Stage 1: Signal selection and preprocessing. W e start fr om a large pool of appr oximately 2,000 multi- variate sensor time-series collected from real monitoring systems. From these, we identify 110 str eams that display clear anomalous behaviors such as persistent deviations, sharp dr ops or spikes, or sudden shifts in periodicity . Each selected stream is associated with a downstream tr oubleshooting event in real practice, ensuring the anomalies ar e tied to actionable diagnostic contexts. W e then apply preprocessing to standardize sampling fr equency , normalize scales across sensor channels, and fully de-identify the signals by removing all system identiﬁers and metadata that could r eveal sensitive operational information. Stage 2: Human annotation pipeline. W e develop a de-identiﬁed annotation pipeline that preserves the realism of pair ed textual data while protecting privacy . Human experts annotate the selected anomalous windows with concise descriptions of the observed pattern, plausible root causes, and candidate corrective actions. T o prevent leakage of proprietary context, annotators are provided only with sanitized time- series segments and high-level machine categories. The resulting annotations captur e domain-relevant diagnostic reasoning in natural language while guaranteeing de-identiﬁcation. Stage 3: Construction of evaluation queries. T o enable systematic benchmarking, we cluster the curated time-series into families of similar anomaly types (e.g., belt failure–like patterns vs. thermal 18 runaway patterns). Fr om these families we generate multiple-choice questions that follow a multi-stage structur e. Each query involves (i) identifying the anomalous segment, (ii) inferring its r oot cause, and (iii) suggesting a corrective action. Gr ound-truth answers are paired with distractors sampled from other clusters, ensuring that solving the task requir es both correct recognition and r easoning rather than memorization. This multi-stage curation yields SenTSR-Bench as a realistic and challenging benchmark, with human- authored annotations grounded in real sensor signals and a design that emphasizes both diagnostic depth and privacy protection. D.2.2 T raining Dataset Generation Building training data at scale for diagnostic reasoning is especially challenging in the multivariate sensor setting: real-world signals are scar ce, and their complexity makes dir ect augmentation dif ﬁcult. W e ther efore propose a two-stage pipeline that leverages vision–language models (VLMs) to bootstrap realistic simulators fr om a small set of seeds. Stage 1: Iterative code synthesis. W e begin with 23 standar dized and de-identiﬁed multivariate time- series, each containing channels such as vibration (acceleration, velocity) and temperature. Each seed is plotted and presented to a VLM together with high-level context pr ompts (e.g., “write Python code that simulates similar behavior with interpretable dynamics”). The VLM outputs candidate simulation code, which we execute to generate synthetic traces. If the output contains runtime errors or fails to repr oduce core dynamics of the seed (e.g., anomaly shape, periodic str ucture), we r eﬁne the pr ompt and re-run. This iterative prompt–code–simulate cycle continues until the simulator consistently repr oduces the desired behaviors. The outcome is a library of seed-aligned simulators. Stage 2: Diversiﬁcation and simpliﬁcation. T o scale up diversity , we prompt an LLM to transform each simulator into a stochastic generator . Deterministic heuristics ar e r eplaced with latent-state dynamics and randomized parameter draws (e.g., varying noise levels, decay rates, or event fr equencies). This produces a family of realistic series rather than exact replicas. W e further refactor the simulators into compact, modular forms so they can be easily r eused and extended. The diversiﬁed generators collectively produce a large corpus of synthetic signals that retain the statistical and structural properties of the real seeds while covering a wider variety of operating conditions. Finally , we apply the same query-construction pipeline as in evaluation: anomalous segments from synthetic series are pair ed with diagnostic labels to form QA and MCQ items. This ensures consistency between training and evaluation, while enabling large-scale supervised training from only a handful of seed signals. E Implementation Details E.1 Implementation Details: Reasoning Mo del Baselines W e evaluate standard r easoning baselines under both zero-shot and few-shot prompting. All models are accessed thr ough an OpenAI-compatible server implemented with vLLM , using HuggingFace checkpoints as backends. Unless otherwise noted, reasoning traces are obtained in a zero-shot setting, while few-shot experiments prepend a small set of curated exemplars. For few-shot prompting, for the SenTSR-Bench benchmark, we pr ovide 3 randomly sampled demonstrations in an in-context learning format, inserted as prior user–assistant interactions. Each demonstration contains either the time-series image or JSON text paired with its gr ound-truth answer . For Qwen3, DeepSeek R1, the encoded time series far exceeds the context length. In such cases, we include only the question and answer template in the demonstrations, omitting the full time-series input. 19 Encoding time series for LLM input. For image encoding, we render multivariate time series as stacked line plots using matplotlib . Each channel is placed in a vertically aligned subplot with labeled axes and channel identiﬁers, following best practices for visual clarity . Detailed plotting functions are provided in the released sour ce code. For text encoding, we convert each channel into a structured JSON-like format. The following template illustrates the format used to render time-series data into textual tokens for inclusion in pr ompts: { "Series 1" : [0.25, 0.31, 0.28, ...], "Series 2" : [1.02, 1.13, 0.95, ...], "Series 3" : [-0.42, -0.38, -0.41, ...] } This str uctured form facilitates tokenization and preserves the alignment of values across channels. When column names are available, they ar e preserved; otherwise, generic names are assigned. Long-context adaptation. For certain benchmarks such as TSEvol , multivariate time-series inputs can exceed 50k tokens when encoded as text. T o accommodate these cases, we apply RoPE scaling to extend the context length of open-source models such as Qwen3 and DeepSeek R1, ensuring that the full series can be processed without truncation. This scaling is necessary for faithfully grounding reasoning in long multivariate signals. Infrastructure. All open-source models are hosted on A WS EC2 instances equipped with 8 × A100 GPUs, served through vLLM . Closed-sour ce reasoning models are accessed via A WS Bedr ock. E.2 TSLM Post-training All time-series specialists (TSLMs) ar e initialized from the public Qwen-VL-3B-Instruct checkpoint. Post-training is carried out in two stages: supervised ﬁne-tuning (SFT) and reinfor cement learning (RL) with veriﬁable rewar ds. For the public benchmarks TSEvol and TS&Language , we ﬁne-tune on 3k causal reasoning tasks fr om the TSEvol SFT set, r estricting training to causal tasks to test cross-task generalization to inductive, deductive, and MCQ-2 tasks at evaluation. For the SenTSR-Bench benchmark, SFT data is constructed fr om the curated What happened and How happened stages, leaving the Suggested ﬁx stage unseen for out-of-distribution evaluation. SFT training uses a cutoff length of 4,096 tokens, per -device batch size of 4 with gradient accumulation of 2 (effective batch size 64 on 8 GPUs), learning rate of 1 × 10 − 5 , and cosine decay scheduling with warmup ratio 0.1. For reinfor cement learning, we adopt Group Relative Policy Optimization (GRPO) to elicit analysis-ﬁrst completions without explicit thinking supervision. For TSEvol , RL training again focuses on causal tasks, and for SenTSR-Bench we apply it to the What happened and How happened datasets. RL training is conﬁgured with KL divergence coefﬁcient β = 0.001, group size G = 8, maximum sequence length L max = 512, and PPO clipping threshold of 0.1, with an ef fective batch size of 16 on 8 GPUs, with a learning rate of 1 × 10 − 6 . E.3 Practical Implementation of Kno wledge Injection. The injection workﬂow is straightforwar d to implement with standard LLM APIs. For models and servers that support assistant pr eﬁll (e.g., OpenAI-compatible endpoints), we directly seed the private trace by pr e- inserting [ ⟨ think ⟩ , r Inj ≤ k ] as the assistant’s initial tokens; the general r easoner π G then continues generation conditioned on this pr eﬁx. For providers that do not expose editable thinking buffers (in some closed- source r easoning models), we use an instructional pr oxy: wrap r Inj ≤ k inside the models’s r ecommended “thinking template” tags in the user/system message (e.g., a documented . . . block) and instruct the model to begin its thinking pr ocess with the instructed template. In practice this proxy r eliably steers the internal reasoning trace and repr oduces the effect of in-chain injection. 20 E.4 Prompt Design for Injection Strategies W e provide the prompt templates used in experiments for evaluating different knowledge injection positions . These prompts are designed for a strong instruction-following TSLM ( ChatTS-14B ) paired with general reasoning LLMs (GRLMs). The goal is to examine how injecting time-series knowledge at differ ent points in the reasoning process, including early , intermediate , and late injection, affects overall reasoning performance. Early Injection. In the early injection setup, the TSLM ﬁrst pr oduces structured, quantitative observa- tions from the time series, which are inserted at the start of the GRLM’s r easoning process to guide the subsequent chain of thought. TSLM (Observation Generation) You are analyzing a time series to extract key quantitative observations that help answer the question. Provide detailed, objective numerical observations by following these guidelines: 1. Make numbered, precise observations about the quantitative aspects of the time series. 2. Be specific about values, positions, and magnitudes when describing features. 3. Begin each observation with "Observation 1:", "Observation 2:", etc. Start your response with: "To answer this question, I need to carefully analyze the time series. Here are my observations: Observation 1... Observation 2..." GRLM (Reasoning with Early Injection) [TSLM Observations] Wait, let me summarize and reflect on the previous observations from the time series, and then continue my reasoning process to derive the final answer... Intermediate Injection. In this setting, the GRLM begins reasoning but calls for the TSLM’s input when encountering uncertainty (identiﬁed as a low-conﬁdence step). The TSLM then provides clariﬁcations, which are integrated back into the GRLM’s ongoing r easoning. TSLM (Intermediate Feedback) You are assisting with a time series reasoning process. Here is the question and the current partial reasoning: [Partial Thought / Low-Confidence Segment] Please analyze whether this reasoning point is correct based on the time series, why or why not, and how it relates to answering the question. Be specific about numerical values and time positions. GRLM (Reasoning with Intermediate Injection) [Partial Reasoning] I am uncertain about this point: [Low-Confidence Segment]. Let me reconsider it with the following clarification from the time-series model: "[TSLM Feedback]" ... Late Injection. Under late injection, the GRLM ﬁrst completes its reasoning trace. The TSLM then reviews the r easoning for factual consistency with the time series, and the GRLM r evises its conclusion accordingly . TSLM (Late Review) You are reviewing a completed reasoning process to check whether its quantitative claims about the time series are correct. For each observation, discuss whether it is correct or incorrect. If incorrect, provide the accurate interpretation of what the data shows. Example: "For Observation 1, after review, it is incorrect. In fact... For Observation 2..." 21 GRLM (Revision with Late Injection) [Original Reasoning Trace] Wait, let me reexamine my previous reasoning based on the review below: [TSLM Review] ... Additional Adaptations. For the RL-honed TSLM , we apply early injection by directly using the model’s self-generated reasoning trace from R1-style GRPO training as the injected knowledge, with- out explicit prompting; for closed-source models (e.g., Claude-3.7 ), injected content is wrapped in ... delimiters, followed by an instruction such as “continue the thinking process above.”; for the prompting-based baseline , the same TSLM-generated content is provided externally as additional context: Prompting Baseline Here is an analysis from a time-series model that is good at time-series analysis but may be limited in general reasoning: [TSLM Observations]. F A dditional Case Study W e pr esent two qualitative case studies that illustrate the beneﬁts of our knowledge injection framework. Figure 7 compares the standalone TSLM, the standalone GRLM, and our injection-based method on a diagnostic reasoning example. The TSLM corr ectly detects rising vibration and stable temperature but hallucinates a joint increase in both signals, yielding an incorrect diagnosis. The GRLM similarly misreads the series, assuming a late temperature rise. By injecting the TSLM’s accurate signal-level observations into the GRLM’s reasoning trace, our method corrects the r easoning ﬂaw while preserving domain-grounded pattern r ecognition, producing the correct ﬁnal diagnosis. Figure 8 further contrasts knowledge prompting with knowledge injection . When the TSLM analysis is provided as an external prompt, the GRLM reasons lar gely in isolation, leading to insufﬁcient use of domain knowledge. In contrast, the injection-based approach integrates the TSLM’s analysis directly into the reasoning ﬂow , enabling joint exploration and progressi ve narrowing of hypotheses, and resulting in a corr ect diagnosis. 22 Figure 7: Case study on knowledge injection versus standalone baselines. The TSLM correctly detects rising vibration and stable temperature but hallucinates a joint increase in both, yielding an incorrect diagnosis. The GRLM similarly misreads the series, assuming a late temperature rise. Our method leverages the TSLM’s accurate signal interpretation while correcting its reasoning ﬂaw , producing the correct ﬁnal diagnosis. Figure 8: Case study on knowledge prompting versus knowledge injection. In the pr ompting-based approach, the GRLM r easons largely in isolation, refer encing the TSLM ’s analysis only at the end for validation, leading to partial use of domain knowledge. In contrast, the injection-based approach inte- grates the TSLM ’s discussion directly into the reasoning ﬂow , enabling joint exploration and narrowing of hypotheses, and resulting in a corr ect, well-grounded diagnosis. 23

SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment