Reasoning Relay: Evaluating Stability and Interchangeability of Large Language Models in Mathematical Reasoning

Reading time: 5 minute
...

📝 Original Info

  • Title: Reasoning Relay: Evaluating Stability and Interchangeability of Large Language Models in Mathematical Reasoning
  • ArXiv ID: 2512.20647
  • Date: 2025-12-16
  • Authors: Leo Lu, Jonathan Zhang, Sean Chua, Spencer Kim, Kevin Zhu, Sean O’Brien, Vasu Sharma

📝 Abstract

Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of large language models (LLMs). While prior work focuses on improving model performance through internal reasoning strategies, little is known about the interchangeability of reasoning across different models. In this work, we explore whether a partially completed reasoning chain from one model can be reliably continued by another model, either within the same model family or across families. We achieve this by assessing the sufficiency of intermediate reasoning traces as transferable scaffolds for logical coherence and final answer accuracy. We interpret this interchangeability as a means of examining inference-time trustworthiness, probing whether reasoning remains both coherent and reliable under model substitution. Using token-level log-probability thresholds to truncate reasoning at early, mid, and late stages from our baseline models, Gemma-3-4B-IT and LLaMA-3.1-70B-Instruct, we conduct continuation experiments with Gemma-3-1B-IT and LLaMA-3.1-8B-Instruct to test intra-family and cross-family behaviors. Our evaluation pipeline leverages truncation thresholds with a Process Reward Model (PRM), providing a reproducible framework for assessing reasoning stability via model interchange. Evaluations with a PRM reveal that hybrid reasoning chains often preserve, and in some cases even improve, final accuracy and logical structure. Our findings point towards interchangeability as an emerging behavioral property of reasoning models, offering insights into new paradigms for reliable modular reasoning in collaborative AI systems.

💡 Deep Analysis

Figure 1

📄 Full Content

Reasoning Relay: Evaluating Stability and Interchangeability of Large Language Models in Mathematical Reasoning Leo Lu1∗ Jonathan Zhang2∗ Sean Chua3∗ Pennsylvania State University Binghamton University University of Toronto lbl5561@psu.edu jzhang78@binghamton.edu seaneugene.chua@mail.utoronto.ca Spencer Kim4 Kevin Zhu5†‡ Sean O’Brien6‡ UC Berkeley Algoverse Algoverse spencer_kim@berkeley.edu kevin@algoverse.us 2000.seano@gmail.com Vasu Sharma7‡ Algoverse sharma.vasu55@gmail.com Abstract Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capa- bilities of large language models (LLMs). While prior work focuses on improving model performance through internal reasoning strategies, little is known about the interchangeability of reasoning across different models. In this work, we explore whether a partially completed reasoning chain from one model can be reliably continued by another model, either within the same model family or across families. We achieve this by assessing the sufficiency of intermediate reasoning traces as transferable scaffolds for logical coherence and final answer accuracy. We interpret this interchangeability as a means of examining inference-time trustworthiness, probing whether reasoning remains both coherent and reliable under model sub- stitution. Using token-level log-probability thresholds to truncate reasoning at early, mid, and late stages from our baseline models, Gemma-3-4B-IT and LLaMA- 3.1-70B-Instruct, we conduct continuation experiments with Gemma-3-1B-IT and LLaMA-3.1-8B-Instruct to test intra-family and cross-family behaviors. Our evalu- ation pipeline leverages truncation thresholds with a Process Reward Model (PRM), providing a reproducible framework for assessing reasoning stability via model interchange. Evaluations with a PRM reveal that hybrid reasoning chains often preserve, and in some cases even improve, final accuracy and logical structure. Our findings point towards interchangeability as an emerging behavioral property of reasoning models, offering insights into new paradigms for reliable modular reasoning in collaborative AI systems. 1 Introduction Chain of Thought (CoT) prompting emerged as powerful mechanism to improve the reasoning capabilities of large language models (LLMs) by encouraging intermediate structured reasoning steps before arriving at a final answer [Wei et al., 2023]. Previous work has explored how CoTs ∗Equal Contribution †Corresponding Author ‡Senior Author 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Socially Responsible and Trustworthy Foundation Models (ResponsibleFM). arXiv:2512.20647v1 [cs.AI] 16 Dec 2025 improve individual model performance even in zero-shot settings [Kojima et al., 2023, Zhang et al., 2022, Jin et al., 2024]. More recently, Hebenstreit et al. [2024] examined the transferability of entire CoT sequences by evaluating whether rationale prompts discovered on one model could generalize reasoning strategies across a range of models and tasks. However, it remains unclear to what extent reasoning trajectories are interchangeable when only partially reused. In light of this, our aim is to answer the central research question: To what extent can the modular decomposition of complex mathematical reasoning tasks enhance the zero-shot performance and interpretability of Large Language Models, when utilizing a collaborative framework that includes both intra-family and cross-family LLMs? In this work, we investigate the process-level interchangeability in language model reasoning by evaluating how well different models can continue the CoT of another’s midstream. We begin with full CoT traces generated by a strong base model (e.g., Gemma-3-4B-IT and LLaMA-3.1-70B- Instruct), recording token-level log-probabilities to guide strategic truncation at 25%, 50%, and 75% of the cumulative log-probability, capturing early, mid, and late stages of reasoning based on informativeness. From these truncated points, alternative models (including those from different families or architectures) are tasked with continuing the reasoning process using only truncated intermediate steps as input We then assess not only accuracy, but also the coherence, semantic alignment, and logical consistency of the full reasoning chain, by using a Process Reward Model (PRM) trained to evaluate multi-step mathematical reasoning performance. Ultimately, our aim is to characterize how steady transferability depends on truncation point, model pairing, and reasoning domain, yielding clearer interpretations into the dynamics of CoT continuation success that goes beyond final answer accuracy. Whereas prior work has explored how CoT prompting improves reasoning within individual models [Wei et al., 2023], whether reasoning can be interchanged across models mid-process remains largely unexamined. We provide compelling early evidence that such a handoff is often successful within the same model family. We show that

📸 Image Gallery

page_1.png page_2.png page_3.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut