Title: Reasoning Relay: Evaluating Stability and Interchangeability of Large Language Models in Mathematical Reasoning
ArXiv ID: 2512.20647
Date: 2025-12-16
Authors: Leo Lu, Jonathan Zhang, Sean Chua, Spencer Kim, Kevin Zhu, Sean O’Brien, Vasu Sharma
📝 Abstract
Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of large language models (LLMs). While prior work focuses on improving model performance through internal reasoning strategies, little is known about the interchangeability of reasoning across different models. In this work, we explore whether a partially completed reasoning chain from one model can be reliably continued by another model, either within the same model family or across families. We achieve this by assessing the sufficiency of intermediate reasoning traces as transferable scaffolds for logical coherence and final answer accuracy. We interpret this interchangeability as a means of examining inference-time trustworthiness, probing whether reasoning remains both coherent and reliable under model substitution. Using token-level log-probability thresholds to truncate reasoning at early, mid, and late stages from our baseline models, Gemma-3-4B-IT and LLaMA-3.1-70B-Instruct, we conduct continuation experiments with Gemma-3-1B-IT and LLaMA-3.1-8B-Instruct to test intra-family and cross-family behaviors. Our evaluation pipeline leverages truncation thresholds with a Process Reward Model (PRM), providing a reproducible framework for assessing reasoning stability via model interchange. Evaluations with a PRM reveal that hybrid reasoning chains often preserve, and in some cases even improve, final accuracy and logical structure. Our findings point towards interchangeability as an emerging behavioral property of reasoning models, offering insights into new paradigms for reliable modular reasoning in collaborative AI systems.
💡 Deep Analysis
📄 Full Content
Reasoning Relay: Evaluating Stability and
Interchangeability of Large Language Models in
Mathematical Reasoning
Leo Lu1∗
Jonathan Zhang2∗
Sean Chua3∗
Pennsylvania State University
Binghamton University
University of Toronto
lbl5561@psu.edu
jzhang78@binghamton.edu
seaneugene.chua@mail.utoronto.ca
Spencer Kim4
Kevin Zhu5†‡
Sean O’Brien6‡
UC Berkeley
Algoverse
Algoverse
spencer_kim@berkeley.edu
kevin@algoverse.us
2000.seano@gmail.com
Vasu Sharma7‡
Algoverse
sharma.vasu55@gmail.com
Abstract
Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capa-
bilities of large language models (LLMs). While prior work focuses on improving
model performance through internal reasoning strategies, little is known about the
interchangeability of reasoning across different models. In this work, we explore
whether a partially completed reasoning chain from one model can be reliably
continued by another model, either within the same model family or across families.
We achieve this by assessing the sufficiency of intermediate reasoning traces as
transferable scaffolds for logical coherence and final answer accuracy. We interpret
this interchangeability as a means of examining inference-time trustworthiness,
probing whether reasoning remains both coherent and reliable under model sub-
stitution. Using token-level log-probability thresholds to truncate reasoning at
early, mid, and late stages from our baseline models, Gemma-3-4B-IT and LLaMA-
3.1-70B-Instruct, we conduct continuation experiments with Gemma-3-1B-IT and
LLaMA-3.1-8B-Instruct to test intra-family and cross-family behaviors. Our evalu-
ation pipeline leverages truncation thresholds with a Process Reward Model (PRM),
providing a reproducible framework for assessing reasoning stability via model
interchange. Evaluations with a PRM reveal that hybrid reasoning chains often
preserve, and in some cases even improve, final accuracy and logical structure.
Our findings point towards interchangeability as an emerging behavioral property
of reasoning models, offering insights into new paradigms for reliable modular
reasoning in collaborative AI systems.
1
Introduction
Chain of Thought (CoT) prompting emerged as powerful mechanism to improve the reasoning
capabilities of large language models (LLMs) by encouraging intermediate structured reasoning
steps before arriving at a final answer [Wei et al., 2023]. Previous work has explored how CoTs
∗Equal Contribution
†Corresponding Author
‡Senior Author
39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Socially Responsible
and Trustworthy Foundation Models (ResponsibleFM).
arXiv:2512.20647v1 [cs.AI] 16 Dec 2025
improve individual model performance even in zero-shot settings [Kojima et al., 2023, Zhang et al.,
2022, Jin et al., 2024]. More recently, Hebenstreit et al. [2024] examined the transferability of entire
CoT sequences by evaluating whether rationale prompts discovered on one model could generalize
reasoning strategies across a range of models and tasks. However, it remains unclear to what extent
reasoning trajectories are interchangeable when only partially reused. In light of this, our aim is
to answer the central research question: To what extent can the modular decomposition of complex
mathematical reasoning tasks enhance the zero-shot performance and interpretability of Large
Language Models, when utilizing a collaborative framework that includes both intra-family and
cross-family LLMs?
In this work, we investigate the process-level interchangeability in language model reasoning by
evaluating how well different models can continue the CoT of another’s midstream. We begin with
full CoT traces generated by a strong base model (e.g., Gemma-3-4B-IT and LLaMA-3.1-70B-
Instruct), recording token-level log-probabilities to guide strategic truncation at 25%, 50%, and
75% of the cumulative log-probability, capturing early, mid, and late stages of reasoning based on
informativeness. From these truncated points, alternative models (including those from different
families or architectures) are tasked with continuing the reasoning process using only truncated
intermediate steps as input We then assess not only accuracy, but also the coherence, semantic
alignment, and logical consistency of the full reasoning chain, by using a Process Reward Model
(PRM) trained to evaluate multi-step mathematical reasoning performance. Ultimately, our aim is to
characterize how steady transferability depends on truncation point, model pairing, and reasoning
domain, yielding clearer interpretations into the dynamics of CoT continuation success that goes
beyond final answer accuracy.
Whereas prior work has explored how CoT prompting improves reasoning within individual models
[Wei et al., 2023], whether reasoning can be interchanged across models mid-process remains largely
unexamined.
We provide compelling early evidence that such a handoff is often successful within the same model
family. We show that