Reading time: 26 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.13102
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

Large language Models (LLMs) are usually used to answer questions, but many high-stakes applications (e.g., tutoring, clinical support) require the complementary skill of asking questions: detecting missing information, requesting clarifications, and using them to solve tasks. We study this skill in reasoning-heavy domains where progress depends on inquiry rather than factual recall. We define an interactive protocol where a student model engages a stronger teacher under a small turn budget. After each teacher reply, we evaluate the student on the original task with Pass@k. We propose Outcome-Driven Question optimization Strategy (ODQS ), a training framework that learns a questioning policy from downstream task outcomes. At each turn, we sample multiple candidate questions; query the teacher with each, then score the student's resulting performance. Using these scores, we train the student via supervised fine-tuning followed by Direct Preference Optimization (DPO), without any human labels. On GSM8K, HumanEval, and OpenCoder, ODQS produces large gains over interactive baselines, boosting Pass@5 by up to 54.7% (absolute) on math and 22.9% (absolute) on coding, and matching baseline performance in three fewer turns. Thus, question asking can be explicitly trained from task outcomes, improving both accuracy and efficiency in interactive reasoning.

๐Ÿ“„ Full Content

The dominant paradigm for language models is reactive: present a prompt and receive a response. This works beautifully when the model has the information it needs. However, many real-world applications, such as educational tutoring (Hu et al., 2023;Pan et al., 2024;Team et al., 2025;Kim et al., 2024) and medical assistance (Li et al., 2024(Li et al., , 2025) ) require models to identify uncertainties, ask questions, and adapt to new information. For example, * Equal contribution. Correspondance: ambati@cs.unc.edu a diagnostic assistant must ask targeted questions before recommending treatment, or a tutor must ask probing questions to identify a student's misconceptions. In these settings, knowing what to ask is the central bottleneck. In such dynamic interactions, models fail not because they cannot generate answers, but because they ask the wrong questions, or none at all.

Recent work has explored interactive settings, including agents that ask clarifying questions (Aliannejadi et al., 2019;Press et al., 2023;Yao et al., 2023) and student-teacher setups where a stronger model guides a weaker one (Kendapadi et al., 2025). These approaches show that interaction helps, but a key gap remains: we lack a training signal that teaches a model which questions to ask. Most methods rely on heuristics, scaffolds, or human judgments of question quality (Aliannejadi et al., 2019;Yao et al., 2023). We argue that question quality should be judged not by style or surface semantics, but by utility: does the question improve the model’s ability to solve the task? Some work applies reward-based refinement for clarifying questions (Andukuri et al., 2024;Srivastava et al., 2019), but previous work has not explored training questioning policies or interaction in reasoning tasks.

We focus on reasoning-intensive domains (math and code) and formalize the interaction via a student-teacher protocol: a student model S attempts a problem and is allowed to query a stronger teacher T , that provides guidance but never the final answer. The student operates under a budget of questioning turns. After each teacher response, we evaluate whether S can now solve the original problem by sampling answer-only attempts and computing Pass@k. This yields a clean operational definition of utility: a question is good if and only if it increases downstream Pass@k.

Building such agents presents three core challenges. The first challenge is search, since the To train S to ask better questions, we collect preference pairs for DPO. At each turn, we sample k candidate questions {q 1 , . . . , q k }. Each question is evaluated based on downstream Pass@k, and preference pairs are formed accordingly. S is then trained on these pairs. Assessment: We explore injecting intermediate assessments, where the teacher can provide feedback on S’s attempts on the problem. We explore including assessments at turn 0 (PRE-ASSESSMENT ) or at turn 2 (MID-ASSESSMENT ).

space of possible questions that can be posed in natural language is immense. The second challenge is supervision, since there are no existing datasets of ground-truth best questions to ask. Current posttraining methods such as SFT (Ouyang et al., 2022) or DPO (Rafailov et al., 2024) are primarily used for style, tone, and subjective quality of responses (Bai et al., 2022), not information seeking. The third challenge is efficiency, since in real deployments, interactions are expensive, so improvements must come from making substantial progress per turn, not simply adding more dialogue.

To bridge these gaps, we introduce two classes of questioning strategies. Our first approach, assessment-based questioning, is a prompting strategy where the student attempts to solve the problem at the start (PRE-ASSESSMENT ) or mid-interaction (MID-ASSESSMENT ). The teacher then assesses this attempt, and the student uses that feedback to ask the next question. Our second, outcome-driven question optimization (ODQS ), is a supervised method that frames question asking as an outcomelabeled decision problem. At each turn, multiple candidate questions are generated. Each candidate is executed by querying the teacher, and the resulting trajectory is scored using the student’s downstream Pass@k performance on the problem. The highest-scoring candidate is treated as a preferred example, while lower-scoring candidates serve as contrasts. These preference pairs are then used to train a questioning policy. This approach enables models to bootstrap effective questioning behavior without human-labeled question quality.

On GSM8K (math) and HumanEval/OpenCoder (code), interactive baselines consistently improve Pass@k over a static baseline, with absolute gains of at least 0.5. ODQS also improves interaction efficiency: ODQS -trained students match the final performance of Unscaffolded interaction using three fewer turns. Overall, ODQS boosts Pass@5 by up to 54.7% on math and 22.9% on coding. We also find the optimal position for adding assessments, and analyze its impact on both domains. Our contributions are:

โ€ข We develop ODQS , a DPO-based approach that trains models to ask questions based on reasoning outcomes rather than human judgments.

โ€ข We introduce assessment-based interaction strategies that provide targeted feedback to students to improve question asking and learning.

โ€ข We empirically show that asking good questions is a learnable skill that improves accuracy and interaction efficiency across models and benchmarks.

โ€ข We provide an in-depth analysis of interactive learning dynamics, explaining how improved progress per turn and earlier high-quality ques-tions drive performance gains.

General Interactive Learning: Early work on language-guided machine learning relied on singleturn instructions (Srivastava et al., 2017;Hancock et al., 2018;Labutov et al., 2018;Arabshahi et al., 2020). To improve comprehension, researchers investigated active learning (Collins et al., 2008;Tamkin et al., 2022) and language-based clarification (Rao and Daumรฉ III, 2018;Srivastava et al., 2019). Recent work probes LLMs’ informationseeking abilities. Bertolazzi et al. (2023) studied the 20-questions game using an informationtheoretic lens, though its binary format limits extension to richer interactions. Other lines develop tool-using QA agents (Pan et al., 2024), improve retrieval via query rephrasing (Deng et al., 2024), and extend interactive learning to multimodal setups (Hu et al., 2023). Closely related to our work, Kendapadi et al. (2025) study teacher-led interactions in knowledge domains such as lyrics and news, primarily testing factual knowledge. In contrast, we focus on domains where reasoning rather than retrieval is the main challenge.

Application contexts: Interactive frameworks have been explored in high-stakes domains such as medicine and education. In clinical support, prior work studies adaptive collaborations among LLMs (Kim et al., 2024), diagnostic-style reasoning frameworks (Chen et al., 2025), and uncertaintytriggered follow-ups (Li et al., 2024;Hu et al., 2024), as well as approaches based on static question sets or handcrafted attributes (Winston et al., 2024;Li et al., 2025). In education, LLM tutors have been evaluated for pedagogical quality (Tack and Piech, 2022), with datasets and post-training methods developed to improve tutoring behavior (Jurenka et al., 2024;Chevalier et al., 2024;Kwon et al., 2024;Dan et al., 2023;Sonkar et al., 2023).

We study whether a student model can improve its performance on reasoning tasks by asking helpful questions to a stronger teacher. Let S denote the student and T the teacher. The ultimate goal for the student is to solve a reasoning problem P. For this, S interacts with T for a fixed budget of N questioning turns (we use N = 5), as illustrated in Figure 1. At each turn i โˆˆ {1 . . . N }, the student asks a single targeted question, and the teacher returns a clarification while being explicitly instructed not to reveal the final answer to P. After each teacher utterance, we evaluate the student’s ability to solve P using the full interaction history so far: we sample k direct answers from S and compute Pass@k (Chen et al., 2021a;Wang et al., 2023).

3.1 Baseline methods STATIC : Our primary baseline is the STATIC baseline that corresponds to student model performance before any interactions, i.e., at turn 0.

We also consider two prompt-based interactive baselines: UNSCAFFOLDED and SCAFFOLDED . UNSCAFFOLDED : At each student utterance, S is asked to directly generate a question for T without any explicit scaffolding. In our experiments, we compare this method with our proposed strategies to measure how much performance improves from interaction alone, without any guidance about how to formulate questions. SCAFFOLDED : In each turn, the student’s system prompt is augmented with a lightweight CoT-style self-reflection scaffold. Before posing a question, S is instructed to (A) summarize what is already known, (B) identify the missing sub-goals needed to solve the task, and (C) ask a follow-up question. This encourages targeted information seeking.

UNSCAFFOLDED and SCAFFOLDED are instantiations of Algorithm 1 without any assessment (t assess = -1). Their prompts are provided in Appendix F.1.

Similar to human learners, we hypothesize that timely feedback can help LLM students calibrate their understanding and guide subsequent questioning. We examine whether a similar mechanism improves interactive learning for LLMs. We propose two prompting-based variants that insert a one-shot assessment at different points in the interaction. In an assessment, the student (S) generates a candidate solution to P, and the teacher (T ) provides feedback on its correctness and highlights errors in the student’s reasoning. This assessment is then appended to the conversation history, grounding all subsequent turns. PRE-ASSESSMENT : Here the assessment is introduced before the student’s first turn (t assess = 0). This allows the student to be explicitly grounded from the very beginning of the interaction. This corresponds to Algorithm 1 with t assess = 0.

Here the assessment occurs halfway through the interaction (t assess = 2 out of N = 5 turns). Unlike pre-assessment, the student can leverage the accumulated dialogue from the first three turns. This corresponds to Algorithm 1 with t assess = 2. See Appendix F.2 for prompts used in assessments.

We frame question asking as a decision problem supervised by downstream task performance. Our approach first collects candidate questions and then fine-tunes students to ask better questions based on their impact on task success. At each student turn, we sample multiple candidate questions for the student to ask the teacher. For each candidate, we query the teacher, append the resulting teacher reply to the dialogue history, and evaluate the student by downstream Pass@k on the original problem.

We then form preference pairs by marking the highest-scoring candidate as chosen and lowerscoring candidates as rejected. From these, we build an SFT dataset B SFT (prompt, chosen) and a DPO dataset B DPO (prompt, chosen, rejected) tuples where the prompt is the same questiongeneration prompt used by the student. The student is fine-tuned with SFT followed by DPO. This procedure is summarized in Algorithm 2. We propose two variants of this strategy. SELF-ODQS : Here the set of candidate questions is sampled from the student model itself. The student learns from its own self-exploration, discovering which of its own questions are most effective. PEER-ODQS : Here the set of candidate questions is sampled from a more capable peer model. The student learns its question asking strategy from these externally generated candidates.

We evaluate our methods on two reasoningintensive domains which require multi-step reasoning and benefit from interactive clarification: math and code. In the Math setting, a high-school-level arithmetic problem is given, and the student must reason step-by-step to compute the correct numerical answer. In the Coding setting, a naturallanguage problem description is provided, and the student must synthesize a correct Python program.

Math: We evaluate on GSM8K (Cobbe et al., 2021). For each student, we construct a subset of 2000 examples that are teacher-solvable but student-unsolvable. A problem is considered teacher-solvable if the teacher is able to solve the problem. It is student-unsolvable if the student’s P ass@5 < 0.5 under five stochastic generations (temperature 0.3) in the static (zero-shot) setting. Coding: We evaluate on HumanEval (Chen et al., 2021b) and OpenCoder (Huang et al., 2025). For each student, we construct a subset of 100 examples that are teacher-solvable but student-unsolvable. A problem is considered teacher-solvable if the teacher solution passes all the canonical unit tests. It is student-unsolvable if the student’s P ass@5 < 0.5 under five stochastic generations (temperature 0.3), with canonical unit tests.

Our ODQS strategy ( ยง 3.3) requires curated preference data of questions. Table 2 summarizes the number of collected examples used for training.

We use Qwen2.5-72B-Instruct (Qwen et al., 2025) as the teacher and two instruction-tuned models as student: (1) Qwen2.5-7B-Instruct (Qwen et al., 2025), and (2) Mistral-7B-Instruct-v0.3 (Jiang et al., 2023). For ODQS -sampling strategies, we use Qwen3-30B-A3B-Instruct-2507 (Yang et al., 2025) as the peer.

For tracking student’s progress, we use a 5turn interaction setup and evaluate the student’s performance before each student question (t โˆˆ {1, 2, . . . , 5}). At each evaluation step, we prompt the student model S for the final answer only and draw k stochastic generations (temperature 0.3, max tokens 2048). Pass@k (Chen et al., 2021a;Wang et al., 2023) is defined as the probability that any one of the k generations matches the correct answer. We report the average Pass@k at a given evaluation turn t. Details of ODQS -training are provided in Appendix C.

In this section, we compare different questioning strategies. Our goal is not to compare models against each other, but to analyze how much each student improves over its own static baseline through interaction. We therefore study learning dynamics via the trajectory of Pass@5 across turns t โˆˆ {1 . . . 5}: effective strategies produce steadily increasing curves (with faster gains early gains being preferable), whereas flat trajectories indicate limited benefit from interaction. We organize our analysis around three research questions: (RQ1) Whether interactions improve performance over a static baseline?; (RQ2) Whether assessments help students to ask better questions; and (RQ3) Whether students can be trained to ask better questions? All results are shown in Figures 2,3,4, and include bootstrapped error bars, which are too small to be noticeable in most cases. Tabulated results and statistical significance tests are reported in Appendix D. All the improvements reported are in absolute numbers. A concise summary of the main findings is provided in Table 1.

Sec.

RQ1 Even simple student-teacher interactions yield gains up to +0.85 in Pass@k. Scaffolding provides additional improvements, especially in early turns. ยง 5.1

RQ2 Assessment timing is domaindependent: Pre variant works best for math (up to +24.7%), while Mid is most effective for coding (up to +12.8%). ยง 5.2

RQ3 Self and Peer variants yield the strongest learning curves (improvements up to +49.3%) and matches the static performance with three fewer turns. ยง 5.3

In this research question, we test whether even simple interaction helps. We compare the performance of the UNSCAFFOLDED and SCAFFOLDED baselines with the STATIC baseline. Results for the two student models appear in Figures 2a and 2b UNSCAFFOLDED interaction yields strong improvements. First, we compare the performance of UNSCAFFOLDED (red) with the STATIC baseline (dashed). We see that even with UNSCAF-FOLDED interaction, on math, the performance of Qwen-7B improves by +0.85 absolute Pass@5 over the STATIC baseline over the course of the interaction (5 turns). Similarly, the performance of Mistral-7B improves by +0.78. Coding shows similar trends, with gains of +0.47 for Qwen-7B and +0.65 for Mistral-7B . Thus even the simplest interactive baseline significantly improves performance relative to the STATIC baseline.

SCAFFOLDED helps most in early turns. Analyzing performance of SCAFFOLDED (blue), we see that it provides similar gains over the STATIC baseline and consistent but smaller gains on top of UNSCAFFOLDED . On math, SCAFFOLDED improves over UNSCAFFOLDED by an average of +7% (up to +18.8%) for Qwen-7B and by +1% on average (up to +4.7%) for Mistral-7B . On coding, it yields +0.5% on average (up to +7.3%) for Qwen-7B and +1.8% on average (up to +8.9%) for Mistral-7B . Across settings, these gains are most pronounced in early turns and saturate later.

In this research question, we examine whether providing explicit assessments helps students ask better questions, and how the timing of such assessments affects learning. Since Scaffolding improves early-turn learning in RQ1, we apply it by default. We compare the two assessment variants, PRE-ASSESSMENT and MID-ASSESSMENT , with no-assessment, i.e. not providing an assessment.

Results for both models are shown in Figures 3a, 3b (Math),and Figures 3c,3d (Coding).

We first compare PRE-ASSESSMENT (red) and MID-ASSESSMENT (yellow) against no-assessment (blue). As expected, models immediately benefit from injected feedback. At turn 0, PRE-ASSESSMENT improves over noassessment by +21.5% for Qwen-7B and +18.3% for Mistral-7B . This early gain persists across subsequent turns: relative to no-assessment, PRE-ASSESSMENT is better by +6.8% on average for Qwen-7B (up to +21.5%) and by +8.5% on average for Mistral-7B (up to +19.4%). In contrast, MID-ASSESSMENT yields only marginal improvements over no-assessment.

When comparing the two assessment positions directly, we observe no significant differences at later turns. However, PRE-ASSESSMENT remains preferable because it introduces a performance boost from the very first turn. Moreover, PRE-ASSESSMENT matches the final performance of the no-assessment setting using one fewer turn, indicating higher efficiency under a fixed interaction budget.

For Coding, MID-ASSESSMENT outperforms PRE-ASSESSMENT At turn 2, MID-ASSESSMENT provides a strong boost, improving over no-assessment by +10.9% for Qwen-7B . Across turns 2-5, MID-ASSESSMENT continues to outperform no-assessment by +10.0% on average for Qwen-7B (up to +10.9%) and by +2.4% on average for Mistral-7B (up to +4.5%). While PRE-ASSESSMENT yields an initial improvement at turn 0, its benefits diminish quickly and remain marginal thereafter.

Crucially, MID-ASSESSMENT surpasses PRE-ASSESSMENT in later turns: across turns 2-5, MID-ASSESSMENT outperforms PRE-ASSESSMENT by +9.7% on average for Qwen-7B (up to +10.1%) and by +9.6% for Mistral-7B (up to +12.8%). Additionally, MID-ASSESSMENT reaches the same performance as no-assessment in two fewer turns, demonstrating superior efficiency.

The optimal position of assessment is domaindependent. Math problems are stepwise and incremental, making early grounding (PRE-ASSESSMENT ) especially valuable. Coding problems, in contrast, are more divergent. An incorrect early direction can derail multiple subsequent turns.

A delayed assessment (MID-ASSESSMENT ) allows the model to first explore solution paths, then correct course once sufficient feedback is accumulated. These trends are examined in detail in ยง 6.

In this research question, we evaluate whether students can be explicitly trained to ask better questions. For Mistral-7B , SELF-ODQS yields an average gain of +10.1% (up to +20.5%), and PEER-ODQS yields +10.9% on average (up to +30.7%).

A similar trend holds for Coding. For Qwen-7B , SELF-ODQS improves over no-training by +0.6% on average (up to +3.0%), while PEER-ODQS improves by +1.4% on average (up to +1.7%). For Mistral-7B , SELF-ODQS yields an average gain of +5.2% (up to +8.7%), and PEER-ODQS improves by +7.2% on average (up to +16.4%).

Beyond Pass@5 gains, both ODQS variants are also more interaction-efficient: for both models and domains, they reach the same final performance as the assessment strategy using three fewer turns. While PEER-ODQS is slightly stronger at later turns, it incurs higher cost due to the need for a peer model, making SELF-ODQS a more efficient alternative in practice. In this section, we analyze why and when different questioning strategies outperform the UN-SCAFFOLDED baseline, and examine the effect of assessment position on student performance.

We hypothesize that performance improves because better methods let the student (1) make more meaningful progress toward solving the problem at each turn, and (2) ask higher-quality questions earlier in the interaction. We empirically verify both claims.

Here, we only analyze students’ questions at turns 0 -4 since a question at the final turn (5) cannot be evaluated.

Claim 1: Progress per turn. We use an LLMas-a-Judge evaluation: for each turn, we provide the problem, its gold solution, and the student’s question and reasoning to Qwen-72B , which assigns a progress score in [0, 1] (prompts in Appendix F.3). This score measures the student’s reasoning progress, not correctness (unlike Pass@5).

Figure 5 shows progress trajectories for Qwen-7B Figure 7: Impact of assessment position. Pass@5 for Qwen-7B when inserting a single assessment at each turn (0-5). On math tasks (a), early assessment (Turn 0) provides the largest and most persistent improvements. On coding tasks (b), the benefit of assessment is shortlived, with later assessments outperforming earlier ones.

on GSM8K. SCAFFOLDED consistently increases progress per turn. Both assessments yield sharp improvements exactly at their insertion points (Turns 1 and 3). The largest gains come from ODQS variants, Self and Peer, which outperform all other methods across turns. This verifies Claim 1.

Claim 2: Asking good questions earlier.

We again use an LLM-as-a-Judge setup, using Qwen-72B , to compute the similarity between questions asked by UNSCAFFOLDED and SELF-ODQS students. Judge’s prompts are shown in Appendix F.4. Figure 6 shows the heatmap comparing SELF-ODQS vs. UNSCAFFOLDED for Math. A darker color indicates higher similarity. Looking at the top-most row, we can see that SELF-ODQS questions at Turn 0 resemble UNSCAFFOLDED questions at Turns 1 -4. Similarly, SELF-ODQS at Turn 1 resembles UNSCAFFOLDED at Turns 2 -4. This pattern persists across all turns: SELF-ODQS asks high-quality questions 1 -3 turns earlier.

This verifies Claim 2. We noticed similar observations for other methods in both Math and Coding domains. Example transcripts showing this behavior are provided in Appendix G.

Both analyses show that better methods increase reasoning progress at each turn and enable the student to ask more advanced questions earlier, key drivers of their performance.

To study how assessment timing affects performance, we place a single assessment at each turn from 0 to 5 and evaluate on both math and coding problems. Figure 7 reports Pass@5 for Qwen-7B across all assessment positions. On math, assessment at Turn 0 performs best: it produces a large boost at Turn 0 that persists across the remaining turns. On coding, improvement is transient: assessment at Turn 0 boosts Turn 0, but is soon overtaken by assessments at Turns 1 and 2, and so on. Later assessments yield better performance because coding problems are not stepwise, early incorrect attempts often derail the solution, and a delayed assessment can correct accumulated errors. This explains the earlier empirical result:

โ€ข Math: PRE-ASSESSMENT (Turn 0) is best.

โ€ข Coding: MID-ASSESSMENT (Turn 2) is best.

We investigated whether LLMs can be prompted and trained to behave as effective interactive learners. Our results show that even a simple UN-SCAFFOLDED student-teacher interaction yields substantial gains, with absolute improvements in Pass@k often exceeding +0.5 and reaching as high as +0.85. SCAFFOLDED guidance further amplifies these gains. We also found that assessment timing matters: PRE-ASSESSMENT is highly beneficial for math tasks, whereas MID-ASSESSMENT is most effective for coding tasks, reflecting fundamental differences in their problem structures (stepwise vs. non-stepwise).

Beyond prompting strategies, we demonstrated that the skill of asking good questions can be explicitly trained. Both SELF-ODQS and PEER-ODQS substantially improve students’ questioning policies, leading to stronger downstream performance and greater learning efficiency, often achieving the same performance as UNSCAFFOLDED students with three fewer turns. Our analysis further showed that stronger methods help models make more progress per turn and ask higher-quality questions earlier in the trajectory.

Overall, these findings indicate that LLMs can move beyond static information retrieval and operate as adaptive, interactive learners. A key challenge ahead is the reliance on brittle, in-context memory: developing more robust mechanisms for retaining and updating knowledge across turns is an important next step toward building truly adaptive learning agents.

Our framework assumes access to a reliable teacher that can (i) answer questions correctly and (ii) provide meaningful feedback during assessments. Performance may degrade if the teacher is noisy, biased, or weaker than the student, which limits applicability in low-resource or fully autonomous settings.

We define question quality indirectly through downstream task performance. While effective, this may conflate good questioning with favorable answer sampling, and does not explicitly capture pedagogical properties such as clarity, minimality, or interpretability of questions.

We evaluate our methods on math (GSM8K) and coding (HumanEval/OPC) tasks, which capture structured reasoning but do not cover broader interactive settings such as open-ended dialogue, longhorizon planning, or multimodal environments. The effectiveness of guided questioning in these domains remains an open question.

We study interactions under a small, fixed turn budget. Longer or adaptive interactions may exhibit different dynamics, and the efficiency gains observed here may not directly extrapolate to settings with unbounded or variable-length conversations.

Outcome-driven training requires executing multiple candidate questions per turn and evaluating them via teacher interaction, which can be computationally expensive. While effective, this may limit scalability to very large datasets or models without further optimization.

All learning occurs within the context window or via parameter updates. The framework does not address long-term memory, retrieval, or knowledge consolidation across tasks, which are important for truly adaptive interactive agents.

Algorithm 1 describes the UNSCAFFOLDED and SCAFFOLDED based student-teacher interaction framework. Algorithm 2 presents the self-training and peer-training approaches, including data collection, DPO training, and subsequent interactionbased evaluation.

For DPO training we use QLoRA (Dettmers et al., 2023) with 4-bit quantization, r=16, ฮฑ=32, dropout 0.05, batch size 4, AdamW (ฮฒ 1 =0.9, ฮฒ 2 =0.95), learning rate 5ร—10 -6 , cosine decay, 5 epochs on NVIDIA RTX A6000 (48 GB) with early stopping. for each (x, y) in D do 3:

โ–ท turn-0 initialized with a teacher message 4: Q โ† โˆ… 7:

for j = 1 to m do 8:

h j โ† TEACHER(T , H โˆช {q j }) โ–ท Teacher responses with q j 10:

end for 13: j โ‹† โ† arg max j s j โ–ท Pick the best question based on Pass@k 14:

15:

for each (q j , h j , s j , ร‚j ) โˆˆ Q with j ฬธ = j โ‹† do 18:

This section provides tabulated performance improvements and statistical significance test results.

For math, results are reported for Qwen-7B in Table 3 and for Mistral-7B in Table 4. For coding, results are reported for Qwen-7B in Table 5 and for Mistral-7B in Table 6. Statistical significance tests comparing methods are summarized by research question: Table 7 for RQ1, Table 8 for RQ2, and Table 9 for RQ3.

We investigate whether the strength of the teacher influences the quality of conversational learning. Using Qwen-7B as the student model, we repeat our primary experiments with Qwen-7B , Qwen-14B, Qwen-32B, and Qwen-72B as teachers. For fairness, the dataset is resampled for each teacher model, except when the teacher is identical to the student (Qwen-7B ), to ensure the teacher can correctly answer each question.

Math: Figure 8a shows nearly identical trends across all teacher sizes. Interestingly, strong gains are observed even when the teacher and student are the same (Qwen-7B ). These results suggest that, for math tasks, the benefit comes less from teacher strength and more from the role-playing dynamic itself: performing chain-of-thought reasoning in a conversational format appears more effective than simply prompting the model to reason in isolation.

Code: Figure 8b shows clear divergence between strong and weak teachers. Qwen-32B and Qwen-72B provide useful guidance, leading to gains of +0.3 and +0.4 Pass@k, respectively. As expected, the stronger Qwen-72B teacher yields strictly higher improvements across all turns. By contrast, weaker teachers (Qwen-7B and Qwen-14B) fail to provide useful guidance and even degrade student performance after interaction. We hypothesize that although weaker teachers can solve the tasks themselves, they struggle to convey their reasoning in a way that benefits the student, often steering it into unproductive paths.

All prompts are task-specific and are provided below as prompt boxes. We group them by their role in the pipeline: interaction, progress analysis, and similarity analysis.

For each task, we use four prompts: two variants for eliciting student questions (UNSCAFFOLDED vs. SCAFFOLDED ), one for eliciting student answers for Pass@k evaluation, and one for eliciting teacher responses. Math: We use Prompt 9 for UNSCAFFOLDED and Prompt 10 for SCAFFOLDED to elicit student questions. We use Prompt 11 to elicit answer-only outputs for computing Pass@k. We use Prompt 12 to elicit teacher responses. Coding: We use Prompt 13 for UNSCAFFOLDED and Prompt 14 for SCAFFOLDED to elicit student questions. We use Prompt 15 to elicit answer-only outputs for computing Pass@k. We use Prompt 16 to elicit teacher responses.

To generate assessments, we provide the problem together with the student’s answer attempt to the teacher model, which returns structured feedback on the correctness and quality of the answer.

We use Prompt 17 for math tasks and Prompt 18 for coding tasks. Examples 19, 20 show the examples of feedback from the teacher.

To measure per-turn reasoning progress, we provide the student’s response to a judge model and prompt it to assign a progress score in [0, 1]. We use Qwen-72B as the judge. For math, we use Prompt 21 and for coding, we use Prompt 22.

To compare question content across methods and turns, we provide pairs of student questions to a judge model and prompt it to assign a similarity score in [0, 1]. We use Qwen-72B as the judge. For math, we use Prompt 23 and for coding, we use Prompt 24.

You are an AI assistant tasked with solving math problems. To help you improve your skills, your math tutor has posed the following problem: -PROBLEM-To help you in solving this problem, you may ask the tutor any question relevant to the task (clarification questions, requirement questions, methodology questions etc.). Think about how you would solve the problem, and what you still need to know in order to complete the question. Do not solve the problem directly, do not ask the tutor for any solutions -the tutor has been instructed not to provide you with any direct answers. Keep your questions concise and to the point.

Transcript 25 illustrates how a pre-assessment helps the model ask a more informative question on Math: Student Question Prompt (SCAFFOLDED )

You are an AI assistant tasked with solving math problems. To help you improve your skills, your math tutor has posed the following problem: -PROBLEM-To help you in solving this problem, you may ask the tutor any question relevant to the task (clarification questions, requirement questions, methodology questions etc.). Think about how you would solve the problem, and what you still need to know in order to complete the question. Do not solve the problem directly, do not ask the tutor for any solutions -the tutor has been instructed not to provide you with any direct answers. Keep your questions concise and to the point. Coding: Assessment Example -Your response -STUDENT ANSWER -Evaluation result: FAIL –Feedback: Your current approach uses a set to track unique numbers, but this removes duplicates without preserving the original order and also doesn’t handle the requirement to remove elements that occur more than once.

You are a strict and consistent math grader. Your task is to evaluate how much progress a student’s reasoning shows toward solving a math problem.

You must estimate a single real-valued score between 0 and 1: -0.0 โ†’ completely wrong or irrelevant reasoning.

-0.25 โ†’ only the setup or an initial idea is correct.

-0.50 โ†’ roughly halfway; some correct derivations but key steps missing or incorrect.

-0.75 โ†’ nearly correct; only small arithmetic or algebraic mistakes.

-1.0 โ†’ fully correct and complete solution.

-Use intermediate values (e.g., 0.62) if progress lies between anchor points.

-Be strict but fair: reward correct logical steps that meaningfully advance toward the right answer.

-Ignore minor stylistic differences (notation, units, variable names) if mathematically equivalent.

-Do not reward irrelevant or circular reasoning.

-If the final answer matches the gold answer but reasoning is absent or wrong, do NOT give 1.0; cap the score to reflect weak progress.

-The gold answer is provided only as a reference for what constitutes a correct solution; do NOT copy it or include it in your output.

-Ensure “progress” is a numeric value (not a string) in [0, 1].

-Return only valid JSON-nothing else-and include a short justification (โ‰ค 30 words).

-Do not repeat the problem text or add extra commentary.

16:

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut