Large language Models (LLMs) are usually used to answer questions, but many high-stakes applications (e.g., tutoring, clinical support) require the complementary skill of asking questions: detecting missing information, requesting clarifications, and using them to solve tasks. We study this skill in reasoning-heavy domains where progress depends on inquiry rather than factual recall. We define an interactive protocol where a student model engages a stronger teacher under a small turn budget. After each teacher reply, we evaluate the student on the original task with Pass@k. We propose Outcome-Driven Question optimization Strategy (ODQS ), a training framework that learns a questioning policy from downstream task outcomes. At each turn, we sample multiple candidate questions; query the teacher with each, then score the student's resulting performance. Using these scores, we train the student via supervised fine-tuning followed by Direct Preference Optimization (DPO), without any human labels. On GSM8K, HumanEval, and OpenCoder, ODQS produces large gains over interactive baselines, boosting Pass@5 by up to 54.7% (absolute) on math and 22.9% (absolute) on coding, and matching baseline performance in three fewer turns. Thus, question asking can be explicitly trained from task outcomes, improving both accuracy and efficiency in interactive reasoning.
💡 Deep Analysis
📄 Full Content
Socratic Students: Training Language Models to Ask Better Questions for
Reasoning
Rajeev Bhatt Ambati 1∗, Tianyi Niu 1∗, Aashu Singh 2,
Shlok Mishra 2, Snigdha Chaturvedi 1, Shashank Srivastava 1
1UNC Chapel Hill, 2Meta
Abstract
Large language Models (LLMs) are usually
used to answer questions, but many high-stakes
applications (e.g., tutoring, clinical support) re-
quire the complementary skill of asking ques-
tions: detecting missing information, request-
ing clarifications, and using them to solve
tasks. We study this skill in reasoning-heavy
domains where progress depends on inquiry
rather than factual recall. We define an inter-
active protocol where a student model engages
a stronger teacher under a small turn budget.
After each teacher reply, we evaluate the stu-
dent on the original task with Pass@k. We pro-
pose Outcome-Driven Question optimization
Strategy (ODQS ), a training framework that
learns a questioning policy from downstream
task outcomes. At each turn, we sample multi-
ple candidate questions; query the teacher with
each, then score the student’s resulting perfor-
mance. Using these scores, we train the student
via supervised fine-tuning followed by Direct
Preference Optimization (DPO), without any
human labels. On GSM8K, HumanEval, and
OpenCoder, ODQS produces large gains over
interactive baselines, boosting Pass@5 by up to
54.7% (absolute) on math and 22.9% (absolute)
on coding, and matching baseline performance
in three fewer turns. Thus, question asking
can be explicitly trained from task outcomes,
improving both accuracy and efficiency in in-
teractive reasoning.
1
Introduction
The dominant paradigm for language models is re-
active: present a prompt and receive a response.
This works beautifully when the model has the
information it needs. However, many real-world
applications, such as educational tutoring (Hu et al.,
2023; Pan et al., 2024; Team et al., 2025; Kim et al.,
2024) and medical assistance (Li et al., 2024, 2025)
require models to identify uncertainties, ask ques-
tions, and adapt to new information. For example,
*Equal contribution. Correspondance: ambati@cs.unc.edu
a diagnostic assistant must ask targeted questions
before recommending treatment, or a tutor must
ask probing questions to identify a student’s mis-
conceptions. In these settings, knowing what to ask
is the central bottleneck. In such dynamic interac-
tions, models fail not because they cannot generate
answers, but because they ask the wrong questions,
or none at all.
Recent work has explored interactive settings, in-
cluding agents that ask clarifying questions (Alian-
nejadi et al., 2019; Press et al., 2023; Yao et al.,
2023) and student–teacher setups where a stronger
model guides a weaker one (Kendapadi et al., 2025).
These approaches show that interaction helps, but a
key gap remains: we lack a training signal that
teaches a model which questions to ask. Most
methods rely on heuristics, scaffolds, or human
judgments of question quality (Aliannejadi et al.,
2019; Yao et al., 2023). We argue that question
quality should be judged not by style or surface
semantics, but by utility: does the question im-
prove the model’s ability to solve the task? Some
work applies reward-based refinement for clarify-
ing questions (Andukuri et al., 2024; Srivastava
et al., 2019), but previous work has not explored
training questioning policies or interaction in rea-
soning tasks.
We focus on reasoning-intensive domains (math
and code) and formalize the interaction via a
student-teacher protocol: a student model S at-
tempts a problem and is allowed to query a stronger
teacher T , that provides guidance but never the fi-
nal answer. The student operates under a budget
of questioning turns. After each teacher response,
we evaluate whether S can now solve the origi-
nal problem by sampling answer-only attempts and
computing Pass@k. This yields a clean operational
definition of utility: a question is good if and only
if it increases downstream Pass@k.
Building such agents presents three core chal-
lenges. The first challenge is search, since the
arXiv:2512.13102v4 [cs.AI] 6 Jan 2026
Hello! Could you clarify if I need to find a way to express the total sold in
terms of the number of boxes, or should I consider any other approach?
You’re on the right track. Express the total number of chocolate bars in terms of the
number of boxes sold by Lisa and Peter, and set up an equation to match the total.
👨🏫
Got it. Let x be the number of chocolate bars per box. Lisa sold 3.5x and Peter sold
4.5x. Together, 3.5x+4.5x=64. Is this setup correct, and should I now solve for x?
Yes, that’s correct. Combine the terms on the left, then solve
for x to find how many chocolate bars are in each box.
👨🏫
Great! Combining the terms, the equation simplifies to: 8x = 64. Now, I need to
solve for x by dividing both sides by 8: x = 64 8 What is the value of x ?
÷
You're doing well! Divide 64 by 8 to find the value of x. This will give
you the nu