Llama-Polya: Instruction Tuning for Large Language Model based on Polya's Problem-solving
This paper introduces Llama-Polya, an instruction-tuned large language model that integrates Polya’s four-step problem-solving framework into its dialogue structure to support mathematical reasoning. Mathematical problem-solving is central to students’ success in mathematics education, yet many learners struggle to plan, justify, and verify their solutions. Although large language models (LLMs) show promise as intelligent tutors, they often lack structured pedagogical alignment grounded in established learning theories. To address this gap, we operationalize Polya’s problem-solving framework within an instruction-tuned LLM to promote metacognitive engagement and examine the effects of pedagogy-aligned fine-tuning compared to domain-only and general-purpose instruction tuning. Built on the Llama-3.1-8B architecture, Llama-Polya was fine-tuned on synthetic math problem-solving data derived from GSM8K, structured according to Polya’s four stages. We developed and evaluated multiple variants-general-purpose instruct, math-domain metamath, pedagogy-aligned polya-v2, and sequential metamath+polya-v2-using both quantitative accuracy metrics and qualitative pedagogical assessments. Results indicate that models tuned with Polya’s framework and domain-specific data produced more balanced reasoning-stage distributions and fewer premature answers. Expert evaluators also observed improved pedagogical coherence and metacognitive prompting, although limitations in personalization and mathematical rigor remained. These findings suggest that pedagogy-grounded instruction tuning can enhance educational alignment and reasoning transparency in LLM-based tutoring systems.
💡 Research Summary
Background and Motivation
Mathematical problem solving is a cornerstone of students’ cognitive and metacognitive development. George Polya’s four‑step framework—understand the problem, devise a plan, carry out the plan, and look back—has long guided instructional design and assessment in mathematics education. Recent large language models (LLMs) have shown promise as conversational tutors, yet most are tuned for general-purpose instruction or domain‑specific reasoning without explicit alignment to pedagogical theory. Consequently, they tend to provide final answers quickly, bypassing the reasoning process that learners need to internalize. The authors argue that embedding an established educational scaffold directly into the model’s training objective could promote metacognitive engagement and improve tutoring quality.
Related Work
The paper reviews three strands of literature: (1) the educational psychology of problem solving, highlighting Polya, Schoenfeld, and Garofalo’s models; (2) AI applications in mathematics education, noting successes in content generation and adaptive feedback but also persistent issues of accuracy, alignment, and personalization; and (3) instruction‑tuning of LLMs, from early InstructGPT and FLAN‑T5 to recent parameter‑efficient fine‑tuning (PEFT) methods. The authors point out a gap: few studies have explicitly incorporated a pedagogical framework into the fine‑tuning pipeline.
Methodology
- Model Base – Llama‑3.1‑8B, a decoder‑only transformer released by Meta.
- Data Construction – The authors sampled ~32 k multi‑step word problems from GSM8K (covering arithmetic, measurement, and geometry). Using a carefully engineered prompt schema, they asked GPT‑4o to generate tutoring dialogues that explicitly follow Polya’s four stages. Each prompt contains eight elements: (i) Situation Information, (ii) Utterance Guidelines, (iii) Student Persona, (iv) Math Problem, (v) Stage‑Flow description, (vi) Few‑shot examples, (vii) Template, and (viii) Instruction. Random variables (student persona, problem) ensure diversity, while optimized variables (stage flow, few‑shots) are iteratively refined based on human review.
- Instruction Tuning – Dialogue pairs are formatted in ChatML (<|im_start|>…<|im_end|>) with the assistant’s response as the target. Full‑parameter fine‑tuning (no LoRA/QLoRA) is performed for one epoch using the Axolotl framework, batch size 1, learning rate 0.0002, weight decay 0.1, and DeepSpeed ZeRO‑2 on eight A100 GPUs.
- Model Variants – Five configurations are trained: (a) base (pre‑trained only), (b) instruct (general‑purpose instruction tuning), (c) polya‑v2 (fine‑tuned on Polya‑aligned math dialogues), (d) metamath (fine‑tuned on a formal reasoning dataset), and (e) metamath + polya‑v2 (sequential fine‑tuning first on Metamath, then on Polya data).
Evaluation Protocol
Evaluation proceeds in two stages. First, researchers engage each model in 10–20 turn tutoring sessions across the three domains, manually annotating each turn with Polya’s stage labels. Second, a panel of mathematics‑education experts rates the dialogues on (i) stage‑wise coherence, (ii) metacognitive prompting (e.g., “Why did you choose this strategy?”), and (iii) solution accuracy. Quantitative metrics include exact‑match accuracy, proportion of tokens generated per stage, and the rate of premature final answers (answers given before the “look‑back” stage).
Results
- Stage Distribution: Polya‑v2 and Metamath + Polya‑v2 allocate substantially more tokens to the “Plan” and “Look‑back” stages compared with the general‑purpose instruct model, indicating deeper engagement with the problem‑solving process.
- Premature Answer Reduction: The premature‑answer rate drops by 23 percentage points for Polya‑v2 and 31 points for the sequential model relative to the instruct baseline.
- Expert Ratings: Metacognitive prompting and pedagogical coherence receive the highest average scores (4.2/5) for the Polya‑aligned models. However, in complex geometry problems, experts note occasional lapses in mathematical rigor, especially in proof‑style reasoning.
- Personalization Gap: Although the prompt schema includes a “Student Persona” field, the persona is static per dialogue and does not adapt to a learner’s history, limiting true personalization.
Discussion
The study demonstrates that grounding LLM fine‑tuning in a well‑established educational framework can shift model behavior from answer‑first to process‑first, fostering metacognitive dialogue. Synthetic data alone proved sufficient to produce measurable gains, suggesting that large‑scale, theory‑driven data generation is a viable shortcut when real tutoring transcripts are scarce. Nonetheless, the authors acknowledge several limitations: (i) lack of longitudinal evaluation of learning gains, (ii) insufficient handling of formal proof steps, and (iii) limited adaptive personalization.
Conclusion and Future Work
Llama‑Polya provides a proof‑of‑concept that Polya‑aligned instruction tuning yields more pedagogically coherent and metacognitively supportive tutoring interactions. Future research directions include (1) integrating learner models to enable dynamic, personalized scaffolding, (2) mixing formal reasoning corpora (e.g., Metamath) with Polya‑structured dialogues to improve proof‑level rigor, and (3) conducting classroom‑scale studies to measure actual learning outcomes over time.
Overall, the paper contributes a novel methodology for marrying educational theory with LLM fine‑tuning, offering a roadmap for developing AI tutors that not only solve problems but also teach students how to think about problems.
Comments & Academic Discussion
Loading comments...
Leave a Comment