Beyond Output Critique: Self-Correction via Task Distillation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have shown promising self-correction abilities, where iterative refinement improves the quality of generated responses. However, most existing approaches operate at the level of output critique, patching surface errors while often failing to correct deeper reasoning flaws. We propose SELF-THOUGHT, a framework that introduces an intermediate step of task abstraction before solution refinement. Given an input and an initial response, the model first distills the task into a structured template that captures key variables, constraints, and problem structure. This abstraction then guides solution instantiation, grounding subsequent responses in a clearer understanding of the task and reducing error propagation. Crucially, we show that these abstractions can be transferred across models: templates generated by larger models can serve as structured guides for smaller LLMs, which typically struggle with intrinsic self-correction. By reusing distilled task structures, smaller models achieve more reliable refinements without heavy fine-tuning or reliance on external verifiers. Experiments across diverse reasoning tasks demonstrate that SELF-THOUGHT improves accuracy, robustness, and generalization for both large and small models, offering a scalable path toward more reliable self-correcting language systems.

💡 Research Summary

The paper introduces SELF‑THOUGHT, a novel self‑correction framework for large language models (LLMs) that goes beyond the conventional “output critique” paradigm. Traditional self‑correction methods work by generating an answer, critiquing it, and then patching surface‑level errors. While effective for simple mistakes, these approaches often fail to address deeper reasoning flaws because they do not explicitly model the underlying task structure.

SELF‑THOUGHT adds an intermediate Task Abstraction step. After an initial generation (ŷ₀) from a prompt ℑ, the model receives a second prompt ℘ that asks it to distill the problem into a structured template d. This template explicitly lists the key variables, constraints, goal type, and any domain‑specific rules in a machine‑readable format (e.g., JSON key‑value pairs). By separating “understanding the task” from “solving the task”, the model can ground subsequent reasoning in a clear, formal representation, reducing error propagation.

The next phase, Solution Instantiation, uses a third prompt ℜ that presents the original input x, the initial answer ŷ₀, and the distilled template d. The model then generates a refined answer ŷ₁ that is guided by the explicit constraints encoded in d. The process is iterated up to a fixed number of steps or until a stopping condition is met, as formalized in Algorithm 1.

A key contribution is DISTIL‑THOUGHT, the mechanism for transferring templates from a large, capable model to a smaller one. Large models (e.g., GPT‑4‑o) produce high‑quality abstractions that capture the essential reasoning pattern of a problem. Smaller models (e.g., Qwen‑2‑5 B, LLaMA‑3‑70 B) can then reuse these templates without needing to generate them themselves. This transfer eliminates the need for external verifiers or costly fine‑tuning for low‑resource models.

Experimental Setup

The authors evaluate SELF‑THOUGHT and DISTIL‑THOUGHT across a diverse suite of reasoning tasks:

Game of 24 – a numeric puzzle requiring arithmetic combinations.
Word Sorting – a lexical ordering challenge.
Checkmate‑In‑One – a chess problem demanding precise move selection.
AIME 2024/2025 – high‑school level competition math problems with multi‑step reasoning.

Baselines include prominent self‑correction methods such as SELF‑REFINE, SELF‑TICK, and PROGC‑O, all of which rely on output‑level critique. The models tested span from large (GPT‑4‑o‑Mini, GPT‑4‑o, O3‑Mini, DeepSeek‑R1) to medium‑size open‑source models (Qwen‑2‑5 B, LLaMA‑3‑70 B).

Results

Across all tasks, SELF‑THOUGHT consistently outperforms the baselines. For example, on Game of 24, GPT‑4‑o‑Mini achieves a 126.30 % relative improvement; on Word Sorting, an 81.82 % gain; and on the hardest benchmark, AIME 2025, a 199.85 % boost. When the distilled templates are supplied to smaller models, the gains remain substantial: Qwen‑2‑5 B sees an average 154.54 % improvement, while LLaMA‑3‑70 B improves by 121.42 %. These numbers demonstrate that the abstraction template is a model‑agnostic conduit for reasoning knowledge.

Analysis of Strengths

Deep Error Correction – By forcing the model to articulate the problem’s structure, SELF‑THOUGHT catches logical inconsistencies that surface‑level edits miss.
Scalability Across Model Sizes – The template transfer enables small models to benefit from the reasoning patterns of large models without additional training.
No External Verifiers Needed – The framework operates purely with prompting, avoiding the overhead of separate verification modules.
Reusable Knowledge – Once a template is generated for a problem class, it can be reused for similar instances, reducing inference cost.

Limitations

The quality of the abstraction depends heavily on the design of the meta‑distillation prompt ℘; domain‑specific tuning may be required.
The current work focuses on static question‑answer settings; extending to multi‑turn dialogues or interactive environments remains an open challenge.
Experiments involve models that are still relatively large (hundreds of millions to billions of parameters); the efficacy for truly lightweight models (<100 M parameters) is not demonstrated.
The approach assumes that a single structured template suffices; more complex tasks might need hierarchical or dynamic abstractions.

Future Directions

Automated Prompt Optimization – Learning to generate optimal ℘ and ℜ prompts could make the framework more plug‑and‑play across domains.
Hierarchical or Multi‑Level Abstractions – Introducing nested templates could capture richer problem hierarchies.
Multimodal Extension – Applying task abstraction to inputs that include images, tables, or code snippets.
Dynamic Stopping Criteria – Learning when the abstraction‑instantiation loop has converged to reduce unnecessary iterations.
Evaluation on Ultra‑Lightweight Models – Testing DISTIL‑THOUGHT on models suitable for edge devices to assess real‑world deployment potential.

Conclusion

SELF‑THOUGHT reframes self‑correction as a two‑stage process: first understand the task through explicit abstraction, then solve it using that abstraction as a guide. This design addresses the core weakness of prior methods—lack of deep task comprehension—and enables a scalable knowledge transfer mechanism (DISTIL‑THOUGHT) that bridges the performance gap between large and small LLMs. Empirical results across a broad set of reasoning benchmarks confirm that the framework yields substantial accuracy gains, improved robustness, and better generalization without requiring additional training data or external verification modules. The paper opens a promising avenue for building more reliable, interpretable, and resource‑efficient self‑correcting language systems.