코트 레시피 메타학습으로 새로운 추론 과제의 인 컨텍스트 학습 강화

Reading time: 6 minute
...

📝 Abstract

Chain-of-thought (CoT) prompting combined with few-shot in-context learning (ICL) has unlocked significant reasoning capabilities in large language models (LLMs). However, ICL with CoT examples is ineffective on novel tasks when the pre-training knowledge is insufficient. We study this problem in a controlled setting using the CoT-ICL Lab (Kothapalli et al., 2025) framework, and propose meta-training techniques to learn novel abstract reasoning tasks in-context. Although CoT examples facilitate reasoning, we noticed that their excessive inclusion during metatraining degrades performance when CoT supervision is limited. To mitigate such behavior, we propose CoT-Recipe, a formal approach to modulate the mix of CoT and non-CoT examples in meta-training sequences. We demonstrate that careful modulation via CoT-Recipe can increase the accuracy of transformers on novel tasks by up to 300% even when there are no CoT examples available in-context. We confirm the broader effectiveness of these techniques by applying them to pretrained LLMs (Qwen2.5 series) for symbolic reasoning tasks and observing gains of up to 130% in accuracy. Code is available at: https://github.com/kvignesh1420/cot-icl-lab

💡 Analysis

Chain-of-thought (CoT) prompting combined with few-shot in-context learning (ICL) has unlocked significant reasoning capabilities in large language models (LLMs). However, ICL with CoT examples is ineffective on novel tasks when the pre-training knowledge is insufficient. We study this problem in a controlled setting using the CoT-ICL Lab (Kothapalli et al., 2025) framework, and propose meta-training techniques to learn novel abstract reasoning tasks in-context. Although CoT examples facilitate reasoning, we noticed that their excessive inclusion during metatraining degrades performance when CoT supervision is limited. To mitigate such behavior, we propose CoT-Recipe, a formal approach to modulate the mix of CoT and non-CoT examples in meta-training sequences. We demonstrate that careful modulation via CoT-Recipe can increase the accuracy of transformers on novel tasks by up to 300% even when there are no CoT examples available in-context. We confirm the broader effectiveness of these techniques by applying them to pretrained LLMs (Qwen2.5 series) for symbolic reasoning tasks and observing gains of up to 130% in accuracy. Code is available at: https://github.com/kvignesh1420/cot-icl-lab

📄 Content

Recent advances in large language models (LLMs) have demonstrated remarkable reasoning abilities when prompted to generate step-by-step solutions to problems. A prime example is chain-of-thought (CoT) prompting (Wei et al., 2022;Kojima et al., 2022), where appending a prompt with “Let’s think step by step” can induce an LLM to generate intermediate “thought” steps (Nye et al., 2022), and enable them to tackle multi-step problems. CoT prompting, specially when combined with in-context learning (ICL) (Brown et al., 2020) has yielded impressive gains on arithmetic, commonsense, and symbolic reasoning benchmarks (Wei et al., 2022;Kojima et al., 2022), and has become a key technique for eliciting reasoning in LLMs.

Despite its successes, CoT in-context learning (CoT-ICL) faces several limitations. First, models often require carefully chosen exemplars (Min et al., 2022b;Zhao et al., 2021) for effective ICL. Next, few-shot CoT prompting uses handcrafted demonstrations of question-answer pairs with rationales, which can be labor-intensive to create for each task (Wei et al., 2022;Kim et al., 2023). Moreover, the benefits of CoT prompting tend to emerge with model scale; while smaller models struggle to produce answers with good reasoning unless finetuned to do so (Li et al., 2023a;Huang et al., 2023;Ho et al., 2023;Kim et al., 2023).

These issues are exacerbated when the tasks are entirely novel and the pre-training knowledge of LLMs is insufficient to generate the correct responses. For example, prompting LLMs to answer domain-specific queries whose background knowledge is not included in the pre-training data. In such scenarios, the models have to rely solely on the (possibly limited) task descriptions and the incontext examples to generate a response. Having CoT examples aids in revealing more information about the task, but their availability might be limited due to data curation constraints.

While previous works have explored metatraining approaches (Min et al., 2022a;Chen et al., 2022) with ICL as an objective, the role of CoT exemplars in the data recipes and inference prompts has been largely overlooked. By addressing this gap, our work aims to understand if models can be meta-trained to effectively leverage the (limited) CoT examples at inference for solving novel tasks. In particular, we study this problem in a controlled setting using the CoT-ICL Lab framework (Kothapalli et al., 2025) for abstract reasoning with transformers. Although CoT exemplars can aid in learning about the task, we find that their excessive inclusion during meta-training can be detrimental to the model’s performance when such supervision is limited (during inference). We propose principled data curation recipes to modulate the mix of CoT and non-CoT examples in sequences to address this issue (Figure 1). We also create a novel symbolic reasoning dataset called CIL-LangSym and meta-train LLMs (Qwen-2.5 series) with our data recipes to show that they can reason effectively on these domain-specific queries (1) even in the absence of CoT exemplars and (2) limited task descriptions. In summary, our key contributions are as follows:

  1. We introduce CoT-ICL Lab-2.0, an extension of CoT-ICL Lab by Kothapalli et al. (2025) for meta-training on abstract reasoning tasks. It incorporates special tokens to isolate the ‘input’, ’thinking’, and ‘answer’ segments of examples, and allows reasoning control in trained transformers by enabling dynamic invocation of multi-step reasoning or direct final answers as needed.

  2. We introduce CoT-Recipe, a formal approach to modulate the mix of CoT and non-CoT examples in sequences during meta-training.

In essence, datasets curated with this approach allow the models to reason even in the absence of CoT exemplars.

  1. We leverage the insights from systematic experiments with CoT-ICL Lab-2.0 to im-prove the CoT-ICL capabilities of real-world LLMs (Qwen-2.5 series) on symbolic reasoning tasks. Especially when their pre-trained knowledge is insufficient for reasoning.

Chain-of-Thought Prompting. CoT prompting with (Wei et al., 2022) and without (Kojima et al., 2022) ICL has been an effective strategy to improve model performance in complex reasoning tasks. However, CoT prompting’s effectiveness is strongly dependent on model scale and the quality of exemplars (Li et al., 2023a). Additionally, designing good CoT exemplars for few-shot prompting can be non-trivial (Min et al., 2022b;Zhao et al., 2021) as the exemplars need to be representative and align with the task at hand (Wang et al., 2023;Chu et al., 2024). This highlights the brittleness of the current models in utilizing the CoT exemplar1 . Beyond the basic paradigm of CoT prompting, the ‘ReAct’ framework by Yao et al. (2023) blends CoT prompting with actions that interface with an external environment. In this framework, the model is prompted in a format where it alternates between generating a thought (reflection on the task state) and an action

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut