Input-Time Scaling: Adding Noise and Irrelevance into Less-Is-More Drastically Improves Reasoning Performance and Efficiency
Large Language Models (LLMs) excel at reasoning, traditionally requiring high-quality large-scale data and extensive training. Recent works reveal a very appealing Less-Is-More phenomenon where very small, carefully curated high-quality datasets match resource-intensive approaches. In this work, we further systematically relax their quality constraints by adding controlled noise via persona context relevance and comparing datasets of different qualities. Counterintuitively, we find that mixing relevant and irrelevant contexts consistently across training and inference stages yields optimal results – a phenomenon we term training-testing co-design. Dataset quality comparisons show that high-quality data benefits weaker models on easy questions, while low-quality data achieves higher scores on hard questions with capable models. Across our experiments, reasoning performance is linked to reasoning efficiency. We, for the first time, found adding noisy and irrelevant contexts into queries can improve reasoning efficiency without any prices and targeted designs. Building on these insights, we propose Input-Time Scaling: applying small, low-quality data to capable models with training-testing co-design. This maintains Less-Is-More while further removing labor-intensive quality curation and improving reasoning effectiveness and efficiency, making the approach more applicable and affordable. Our method achieves 76.7% pass@1 on AIME24/25 using Qwen2.5-32B-Instruct, and 90.0%/80.0% with DeepSeek-R1-Distill-Qwen-32B – state-of-the-art among Qwen2.5-32B variants. We are open-sourcing our datasets, pipelines, evaluation results, and checkpoints to facilitate reproducibility and further research.
💡 Research Summary
The paper revisits the “Less‑Is‑More” phenomenon, which shows that very small, carefully curated datasets can match the performance of large‑scale training for large language models (LLMs). While prior work still required substantial human effort to filter and curate high‑quality data, the authors ask whether data quality constraints can be relaxed even further. To explore this, they introduce controlled noise into the input‑output pairs by concatenating persona contexts of varying relevance to the original query. Four persona strategies are defined: None (no added context), Similar (a persona that is semantically related to the query), Dissimilar (a persona that is unrelated and thus noisy), and Random (a persona drawn from a random domain). The persona‑query relevance serves as a proxy for noise level, allowing systematic degradation of data quality without altering the original chain‑of‑thought (CoT) or answer.
Two benchmark datasets are used to represent opposite ends of the quality spectrum. LIMO (less than 1 K examples) is a high‑quality, heavily filtered set with curated reasoning chains. OpenThought (over 1 M examples) is a low‑quality set with minimal filtering, many missing answers, and high query diversity. Both datasets are transformed with the four persona strategies, yielding eight training configurations (high‑quality × four persona types, low‑quality × four persona types).
Experiments are conducted primarily on 32‑billion‑parameter models: Qwen2.5‑32B‑Instruct, DeepSeek‑R1‑Distill‑Qwen‑32B, and Llama‑3 variants. Training uses a modest 240 update steps, batch size 48, learning rate 5e‑6, and a cosine schedule. Evaluation uses AIME 24 and AIME 25 as the main mathematical reasoning benchmarks, supplemented by Math500 and GPQA. Each test set is also augmented with the same four persona variants, enabling a full factorial analysis of training‑testing combinations (32 evaluations per model, 256 total across all configurations). Performance is measured with pass@1 (greedy decoding for 32 B models, 4‑sample averaging for smaller models). Two aggregate metrics are reported: “avg” (mean across all four benchmarks) and “avg2” (mean across the two AIME benchmarks).
Key findings:
- Adding noise via Dissimilar or Random personas during training consistently improves performance over the baseline (None). For the high‑quality LIMO set, noise‑augmented training yields an average gain of +8 % (avg) and +5 % (avg2) compared to the baseline. For the low‑quality OpenThought set, the gains are even larger (+18 % avg, +35 % avg2).
- The “training‑testing co‑design” effect emerges: applying any persona strategy (S, D, or R) during both training and inference yields the highest scores. Consistent strategies (e.g., S‑D, R‑R) outperform mismatched ones (e.g., training with personas but testing without).
- Data quality interacts with model capacity. High‑quality data benefits smaller or weaker models on easier questions, whereas low‑quality, larger‑scale data benefits strong models on harder problems. This overturns the conventional belief that higher quality is always better.
- Token‑level analysis shows that persona contexts increase the number of “thinking tokens” (the model spends more computation before generating the final answer) while reducing the length of the final answer, leading to higher reasoning efficiency. In other words, the added noise acts as a catalyst that forces the model to engage in deeper internal reasoning.
Based on these observations, the authors propose “Input‑Time Scaling”: (a) use a small amount of low‑quality data, (b) apply persona contexts consistently during both training and inference, and (c) leverage capable models that can extract signal from noisy inputs. This approach eliminates the need for expensive data curation while preserving or even improving reasoning performance and efficiency.
Empirically, Input‑Time Scaling achieves state‑of‑the‑art results on the AIME benchmarks: Qwen2.5‑32B‑Instruct reaches 76.7 % pass@1 on AIME 24/25 using only 1 K low‑quality examples, and DeepSeek‑R1‑Distill‑Qwen‑32B attains 90 % on AIME 24 and 80 % on AIME 25, surpassing all other Qwen2.5‑32B variants. These results match or exceed the performance of models trained on ten times more data and with reinforcement learning, demonstrating the practical impact of the method.
All datasets, data‑processing pipelines, evaluation scripts, and model checkpoints are released publicly, facilitating reproducibility and encouraging further research into low‑quality data utilization, noise‑aware training, and efficient reasoning in LLMs.
Comments & Academic Discussion
Loading comments...
Leave a Comment