Prompt Repetition Improves Non-Reasoning LLMs
When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.
š” Research Summary
The paper āPrompt Repetition Improves NonāReasoning LLMsā investigates a surprisingly simple yet effective prompting technique: duplicating the entire user query before the model generates an answer. The authors argue that causal language models, which can only attend to previous tokens, treat the order of tokens in a prompt as a critical factor. In a singleāshot prompt, the model may see the context before the question (or viceāversa), leading to variable performance across āquestionāfirstā and āoptionsāfirstā formats. By feeding the prompt twiceātransforming ā
The experimental protocol spans seven prominent LLMs (Google GeminiāÆ2.0 Flash/Lite, OpenAI GPTā4oāmini/4o, Anthropic ClaudeāÆ3 Haiku/7āÆSonnet, DeepSeekāÆV3) accessed via official APIs in early 2025. The models are evaluated on seven benchmarks: ARCāChallenge, OpenBookQA, GSM8K, MMLUāPro, MAāTH, plus two custom tasks designed to stress ordering (NameIndex and MiddleMatch). For each benchmark the authors test both āquestionāfirstā and āoptionsāfirstā variants, yielding 70 modelābenchmark combinations.
Key findings under a ānoāreasoningā setting (i.e., the model is instructed not to perform chaināofāthought or stepābyāstep reasoning) are:
- Prompt repetition yields statistically significant accuracy gains in 47 out of 70 cases (McNemar test, pāÆ<āÆ0.1). There are zero losses.
- Gains are especially pronounced for the custom tasks: GeminiāÆ2.0 FlashāLiteās accuracy on NameIndex jumps from 21.33āÆ% to 97.33āÆ%; similar leaps are observed across all models for MiddleMatch.
- The improvement is larger when the prompt is presented in the āoptionsāfirstā order, confirming that the second copy compensates for the modelās inability to see future tokens.
- Variants such as āVerboseā (adding a āLet me repeat that:ā preamble) and āĆ3ā (three repetitions) perform on par with or better than the vanilla doubleārepeat, with the Ć3 version delivering the strongest gains on the custom tasks.
- A control experiment that pads the input with periods to match the token length of the repeated prompt (without actual repetition) shows no performance boost, reinforcing that the benefit stems from repeated semantic content, not merely longer inputs.
Efficiency analysis shows that prompt repetition does not increase the number of generated tokens or the endātoāend latency for nonāreasoning runs. The extra tokens are processed during the preāfill stage, which is parallelizable and does not affect the generation phase. Only the Anthropic models exhibit modest latency increases for very long inputs, likely due to KVācache handling overhead.
The authors position prompt repetition as a lowācost alternative to reasoningāoriented prompting (e.g., ChaināofāThought, āThink stepābyāstepā), which typically inflates output length and latency. When reasoning is explicitly enabled, repetition is neutral to slightly positive (5 wins, 1 loss, 22 ties across 28 tests), suggesting that the models already repeat parts of the prompt during reasoning.
Limitations acknowledged include reliance on blackābox API calls (preventing direct inspection of internal attention patterns), lack of systematic exploration of optimal repetition count beyond 2ā3, and uncertainty about scalability to very long prompts or multimodal inputs. The paper also does not quantify memory overhead in the KVācache nor examine potential tradeāoffs in multiāturn dialogues.
Future work outlined by the authors encompasses: fineātuning models on repeatedāprompt data; training reasoning models with repetition to encourage internal cache reuse; dynamic repetition of generated tokens during inference; caching only the second copy to achieve zero latency impact; selective repetition of salient prompt segments; reordering prompts via a smaller auxiliary model; extending the technique to image or audio modalities; deeper analysis of attention patterns induced by repetition; combining repetition with selective attention, speculative decoding, or prefixāLM methods; and systematic studies of when and why repetition helps.
In summary, the paper demonstrates that a trivial preprocessing stepāduplicating the user promptācan dramatically boost accuracy for a wide range of LLMs on nonāreasoning tasks without any penalty in latency or output length. This makes prompt repetition a practical, dropāin improvement for production systems that rely on fast, concise responses, and it opens several promising research avenues for integrating repetition more deeply into model training and inference pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment