Input Reduction Enhanced LLM-based Program Repair

Input Reduction Enhanced LLM-based Program Repair
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) have shown great potential in Automated Program Repair (APR). Test inputs, being crucial for reasoning the root cause of failures, are always included in the prompt for LLM-based APR. Unfortunately, LLMs struggle to retain key information in long prompts. When the test inputs are extensive in the prompt, this may trigger the “lost-in-the-middle” issue, compromising repair performance. ReduceFix prompts an LLM to generate a reducer that minimizes failure-inducing test inputs without human effort, and then feeds the reduced failure-inducing inputs to guide patch generation. For targeted evaluation, we constructed LFTBench, the first long-input APR benchmark with 200 real bugs from 20 programming tasks, each paired with a failure-inducing input whose median size is 1 MB. On this benchmark, ReduceFix shrinks inputs by 89.1% on average and improves overall pass@10 by up to 53.8% relative to a prompt that includes the original test, and by 17.6% compared with omitting the test entirely. Adding the same reduction step to ChatRepair and CREF increases their fix rate by 21.3% and 2.6%, respectively, without other changes. Our gains hold against a ddmin-only reducing template baseline and transfer to repository-level OSS-Fuzz cases. Ablation studies further highlight the impact of input length and compressed failure information on repair success. These results underscore that automatically reducing failing inputs is a practical and powerful complement to LLM-based APR, significantly improving its scalability and effectiveness.


💡 Research Summary

The paper addresses a critical bottleneck in large‑language‑model (LLM) based automated program repair (APR): when failure‑inducing test inputs are large, they consume most of the model’s context window and cause the “lost‑in‑the‑middle” effect, whereby the model overlooks the crucial part of the input that actually reveals the bug. To mitigate this, the authors propose ReduceFix, a three‑stage framework that automatically generates a task‑specific input reducer, applies it to shrink the failing test, and then feeds the reduced test together with the buggy code and problem description to the LLM for patch generation.

In the first stage, ReduceFix constructs a one‑shot prompt that contains the full problem statement, a concrete example of a reducer (derived from the classic ddmin algorithm), and a few I/O pairs from the target task. This prompt is sent to a code‑oriented LLM (e.g., Qwen2.5‑Plus) which returns a Python script that can reduce inputs of the given format while preserving the failure‑inducing behavior. The second stage runs this generated reducer under a strict time budget (60 seconds). The reducer iteratively deletes chunks of the original input, checking after each deletion whether the reference solution and the buggy submission still diverge. The process stops when no further reduction is possible, yielding a minimal failing input i*; the compression ratio achieved across the benchmark is 89.1 % on average, and 95 % of reducers succeed.

The third stage concatenates the problem description, buggy code, and the reduced input i* into a repair prompt. If i* is still too large, ReduceFix truncates it by keeping the first and last half‑lines, inserting an ellipsis token to indicate omitted content. The LLM then samples up to ten candidate patches, each of which is compiled and executed against the full hidden test suite. The first patch that passes all tests is returned as the repaired program.

For evaluation, the authors built LFTBench, the first APR benchmark focused on long inputs, containing 200 real bugs from 20 AtCoder tasks (median failing input size ≈ 1 MB). ReduceFix, when paired with four different LLMs (including Qwen2.5‑Plus, GLM‑4‑9B‑chat, and Qwen2.5‑Coder‑7B‑instruct), improves overall pass@10 by up to 53.8 % compared with prompts that embed the full test, and by 17.6 % compared with omitting the test entirely. The approach also generalizes: adding the same reduction step to existing APR tools—ChatRepair and CREF—boosts their pass@10 by 21.3 % and 2.6 %, respectively, without any other modifications.

Further experiments on 12 OSS‑Fuzz crash cases (five projects) demonstrate that reduced inputs raise micro‑average pass@10 from 16.7 % (original test) to 41.7 % under Docker‑grounded validation. Ablation studies reveal that the primary driver of improvement is the reduction of input length; providing only compressed failure evidence yields the largest gains, while merely appending additional evidence offers limited benefit.

In summary, ReduceFix shows that automatically generating and applying input reducers is a practical, hands‑free complement to LLM‑based APR. By shrinking massive failing inputs while preserving their discriminative power, it alleviates context‑window limitations, enhances the model’s focus on the bug‑relevant information, and substantially improves the scalability and effectiveness of LLM‑driven program repair.


Comments & Academic Discussion

Loading comments...

Leave a Comment