LLMs and Fuzzing in Tandem: A New Approach to Automatically Generating Weakest Preconditions

LLMs and Fuzzing in Tandem: A New Approach to Automatically Generating Weakest Preconditions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The weakest precondition (WP) of a program describes the largest set of initial states from which all terminating executions of the program satisfy a given postcondition. The generation of WPs is an important task with practical applications in areas ranging from verification to run-time error checking. This paper proposes the combination of Large Language Models (LLMs) and fuzz testing for generating WPs. In pursuit of this goal, we introduce \emph{Fuzzing Guidance} (FG); FG acts as a means of directing LLMs towards correct WPs using program execution feedback. FG utilises fuzz testing for approximately checking the validity and weakness of candidate WPs, this information is then fed back to the LLM as a means of context refinement. We demonstrate the effectiveness of our approach on a comprehensive benchmark set of deterministic array programs in Java. Our experiments indicate that LLMs are capable of producing viable candidate WPs, and that this ability can be practically enhanced through FG.


💡 Research Summary

The paper tackles the long‑standing challenge of automatically generating weakest preconditions (WPs) for programs, a cornerstone of formal verification that defines the largest set of initial states guaranteeing a postcondition. Traditional WP derivation relies on heavy mathematical machinery—Hoare logic, loop invariants, and theorem provers—making it labor‑intensive and often undecidable for programs with loops. To overcome these limitations, the authors propose a novel hybrid framework called Fuzzing Guidance (FG), which tightly couples Large Language Models (LLMs) with execution‑based fuzz testing.

FG operates in two distinct fuzzing phases. The first, validity‑fuzzing, attempts to find inputs that satisfy a candidate WP but violate the program’s postcondition. If such a counterexample is discovered, the WP is deemed invalid and fed back to the LLM for repair. The second, weakness‑fuzzing, assumes a WP is already valid and searches for inputs that violate the WP while still satisfying the postcondition. Finding such inputs indicates the WP is not the weakest possible; the LLM is again prompted to broaden the precondition. These phases are orchestrated in iterative FG‑cycles: multiple rounds of validity‑fuzzing and repair followed by a single round of weakness‑fuzzing and repair. The feedback is embedded in specially crafted prompts (“repair‑validity” and “repair‑weakness”), driving the LLM to progressively refine its answer.

The experimental evaluation focuses on a benchmark suite of deterministic Java array programs (e.g., copying, sorting, searching). Each program follows a uniform pattern: a static method `foo(int


Comments & Academic Discussion

Loading comments...

Leave a Comment