Quokka: Accelerating Program Verification with LLMs via Invariant Synthesis

Quokka: Accelerating Program Verification with LLMs via Invariant Synthesis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Program verification relies on loop invariants, yet automatically discovering strong invariants remains a long-standing challenge. We investigate whether large language models (LLMs) can accelerate program verification by generating useful loop invariants. We introduce Quokka, a first-order and effective framework for LLM-based invariant synthesis that provides sound evaluation while achieving state-of-the-art speedup results. Unlike prior work that designs complex, highly customized algorithms, Quokka employs a simple and principled verification procedure. We construct a benchmark of 866 instances and evaluate 9 state-of-the-art LLMs across multiple model families. Our results show that Quokka consistently outperforms all prior LLM-based verifiers: achieving speedups of at least 1.2x on 81 instances compared to 39 instances for the previous best approach. We further demonstrate that supervised fine-tuning and Best-of-N sampling can yield measurable improvements in accelerating verification.


💡 Research Summary

Program verification fundamentally relies on loop invariants: logical conditions that must hold before and after each iteration of a loop. While many traditional techniques (constraint solving, abstract interpretation, Craig interpolation, syntax‑guided synthesis) have been proposed over the past four decades, automatically discovering strong invariants—those that not only are correct but also significantly reduce verification effort—remains an open challenge. Recent work has explored the use of large language models (LLMs) for invariant generation, but two critical shortcomings have limited their practical impact. First, prior evaluations (e.g., Pei et al., 2023) used Daikon, a dynamic analysis tool, to judge correctness, which is unsound because Daikon‑derived invariants may not hold for all executions and may reject semantically equivalent predicates. Second, these studies measured only whether an LLM‑generated invariant is correct, not whether it actually accelerates verification.

Quokka addresses both issues by introducing a principled, sound decision procedure that directly integrates LLM‑generated invariants with a formal verification engine (U Automizer). The core idea is to treat each candidate invariant q = ⟨ψ, ℓ⟩ as a property and issue two verifier queries:

  1. da = V(P, ∅, q) – checks whether q is a valid invariant for the original program P without any assumptions.
  2. db = V(P, {q}, p★) – checks whether the target property p★ (the final assertion) holds when q is assumed true at its location ℓ.

If da returns T (proved) and db returns T or U, the procedure concludes that the invariant is both correct and useful, yielding a prove judgment. If db returns F (refuted) under the assumption, the system can immediately declare the target property false on the original program (short‑circuit refutation). Any other combination results in an inconclusive outcome, preserving soundness while acknowledging verifier incompleteness or timeouts.

The authors formalize this workflow as a small proof calculus with three inference rules (DEC‑FALSE, DEC‑PROP, DEC‑U) and prove a decision soundness theorem: whenever the calculus derives T (true) or F (false), the corresponding statement about the original program is guaranteed to be correct. This guarantees that Quokka never reports a speedup based on a spurious invariant.

Implementation details include a syntactic filter that rejects predicates containing side‑effects (e.g., assignments) and parallel execution of the two verifier queries to minimize latency. The system is built on top of U Automizer, a state‑of‑the‑art SMT‑based verifier.

For evaluation, the authors constructed a benchmark of 866 verification instances drawn from the latest SV‑COMP competition, the largest LLM‑based verification dataset to date. They evaluated nine contemporary LLMs spanning OpenAI’s GPT series, Anthropic’s Claude models, and the Qwen and LLaMA families. The primary metric is speedup: the factor by which verification time is reduced compared to a baseline run without any LLM assistance. Quokka achieved at least a 1.2× speedup on 81 instances, compared to 39 instances for the previous best LLM‑based verifier (LEMUR). For more substantial gains (≥2.0×), Quokka succeeded on 51 instances versus 22 for LEMUR. GPT‑5.2 consistently delivered the strongest performance across all models.

Beyond raw model comparison, the paper explores two techniques to improve invariant generation:

  • Supervised fine‑tuning – The authors generated a synthetic training set of 3,589 programs using GPT‑4o with a carefully crafted prompt. Each candidate program was filtered through the verifier‑based pipeline to ensure only high‑quality invariants were retained as labels. Fine‑tuning on this data yielded measurable improvements in both correctness and strength of generated invariants.
  • Best‑of‑N sampling – By sampling N candidate invariants per query and selecting the best according to the verifier’s outcome, they observed a 22 % increase in the number of instances achieving ≥1.2× speedup (99 instances for N = 8 versus 81 for a single sample).

In summary, Quokka demonstrates that a simple, sound verification‑centric procedure can harness LLMs to produce strong loop invariants that materially accelerate program verification. It outperforms prior LLM‑based systems without resorting to complex post‑processing pipelines, and it shows that modest engineering efforts such as fine‑tuning and Best‑of‑N sampling can further boost performance. The work establishes a clear, reproducible methodology for future research at the intersection of large language models and formal software verification.


Comments & Academic Discussion

Loading comments...

Leave a Comment