Title: VeruSAGE: A Study of Agent-Based Verification for Rust Systems
ArXiv ID: 2512.18436
Date: 2025-12-20
Authors: Chenyuan Yang, Natalie Neamtu, Chris Hawblitzel, Jacob R. Lorch, Shan Lu
📝 Abstract
Large language models (LLMs) have shown impressive capability to understand and develop code. However, their capability to rigorously reason about and prove code correctness remains in question. This paper offers a comprehensive study of LLMs' capability to develop correctness proofs for system software written in Rust. We curate a new system-verification benchmark suite, VeruSAGE-Bench, which consists of 849 proof tasks extracted from eight open-source Verus-verified Rust systems. Furthermore, we design different agent systems to match the strengths and weaknesses of different LLMs (o4-mini, GPT-5, Sonnet 4, and Sonnet 4.5). Our study shows that different tools and agent settings are needed to stimulate the system-verification capability of different types of LLMs. The best LLM-agent combination in our study completes over 80% of system-verification tasks in VeruSAGE-Bench. It also completes over 90% of a set of system proof tasks not part of VeruSAGE-Bench because they had not yet been finished by human experts. This result shows the great potential for LLM-assisted development of verified system software.
💡 Deep Analysis
📄 Full Content
VeruSAGE: A Study of Agent-Based Verification for Rust Systems
Chenyuan Yang⋆
Natalie Neamtu♦
Chris Hawblitzel■
Jacob R. Lorch■
Shan Lu▲■
⋆University of Illinois Urbana-Champaign
♦Carnegie Mellon University
■Microsoft Research
▲University of Chicago
Abstract
Large language models (LLMs) have shown impressive capa-
bility to understand and develop code. However, their capa-
bility to rigorously reason about and prove code correctness
remains in question. This paper offers a comprehensive study
of LLMs’ capability to develop correctness proofs for system
software written in Rust. We curate a new system-verification
benchmark suite, VeruSAGE-Bench, which consists of 849
proof tasks extracted from eight open-source Verus-verified
Rust systems. Furthermore, we design different agent systems
to match the strengths and weaknesses of different LLMs (o4-
mini, GPT-5, Sonnet 4, and Sonnet 4.5). Our study shows
that different tools and agent settings are needed to stimulate
the system-verification capability of different types of LLMs.
The best LLM-agent combination in our study completes over
80% of system-verification tasks in VeruSAGE-Bench. It also
completes over 90% of a set of system proof tasks not part
of VeruSAGE-Bench because they had not yet been finished
by human experts. This result shows the great potential for
LLM-assisted development of verified system software.
1
Introduction
In the past few years, two contrasting code and system devel-
opment methodologies have progressed. On the one hand, AI
coding agents [37, 43] are becoming popular. They are very
good at quickly producing a large amount of code with little
human support. Their weakness is the lack of a correctness
guarantee, which is particularly problematic for reliability-
critical software, including most system software. On the
other hand, system verification techniques are getting mature
after decades of research. Recent work [6, 10, 17, 19, 46]
has demonstrated the feasibility for human experts to develop
large-scale system software in a popular system programming
language (i.e., Rust), together with formal correctness specifi-
cations and proofs that can be mathematically verified by tools
such as Verus [17, 18]. However, the speed of such code and
proof development and the accessibility of such verification
techniques to general developers remains questionable. We
naturally wonder whether these two methodologies can com-
plement each other. Specifically, can large language models
(LLMs) help write correctness proofs for system software?
Several research projects have explored using LLMs for
proof writing, but none of them offered an answer to our
question. Some of them focused on verification that requires
special proof-oriented languages, instead of general program-
ming languages [7]; the others focused on small programming
problems like binary search [1, 9, 23, 40, 48]. For example,
the AutoVerus project [40] designed a benchmark suite of
150 small Rust programs with Verus specifications, called
VerusBench. It also designed an agent system that empow-
ers GPT-4o [14] to prove 90% of the tasks in VerusBench.
Most recently, RagVerus [48] tried AutoVerus plus Retrieval
Augmented Generation (RAG), i.e., providing LLMs with
example proofs from the same project. They did this for four
system projects: VeriSMo [49], Vest [6], IronKV [17], and a
small part of Anvil [32]. Unfortunately, they report a depress-
ing result: only 20% or less of the proof tasks in VeriSMo,
IronKV, and Vest could be proved by GPT-4o.
These recent research efforts bring up several natural ques-
tions: How do real-world system proof tasks fundamentally
differ from small programming problems? Is the poor per-
formance due to limitations in the agent architecture (e.g.,
AutoVerus + RAG) or the underlying model capabilities (e.g.,
GPT-4o)? And, ultimately, can state-of-the-art LLMs, paired
with specialized agentic designs, effectively tackle the com-
plexity of real-world system verification?
To answer these questions, we curate VeruSAGE-Bench, a
comprehensive Verus system verification benchmark suite. It
consists of 849 proof tasks extracted from eight open-source
Verus-verified system projects authored by different research
groups. These projects cover various domains, including
operating systems, memory allocators, storage systems, dis-
tributed systems, etc. Every task corresponds to one proof
function or executable Rust function in the original project,
with all the dependencies extracted into a stand-alone Rust
file that can be individually compiled and verified. Each task
file contains no proof bodies from the original project (i.e.,
no proof example for LLMs), and hence brings us closer to
testing LLMs’ real system-proof-writing capabilities.
With this benchmark suite, we can quantitatively measure
the makeup of system proofs, and the differences between
them and proofs for small programming tasks. Table 1 lists
some of these measurements. It clearly demonstrates