VeruSAGE: A Study of Agent-Based Verification for Rust Systems

Reading time: 5 minute
...

📝 Original Info

  • Title: VeruSAGE: A Study of Agent-Based Verification for Rust Systems
  • ArXiv ID: 2512.18436
  • Date: 2025-12-20
  • Authors: Chenyuan Yang, Natalie Neamtu, Chris Hawblitzel, Jacob R. Lorch, Shan Lu

📝 Abstract

Large language models (LLMs) have shown impressive capability to understand and develop code. However, their capability to rigorously reason about and prove code correctness remains in question. This paper offers a comprehensive study of LLMs' capability to develop correctness proofs for system software written in Rust. We curate a new system-verification benchmark suite, VeruSAGE-Bench, which consists of 849 proof tasks extracted from eight open-source Verus-verified Rust systems. Furthermore, we design different agent systems to match the strengths and weaknesses of different LLMs (o4-mini, GPT-5, Sonnet 4, and Sonnet 4.5). Our study shows that different tools and agent settings are needed to stimulate the system-verification capability of different types of LLMs. The best LLM-agent combination in our study completes over 80% of system-verification tasks in VeruSAGE-Bench. It also completes over 90% of a set of system proof tasks not part of VeruSAGE-Bench because they had not yet been finished by human experts. This result shows the great potential for LLM-assisted development of verified system software.

💡 Deep Analysis

Figure 1

📄 Full Content

VeruSAGE: A Study of Agent-Based Verification for Rust Systems Chenyuan Yang⋆ Natalie Neamtu♦ Chris Hawblitzel■ Jacob R. Lorch■ Shan Lu▲■ ⋆University of Illinois Urbana-Champaign ♦Carnegie Mellon University ■Microsoft Research ▲University of Chicago Abstract Large language models (LLMs) have shown impressive capa- bility to understand and develop code. However, their capa- bility to rigorously reason about and prove code correctness remains in question. This paper offers a comprehensive study of LLMs’ capability to develop correctness proofs for system software written in Rust. We curate a new system-verification benchmark suite, VeruSAGE-Bench, which consists of 849 proof tasks extracted from eight open-source Verus-verified Rust systems. Furthermore, we design different agent systems to match the strengths and weaknesses of different LLMs (o4- mini, GPT-5, Sonnet 4, and Sonnet 4.5). Our study shows that different tools and agent settings are needed to stimulate the system-verification capability of different types of LLMs. The best LLM-agent combination in our study completes over 80% of system-verification tasks in VeruSAGE-Bench. It also completes over 90% of a set of system proof tasks not part of VeruSAGE-Bench because they had not yet been finished by human experts. This result shows the great potential for LLM-assisted development of verified system software. 1 Introduction In the past few years, two contrasting code and system devel- opment methodologies have progressed. On the one hand, AI coding agents [37, 43] are becoming popular. They are very good at quickly producing a large amount of code with little human support. Their weakness is the lack of a correctness guarantee, which is particularly problematic for reliability- critical software, including most system software. On the other hand, system verification techniques are getting mature after decades of research. Recent work [6, 10, 17, 19, 46] has demonstrated the feasibility for human experts to develop large-scale system software in a popular system programming language (i.e., Rust), together with formal correctness specifi- cations and proofs that can be mathematically verified by tools such as Verus [17, 18]. However, the speed of such code and proof development and the accessibility of such verification techniques to general developers remains questionable. We naturally wonder whether these two methodologies can com- plement each other. Specifically, can large language models (LLMs) help write correctness proofs for system software? Several research projects have explored using LLMs for proof writing, but none of them offered an answer to our question. Some of them focused on verification that requires special proof-oriented languages, instead of general program- ming languages [7]; the others focused on small programming problems like binary search [1, 9, 23, 40, 48]. For example, the AutoVerus project [40] designed a benchmark suite of 150 small Rust programs with Verus specifications, called VerusBench. It also designed an agent system that empow- ers GPT-4o [14] to prove 90% of the tasks in VerusBench. Most recently, RagVerus [48] tried AutoVerus plus Retrieval Augmented Generation (RAG), i.e., providing LLMs with example proofs from the same project. They did this for four system projects: VeriSMo [49], Vest [6], IronKV [17], and a small part of Anvil [32]. Unfortunately, they report a depress- ing result: only 20% or less of the proof tasks in VeriSMo, IronKV, and Vest could be proved by GPT-4o. These recent research efforts bring up several natural ques- tions: How do real-world system proof tasks fundamentally differ from small programming problems? Is the poor per- formance due to limitations in the agent architecture (e.g., AutoVerus + RAG) or the underlying model capabilities (e.g., GPT-4o)? And, ultimately, can state-of-the-art LLMs, paired with specialized agentic designs, effectively tackle the com- plexity of real-world system verification? To answer these questions, we curate VeruSAGE-Bench, a comprehensive Verus system verification benchmark suite. It consists of 849 proof tasks extracted from eight open-source Verus-verified system projects authored by different research groups. These projects cover various domains, including operating systems, memory allocators, storage systems, dis- tributed systems, etc. Every task corresponds to one proof function or executable Rust function in the original project, with all the dependencies extracted into a stand-alone Rust file that can be individually compiled and verified. Each task file contains no proof bodies from the original project (i.e., no proof example for LLMs), and hence brings us closer to testing LLMs’ real system-proof-writing capabilities. With this benchmark suite, we can quantitatively measure the makeup of system proofs, and the differences between them and proofs for small programming tasks. Table 1 lists some of these measurements. It clearly demonstrates

📸 Image Gallery

verusage.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut