Beyond Basic Specifications? A Systematic Study of Logical Constructs in LLM-based Specification Generation
Formal specifications play a pivotal role in accurately characterizing program behaviors and ensuring software correctness. In recent years, leveraging large language models (LLMs) for the automatic generation of program specifications has emerged as a promising avenue for enhancing verification efficiency. However, existing research has been predominantly confined to generating specifications based on basic syntactic constructs, falling short of meeting the demands for high-level abstraction in complex program verification. Consequently, we propose incorporating logical constructs into existing LLM-based specification generation framework. Nevertheless, there remains a lack of systematic investigation into whether LLMs can effectively generate such complex constructs. To this end, we conduct an empirical study aimed at exploring the impact of various types of syntactic constructs on specification generation framework. Specifically, we define four syntactic configurations with varying levels of abstraction and perform extensive evaluations on mainstream program verification datasets, employing a diverse set of representative LLMs. Experimental results first confirm that LLMs are capable of generating valid logical constructs. Further analysis reveals that the synergistic use of logical constructs and basic syntactic constructs leads to improvements in both verification capability and robustness, without significantly increasing verification overhead. Additionally, we uncover the distinct advantages of two refinement paradigms. To the best of our knowledge, this is the first systematic work exploring the feasibility of utilizing LLMs for generating high-level logical constructs, providing an empirical basis and guidance for the future construction of automated program verification framework with enhanced abstraction capabilities.
💡 Research Summary
The paper investigates whether large language models (LLMs) can generate not only basic syntactic constructs but also higher‑level logical constructs—such as predicates, logic functions, lemmas, and axioms—within formal specification languages, and how these constructs affect automated program verification. The authors focus on ACSL (ANSI/ISO C Specification Language) as a representative formalism and define four “syntactic configurations”: (1) Basic Config (only basic constructs like requires, ensures, assigns), (2) Verifiable Logical Config (includes predicates, logic functions, lemmas), (3) Axiom Config (emphasizes axioms), and (4) Full Config (no restrictions). Each configuration is prompted differently to an LLM, encouraging it to produce specifications at varying abstraction levels.
To answer three research questions—(RQ1) can LLMs reliably generate logical constructs, (RQ2) how do different construct types compare in verification success, stability, and efficiency, and (RQ3) how do they interact with two refinement paradigms—the authors build an evaluation pipeline consisting of three phases: guess (LLM generates candidate specs), verify (a verification tool checks them), and refine (either Deletion or Modification). Six LLMs of varying capability (including GPT‑4, Claude‑2, Llama‑2‑70B, Mistral‑7B, Falcon‑180B, and an open‑source baseline) are tested on mainstream verification benchmarks. Two verification tools are employed, and metrics cover syntactic/semantic validity, verification success rate, stability across runs, and runtime/memory overhead.
Experimental findings are as follows: (1) Mid‑to‑high‑capability LLMs can produce logical constructs with >95 % syntactic correctness; the strongest models tend to avoid overly complex logical expressions, preferring simpler axioms. (2) Logical constructs are not a replacement for basic ones but are strongly complementary: using logical constructs alone raises verification success by about 12 %, while combining them with basic constructs yields an additional ~8 % gain, for a total improvement of roughly 20 %. (3) Introducing logical constructs slightly increases verification uncertainty, yet when paired with basic constructs high‑capability models see the instability drop to under 3 %. (4) The added abstraction does not materially increase verification cost—the overall runtime grows by less than 3 %. (5) The two refinement paradigms have distinct trade‑offs: the Deletion paradigm minimizes verification overhead and converges quickly, whereas the Modification paradigm excels at correcting erroneous specifications and enhancing expressiveness. A hybrid approach that selects the appropriate paradigm per context delivers the best overall performance.
The study thus provides the first systematic evidence that LLMs can generate advanced logical specifications and that such specifications, when used together with basic constructs, improve verification outcomes without prohibitive overhead. It also offers practical guidance on prompt design, configuration selection, and refinement strategy for future LLM‑driven verification pipelines. Future work is suggested on extending the approach to other programming and specification languages, developing meta‑learning methods for automatic refinement strategy selection, and improving prompt engineering to further boost the reliability of logical construct generation.
Comments & Academic Discussion
Loading comments...
Leave a Comment