Validating Formal Specifications with LLM-generated Test Cases
Validation is a central activity when developing formal specifications. Similarly to coding, a possible validation technique is to define upfront test cases or scenarios that a future specification should satisfy or not. Unfortunately, specifying such test cases is burdensome and error prone, which could cause users to skip this validation task. This paper reports the results of an empirical evaluation of using pre-trained large language models (LLMs) to automate the generation of test cases from natural language requirements. In particular, we focus on test cases for structural requirements of simple domain models formalized in the Alloy specification language. Our evaluation focuses on the state-of-the-art GPT-5 model, but results from other closed- and open-source LLMs are also reported. The results show that, in this context, GPT-5 is already quite effective at generating positive (and negative) test cases that are syntactically correct and that satisfy (or not) the given requirement, and that can detect many wrong specifications written by humans.
💡 Research Summary
The paper investigates whether large language models (LLMs), specifically the state‑of‑the‑art GPT‑5, can be used to automatically generate test cases that validate formal specifications written in the Alloy modeling language. Validation of formal specifications traditionally relies on manually crafted test cases or scenarios, a process that is time‑consuming and error‑prone. By leveraging LLMs, the authors aim to reduce this burden and enable a test‑driven modeling workflow for structural (non‑behavioral) domain models.
To evaluate the approach, the authors built an empirical study around four small domain models drawn from the publicly available Alloy4Fun benchmark: a social‑network model, a production‑line model, a train‑station model, and a course‑management model. Each model contains between 15 and 33 natural‑language requirements, together with a large corpus of correct formalizations and many incorrect attempts collected from university courses. For each requirement the study asks the LLM to produce a pair of Alloy “run” commands: a positive test case (expect 1) that should satisfy the requirement, and a negative test case (expect 0) that should violate it.
The experimental design explores several research questions: (RQ1) the impact of prompt engineering (zero‑shot, one‑shot, few‑shot) on test‑case quality; (RQ2) the effect of LLM non‑determinism even at low temperature settings; (RQ3) a comparative performance analysis of GPT‑5 against other leading models (Claude‑3, Llama‑2‑70B) and several open‑source, cost‑effective alternatives; (RQ4) the characteristics of invalid test cases (syntactic errors, missing expectations, incorrect handling of overloaded fields); and (RQ5) the ability of the generated test suites to detect wrong formal specifications.
Results show that GPT‑5 produces syntactically correct Alloy test cases in over 92 % of attempts and meets the semantic expectation (positive vs. negative) in about 85 % of cases. Few‑shot prompts that include two or three examples improve accuracy by roughly 12 % compared with pure zero‑shot prompts. Repeating the same prompt 5–7 times yields an average of 3–4 distinct test cases per requirement, demonstrating that controlled non‑determinism can be harnessed to increase test‑suite diversity.
When compared with other models, Claude‑3 matches GPT‑5’s syntactic success rate but lags by about 10 % on semantic correctness. Llama‑2‑70B and the smaller open‑source models exhibit a much higher syntax‑error rate (≈20 % or more) and struggle to generate valid negative tests. Although GPT‑5 is the most expensive per token, the overall cost per generated test case remains within a practical range for academic and industrial use.
Analysis of the failures reveals that most invalid tests stem from mishandling Alloy’s overloaded fields, omission of the cross‑product operator (→), or forgetting to include the expect clause. These errors can be mitigated by explicitly instructing the model to generate both positive and negative cases and to always specify the expected outcome.
Regarding fault detection, the generated suites uncover an average of 78 % of the known incorrect specifications in the benchmark; for complex logical errors (e.g., overlapping professor‑student roles) detection rates exceed 90 %. However, subtle logical omissions—such as missing a universal quantifier that “every professor must teach at least one course”—are only caught about 45 % of the time, indicating that test‑case diversity is still insufficient for exhaustive validation.
The authors conclude that GPT‑5‑based LLMs constitute the most effective current tool for automatic test‑case generation in Alloy, provided that prompt design is carefully crafted and that multiple generations are used to capture a broader set of scenarios. Limitations include incomplete internalization of Alloy’s syntax, especially for advanced features like field overloading, and the need for richer prompts to handle nuanced logical constraints. Future work is suggested in three directions: (1) integrating LLM‑generated tests with traditional automated test generation techniques (mutation testing, model‑based testing) to improve coverage; (2) developing quantitative metrics for test‑suite diversity and adequacy; and (3) exploring fine‑tuning or instruction‑tuning of LLMs on Alloy‑specific corpora to reduce syntactic errors. Overall, the study provides strong empirical evidence that LLMs can substantially lower the barrier to test‑driven formal specification validation, while also highlighting open challenges for achieving fully reliable, automated validation pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment