PALM: Path-aware LLM-based Test Generation with Comprehension
Symbolic execution is a widely used technique for test generation, offering systematic exploration of program paths through constraint solving. However, it is fundamentally constrained by the capability to model the target code, including library functions, in terms of symbolic constraints and by the capability of underlying constraint solvers. As a result, many paths involving complex features remain unanalyzed or insufficiently modeled. Recent advances in large language models (LLMs) have shown promise in generating diverse and valid test inputs. Yet, LLMs lack mechanisms for systematically enumerating program paths and often fail to cover subtle corner cases. We observe that directly prompting an LLM with the full program leads to missed coverage of interesting paths. In this paper, we present PALM, a test generation system that combines symbolic path enumeration with LLM-assisted test generation. PALM statically enumerates possible paths through AST-level analysis and transforms each into an executable variant with embedded assertions that specify the target path. This avoids the need to translate path constraints into SMT formulas, by instead constructing program variants that the LLM can interpret. Importantly, PALM provides an interactive frontend that visualizes path coverage alongside generated tests, assembling tests based on the specific paths they exercise. A user study with 12 participants demonstrates that PALM’s frontend helps users better understand path coverage and identify which paths are actually exercised by PALM-generated tests through verification and visualization of their path profiles.
💡 Research Summary
PALM (Path‑aware LLM‑based Test Generation with Comprehension) tackles two long‑standing limitations of traditional symbolic execution and recent LLM‑driven test generators. Symbolic execution systematically explores program paths but often stalls when it must model complex library calls, string operations, or case‑insensitive comparisons, because these operations lack built‑in SMT theories and require manual modeling. Consequently, many feasible paths remain unexplored. Conversely, large language models (LLMs) such as GPT‑4o can synthesize diverse test inputs from a program description, yet they do not enumerate paths and therefore miss subtle corner cases; prompting an LLM with the whole source typically yields “generic” tests that cover common scenarios but ignore rare, bug‑revealing executions.
PALM bridges this gap by using static AST‑level analysis to enumerate every feasible execution path of a Java program, then converting each path into a path‑specific program variant. The variant is the original code augmented with assertTrue or assertFalse statements that explicitly encode the branch outcomes required for that path. This transformation eliminates the need to translate path constraints into SMT formulas; the variant itself serves as a human‑readable, LLM‑friendly prompt that tells the model exactly which conditions must hold.
The system operates in two phases:
-
Path Extraction – PALM recursively traverses the program’s abstract syntax tree. For
Ifstatements it creates two sub‑paths, each prefixed with the appropriate assert; for loops it unrolls up to a user‑defined bound K (default 2) and treats each unrolling as a separate path; for block statements it computes the Cartesian product of sub‑paths. The algorithm (presented as Algorithm 1) also performs function inlining, variable renaming, constant propagation, and folding to ensure the generated variant is executable. Users can mark specific functions as “symbolic” and designate an entry point, limiting the exploration to areas of interest. -
Test Generation & Validation – Each variant is sent to an LLM as a prompt. The LLM returns a concrete test method (e.g., a call to the target function with concrete arguments). PALM then runs this test against the variant. If all asserts pass, the test is marked as covering the corresponding leaf node in a visual symbolic‑execution tree. If any assert fails, PALM records the first failing assertion, feeds both the failing test and the assertion back to the LLM, and retries up to five iterations. This feedback loop guides the model to correct its misunderstanding of the path’s data dependencies (e.g., a variable updated before a later condition).
A central interactive frontend visualizes the symbolic‑execution tree, coloring nodes green (covered), red (uncovered), or gray (unreachable). Users can click any leaf to inspect its variant, view the generation history, edit the LLM prompt, or manually adjust the test. The UI also supports “Locate Path” (find the path exercised by an existing test) and “Verify Path” (run a test on the variant to confirm coverage). This tight coupling of visualization and generation gives developers immediate insight into which paths remain untested and why a generated test may have missed its target.
Empirical Evaluation
PALM was evaluated on 124 Java programs drawn from the HumanEval‑Java benchmark, which includes nested loops, external API calls, and non‑trivial string manipulations. Using the same LLM backend (GPT‑4o‑mini or GPT‑4o‑3‑mini), PALM achieved 35.0 % and 24.2 % higher path coverage respectively, compared to the raw LLM approach that does not perform path enumeration. Adding the iterative validation‑feedback loop further improved coverage by 14.2 % over a non‑iterative baseline. In contrast, Symbolic PathFinder—a representative traditional symbolic executor—failed to produce any accurate path constraints for 34.3 % of the programs because of insufficient modeling of external library calls.
A within‑subject user study with 12 participants examined the usability of PALM’s frontend. Participants reported higher confidence in generating sufficient tests, better ability to disambiguate redundant or missing tests, and easier verification that a test indeed exercised a specific path, compared with a pure LLM tool. Quantitative metrics (e.g., time to achieve a target coverage, number of correctly identified uncovered paths) also favored PALM.
Limitations and Future Work
The primary scalability concern is path explosion: the number of enumerated paths grows exponentially with the number of conditionals and loop iterations. PALM mitigates this by allowing users to set loop‑unrolling bounds and to select only a subset of functions for symbolic analysis, but more sophisticated path‑selection heuristics (e.g., risk‑based prioritization) are needed for large codebases. Additionally, tests that involve external resources (files, network) may not be fully validated by simple asserts; extending the variant language to include mock‑or‑stub generation could address this. Finally, the approach relies on the LLM’s ability to respect the asserts; while the feedback loop improves reliability, occasional syntactic or semantic errors still require manual correction.
Conclusion
PALM demonstrates that combining symbolic execution’s exhaustive path discovery with LLMs’ natural language and code generation capabilities yields a practical, high‑coverage test generation pipeline that sidesteps the need for handcrafted symbolic models of complex library functions. The path‑specific program variants act as an intuitive bridge between formal execution semantics and LLM prompting, while the interactive frontend empowers developers to understand and steer the testing process. Empirical results show substantial gains in path coverage and user confidence, suggesting that PALM’s design can serve as a blueprint for future systems that blend formal analysis with generative AI to automate software testing at scale.
Comments & Academic Discussion
Loading comments...
Leave a Comment