Stateful Testing: Finding More Errors in Code and Contracts

Stateful Testing: Finding More Errors in Code and Contracts
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Automated random testing has shown to be an effective approach to finding faults but still faces a major unsolved issue: how to generate test inputs diverse enough to find many faults and find them quickly. Stateful testing, the automated testing technique introduced in this article, generates new test cases that improve an existing test suite. The generated test cases are designed to violate the dynamically inferred contracts (invariants) characterizing the existing test suite. As a consequence, they are in a good position to detect new errors, and also to improve the accuracy of the inferred contracts by discovering those that are unsound. Experiments on 13 data structure classes totalling over 28,000 lines of code demonstrate the effectiveness of stateful testing in improving over the results of long sessions of random testing: stateful testing found 68.4% new errors and improved the accuracy of automatically inferred contracts to over 99%, with just a 7% time overhead.


💡 Research Summary

The paper addresses a well‑known limitation of automated random testing: after a long testing session the object pool becomes huge, and the probability of randomly selecting the “right” objects that would expose new faults drops dramatically. To overcome this, the authors introduce Stateful Testing, a technique that builds on top of an existing random‑testing campaign and automatically generates new test cases aimed at violating dynamically inferred contracts (pre‑ and post‑conditions).

Workflow

  1. Random testing (AutoTest) runs for many hours, creating a large pool of objects and a suite of test cases.
  2. Object/transition database – all objects, their public query results, and the state transitions caused by method calls are serialized and stored in a relational database.
  3. Dynamic contract inference (AutoInfer) extracts contracts from the passing test cases using template‑based heuristics. These contracts summarize observed regularities, e.g., Current.disjoint(other) as a precondition or old Current.is_equal(other) ⇒ Current.is_empty as a postcondition. Because inference is based on finite observations, some contracts are unsound.
  4. Reduction phase searches the database for objects that violate one or more inferred contracts. When such objects are found, new test cases are built that invoke the corresponding routine with those objects. If necessary, additional method calls are inserted to mutate objects so that missing preconditions become satisfied (e.g., moving a cursor before calling merge_tree_after).
  5. Execution of the new suite either uncovers previously unseen faults (if the call fails) or demonstrates that an inferred contract can be violated without causing a failure (hence the contract is unsound and should be discarded).

Key examples

  • Unsound precondition: In TWO_WAY_SORTED_SET.merge, the inferred precondition Current.disjoint(other) was never challenged. The database contained overlapping sets; using them generated calls that showed the precondition is unnecessary, leading to its removal.
  • Unsound postcondition: In LINKED_LIST.merge_left, the inferred postcondition old Current.is_equal(other) ⇒ Current.is_empty was validated by constructing equal but non‑empty lists and confirming the consequent does not hold, thus confirming the contract’s soundness.
  • Object mutation: For TWO_WAY_TREE.merge_tree_after, the required precondition not off could not be satisfied directly because all sibling trees in the pool had the cursor off. The system automatically selected a helper routine start to move the cursor, then performed the merge, exposing a real implementation bug.

Experimental evaluation
The technique was applied to 13 data‑structure classes from EiffelBase and Gobo (≈28 000 lines of code). A 520‑hour random‑testing campaign produced 149 293 distinct test cases, uncovered 95 faults, and inferred hundreds of contracts. Stateful testing was then run for 36 hours on the same data. Results:

  • 65 new faults were discovered, a 68.4 % improvement over the original campaign.
  • 39.3 % of the inferred contracts were validated as sound; manual inspection confirmed that virtually all validated contracts were indeed correct.
  • The additional time required was only about 7 % of the total execution time, demonstrating a modest overhead.

Insights and contributions

  1. Synergy of dynamic contracts and searchable object metadata – By turning the object pool into a queryable database, the system can deliberately target “rare” object configurations that random selection would miss.
  2. Automatic contract refinement – Unsound contracts are either disproved (by successful execution of a violating test) or confirmed (by causing a failure), providing a self‑cleaning mechanism for inferred specifications.
  3. State mutation for test generation – When a direct violation is impossible, the framework can compose auxiliary method calls to bring objects into the required state, dramatically expanding the reachable test space.
  4. Practical scalability – The modest 7 % overhead shows that the approach is feasible for real‑world code bases; the technique is language‑agnostic in principle, requiring only public queries and a contract mechanism (e.g., Design by Contract in Eiffel, Code Contracts in .NET, or JML for Java).

Limitations and future work
The quality of inferred contracts depends on the templates used; richer templates could capture more subtle properties. Database search costs may grow with very large pools, suggesting the need for indexing or sampling strategies. The current implementation is tied to Eiffel; extending it to other languages will require mapping their contract facilities to the same inference pipeline. Future research directions include (a) integrating static analysis to guide contract inference, (b) scaling the approach to large industrial systems, and (c) exploring automated repair actions once a faulty contract is identified.

In summary, Stateful Testing demonstrates that augmenting random testing with contract‑guided, database‑driven test generation can substantially increase fault detection and improve the reliability of automatically inferred specifications, offering a compelling path toward more effective automated software testing.


Comments & Academic Discussion

Loading comments...

Leave a Comment