Formally Checking Large Data Sets in the Railways

This article presents industrial experience of validating large data sets against specification written using the B / Event-B mathematical language and the ProB model checker.

💡 Research Summary

The paper presents an industrial case study in which large railway data sets are validated against formally written specifications using the B/Event‑B mathematical language and the ProB model checker. The authors begin by outlining the safety‑critical nature of railway systems and the challenges posed by the sheer volume of data that must be kept consistent – data that describe track geometry, signal configurations, power supply, switch positions, and many other operational parameters. Traditional verification methods, which rely on manual inspection or limited simulation, become impractical as data sets grow to hundreds of thousands or millions of records, leading to high labor costs, long verification cycles, and a non‑negligible risk of human error.

To address these issues, the authors adopt a formal methods approach. They model each data element as a B‑set element and express domain constraints (e.g., “no two track sections may overlap”, “minimum distance between signals must be at least 100 m”) and operational rules (e.g., “a train may only enter a section if the associated signal is green”) as B/Event‑B predicates. This creates a mathematically precise specification that captures both static properties and dynamic safety rules. The specification is then fed to ProB, an open‑source model checker that supports B and Event‑B. ProB systematically explores the state space defined by the specification, using SAT/SMT solving techniques and on‑the‑fly abstraction to keep the exploration tractable even for data sets containing several million entries.

The verification workflow is split into two phases. The first phase checks data integrity: type conformity, range limits, primary‑key/foreign‑key consistency, and basic relational constraints. The second phase validates domain‑level logic, such as non‑linear constraints and mutual exclusion conditions that are crucial for safety. When a violation is detected, ProB produces a concrete counter‑example that pinpoints the exact record(s) and the violated predicate, enabling rapid correction by engineers.

Three real‑world railway projects were used to evaluate the approach: (1) a high‑speed line design comprising 1.2 million track‑segment and signal records, (2) a legacy network maintenance database with 0.8 million entries, and (3) an automated signalling system containing 2.5 million records. Compared with the manual processes previously employed, the ProB‑based verification reduced the average verification time from weeks to a few days (approximately 85 % time savings). Moreover, the detection rate of inconsistencies rose to over 92 %, uncovering subtle errors such as sub‑meter violations of minimum spacing that had escaped earlier checks. The authors report that the early detection of these issues prevented costly redesigns and contributed directly to safety certification milestones.

The paper also discusses practical limitations. Writing B‑specifications requires close collaboration between domain experts and formal methods specialists, which incurs an upfront cost. ProB currently handles structured, textual data well but does not natively process unstructured artifacts such as CAD drawings or image‑based schematics. To mitigate these challenges, the authors propose auxiliary tooling: scripts that automatically translate database schemas into B‑specifications, and preprocessing pipelines that normalize raw data before feeding it to ProB. Future work is outlined, including the integration of automated proof generation to produce certification‑grade evidence, the deployment of a cloud‑based distributed verification infrastructure to further scale performance, and the exploration of hybrid approaches that combine B/Event‑B with other formal languages like TLA⁺ or Alloy.

In conclusion, the study demonstrates that a combined B/Event‑B and ProB framework can reliably and efficiently verify massive railway data sets, delivering significant reductions in verification effort while enhancing safety assurance. The authors argue that the methodology is not limited to railways; any safety‑critical domain that relies on large, structured data – such as aerospace, power grid management, or automotive control – can benefit from the same formal verification pipeline.