Formally Checking Large Data Sets in the Railways
This article presents industrial experience of validating large data sets against specification written using the B / Event-B mathematical language and the ProB model checker.
đĄ Research Summary
The paper presents an industrial case study in which large railway data sets are validated against formally written specifications using the B/EventâB mathematical language and the ProB model checker. The authors begin by outlining the safetyâcritical nature of railway systems and the challenges posed by the sheer volume of data that must be kept consistent â data that describe track geometry, signal configurations, power supply, switch positions, and many other operational parameters. Traditional verification methods, which rely on manual inspection or limited simulation, become impractical as data sets grow to hundreds of thousands or millions of records, leading to high labor costs, long verification cycles, and a nonânegligible risk of human error.
To address these issues, the authors adopt a formal methods approach. They model each data element as a Bâset element and express domain constraints (e.g., âno two track sections may overlapâ, âminimum distance between signals must be at least 100âŻmâ) and operational rules (e.g., âa train may only enter a section if the associated signal is greenâ) as B/EventâB predicates. This creates a mathematically precise specification that captures both static properties and dynamic safety rules. The specification is then fed to ProB, an openâsource model checker that supports B and EventâB. ProB systematically explores the state space defined by the specification, using SAT/SMT solving techniques and onâtheâfly abstraction to keep the exploration tractable even for data sets containing several million entries.
The verification workflow is split into two phases. The first phase checks data integrity: type conformity, range limits, primaryâkey/foreignâkey consistency, and basic relational constraints. The second phase validates domainâlevel logic, such as nonâlinear constraints and mutual exclusion conditions that are crucial for safety. When a violation is detected, ProB produces a concrete counterâexample that pinpoints the exact record(s) and the violated predicate, enabling rapid correction by engineers.
Three realâworld railway projects were used to evaluate the approach: (1) a highâspeed line design comprising 1.2âŻmillion trackâsegment and signal records, (2) a legacy network maintenance database with 0.8âŻmillion entries, and (3) an automated signalling system containing 2.5âŻmillion records. Compared with the manual processes previously employed, the ProBâbased verification reduced the average verification time from weeks to a few days (approximately 85âŻ% time savings). Moreover, the detection rate of inconsistencies rose to over 92âŻ%, uncovering subtle errors such as subâmeter violations of minimum spacing that had escaped earlier checks. The authors report that the early detection of these issues prevented costly redesigns and contributed directly to safety certification milestones.
The paper also discusses practical limitations. Writing Bâspecifications requires close collaboration between domain experts and formal methods specialists, which incurs an upfront cost. ProB currently handles structured, textual data well but does not natively process unstructured artifacts such as CAD drawings or imageâbased schematics. To mitigate these challenges, the authors propose auxiliary tooling: scripts that automatically translate database schemas into Bâspecifications, and preprocessing pipelines that normalize raw data before feeding it to ProB. Future work is outlined, including the integration of automated proof generation to produce certificationâgrade evidence, the deployment of a cloudâbased distributed verification infrastructure to further scale performance, and the exploration of hybrid approaches that combine B/EventâB with other formal languages like TLAâş or Alloy.
In conclusion, the study demonstrates that a combined B/EventâB and ProB framework can reliably and efficiently verify massive railway data sets, delivering significant reductions in verification effort while enhancing safety assurance. The authors argue that the methodology is not limited to railways; any safetyâcritical domain that relies on large, structured data â such as aerospace, power grid management, or automotive control â can benefit from the same formal verification pipeline.