Formal verification of tree-based machine learning models for lateral spreading

F ormal v eriﬁcation of tree-based mac hine learning mo dels for lateral spreading Krishna Kumar ∗ a Oden Institute of Computational Engine ering and Scienc es, The University of T exas at Austin, A ustin, T exas, USA Abstract Mac hine learning mo dels for geotec hnical hazard prediction can achiev e high accuracy while learning physi- cally inconsisten t relationships from sparse or biased training data. Curren t remedies (p ost-ho c explainabilit y suc h as SHAP and LIME, and training-time constrain ts) either diagnose individual predictions appro ximately or restrict mo del capacity without pro viding exhaustive guaran tees. This pap er enco des trained tree ensem- bles as logical form ulas in a Satisﬁability Mo dulo Theories (SMT) solver and chec ks physical sp eciﬁcations across the entire input domain, not just sampled p oin ts. F our geotechnical speciﬁcations (w ater table depth, PGA monotonicity , distance safety , and ﬂat-ground safet y) are formalized as decidable logical formulas and v eriﬁed via SMT against b oth XGBoost ensembles and Explainable Bo osting Machines (EBMs) trained on the 2011 Christc h urch earthquak e lateral spreading dataset (7,291 sites, four features). The SMT solver either produces a concrete counterexample where a sp eciﬁcation fails or prov es that no violation exists. Unconstrained XGBoost (82.5% accuracy) violates all four sp eciﬁcations. Monotone-constrained XGBoost (69.3% accuracy) satisﬁes PGA monotonicity but still violates comp ound sp eciﬁcations that no single-feature constrain t can express. The unconstrained EBM (80.1% accuracy) violates all four sp eciﬁcations. Imposing monotone constrain ts on PGA alone ﬁxes Sp eciﬁcation B but leav es the remaining three violated. A fully constrained EBM with monotone constraints on all four features (67.2% accuracy) satisﬁes three of four sp eciﬁcations (water table depth safety remains violated by a thin margin), demonstrating that iterative constrain t application guided by veriﬁcation can progressively improv e physical consistency . A Pareto anal- ysis of 33 mo del v arian ts rev eals a p ersisten t trade-oﬀ, as among the v arian ts studied, no mo del ac hieves b oth > 80% accuracy and full compliance with the sp eciﬁcation set. SHAP analysis of sp eciﬁcation coun- terexamples sho ws that the oﬀending feature can rank last among four features, demonstrating that p ost-ho c explanations do not substitute for formal veriﬁcation. These results establish a verify-ﬁx-v erify engineering lo op for deploying physically consistent ML mo dels in safety-critical geotechnical applications, and pro vide a foundation for formal certiﬁcation of geotec hnical ML mo dels b efore ﬁeld deploymen t. Keywor ds: formal v eriﬁcation, physical consistency, machine learning, lateral spreading, SMT solver, X GBo ost, explainable bo osting machine ∗ Corresponding author Email addr ess: krishnak@utexas.edu (Krishna Kumar) 1. In tro duction Mac hine learning (ML) mo dels increasingly predict earthquake-induced geotechnical hazards, including liquefaction triggering [25, 10, 7] and lateral spreading [11, 9]. These models capture complex, non-linear relationships among input features (p eak ground acceleration (PGA), groundwater depth, distance to a free face, ground slope) that traditional empirical metho ds cannot represen t [31]. Ensemble methods lik e random forests [2] and gradient-bo osted trees [5] ac hiev e this by aggregating h undreds of w eak predictors, each ﬁtting a lo cal partition of the feature space. The aggregate captures in teraction eﬀects that no single decision tree could express, but it also obscures ho w individual features drive predictions. This opacity creates a practical problem. Mo dels trained on sparse or biased geotec hnical data can learn ph ysically inconsistent relationships that remain in visible until deploymen t. Hsiao et al. [11] demonstrated this concretely . Their X GBoost mo del, trained on the 2011 Christch urch earthquake lateral spreading dataset [9], predicted that the probability of lateral spreading de cr e ase d as PGA increased in certain input regions, in verting the fundamen tal exp ectation that stronger shaking pro duces more ground displacement. The model still achiev ed appro ximately 84% ov erall accuracy b ecause the non-physical pattern w as conﬁned to data- sparse input regions, precisely the regions where reliable predictions matter most. This failure mode is not unique to one mo del or dataset. Maurer and Sanger [21] review ed 75 AI liquefaction publications and identiﬁed systematic deﬁciencies (lack of comparison to state-of-practice mo dels, departure from ML b est practices, mo dels discussed but not provided) that undermine trust in ML predictions across the ﬁeld. This pap er addresses a complementary deﬁciency , the absence of exhaustiv e mo del-assurance to ols that can determine whether a trained mo del satisﬁes sp eciﬁc consistency conditions, regardless of how it w as trained or b enc hmark ed. P ost-ho c explainability metho ds oﬀer one line of defense. SHapley A dditive exPlanations (SHAP) [20] and Lo cal Interpretable Model-agnostic Explanations (LIME) [26] hav e prov en v aluable for geotechnical ML [3, 11], but both are approximate. Huang and Marques-Silv a [14] pro ved that SHAP can assign non-zero imp ortance to irrelev ant features and zero imp ortance to relev an t ones. Post-hoc metho ds can suggest the presence of failure, but they cannot c ertify its absence. Correcting non-ph ysical relationships requires mo difying the mo del. Monotone constraints [5] and shap e function editing [12] improv e physical consistency but op erate on individual features, cannot express com- p ound conditions (e.g., “far from free face and weak shaking should not predict high risk”), and provide no pro of that the corrected mo del satisﬁes the intended prop ert y across its full input domain. This pap er closes that gap by in tro ducing formal veriﬁcation via Satisﬁability Mo dulo Theories (SMT) solv ers to geotechnical ML. W e enco de trained XGBoost and Explainable Boosting Machine (EBM) tree ensem bles as logical form ulas in an SMT solver and chec k whether ph ysical sp eciﬁcations hold across all p ossible inputs simultaneously . T ree ensem bles partition the input space into ﬁnitely many h yperrectangular regions, each assigned a constant output. Negating a ph ysical sp eciﬁcation and inlining this partition yields a quantiﬁer-free formula in linear real arithmetic that the solv er decides exactly , returning either sat with 2 a concrete coun terexample or unsat (sp eciﬁcation prov en). W e use Z3 [22], though the approach applies to an y QF_LRA-capable solver. The contribution is metho dological, a mo del-assurance workﬂo w demonstrating ho w formal veriﬁcation complements training-time constraints, rather than prop osing a new predictor. As geotec hnical ML mo dels mo ve to w ard op erational deplo yment, formal certiﬁcation that these models resp ect established sp eciﬁcations b ecomes essen tial for safe adoption in practice. The con tributions of this pap er are as follo ws. 1. The ﬁrst application of SMT-based formal veriﬁcation to geotechnical ML mo dels, demonstrating that b oth X GBo ost ensem bles and Explainable Bo osting Mac hines trained on lateral spreading data can b e enco ded as logical formulas and formally chec k ed against sp eciﬁed geotechnical sp eciﬁcations. 2. F ormalization of four geotec hnical sp eciﬁcations, spanning single-feature thresholds (A), monotonicity (B), and compound m ulti-feature conditions (C, D). Each speciﬁcation reduces to a quan tiﬁer-free satisﬁabilit y query in linear real arithmetic (QF_LRA) after inlining the tree-ensemble score. The sp eciﬁc thresholds are calibrated to the Christc hurc h dataset and can b e adapted to other sites. The v eriﬁcation framew ork itself is general and accepts an y threshold v alues that local engineers deem appropriate. 3. A v erify-ﬁx-v erify engineering loop that uses SMT coun terexamples to diagnose violations, applies targeted monotone constrain ts, and re-veriﬁes to conﬁrm or reveal residual failures. 4. A P areto analysis of 33 mo del v arian ts quantifying the accuracy-consistency trade-oﬀ. Among the v ariants studied, no model achiev es b oth > 80% accuracy and full compliance with the sp eciﬁcation set, establishing a concrete cost for ph ysical consistency that practitioners can ev aluate. 2. Bac kground and related work 2.1. Machine le arning for later al spr e ading Lateral spreading, the p ermanen t horizon tal ground displacemen t tow ard a free face during earthquake- induced liquefaction, damages buried utilities, bridge foundations, and building fo otings in seismic regions. Predicting where lateral spreading will o ccur requires relating subsurface soil conditions and seismic loading to surface displacement, a task complicated b y the heterogeneit y of natural dep osits and the rarit y of strong- motion ev ents. Duran te and Rathje [9] compiled 7,291 sites from the 2011 Christch urc h earthquake, each characterized b y four features (GWD, distance to free-face L , slop e, PGA) and lab eled via remote sensing, training XG- Bo ost classiﬁers achieving approximately 80–84% accuracy . Demir and Şahin [7] conﬁrmed that bo osted tree ensem bles consisten tly outp erform logistic regression and SVMs on this dataset. Chen et al. [6] extended the approach to displacement magnitude prediction, and Hsiao et al. [13] incorp orated CPT proﬁles via auto encoder, improving accuracy ab o ve 83%. Hsiao et al. [11] found that the XGBoost model assigned ne gative SHAP v alues to PGA in certain regions, meaning increasing shaking de cr e ase d predicted spreading probabilit y . This non-physical b ehavior arose from 3 sparse training data in the lo w-PGA, high-distance region. SHAP revealed the problem visually but could not quan tify its extent or guarantee that a corrected mo del would b e free of similar artifacts. Hsiao et al. [12] replaced XGBoost with an EBM [23, 19], a Generalized Additiv e Mo del (GAM) decom- p osing the prediction into univ ariate shap e functions f i ( x i ) and pairwise in teractions f ij ( x i , x j ) . g  E [ y ]  = β 0 + X i f i ( x i ) + X i 10 ∧ x 1 < 0 , the solv er deduces x 2 > 10 and returns a satisfying assignment such as x 1 = − 1 , x 2 = 12 ; tree-ensemble queries ha ve exactly this structure, with eac h leaf contributing a constan t to the sum. Physical sp eciﬁcations for tree ensembles are stated as universally quan tiﬁed ﬁrst-order form ulas. V eriﬁcation negates each sp eciﬁcation and inlines the tree-ensemble score as a nested If-Then-Else (ITE) expression. The resulting solv er queries are quan tiﬁer-free formulas in linear real arithmetic with ITE terms (Quantiﬁer-F ree Linear Real Arithmetic, QF_LRA), b ecause every tree split compares a feature x j against a threshold θ and the ensemble output is a sum of constant leaf w eights. The decision pro cedure for QF_LRA is complete, meaning the solver alw a ys terminates. W e use Z3 [22] as our v eriﬁcation engine, though the metho dology applies to any QF_LRA- capable solv er. T r e e ensemble veriﬁc ation. Kantc helian et al. [17] pioneered Mixed Integer Linear Programming (MILP)- based tree ensemble veriﬁcation for computing optimal adversarial examples. Chen et al. [4] recast robustness v eriﬁcation as a max-clique problem, achieving speedups o ver MILP . Dev os et al. [8] dev eloped V eritas for generic veriﬁcation tasks b ey ond adv ersarial robustness, and Törnblom and Nadjm-T ehrani [28] built V oTE for safet y-critical autonomous systems. Sato et al. [27] prop osed metho ds to extract violation ranges for constructing input ﬁlters. On the explanation side, Ignatiev et al. [16] developed formal ab ductiv e explanations (subset-minimal feature sets guaranteeing a prediction), extended by Ignatiev et al. [15] to MaxSA T enco dings. V arshney et al. [29] combined MILP and SMT into a data-aw are sensitivity framework scaling to 800-tree ensem bles. All prior w ork fo cused on adversarial robustness or generic sensitivity . No study has applied formal v eriﬁcation to chec k domain-sp eciﬁc physical speciﬁcations, which is the contribution of this pap er. The key insigh t is that tree ensembles partition the input space into ﬁnitely many h yp errectangular regions, each assigned a constant output. Negating a physical sp eciﬁcation and inlining this partition yields a decidable QF_LRA form ula. 2.4. Positioning: formal veriﬁc ation as c omplement to c onstr aints Hard constraints pr event violations during training by restricting the hypothesis space, but provide no diagnostic information ab out where the unconstrained mo del w ould hav e failed. F ormal veriﬁcation audits a trained mo del b y chec king whether its learned function satisﬁes sp eciﬁed prop erties, returning either a pro of or a concrete coun terexample. These t wo capabilities comp ose into a verify-ﬁx-v erify lo op. 1. T rain an unconstrained mo del, optimizing purely for accuracy . 2. V erify physical sp eciﬁcations formally . The solver pro vides concrete coun terexamples for each violation. 3. Analyze coun terexamples to determine which constraints or corrections are needed. 5 4. Apply targeted ﬁxes (monotone constraints, shape function editing, or retraining with mo diﬁed hyper- parameters) guided b y the counterexample analysis. 5. Re-v erify to conﬁrm compliance or to discov er residual violations that require additional iteration. This lo op pro duces mo dels whose physical consistency is not assumed but pr oven . The key distinction from simply applying constraints a priori is that veriﬁcation reveals whether the constraints are suﬃcien t. As we demonstrate in Section 5.2, monotone constraints ﬁx Sp eciﬁcation B (single-feature monotonicity) but leav e Sp eciﬁcations A, C, and D (threshold and compound conditions) unsatisﬁed, a failure that would remain undetected without formal v eriﬁcation. 3. Problem formulation 3.1. Dataset W e use the 2011 Christc hurc h earthquak e lateral spreading dataset [9], comprising 7,291 sites (42% spreading, 58% not) classiﬁed by remote sensing displacement measurements exceeding 0.3 m. The four features (T able 1) each hav e clear physical relationships to lateral spreading that domain exp erts can articulate as sp eciﬁcations. T able 1: Input features for lateral spreading classiﬁcation. F eature Description Range Unit GWD Groundw ater depth 0.37–6.05 m L Distance to nearest free-face 0–3.29 km Slop e Ground surface slop e 0–10.92 % PGA P eak ground acceleration 0.33–0.57 g 3.2. Physic al sp e ciﬁc ations as lo gic al formulas W e formalize four ph ysical sp eciﬁcations that any ML lateral spreading mo del should resp ect. The four sp eciﬁcations are c hosen to exercise structurally diﬀerent formula classes (single-feature threshold, monotonic- it y , and comp ound multi-feature conditions) so that the veriﬁcation framework is tested against the range of constraint t yp es that arise in practice. Our contribution is the veriﬁcation framework, not the particular rules. Let the input v ector b e x = ( g, ℓ, s, p ) ∈ X ⊆ R 4 , (2) where g is groundwater depth (GWD), ℓ is distance to free face ( L ), s is ground slope, and p is peak ground acceleration (PGA). Let f : X → R denote the mo del’s logit (log-o dds) output, so that the predicted class is ˆ y ( x ) = 1 [ f ( x ) > 0] . T able 2 summarizes the notation and logical symbols used throughout. Each 6 T able 2: Notation and logical symbols. Sym b ol Meaning x = ( g, ℓ, s, p ) Input v ector (GWD, distance, slop e, PGA) X ⊆ R 4 Bounded input domain f ( x ) Mo del logit (log-o dds) output ˆ y ( x ) Predicted class: 1 [ f ( x ) > 0] ∀ Univ ersal quantiﬁcation (“for all”) ∃ Existen tial quantiﬁcation (“there exists”) ¬ Logical negation (“not”) ⇒ Logical implication (“if . . . then”) ∧ Logical conjunction (“and”) ∨ Logical disjunction (“or”) V Conjunction o ver a set (“and” for all elements) ∈ Set mem b ership (“b elongs to”) ≤ , > Arithmetic comparison sat Satisﬁable: counterexample exists (sp eciﬁcation violated) unsat Unsatisﬁable: no counterexample (sp eciﬁcation prov en) sp eciﬁcation is a universally quantiﬁed ﬁrst-order formula ov er the b ounded domain X . V eriﬁcation reduces eac h to a quantiﬁer-free satisﬁability query after inlining the mo del as an ITE expression (Section 3.3). The sp eciﬁcations enco de ne c essary conditions from geotechnical domain knowledge. Satisfying them do es not guarantee correctness in all resp ects. The sp eciﬁc thresholds are calibrated to the Christch urc h dataset and encode site-sp eciﬁc engineering b ounds rather than universal ph ysical laws. Practitioners applying this framew ork elsewhere should recalibrate thresholds to lo cal conditions. Sp e ciﬁc ation A (W ater table depth). Sites with deep groundwater should not b e predicted as lateral spreading. Geyin et al. [10] found that surface manifestation of liquefaction b ecomes negligible for GWD ab o ve 3–4 m, and w e adopt a conserv ativ e 5 m threshold. ∀ ( g , ℓ, s, p ) ∈ X : g > 5 . 0 ⇒ f ( g , ℓ, s, p ) ≤ 0 (3) Sp e ciﬁc ation B (PGA monotonicity). Increasing PGA should not decrease predicted risk, a ﬁrst-principles consequence of soil dynamics that holds regardless of other features. ∀ ( g , ℓ, s, p 1 ) , ( g , ℓ, s, p 2 ) ∈ X : p 1 ≤ p 2 ⇒ f ( g , ℓ, s, p 1 ) ≤ f ( g , ℓ, s, p 2 ) (4) P er-feature monotone constrain ts can directly enforce Speciﬁcation B by requiring p to ha v e a non-decreasing marginal eﬀect during tree construction. The same mechanism cannot enforce Speciﬁcation A, because 7 monotone constraints con trol the dir e ction of a feature’s marginal eﬀect but not the absolute output v alue at a given threshold. A mo del with a monotonically decreasing g eﬀect can still pro duce a p ositive logit at g = 5 . 01 m if the c om bined contributions from ℓ , s , and p are suﬃciently large. Sp e ciﬁc ation C (Distanc e safety). Sites far from a free face exp eriencing weak shaking should not b e predicted as lateral spreading. The 2.5 km threshold exceeds the maxim um documented lateral spread exten t in Christc hurc h, and the 0.35 g threshold falls b elow conﬁrmed spreading PGA v alues (minimum 0.37 g). ∀ ( g , ℓ, s, p ) ∈ X : ( ℓ > 2 . 5 ∧ p < 0 . 35) ⇒ f ( g , ℓ, s, p ) ≤ 0 (5) This sp eciﬁcation constrains tw o features simultaneously , and no single-feature monotone constraint can ex- press the conjunction. Only 4 training observ ations fall within the premise region, none exhibiting spreading. Sp e ciﬁc ation D (Flat-gr ound safety). Sp eciﬁcation D restricts the same safety conclusion to a smaller, more ph ysically certain subregion b y adding the requiremen t that the ground surface is essen tially ﬂat ( s < 0 . 1 %), eliminating the gen tly-sloping-ground driving mechanism. ∀ ( g , ℓ, s, p ) ∈ X : ( s < 0 . 1 ∧ ℓ > 2 . 5 ∧ p < 0 . 35) ⇒ f ( g, ℓ, s, p ) ≤ 0 (6) Only 1 observ ation in the training data satisﬁes the premise, and it is non-spreading. Speciﬁcation D is not logically indep enden t of Sp eciﬁcation C. Because its premise implies that of Sp eciﬁcation C ( s < 0 . 1 ∧ ℓ > 2 . 5 ∧ p < 0 . 35 ⇒ ℓ > 2 . 5 ∧ p < 0 . 35 ), an y mo del satisfying Sp eciﬁcation C automatically satisﬁes Sp eciﬁca- tion D. Sp eciﬁcation D is therefore a weak er requirement, included as a diagnostic. When a mo del violates Sp eciﬁcation C, Sp eciﬁcation D determines whether the violation p ersists even in the smaller ﬂat-ground subregion where the driving conditions are w eakest. T able 3 summarizes the four sp eciﬁcations and their structural classes. Standard training-time monotone constraints can directly enforce only Sp eciﬁcation B. T able 3: Sp eciﬁcation taxonomy . T raining-time monotone constraints can directly enforce only the monotonicity class (B). Threshold (A) and comp ound (C, D) speciﬁcations require formal v eriﬁcation to c heck. Sp ec. Name T yp e Monotone constrain t suﬃcient? A W ater table depth Single-feature threshold No B PGA monotonicity Single-feature monotonicit y Y es C Distance safety Comp ound multi-feature No D Flat-ground safety Comp ound multi-feature No W e verify all four sp eciﬁcations ov er X = [0 . 37 , 6 . 05] × [0 , 3 . 29] × [0 , 10 . 5] × [0 . 33 , 0 . 57] ( g × ℓ × s × p ), whic h rounds the observed data ranges to clean b ounds and excludes tw o outliers (one with Slop e = 10 . 9 %, one with PGA = 0 . 329 g), cov ering 7,289 of 7,291 observ ations (99.97%). V eriﬁcation guarantees apply only within X ; mo del b eha vior outside these b ounds is extrapolation. 8 3.3. SMT enc o ding of tr e e ensembles A gradien t-b o osted tree ensemble with T trees computes the logit as f ( x ) = β 0 + T X t =1 h t ( x ) (7) where β 0 is a base score and each tree h t maps features to a leaf weigh t via binary splits. W e enco de each tree recursiv ely as an SMT If-Then-Else (ITE) expression: Encode ( n ) =        w n if n is a leaf with w eight w n If  x j n < θ n , Encode ( n. left ) , Encode ( n. right )  if n is in ternal (8) The full ensemble logit is f ( x ) = β 0 + P T t =1 Encode ( ro ot t ) . F or example, a depth-2 tree with ro ot split on PGA at 0.42 g (Figure 1a) b ecomes h 1 ( g , ℓ, s, p ) = If  p < 0 . 42 , If ( s < 0 . 05 , w 1 , w 2 ) , If ( p < 0 . 48 , w 3 , w 4 )  (9) Eac h leaf corresp onds to a conjunction of linear inequalities deﬁning a p olyhedral region of the input space (e.g., leaf w 1 is reached when p < 0 . 42 ∧ s < 0 . 05 ). Summing T such ITE expressions yields the ensem ble logit f ( g , ℓ, s, p ) = β 0 + P t h t ( g , ℓ, s, p ) . F or our 100-tree, depth-5 mo del, this produces a single symbolic expression o ver four v ariables with 2,380 leaf terms. Figure 1 illustrates the enco ding. Eac h root-to-leaf path b ecomes a conjunction of linear inequalities (left), and the full ensemble partitions the input space into regions of constan t output (right, ov er g × p ), eac h corresp onding to a unique leaf com bination across all 100 trees. Because this partition is ﬁnite, the solv er reasons ov er all regions in a single query . T w o implemen tation details are critical. First, the base score β 0 is stored separately from the tree dump and m ust be included. Omitting it shifts every logit by a constant oﬀset. Second, X GBo ost uses 32-bit ﬂoats while SMT solvers use exact arithmetic, so we cast all thresholds and input features to float32 to preven t precision mismatches. Every counterexample was v alidated against the actual XGBoost mo del to conﬁrm exact logit agreemen t. T o verify a sp eciﬁcation, we assert its ne gation . Eac h sp eciﬁcation has the form ∀ x ∈ X : P ( x ) . By the logical equiv alence ¬ ( ∀ x : P ( x )) ≡ ∃ x : ¬ P ( x ) , “it is not the case that P holds for all x ” is the same as “there exists some x where P fails.” A sp eciﬁcation is therefore violated if and only if suc h a coun terexample exists. The solver does not en umerate inputs. Instead, for Sp eciﬁcation A (“for all inputs with GWD > 5 m, the mo del should not predict spreading”), the solver asks whether there exists an input where GWD > 5 m and the mo del predicts spreading. F or Sp eciﬁcation C, the negated query is ∃ ( g , ℓ, s, p ) ∈ X : ℓ > 2 . 5 ∧ p < 0 . 35 ∧ f ( g, ℓ, s, p ) > 0 (10) After inlining f as an ITE expression, this becomes a quan tiﬁer-free QF_LRA formula ov er four free real v ariables with domain constraints. If the solv er returns sat , the satisfying assignmen t (input) is a counterex- ample and the sp eciﬁcation is violated. If unsat , no suc h input exists and the speciﬁcation holds univ ersally . 9 PGA (g) < 0 . 42 Slope (%) < 0 . 05 PGA (g) < 0 . 48 w 1 w 2 w 3 w 4 Y es No T r e e 1 of 100 f ( x ) = β 0 + 100 X t =1 ITE t ( x ) each split → If ( x i < τ , left , right ) encode Negated Sp eciﬁcation B ∃ ( g , ℓ, s, p 1 ) , ( g , ℓ, s, p 2 ) ∈ X : p 1 < p 2 ∧ f ( g , ℓ, s, p 2 ) < f ( g, ℓ, s, p 1 ) conjoin ¬ spec ϕ ∧ ¬ sp ec SMT solv er SA T → counter-example UNSA T → sp ec. prov en (a) V eriﬁcation pip eline. An XGBoost tree (left) is enco ded as a nested SMT If-Then-Else expression. The negation of a physical speciﬁcation is conjoined with the formula. If the solv er returns SA T, it pro duces a concrete counter-example; if UNSA T, the speciﬁcation is prov en to hold o ver the entir e input domain. 1 2 3 4 5 6 GWD (m) 0.35 0.40 0.45 0.50 0.55 PGA (g) Specification A (GWD > 5 m) 0.00 0.12 0.24 0.36 0.48 0.60 0.72 0.84 0.96 P(Lateral Spreading) (b) The resulting input space partition over g and p (with ℓ = 0 . 5 km, s = 2 %), colored by predicted lateral spreading probability . Each colored region corresp onds to a unique leaf combination across all 100 trees. The dashed line marks the Sp eciﬁcation A threshold ( g > 5 m). The SMT solver reasons ov er all regions simultaneously . Figure 1: SMT enco ding of an XGBoost gradient-bo osted ensem ble. Each of the 100 trees (a) is enco ded as a nested If-Then- Else expression ov er real-v alued leaf weigh ts; the ensemble sum of all trees partitions the input space into regions of constant predicted probability (b). 10 4. Metho dology 4.1. T r e e-to-SMT enc o ding W e parse X GBo ost’s JSON tree dump, which represen ts eac h tree as a nested dictionary of in ternal no des (with split feature, threshold, and child pointers) and leaf nodes (with weigh t v alues). Algorithm 1 describes the recursiv e enco ding. Algorithm 1 Recursive enco ding of a decision tree as an SMT expression. 1: function EncodeTree (no de, features) 2: if no de is leaf then 3: return RealVal ( float32 ( no de.w eigh t )) 4: end if 5: j ← node.split_feature 6: θ ← float32 ( node.threshold ) ▷ Matc h XGBoost arithmetic 7: left ← EncodeTree ( no de.y es_child , features ) 8: right ← EncodeTree ( node.no_child , features ) 9: return If ( features [ j ] < θ , left , right ) 10: end function F or an ensemble of T trees, the logit expression is logit = β 0 + T X t =1 EncodeTree ( tree t . ro ot , features ) (11) The resulting expression is a deeply nested ITE term. A mo del with 100 trees of depth 5 pro duces up to 100 × 2 5 = 3 , 200 leaf terms (2,380 in our mo del), though the solver’s internal simpliﬁcation reduces this substan tially b efore solving. 4.2. EBM-to-SMT enc o ding EBMs decomp ose the prediction in to univ ariate shap e functions and pairwise in teractions, each stored as a lo okup table ov er learned bin edges. Eac h univ ariate function b ecomes a chain of ITE expressions ov er its bin edges: f i ( x i ) = If ( x i < b 1 , s 1 , If ( x i < b 2 , s 2 , . . . s n i )) (12) A djacent bins with identical scores are merged, compressing 4,043 univ ariate bins to approximately 2,900 ITE no des. Pairwise interactions b ecome nested tw o-dimensional ITE expressions (approximately 23,400 no des for s ix interaction terms). The full EBM logit is f ( x ) = β 0 + P i f i ( x i ) + P i 5 ∧ ( h 1 + h 2 ) > 0 . The solver propagates g > 5 (and therefore g ≥ 2 . 5 ) in to tree 1’s right branch, yielding either − 0 . 1 or 0 . 05 . If tree 1 outputs 0 . 05 ( p ≥ 0 . 42 ) and tree 2 outputs 0 . 2 ( ℓ < 1 . 0 ), the logit is 0 . 25 > 0 , so the solver returns sat with a coun terexample (e.g., g = 5 . 5 , ℓ = 0 . 5 , p = 0 . 45 ). The solv er reasons sym b olically o ver the entire piecewise-constan t structure, c hecking every region sim ul- taneously via constrain t propagation. F or our 100-tree mo del, each speciﬁcation c heck typically completes in under 5 seconds. 4.4. V eriﬁc ation pr o c e dur e F or Sp eciﬁcations A, C, and D, the solver chec ks a single query by asserting the sp eciﬁcation’s negation along with physical range constraints, and returns sat with a v alidated counterexample or unsat (sp eciﬁcation pro ven). Sp eciﬁcation B (monotonicit y) is more exp ensiv e b ecause it compares the mo del at tw o inputs diﬀering only in PGA. A direct enco ding declares t w o feature vectors, constrains non-PGA features to be equal, and asserts that higher PGA pro duces lo w er logit, doubling the formula size and requiring 480 seconds for the monotone mo del. W e resolv e this cost by leveraging the additive ensemble structure. Prop osition 1 (Additiv e decomposition for monotonicity) . L et f ( x ) = β 0 + P M m =1 g m ( x ) b e an additive mo del. If e ach c omp onent g m is monotonic al ly non-de cr e asing in fe atur e x j , then f is monotonic al ly non- de cr e asing in x j . Pr o of. Fix inputs x , x ′ diﬀering only in x j with x j < x ′ j . By hypothesis, g m ( x ) ≤ g m ( x ′ ) for each m = 1 , . . . , M . Summing gives P m g m ( x ) ≤ P m g m ( x ′ ) ; adding β 0 to b oth sides gives f ( x ) ≤ f ( x ′ ) . Prop osition 1 reduces ensemble monotonicity to T indep endent p er-tree queries. W e verify Sp eciﬁcation B b y c hec king each of the 100 trees individually . Each query inv olv es a single tree rather than the full ensemble, and every query completes in under 1 second. All 100 p er-tree chec ks return unsat , proving Sp eciﬁcation B for the full ensem ble by Prop osition 1. This p er-tree strategy yields a complete formal proof, not an empirical appro ximation. 4.5. V erify-ﬁx-verify lo op V eriﬁcation results guide iterative mo del impro vemen ts: (1) train an unconstrained mo del (82.5% accu- racy); (2) verify all sp eciﬁcations (all four violated); (3) retrain with monotone_constraints (GWD − , L − , 12 PGA + ); (4) re-v erify (Speciﬁcation B pro v en via p er-tree decomposition, but Sp eciﬁcations A, C, and D still violated at 69.3% accuracy). The lo op reveals that monotone constrain ts are nece ssary but not suﬃcient; coun terexamples pinp oin t where comp ound violations p ersist. 4.6. F ormal ab ductive explanations The SMT enco ding also enables computing formal ab ductive explanations , the minimal subset of features that, when ﬁxed to a sample’s v alues, guaran tees the prediction regardless of ho w the remaining features v ary . F or a sample x ∗ predicted as p ositiv e ( f ( x ∗ ) > 0 ), a subset S ⊆ { 1 , . . . , d } is a suﬃcient reason if ∀ x ∈ X :  ^ j ∈ S x j = x ∗ j  ⇒ f ( x ) > 0 (15) W e compute this iteratively , starting from the full feature set and testing whether removing eac h feature preserv es the ab ov e prop ert y via SMT queries. If removing feature j yields unsat (meaning no counterexample exists even without ﬁxing j ), the feature is unnecessary and can be dropp ed. If sat , some input diﬀering only in j c hanges the prediction, so j is formally necessary . Comparing these formal explanations with SHAP’s top- k features (where k matches the formal explanation size) measures ho w faithfully SHAP represents the mo del’s actual decision logic. 5. Results 5.1. Mo del tr aining W e train four model families on the C hristc h urch dataset (80/20 train-test split, 5,832/1,459 sites): (1) unconstrained X GBo ost, (2) monotone-constrained XGBoost (GWD − , L − , PGA + ), (3) unconstrained EBM, and (4) monotone-constrained EBM. F or X GBoost, we v ary depth d ∈ { 3 , 5 , 7 , 9 } and estimators n ∈ { 20 , 50 , 100 } (24 v ariants); for EBMs, interaction terms i ∈ { 0 , 3 , 5 , 10 } (9 v arian ts including the fully constrained mo del), yielding 33 total. The baseline unconstrained X GBoost ( d = 5 , n = 100 ) achiev es 82.5% test accuracy; 10-fold cross- v alidation conﬁrms s tabilit y ( 83 . 2 ± 1 . 4 % unconstrained, 67 . 4 ± 1 . 0 % monotone). The b est unconstrained EBM ( i = 10 ) achiev es 80.1% with a smaller train-test gap, reﬂecting the GAM structure’s inheren t regularization. The SHAP b eeswarm p lot for XGBoost (Figure 2, left) reveals that high PGA exhibits both p ositiv e and negative SHAP v alues, the non-physical pattern iden tiﬁed by Hsiao et al. [11]. EBM feature imp ortance (Figure 2, right) ﬂags a similar concern, but neither visualization can quantify ho w extensively the mo del violates ph ysical exp ectations across its full input domain. The EBM shap e functions (Figure 3) reveal the source of these violations. The unconstrained PGA curve exhibits a dip b eyond 0.45 g where increasing PGA decreases predicted log-odds, and the PGA-monotone constrain t eliminates this artifact. 13 4 3 2 1 0 1 2 3 SHAP value (impact on log-odds) Slope (%) GWD (m) L (km) PGA (g) (a) XGBoost (unconstrained) 0 2 4 6 8 10 12 14 Shape function range (log-odds) L (km) & PGA (g) L (km) L (km) & Slope (%) GWD (m) & L (km) GWD (m) GWD (m) & PGA (g) PGA (g) GWD (m) & Slope (%) Slope (%) & PGA (g) Slope (%) (b) EBM (unconstrained) F eature importance Low High F eature value Figure 2: Global feature imp ortance. Left: SHAP b eesw arm plot for unconstrained XGBoost. Each dot represents a test sample; color indicates feature v alue (red = high, blue = low). PGA shows b oth p ositiv e and negative SHAP v alues, indicating non- monotonic behavior. Right: EBM feature imp ortance measured b y shap e function range (maximum min us minimum log-o dds contribution). 1 2 3 4 GWD (m) 4 3 2 1 0 1 Unconstrained EBM Score (log-odds) GWD (m) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 L (km) 4 2 0 2 4 L (km) 0 2 4 6 8 Slope (%) 0.5 0.0 0.5 1.0 1.5 Slope (%) 0.35 0.40 0.45 0.50 0.55 PGA (g) 3 2 1 0 1 PGA (g) Non-monotone 1 2 3 4 GWD (m) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 PGA-monotone EBM Score (log-odds) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 L (km) 3 2 1 0 1 2 0 2 4 6 8 Slope (%) 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.35 0.40 0.45 0.50 0.55 PGA (g) 0.8 0.6 0.4 0.2 0.0 0.2 0.4 EBM Shape F unctions: Unconstrained vs PGA-Monotone Figure 3: EBM shap e functions for all four features. T op row: unconstrained EBM. Bottom row: PGA-monotone EBM. The PGA shap e function (righ tmost column) shows non-physical b eha vior in the unconstrained mo del, sp eciﬁcally a dip near 0.42 g where increasing PGA decreases predicted log-o dds. The monotone constraint eliminates this artifact. 14 5.2. Sp e ciﬁc ation veriﬁc ation r esults W e veriﬁed all four sp eciﬁcations against each mo del family (T able 4); the following paragraphs discuss eac h mo del in detail, with coun terexample diagnostics in Figure 4. T able 4: Sp eciﬁcation veriﬁcation results. XGBoost results are full SMT pro ofs (Z3). EBM results use SMT for Sp eciﬁcations A, C, and D; Sp eciﬁcation B uses p er-term score-array decomp osition (Proposition 1). The solver chec ks the ne gation of each speciﬁcation: if satisﬁable, a counterexample exists and the sp eciﬁcation is violated ( ✗ ); if unsatisﬁable, no counterexample exists and the sp eciﬁcation is prov en ( ✓ ). All coun terexamples are v alidated against the actual mo del. Mo del A (W ater table) B (PGA mono.) C (Distance) D (Flat-ground) T est Acc. X GBo ost (unconstrained) ✗ ✗ ✗ ✗ 82.5% X GBo ost (monotone) ✗ ✓ ✗ ✗ 69.3% EBM (unconstrained) ✗ ✗ ✗ ✗ 80.1% EBM (PGA monotone) ✗ ✓ ✗ ✗ 70.5% EBM (fully constrained) ✗ ✓ ✓ ✓ 67.2% 0.35 0.40 0.45 0.50 0.55 PGA (g) 0.0 0.2 0.4 0.6 0.8 P(Lateral Spreading) (a) Specification B: PGA Monotonicity Low risk (GWD 1.5 m, L 0.5 km) Medium risk (GWD 3.0 m, L 1.5 km) High risk (GWD 4.0 m, L 2.5 km) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 L (km) 0.35 0.40 0.45 0.50 0.55 PGA (g) Specification C region (b) Specification C: Distance vs PGA 1 2 3 4 5 6 GWD (m) 0.0 0.2 0.4 0.6 0.8 P(Lateral Spreading) (c) Specification A: W ater T able Depth Low PGA (0.35 g) Moderate PGA (0.42 g) Strong PGA (0.50 g) Severe PGA (0.55 g) Specification A threshold 0.00 0.15 0.30 0.45 0.60 0.75 0.90 P(Spreading) Figure 4: Sp eciﬁcation violation diagnostics for unconstrained XGBoost. Left: PGA monotonicity (Speciﬁcation B) for three soil proﬁles; red dots mark non-monotone regions. Center: contour of predicted spreading probability ov er distance ( L ) and PGA, with Speciﬁcation C safe region marked. Righ t: predicted probability v ersus groundw ater depth for sev eral PGA levels, with Sp eciﬁcation A threshold at 5 m. Unc onstr aine d X GBo ost. The solv er returns sat for all four speciﬁcations (eac h within 1 second). The Sp eciﬁcation A coun terexample (GWD = 5 . 001 m, logit = +0 . 065 , 52%) predicts spreading despite deep groundw ater. The Speciﬁcation B counterexample shows increasing PGA from 0.42 g to 0.52 g de cr e asing the logit by 0.15 units, conﬁrming the SHAP-based observ ation of Hsiao et al. [11] by exhaustive formal pro of. The Sp eciﬁcation C counterexample ( L = 2 . 501 km, PGA = 0 . 33 g, logit = +0 . 169 , 54%) and Sp eciﬁcation D coun terexample (Slop e = 0 . 05 %, L = 2 . 501 km, logit = +0 . 169 , 54%) predict spreading in regions where the adopted screening sp eciﬁcations exclude the usual driving conditions and the training data con tain no observ ed spreading cases (Figure 5b). 15 f j f i lb j ub j lb i ub i (a) f j f i lb j ub j lb i ub i × (b) Figure 5: Sc hematic of t wo failure mo des in a t w o-dimensional feature subspace deﬁned b y bounds lb and ub . (a) No training data fall in the green region; the model interpolates without empirical supp ort. (b) T raining observ ations exist in the red region but are exclusiv ely negativ e-class ( ◦ ), y et the model predicts y = 1 at the SMT coun terexample ( × ). Sp eciﬁcation D violations correspond to case (b). Monotone-c onstr aine d X GBo ost. Monotone constraints drop accuracy to 69.3% ( − 13.2 pp). The Sp eciﬁ- cation A counterexample (GWD = 5 . 014 m, L = 0 . 011 km, PGA = 0 . 33 g, logit = +0 . 221 ) shows that p er-feature constrain ts cannot enco de the comp ound reasoning that deep groundwater should ov erride prox- imit y and shaking. Sp eciﬁcation B is pro ven in tw o indep enden t w ays, p er-tree decomp osition (100 trees, 0.6 s total) and direct dual-ensem ble veriﬁcation ( unsat in 480 s), but Sp eciﬁcations C and D still return sat , conﬁrming that monotone constraints ﬁx single-feature violations while lea ving comp ound speciﬁcations unsatisﬁed. EBM variants. The unconstrained EBM violates all four sp eciﬁcations. Sp eciﬁcation A is violated (GWD = 5 . 000 m, L = 0 . 005 km, logit = +1 . 68 , 84%), despite the EBM’s additive structure, b ecause the combined p ositiv e con tributions from proximit y ( L ), slop e, and PGA outw eigh the GWD shap e function’s negative con tribution at deep water table depths. Sp eciﬁcation B is violated (272 non-monotone segments in the PGA shap e function). Sp eciﬁcation C is violated ( L = 2 . 52 km, PGA = 0 . 33 g, logit = +0 . 13 , 53%); this query requires the sp ecialized QF_LRA solv er and completes in 1,331 s. Sp eciﬁcation D is violated (Slop e = 0 . 001 %, L = 2 . 72 km, logit = +0 . 03 , 51%). The PGA-monotone EBM prov es Sp eciﬁcation B via p er-term decomposition but still violates Sp eciﬁ- cation A (GWD = 5 . 00 m, L = 0 . 005 km, logit = +0 . 14 , 53%), with accuracy dropping 9.6 pp (80.1% → 70.5%). F ul ly c onstr aine d EBM. T o test whether more sp eciﬁcations can be satisﬁed sim ultaneously , we train an EBM with monotone constrain ts on all four features (GWD decreasing, L decreasing, Slope increasing, PGA increasing), enforcing the ph ysical directions for every feature. This mo del prov es three of four speciﬁcations, 16 namely Sp eciﬁcation B (PGA monotonicity , via p er-term decomp osition), Sp eciﬁcation C (distance safet y), and Speciﬁcation D (ﬂat-ground safety), the latter t wo prov en b y the SMT solv er returning unsat . Ho wev er, Sp eciﬁcation A (w ater table depth) is violated b y a thin margin; the solver ﬁnds a coun terexample at GWD = 5 . 003 m, L = 0 . 003 km, Slope = 2 . 5 %, PGA = 0 . 46 g, with a predicted probability of 50.3% (logit = +0 . 012 ). Even with monotonicity constraints on all four features, the mo del cannot guarantee Sp eciﬁcation A b ecause monotonicity enforces the dir e ction of a feature’s eﬀect, not an absolute threshold on the output. The accuracy cost of full constraint is 12.9 p ercen tage p oints (80.1% → 67.2%), but the mo del ac hieves the highest sp eciﬁcation compliance of any v ariant tested. Figure 6 illustrates the eﬀect. F or a ﬁxed soil proﬁle, unconstrained XGBoost and EBM exhibit clear non-monotone PGA regions, while the constrained v ariants pro duce strictly increasing curves. 0.35 0.40 0.45 0.50 0.55 PGA (g) 0.0 0.2 0.4 0.6 0.8 1.0 P(Lateral Spreading) (a) XGBoost Non-monotone 0.35 0.40 0.45 0.50 0.55 PGA (g) (b) EBM (no constraints) Non-monotone 0.35 0.40 0.45 0.50 0.55 PGA (g) (c) EBM (PGA monotone) 0.35 0.40 0.45 0.50 0.55 PGA (g) (d) EBM (full constraints) GWD = 1.5 m, L = 0.5 km, Slope = 2.0% Figure 6: PGA monotonicity comparison across mo del families for a ﬁxed soil proﬁle (GWD = 1.5 m, L = 0.5 km, Slop e = 2.0%). Red dots mark non-monotone regions where increasing PGA decreases predicted probability . Unconstrained XGBoost and EBM exhibit violations; monotone-constrained varian ts do not. 5.3. Par eto fr ontier: ac cur acy versus physic al c onsistency T o quan tify the accuracy cost of physical consistency , w e trained all 33 mo del v arian ts and measured Sp eciﬁcation B compliance as the fraction of 120 empirically sampled PGA-sw eep proﬁles satisfying mono- tonicit y (Figure 7). This empirical proxy enables broad hyperparameter comparison; formal SMT pro ofs (T able 4) are applied to representativ e mo dels in each family . Unconstrained mo dels cluster at high accuracy (74–84%) with near-zero compliance; constrained mo dels cluster at lo wer accuracy (68–71%) with full compliance. No mo del o ccupies the top-right region where accuracy exceeds 80% and compliance reaches 100%, consisten t with unconstrained mo dels ﬁtting patterns that violate the sp eciﬁcations. T able 5 conﬁrms this pattern across all 33 v arian ts. The 10–15 percentage p oin t accuracy gap p ersists regardless of mo del family or h yp erparameter setting. 5.4. SHAP versus formal explanations SHAP explains how a mo del arrives at a prediction b y decomp osing it into p er-feature contributions. F or- mal veriﬁcation answers a diﬀeren t question, namely do es the prediction comply with physical sp eciﬁcations? W e show that these tw o questions can giv e contradictory signals. 17 0.0 0.2 0.4 0.6 0.8 1.0 Physical Consistency (1 violation rate) 0.60 0.65 0.70 0.75 0.80 0.85 0.90 T est Accuracy Ideal region Accuracy consistency gap XGBoost XGBoost (mono) EBM EBM (mono) Figure 7: Accuracy versus physical consistency (Speciﬁcation B compliance) for 33 model v arian ts. Unconstrained mo dels cluster at high accuracy with zero compliance (upp er-left); constrained models cluster at low er accuracy with full compliance (low er-righ t). The fully constrained EBM (low est green diamond, 67.2%) achiev es full Speciﬁcation B compliance and pro v es 3 of 4 speciﬁcations via SMT. No mo del occupies the ideal top-righ t region. T able 5: Summary statistics for the Pareto frontier, aggregated across hyperparameter v arian ts. Violations and compliance refer to Speciﬁcation B (PGA monotonicity) only , measured on 120 empirically sampled PGA-sweep proﬁles. F ull sp eciﬁcation veriﬁcation results (Sp eciﬁcations A–D) are in T able 4. Mo del family V arian ts A cc. range B violations B compliance X GBo ost (unconstrained) 12 74.0–84.2% 120/120 0% X GBo ost (monotone) 12 68.3–70.0% 0/120 ✓ 100% ✓ EBM (unconstrained) 4 74.6–80.1% 116/120 3.3% EBM (monotone) 4 67.9–70.5% 0/120 ✓ 100% ✓ EBM (fully constrained) 1 67.2% 0/120 ✓ 100% ✓ 18 0.0 0.5 1.0 1.5 2.0 L og- odds (a) Specification A : SHAP waterfall at counter e xample ( G W D = 5 . 6 5 m > 5 m s h o u l d p r e d i c t n o s p r e a d i n g ) GWD = 5.65 Slope = 4.86 PGA = 0.51 L = 1.21 -0.12 +0.52 +0.61 +1.67 E [ f ( x ) ] = - 0 . 3 8 f ( x ) = + 2 . 3 1 ( 9 1 % s p r e a d i n g ) 0.35 0.40 0.45 0.50 0.55 PGA (g) (b) Specification B: SHAP dependence for PGA ( h i g h e r P G A s h o u l d h i g h e r r i s k ) 4 3 2 1 0 1 2 3 SHAP value (log- odds) Figure 8: SHAP analysis of sp eciﬁcation violations (unconstrained XGBoost). (a) W aterfall at a Sp eciﬁcation A counterexample (GWD = 5 . 65 m, predicted 91% spreading). Red bars push tow ard spreading, blue bars push against it. GWD is the feature that violates Sp eciﬁcation A, yet SHAP ranks it last (#4) with the smallest magnitude ( − 0 . 12 ). (b) SHAP dep endence plot for PGA. Each dot is one test sample; color indicates the SHAP v alue (red = pushes tow ard spreading, blue = pushes against). The black line (binned mean) should increase monotonically if the model respects Sp eciﬁcation B; instead, it drops sharply b et ween 0.42–0.52 g (shaded region, red arrows), conﬁrming the non-monotonicity that formal veriﬁcation detects. Figure 8 illustrates this gap using a Sp eciﬁcation A coun terexample. The input has GWD = 5 . 65 m, L = 1 . 21 km, Slop e = 4 . 86 %, PGA = 0 . 51 g. The mo del predicts spreading with 91% conﬁdence (logit = +2 . 31 ), violating Sp eciﬁcation A (GWD > 5 m should not predict spreading). Panel (a) sho ws the standard SHAP waterfall that an engineer w ould insp ect. L contributes +1 . 67 , PGA +0 . 61 , Slop e +0 . 52 , and GWD only − 0 . 12 . An engineer would read this as “close to a free face, strong shaking, steep slop e” and ﬁnd the prediction reasonable. GWD, the feature that makes this prediction a sp eciﬁcation violation, is rank ed last by SHAP with the smallest magnitude. The underlying issue is that SHAP decomp oses predictions, not v alidates them. SHAP correctly rep orts that GWD contributes − 0 . 12 log-o dds at this input, but it ranks the GWD feature last, indicating its minimal impact. In contrast, formal veriﬁcation prov es that any input with GWD > 5 m violates Sp eciﬁcation A, regardless of the con tributions from other features. While SHAP explains the model’s reasoning for a sp eciﬁc input, it cannot detect that the mo del’s ov erall logic violations against a sp eciﬁcation. The same pattern app ears for Speciﬁcation B (Figure 8b). The SHAP dependence plot for PGA sho ws that the mo del’s SHAP v alues drop sharply b et ween 0.42–0.52 g instead of increasing monotonically . While the global dep endence plot makes this visible, individual w aterfall plots w ould not detect the violation, because eac h w aterfall app ears lo cally reasonable. F ormal veriﬁcation detects the violation automatically and returns a concrete coun terexample. 5.5. Counter example analysis Eac h SMT counterexample is a concrete input vector that the solver constructs to violate the sp eciﬁca- tion. V alidating each counterexample against the actual mo del conﬁrms the enco ding’s ﬁdelity; T able 6 lists 19 represen tative counterexamples for the unconstrained XGBoost mo del. T able 6: Representativ e counterexamples from SMT veriﬁcation of unconstrained XGBoost. A p ositive logit (probability > 50 %) means the mo del predicts spreading, violating the sp eciﬁcation. All counterexamples were v alidated against the actual XGBoost model to machine precision. Sp ec. GWD (m) L (km) Slop e (%) PGA (g) Logit Probabilit y A (w ater table) 5.001 1.35 1.58 0.37 +0.065 52% C (distance safet y) 1.73 2.501 0.05 0.33 +0.169 54% D (ﬂat-ground) 1.72 2.501 0.05 0.33 +0.169 54% The Sp eciﬁcation A counterexample (GWD = 5 . 001 m) predicts spreading at 52% probabilit y b ecause the mo del relies on proximit y ( L = 1 . 35 km) while insuﬃciently weigh ting groundwater depth. The Sp eci- ﬁcation C counterexample ( L = 2 . 501 km, PGA = 0 . 33 g) predicts spreading at 54% in a region where the Christc hurc h data con tain no observ ed spreading cases, b ecause the ensem ble accumulates positive con tribu- tions where training data is sparse. Both coun terexamples o ccur in data-sparse b oundary regions, connecting to the ro ot cause identiﬁed by Hsiao et al. [12]. F ormal v eriﬁcation transforms the in tuition that “sparse data causes problems” in to a concrete, v alidated demonstration. 6. Discussion and conclusions F ormal veriﬁc ation versus empiric al testing. Grid-base d testing can detect violations but cannot certify their absence. T able 7 shows that uniform grid search with up to n = 200 p oints p er feature (1.6 billion ev aluations) ﬁnds zer o Sp eciﬁcation C violations, while the SMT solver ﬁnds a counterexample in 0.19 s. The violation region o ccupies a thin sliver ( L ∈ [2 . 500 , 2 . 505] km, Slop e < 0 . 05 %) smaller than a single grid step. When the solv er returns unsat , the sp eciﬁcation holds univ ersally , a guarantee qualitatively diﬀerent from any ﬁnite-sample test. Sc alability. XGBoost veriﬁcation completes in under 1 s p er sp eciﬁcation for single-ensemble queries. P er- tree decomp osition reduces monotonicity pro ofs by orders of magnitude compared to the direct dual-ensemble enco ding. EBM veriﬁcation is slo w er due to larger enco dings; the most expensive query requires a specialized QF_LRA solver. Higher-dimensional mo dels will require abstraction techniques [29, 17] to main tain practical run times. Comp ound c onstr aints and c onstr aint inter actions. Monotone constraints enforce the direction of a feature’s marginal eﬀect but cannot guaran tee threshold behavior (Speciﬁcation A) or multi-feature conjunctions (Sp eciﬁcation C). Progressiv ely adding constraints improv es compliance, but some sp eciﬁcations resist full resolution because monotonicity enforces the dir e ction of a feature’s eﬀect, not an absolute threshold on the output. Promising extensions include coun terexample-guided data augmentation and p ost-ho c output clamping; this pap er establishes the diagnostic prerequisite by identifying what fails and wher e . 20 T able 7: Grid search v ersus SMT v eriﬁcation for Sp eciﬁcations C and D (unconstrained X GBo ost). Grid search ev aluates the model on n 4 uniformly spaced points ov er X . The SMT solver encodes the mo del symbolically and reasons ov er the en tire contin uous domain. Sp eciﬁcation D adds a Slop e < 0 . 1 % constraint, which further reduces the num ber of grid p oin ts in the premise region. Metho d P oints ev al. Premise (C) Premise (D) Time Violated? Grid ( n = 30 ) 810,000 18,900 630 < 0.1 s None found Grid ( n = 50 ) 6,250,000 150,000 3,000 0.1 s None found Grid ( n = 100 ) 100,000,000 2,160,000 21,60 0 1.7 s None found Grid ( n = 150 ) 506,250,000 10,530,000 140,400 17 s None found Grid ( n = 200 ) 1,600,000,000 32,640,000 326,400 184 s None found SMT (C) entir e c ontinuous domain 0.19 s Y es SMT (D) entir e c ontinuous domain 0.7 s Y es Why not emb e d har d c onstr aints and skip veriﬁc ation. A v ailable constraint mechanisms op erate p er-feature and cannot encode comp ound conditions (Speciﬁcation C) or threshold guaran tees (Sp eciﬁcation A). Our results conﬁrm that even the most constrained mo del cannot satisfy all sp eciﬁcations. SMT counterexam- ples pinp oint wher e a mo del fails, providing actionable information that a pass/fail constrain t ﬂag cannot. F ormal veriﬁcation therefore complements hard constraints. Constrain ts restrict the hypothesis space, while v eriﬁcation determines whether the resulting trained mo del satisﬁes the sp eciﬁcation. Deterministic versus pr ob abilistic guar ante es. The SMT solver pro ves that a speciﬁcation holds for al l inputs or provides a single counterexample where it fails. This deterministic guarantee applies to the mo del , not to the physical system; it sa ys nothing about whether f correctly predicts the real-world outcome, which dep ends on data quality , feature suﬃciency , and epistemic uncertaint y . A natural extension would combine formal v eriﬁcation with probabilistic calibration (e.g., conformal prediction), providing complemen tary guar- an tees of physical consistency and w ell-calibrated uncertaint y . Data-a ware v eriﬁcation [29], which constrains coun terexamples to remain near the training distribution, oﬀers a middle ground. Sp e ciﬁc ation sele ction and sc op e. The four speciﬁcations exercise structurally diﬀerent formula classes, in- cluding a single-feature threshold (Speciﬁcation A), monotonicity (Sp eciﬁcation B), and compound multi- feature screening conditions (Sp eciﬁcations C and D). This distinction matters b ecause standard monotone constrain ts can directly enforce only Sp eciﬁcation B. A pro duction sp eciﬁcation library would b e richer, in- corp orating CPT-based susceptibility , depth-weigh ted liquefaction p oten tial, or magnitude-duration eﬀects, but each new sp eciﬁcation b ecomes an independent SMT query with no framew ork changes. This study is methodological. It demonstrates that SMT-based formal veriﬁcation can c heck tree ensem bles against geotec hnical sp eciﬁcations, rather than establishing a new predictor or regional mo del. 21 T able 8: Threshold sensitivity for Speciﬁcation A (water table depth) against unconstrained XGBoost. The mo del violates Speciﬁcation A at every GWD threshold. GWD threshold (m) Result Coun terexample logit 3.0 Violated +0 . 131 3.5 Violated +1 . 405 4.0 Violated +0 . 062 4.5 Violated +0 . 366 5.0 Violated +1 . 405 5.5 Violated +1 . 405 6.0 Violated +0 . 045 Limitations and futur e dir e ctions. The sp eciﬁcations enco de necessary conditions, not suﬃcient ones. A complian t mo del could still learn non-physical patterns outside the sp eciﬁcation set. The approach applies only to tree-based mo dels; neural netw ork veriﬁcation via MILP enco dings [17] is p ossible but computationally more exp ensiv e. The thresholds in Sp eciﬁcations A and C are site-sp eciﬁc Christch urc h b ounds, not univ ersal constan ts; applying the framew ork elsewhere requires recalibrating thresholds to lo cal conditions. T able 8 demonstrates that the unconstrained mo del violates Speciﬁcation A across a range of thresholds, conﬁrming a p ersisten t pattern rather than a narrow edge case. Extending the framework to regression models predicting displacemen t magnitudes is conceptually straigh tforward in SMT but requires reformulating sp eciﬁcations o ver contin uous outputs. In summary , SMT-based formal veriﬁcation can detect physical inconsistencies in tree ensemble mo dels that empirical testing and p ost-hoc explanations miss. T raining-time constrain ts ﬁx targeted violations but cannot enforce comp ound or threshold sp eciﬁcations, and physical consistency comes at a quantiﬁable accuracy cost. The v erify-ﬁx-verify lo op (train for accuracy , v erify against sp eciﬁcations, apply targeted constrain ts, re-verify) provides a systematic workﬂo w for deploying physically consistent ML mo dels in safety- critical geotechnical applications. As ML mo dels increasingly inform engineering decisions, formal v eriﬁcation oﬀers a path to ward certifying that these mo dels resp ect established sp eciﬁcations b efore ﬁeld deploymen t. Data av ailabilit y The Christc hurc h lateral spreading dataset is av ailable from Duran te and Rathje [9]. The veriﬁcation co de and trained mo dels are av ailable at https://github.com/geoelements- dev/2026- formal- verify- liq . References [1] Barrett, C., Sebastiani, R., Seshia, S.A., Tinelli, C., 2021. Satisﬁability mo dulo theories, in: Biere, A., Heule, M., v an Maaren, H., W alsh, T. (Eds.), Handb o ok of Satisﬁability . 2nd ed.. IOS Press, pp. 22 1267–1329. doi: 10.3233/FAIA201017 . [2] Breiman, L., 2001. Random forests. Machine Learning 45, 5–32. doi: 10.1023/A:1010933404324 . [3] Can, R., Kocaman, S., Gokceoglu, C., 2021. A comprehensive assessmen t of XGBoost algorithm for landslide susceptibility mapping in the upper basin of Ataturk dam, T urkey. Applied Sciences 11, 4993. doi: 10.3390/app11114993 . [4] Chen, H., Zhang, H., Si, S., Li, Y., Boning, D., Hsieh, C.J., 2019. Robustness veriﬁcation of tree-based mo dels, in: A dv ances in Neural Information Pro cessing Systems, pp. 12317–12328. [5] Chen, T., Guestrin, C., 2016. XGBoost: A scalable tree bo osting system, in: Pro ceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discov ery and Data Mining, pp. 785–794. doi: 10.1145/2939672.2939785 . [6] Chen, Z., Ni, P ., Chen, G., W ang, Y., Chen, J., W ang, C., 2024. Ev aluation of liquefaction-induced lateral spread displacement based on ensemble learning. Geological Journal 59, 2534–2548. doi: 10.1002/ gj.4951 . [7] Demir, S., Şahin, E.K., 2023. Predicting o ccurrence of liquefaction-induced lateral spreading using gradi- en t b o osting algorithms integrated with particle swarm optimization: PSO-X GBo ost, PSO-LightGBM, and PSO-CatBo ost. Acta Geotechnica 18, 3403–3419. doi: 10.1007/s11440- 022- 01777- 1 . [8] Dev os, L., Meert, W., Davis, J., 2021. V ersatile veriﬁcation of tree ensembles. Pro ceedings of Machine Learning Researc h 139, 2654–2664. [9] Duran te, M.G., Rathje, E.M., 2021. An exploration of the use of mac hine learning to predict lateral spreading. Earthquake Sp ectra 37, 2288–2314. doi: 10.1177/87552930211004613 . [10] Geyin, M., Maurer, B.W., Bradley , B.A., Green, R.A., v an Ballego o y , S., 2022. Developmen t and v alidation of a probabilistic liquefaction triggering mo del for CPT-based analyses. ASCE Journal of Geotec hnical and Geo en vironmen tal Engineering 148, 04022042. doi: 10.1061/(ASCE)GT.1943- 5606. 0002814 . [11] Hsiao, C.H., Kumar, K., Rathje, E.M., 2024. Explainable AI mo dels for predicting liquefaction-induced lateral spreading. F rontiers in Built Environmen t 10, 1387953. doi: 10.3389/fbuil.2024.1387953 . [12] Hsiao, C.H., Kumar, K., Rathje, E.M., 2025a. Domain-informed explainable b oosting mac hines for trust worth y lateral spread predictions. Manuscript in preparation. [13] Hsiao, C.H., Rathje, E.M., Kumar, K., 2025b. Inv estigating the eﬀect of CPT in lateral spreading prediction using explainable AI. arXiv preprint arXiv:2503.13389 . 23 [14] Huang, X., Marques-Silv a, J., 2024. On the failings of Shapley v alues for explainability . International Journal of Appro ximate Reasoning 171, 109112. doi: 10.1016/j.ijar.2023.109112 . [15] Ignatiev, A., Izza, Y., Stuck ey , P .J., Marques-Silv a, J., 2022. Using MaxSA T for eﬃcient explanations of tree ensembles, in: Proceedings of the 36th AAAI Conference on Artiﬁcial Intelligence, pp. 3776–3785. doi: 10.1609/aaai.v36i4.20292 . [16] Ignatiev, A., Naro dytsk a, N., Marques-Silv a, J., 2019. Ab duction-based explanations for machine learn- ing mo dels, in: Pro ceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence, pp. 1511–1519. [17] Kan tchelian, A., Tygar, J.D., Joseph, A.D., 2016. Ev asion and hardening of tree ensem ble classiﬁers, in: Pro ceedings of the 33rd International Conference on Machine Learning, pp. 2387–2396. [18] Kumar, K., 2025. Deep learning in geotechnical engineering: a critical assessmen t of PINNs and op erator learning. arXiv preprint arXiv:2512.24365 . [19] Lou, Y., Caruana, R., Gehrk e, J., 2012. Intelligible mo dels for classiﬁcation and regression, in: Pro ceed- ings of the 18th ACM SIGKDD International Conference on Knowledge Discov ery and Data Mining, pp. 150–158. doi: 10.1145/2339530.2339556 . [20] Lundb erg, S.M., Lee, S.I., 2017. A uniﬁed approac h to interpreting mo del predictions. Adv ances in Neural Information Pro cessing Systems 30. [21] Maurer, B.W., Sanger, M.D., 2023. Why “ AI” mo dels for predicting soil liquefaction ha ve been ignored, plus some that shouldn’t b e. Earthquake Sp ectra 39, 1883–1910. doi: 10.1177/87552930231173711 . [22] de Moura, L., Bjørner, N., 2008. Z3: An eﬃcien t SMT solv er, in: International Conference on T ools and Algorithms for the Construction and Analysis of Systems, Springer. pp. 337–340. doi: 10.1007/ 978- 3- 540- 78800- 3_24 . [23] Nori, H., Jenkins, S., K och, P ., Caruana, R., 2019. In terpretML: A uniﬁed framework for machine learning in terpretability . arXiv preprin t arXiv:1909.09223 . [24] Raissi, M., Perdik aris, P ., Karniadakis, G.E., 2019. Ph ysics-informed neural netw orks: A deep learning framew ork for solving forw ard and inv erse problems in volving nonlinear partial diﬀerential equations. Journal of Computational Ph ysics 378, 686–707. doi: 10.1016/j.jcp.2018.10.045 . [25] Rateria, G., Maurer, B.W., 2022. Ev aluation and up dating of Ishihara’s (1985) mo del for liquefaction surface expression. Soils and F oundations 62, 101131. doi: 10.1016/j.sandf.2022.101131 . [26] Rib eiro, M.T., Singh, S., Guestrin, C., 2016. “ Why should I trust you?”: Explaining the predictions of any classiﬁer, in: Pro ceedings of the 22nd A CM SIGKDD International Conference on Knowledge Disco very and Data Mining, pp. 1135–1144. doi: 10.1145/2939672.2939778 . 24 [27] Sato, N., et al., 2020. F ormal v eriﬁcation of a decision-tree ensemble mo del and detection of its violation ranges. IEICE T ransactions on Information and Systems E103-D, 363–378. [28] Törn blom, J., Nadjm-T ehrani, S., 2020. F ormal v eriﬁcation of input-output mappings of tree ensembles. Science of Computer Programming 194, 102450. doi: 10.1016/j.scico.2020.102450 . [29] V arshney , N., Gupta, A., Ahmad, A., T ay al, T.V., Aksha y , S., 2026. Data-a ware and scalable sensitivit y analysis for decision tree ensembles, in: Pro ceedings of the 14th International Conference on Learning Represen tations. [30] Y uan, B., Cho o, C.S., Y eo, L.Y., W ang, Y., Y ang, Z., Guan, Q., Suryasen tana, S., Cho o, J., Shen, H., Megia, M., Zhang, J., Liu, Z., Song, Y., W ang, H., Chen, X., 2025. Ph ysics-informed machine learning in geotec hnical engineering: a direction pap er. Geomechanics and Geo engineering 20, 1128–1159. [31] Zhang, W., Gu, X., T ang, L., Yin, Y., Liu, D., Zhang, Y., 2022. Application of machine learning, deep learning and optimization algorithms in geo engineering and geoscience: Comprehensive review and future c hallenge. Gondw ana Research 109, 1–17. doi: 10.1016/j.gr.2022.03.015 . 25

Formal verification of tree-based machine learning models for lateral spreading

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment