Formal verification of tree-based machine learning models for lateral spreading

Machine learning models for geotechnical hazard prediction can achieve high accuracy while learning physically inconsistent relationships from sparse or biased training data. Current remedies (post-hoc explainability, such as SHAP and LIME, and train…

Authors: Krishna Kumar

Formal verification of tree-based machine learning models for lateral spreading
F ormal v erification of tree-based mac hine learning mo dels for lateral spreading Krishna Kumar ∗ a Oden Institute of Computational Engine ering and Scienc es, The University of T exas at Austin, A ustin, T exas, USA Abstract Mac hine learning mo dels for geotec hnical hazard prediction can achiev e high accuracy while learning physi- cally inconsisten t relationships from sparse or biased training data. Curren t remedies (p ost-ho c explainabilit y suc h as SHAP and LIME, and training-time constrain ts) either diagnose individual predictions appro ximately or restrict mo del capacity without pro viding exhaustive guaran tees. This pap er enco des trained tree ensem- bles as logical form ulas in a Satisfiability Mo dulo Theories (SMT) solver and chec ks physical sp ecifications across the entire input domain, not just sampled p oin ts. F our geotechnical specifications (w ater table depth, PGA monotonicity , distance safety , and flat-ground safet y) are formalized as decidable logical formulas and v erified via SMT against b oth XGBoost ensembles and Explainable Bo osting Machines (EBMs) trained on the 2011 Christc h urch earthquak e lateral spreading dataset (7,291 sites, four features). The SMT solver either produces a concrete counterexample where a sp ecification fails or prov es that no violation exists. Unconstrained XGBoost (82.5% accuracy) violates all four sp ecifications. Monotone-constrained XGBoost (69.3% accuracy) satisfies PGA monotonicity but still violates comp ound sp ecifications that no single-feature constrain t can express. The unconstrained EBM (80.1% accuracy) violates all four sp ecifications. Imposing monotone constrain ts on PGA alone fixes Sp ecification B but leav es the remaining three violated. A fully constrained EBM with monotone constraints on all four features (67.2% accuracy) satisfies three of four sp ecifications (water table depth safety remains violated by a thin margin), demonstrating that iterative constrain t application guided by verification can progressively improv e physical consistency . A Pareto anal- ysis of 33 mo del v arian ts rev eals a p ersisten t trade-off, as among the v arian ts studied, no mo del ac hieves b oth > 80% accuracy and full compliance with the sp ecification set. SHAP analysis of sp ecification coun- terexamples sho ws that the offending feature can rank last among four features, demonstrating that p ost-ho c explanations do not substitute for formal verification. These results establish a verify-fix-v erify engineering lo op for deploying physically consistent ML mo dels in safety-critical geotechnical applications, and pro vide a foundation for formal certification of geotec hnical ML mo dels b efore field deploymen t. Keywor ds: formal v erification, physical consistency, machine learning, lateral spreading, SMT solver, X GBo ost, explainable bo osting machine ∗ Corresponding author Email addr ess: krishnak@utexas.edu (Krishna Kumar) 1. In tro duction Mac hine learning (ML) mo dels increasingly predict earthquake-induced geotechnical hazards, including liquefaction triggering [25, 10, 7] and lateral spreading [11, 9]. These models capture complex, non-linear relationships among input features (p eak ground acceleration (PGA), groundwater depth, distance to a free face, ground slope) that traditional empirical metho ds cannot represen t [31]. Ensemble methods lik e random forests [2] and gradient-bo osted trees [5] ac hiev e this by aggregating h undreds of w eak predictors, each fitting a lo cal partition of the feature space. The aggregate captures in teraction effects that no single decision tree could express, but it also obscures ho w individual features drive predictions. This opacity creates a practical problem. Mo dels trained on sparse or biased geotec hnical data can learn ph ysically inconsistent relationships that remain in visible until deploymen t. Hsiao et al. [11] demonstrated this concretely . Their X GBoost mo del, trained on the 2011 Christch urch earthquake lateral spreading dataset [9], predicted that the probability of lateral spreading de cr e ase d as PGA increased in certain input regions, in verting the fundamen tal exp ectation that stronger shaking pro duces more ground displacement. The model still achiev ed appro ximately 84% ov erall accuracy b ecause the non-physical pattern w as confined to data- sparse input regions, precisely the regions where reliable predictions matter most. This failure mode is not unique to one mo del or dataset. Maurer and Sanger [21] review ed 75 AI liquefaction publications and identified systematic deficiencies (lack of comparison to state-of-practice mo dels, departure from ML b est practices, mo dels discussed but not provided) that undermine trust in ML predictions across the field. This pap er addresses a complementary deficiency , the absence of exhaustiv e mo del-assurance to ols that can determine whether a trained mo del satisfies sp ecific consistency conditions, regardless of how it w as trained or b enc hmark ed. P ost-ho c explainability metho ds offer one line of defense. SHapley A dditive exPlanations (SHAP) [20] and Lo cal Interpretable Model-agnostic Explanations (LIME) [26] hav e prov en v aluable for geotechnical ML [3, 11], but both are approximate. Huang and Marques-Silv a [14] pro ved that SHAP can assign non-zero imp ortance to irrelev ant features and zero imp ortance to relev an t ones. Post-hoc metho ds can suggest the presence of failure, but they cannot c ertify its absence. Correcting non-ph ysical relationships requires mo difying the mo del. Monotone constraints [5] and shap e function editing [12] improv e physical consistency but op erate on individual features, cannot express com- p ound conditions (e.g., “far from free face and weak shaking should not predict high risk”), and provide no pro of that the corrected mo del satisfies the intended prop ert y across its full input domain. This pap er closes that gap by in tro ducing formal verification via Satisfiability Mo dulo Theories (SMT) solv ers to geotechnical ML. W e enco de trained XGBoost and Explainable Boosting Machine (EBM) tree ensem bles as logical form ulas in an SMT solver and chec k whether ph ysical sp ecifications hold across all p ossible inputs simultaneously . T ree ensem bles partition the input space into finitely many h yperrectangular regions, each assigned a constant output. Negating a ph ysical sp ecification and inlining this partition yields a quantifier-free formula in linear real arithmetic that the solv er decides exactly , returning either sat with 2 a concrete coun terexample or unsat (sp ecification prov en). W e use Z3 [22], though the approach applies to an y QF_LRA-capable solver. The contribution is metho dological, a mo del-assurance workflo w demonstrating ho w formal verification complements training-time constraints, rather than prop osing a new predictor. As geotec hnical ML mo dels mo ve to w ard op erational deplo yment, formal certification that these models resp ect established sp ecifications b ecomes essen tial for safe adoption in practice. The con tributions of this pap er are as follo ws. 1. The first application of SMT-based formal verification to geotechnical ML mo dels, demonstrating that b oth X GBo ost ensem bles and Explainable Bo osting Mac hines trained on lateral spreading data can b e enco ded as logical formulas and formally chec k ed against sp ecified geotechnical sp ecifications. 2. F ormalization of four geotec hnical sp ecifications, spanning single-feature thresholds (A), monotonicity (B), and compound m ulti-feature conditions (C, D). Each specification reduces to a quan tifier-free satisfiabilit y query in linear real arithmetic (QF_LRA) after inlining the tree-ensemble score. The sp ecific thresholds are calibrated to the Christc hurc h dataset and can b e adapted to other sites. The v erification framew ork itself is general and accepts an y threshold v alues that local engineers deem appropriate. 3. A v erify-fix-v erify engineering loop that uses SMT coun terexamples to diagnose violations, applies targeted monotone constrain ts, and re-verifies to confirm or reveal residual failures. 4. A P areto analysis of 33 mo del v arian ts quantifying the accuracy-consistency trade-off. Among the v ariants studied, no model achiev es b oth > 80% accuracy and full compliance with the sp ecification set, establishing a concrete cost for ph ysical consistency that practitioners can ev aluate. 2. Bac kground and related work 2.1. Machine le arning for later al spr e ading Lateral spreading, the p ermanen t horizon tal ground displacemen t tow ard a free face during earthquake- induced liquefaction, damages buried utilities, bridge foundations, and building fo otings in seismic regions. Predicting where lateral spreading will o ccur requires relating subsurface soil conditions and seismic loading to surface displacement, a task complicated b y the heterogeneit y of natural dep osits and the rarit y of strong- motion ev ents. Duran te and Rathje [9] compiled 7,291 sites from the 2011 Christch urc h earthquake, each characterized b y four features (GWD, distance to free-face L , slop e, PGA) and lab eled via remote sensing, training XG- Bo ost classifiers achieving approximately 80–84% accuracy . Demir and Şahin [7] confirmed that bo osted tree ensem bles consisten tly outp erform logistic regression and SVMs on this dataset. Chen et al. [6] extended the approach to displacement magnitude prediction, and Hsiao et al. [13] incorp orated CPT profiles via auto encoder, improving accuracy ab o ve 83%. Hsiao et al. [11] found that the XGBoost model assigned ne gative SHAP v alues to PGA in certain regions, meaning increasing shaking de cr e ase d predicted spreading probabilit y . This non-physical b ehavior arose from 3 sparse training data in the lo w-PGA, high-distance region. SHAP revealed the problem visually but could not quan tify its extent or guarantee that a corrected mo del would b e free of similar artifacts. Hsiao et al. [12] replaced XGBoost with an EBM [23, 19], a Generalized Additiv e Mo del (GAM) decom- p osing the prediction into univ ariate shap e functions f i ( x i ) and pairwise in teractions f ij ( x i , x j ) . g  E [ y ]  = β 0 + X i f i ( x i ) + X i 10 ∧ x 1 < 0 , the solv er deduces x 2 > 10 and returns a satisfying assignment such as x 1 = − 1 , x 2 = 12 ; tree-ensemble queries ha ve exactly this structure, with eac h leaf contributing a constan t to the sum. Physical sp ecifications for tree ensembles are stated as universally quan tified first-order form ulas. V erification negates each sp ecification and inlines the tree-ensemble score as a nested If-Then-Else (ITE) expression. The resulting solv er queries are quan tifier-free formulas in linear real arithmetic with ITE terms (Quantifier-F ree Linear Real Arithmetic, QF_LRA), b ecause every tree split compares a feature x j against a threshold θ and the ensemble output is a sum of constant leaf w eights. The decision pro cedure for QF_LRA is complete, meaning the solver alw a ys terminates. W e use Z3 [22] as our v erification engine, though the metho dology applies to any QF_LRA- capable solv er. T r e e ensemble verific ation. Kantc helian et al. [17] pioneered Mixed Integer Linear Programming (MILP)- based tree ensemble verification for computing optimal adversarial examples. Chen et al. [4] recast robustness v erification as a max-clique problem, achieving speedups o ver MILP . Dev os et al. [8] dev eloped V eritas for generic verification tasks b ey ond adv ersarial robustness, and Törnblom and Nadjm-T ehrani [28] built V oTE for safet y-critical autonomous systems. Sato et al. [27] prop osed metho ds to extract violation ranges for constructing input filters. On the explanation side, Ignatiev et al. [16] developed formal ab ductiv e explanations (subset-minimal feature sets guaranteeing a prediction), extended by Ignatiev et al. [15] to MaxSA T enco dings. V arshney et al. [29] combined MILP and SMT into a data-aw are sensitivity framework scaling to 800-tree ensem bles. All prior w ork fo cused on adversarial robustness or generic sensitivity . No study has applied formal v erification to chec k domain-sp ecific physical specifications, which is the contribution of this pap er. The key insigh t is that tree ensembles partition the input space into finitely many h yp errectangular regions, each assigned a constant output. Negating a physical sp ecification and inlining this partition yields a decidable QF_LRA form ula. 2.4. Positioning: formal verific ation as c omplement to c onstr aints Hard constraints pr event violations during training by restricting the hypothesis space, but provide no diagnostic information ab out where the unconstrained mo del w ould hav e failed. F ormal verification audits a trained mo del b y chec king whether its learned function satisfies sp ecified prop erties, returning either a pro of or a concrete coun terexample. These t wo capabilities comp ose into a verify-fix-v erify lo op. 1. T rain an unconstrained mo del, optimizing purely for accuracy . 2. V erify physical sp ecifications formally . The solver pro vides concrete coun terexamples for each violation. 3. Analyze coun terexamples to determine which constraints or corrections are needed. 5 4. Apply targeted fixes (monotone constraints, shape function editing, or retraining with mo dified hyper- parameters) guided b y the counterexample analysis. 5. Re-v erify to confirm compliance or to discov er residual violations that require additional iteration. This lo op pro duces mo dels whose physical consistency is not assumed but pr oven . The key distinction from simply applying constraints a priori is that verification reveals whether the constraints are sufficien t. As we demonstrate in Section 5.2, monotone constraints fix Sp ecification B (single-feature monotonicity) but leav e Sp ecifications A, C, and D (threshold and compound conditions) unsatisfied, a failure that would remain undetected without formal v erification. 3. Problem formulation 3.1. Dataset W e use the 2011 Christc hurc h earthquak e lateral spreading dataset [9], comprising 7,291 sites (42% spreading, 58% not) classified by remote sensing displacement measurements exceeding 0.3 m. The four features (T able 1) each hav e clear physical relationships to lateral spreading that domain exp erts can articulate as sp ecifications. T able 1: Input features for lateral spreading classification. F eature Description Range Unit GWD Groundw ater depth 0.37–6.05 m L Distance to nearest free-face 0–3.29 km Slop e Ground surface slop e 0–10.92 % PGA P eak ground acceleration 0.33–0.57 g 3.2. Physic al sp e cific ations as lo gic al formulas W e formalize four ph ysical sp ecifications that any ML lateral spreading mo del should resp ect. The four sp ecifications are c hosen to exercise structurally different formula classes (single-feature threshold, monotonic- it y , and comp ound multi-feature conditions) so that the verification framework is tested against the range of constraint t yp es that arise in practice. Our contribution is the verification framework, not the particular rules. Let the input v ector b e x = ( g, ℓ, s, p ) ∈ X ⊆ R 4 , (2) where g is groundwater depth (GWD), ℓ is distance to free face ( L ), s is ground slope, and p is peak ground acceleration (PGA). Let f : X → R denote the mo del’s logit (log-o dds) output, so that the predicted class is ˆ y ( x ) = 1 [ f ( x ) > 0] . T able 2 summarizes the notation and logical symbols used throughout. Each 6 T able 2: Notation and logical symbols. Sym b ol Meaning x = ( g, ℓ, s, p ) Input v ector (GWD, distance, slop e, PGA) X ⊆ R 4 Bounded input domain f ( x ) Mo del logit (log-o dds) output ˆ y ( x ) Predicted class: 1 [ f ( x ) > 0] ∀ Univ ersal quantification (“for all”) ∃ Existen tial quantification (“there exists”) ¬ Logical negation (“not”) ⇒ Logical implication (“if . . . then”) ∧ Logical conjunction (“and”) ∨ Logical disjunction (“or”) V Conjunction o ver a set (“and” for all elements) ∈ Set mem b ership (“b elongs to”) ≤ , > Arithmetic comparison sat Satisfiable: counterexample exists (sp ecification violated) unsat Unsatisfiable: no counterexample (sp ecification prov en) sp ecification is a universally quantified first-order formula ov er the b ounded domain X . V erification reduces eac h to a quantifier-free satisfiability query after inlining the mo del as an ITE expression (Section 3.3). The sp ecifications enco de ne c essary conditions from geotechnical domain knowledge. Satisfying them do es not guarantee correctness in all resp ects. The sp ecific thresholds are calibrated to the Christch urc h dataset and encode site-sp ecific engineering b ounds rather than universal ph ysical laws. Practitioners applying this framew ork elsewhere should recalibrate thresholds to lo cal conditions. Sp e cific ation A (W ater table depth). Sites with deep groundwater should not b e predicted as lateral spreading. Geyin et al. [10] found that surface manifestation of liquefaction b ecomes negligible for GWD ab o ve 3–4 m, and w e adopt a conserv ativ e 5 m threshold. ∀ ( g , ℓ, s, p ) ∈ X : g > 5 . 0 ⇒ f ( g , ℓ, s, p ) ≤ 0 (3) Sp e cific ation B (PGA monotonicity). Increasing PGA should not decrease predicted risk, a first-principles consequence of soil dynamics that holds regardless of other features. ∀ ( g , ℓ, s, p 1 ) , ( g , ℓ, s, p 2 ) ∈ X : p 1 ≤ p 2 ⇒ f ( g , ℓ, s, p 1 ) ≤ f ( g , ℓ, s, p 2 ) (4) P er-feature monotone constrain ts can directly enforce Specification B by requiring p to ha v e a non-decreasing marginal effect during tree construction. The same mechanism cannot enforce Specification A, because 7 monotone constraints con trol the dir e ction of a feature’s marginal effect but not the absolute output v alue at a given threshold. A mo del with a monotonically decreasing g effect can still pro duce a p ositive logit at g = 5 . 01 m if the c om bined contributions from ℓ , s , and p are sufficiently large. Sp e cific ation C (Distanc e safety). Sites far from a free face exp eriencing weak shaking should not b e predicted as lateral spreading. The 2.5 km threshold exceeds the maxim um documented lateral spread exten t in Christc hurc h, and the 0.35 g threshold falls b elow confirmed spreading PGA v alues (minimum 0.37 g). ∀ ( g , ℓ, s, p ) ∈ X : ( ℓ > 2 . 5 ∧ p < 0 . 35) ⇒ f ( g , ℓ, s, p ) ≤ 0 (5) This sp ecification constrains tw o features simultaneously , and no single-feature monotone constraint can ex- press the conjunction. Only 4 training observ ations fall within the premise region, none exhibiting spreading. Sp e cific ation D (Flat-gr ound safety). Sp ecification D restricts the same safety conclusion to a smaller, more ph ysically certain subregion b y adding the requiremen t that the ground surface is essen tially flat ( s < 0 . 1 %), eliminating the gen tly-sloping-ground driving mechanism. ∀ ( g , ℓ, s, p ) ∈ X : ( s < 0 . 1 ∧ ℓ > 2 . 5 ∧ p < 0 . 35) ⇒ f ( g, ℓ, s, p ) ≤ 0 (6) Only 1 observ ation in the training data satisfies the premise, and it is non-spreading. Specification D is not logically indep enden t of Sp ecification C. Because its premise implies that of Sp ecification C ( s < 0 . 1 ∧ ℓ > 2 . 5 ∧ p < 0 . 35 ⇒ ℓ > 2 . 5 ∧ p < 0 . 35 ), an y mo del satisfying Sp ecification C automatically satisfies Sp ecifica- tion D. Sp ecification D is therefore a weak er requirement, included as a diagnostic. When a mo del violates Sp ecification C, Sp ecification D determines whether the violation p ersists even in the smaller flat-ground subregion where the driving conditions are w eakest. T able 3 summarizes the four sp ecifications and their structural classes. Standard training-time monotone constraints can directly enforce only Sp ecification B. T able 3: Sp ecification taxonomy . T raining-time monotone constraints can directly enforce only the monotonicity class (B). Threshold (A) and comp ound (C, D) specifications require formal v erification to c heck. Sp ec. Name T yp e Monotone constrain t sufficient? A W ater table depth Single-feature threshold No B PGA monotonicity Single-feature monotonicit y Y es C Distance safety Comp ound multi-feature No D Flat-ground safety Comp ound multi-feature No W e verify all four sp ecifications ov er X = [0 . 37 , 6 . 05] × [0 , 3 . 29] × [0 , 10 . 5] × [0 . 33 , 0 . 57] ( g × ℓ × s × p ), whic h rounds the observed data ranges to clean b ounds and excludes tw o outliers (one with Slop e = 10 . 9 %, one with PGA = 0 . 329 g), cov ering 7,289 of 7,291 observ ations (99.97%). V erification guarantees apply only within X ; mo del b eha vior outside these b ounds is extrapolation. 8 3.3. SMT enc o ding of tr e e ensembles A gradien t-b o osted tree ensemble with T trees computes the logit as f ( x ) = β 0 + T X t =1 h t ( x ) (7) where β 0 is a base score and each tree h t maps features to a leaf weigh t via binary splits. W e enco de each tree recursiv ely as an SMT If-Then-Else (ITE) expression: Encode ( n ) =        w n if n is a leaf with w eight w n If  x j n < θ n , Encode ( n. left ) , Encode ( n. right )  if n is in ternal (8) The full ensemble logit is f ( x ) = β 0 + P T t =1 Encode ( ro ot t ) . F or example, a depth-2 tree with ro ot split on PGA at 0.42 g (Figure 1a) b ecomes h 1 ( g , ℓ, s, p ) = If  p < 0 . 42 , If ( s < 0 . 05 , w 1 , w 2 ) , If ( p < 0 . 48 , w 3 , w 4 )  (9) Eac h leaf corresp onds to a conjunction of linear inequalities defining a p olyhedral region of the input space (e.g., leaf w 1 is reached when p < 0 . 42 ∧ s < 0 . 05 ). Summing T such ITE expressions yields the ensem ble logit f ( g , ℓ, s, p ) = β 0 + P t h t ( g , ℓ, s, p ) . F or our 100-tree, depth-5 mo del, this produces a single symbolic expression o ver four v ariables with 2,380 leaf terms. Figure 1 illustrates the enco ding. Eac h root-to-leaf path b ecomes a conjunction of linear inequalities (left), and the full ensemble partitions the input space into regions of constan t output (right, ov er g × p ), eac h corresp onding to a unique leaf com bination across all 100 trees. Because this partition is finite, the solv er reasons ov er all regions in a single query . T w o implemen tation details are critical. First, the base score β 0 is stored separately from the tree dump and m ust be included. Omitting it shifts every logit by a constant offset. Second, X GBo ost uses 32-bit floats while SMT solvers use exact arithmetic, so we cast all thresholds and input features to float32 to preven t precision mismatches. Every counterexample was v alidated against the actual XGBoost mo del to confirm exact logit agreemen t. T o verify a sp ecification, we assert its ne gation . Eac h sp ecification has the form ∀ x ∈ X : P ( x ) . By the logical equiv alence ¬ ( ∀ x : P ( x )) ≡ ∃ x : ¬ P ( x ) , “it is not the case that P holds for all x ” is the same as “there exists some x where P fails.” A sp ecification is therefore violated if and only if suc h a coun terexample exists. The solver does not en umerate inputs. Instead, for Sp ecification A (“for all inputs with GWD > 5 m, the mo del should not predict spreading”), the solver asks whether there exists an input where GWD > 5 m and the mo del predicts spreading. F or Sp ecification C, the negated query is ∃ ( g , ℓ, s, p ) ∈ X : ℓ > 2 . 5 ∧ p < 0 . 35 ∧ f ( g, ℓ, s, p ) > 0 (10) After inlining f as an ITE expression, this becomes a quan tifier-free QF_LRA formula ov er four free real v ariables with domain constraints. If the solv er returns sat , the satisfying assignmen t (input) is a counterex- ample and the sp ecification is violated. If unsat , no suc h input exists and the specification holds univ ersally . 9 PGA (g) < 0 . 42 Slope (%) < 0 . 05 PGA (g) < 0 . 48 w 1 w 2 w 3 w 4 Y es No T r e e 1 of 100 f ( x ) = β 0 + 100 X t =1 ITE t ( x ) each split → If ( x i < τ , left , right ) encode Negated Sp ecification B ∃ ( g , ℓ, s, p 1 ) , ( g , ℓ, s, p 2 ) ∈ X : p 1 < p 2 ∧ f ( g , ℓ, s, p 2 ) < f ( g, ℓ, s, p 1 ) conjoin ¬ spec ϕ ∧ ¬ sp ec SMT solv er SA T → counter-example UNSA T → sp ec. prov en (a) V erification pip eline. An XGBoost tree (left) is enco ded as a nested SMT If-Then-Else expression. The negation of a physical specification is conjoined with the formula. If the solv er returns SA T, it pro duces a concrete counter-example; if UNSA T, the specification is prov en to hold o ver the entir e input domain. 1 2 3 4 5 6 GWD (m) 0.35 0.40 0.45 0.50 0.55 PGA (g) Specification A (GWD > 5 m) 0.00 0.12 0.24 0.36 0.48 0.60 0.72 0.84 0.96 P(Lateral Spreading) (b) The resulting input space partition over g and p (with ℓ = 0 . 5 km, s = 2 %), colored by predicted lateral spreading probability . Each colored region corresp onds to a unique leaf combination across all 100 trees. The dashed line marks the Sp ecification A threshold ( g > 5 m). The SMT solver reasons ov er all regions simultaneously . Figure 1: SMT enco ding of an XGBoost gradient-bo osted ensem ble. Each of the 100 trees (a) is enco ded as a nested If-Then- Else expression ov er real-v alued leaf weigh ts; the ensemble sum of all trees partitions the input space into regions of constant predicted probability (b). 10 4. Metho dology 4.1. T r e e-to-SMT enc o ding W e parse X GBo ost’s JSON tree dump, which represen ts eac h tree as a nested dictionary of in ternal no des (with split feature, threshold, and child pointers) and leaf nodes (with weigh t v alues). Algorithm 1 describes the recursiv e enco ding. Algorithm 1 Recursive enco ding of a decision tree as an SMT expression. 1: function EncodeTree (no de, features) 2: if no de is leaf then 3: return RealVal ( float32 ( no de.w eigh t )) 4: end if 5: j ← node.split_feature 6: θ ← float32 ( node.threshold ) ▷ Matc h XGBoost arithmetic 7: left ← EncodeTree ( no de.y es_child , features ) 8: right ← EncodeTree ( node.no_child , features ) 9: return If ( features [ j ] < θ , left , right ) 10: end function F or an ensemble of T trees, the logit expression is logit = β 0 + T X t =1 EncodeTree ( tree t . ro ot , features ) (11) The resulting expression is a deeply nested ITE term. A mo del with 100 trees of depth 5 pro duces up to 100 × 2 5 = 3 , 200 leaf terms (2,380 in our mo del), though the solver’s internal simplification reduces this substan tially b efore solving. 4.2. EBM-to-SMT enc o ding EBMs decomp ose the prediction in to univ ariate shap e functions and pairwise in teractions, each stored as a lo okup table ov er learned bin edges. Eac h univ ariate function b ecomes a chain of ITE expressions ov er its bin edges: f i ( x i ) = If ( x i < b 1 , s 1 , If ( x i < b 2 , s 2 , . . . s n i )) (12) A djacent bins with identical scores are merged, compressing 4,043 univ ariate bins to approximately 2,900 ITE no des. Pairwise interactions b ecome nested tw o-dimensional ITE expressions (approximately 23,400 no des for s ix interaction terms). The full EBM logit is f ( x ) = β 0 + P i f i ( x i ) + P i 5 ∧ ( h 1 + h 2 ) > 0 . The solver propagates g > 5 (and therefore g ≥ 2 . 5 ) in to tree 1’s right branch, yielding either − 0 . 1 or 0 . 05 . If tree 1 outputs 0 . 05 ( p ≥ 0 . 42 ) and tree 2 outputs 0 . 2 ( ℓ < 1 . 0 ), the logit is 0 . 25 > 0 , so the solver returns sat with a coun terexample (e.g., g = 5 . 5 , ℓ = 0 . 5 , p = 0 . 45 ). The solv er reasons sym b olically o ver the entire piecewise-constan t structure, c hecking every region sim ul- taneously via constrain t propagation. F or our 100-tree mo del, each specification c heck typically completes in under 5 seconds. 4.4. V erific ation pr o c e dur e F or Sp ecifications A, C, and D, the solver chec ks a single query by asserting the sp ecification’s negation along with physical range constraints, and returns sat with a v alidated counterexample or unsat (sp ecification pro ven). Sp ecification B (monotonicit y) is more exp ensiv e b ecause it compares the mo del at tw o inputs differing only in PGA. A direct enco ding declares t w o feature vectors, constrains non-PGA features to be equal, and asserts that higher PGA pro duces lo w er logit, doubling the formula size and requiring 480 seconds for the monotone mo del. W e resolv e this cost by leveraging the additive ensemble structure. Prop osition 1 (Additiv e decomposition for monotonicity) . L et f ( x ) = β 0 + P M m =1 g m ( x ) b e an additive mo del. If e ach c omp onent g m is monotonic al ly non-de cr e asing in fe atur e x j , then f is monotonic al ly non- de cr e asing in x j . Pr o of. Fix inputs x , x ′ differing only in x j with x j < x ′ j . By hypothesis, g m ( x ) ≤ g m ( x ′ ) for each m = 1 , . . . , M . Summing gives P m g m ( x ) ≤ P m g m ( x ′ ) ; adding β 0 to b oth sides gives f ( x ) ≤ f ( x ′ ) . Prop osition 1 reduces ensemble monotonicity to T indep endent p er-tree queries. W e verify Sp ecification B b y c hec king each of the 100 trees individually . Each query inv olv es a single tree rather than the full ensemble, and every query completes in under 1 second. All 100 p er-tree chec ks return unsat , proving Sp ecification B for the full ensem ble by Prop osition 1. This p er-tree strategy yields a complete formal proof, not an empirical appro ximation. 4.5. V erify-fix-verify lo op V erification results guide iterative mo del impro vemen ts: (1) train an unconstrained mo del (82.5% accu- racy); (2) verify all sp ecifications (all four violated); (3) retrain with monotone_constraints (GWD − , L − , 12 PGA + ); (4) re-v erify (Specification B pro v en via p er-tree decomposition, but Sp ecifications A, C, and D still violated at 69.3% accuracy). The lo op reveals that monotone constrain ts are nece ssary but not sufficient; coun terexamples pinp oin t where comp ound violations p ersist. 4.6. F ormal ab ductive explanations The SMT enco ding also enables computing formal ab ductive explanations , the minimal subset of features that, when fixed to a sample’s v alues, guaran tees the prediction regardless of ho w the remaining features v ary . F or a sample x ∗ predicted as p ositiv e ( f ( x ∗ ) > 0 ), a subset S ⊆ { 1 , . . . , d } is a sufficient reason if ∀ x ∈ X :  ^ j ∈ S x j = x ∗ j  ⇒ f ( x ) > 0 (15) W e compute this iteratively , starting from the full feature set and testing whether removing eac h feature preserv es the ab ov e prop ert y via SMT queries. If removing feature j yields unsat (meaning no counterexample exists even without fixing j ), the feature is unnecessary and can be dropp ed. If sat , some input differing only in j c hanges the prediction, so j is formally necessary . Comparing these formal explanations with SHAP’s top- k features (where k matches the formal explanation size) measures ho w faithfully SHAP represents the mo del’s actual decision logic. 5. Results 5.1. Mo del tr aining W e train four model families on the C hristc h urch dataset (80/20 train-test split, 5,832/1,459 sites): (1) unconstrained X GBo ost, (2) monotone-constrained XGBoost (GWD − , L − , PGA + ), (3) unconstrained EBM, and (4) monotone-constrained EBM. F or X GBoost, we v ary depth d ∈ { 3 , 5 , 7 , 9 } and estimators n ∈ { 20 , 50 , 100 } (24 v ariants); for EBMs, interaction terms i ∈ { 0 , 3 , 5 , 10 } (9 v arian ts including the fully constrained mo del), yielding 33 total. The baseline unconstrained X GBoost ( d = 5 , n = 100 ) achiev es 82.5% test accuracy; 10-fold cross- v alidation confirms s tabilit y ( 83 . 2 ± 1 . 4 % unconstrained, 67 . 4 ± 1 . 0 % monotone). The b est unconstrained EBM ( i = 10 ) achiev es 80.1% with a smaller train-test gap, reflecting the GAM structure’s inheren t regularization. The SHAP b eeswarm p lot for XGBoost (Figure 2, left) reveals that high PGA exhibits both p ositiv e and negative SHAP v alues, the non-physical pattern iden tified by Hsiao et al. [11]. EBM feature imp ortance (Figure 2, right) flags a similar concern, but neither visualization can quantify ho w extensively the mo del violates ph ysical exp ectations across its full input domain. The EBM shap e functions (Figure 3) reveal the source of these violations. The unconstrained PGA curve exhibits a dip b eyond 0.45 g where increasing PGA decreases predicted log-odds, and the PGA-monotone constrain t eliminates this artifact. 13 4 3 2 1 0 1 2 3 SHAP value (impact on log-odds) Slope (%) GWD (m) L (km) PGA (g) (a) XGBoost (unconstrained) 0 2 4 6 8 10 12 14 Shape function range (log-odds) L (km) & PGA (g) L (km) L (km) & Slope (%) GWD (m) & L (km) GWD (m) GWD (m) & PGA (g) PGA (g) GWD (m) & Slope (%) Slope (%) & PGA (g) Slope (%) (b) EBM (unconstrained) F eature importance Low High F eature value Figure 2: Global feature imp ortance. Left: SHAP b eesw arm plot for unconstrained XGBoost. Each dot represents a test sample; color indicates feature v alue (red = high, blue = low). PGA shows b oth p ositiv e and negative SHAP v alues, indicating non- monotonic behavior. Right: EBM feature imp ortance measured b y shap e function range (maximum min us minimum log-o dds contribution). 1 2 3 4 GWD (m) 4 3 2 1 0 1 Unconstrained EBM Score (log-odds) GWD (m) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 L (km) 4 2 0 2 4 L (km) 0 2 4 6 8 Slope (%) 0.5 0.0 0.5 1.0 1.5 Slope (%) 0.35 0.40 0.45 0.50 0.55 PGA (g) 3 2 1 0 1 PGA (g) Non-monotone 1 2 3 4 GWD (m) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 PGA-monotone EBM Score (log-odds) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 L (km) 3 2 1 0 1 2 0 2 4 6 8 Slope (%) 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.35 0.40 0.45 0.50 0.55 PGA (g) 0.8 0.6 0.4 0.2 0.0 0.2 0.4 EBM Shape F unctions: Unconstrained vs PGA-Monotone Figure 3: EBM shap e functions for all four features. T op row: unconstrained EBM. Bottom row: PGA-monotone EBM. The PGA shap e function (righ tmost column) shows non-physical b eha vior in the unconstrained mo del, sp ecifically a dip near 0.42 g where increasing PGA decreases predicted log-o dds. The monotone constraint eliminates this artifact. 14 5.2. Sp e cific ation verific ation r esults W e verified all four sp ecifications against each mo del family (T able 4); the following paragraphs discuss eac h mo del in detail, with coun terexample diagnostics in Figure 4. T able 4: Sp ecification verification results. XGBoost results are full SMT pro ofs (Z3). EBM results use SMT for Sp ecifications A, C, and D; Sp ecification B uses p er-term score-array decomp osition (Proposition 1). The solver chec ks the ne gation of each specification: if satisfiable, a counterexample exists and the sp ecification is violated ( ✗ ); if unsatisfiable, no counterexample exists and the sp ecification is prov en ( ✓ ). All coun terexamples are v alidated against the actual mo del. Mo del A (W ater table) B (PGA mono.) C (Distance) D (Flat-ground) T est Acc. X GBo ost (unconstrained) ✗ ✗ ✗ ✗ 82.5% X GBo ost (monotone) ✗ ✓ ✗ ✗ 69.3% EBM (unconstrained) ✗ ✗ ✗ ✗ 80.1% EBM (PGA monotone) ✗ ✓ ✗ ✗ 70.5% EBM (fully constrained) ✗ ✓ ✓ ✓ 67.2% 0.35 0.40 0.45 0.50 0.55 PGA (g) 0.0 0.2 0.4 0.6 0.8 P(Lateral Spreading) (a) Specification B: PGA Monotonicity Low risk (GWD 1.5 m, L 0.5 km) Medium risk (GWD 3.0 m, L 1.5 km) High risk (GWD 4.0 m, L 2.5 km) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 L (km) 0.35 0.40 0.45 0.50 0.55 PGA (g) Specification C region (b) Specification C: Distance vs PGA 1 2 3 4 5 6 GWD (m) 0.0 0.2 0.4 0.6 0.8 P(Lateral Spreading) (c) Specification A: W ater T able Depth Low PGA (0.35 g) Moderate PGA (0.42 g) Strong PGA (0.50 g) Severe PGA (0.55 g) Specification A threshold 0.00 0.15 0.30 0.45 0.60 0.75 0.90 P(Spreading) Figure 4: Sp ecification violation diagnostics for unconstrained XGBoost. Left: PGA monotonicity (Specification B) for three soil profiles; red dots mark non-monotone regions. Center: contour of predicted spreading probability ov er distance ( L ) and PGA, with Specification C safe region marked. Righ t: predicted probability v ersus groundw ater depth for sev eral PGA levels, with Sp ecification A threshold at 5 m. Unc onstr aine d X GBo ost. The solv er returns sat for all four specifications (eac h within 1 second). The Sp ecification A coun terexample (GWD = 5 . 001 m, logit = +0 . 065 , 52%) predicts spreading despite deep groundw ater. The Specification B counterexample shows increasing PGA from 0.42 g to 0.52 g de cr e asing the logit by 0.15 units, confirming the SHAP-based observ ation of Hsiao et al. [11] by exhaustive formal pro of. The Sp ecification C counterexample ( L = 2 . 501 km, PGA = 0 . 33 g, logit = +0 . 169 , 54%) and Sp ecification D coun terexample (Slop e = 0 . 05 %, L = 2 . 501 km, logit = +0 . 169 , 54%) predict spreading in regions where the adopted screening sp ecifications exclude the usual driving conditions and the training data con tain no observ ed spreading cases (Figure 5b). 15 f j f i lb j ub j lb i ub i (a) f j f i lb j ub j lb i ub i × (b) Figure 5: Sc hematic of t wo failure mo des in a t w o-dimensional feature subspace defined b y bounds lb and ub . (a) No training data fall in the green region; the model interpolates without empirical supp ort. (b) T raining observ ations exist in the red region but are exclusiv ely negativ e-class ( ◦ ), y et the model predicts y = 1 at the SMT coun terexample ( × ). Sp ecification D violations correspond to case (b). Monotone-c onstr aine d X GBo ost. Monotone constraints drop accuracy to 69.3% ( − 13.2 pp). The Sp ecifi- cation A counterexample (GWD = 5 . 014 m, L = 0 . 011 km, PGA = 0 . 33 g, logit = +0 . 221 ) shows that p er-feature constrain ts cannot enco de the comp ound reasoning that deep groundwater should ov erride prox- imit y and shaking. Sp ecification B is pro ven in tw o indep enden t w ays, p er-tree decomp osition (100 trees, 0.6 s total) and direct dual-ensem ble verification ( unsat in 480 s), but Sp ecifications C and D still return sat , confirming that monotone constraints fix single-feature violations while lea ving comp ound specifications unsatisfied. EBM variants. The unconstrained EBM violates all four sp ecifications. Sp ecification A is violated (GWD = 5 . 000 m, L = 0 . 005 km, logit = +1 . 68 , 84%), despite the EBM’s additive structure, b ecause the combined p ositiv e con tributions from proximit y ( L ), slop e, and PGA outw eigh the GWD shap e function’s negative con tribution at deep water table depths. Sp ecification B is violated (272 non-monotone segments in the PGA shap e function). Sp ecification C is violated ( L = 2 . 52 km, PGA = 0 . 33 g, logit = +0 . 13 , 53%); this query requires the sp ecialized QF_LRA solv er and completes in 1,331 s. Sp ecification D is violated (Slop e = 0 . 001 %, L = 2 . 72 km, logit = +0 . 03 , 51%). The PGA-monotone EBM prov es Sp ecification B via p er-term decomposition but still violates Sp ecifi- cation A (GWD = 5 . 00 m, L = 0 . 005 km, logit = +0 . 14 , 53%), with accuracy dropping 9.6 pp (80.1% → 70.5%). F ul ly c onstr aine d EBM. T o test whether more sp ecifications can be satisfied sim ultaneously , we train an EBM with monotone constrain ts on all four features (GWD decreasing, L decreasing, Slope increasing, PGA increasing), enforcing the ph ysical directions for every feature. This mo del prov es three of four specifications, 16 namely Sp ecification B (PGA monotonicity , via p er-term decomp osition), Sp ecification C (distance safet y), and Specification D (flat-ground safety), the latter t wo prov en b y the SMT solv er returning unsat . Ho wev er, Sp ecification A (w ater table depth) is violated b y a thin margin; the solver finds a coun terexample at GWD = 5 . 003 m, L = 0 . 003 km, Slope = 2 . 5 %, PGA = 0 . 46 g, with a predicted probability of 50.3% (logit = +0 . 012 ). Even with monotonicity constraints on all four features, the mo del cannot guarantee Sp ecification A b ecause monotonicity enforces the dir e ction of a feature’s effect, not an absolute threshold on the output. The accuracy cost of full constraint is 12.9 p ercen tage p oints (80.1% → 67.2%), but the mo del ac hieves the highest sp ecification compliance of any v ariant tested. Figure 6 illustrates the effect. F or a fixed soil profile, unconstrained XGBoost and EBM exhibit clear non-monotone PGA regions, while the constrained v ariants pro duce strictly increasing curves. 0.35 0.40 0.45 0.50 0.55 PGA (g) 0.0 0.2 0.4 0.6 0.8 1.0 P(Lateral Spreading) (a) XGBoost Non-monotone 0.35 0.40 0.45 0.50 0.55 PGA (g) (b) EBM (no constraints) Non-monotone 0.35 0.40 0.45 0.50 0.55 PGA (g) (c) EBM (PGA monotone) 0.35 0.40 0.45 0.50 0.55 PGA (g) (d) EBM (full constraints) GWD = 1.5 m, L = 0.5 km, Slope = 2.0% Figure 6: PGA monotonicity comparison across mo del families for a fixed soil profile (GWD = 1.5 m, L = 0.5 km, Slop e = 2.0%). Red dots mark non-monotone regions where increasing PGA decreases predicted probability . Unconstrained XGBoost and EBM exhibit violations; monotone-constrained varian ts do not. 5.3. Par eto fr ontier: ac cur acy versus physic al c onsistency T o quan tify the accuracy cost of physical consistency , w e trained all 33 mo del v arian ts and measured Sp ecification B compliance as the fraction of 120 empirically sampled PGA-sw eep profiles satisfying mono- tonicit y (Figure 7). This empirical proxy enables broad hyperparameter comparison; formal SMT pro ofs (T able 4) are applied to representativ e mo dels in each family . Unconstrained mo dels cluster at high accuracy (74–84%) with near-zero compliance; constrained mo dels cluster at lo wer accuracy (68–71%) with full compliance. No mo del o ccupies the top-right region where accuracy exceeds 80% and compliance reaches 100%, consisten t with unconstrained mo dels fitting patterns that violate the sp ecifications. T able 5 confirms this pattern across all 33 v arian ts. The 10–15 percentage p oin t accuracy gap p ersists regardless of mo del family or h yp erparameter setting. 5.4. SHAP versus formal explanations SHAP explains how a mo del arrives at a prediction b y decomp osing it into p er-feature contributions. F or- mal verification answers a differen t question, namely do es the prediction comply with physical sp ecifications? W e show that these tw o questions can giv e contradictory signals. 17 0.0 0.2 0.4 0.6 0.8 1.0 Physical Consistency (1 violation rate) 0.60 0.65 0.70 0.75 0.80 0.85 0.90 T est Accuracy Ideal region Accuracy consistency gap XGBoost XGBoost (mono) EBM EBM (mono) Figure 7: Accuracy versus physical consistency (Specification B compliance) for 33 model v arian ts. Unconstrained mo dels cluster at high accuracy with zero compliance (upp er-left); constrained models cluster at low er accuracy with full compliance (low er-righ t). The fully constrained EBM (low est green diamond, 67.2%) achiev es full Specification B compliance and pro v es 3 of 4 specifications via SMT. No mo del occupies the ideal top-righ t region. T able 5: Summary statistics for the Pareto frontier, aggregated across hyperparameter v arian ts. Violations and compliance refer to Specification B (PGA monotonicity) only , measured on 120 empirically sampled PGA-sweep profiles. F ull sp ecification verification results (Sp ecifications A–D) are in T able 4. Mo del family V arian ts A cc. range B violations B compliance X GBo ost (unconstrained) 12 74.0–84.2% 120/120 0% X GBo ost (monotone) 12 68.3–70.0% 0/120 ✓ 100% ✓ EBM (unconstrained) 4 74.6–80.1% 116/120 3.3% EBM (monotone) 4 67.9–70.5% 0/120 ✓ 100% ✓ EBM (fully constrained) 1 67.2% 0/120 ✓ 100% ✓ 18 0.0 0.5 1.0 1.5 2.0 L og- odds (a) Specification A : SHAP waterfall at counter e xample ( G W D = 5 . 6 5 m > 5 m s h o u l d p r e d i c t n o s p r e a d i n g ) GWD = 5.65 Slope = 4.86 PGA = 0.51 L = 1.21 -0.12 +0.52 +0.61 +1.67 E [ f ( x ) ] = - 0 . 3 8 f ( x ) = + 2 . 3 1 ( 9 1 % s p r e a d i n g ) 0.35 0.40 0.45 0.50 0.55 PGA (g) (b) Specification B: SHAP dependence for PGA ( h i g h e r P G A s h o u l d h i g h e r r i s k ) 4 3 2 1 0 1 2 3 SHAP value (log- odds) Figure 8: SHAP analysis of sp ecification violations (unconstrained XGBoost). (a) W aterfall at a Sp ecification A counterexample (GWD = 5 . 65 m, predicted 91% spreading). Red bars push tow ard spreading, blue bars push against it. GWD is the feature that violates Sp ecification A, yet SHAP ranks it last (#4) with the smallest magnitude ( − 0 . 12 ). (b) SHAP dep endence plot for PGA. Each dot is one test sample; color indicates the SHAP v alue (red = pushes tow ard spreading, blue = pushes against). The black line (binned mean) should increase monotonically if the model respects Sp ecification B; instead, it drops sharply b et ween 0.42–0.52 g (shaded region, red arrows), confirming the non-monotonicity that formal verification detects. Figure 8 illustrates this gap using a Sp ecification A coun terexample. The input has GWD = 5 . 65 m, L = 1 . 21 km, Slop e = 4 . 86 %, PGA = 0 . 51 g. The mo del predicts spreading with 91% confidence (logit = +2 . 31 ), violating Sp ecification A (GWD > 5 m should not predict spreading). Panel (a) sho ws the standard SHAP waterfall that an engineer w ould insp ect. L contributes +1 . 67 , PGA +0 . 61 , Slop e +0 . 52 , and GWD only − 0 . 12 . An engineer would read this as “close to a free face, strong shaking, steep slop e” and find the prediction reasonable. GWD, the feature that makes this prediction a sp ecification violation, is rank ed last by SHAP with the smallest magnitude. The underlying issue is that SHAP decomp oses predictions, not v alidates them. SHAP correctly rep orts that GWD contributes − 0 . 12 log-o dds at this input, but it ranks the GWD feature last, indicating its minimal impact. In contrast, formal verification prov es that any input with GWD > 5 m violates Sp ecification A, regardless of the con tributions from other features. While SHAP explains the model’s reasoning for a sp ecific input, it cannot detect that the mo del’s ov erall logic violations against a sp ecification. The same pattern app ears for Specification B (Figure 8b). The SHAP dependence plot for PGA sho ws that the mo del’s SHAP v alues drop sharply b et ween 0.42–0.52 g instead of increasing monotonically . While the global dep endence plot makes this visible, individual w aterfall plots w ould not detect the violation, because eac h w aterfall app ears lo cally reasonable. F ormal verification detects the violation automatically and returns a concrete coun terexample. 5.5. Counter example analysis Eac h SMT counterexample is a concrete input vector that the solver constructs to violate the sp ecifica- tion. V alidating each counterexample against the actual mo del confirms the enco ding’s fidelity; T able 6 lists 19 represen tative counterexamples for the unconstrained XGBoost mo del. T able 6: Representativ e counterexamples from SMT verification of unconstrained XGBoost. A p ositive logit (probability > 50 %) means the mo del predicts spreading, violating the sp ecification. All counterexamples were v alidated against the actual XGBoost model to machine precision. Sp ec. GWD (m) L (km) Slop e (%) PGA (g) Logit Probabilit y A (w ater table) 5.001 1.35 1.58 0.37 +0.065 52% C (distance safet y) 1.73 2.501 0.05 0.33 +0.169 54% D (flat-ground) 1.72 2.501 0.05 0.33 +0.169 54% The Sp ecification A counterexample (GWD = 5 . 001 m) predicts spreading at 52% probabilit y b ecause the mo del relies on proximit y ( L = 1 . 35 km) while insufficiently weigh ting groundwater depth. The Sp eci- fication C counterexample ( L = 2 . 501 km, PGA = 0 . 33 g) predicts spreading at 54% in a region where the Christc hurc h data con tain no observ ed spreading cases, b ecause the ensem ble accumulates positive con tribu- tions where training data is sparse. Both coun terexamples o ccur in data-sparse b oundary regions, connecting to the ro ot cause identified by Hsiao et al. [12]. F ormal v erification transforms the in tuition that “sparse data causes problems” in to a concrete, v alidated demonstration. 6. Discussion and conclusions F ormal verific ation versus empiric al testing. Grid-base d testing can detect violations but cannot certify their absence. T able 7 shows that uniform grid search with up to n = 200 p oints p er feature (1.6 billion ev aluations) finds zer o Sp ecification C violations, while the SMT solver finds a counterexample in 0.19 s. The violation region o ccupies a thin sliver ( L ∈ [2 . 500 , 2 . 505] km, Slop e < 0 . 05 %) smaller than a single grid step. When the solv er returns unsat , the sp ecification holds univ ersally , a guarantee qualitatively different from any finite-sample test. Sc alability. XGBoost verification completes in under 1 s p er sp ecification for single-ensemble queries. P er- tree decomp osition reduces monotonicity pro ofs by orders of magnitude compared to the direct dual-ensemble enco ding. EBM verification is slo w er due to larger enco dings; the most expensive query requires a specialized QF_LRA solver. Higher-dimensional mo dels will require abstraction techniques [29, 17] to main tain practical run times. Comp ound c onstr aints and c onstr aint inter actions. Monotone constraints enforce the direction of a feature’s marginal effect but cannot guaran tee threshold behavior (Specification A) or multi-feature conjunctions (Sp ecification C). Progressiv ely adding constraints improv es compliance, but some sp ecifications resist full resolution because monotonicity enforces the dir e ction of a feature’s effect, not an absolute threshold on the output. Promising extensions include coun terexample-guided data augmentation and p ost-ho c output clamping; this pap er establishes the diagnostic prerequisite by identifying what fails and wher e . 20 T able 7: Grid search v ersus SMT v erification for Sp ecifications C and D (unconstrained X GBo ost). Grid search ev aluates the model on n 4 uniformly spaced points ov er X . The SMT solver encodes the mo del symbolically and reasons ov er the en tire contin uous domain. Sp ecification D adds a Slop e < 0 . 1 % constraint, which further reduces the num ber of grid p oin ts in the premise region. Metho d P oints ev al. Premise (C) Premise (D) Time Violated? Grid ( n = 30 ) 810,000 18,900 630 < 0.1 s None found Grid ( n = 50 ) 6,250,000 150,000 3,000 0.1 s None found Grid ( n = 100 ) 100,000,000 2,160,000 21,60 0 1.7 s None found Grid ( n = 150 ) 506,250,000 10,530,000 140,400 17 s None found Grid ( n = 200 ) 1,600,000,000 32,640,000 326,400 184 s None found SMT (C) entir e c ontinuous domain 0.19 s Y es SMT (D) entir e c ontinuous domain 0.7 s Y es Why not emb e d har d c onstr aints and skip verific ation. A v ailable constraint mechanisms op erate p er-feature and cannot encode comp ound conditions (Specification C) or threshold guaran tees (Sp ecification A). Our results confirm that even the most constrained mo del cannot satisfy all sp ecifications. SMT counterexam- ples pinp oint wher e a mo del fails, providing actionable information that a pass/fail constrain t flag cannot. F ormal verification therefore complements hard constraints. Constrain ts restrict the hypothesis space, while v erification determines whether the resulting trained mo del satisfies the sp ecification. Deterministic versus pr ob abilistic guar ante es. The SMT solver pro ves that a specification holds for al l inputs or provides a single counterexample where it fails. This deterministic guarantee applies to the mo del , not to the physical system; it sa ys nothing about whether f correctly predicts the real-world outcome, which dep ends on data quality , feature sufficiency , and epistemic uncertaint y . A natural extension would combine formal v erification with probabilistic calibration (e.g., conformal prediction), providing complemen tary guar- an tees of physical consistency and w ell-calibrated uncertaint y . Data-a ware v erification [29], which constrains coun terexamples to remain near the training distribution, offers a middle ground. Sp e cific ation sele ction and sc op e. The four specifications exercise structurally different formula classes, in- cluding a single-feature threshold (Specification A), monotonicity (Sp ecification B), and compound multi- feature screening conditions (Sp ecifications C and D). This distinction matters b ecause standard monotone constrain ts can directly enforce only Sp ecification B. A pro duction sp ecification library would b e richer, in- corp orating CPT-based susceptibility , depth-weigh ted liquefaction p oten tial, or magnitude-duration effects, but each new sp ecification b ecomes an independent SMT query with no framew ork changes. This study is methodological. It demonstrates that SMT-based formal verification can c heck tree ensem bles against geotec hnical sp ecifications, rather than establishing a new predictor or regional mo del. 21 T able 8: Threshold sensitivity for Specification A (water table depth) against unconstrained XGBoost. The mo del violates Specification A at every GWD threshold. GWD threshold (m) Result Coun terexample logit 3.0 Violated +0 . 131 3.5 Violated +1 . 405 4.0 Violated +0 . 062 4.5 Violated +0 . 366 5.0 Violated +1 . 405 5.5 Violated +1 . 405 6.0 Violated +0 . 045 Limitations and futur e dir e ctions. The sp ecifications enco de necessary conditions, not sufficient ones. A complian t mo del could still learn non-physical patterns outside the sp ecification set. The approach applies only to tree-based mo dels; neural netw ork verification via MILP enco dings [17] is p ossible but computationally more exp ensiv e. The thresholds in Sp ecifications A and C are site-sp ecific Christch urc h b ounds, not univ ersal constan ts; applying the framew ork elsewhere requires recalibrating thresholds to lo cal conditions. T able 8 demonstrates that the unconstrained mo del violates Specification A across a range of thresholds, confirming a p ersisten t pattern rather than a narrow edge case. Extending the framework to regression models predicting displacemen t magnitudes is conceptually straigh tforward in SMT but requires reformulating sp ecifications o ver contin uous outputs. In summary , SMT-based formal verification can detect physical inconsistencies in tree ensemble mo dels that empirical testing and p ost-hoc explanations miss. T raining-time constrain ts fix targeted violations but cannot enforce comp ound or threshold sp ecifications, and physical consistency comes at a quantifiable accuracy cost. The v erify-fix-verify lo op (train for accuracy , v erify against sp ecifications, apply targeted constrain ts, re-verify) provides a systematic workflo w for deploying physically consistent ML mo dels in safety- critical geotechnical applications. As ML mo dels increasingly inform engineering decisions, formal v erification offers a path to ward certifying that these mo dels resp ect established sp ecifications b efore field deploymen t. Data av ailabilit y The Christc hurc h lateral spreading dataset is av ailable from Duran te and Rathje [9]. The verification co de and trained mo dels are av ailable at https://github.com/geoelements- dev/2026- formal- verify- liq . References [1] Barrett, C., Sebastiani, R., Seshia, S.A., Tinelli, C., 2021. Satisfiability mo dulo theories, in: Biere, A., Heule, M., v an Maaren, H., W alsh, T. (Eds.), Handb o ok of Satisfiability . 2nd ed.. IOS Press, pp. 22 1267–1329. doi: 10.3233/FAIA201017 . [2] Breiman, L., 2001. Random forests. Machine Learning 45, 5–32. doi: 10.1023/A:1010933404324 . [3] Can, R., Kocaman, S., Gokceoglu, C., 2021. A comprehensive assessmen t of XGBoost algorithm for landslide susceptibility mapping in the upper basin of Ataturk dam, T urkey. Applied Sciences 11, 4993. doi: 10.3390/app11114993 . [4] Chen, H., Zhang, H., Si, S., Li, Y., Boning, D., Hsieh, C.J., 2019. Robustness verification of tree-based mo dels, in: A dv ances in Neural Information Pro cessing Systems, pp. 12317–12328. [5] Chen, T., Guestrin, C., 2016. XGBoost: A scalable tree bo osting system, in: Pro ceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discov ery and Data Mining, pp. 785–794. doi: 10.1145/2939672.2939785 . [6] Chen, Z., Ni, P ., Chen, G., W ang, Y., Chen, J., W ang, C., 2024. Ev aluation of liquefaction-induced lateral spread displacement based on ensemble learning. Geological Journal 59, 2534–2548. doi: 10.1002/ gj.4951 . [7] Demir, S., Şahin, E.K., 2023. Predicting o ccurrence of liquefaction-induced lateral spreading using gradi- en t b o osting algorithms integrated with particle swarm optimization: PSO-X GBo ost, PSO-LightGBM, and PSO-CatBo ost. Acta Geotechnica 18, 3403–3419. doi: 10.1007/s11440- 022- 01777- 1 . [8] Dev os, L., Meert, W., Davis, J., 2021. V ersatile verification of tree ensembles. Pro ceedings of Machine Learning Researc h 139, 2654–2664. [9] Duran te, M.G., Rathje, E.M., 2021. An exploration of the use of mac hine learning to predict lateral spreading. Earthquake Sp ectra 37, 2288–2314. doi: 10.1177/87552930211004613 . [10] Geyin, M., Maurer, B.W., Bradley , B.A., Green, R.A., v an Ballego o y , S., 2022. Developmen t and v alidation of a probabilistic liquefaction triggering mo del for CPT-based analyses. ASCE Journal of Geotec hnical and Geo en vironmen tal Engineering 148, 04022042. doi: 10.1061/(ASCE)GT.1943- 5606. 0002814 . [11] Hsiao, C.H., Kumar, K., Rathje, E.M., 2024. Explainable AI mo dels for predicting liquefaction-induced lateral spreading. F rontiers in Built Environmen t 10, 1387953. doi: 10.3389/fbuil.2024.1387953 . [12] Hsiao, C.H., Kumar, K., Rathje, E.M., 2025a. Domain-informed explainable b oosting mac hines for trust worth y lateral spread predictions. Manuscript in preparation. [13] Hsiao, C.H., Rathje, E.M., Kumar, K., 2025b. Inv estigating the effect of CPT in lateral spreading prediction using explainable AI. arXiv preprint arXiv:2503.13389 . 23 [14] Huang, X., Marques-Silv a, J., 2024. On the failings of Shapley v alues for explainability . International Journal of Appro ximate Reasoning 171, 109112. doi: 10.1016/j.ijar.2023.109112 . [15] Ignatiev, A., Izza, Y., Stuck ey , P .J., Marques-Silv a, J., 2022. Using MaxSA T for efficient explanations of tree ensembles, in: Proceedings of the 36th AAAI Conference on Artificial Intelligence, pp. 3776–3785. doi: 10.1609/aaai.v36i4.20292 . [16] Ignatiev, A., Naro dytsk a, N., Marques-Silv a, J., 2019. Ab duction-based explanations for machine learn- ing mo dels, in: Pro ceedings of the 33rd AAAI Conference on Artificial Intelligence, pp. 1511–1519. [17] Kan tchelian, A., Tygar, J.D., Joseph, A.D., 2016. Ev asion and hardening of tree ensem ble classifiers, in: Pro ceedings of the 33rd International Conference on Machine Learning, pp. 2387–2396. [18] Kumar, K., 2025. Deep learning in geotechnical engineering: a critical assessmen t of PINNs and op erator learning. arXiv preprint arXiv:2512.24365 . [19] Lou, Y., Caruana, R., Gehrk e, J., 2012. Intelligible mo dels for classification and regression, in: Pro ceed- ings of the 18th ACM SIGKDD International Conference on Knowledge Discov ery and Data Mining, pp. 150–158. doi: 10.1145/2339530.2339556 . [20] Lundb erg, S.M., Lee, S.I., 2017. A unified approac h to interpreting mo del predictions. Adv ances in Neural Information Pro cessing Systems 30. [21] Maurer, B.W., Sanger, M.D., 2023. Why “ AI” mo dels for predicting soil liquefaction ha ve been ignored, plus some that shouldn’t b e. Earthquake Sp ectra 39, 1883–1910. doi: 10.1177/87552930231173711 . [22] de Moura, L., Bjørner, N., 2008. Z3: An efficien t SMT solv er, in: International Conference on T ools and Algorithms for the Construction and Analysis of Systems, Springer. pp. 337–340. doi: 10.1007/ 978- 3- 540- 78800- 3_24 . [23] Nori, H., Jenkins, S., K och, P ., Caruana, R., 2019. In terpretML: A unified framework for machine learning in terpretability . arXiv preprin t arXiv:1909.09223 . [24] Raissi, M., Perdik aris, P ., Karniadakis, G.E., 2019. Ph ysics-informed neural netw orks: A deep learning framew ork for solving forw ard and inv erse problems in volving nonlinear partial differential equations. Journal of Computational Ph ysics 378, 686–707. doi: 10.1016/j.jcp.2018.10.045 . [25] Rateria, G., Maurer, B.W., 2022. Ev aluation and up dating of Ishihara’s (1985) mo del for liquefaction surface expression. Soils and F oundations 62, 101131. doi: 10.1016/j.sandf.2022.101131 . [26] Rib eiro, M.T., Singh, S., Guestrin, C., 2016. “ Why should I trust you?”: Explaining the predictions of any classifier, in: Pro ceedings of the 22nd A CM SIGKDD International Conference on Knowledge Disco very and Data Mining, pp. 1135–1144. doi: 10.1145/2939672.2939778 . 24 [27] Sato, N., et al., 2020. F ormal v erification of a decision-tree ensemble mo del and detection of its violation ranges. IEICE T ransactions on Information and Systems E103-D, 363–378. [28] Törn blom, J., Nadjm-T ehrani, S., 2020. F ormal v erification of input-output mappings of tree ensembles. Science of Computer Programming 194, 102450. doi: 10.1016/j.scico.2020.102450 . [29] V arshney , N., Gupta, A., Ahmad, A., T ay al, T.V., Aksha y , S., 2026. Data-a ware and scalable sensitivit y analysis for decision tree ensembles, in: Pro ceedings of the 14th International Conference on Learning Represen tations. [30] Y uan, B., Cho o, C.S., Y eo, L.Y., W ang, Y., Y ang, Z., Guan, Q., Suryasen tana, S., Cho o, J., Shen, H., Megia, M., Zhang, J., Liu, Z., Song, Y., W ang, H., Chen, X., 2025. Ph ysics-informed machine learning in geotec hnical engineering: a direction pap er. Geomechanics and Geo engineering 20, 1128–1159. [31] Zhang, W., Gu, X., T ang, L., Yin, Y., Liu, D., Zhang, Y., 2022. Application of machine learning, deep learning and optimization algorithms in geo engineering and geoscience: Comprehensive review and future c hallenge. Gondw ana Research 109, 1–17. doi: 10.1016/j.gr.2022.03.015 . 25

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment