Conformalized Signal Temporal Logic Inference under Covariate Shift

Conf ormalized Signal T emporal Logic Infer ence under Cov ariate Shift Y ixuan W ang 1 , Danyang Li 2 , Matthew Cleav eland 3 , Roberto T ron 2 , Mingyu Cai 1 Abstract —Signal T emporal Logic (STL) inference learns in- terpret able logical rules f or temporal behaviors in dynamical systems. T o ensure the correctness of learned STL formulas, recent approaches hav e incorporated conformal prediction as a statistical tool for uncertainty quantiﬁcation. Howe ver , most existing methods rely on the assumption that calibration and testing data are identically distributed and exchangeable, an assumption that is frequently violated in real-world settings. This paper proposes a conf ormalized STL inference framework that explicitly addresses covariate shift between training and deployment trajectories dataset. From a technical standpoint, the approach ﬁrst employs a template-free, differentiable STL inference method to learn an initial model, and subsequently reﬁnes it using a limited deployment side dataset to promote distribution alignment. T o provide validity guarantees under distribution shift, the framework estimates the likelihood ratio between training and deployment distributions and integrates it into an STL-rob ustness-based weighted conformal pr ediction scheme. Experimental results on trajectory datasets demonstrate that the proposed framework preserv es the interpretability of STL formulas while signiﬁcantly improving symbolic learning reliability at deployment time. The project page can be found: https://sites.google.com/ucr .edu/confrtlics?usp=sharing. Index T erms —Formal Methods, Conf ormal Prediction, Signal T emporal Logic, T emporal Logic Inference, I . I N T RO D U C T I O N Interpretable decision models are particularly valuable in control and robotics, where learned decision rules are often used in safety-critical settings and must therefore be inspected and veriﬁed. Signal T emporal Logic (STL) provides an interpretable formal language for describing temporal beha viors of dynamical systems. It speciﬁes temporal properties through human- readable logical formulas and equips them with a quantitative robustness semantics to measure the degree of satisfaction or violation. These properties make STL a useful foundation for learning interpretable classiﬁers ov er trajectories. A substantial body of prior work has studied STL inference from data. Early methods typically relied on ﬁxed logical templates or small template families and optimized real-valued parameters using 1 Mingyu Cai and Y ixuan W ang are with Mechanical Engineering, Univ ersity of California, Riverside, CA, 92521, USA. { mingyu.cai, ywang1457 } @ucr.edu 2 Danyang Li and Roberto T ron are with Mechanical Engineering, Boston Univ ersity . { danyangl, tron } @bu.edu 3 Matthew Cleav eland is with MIT Lincoln Laboratory , Lexington, MA, 02421, USA { matthew.cleaveland@ll } @mit.edu DISTRIBUTION ST A TEMENT A. Approved for public release. Distribution is unlimited. This material is based upon work supported by the Department of the Army under Air Force Contract No. F A8702-15-D-0001 or F A8702-25- D-B002. Any opinions, ﬁndings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reﬂect the views of the Department of the Army . © 2026 Massachusetts Institute of T echnology . Deliv ered to the U.S. Government with Unlimited Rights, as deﬁned in DF ARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are deﬁned by DF ARS 252.227-7013 or DF ARS 252.227-7014 as detailed above. Use of this work other than as speciﬁcally authorized by the U.S. Government may violate any copyrights that exist in this work. robustness-based objectives [ 1 ], [ 2 ]. More recent approaches hav e expanded the space of learnable speciﬁcations through logic-based learning from demonstrations, and differentiable neural-symbolic frameworks that embed temporal and Boolean operators into trainable computation graphs [ 3 ], [ 4 ], [ 5 ], [ 6 ], [ 7 ], [ 8 ], [ 9 ]. These dev elopments improv e scalability and ﬂexibility while preserving the interpretability of the learned speciﬁcations. In particular , recent differentiable STL learning approaches optimize both formula structure and parameters directly from trajectory data, providing a practical route to compact and expressi ve symbolic classiﬁers. Howe ver , interpretability alone is not sufﬁcient for deploy- ment. In many control and robotics applications, a learned classiﬁer is not only expected to be accurate, but also to quantify the reliability of the decisions. A symbolic STL formula may be human-readable, yet its predictions can still be poorly calibrated or statistically unreliable when learned from ﬁnite data. This moti vates the need for statistical correctness guarantees in addition to interpretable formula learning. Con- formal prediction (CP) provides a principled frame work for this purpose by conv erting a real-valued score into a calibrated decision rule with ﬁnite-sample coverage guarantees under exchangeability [ 10 ], [ 11 ]. In practice, split CP uses a held-out calibration set and empirical quantiles nonconformity scores, which are then used as thresholds for constructing calibrated decision rules [ 12 ], [ 13 ], [ 14 ]. CP has also been used in robotics and control to provide calibrated decision rules and uncertainty- aware safety wrappers for learning-based systems [ 15 ], [ 16 ], [ 17 ]. In previous works [ 18 ], [ 19 ], CP was integrated with dif ferentiable STL inference e.g., TLINet [ 7 ] and STLcg [ 6 ], to quantify prediction uncertainty while preserving interpretability . A further challenge is that standard CP relies on exchange- ability , meaning that the calibration data and the future test data are assumed to follo w the same distrib utional mechanism. STL formula are typically learned from trajectories collected under nominal or controlled conditions, whereas deployment trajectories may differ because of en vironmental variation, operational changes, or data-collection bias. Such cov ariate shift can substantially de grade deployment-time reliability ev en when the conditional labeling rule remains unchanged. Existing STL inference methods primarily optimize empirical performance on source-side data and do not explicitly address calibration under distribution mismatch between source-side and deployment-side distributions. Prior conformalized STL inference [ 18 ], [ 19 ], does not explicitly address mismatch among training, calibration, and deployment data. W eighted CP provides one statistical mechanism for handling such mismatch by re weighting calibration samples according to their deployment relev ance [ 20 ], [ 15 ]. As a result, we posit that in the STL setting, reliability under distribution shift should be addressed jointly at two lev els: the learned formula should adapt tow ard deployment-rele vant regions, and the ﬁnal decision threshold should be calibrated in a distribution-aware manner . Our framework is more broadly compatible with dif ferentiable STL inference methods. In the sequel, we formulate the method using binary trajectory classiﬁcation as a concrete instantiation. Motiv ated by this challenge, we study conformalized STL inference under cov ariate shift. Our approach starts from a differentiable STL inference model and reﬁnes the learned formula using an additional dataset that is more representati ve of deployment conditions. The reﬁnement stage adapts the formula toward the test-time distribution through data reweight- ing [ 21 ], [ 22 ] and robustness-distrib ution regularization [ 23 ]. W e then perform weighted CP for the resulting robustness- based decision rule, so that calibration better reﬂects the test- time distribution. In this way , the framework jointly addresses formula adaptation and statistical calibration under distribution shift. Contributions are as follows: • W e dev elop a covariate-shift-a ware incremental learning scheme for differentiable STL inference that adapts a pretrained STL formula through STL rob ustness based distributional alignment, while maintaining both classiﬁcation performance and interpretability . • T o provide statistical correctness under distribution mismatch, we extend conformalized STL inference to the covariate shift setting using weighted CP [ 20 ] with a robustness-based nonconformity score tailored to learned STL formulas. • T o improve the stability of shift alignment during learning, we introduce a distrib ution-level termination criterion that ineffecti ve learning iterations. • Experiments on trajectory datasets demonstrate that the proposed method achiev es well-calibrated inference coverage under covariate shift and reduces empirical miscoverage. I I . P R E L I M I NA R I E S W e work with trajectories of the form x = ( x 0 , . . . , x T ) , where x t ∈ R d denotes the system state at time t . A. Signal T emporal Logic Infer ence Signal T emporal Logic (STL) is a formal language for describing temporal and spatial properties of time-series data. Atomic predicates are taken to be linear inequalities of the form µ ( x t ) ≡ a ⊤ x t ≥ b, a ∈ R d , b ∈ R , which test whether a linear constraint on the state is satisﬁed at a giv en time. STL formula are generated from atomic predicates according to the grammar φ ::= µ   ¬ φ   φ 1 ∧ φ 2   φ 1 ∨ φ 2   ♢ [ t 1 ,t 2 ] φ   □ [ t 1 ,t 2 ] φ where ¬ , ∧ and ∨ denote Boolean negation, conjunction and disjunction, ♢ t 1 ,t 2 and □ [ t 1 ,t 2 ] are the temporal ”ev entually” and “always” operators ov er the interval [ t 1 , t 2 ] ⊆ { 0 , . . . , T } . The standard quantitati ve semantics of STL associates to each formula φ , trajectory X , and time t a real-valued robustness score ρ φ ( X, t ) ∈ R that measures the degree of satisfaction. W e denote ρ φ ( X ) := ρ φ ( X, 0) . Positi ve values indicate satisfaction, negati ve v alues indicate violation, and lar ger absolute values correspond to larger robustness margins. For atomic predicates, Boolean operators, and temporal operators, the robust semantics are deﬁned recursively as follows: ρ µ ( X, t ) = a ⊤ x t − b (1) ρ ¬ φ ( X, t ) = − ρ φ ( X, t ) (2) ρ φ 1 ∧ φ 2 ( X, t ) = min  ρ φ 1 ( X, t ) , ρ φ 2 ( X, t )  (3) ρ φ 1 ∨ φ 2 ( X, t ) = max  ρ φ 1 ( X, t ) , ρ φ 2 ( X, t )  (4) ρ ♢ [ t 1 ,t 2 ] φ ( X, t ) = max τ ∈ [ t 1 ,t 2 ] ρ φ ( X, t + τ ) (5) ρ □ [ t 1 ,t 2 ] φ ( X, t ) = min τ ∈ [ t 1 ,t 2 ] ρ φ ( X, t + τ ) (6) STL formulas can be used as binary classiﬁers for trajectories through the sign of the robustness score. Giv en a labeled trajectory ( X , Y ) with Y ∈ {− 1 , +1 } , deﬁne ˆ Y ( X ) = ( +1 , ρ φ ( X ) > 0 , − 1 , otherwise (7) Thus, the product Y ρ φ ( X ) is positiv e when the robustness sign agrees with the true label and negati ve otherwise. a) Misclassiﬁcation rate (MCR): For an e valuation dataset D = { ( X i , Y i ) } n i =1 , the misclassiﬁcation rate (MCR) induced by φ is deﬁned as MCR( D ; φ ) = 1 n n X i =1 1 h ˆ Y ( X i )  = Y i i Minimizing the MCR corresponds to learning an STL formula whose robustness sign aligns with the ground-truth labels. b) Differ entiable STL learning (TLINet): TLINet [ 7 ] is adopted as a differentiable STL learner for optimizing STL formulas from data. Since the 0-1 misclassiﬁcation loss 1 [ Y ρ φ θ ( X ) ≤ 0] is non-dif ferentiable, optimization is performed using a smooth margin-based surrogate, namely the logistic loss, as described in Sec IV -A. B. Conformal Pr ediction Conformal Prediction (CP) [ 11 ] is a distrib ution-free proce- dure that augments the output of a ﬁxed predicti ve model with prediction regions that enjoy ﬁnite-sample cov erage guarantees. In split CP , calibration is performed on a held-out dataset using a nonconformity score, which measures how inconsistent a labeled sample is with the model’ s decision rule. The dataset is partitioned into a training set D train , a calibration set D cal = { ( X i , Y i ) } n cal i =1 , and a test set D test , where D cal and a new test sample are exchangeable. In our shift-aware setting, the calibration data are not assumed exchangeable with the test data. Instead, weighted CP is used to approximate deployment- relev ant nonconformity score quantiles under cov ariate shift. Let A : X × Y → R denote a measurable nonconformity score function, where X is the trajectory space and Y is the label space. The nonconformity scores on the calibration set are computed as s i = A  X i , Y i ) , i = 1 , . . . , n cal . (8) For a target miscov erage level α ∈ (0 , 1) , the conformal threshold is deﬁned as the (1 − α ) quantile of the multiset { s 1 , . . . , s n cal } . Speciﬁcally , let k = ⌈ ( n cal + 1)(1 − α ) ⌉ and deﬁne T CP := s ( k ) , where s ( k ) denotes the k -th smallest value. The resulting CP set for a test input X test ∈ X is C ( X test ) := { Y test ∈ Y : A ( X test , Y test ) ≤ T CP } Under the exchangeability assumption, the CP guarantee yields P ( Y test ∈ C ( X test )) ≥ 1 − α (9) I I I . P RO B L E M F O R M U L A T I O N W e consider the problem of binary classiﬁcation of system trajectories using STL. Let φ θ denote an STL formula learned by a differentiable STL inference method, where θ collects the learnable formula parameters, such as predicate and temporal parameters. The induced robustness score ρ φ θ ( X ) is used to deﬁne the binary classiﬁcation. a) Data distributions and covariate shift: Follo wing the distribution-shift setting described in the introduction, let P train denote the nominal training distribution and P dep denote the deployment distribution. W e consider a cov ariate shift setting in which the conditional labeling rule remains in variant while the marginal distribution over trajectories changes. Assumption 1 (Cov ariate Shift under STL Semantics) . W e assume P train ( Y | X ) = P dep ( Y | X ) , P train ( X )  = P dep ( X ) This assumption is natural in our setting because labels are deﬁned by STL satisfaction, which depends only on the trajectory and the speciﬁcation, and is therefore in variant to operating conditions. b) Data partitioning: W e consider three mutually disjoint datasets. A core training set D train ∼ P train is used to learn an initial STL formula and deﬁne task semantics. Because it is collected under nominal or conserv ative conditions, it may not adequately cover the range of trajectories encountered at deployment. W e assume access to a limited deployment-side dataset D dep that captures beha viors underrepresented in D train . Although D dep is only a partial and potentially biased sample from P dep , it provides information about regions likely to arise at test time and is incorporated only through distribution- aware objectives, without altering the labeling rule. Finally , a separate calibration set D cal ∼ P train is reserved exclusi vely for weighted conformal calibration of the robustness-based decision rule [ 11 ], [ 17 ]. Since D cal is drawn from P train whereas deployment-time test trajectories follow P dep , standard calibration is no longer distribution-matched. W e therefore use importance weighting to make calibration better reﬂect the target deployment distribution. c) Problem F ormulation: Giv en D train , D dep , and D cal as deﬁned above, the objectiv e is to learn φ θ . For a θ , the corresponding prediction set is induced by the calibration rule and is denoted by C θ ( · ) . The resulting STL decision rule should achie ve both low misclassiﬁcation and reliable coverage under the deployment distribution: min θ MCR P dep ( φ θ ) := E ( X,Y ) ∼ P dep [ 1 [ Y ρ φ θ ( X ) ≤ 0]] s.t. P ( X,Y ) ∼ P dep ( Y ∈ C θ ( X )) ≥ 1 − α Fig. 1: Pipeline of proposed shift-aware conformal STL framew ork under co variate shift. The framew ork consists of four sequential stages integrating STL formula learning, distribution alignment, and weighted conformal calibration. Remark 1 . Challenges and Motivations: If the cov ariate shift is ignored, an STL formula learned only from D train may ov erﬁt regions of the trajectory space that are well represented during training but less relev ant at deployment, leading to degraded classiﬁcation performance and miscalibration under P dep . This can be effecti ve when D dep is suf ﬁciently large and representativ e, but in practice the av ailable deployment-side data is often limited and distributionally mismatched relati ve to D train , making nai ve retraining statistically unstable. Our approach instead uses the STL formula learned from D train as a warm start and incorporates D dep through distribution-a ware objectiv es that steer the learned robustness scores toward regions more relev ant to deployment. T o assess how informative the deployment-side data is, we monitor an effecti ve sample size diagnostic deri ved from density-ratio weights. This quantity reﬂects the concentration of the weights and indicates whether further reﬁnement additional reﬁnement is likely to materially improve deployment alignment. I V . C O N F O R M A L I Z E D A C T I V E S T L I N F E R E N C E U N D E R C OV A R I A T E S H I F T In Section IV -A , we present a shift-aware STL framew ork that learns a formula from training data and adapts it using deployment-proxy data. Section IV -B establishes correctness guarantees under distribution shift. The overall pipline is summerized in Fig. 1 A. STL Infer ence Adaptation The frame work builds on TLINet [ 7 ] as a dif ferentiable, template-free backbone for learning STL formulas from labeled trajectories, enabling gradient-based optimization and yielding interpretable, parameterized STL speciﬁcations. T raining is performed by minimizing a margin-based surrogate loss applied to the signed robustness value Y ρ φ θ ( X ) . Follo wing TLINet, the logistic loss is used: ℓ ( Y ρ φ θ ( X )) = log(1 + exp( − Y ρ φ θ ( X ))) Giv en a core training dataset D train = { ( X i , Y i ) } n c i =1 , the nominal STL formula is obtained by minimizing the empirical surrogate loss L train ( θ ) = 1 n c X ( X,Y ) ∈ D train ℓ ( Y ρ φ θ ( X )) Let θ 0 denote a minimizer of L train , and φ θ 0 the corre- sponding STL formula learned from D train . The parameter θ 0 deﬁnes the nominal STL formula φ θ 0 learned from the training dataset D train . This formula serves as a warm-start initialization for the subsequent shift-aware adaptation stage, and provides the semantic structure of the STL speciﬁcation. The adaptation stage further reﬁnes them using deployment- side data D dep tow ard deployment-rele vant trajectory regions through the distribution-a ware objectiv es introduced below . a) Density-Ratio W eighting on the Alignment Set: As deﬁned in Section III, the learning setting assumes a cov ariate shift between training and deployment trajectories. For a trajectory X ∈ X , the ideal importance weight is the density ratio that exactly reweights expectations from the training distribution to the deployment distrib ution: ω ⋆ ( X ) := p dep ( X ) p train ( X ) (10) where p train and p dep denote the underlying marginal densities of trajectories under training and deployment distributions, respectiv ely . Direct estimation of ω ⋆ ( X ) is generally difﬁcult without strong parametric assumptions. W e therefore approximate it using a k -nearest-neighbor (kNN) density-ratio estimator in an embedding space [ 24 ], which provides a simple nonparametric approximation based on local sample density . Other density- ratio estimators, such as kernel mean matching (KMM), KLIEP , and uLSIF , could also be used [ 25 ]. The intuition is that, in a p - dimensional embedding space, the local sample density around X is approximately in versely proportional to the volume of its kNN neighborhood, and hence to r k ( X ; S ) p . Therefore, the ratio of kNN radii provides a local approximation to the density ratio between the deployment and training distributions. T o instantiate this estimator, let f : X → R p denote a ﬁxed embedding. For a reference set S , we deﬁne the kNN radius of X as r k ( X ; S ) = ∥ f ( X ) − f ( X ( k ) ) ∥ 2 where X ( k ) ∈ S is the k -th nearest neighbor of X . The unnormalized kNN density-ratio weight is then given by ˆ ω ( X ) =  r k ( X ; D train ) r k ( X ; D dep )  p T o improv e numerical stability , we normalize the ra w weights to have unit empirical mean over D dep and clip them to a ﬁxed range: ˜ ω ( X ) = ˆ ω ( X ) 1 | D dep | P X ′ ∈ D dep ˆ ω ( X ′ ) (11) ω ( X ) = min { max { ˜ ω ( X ) , ω min } ω max } (12) Hence, ω ( X ) is used as a practical approximation to the ideal density-ratio weight ω ⋆ ( X ) . Although not exact, it captures relati ve distributional differences between P train and P dep and emphasizes trajectories more relev ant to deployment. Lemma 1. Under Assumption 1, assume that P dep is absolutely continuous with r espect to P train over X . Then, with ideal density-ratio weight ω ⋆ ( X ) , for any measurable function h : X × {− 1 , +1 } → R with ﬁnite expectation, we have E ( X,Y ) ∼ P dep [ h ( X, Y )] = E ( X,Y ) ∼ P train [ ω ⋆ ( X ) h ( X, Y )] Pr oof. By the law of total expectation under P dep , E P dep [ h ( X, Y )] = Z X X y ∈{− 1 , +1 } h ( x, y ) p dep ( y | x ) p dep ( x ) dx Using p dep ( x ) = ω ⋆ ( x ) p train ( x ) , we obtain E P dep [ h ( X, Y )] = Z X X y ∈{− 1 , +1 } h ( x, y ) ω ⋆ ( x ) p train ( y | x ) p train ( x ) dx = E P train [ ω ⋆ ( X ) h ( X, Y )] Proposition 1. Under covariate shift, consider the expected misclassiﬁcation rate under the deployment distribution P dep . F or any ﬁxed STL formula φ θ , the misclassiﬁcation rate under the deployment distribution can be written as E ( X,Y ) ∼ P dep  1 { Y ρ φ θ ( X ) ≤ 0 }  = E ( X,Y ) ∼ P train  ω ⋆ ( X ) 1 { Y ρ φ θ ( X ) ≤ 0 }  In practice, ω ⋆ ( X ) is r eplaced by its kNN-based appr oximation ω ( X ) . Pr oof. This proposition is a special case of Lemma 1 with h ( X, Y ) = 1 { Y ρ φ 0 ( X ) ≤ 0 } . Proposition 1 motiv ates a deployment-time learning objectiv e that combines the nominal training loss on D train with a weighted loss on D dep . L train ( θ ) = 1 | D train | X ( X,Y ) ∈ D train ℓ ( Y ρ φ θ ( X )) + 1 | D dep | X ( X,Y ) ∈ D dep ω ( X ) ℓ ( Y ρ φ θ ( X )) The ﬁrst term preserves the nominal STL semantics learned from D train , while the second term emphasizes deployment- relev ant trajectories through density-ratio weighting. Howe ver , these two terms alone are not sufﬁcient. In practice, the estimated density-ratio weights can be highly non-uniform, so direct optimization of the weighted term may lead to unstable updates and may distort the robustness structure learned from D train [ 20 ]. T o stabilize reﬁnement and further adapt the learned STL rule at the distribution le vel, we add a robustness- distribution regularizer based on the Jensen–R ´ enyi div ergence (JRD) [26]. b) Regularization of Rob ustness-V alue Distributions via J ensen–R ´ enyi Diverg ence: The JRD is adopted to compare the empirical robustness distributions induced on D train and D dep . As a symmetric di vergence between probability measures, it provides a distribution-le vel adaptation term that encourages the learned robustness values to remain aligned across the source and deployment-side data. Other distributional discrepancies, e.g., Jensen–Shannon di vergence [27], could also be used. Deﬁnition 1. For a giv en parameter θ , deﬁne the empirical robustness distributions induced by the training set and the deployment set as ˆ P θ train := 1 | D train | X ( X,Y ) ∈ D train δ ρ φ θ ( X ) ˆ P θ dep := 1 | D dep | X X ∈ D dep δ ρ φ θ ( X ) (13) where δ ρ φ θ ( X ) denotes the unit point mass located at the robustness value ρ φ θ ( X ) . The following proposition shows that this regularizer yields an explicit bound on the resulting robustness-distribution mismatch. Proposition 2. Consider the r egularized training objective L ( θ ) = L train ( θ ) + λ JRD JRD  ˆ P θ train , ˆ P θ dep  wher e λ JRD > 0 is a r e gularization parameter contr olling the str ength of the r obustness-distrib ution alignment penalty . Let θ ⋆ be a minimizer of L , and let θ train be a minimizer of the unr e gularized training loss L train . Then JRD  ˆ P θ ⋆ train , ˆ P θ ⋆ dep  ≤ L train ( θ train ) − L train ( θ ⋆ ) λ JRD + JRD  ˆ P θ train train , ˆ P θ train dep  Pr oof. By optimality of θ ⋆ , we hav e L ( θ ⋆ ) ≤ L ( θ train ) . Rearranging terms yields λ JRD JRD  ˆ P θ ⋆ train , ˆ P θ ⋆ dep  ≤ L train ( θ train ) − L train ( θ ⋆ ) + λ JRD JRD  ˆ P θ train train , ˆ P θ train dep  which implies the stated bound. Although the practical density-ratio weights are clipped to improv e numerical stability , clipping alone does not directly control distribution-le vel drift in the induced robustness values. W e therefore reﬁne the pretrained STL formula by minimizing the following uniﬁed shift-aware objectiv e: L ( θ ) = 1 | D train | X ( X,Y ) ∈ D train ℓ ( Y ρ φ θ ( X )) + 1 | D dep | X ( X,Y ) ∈ D dep ω ( X ) ℓ ( Y ρ φ θ ( X )) + λ JRD JRD  ˆ P θ train , ˆ P θ dep  (14) where ω ( X ) denotes the density-ratio weight, and λ JRD > 0 is a regularization parameter . B. W eighted Conformalized STL Infer ence This section aims to provide correctness guarantees using CP for the learned STL formulas. The general nonconformity score in Section II-B is specialized to the rob ustness-based form for binary classiﬁcation. A ( X, Y ) = S θ ( X, Y ) := − Y ρ φ θ ( X ) , (15) where larger v alues of S θ indicates stronger disagreement with label Y . This choice is simple and compatible with our prior formulation [ 19 ]. More importantly , it yields a uniﬁed decision statistic: the same signed robustness margin governs STL training, binary classiﬁcation, and CP . Under cov ariate shift, calibration and deployment-time test samples are generally not exchangeable. W e therefore adopt weighted CP [ 20 ], where the calibration nonconformity scores are re weighted according to the density-ratio weights introduced in Section IV -A . The nonconformati vity scores on D cal are deﬁned as s i := S θ ( X i , Y i ) , i = 1 , . . . , n cal (16) The weighted empirical cumulativ e distribution of the calibration nonconformity scores is deﬁned as ˆ F w ( t ) := P n i =1 ω ( X ) 1 { s i ≤ t } P n i =1 ω ( X ) (17) The weighted conformal threshold is then taken as the (1 − α ) - quantile of this weighted empirical distribution: T wcp := inf n t ∈ R : ˆ F w ( t ) ≥ 1 − α o (18) Accordingly , the weighted CP set is deﬁned by C ( X ) = { Y ∈ {− 1 , +1 } : S θ ( X, Y ) ≤ T wcp } . (19) Under Assumption 1, and using the ideal density-ratio weight ω ⋆ ( X ) deﬁned in Section IV -A , the prediction set C ( X ) is a direct specialization of the weighted conformal construction under covariate shift in [20] with coverage guarantees, i.e., P ( X,Y ) ∼ P dep ( Y ∈ C ( X )) ≥ 1 − α (20) In practice, howe ver , highly non-uniform weights may reduce the effecti ve number of weighted samples and increase the variability of weighted empirical estimates. W e can quantify this effect by adopting the effecti ve sample size (ESS) [28]. ESS( D ) =  P X ∈ D ω ( X )  2 P X ∈ D ω ( X ) 2 with the normalized form ESS( D ) / | D | ∈ (0 , 1] . The following result formalizes the relation between weight concentration and the conditional variability of weighted empirical estimates. Proposition 3. Let { Z i } n i =1 be i.i.d. fr om a distribution Q and let { w i } n i =1 be nonne gative weights with normalized weights ¯ w i := w i / P n j =1 w j . F or any measurable g with g ( z ) ∈ [ − 1 , 1] for all z , deﬁne ˆ µ w ( g ) := n X i =1 ¯ w i g ( Z i ) Then, conditional on the weights ( w 1 , . . . , w n ) , V ar  ˆ µ w ( g ) | w 1: n  ≤ n X i =1 ¯ w 2 i = 1 ESS , This bound sho ws that the conditional variability of a weighted empirical estimate is controlled by 1 / ESS . In particu- lar , smaller ESS yields a larger upper bound on the conditional variance, indicating reduced stability when the weights are highly concentrated. Pr oof. The quantity ˆ µ w ( g ) is a weighted average of inde- pendent variables g ( Z 1 ) , . . . , g ( Z n ) , each bounded in [ − 1 , 1] . Conditional on the weights, the variance is therefore bounded by the sum of squared normalized weights. Since | g | ≤ 1 , V ar[ g ( Z i )] ≤ 1 . Independence yields V ar " n X i =1 ¯ w i g ( Z i ) # = n X i =1 ¯ w 2 i V ar[ g ( Z i )] ≤ n X i =1 ¯ w 2 i Finally , n X i =1 ¯ w 2 i = P n i =1 w 2 i ( P n i =1 w i ) 2 = 1 ESS As a result, we can use ESS as a learning termination crite- rion of balancing the ov erﬁtting and maintaining the effecti ve number of calibration samples for the distrib utional alignment during shift iterati ve reﬁnement procedure. In particular, let ESS ( t ) denote the effecti ve sample size at iteration t . learning prcoess in Section IV -A is terminated when      ESS ( t ) n − ESS ( t − 1) n      ≤ ε where n is the number of calibration samples and ε > 0 is a tolerance parameter . This criterion indicates that further updates no longer substantially change the effecti ve distribution alignments. V . E X P E R I M E N TA L R E S U L T S W e e valuate the proposed conformalized STL inference framew ork on three trajectory datasets with distinct distri- butional characteristics under covariate shift. All experiments were conducted on a Linux workstation equipped with an Intel Core i9-13900KF CPU (24 cores, 32 threads). The objectiv es of the experiments are threefold: • T o assess the classiﬁcation performance of the reﬁned STL classiﬁer compared to nominal TLINet training; • T o ev aluate the behavior of CP under distribution shift; • T o examine the efﬁcienc y-cov erage tradeoff between standard CP and weighted CP . A. Evaluation Metrics For a target miscoverage level α , the conformal predictor aims to achiev e a coverage lev el of 1 − α . In experiments we report the empirical coverage, deﬁned as Cov erage = 1 n n X i =1 1 { Y i ∈ C ( X i ) } (21) (a) VIMA simulated environment (b) V ima Dataset (c) Nav al Dataset (d) Motion Planning Dataset Fig. 2: Schematic Diagram of the Dataset which estimates the probability that the prediction set contains the true label under the test distribution. T o quantify the efﬁcienc y of the prediction sets, we report the average prediction set size, referred to as inefﬁcienc y: Inefﬁcienc y = 1 n n X i =1 | C ( X i ) | (22) This metric measures the av erage size of prediction sets, where smaller values correspond to more compact predictions. In the binary classiﬁcation setting considered here, the prediction set C ( X ) is deﬁned in Section II-B has cardinality in { 0 , 1 , 2 } . A singleton set corresponds to a conﬁdent prediction, a two-label set indicates ambiguity , and an empty set indicates that neither label satisﬁes the acceptance condition. B. Datasets Dataset 1: Naval Surveillance: The Naval Surveillance dataset consists of labeled time-series trajectories deriv ed from a marine propulsion system benchmark. Each trajectory is a multiv ariate signal X ∈ R d × T , and the task is binary classiﬁcation of normal versus anomalous operational behavior . Dataset 2: Place a block into a basket without violating the safety constraint. This task requires the robot to transport a target block into a designated basket region while keeping the trajectory within the basket boundary marked by tape. C. Baseline Comparison W e compare our frame work with sev eral baseline classiﬁers on the considered trajectory datasets: TLINet, a differentiable STL inference model; LSTM/RNN, a recurrent neural-network baseline for sequence classiﬁcation [ 29 ]; BCDT , a boosted classiﬁcation-tree method [ 30 ], [ 31 ]; DT , a standard decision tree [ 32 ]; and DA G, a directed-acyclic-graph-based temporal classiﬁer [ 33 ], [ 34 ]. These methods serve as reference baselines for classiﬁcation performance. Neural baselines such as LSTM T ABLE I: Classiﬁcation results of our method and baseline methods on trajectory datasets. Method MCR for P train MCR for P dep T ime(s) STL formula Nav al Dataset Ours 1.25 0.07 16 ♢ [25 , 26] ( x ≤ 38 . 6) ∧ □ [11 , 12] ( y ≥ 24 . 1) TLINet 1.25 0.05 38 ♢ [55 , 60] ( x < 25 . 89) ∧ □ [0 , 16] ( y > 23 . 77) LSTM/RNN 0.01 0.025 19 N/A BCDT 0.0100 N/R 1996 ♢ [28 , 53] ( x < 30 . 85) ∧ □ [2 , 26] ( y > 21 . 31) ∧ ( x > 11 . 10) DT 0.0195 N/R 140 ¬ ( ♢ [38 , 53] ( x > 20 . 1) ∧ ♢ [12 , 37] ( x > 43 . 2)) ∨ ( ♢ [38 , 53] ( x > 20 . 1) ∧ ¬ ♢ [20 , 59] ( y > 32 . 2)) ∨ ( ♢ [38 , 53] ( x > 20 . 1) ∧ ♢ [20 , 59] ( y > 32 . 2) ∧ □ [14 , 60] ( y > 30 . 1)) D A G 0.0885 N/R 996 ♢ [0 , 33] ( □ [18 , 23] ( y > 19 . 88) ∧ □ [9 , 30] ( x < 34 . 08)) VIMA Dataset Ours 8.55 0.015 15 □ [17 , 19] ((88 . 62 < x < 152 . 21) ∧ (45 . 47 < y < 89 . 73)) TLINet 8.55 0.0125 45 □ [17 , 19] ((89 . 86 < x < 147 . 14) ∧ (59 . 64 < y < 89 . 97)) LSTM/RNN 8.30 0.08 23 N/A Fig. 3: W eighted CP with Covariate Shift can be further adapted with additional data, whereas the remaining non-neural baselines do not naturally admit the same gradient-based reﬁnement mechanism as our framew ork and typically require retraining instead. T able I report the MCR, training time, and the learned STL formulas for all compared methods. Across the ev aluated datasets, models trained only on source-side data degrade under the deployment distribution, which is consistent with covariate shift. Our framew ork reduces this degradation by reﬁning the core trained STL formula with deployment-side data and remains close to the oracle TLINet trained directly on P dep . Compared with the neural and tree-based baselines, it achiev es competiti ve or better classiﬁcation performance while retaining an explicit STL formula. The reported training times sho w that the additional reﬁnement stage remains computationally practical. D. Ablation Study a) STL Classiﬁcation: T able I isolates the effect of the shift-aware reﬁnement stage. T raining only on D train leads to degraded performance under P dep , while incorporating P dep substantially reduces the MCR across datasets. b) W eighted Conformal Pr ediction under Covariate Shift: Figure 3 compares weighted CP applied to the reﬁned STL classiﬁer with the same procedure applied to a model trained without shift-aware retraining. In both cases, density-ratio weighting is used during calibration. On the Nav al dataset, the reﬁned model yields substantially smaller prediction set sizes across all miscoverage lev els. The (a) Nav al Dataset (b) Motion Planning Dataset Fig. 4: Comparison across datasets with weighted CP . gap between the two curves remains consistent, indicating that shift aware retraining improves the alignment between the learned robustness scores and the deployment distribution. As a result, fewer candidate labels are required to maintain the same cov erage level. A similar trend is observed on VIMA. Although the difference is less pronounced than in Nav al, the reﬁned model consistently achiev es lo wer inefﬁciency across the range of target miscoverage lev els. This suggests that incorporating D dep during training reduces the impact of mar ginal distribution mismatch on the calibration score distribution. Overall, these results indicate that shift-aware reﬁnement enhances the effecti veness of weighted CP by improving the distributional alignment of rob ustness scores under deployment conditions. c) Effect of Shift-aware Reﬁnement on Conformal Pr edic- tion: Figures 4 further examine how shift-aware reﬁnement af fects CP . On both datasets, the reﬁned model yields prediction sets whose av erage size is closer to one at comparable target cov erage levels, indicating better alignment between the learned robustness scores and the deployment distribution. At the same time, empirical co verage remains closer to the target le vel across settings. Fig. 5: Comparison between weighted CP and Standard CP using our method in dif ferent dataset. d) W eighted vs. Standard Conformal Prediction: Figure 5 compares weighted CP and standard CP in terms of empirical cov erage–inefﬁciency trade-of fs on the Na val and VIMA datasets. Across both datasets, the two methods yield closely matched curves, indicating that density-ratio reweighting has only a minor ef fect on the conformal quantile in the current setting. This suggests that the calibration and test score distributions are already reasonably aligned after reﬁnement. V I . C O N C L U S I O N W e studied STL-based trajectory classiﬁcation under cov ari- ate shift and proposed a shift-aw are STL inference framework that reﬁnes a learned STL formula using deployment-time data. The method combines TLINet pretraining with a reﬁnement stage that improv es the alignment between the learned robust- ness scores and the deployment distribution. In addition, CP is incorporated to quantify prediction uncertainty while main- taining statistical coverage guarantees. Experimental results on multiple trajectory datasets show that the proposed frame work improv es classiﬁcation performance under distribution shift while preserving the interpretability of STL-based decision rules. Future work includes extending the approach to multi- class STL inference and in vestigating more advanced shift- adaptation strategies. R E F E R E N C E S [1] E. Asarin, A. Donz ´ e, O. Maler, and D. Nickovic, “Parametric identiﬁ- cation of temporal properties, ” in International Conference on Runtime V eriﬁcation . Springer, 2011, pp. 147–160. [2] B. Hoxha, A. Dokhanchi, and G. Fainekos, “Mining parametric temporal logic properties in model-based design for cyber -physical systems, ” International Journal on Software T ools for T ec hnology T ransfer , vol. 20, no. 1, pp. 79–93, 2018. [3] E. Bartocci, C. Mateis, E. Nesterini, and D. Nickovic, “Survey on mining signal temporal logic speciﬁcations, ” Information and Computation , vol. 289, p. 104957, 2022. [4] G. Bombara and C. Belta, “Ofﬂine and online learning of signal temporal logic formulae using decision trees, ” ACM T ransactions on Cyber- Physical Systems , vol. 5, no. 3, pp. 1–23, 2021. [5] C. Y oo and C. Belta, “Rich time series classiﬁcation using temporal logic, ” in Robotics: Science and Systems , 2017. [6] K. Leung, N. Ar ´ echiga, and M. Pav one, “Backpropagation through signal temporal logic speciﬁcations: Infusing logical structure into gradient- based methods, ” The International Journal of Robotics Researc h , vol. 42, no. 6, pp. 356–370, 2023. [7] D. Li, M. Cai, C.-I. V asile, and R. Tron, “Tlinet: Differentiable neural network temporal logic inference, ” 2024. [Online]. A vailable: https://arxiv .org/abs/2405.06670 [8] ——, “Learning signal temporal logic through neural network for interpretable classiﬁcation, ” in 2023 American Contr ol Conference (ACC) . IEEE, 2023, pp. 1907–1914. [9] R. Y an, Z. Xu, and A. Julius, “Swarm signal temporal logic inference for swarm behavior analysis, ” IEEE Robotics and Automation Letters , vol. 4, no. 3, pp. 3021–3028, 2019. [10] V . V o vk, A. Gammerman, and G. Shafer, Algorithmic learning in a random world . Springer , 2005. [11] A. N. Angelopoulos and S. Bates, “ A gentle introduction to conformal prediction and distribution-free uncertainty quantiﬁcation, ” arXiv pr eprint arXiv:2107.07511 , 2021. [12] H. Papadopoulos, Inductive conformal prediction: Theory and application to neural networks . INTECH Open Access Publisher Rijeka, 2008. [13] J. Lei, M. G’Sell, A. Rinaldo, R. J. T ibshirani, and L. W asserman, “Distribution-free predictive inference for regression, ” Journal of the American Statistical Association , vol. 113, no. 523, pp. 1094–1111, 2018. [14] Y . Romano, E. Patterson, and E. Candes, “Conformalized quantile regression, ” Advances in neural information pr ocessing systems , vol. 32, 2019. [15] A. Dixit, L. Lindemann, S. X. W ei, M. Cleaveland, G. J. Pappas, and J. W . Burdick, “ Adapti ve conformal prediction for motion planning among dynamic agents, ” in Learning for Dynamics and Control Conference . PMLR, 2023, pp. 300–314. [16] K. Liang, L. Luo, Y . W ang, M. Cai, and C. I. V asile, “Time-aw are motion planning in dynamic en vironments with conformal prediction, ” Learning for Decision and Contr ol (L4DC) , 2026. [17] L. Lindemann, M. Cleaveland, G. Shim, and G. J. Pappas, “Safe planning in dynamic environments using conformal prediction, ” IEEE Robotics and Automation Letters , vol. 8, no. 8, pp. 5116–5123, 2023. [18] E. Soroka, R. Sinha, and S. Lall, “Learning temporal logic predicates from data with statistical guarantees, ” arXiv pr eprint arXiv:2406.10449 , 2024. [19] D. Li, Y . W ang, M. Cleav eland, M. Cai, and R. T ron, “Conformal prediction for signal temporal logic inference, ” ArXiv , vol. abs/2509.25473, 2025. [Online]. A vailable: https: //api.semanticscholar .org/CorpusID:281682043 [20] R. J. T ibshirani, R. Foygel Ba rber, E. Candes, and A. Ramdas, “Conformal prediction under covariate shift, ” Advances in neural information pr ocessing systems , vol. 32, 2019. [21] H. Shimodaira, “Improving predictive inference under covariate shift by weighting the log-likelihood function, ” Journal of statistical planning and infer ence , vol. 90, no. 2, pp. 227–244, 2000. [22] D. O. Loftsgaarden and C. P . Quesenberry , “ A nonparametric estimate of a multiv ariate density function, ” The Annals of Mathematical Statistics , vol. 36, no. 3, pp. 1049–1051, 1965. [23] A. B. Hamza, “Jensen-rhyi diver gence measure: Theoretical and compu- tational perspectiv es, ” in IEEE Int. Symp. Inf. Theory , 2003. [24] P . Zhao and L. Lai, “ Analysis of knn density estimation, ” IEEE T ransactions on Information Theory , vol. 68, no. 12, pp. 7971–7995, 2022. [25] M. Sugiyama, T . Suzuki, and T . Kanamori, Density ratio estimation in machine learning . Cambridge Univ ersity Press, 2012. [26] L. G. S. Giraldo and J. C. Principe, “Information theoretic learning with inﬁnitely divisible kernels, ” arXiv pr eprint arXiv:1301.3551 , 2013. [27] C. Shui, Q. Chen, J. W en, F . Zhou, C. Gagn ´ e, and B. W ang, “ A novel domain adaptation theory with jensen–shannon div ergence, ” Knowledge- Based Systems , vol. 257, p. 109808, 2022. [28] J. Qui ˜ nonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence, Covariate Shift by Kernel Mean Matching , 2009, pp. 131–160. [29] S. Hochreiter and J. Schmidhuber , “Long short-term memory , ” Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997. [30] J. H. Friedman, “Greedy function approximation: a gradient boosting machine, ” Annals of statistics , pp. 1189–1232, 2001. [31] E. Aasi, C. I. V asile, M. Bahreinian, and C. Belta, “Classiﬁcation of time- series data using boosted decision trees, ” in 2022 IEEE/RSJ International Confer ence on Intelligent Robots and Systems (IR OS) . IEEE, 2022, pp. 1263–1268. [32] L. Breiman, J. Friedman, R. A. Olshen, and C. J. Stone, Classiﬁcation and r egr ession tr ees . Chapman and Hall/CRC, 2017. [33] J. Platt, N. Cristianini, and J. Shawe-T aylor , “Large margin dags for multiclass classiﬁcation, ” Advances in neural information pr ocessing systems , vol. 12, 1999. [34] Z. K ong, A. Jones, and C. Belta, “T emporal logics for learning and detection of anomalous behavior , ” IEEE T ransactions on Automatic Contr ol , vol. 62, no. 3, pp. 1210–1222, 2016.

Conformalized Signal Temporal Logic Inference under Covariate Shift

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment