From High-Level Requirements to KPIs: Conformal Signal Temporal Logic Learning for Wireless Communications
Softwarized radio access networks (RANs), such as those based on the Open RAN (O-RAN) architecture, generate rich streams of key performance indicators (KPIs) that can be leveraged to extract actionable intelligence for network optimization. However,…
Authors: Jiechen Chen, Michele Polese, Osvaldo Simeone
1 From High-Le v el Requirements to KPIs: Conformal Signal T emporal Logic Learning for W ireless Communications Jiechen Chen, Member , IEEE , Michele Polese, Member , IEEE , and Osv aldo Simeone, F ellow , IEEE Abstract —Softwarized radio access networks (RANs), such as those based on the Open RAN (O-RAN) architectur e, gen- erate rich streams of key performance indicators (KPIs) that can be le veraged to extract actionable intelligence for network optimization. Ho wev er , bridging the gap between low-level KPI measurements and high-level requir ements, such as quality of experience (QoE), requir es methods that are both r elevant, cap- turing temporal patterns predictive of user-le vel outcomes, and interpr etable, providing human-readable insights that operators can validate and act upon. This paper introduces conformal signal temporal logic learning (C-STLL), a framework that addresses both requir ements. C-STLL leverages signal temporal logic (STL), a formal language for specifying temporal properties of time series, to lear n interpr etable formulas that distinguish KPI traces satisfying high-lev el requirements from those that do not. T o ensur e r eliability , C-STLL wraps around existing STL lear ning algorithms with a conf ormal calibration procedure based on the Learn Then T est (L TT) framework. This pr ocedure produces a set of STL formulas with formal guarantees: with high probability , the set contains at least one formula achieving a user -specified accuracy level. The calibration jointly optimizes for reliability , formula complexity , and diversity through principled acceptance and stopping rules validated via multiple hypothesis testing. Experiments using the ns-3 network simulator on a mobile gaming scenario demonstrate that C-STLL effecti vely controls risk below target levels while retur ning compact, diverse sets of interpr etable temporal specifications that relate KPI behavior to QoE outcomes. Index T erms —Radio access networks, signal temporal logic, signal temporal logic learning, learn then test. I . I N T RO D U C T I O N A. Motivation The wealth of data produced by modern wireless networks creates a unique opportunity to extract actionable intelligence that can dri ve network optimization, anticipate failures, and ensure quality of service guarantees [ 1 ]. In particular, network controllers in disaggregated architectures like O-RAN can aggregate rich streams of key performance indicators (KPIs), at the radio access network (RAN), including throughput, J. Chen is with the Department of Engineering, King’ s College London, London, WC2R 2LS, UK (email: jiechen.chen@kcl.ac.uk). M. Polese is with the Institute for Intelligent Network ed Systems, Northeastern Uni- versity , Boston, MA 02115 USA (email: m.polese@northeastern.edu). O. Simeone is with the Institute for Intelligent Networked Systems, Northeastern Univ ersity London, One Portsoken Street, London, E1 8PH, UK (email: o.simeone@northeastern.edu). This work was supported by the European Research Council (ERC) under the European Union’ s Horizon Europe Programme (grant agreement No. 101198347), by an Open Fello wship of the EPSRC (EP/W024101/1), by the EPSRC project (EP/X011852/1), and by the U.S. NSF under grant TI- 2449452. Fig. 1. From high-level requirements to KPI requirements: The radio access network (RAN) collects traces of KPIs such as throughput and latency ov er time. High-lev el QoE evaluations are provided by end users. The goal of this work is to infer interpretable properties of KPI traces that are predictive of whether high-level requirements are satisfied or not. These properties are expressed by the formal language signal temporal logic (STL), which provides tools to describe temporal constraints as time series. latency , packet loss, and signal quality [ 2 – 4 ]. Beyond rel- ev ance, intelligence extracted from the RAN must also be interpr etable [ 5 – 7 ], allo wing network operators to understand why a decision was made either to validate decisions against domain expertise or to satisfy regulatory requirements. Howe ver , e xtracting interpretable intelligence from RAN traces presents a challenge giv en the gap between the low-level KPI measur ements readily available at the RAN and the high- level behavioral r equirements that matter to network operators and end users (see Fig. 1 ). Explanations about QoE outcomes must be constructed from the only av ailable data, namely lo w- lev el RAN KPI trace logs. This necessitates the development of novel interpretable mechanisms to bridge the semantic gap between QoE targets and RAN KPI traces. Signal T emporal Logic (STL) offers a principled framew ork for expressing and learning interpretable temporal properties of time-series data [ 8 , 9 ]. An STL formula such as □ [0 , ∞ ] (( latency < 100 ms ) ∧ ♢ [0 , 5] ( throughput > 50 Mbps )) (1) provides a human-readable specification that directly relates KPI behavior to requirements. For example, the STL formula in ( 1 ) states that latenc y must al ways remain belo w 100 ms and (denoted as ∧ ) throughput must exceed 50 Mbps at least once within every time slot consisting of a 5-TTI window . Unlike feature attributions from standard explainable AI methods such as SHAP or LIME [ 5 ], STL formulas constitute formal specifications that can be verified, composed, and directly translated into network policies. 2 That said, existing STL learning (STLL) methods, while effecti ve at extracting candidate formulas from labeled data, provide no formal guarantees on the reliability of learned specifications [ 10 , 11 ]. A formula that performs well on training data may fail to generalize, potentially leading to costly misclassifications in deployment. This limitation is particularly concerning in wireless communications, where network decisions based on unreliable predictions can degrade user experience or violate service lev el agreements (SLAs). B. Related W ork 1) Signal temporal logic: STL enables precise specification of temporal properties [ 8 ], and the quantitati ve semantics of STL, kno wn as robustness, provide a real-valued measure indicating the degree to which a signal satisfies a formula [ 9 , 11 , 12 ]. This quantitati ve interpretation enables gradient- based optimization and serves as a natural loss function for learning [ 10 ]. While STL learning has been successfully applied in cyber- physical systems, robotics, and autonomous vehicles [ 13 ], its application to wireless communications remains largely unexplored. In particular , recent work has begun to explore STL formalisms for network verification and monitoring [ 14 ]. 2) Explainable AI for wir eless communications: The need for explainable AI (XAI) in wireless networks has been increasingly recognized as ML models are deployed for crit- ical network functions [ 5 , 6 , 15 , 16 ]. Several frameworks hav e been developed to provide interpretability specifically for AI-driven O-RAN control. These include EXPLORA [ 7 ], which uses attributed graphs to link DRL-based agent actions to wireless network context; XAI-on-RAN, which integrates GPU-accelerated explainability techniques for real- time operation [ 17 ]; as well as work on RAN slicing and resource allocation [ 18 ], anomaly detection using interpretable autoencoders [ 19 ], and energy efficiency optimization [ 20 ]. Howe ver , existing XAI methods for wireless predominantly rely on statistical attribution, and the application of formal specification languages to wireless network intelligence re- mains an open research direction. 3) Conformal prediction and reliability guarantees: Con- formal prediction is a distrib ution-free framew ork for con- structing prediction sets with guaranteed cov erage [ 21 , 22 ]. It has recently been applied to wireless communications for tasks including demodulation and neuromorphic wireless edge intelligence [ 23 – 26 ]. Reference [ 27 ] studied the problem of predictiv e monitoring based on STL and conformal prediction. The Learn Then T est (L TT) framework recasts risk control as multiple hypothesis testing, enabling control of general risk functions beyond miscov erage [ 28 ]. 4) QoE prediction: Predicting user QoE from network- lev el measurements [ 29 ] was addressed in reference [ 30 ] for voice-o ver-IP (V oIP) systems, while reference [ 31 ] focused on predicting future virtual reality QoE lev els in O-RAN networks. These approaches inherit the limitations of black- box models in terms of interpretability and statistical validity . C. Main Contributions This paper introduces conformal STL learning (C-STLL), a nov el framew ork for extracting interpretable, reliability- guaranteed temporal specifications from wireless KPI traces. The main contributions are as follows: 1) STL learning to wireless communications: W e demon- strate that STL provides a natural and effecti ve formalism for expressing temporal properties of KPI traces that are predictiv e of high-level QoE requirements. Unlike black-box ML models, learned STL formulas are human- readable and verifiable. 2) Conf ormal calibration for reliable specification learn- ing: W e develop C-STLL, a calibration methodology that wraps around any STLL algorithm to produce a set of STL formulas with formal reliability guarantees (see Fig. 2 ). Specifically , C-STLL ensures that, with high probability , the returned set contains at least one formula achieving a target accurac y le vel. This is accomplished through a novel combination of sequential STL formula generation, acceptance rules based on complexity and div ersity criteria, and multiple hypothesis testing via the Learn Then T est frame work [ 28 , 32 , 33 ]. 3) Experimental validation: W e ev aluate C-STLL using the ns-3 network simulator , modeling a mobile gam- ing scenario with realistic traffic patterns and network conditions. Experiments demonstrate that C-STLL suc- cessfully controls risk belo w specified tolerance le vels while producing compact, di verse sets of interpretable STL formulas. Comparisons with ablated variants of the proposed framework confirm the importance of each component of the proposed framew ork. The remainder of this paper is or ganized as follo ws. Sec- tion II revie ws the syntax and semantics of STL. Section III formulates the STL learning problem and describes existing solution approaches. Section IV presents the proposed C-STLL framew ork, including the sequential generation procedure and the calibration methodology based on multiple hypothesis test- ing. Section V provides experimental results, and Section VI concludes the paper . I I . B A C K G RO U N D : S I G NA L T E M P O R A L L O G I C Signal T emporal Logic (STL) is a formal language for specifying and analyzing the temporal properties of time-series data. In this section, we present the standard STL syntax together with quantitativ e metrics used to ev aluate the degree of satisfaction of gi ven STL properties. A. STL F ormulas Let X = [ x 0 , x 1 , . . . , x T − 1 ] be a finite discrete-time trajectory with x t = [ x 1 ,t , ..., x d,t ] ∈ R d denoting the system state at time t , where integer d is the dimension of the state. In wireless communications, each vector x t may represent a collection of KPIs e xtracted over transmission time intervals (TTIs) indexed as t = 0 , 1 , . . . , e.g., one entry x i,t of vector x t may report the throughput of some user and another entry , x j,t , the latency of the user . 3 Fig. 2. Illustration of the proposed conformal STL learning (C-STLL) scheme. As illustrated in Fig. 1 , KPI traces are labeled as Y = 1 or Y = − 1 based on whether or not they correspond to settings in which positive or negati ve higher -lev el requirements are satisfied. V ia any STLL algorithm [ 10 ], the dataset D of labeled KPI traces can be leveraged to produce an STL formula ϕ , providing interpretable conditions on the KPI time series that are consistent with positiv e labels. Howev er, this approach provides no guarantees on the reliability and accuracy of the formula ϕ . The proposed C-STLL wraps around any STLL procedure to produce a set of STL formulas with r eliability and quality assurance . Through a sequential application of the STL algorithm, each newly learned formula is either accepted for inclusion in the set or rejected, and the process is continued or stopped, based on calibrated threshold-based procedures. In this example, formula ϕ 2 is rejected due to its high complexity , while ϕ 3 is rejected for being too similar to ϕ 1 , which is already in the set. An STL formula ϕ specifies a property that time series X may exhibit or not. Specifically , one writes ( X , t ) | = ϕ, (2) if sequence X satisfies property ϕ from time t onward, so the symbol | = reads “satisfies”. F or e xample, a sequence of KPIs, X , corresponding to nominal behavior of a radio access network may satisfy the property ϕ that the latency of a certain group of users does not drop below 1 ms for more than a given number of TTIs starting from some TTI t . STL properties ϕ are built from atomic predicates of the form µ ::= a ⊤ x > b, (3) with a ∈ R d and b ∈ R , where the symbol ::= is to be read as “is defined as”. Specifically , the basic STL property ( X , t ) | = ϕ is equiv alent to the condition a ⊤ x t > b requiring the predicate ( 3 ) to be valid for all times starting from t . As we will discuss below , more complex properties can be built from basic predicates of the form ( 3 ). For example, consider the property that the throughput x i,t of a giv en user must be larger than a threshold b = 50 Mbit per second (Mbps). Using a “one-hot” vector a with a single one in position i and all other zero entries, denoted as e i , this can be expressed as the STL property ( X , t ) | = ( e ⊤ i x = x i > 50 Mbps ) . More generally , STL formulas can be constructed by lever - aging the eventually operator ♢ [ t 1 ,t 2 ] , the always operator □ [ t 1 ,t 2 ] , as well as logical operations such as AND and OR. In particular , using the ev entually operator , the condition ( X , t ) | = ϕ with the STL property ϕ ::= ♢ [ t 1 ,t 2 ] µ (4) is true if the condition µ holds at any time t ′ ∈ [ t + t 1 , t + t 2 ] , i.e., in an interv al of the form [ t 1 , t 2 ] with time measured starting from index t . In a similar w ay , using the always operator , the condition ( X , t ) | = ϕ with the STL property ϕ ::= □ [ t 1 ,t 2 ] µ (5) is valid if the property µ holds for all t ′ ∈ [ t + t 1 , t + t 2 ] . For instance, continuing the wireless example, the STL property ( X , 0) | = □ [0 , ∞ ] ( x i > 50 Mbps ) is true if the throughput x i,t ′ is no smaller than 50 Mbps for all TTIs starting at time index 0 . In broadest generality , formulas such as ( 4 ) and ( 5 ) can be composed recursively , making use also of logical operations, namely AND ( ∧ ) and OR ( ∨ ). Specifically , one can recursiv ely define an STL property via the rule ϕ ::= µ | ∧ n i =1 ϕ i | ∨ n i =1 ϕ i | ♢ [ t 1 ,t 2 ] ϕ | □ [ t 1 ,t 2 ] ϕ, (6) where µ is an atomic predicate ( 3 ), and each step of the recursion applies one of the options in ( 6 ), which are separated by a bar “ | ” following a standard conv ention [ 34 ]. For instance, consider the property for a gi ven user that the throughout x i,t must be larger than 50 Mbit per second (Mbps) and the latenc y x j,t must be smaller than 1 ms for all TTIs starting at TTI t . The STL property ( X , 0) | = □ [0 , ∞ ) x i > 50 Mbps ∧ ♢ [0 , 5] ( x j ≤ 1 ms ) , (7) is true if for each TTI t starting from t = 0 , the throughput is larger than 50 Mbps and the latency is no larger than 1 ms in at least one of 5 consecutive TTIs. B. Rob ustness STL is equipped with a robustness scores, which assign to each formula ϕ and signal X a real-valued function ρ ( X , ϕ, t ) ∈ R that measures the de gree to which ϕ is satisfied when the ev aluation begins at time t [ 8 ]. By definition of 4 robustness measure, a signal satisfies the formula ϕ from time t if and only if its robustness is positive, i.e., ( X , t ) | = ϕ ⇐ ⇒ ρ ( X , ϕ, t ) > 0 , (8) and larger robustness v alues ρ ( X , ϕ, t ) indicate stronger sat- isfaction. For an atomic predicate µ in ( 3 ), the robustness is defined as ρ ( X , µ, t ) = a ⊤ x t − b. (9) Accordingly , the robustness is positive if a ⊤ x t > b . When considering a general STL formula constructed via the rule ( 6 ), the robustness can be obtained by using the follo wing rules. The robustness of the logical AND of more properties, ϕ = ∧ n i =1 ϕ i , is the minimum ρ ( X , ∧ n i =1 ϕ i , t ) = min i =1 ,...,n ρ ( X , ϕ i , t ) , (10) since all properties { ϕ i } n i =1 must be satisfied; while for the logical-OR formula ϕ = ∨ n i =1 ϕ i , the rob ustness is the maximum ρ ( X , ∨ n i =1 ϕ i , t ) = max i =1 ,...,n ρ ( X , ϕ i , t ) , (11) since the v alidity of an y property ϕ i is suf ficient for the v alidity of the entire property ϕ . Finally , in a similar way , for the ev entually operator , we have ρ ( X , ♢ [ t 1 ,t 2 ] ϕ, t ) = max t ′ ∈ [ t + t 1 ,t + t 2 ] ρ ( X , ϕ, t ′ ) , (12) and for the always operator the robustness is ρ ( X , □ [ t 1 ,t 2 ] ϕ, t ) = min t ′ ∈ [ t + t 1 ,t + t 2 ] ρ ( X , ϕ, t ′ ) . (13) I I I . L E A R N I N G S T L P RO P E RT I E S In this section, we discuss the problem of learning from labeled traces an interpretable prediction rule based on STL by following [ 10 ]. The goal is to infer a formal property , in the form of an STL formula ϕ , that lo wer-le vel traces X satisfying higher-le vel requirements are likely to meet. A. F ormulating STL Learning Consider a system in which traces X can be collected that describe the e volution of the v ariables of interest, here the KPIs of a RAN. T o each trace X , an end user can potentially assign a binary label Y ∈ { 1 , − 1 } indicating whether trajectory X satisfies gi ven higher -le vel requirements, Y = 1 , or violates them, Y = − 1 . Henceforth, we will refer to a sequence X labeled Y = 1 as positive , and to a sequence X with Y = − 1 as negative . As in Fig. 1 , the higher-le vel requirements may refer to the QoE of end users when interacting with certain applications ov er the network. W e are interested in the problem of STL learning (STLL), where the goal is to infer an STL formula ϕ that can ef fecti vely distinguish between traces with positiv e and negati ve labels. Specifically , we would like to identify an STL formula ϕ that is likely to be satisfied by traces X with positi ve labels, i.e., ( X , 0) | = ϕ if Y = 1 , while being violated by traces with negati ve labels, i.e., ( X , 0) | = ϕ if Y = − 1 . Note that the starting time is chosen as t = 0 by con vention. Accordingly , the classification decision ˆ Y associated with an STL property ϕ is gi ven by ˆ Y = ( 1 , if ( X , 0) | = ϕ, − 1 , if ( X , 0) | = ϕ, (14) or equi v alently in terms of robustness ˆ Y = ( 1 , if ρ ( X , ϕ, 0) > 0 , − 1 , if ρ ( X , ϕ, 0) < 0 . (15) Accordingly , the rob ustness ρ ( X , ϕ, 0) serves as a logit in con ventional binary classification. The STLL problem is formulated as a supervised learn- ing task. T o this end, assume access to a labeled dataset D = { ( X i , Y i ) } |D| i =1 in which each pair ( X i , Y i ) is drawn i.i.d. from an underlying ground-truth distribution P X ,Y fol- lowing the standard frequentistic learning framew ork (see e.g., [ 35 ]). The data distribution describes the current, ground-truth, relationship between the traces X and the label Y . Using ( 15 ), for each labeled pair ( X , Y ) ∼ P X ,Y , the classification mar gin , giv en by the product Y ρ ( X , ϕ, 0) , is positiv e if the STL formula correctly marks a trace X as positiv e or negati ve; while a ne gati ve classification mar gin Y ρ ( X , ϕ, 0) indicates an incorrect classification decision. Accordingly , any non-increasing function of the classification margin can serve as a loss function for the STL problem (see, e.g., [ 35 , Chapter 6]). Specifically , in [ 10 ], the shifted hinge loss loss ℓ ( ϕ, X , Y ) = ReLU ( β − Y ρ ( X , ϕ, 0)) − γ β (16) was adopted, where the hyperparameters β > 0 specifies the desired robustness margin and γ > 0 adjusts the trade-off in penalizing samples that already satisfy the margin. Using the loss ℓ ( ϕ, X , Y ) , for any giv en point distribution P X ,Y , the goal is to find an STL rule ϕ that minimizes the population loss L p ( ϕ ) = E ( X ,Y ) ∼ P X ,Y [ ℓ ( X , Y )] , (17) where the av erage is ev aluated over the joint distribution P X ,Y of input X and target Y . The e xpected v alue in ( 17 ) is practically approximated using the dataset D , yielding the training loss L D ( ϕ ) = 1 |D | |D| X i =1 ℓ ( X i , Y i ) . (18) Howe ver , a direct optimization of the training loss L D ( ϕ ) o ver the STL formula ϕ is made difficult by the fact that the space of STL formulas is discrete. B. Addr essing the STL Learning Pr oblem In order to address the STLL problem of minimizing the training loss ( 18 ) o ver the STL formula ϕ , reference [ 10 ] introduced a relaxation-based method. Specifically , the STLL approach in [ 10 ] relaxes the discrete variables dictating the sequence of temporal operators and logical operations applied 5 Fig. 3. An illustration of a parameterized template for the construction of an STL formula ϕ . In this example, the STL formula ϕ is constructed by the following steps: 1) a number of predicates µ , as in ( 3 ), are selected; 2) the predicates are combined via Boolean operators, either ∧ or ∨ ; 3) temporal operators ♢ [ t 1 ,t 2 ] or □ [ t 1 ,t 2 ] are applied to the resulting partial STL formulas; and 4) another Boolean operator is used to combine the partial STL properties constructed at the previous steps. Each layer is specified by parameters that are optimized by addressing the STLL problem. in the construction rule ( 6 ) to obtain the STL formula ϕ . T o this end, reference [ 10 ] starts by assuming a template for the STL formula ϕ as a sequence of steps in the construction ( 6 ) of an STL rule. An example, adapted from [ 10 ], is shown in Fig. 3 , in which the formula ϕ is constructed via the following steps: 1) a number of predicates µ , as in ( 3 ), are selected; 2) the predicates are combined via Boolean operators, which may be either ∧ or ∨ ; 3) temporal operators, namely ♢ [ t 1 ,t 2 ] or □ [ t 1 ,t 2 ] , are applied to the resulting partial STL formulas; and 4) another Boolean operator , either ∧ or ∨ , is used to combine the partial STL properties constructed at the previous steps. Accordingly , as illustrated in Fig. 3 , a general template contains predicate layer , Boolean layers, and temporal layers. Therefore, templates such as that in Fig. 3 fully specify an STL property ϕ once a number of parameters are giv en, namely: (i) the variables a and b defining the atomic predicate of a predicate module; (ii) the binary variables determining which logical operator (AND or OR) is applied and which formulas constructed from previous steps are combined in a temporal module; (iii) the binary variables determining whether a temporal operator is “always” □ or “ev entually” ♢ , as well as the corresponding interv al bounds t 1 and t 2 , in a temporal module. Minimizing the training loss L D ( ϕ ) over an STL formula ϕ follo wing a given template thus requires optimizing over both continuous variables – namely a , b, t 1 , t 2 – and discrete variables – i.e., the mentioned binary indicators. Reference [ 10 ] proposes relaxing the binary variables to probabilities so as to obtain a dif ferentiable training loss, while adding regularizers to push optimized probabilities to wards either 0 or 1. For completeness, more details on the STLL procedure introduced in [ 10 ] can be found in Appendix A. I V . C O N F O R M A L S T L L E A R N I N G As discussed in the previous section, given a dataset D , STLL methods obtain an STL formula ϕ that ideally distin- guishes positiv e and neg ativ e sequences with minimal av erage error . Howe ver , being based on standard supervised learning methodologies, such techniques do not provide any formal, finite-sample, guarantee on the actual average loss obtained by the optimized formula ϕ . This can make the deployment of these methods in sensitive domains potentially problematic. In this section, we introduce conformal STL learning (C-STLL), a calibration methodology b uilding on sequential gener ation and multiple hypothesis testing . A. Reliable STL Learning via Sequential Generation V ia sequential generation, C-STLL constructs a set of candidate STL formulas using multiple runs of an STLL algorithm. W e specifically adopt the STLL algorithm in [ 10 ], which was re vie wed in Section III-B . As illustrated in Fig. 2 , C-STLL calibrates an acceptance rule determining whether a formula generated via STLL is accepted for inclusion in the set, together with a stopping rule dictating when to stop generating. C-STLL le verages held-out data, and has the following aims: ( i ) to formally ensure that at least one high- accuracy STL formula is included in the set with sufficiently large probability , and ( ii ) to make a best effort at minimizing the complexity and at maximizing the div ersity of the STLL formulas in the generated set. C-STLL follows the general methodology of conformal language modeling [ 33 ], which is itself grounded in Pareto testing [ 32 ] and learn then test (L TT) [ 28 ]. In the following, we first describe the sequential generation of STL formulas via STLL and the accuracy requirement, and then we present C-STLL as a calibration methodology for acceptance and stopping rules. Formally , giv en a dataset D , C-STLL constructs a set of STL formulas C λ ( D ) through multiple runs of STLL [ 10 ] via the follo wing steps. 1) Sequential Generation: At each learning run l = 1 , 2 , ... , STLL is applied to the dataset D as described in Sec. III-A , obtaining a generally different formula ϕ l . T o ensure the nondeterminism of STLL, and thus the generation of different STL formulas, we adopt an ensembling methodology based on randomizing initialization and mini-batch selection for stochastic gradient descent ov er the trainable parameters, as in deep ensembles [ 36 ]. An alternati ve approach may treat the STL parameters as random variables and sampling from a learned posterior distribution [ 37 ]. C-STLL includes formula ϕ l in set C λ ( D ) according to an inclusion rule I λ ( D , ϕ 1 , . . . , ϕ l ) , to be designed: I λ ( D , ϕ 1 , . . . , ϕ l ) = ( 1 , if ϕ l ∈ C λ ( D ) , 0 , otherwise. (19) 6 C-STLL stops when a stopping indicator S λ ( D , ϕ 1 , . . . , ϕ l ) , to be designed, returns 1, while proceeding to the next run l + 1 otherwise: S λ ( D , ϕ 1 , . . . , ϕ l ) = ( 1 , if a stopping condition is satisfied , 0 , otherwise. (20) Acceptance rule I λ ( D , ϕ 1 , . . . , ϕ l ) and stopping rule S λ ( D , ϕ 1 , . . . , ϕ l ) generally depend on a vector of hyperparameters λ , as introduced in the next subsection. 2) Reliability Requir ements: C-STLL designs the hyperpa- rameter λ of the acceptance and stopping rule with the aim of ensuring a reliability requirement for the set C λ ( D ) , while also accounting for complexity and di versity of the formulas in set C λ ( D ) . The reliability condition is formalized as follows. Giv en an STL formula ϕ , any ne w trace X can be classified as positiv e ( ˆ Y = 1) or negativ e ( ˆ Y = − 1) based on ( 14 ), or equiv alently ( 15 ). Gi ven a validation dataset D v , we wish to impose that the fraction of correctly classified samples is larger than a user -defined threshold φ , i.e., 1 |D v | |D v | X i =1 1 ( ˆ Y i = Y i ) > φ, (21) where 1 ( · ) is the indicator function ( 1 ( true ) = 1 and 1 ( false ) = 0 ). Accordingly , we define the accuracy indicator for an STL formula ϕ as A ( ϕ, D v ) = ( 1 , if 1 |D v | P |D v | i =1 1 ( ˆ Y i = Y i ) > φ, 0 , otherwise, (22) so that A ( ϕ, D v ) = 1 indicates that the STL formula ϕ is sufficiently accurate, while A ( ϕ, D v ) = 0 signals the unreliability of formula ϕ . The joint distribution P DD v in practice captures the v ari- ability of the relationship between KPI traces X and high- lev el labels Y across the tasks of interest. T o formalize this, we introduce an indicator τ for a task of interest, corresponding, e.g., to some users, network conditions, and QoE requirements. Each pair ( D , D v ) of datasets pertains to a particular task τ . W e model the task τ ∼ P τ as a random variable with an unknown distribution P τ , while the datasets D = { ( X i , Y i ) } |D| i =1 and D v = { ( X i , Y i ) } |D v | i =1 contain i.i.d. data from an unkno wn task-specific distribution P X ,Y | τ . Accordingly , the joint distrib ution of the dataset pair ( D , D v ) is gi ven by the mixture P DD v = Z P τ P D| τ P D v | τ dτ , (23) where the task indicator τ is marginalized ov er P τ , and P D| τ and P D v | τ represent the i.i.d. distributions obtained from distribution P X ,Y | τ . Using this accuracy function, we define the risk associated with the choice of hyperparameter λ as the probability that there is no sufficiently accurate STL formula ϕ in set C λ ( D ) , i.e., R ( λ ) = Pr { ∄ ϕ ∈ C λ ( D ) : A ( ϕ, D v ) = 1 } . (24) In interpreting the probability in ( 24 ), we view the dataset pair ( D , D v ) as random, follo wing some unkno wn joint distribution P DD v . Our design goal is to identify a hyperparameter λ ∗ , which guarantees that the risk is controlled with high proba- bility , i.e., Pr( R ( λ ∗ ) < ϵ ) ≥ 1 − δ, (25) where the probability is over any randomness associated with the optimization of the hyperparameters λ ∗ . By ( 25 ), the risk of the generated set C λ ( D ) must be smaller than ϵ with probability lar ger than 1 − δ . B. Conformal STL Learning As discussed, C-STLL builds the set C λ ( D ) sequentially , and we define henceforth as C λ ,l ( D ) the current set at iteration l , with the initial set C λ , 0 ( D ) being empty . Acceptance rule: At each run l , C-STLL accepts an STL formula ϕ l into the set C λ ,l ( D ) , setting I λ ( D , ϕ 1 , . . . , ϕ l ) = 1 , if the formula is of sufficiently lo w complexity and suffi- ciently diverse relati ve to the pre viously accepted formulas. T o elaborate, let H ( ϕ l ) denote a function that maps the learned STL ϕ l to a scalar quantity measuring its complexity . A less complex formula is generally more interpretable, and thus more useful to relate KPI traces to high-le vel requirements. W e specifically adopt the number of operations as the measure of complexity H ( ϕ l ) [ 38 ]. The candidate formula ϕ l is deemed of suf ficiently lo w complexity if the measure H ( ϕ l ) is smaller than a threshold λ 1 , i.e., H ( ϕ l ) < λ 1 . (26) Quality alone, ho wever , is insuf ficient for constructing an informativ e set of formulas, since the learned formulas may be structurally similar . T o avoid admitting redundant formulas, each candidate is additionally compared with those already in set C λ ,l ( D ) by using a distance measure, which is designed to detect duplicates and promote structural div ersity . A new candidate ϕ l is deemed to be sufficiently di verse from pre vious included formulas if the distance D ( ϕ l , ϕ ) between ϕ l and all of the pre viously accepted formula ϕ ∈ C λ ,l − 1 ( D ) is larger than a threshold λ 2 , i.e., D ( ϕ l , ϕ ) > λ 2 , for all ϕ ∈ C λ ,l − 1 ( D ) . (27) As distance function, we follow [ 39 ] and measure semantic dissimilarity by comparing the rob ustness scores of the tw o formulas on the dataset D as D ( ϕ l , ϕ ) = σ 1 |D | |D| X i =1 || ρ ( X i , ϕ l , 0) | − | ρ ( X i , ϕ, 0) || . (28) The use of the sigmoid function σ ( · ) = (1 + e − ( · ) ) − 1 in ( 28 ) ensures that the distance is normalized between 0 and 1. T wo STL formulas that behav e similarly across traces in dataset D will produce nearly identical rob ustness v alues, leading to a small distance ( 28 ). When a candidate satisfies both the comple xity and di versity criteria, it is accepted into the set C λ ,l ( D ) , i.e., the acceptance function is given by I λ ( D , ϕ 1 , . . . , ϕ l ) = 1 ( H ( ϕ l ) < λ 1 and D ( ϕ l , ϕ ) > λ 2 , for all ϕ ∈ C λ ,l − 1 ( D )) . (29) 7 Algorithm 1: Conformal STL Learning (C-STLL) — test time 1: Input: Labeled dataset D 2: Initialization: The initial set is empty C λ , 0 ( D ) = {} 3: f or l = 1 , 2 , . . . , L max do 4: Run STLL [ 10 ] on dataset D to learn an STL formula ϕ l 5: if H ( ϕ l ) < λ 1 and D ( ϕ l , ϕ ) > λ 2 for all ϕ ∈ C λ ,l − 1 ( D ) 6: Add ϕ l to the set C λ ,l ( D ) 7: if F ( C λ ,l ( D )) > λ 3 8: Exit the for loop 9: end if 10: end for 11: Output: Set C λ ( D ) = C λ ,l ( D ) Accordingly , the set is updated as C λ ,l ( D ) = ( C λ ,l − 1 ( D ) ∪ { ϕ l } , if I λ ( D , ϕ 1 , . . . , ϕ l ) = 1 , C λ ,l − 1 ( D ) , otherwise. (30) Stopping Rule: The sequential generation process is run for at most L max times, and may terminate early if the ev olving set is assessed to be of high enough quality . T o formalize the resulting stopping function S λ ( D , ϕ 1 , . . . , ϕ l ) in ( 20 ), we define a set-based quality function F ( C λ ,l ( D )) for the set C λ ,l ( D ) . Specifically , we define the quality of any STL formula ϕ l as an increasing function of its average robustness on the dataset D as Q ( ϕ l ) = σ 1 |D | |D| X i =1 ρ ( X i , ϕ l , 0) , (31) where the sigmoid function again ensures a normalized mea- sure in the interval [0 , 1] . The set-based quality function F ( C λ ,l ( D )) is then defined as the average F ( C λ ,l ( D )) = 1 |C λ ,l ( D ) | X ϕ ∈C λ ,l ( D ) Q ( ϕ ) . (32) C-STLL stops generating STL formulas at the earliest iteration l for which we have the inequality F ( C λ ,l ( D )) > λ 3 , (33) where λ 3 is a threshold. The o verall C-STLL procedure is summarized in Algorithm 1 . The vector of hyperparameters λ = ( λ 1 , λ 2 , λ 3 ) , which encompass the complexity threshold λ 1 ( 26 ), the div ersity threshold λ 2 ( 27 ), and the early-stopping threshold λ 3 ( 33 ) are optimized during a preliminary calibra- tion phase, which is discussed next. C. Reliable Hyperparameter Optimization via Multiple Hy- pothesis T esting As discussed in Sec. IV -A , C-STLL aims at identifying hyperparameters λ ∗ that satisfy the reliability requirement ( 25 ), while attempting to introduce STL formulas with lo w complexity and to maintain di versity . T o this end, C-STLL carries out an of fline calibration procedure based on calibration data encompassing examples of dataset pairs ( D , D v ) from different tasks. F ollowing the setting described in Section IV -A2 , we assume the a v ailability of a calibration dataset D cal = { ( D k , D v k ) } |D cal | k =1 consisting of pairs ( D k , D v k ) of training and v alidation datasets drawn i.i.d. from the dataset distribution P DD v . Accordingly , each pair ( D k , D v k ) corre- sponds to an independently generated task τ k ∼ P τ . C-STLL uses the calibration dataset D cal to select hyperparameter λ ∗ so that the condition ( 25 ) is satisfied, where the outer probability is taken ov er the calibration dataset D cal . T o this end, C-STLL follows the L TT methodology [ 28 ] by specifically adopting Pareto testing [ 32 ]. In L TT , one starts by identifying a discrete subset Λ of configurations for the hyperparameter λ . This can be done via a standard grid or through any optimization algorithm using prior information or separate data. For each hyperparameter λ ∈ Λ , L TT tests the null hypothesis H λ that the true risk ( 24 ) is above the target uncertainty le vel ϵ , i.e., H λ : R ( λ ) ≥ ϵ. (34) Rejecting the null hypothesis H λ identifies hyperparameter λ yielding reliable performance as measured by the accuracy indicator ( 22 ). Using the calibration data D cal , for each hyperparameter λ ∈ Λ , we e v aluate the empirical risk ˆ R ( λ , D cal ) as ˆ R ( λ , D cal ) = 1 |D cal | |D cal | X k =1 1 { ∄ ϕ ∈ C λ ( D k ) : A ( ϕ, D v k ) = 1 } , (35) which corresponds to an unbiased estimate of the true risk R ( λ ) in ( 24 ). Accordingly , if the null hypothesis H λ holds, the estimate ( 35 ) will tend to be larger than ϵ . More precisely , let b ( |D cal | , ϵ ) denote a binomial random v ariable with sample size |D cal | and success probability ϵ . A valid p-value for the hypothesis H λ in ( 34 ) is giv en by [ 33 ] p ( λ ) = Pr( b ( |D cal | , ϵ ) ≤ |D cal | ˆ R ( λ , D cal )) , (36) in the sense that we hav e the inequality Pr( p ( λ ) ≤ δ |H λ ) ≤ δ for all probability δ . Using the p-values { p ( λ ) } λ ∈ Λ , L TT applies a family- wise error rate (FWER) multiple hypothesis testing (MHT) procedure to obtain a subset Λ v alid ⊆ Λ with the property Pr( ∃ λ ∈ Λ v alid : R ( λ ) ≥ ϵ ) ≤ δ. (37) If the set Λ v alid is empty , we select the solution λ ∗ = [ λ 1 = ∞ , λ 2 = 0 , λ 3 = ∞ ] ⊤ , which allows the set C λ ( D ) to include all L max STL formulas. Otherwise, using any configuration in subset Λ v alid guarantees the reliability condition ( 25 ) by the FWER property ( 37 ). In C-STLL, we select the hyperparam- eter λ ∗ ∈ Λ v alid that minimizes the empirical set size, i.e., λ ∗ = arg min λ ∈ Λ v alid 1 |D cal | |D cal | X k =1 |C λ ( D k ) | . (38) As for the FWER procedure, in order to efficiently search the set of candidates Λ , we use the Pareto T esting procedure 8 Algorithm 2: Conformal STL Learning (C-STLL) — calibration phase 1: Input: Calibration dataset D cal , set Λ , risk tolerance ϵ , error le vel δ 2: Initialization: Initialize the reliable set as empty Λ v alid = {} 3: f or λ ∈ Λ do 4: Evaluate the empirical risk ˆ R ( λ , D cal ) using ( 35 ) 5: Compute the p-value p ( λ ) for hypothesis H λ using ( 36 ) 6: end for 7: Apply L TT procedure on { p ( λ ) } λ ∈ Λ to obtain a subset Λ v alid ⊆ Λ satisfying the FWER guarantee ( 37 ) (see Appendix B) 8: if Λ v alid is empty 9: Set λ ∗ = [ λ 1 = ∞ , λ 2 = 0 , λ 3 = ∞ ] ⊤ 10: else 11: Select λ ∗ ∈ Λ v alid that minimizes the empirical av erage set size as in ( 38 ) 12: end if 13: Output: Calibrated hyperparameter λ ∗ from [ 32 ]. Pareto T esting exploits structure in Λ by first using a proportion of the dataset D cal to approximate the Pareto-optimal frontier in two-dimensional space, and by then iterativ ely validating promising configurations using fixed- sequence testing [ 40 ] on the remaining calibration data. Details can be found in Appendix B. D. Theor etical Guarantees Giv en its reliance on L TT , C-STLL guarantees the target reliability condition ( 25 ). Theorem 1 ( Reliability of C-STLL via L TT ) . By setting the hyperparameter vector λ ∗ as described in Algorithm 2 , the risk of the set C λ ∗ ( D ) constructed via C-STLL as in Algorithm 1 satisfies the inequality ( 25 ) . Pr oof. The proof is provided for completeness in the Ap- pendix C. V . E X P E R I M E N T S In this section, we provide numerical results to demonstrate the effecti veness of the proposed C-STLL calibration method- ology . A. Setting Throughout this section, we adopt the network simulator ns-3 1 to model a single g aming user communicating with a remote host over a wireless network (see Fig. 1 ). Bidirec- tional user datagram protocol (UDP) traffic is generated to emulate interactiv e g aming, while optional background users introduce network congestion. The latency KPI is defined as the one-way end-to-end delay of successfully received gaming 1 The simulator is av ailable at https://www .nsnam.org/. packets, av eraged over uplink and downlink. The bac klog KPI represents the number of user-side queued UDP packets awaiting transmission. Each trace X consists of latency and backlog KPIs collected over 61 time steps, which corresponds to uniformly spaced observations over a real time observation of 13s. W e consider two dif ferent labeling strategies, with the first relying on a ground-truth STL formula, and the second mimicking a high-level QoE assessment. The first approach is included for reference, as it makes it possible to compare true and estimated STL formulas directly . In the first approach, a trace is labeled as positiv e ( Y = 1) if the latency remains below a threshold T 1 ov er all time steps, while the backlog is below a threshold T 2 during the final 5 time steps; otherwise, it is labeled as neg ativ e ( Y = − 1) . This labeling approach has the advantage of pro viding a ground- truth STL formula, namely ϕ true = □ [0 , 60] ( latency < T 1 ms ) ∧ □ [56 , 60] ( backlog < T 2 kilobyte ) , (39) facilitating ev aluation. W e specifically use the threshold pairs ( T 1 , T 2 ) = (100 , 30) , (110 , 28) , (120 , 26) , (130 , 24) , and (140 , 32) to label the entire dataset, corresponding to five distinct tasks. W e also consider a labeling setting mimicking the assess- ment of QoE le vels from the end user (see Fig. 1 ). T o this end, we adopt an LLM-as-a-judge approach [ 41 , 42 ], whereby an LLM (ChatGPT5.1) is used to assign QoE labels based on the temporal ev olution of latency and backlog. This is done by using the following prompt: “Y ou are a gaming user communicating with a r emote host over a wireless network. The attached file contains KPIs including latency and backlo g over 61 time steps. Evaluate the QoE according to one of the following five QoE definitions: (i) latency-dominant QoE, pri- oritizing consistently low latency; (ii) backlo g-dominant QoE, prioritizing low queue backlo g; (iii) balanced QoE, r equiring both low latency and low backlo g; (iv) burst-toler ant QoE, allowing short-term spik es in latency or bac klog b ut penalizing sustained de gradation; and (v) strict QoE, penalizing any persistent latency or backlog incr ease. Label eac h example with Y = 1 corr esponding to high QoE and Y = − 1 to low QoE. ” Each training dataset D contains 5,000 examples, and each validation set D v containing 1,000 examples. Gi ven fi ve tasks, each dataset pair ( D , D v ) is generated by first uniformly sampling a task τ , and then sampling from the labeled data corresponding to this task. W e generate 100 calibration data pairs ( D , D v ) , with 50 pairs used for Pareto testing, and the remaining 50 used for fix ed-sequence testing (see Section IV -C ). B. Implementation W e adopt the STLL method in [ 10 ], which learns an STL formula via neural network–based optimization, as explain in Section III-B and Appendix A. W e follow a template similar to Fig. 3 , with six predicate modules at the input layer , followed by six temporal modules and a single Boolean 9 T ABLE I E X A M P L E S O F L E A R N E D S T L F O R MU L A S E T S W I T H C O N V E N T I O NA L S T L L [ 1 0 ] A N D W I T H V A R I A N T S O F T H E P R O P O S E D C - S T L L C A L I BR A T I O N S C H E M E S G I V E N T H E S A M E T R A I N I N G D A TAS E T D ( T RU E L A B E L I N G S T L F O R M U L A ( 39 ) W I T H T 1 = 100 A N D T 2 = 30 ) . Scheme Learned STL Formula Set C λ ∗ ( D ) Accuracy STLL ϕ = □ [24 , 39] ( latency < 21) ∧ □ [1 , 57]( backlog < 59) ∧ □ [29 , 30] ( latency < 66) ∧ □ [11 , 42] ( backlog > 3) ϕ : 65% C-STLL { ϕ 1 = □ [20 , 31] ( latency < 137) ∧ □ [25 , 60] ( backlog < 55) , ϕ 2 = ♢ [11 , 40] ( backlog > 3) ∧ □ [0 , 55] ( latency < 30) } ϕ 1 : 96.9 % , ϕ 2 : 85% C-STLL with stopping rule only { ϕ 1 = □ [19 , 49] ( backlog < 64) , ϕ 2 = ♢ [19 , 33] ( latency < 154) ∨ □ [9 , 48] ( latency < 22) , ϕ 3 = ♢ [13 , 35] ( backlog > 60) ∨ ♢ [14 , 45] ( latency < 20) , ϕ 4 = ♢ [11 , 60] ( backlog < 4) ∨ □ [13 , 38] ( backlog < 63) , ϕ 5 = □ [5 , 29] ( backlog < 81) , ϕ 6 = ♢ [18 , 32] ( backlog > 16) ∧ ♢ [24 , 54] ( backlog > 11) ∧ ♢ [0 , 33] ( backlog < 45) , ϕ 7 = □ [15 , 36] ( latency < 14) ∧ ♢ [17 , 53] ( backlog < 24) , ϕ 8 = □ [25 , 29] ( latency < 27) , ϕ 9 = □ [0 , 33] ( backlog > 11) ∨ □ [8 , 46] ( backlog > 21) , ϕ 10 = □ [15 , 38] ( backlog > 2) ∧ □ [18 , 29] ( latency < 28) } ϕ 1 : 84.1% , ϕ 2 : 84.1% , ϕ 3 : 52 . 7% , ϕ 4 : 57 . 5% , ϕ 5 : 84.1% , ϕ 6 : 81.3% , ϕ 7 : 47 . 3% , ϕ 8 : 76 . 8% , ϕ 9 : 52 . 7% , ϕ 10 : 71 . 9% C-STLL with complexity check and stopping rule { ϕ 1 = ♢ [23 , 45] ( backlog > 2) ∧ □ [21 , 39] ( backlog < 75) , ϕ 2 = ♢ [24 , 39] ( backlog > 8) ∧ □ [27 , 46] ( latency < 32) , ϕ 3 = ♢ [18 , 32] ( latency < 17) ∧ ♢ [26 , 52] ( backlog > 4) , ϕ 4 = ♢ [27 , 33] ( backlog < 45) , ϕ 5 = ♢ [17 , 47] ( backlog > 12) ∧ ♢ [1 , 35] ( backlog < 73) , ϕ 6 = ♢ [1 , 44] ( latency < 17) ∧ ♢ [13 , 52] ( backlog > 13) , ϕ 7 = □ [25 , 54] ( latency < 20) ∧ ♢ [5 , 36] ( backlog > 6) } ϕ 1 : 84.1% , ϕ 2 : 78 . 6% , ϕ 3 : 53 . 8% , ϕ 4 : 82.6% , ϕ 5 : 52 . 7% , ϕ 6 : 57 . 7% , ϕ 7 : 48 . 1% C-STLL with di- versity check and stopping rule { ϕ 1 = □ [24 , 51] ( latency < 32) ∧ ♢ [9 , 46] ( latency > 7) ∧ □ [13 , 37] ( backlog > 369) ∧ □ [4 , 36] ( latency < 31) ϕ 2 = □ [1 , 53] ( latency < 29) ∨ ♢ [0 , 58] ( latency < 28) , ϕ 3 = ♢ [11 , 45] ( latency < 179) ∧ □ [19 , 55] ( backlog > 48) ∧ □ [23 , 56] ( latency > 10) ∧ ♢ [25 , 54] ( backlog > 195) ∧ ♢ [0 , 52] ( latency < 101) ∧ □ [9 , 32] ( backlog > 189) } ϕ 1 : 70 . 6% , ϕ 2 : 83.9% , ϕ 3 : 83.9% module as the output layer . Following the recommendations in [ 10 ], the learning rate is set to 0.1, the batch size to 512, and the number of training epochs to 5, and in the loss function ( 16 ), we set β = 0 . 1 and γ = 0 . 01 . The grid Λ contains all threshold tuples ( λ 1 , λ 2 , λ 3 ) with λ 1 ∈ { 0 . 33 , 0 . 5 , 0 . 67 } , λ 2 ∈ { 0 . 50 , 0 . 52 , 0 . 54 , 0 . 56 , 0 . 58 } , and λ 3 ∈ { 0 . 3 , 0 . 4 , 0 . 5 , 0 . 6 , 0 . 7 } . These ranges were chosen based on preliminary experiments to identify regions with meaningful variation in risk and set size ov er separate data generated in the same way . For C-STLL, unless otherwise stated, the accuracy threshold in ( 22 ) is set to φ = 0 . 8 , thus requiring that a fraction larger than 80% of validation data points is correctly classified. The risk tolerance in ( 34 ) is set to ϵ = 0 . 2 , so that for the selected hyperparameter λ ∗ the probability of meeting the accuracy requirement is at least 80%. Finally , the error lev el in ( 25 ) is δ = 0 . 05 , so that the probability of selecting an unreliable hyperparameter λ ∗ is no more than 0.05. C-STLL is run for at most L max = 10 iterations. C. Benc hmarks For comparison, we consider the standard application of STLL [ 10 ], along with a number of variants of C-STLL: STLL: This baseline learns a single STL formula by minimiz- ing a robustness-based loss ( 16 ) over the training dataset D . Unlike C-STLL, STLL does not generate a set of candidate formulas and does not employ stopping criteria and acceptance rule, and therefore provides no explicit reliability control. It is obtained from Algorithm 1 by setting L max = 1 . C-STLL with stopping rule only: This variant of the pro- posed C-STLL scheme uses the stopping rule based on the set- lev el quality function F , but it does not apply the complexity and diversity checks when accepting individual formulas ϕ l . Accordingly , in this case, only the stopping threshold λ 3 is optimized, while thresholds λ 1 and λ 2 are not used in Algorithm 1 . C-STLL with complexity check and stopping rule: This variant of C-STLL enforces the complexity-based acceptance rule, admitting a candidate formula ϕ l only if its complexity satisfies ( 26 ), b ut no di versity constraint is applied. In this case, only the complexity threshold λ 1 and the stopping threshold λ 3 are optimized, while threshold λ 2 is not used in Algorithm 1 . C-STLL with diversity check and stopping rule: This variant enforces the diversity-based acceptance rule, requiring each accepted formula ϕ l to be suf ficiently dissimilar from previously accepted formulas, but no complexity constraint is applied. In this case, only the diversity threshold λ 2 and the stopping threshold λ 3 are optimized, while threshold λ 1 is not used in Algorithm 1 . C-STLL with Bonferr oni correction: W e also consider a baseline that selects the hyperparameter λ using Bonferroni- corrected MHT as the FWER-controlling mechanism. For each λ ∈ Λ , this scheme computes the p-value in ( 36 ), and declares λ reliable if p ( λ ) < δ / | Λ | . Therefore, the reliable set is given by Λ v alid = { λ ∈ Λ | p ( λ ) < δ / | Λ |} , and we select the hyperparameter λ ∗ ∈ Λ v alid that yields the smallest set size as in ( 38 ). This benchmark allows us to verify the benefits of 10 Pareto testing [ 32 ] as a hyperparameter selection strategy . W e emphasize that all C-STLL variants are novel, and that prior art only considered STLL [ 10 ]. D. Results W e start by considering data labeled via the ground-truth STL formula ϕ true in ( 39 ). T o illustrate the operation of different schemes, in T able I, we present examples of STL formula sets learned using the same dataset D under STLL, which returns a single formula, as well as with the men- tioned C-STLL v ariants. As discussed, the C-STLL variants implement different acceptance and stopping mechanisms, which directly shape the resulting set. T able I also reports the accuracy on the same v alidation dataset D v as in ( 21 ), i.e., ( P |D v | i =1 1 ( ˆ Y i = Y i )) / |D v | , marking in bold the STL formulas with accuracy abov e the target threshold φ = 0 . 8 . STLL returns a single formula, which is quite different from the ground-truth STL formula ϕ true in ( 39 ), yielding an accuracy lev el belo w the tar get 80%. In contrast, C-STLL results in a compact set, where each formula is both low in complexity and complementary to the others, resulting in at least one formula in the set with accuracy abov e the threshold φ . When only the stopping rule is used, C-STLL admits all learned formulas until the aggre gate quality criterion is met. As a result, the corresponding set is large, and contains many formulas that are structurally similar or redundant. Introducing a comple xity check restricts the admission of formulas with many temporal or Boolean operators, leading to a smaller set composed of simpler and more interpretable formulas. In contrast, enforcing a div ersity check suppresses formulas whose robustness behaviors are similar to previously accepted ones, encouraging structural v ariety b ut still allowing relati vely complex formulas to enter the set. This analysis validates the importance of including complexity and div ersity checks. W e no w e valuate aggregate metrics to study the average performance of C-STLL. T o this end, Fig. 4 shows the test performance of STLL and of all C-STLL variants as a function of the risk tolerance ϵ on a test set. W e ev aluate four key metrics, including a verage risk ( 24 ), average set size |C λ ( D ) | , av erage formula complexity ( 26 ), and average div ersity ( 28 ). Recall that the risk R ( λ ) measures the probability of the set C λ ( D ) including no STL formulas above the target reliability φ . The a verage complexity is defined as the normalized number of temporal operations in an STL formula (see Section IV -B ), while the average di versity is computed as the mean of all pairwise distances between formulas using ( 28 ). Fig. 4 (a) sho ws the empirical a verage risk achieved by each scheme on the test dataset. The dashed diagonal indicates the tar get constraint ϵ . STLL fails to control the risk, while all C-STLL schemes control the risk below the pre-defined tolerance ϵ , v alidating the ef fectiv eness of the conformal calibration framew ork. The Bonferroni-based scheme consis- tently yields the lowest empirical risk, which is expected due to its conservati ve nature. In contrast, the baseline C- STLL and its variants exhibit a tradeoff between reliability and expressi veness: relaxing the selection criterion allo ws the model to admit more compact or div erse formula sets at the cost of operating closer to the risk boundary . Fig. 4 (b) reports the a verage size of the learned STL formula set. A decreasing trend is observ ed for most schemes as the target ϵ increases, indicating that a looser risk tolerance allows fewer formulas while controlling the risk. C-STLL consistently produces the smallest sets, as it directly optimizes for early stopping once a sufficient quality lev el is reached. C-STLL with stopping-rule-only variant yields larger sets, since it lacks additional structural constraints and thus accumulates more candidate formulas before termination. Fig. 4 (c) illustrates the a verage normalized complexity of formulas in the selected sets. Schemes that explicitly penalize complexity achiev e consistently lower complexity values. C- STLL maintains relati vely low complexity due to its early- stopping mechanism. In contrast, the div ersity-aw are and Bonferroni-based schemes result in higher comple xity on av erage, since diversity and conservati ve testing both fa v or retaining richer temporal structures to hedge against different failure modes. Fig. 4 (d) shows the a verage di versity of the learned STL sets. As expected, C-STLL with diversity check and stopping rule consistently achie ves high di versity . C-STLL with com- plexity check and stopping rule scheme shows lower div ersity than other variants, since restricting formula complexity inher - ently limits structural variation. Finally , we consider the setting in which labels are assigned by an LLM-as-a-judge. Fig. 5 reports the corresponding test performance as function of the risk tolerance ϵ on the test dataset. All C-STLL v ariants successfully control the empirical risk belo w the tar get risk tolerance ϵ , validating the ef fec- tiv eness of the proposed frame work also under subjective, QoE-driv en labels. Compared to Fig. 4 , we observe moder - ately larger average set sizes and complexity values for most schemes, reflecting the increased ambiguity and variability induced by LLM-based labeling. Ne vertheless, C-STLL yields the smallest sets due to its early-stopping generation approach, and the dif ferent variants of C-STLL provide relati ve gains that are consistent with those discussed in Fig. 4 . V I . C O N C L U S I O N S This paper has introduced conformal signal temporal logic learning (C-STLL), a novel frame work for extracting inter- pretable temporal specifications from wireless network KPI traces with formal reliability guarantees. By combining the expressi veness of STL with the distribution-free coverage guarantees of modern hyperparameter selection methods, C- STLL addresses both r elevance – capturing temporal patterns predictiv e of high-lev el requirements such as QoE – and interpr etability – providing human-readable specifications that network operators can v alidate, understand, and act upon. The ke y insight behind C-STLL is to move beyond learning a single STL formula to constructing a calibrated set of formulas with prov able guarantees. Through sequential gener- ation with principled acceptance and stopping rules, C-STLL ensures that the returned set contains at least one formula achieving a tar get accurac y lev el with high probability . The ac- ceptance rules, based on complexity and di versity thresholds, 11 Fig. 4. A verage test performance as a function of the target risk ϵ (the probability that the set C λ ( D ) includes no STL formulas with validation accuracy above φ = 0 . 8 ) with data labeled using the ground-truth STL formula ( 39 ): (a) A verage risk. (b) A verage set size. (c) A verage complexity ( 26 ). (d) A verage div ersity ( 28 ). Fig. 5. A verage test performance as a function of the target risk ϵ (the probability that the set C λ ( D ) includes no STL formulas with validation accuracy above φ = 0 . 8 ) with data labeled using LLM: (a) A verage risk. (b) A verage set size. (c) A verage complexity ( 26 ). (d) A verage div ersity ( 28 ). promote interpretability by fa v oring simpler formulas while encouraging structural v ariety to capture dif ferent aspects of the underlying temporal patterns. The calibration procedure, grounded in the Learn Then T est (L TT) framew ork and implemented via P areto testing followed by fix ed-sequence testing [ 33 ], efficiently searches the hyperparameter space while maintaining rigorous statistical validity . Experimental results on a mobile gaming scenario simulated using ns-3 have demonstrated the effecti veness of C-STLL. Compared to standard STLL [ 10 ], which returns a single formula without reliability assurances, C-STLL produced com- pact sets of diverse, low-comple xity formulas that achiev ed 12 high accuracy on validation data. Ablation studies confirmed the importance of each component: complexity checks yielded more interpretable formulas, div ersity checks prev ented redun- dancy , and the stopping rule enabled efficient early termination once suf ficient quality was achie ved. Sev eral directions merit further inv estigation. First, there may be QoE requirements that may be better expressed in terms of functions of the KPIs. Developing solutions that account for this, while preserving interpretability is an open challenge. Second, extending C-STLL to an online setting, where KPI traces arrive sequentially and the formula set must be updated incrementally , would enhance its applicability to real-time network monitoring. Third, incorporating multi- task learning across heterogeneous network conditions and application types could improv e generalization and reduce the calibration data requirements. Fourth, Third, validating C-STLL with real QoE measurements, as well as on a real O-RAN architecture, would support practical deployment for automated KPI-to-requirement translation in softwarized networks. Finally , in vestigating the use of lar ge language models to generate natural-language explanations of learned STL formulas could further bridge the gap between formal specifications and operator understanding. A P P E N D I X A : T E M P O R A L L O G I C I N F E R E N C E N E T W O R K This appendix revie ws TLINet [ 10 ], which learns STL formulas by integrating quantitati ve STL semantics with neural network optimization. vSTL Representation: TLINet uses vectorized STL (vSTL), where Boolean operators ∧ / ∨ and temporal operators □ / ♢ are parameterized by binary vectors. A binary vector w b ∈ { 0 , 1 } n selects acti ve subformulas in Boolean operations, while w I ∈ { 0 , 1 } T encodes the time interv al [ t 1 , t 2 ] for temporal operators, with w t I = 1 if t 1 ≤ t ≤ t 2 . The rob ustness computations in ( 10 )–( 13 ) are reformulated accordingly over these binary vectors. TLINet Architectur e: As illustrated in Fig. 3 , TLINet con- tains predicate, temporal, and Boolean modules arranged in layers. The predicate module takes trajectory X as input and outputs robustness v ectors using trainable parameters a and b . T emporal modules learn the interval bounds t 1 , t 2 via smooth approximations of w I , and select between □ and ♢ using a Bernoulli variable κ with trainable probability p κ . Boolean modules similarly learn subformula selection via Bernoulli variables for w b and operator type, either ∧ or ∨ , via another Bernoulli variable κ b . The output robustness ρ ( X , ϕ, 0) serves as the decision score. The training objecti ve combines the hinge classification loss in ( 16 ) with regularizers promoting binarization of all probabilities via p (1 − p ) terms and sparsity in Boolean selection via ℓ 1 regularization on p b . P arameters are optimized using stochastic gradient descent. A P P E N D I X B : L T T V I A P A R E T O T E S T I N G The calibration set D cal is partitioned into D cal 1 with K 1 pairs for Pareto testing and D cal 2 with K 2 pairs for fixed- sequence testing. Par eto testing: For each candidate λ ∈ Λ , we run C-STLL on D cal 1 to compute the empirical risk ˆ R ( λ , D cal 1 ) = 1 K 1 K 1 X k =1 1 ( ∄ ϕ ∈ C λ ( D k, 1 ) : A ( ϕ, D v k, 1 ) = 1) and a verage set size |C λ ( D cal 1 ) | = 1 K 1 K 1 X k =1 |C λ ( D k, 1 ) | . Since risk and set size are conflicting objectiv es, solving the multi-objectiv e problem min λ { ˆ R ( λ , D cal 1 ) , |C λ ( D cal 1 ) |} yields a Pareto frontier ¯ Λ ⊆ Λ . The hyperparameters in ¯ Λ are then ordered by increasing p-values as defined in ( 36 ). Fixed-sequence testing: Using D cal 2 , we test the ordered hyperparameters sequentially . For each λ π ( i ) , we compute the empirical risk ˆ R ( λ π ( i ) , D cal 2 ) = 1 K 2 K 2 X k =1 1 ( ∄ ϕ ∈ C λ π ( i ) ( D k, 2 ) : A ( ϕ, D v k, 2 ) = 1) , and its p-value p ( λ π ( i ) ) = Pr( b ( K 2 , ϵ ) ≤ K 2 ˆ R ( λ π ( i ) , D cal 2 )) . The hypothesis in ( 34 ) is rejected if p ( λ π ( i ) ) < δ . T esting proceeds until the first non-rejected hypothesis at index π ( i ∗ ) , yielding Λ v alid = { λ π (1) , . . . , λ π ( i ∗ − 1) } . The final selection λ ∗ ∈ Λ v alid minimizes the average set size |C λ ∗ ( D cal 2 ) | . A P P E N D I X C : P RO O F O F T H E O R E M 1 The reliability condition ( 25 ) follows from the validity of the p-v alue in ( 36 ) combined with the FWER control of fixed- sequence testing. T o show that the p-value is valid, let S = |D cal | ˆ R ( λ , D cal ) and note that S ∼ b ( |D cal | , R ( λ )) since calibration pairs are i.i.d. For any α ∈ [0 , 1] , because the binomial CDF F ϵ ( · ) is nondecreasing, the e vent { F ϵ ( S ) ≤ α } implies S ≤ c α , where c α is the α -quantile index. Under the null hypothesis ( 34 ), the binomial CDF is nonincreasing in its success probability , so Pr( S ≤ c α ) ≤ F ϵ ( c α ) ≤ α . This confirms the inequality Pr( p ( λ ) ≤ α |H λ ) ≤ α . Since fixed-sequence testing controls the FWER at level δ [ 40 ], with probability at least 1 − δ no true null hypothesis is rejected, and the selected λ ∗ satisfies the reliability condition ( 25 ). R E F E R E N C E S [1] B. Brik, A. Ksentini, and M. Bouaziz, “Deep learning for B5G open radio access network: Evolution, survey , case studies, and challenges, ” IEEE Open J ournal of the Communications Society , vol. 3, pp. 228–250, 2022. [2] M. Polese, L. Bonati, S. D’Oro, S. Basagni, and T . Melodia, “Understanding O-RAN: Architecture, interfaces, algorithms, security , and research challenges, ” IEEE Communications Sur- ve ys & T utorials , vol. 25, no. 2, pp. 1376–1411, 2023. [3] L. Bonati, M. Polese, S. D’Oro, S. Basagni, and T . Melodia, “Intelligence and learning in O-RAN for data-driven NextG cellular networks, ” IEEE Communications Magazine , vol. 59, no. 10, pp. 21–27, 2021. 13 [4] A. Lacav a, M. Polese, R. Si varaj, R. Soundrarajan, B. S. Bhati, T . Singh, T . Zugno, F . Cuomo, and T . Melodia, “Programmable and customized intelligence for traffic steering in 5G networks using open RAN architectures, ” IEEE T ransactions on Mobile Computing , vol. 23, no. 4, pp. 2882–2897, 2024. [5] W . Guo, “Explainable artificial intelligence for 6G: Improving trust between human and machine, ” IEEE Communications Magazine , vol. 58, no. 6, pp. 39–45, 2020. [6] B. Brik, K. Boutiba, and A. Ksentini, “Explainable AI in 6G O-RAN: A tutorial and survey on architecture, use cases, chal- lenges, and future directions, ” IEEE Communications Surveys & T utorials , vol. 26, no. 4, pp. 2490–2520, 2024. [7] C. Fiandrino, L. Bonati, S. D’Oro, M. Polese, T . Melodia, and J. W idmer , “EXPLORA: AI/ML explainability for the open RAN, ” Pr oceedings of the ACM on Networking , vol. 1, no. CoNEXT3, pp. 1–26, 2023. [8] O. Maler and D. Nickovic, “Monitoring temporal properties of continuous signals, ” in International symposium on formal techniques in real-time and fault-tolerant systems , pp. 152–166, Springer , 2004. [9] A. Donz ´ e and O. Maler, “Rob ust satisfaction of temporal logic ov er real-valued signals, ” in Pr oc. FORMA TS , pp. 92–106, Springer , 2010. [10] D. Li, M. Cai, C.-I. V asile, and R. Tron, “TLINet: Differen- tiable neural network temporal logic inference, ” arXiv pr eprint arXiv:2405.06670 , 2024. [11] G. Bombara, C.-I. V asile, F . Penber , H. Y asuoka, and C. Belta, “Offline and online learning of signal temporal logic formulae using decision trees, ” A CM T ransactions on Cyber-Physical Systems , vol. 5, no. 3, pp. 1–23, 2021. [12] G. E. Fainekos and G. J. Pappas, “Rob ustness of temporal logic specifications for continuous-time signals, ” Theor etical Computer Science , vol. 410, no. 42, pp. 4262–4291, 2009. [13] E. Bartocci et al. , “Specification-based monitoring of cyber- physical systems: A survey on theory , tools and applications, ” Handbook of Runtime V erification , pp. 135–175, 2018. [14] L. Panizo, M.-d.-M. Gallardo, F . Luque-Schempp, and P . Merino, “Runtime monitoring of 5G network slicing using ST An, ” J ournal of Logical and Algebr aic Methods in Pro gram- ming , vol. 145, p. 101059, 2025. [15] A. Adadi and M. Berrada, “Peeking inside the black-box: A surve y on explainable artificial intelligence (XAI), ” IEEE Access , vol. 6, pp. 52138–52160, 2018. [16] D. Luan and J. Thompson, “Channelformer: Attention based neural solution for wireless channel estimation and effecti ve on- line training, ” IEEE T ransactions on W ir eless Communications , vol. 22, no. 10, pp. 6562–6577, 2023. [17] O. T . Basaran and F . Dressler , “XAI-on-RAN: Explainable, ai- nativ e, and gpu-accelerated RAN tow ards 6G, ” arXiv pr eprint arXiv:2511.17514 , 2025. [18] F . Rezazadeh, H. Chergui, and J. Mangues-Bafalluy , “Explanation-guided deep reinforcement learning for trustworthy 6G ran slicing, ” in Proc. IEEE international confer ence on communications workshops , pp. 1026–1031, 2023. [19] O. T . Basaran and F . Dressler, “Xainomaly: Explainable and interpretable deep contractiv e autoencoder for o-ran traffic anomaly detection, ” Computer Networks , vol. 261, p. 111145, 2025. [20] L. Malakalapalli, V . Gudepu, B. Chirumamilla, S. Y adhunandan, and K. Kondepu, “Integrating explainable ai for energy efficient open radio access networks, ” in Proc. IEEE Futur e Networks W orld F orum , pp. 232–237, 2024. [21] V . V ovk, A. Gammerman, and G. Shafer, Algorithmic Learning in a Random W orld . 2005. [22] G. Shafer and V . V ovk, “ A tutorial on conformal prediction, ” Journal of Machine Learning Resear ch , vol. 9, pp. 371–421, 2008. [23] K. Cohen, S. Park, O. Simeone, and S. Shamai, “Calibrating AI models for wireless communications via conformal prediction, ” IEEE T ransactions on Machine Learning in Communications and Networking , vol. 1, pp. 296–312, 2023. [24] K. Cohen, S. Park, O. Simeone, P . Popovski, and S. Shamai, “Guaranteed dynamic scheduling of ultra-reliable low-latency traffic via conformal prediction, ” IEEE Communications Letters , vol. 27, no. 5, pp. 1473–1477, 2023. [25] O. Simeone, S. Park, and M. Zecchin, “Conformal calibration: Ensuring the reliability of black-box AI in wireless systems, ” arXiv preprint arXiv:2504.09310 , 2025. [26] J. Chen, S. Park, P . Popovski, H. V . Poor , and O. Simeone, “Neuromorphic split computing with wak e-up radios: Architec- ture and design via digital twinning, ” IEEE T ransactions on Signal Processing , vol. 72, pp. 4635–4650, 2024. [27] F . Cairoli, L. Bortolussi, J. V . Deshmukh, L. Lindemann, and N. Paoletti, “Conformal predicti ve monitoring for multi- modal scenarios, ” in Proc. International Confer ence on Runtime V erification , pp. 336–356, 2025. [28] A. N. Angelopoulos, S. Bates, E. J. Cand ` es, M. I. Jordan, and L. Lei, “Learn then test: Calibrating predictiv e algorithms to achiev e risk control, ” The Annals of Applied Statistics , vol. 19, no. 2, pp. 1641–1662, 2025. [29] A. A. Barakabitze, A. Ahmad, R. Mijumbi, and A. Hines, “QoE management of multimedia streaming services in future net- works: A tutorial and surve y , ” IEEE Communications Surve ys & T utorials , vol. 22, no. 1, pp. 526–565, 2020. [30] P . Charonyktakis, M. Plakia, I. Tsamardinos, and M. Pa- padopouli, “On user-centric modular QoE prediction for V oIP based on machine-learning algorithms, ” IEEE T ransactions on Mobile Computing , vol. 15, no. 6, pp. 1443–1456, 2016. [31] G. K ougioumtzidis, V . Poulko v , P . Lazaridis, and Z. Zaharis, “Deep learning-aided QoE prediction for virtual reality appli- cations over open radio access networks, ” IEEE Access , vol. 11, pp. 143514–143529, 2023. [32] B. Laufer-Goldshtein, A. Fisch, R. Barzilay , and T . Jaakkola, “Efficiently controlling multiple risks with pareto testing, ” arXiv pr eprint arXiv:2210.07913 , 2022. [33] V . Quach, A. Fisch, T . Schuster , A. Y ala, J. H. Sohn, T . S. Jaakkola, and R. Barzilay , “Conformal language modeling, ” arXiv preprint arXiv:2306.10193 , 2023. [34] A. Donz ´ e, “On signal temporal logic, ” in Pr oc. International confer ence on runtime verification , pp. 382–383, 2013. [35] O. Simeone, Machine Learning for Engineers . Cambridge Univ ersity Press, 2022. [36] M. A. Ganaie, M. Hu, A. K. Malik, M. T an veer , and P . N. Suganthan, “Ensemble deep learning: A revie w , ” Engineering Applications of Artificial Intelligence , v ol. 115, p. 105151, 2022. [37] L. V . Jospin, H. Laga, F . Boussaid, W . Buntine, and M. Ben- namoun, “Hands-on bayesian neural networks—a tutorial for deep learning users, ” IEEE Computational Intelligence Maga- zine , vol. 17, no. 2, pp. 29–48, 2022. [38] G. Bombara, C.-I. V asile, F . Penedo, H. Y asuoka, and C. Belta, “ A decision tree approach to data classification using signal temporal logic, ” in Proc. International Confer ence on Hybrid Systems: Computation and Contr ol , pp. 1–10, 2016. [39] C. Madsen, P . V aidyanathan, S. Sadraddini, C.-I. V asile, N. A. DeLateur , R. W eiss, D. Densmore, and C. Belta, “Metrics for signal temporal logic formulae, ” in Pr oc. IEEE Confer ence on Decision and Contr ol (CDC) , pp. 1542–1547, 2018. [40] B. L. Wiens, “ A fixed sequence bonferroni procedure for testing multiple endpoints, ” Pharmaceutical Statistics: The Journal of Applied Statistics in the Pharmaceutical Industry , vol. 2, no. 3, pp. 211–215, 2003. [41] H. Li, Q. Dong, J. Chen, H. Su, Y . Zhou, Q. Ai, Z. Y e, and Y . Liu, “LLMs-as-judges: a comprehensive surve y on llm-based ev aluation methods, ” arXiv preprint , 2024. [42] J. Gu et al. , “ A survey on LLM-as-a-judge, ” The Innovation , 2024.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment