Conservative Continuous-Time Treatment Optimization
We develop a conservative continuous-time stochastic control framework for treatment optimization from irregularly sampled patient trajectories. The unknown patient dynamics are modeled as a controlled stochastic differential equation with treatment …
Authors: Nora Schneider, Georg Manten, Niki Kilbertus
Conserv ativ e Continuous-T ime T r eatment Optimization Nora Schneider * 1 2 3 Georg Manten * 1 2 3 Niki Kilbertus 1 2 3 Abstract W e dev elop a conserv ati ve continuous-time stochastic control framew ork for treatment op- timization from irregularly sampled patient tra- jectories. The unknown patient dynamics are modeled as a controlled stochastic dif ferential equation with treatment as a continuous-time con- trol. Nai ve model-based optimization can e xploit model errors and propose out-of-support controls, so optimizing the estimated dynamics may not optimize the true dynamics. T o limit e xtrapola- tion, we add a consistent signature-based MMD regularizer on path space that penalizes treatment plans whose induced trajectory distribution de- viates from observed trajectories. The resulting objecti ve minimizes a computable upper bound on the true cost. Experiments on benchmark datasets show impro ved rob ustness and performance com- pared to non-conservati v e baselines. 1. Introduction A ke y challenge in healthcare is finding the best treatment plan to guide patient trajectories toward a desired clinical outcome, such as selecting a chemotherapy dosing sched- ule that maximally reduces tumor size. Unlike a one-shot decision, treatment is inherently temporal and continuous. Patient state ev olves constantly , measurements are taken irregularly , and interventions are applied over time. De- signing optimal plans therefore requires a model that can estimate potential future patient trajectories under candidate treatment plans, including ones nev er administered before. Historically , this has been approached by mechanistic continuous-time models, such as pharmacokinetic dif fer- ential equations. These often manually prescribed and hard- earned mathematical models are interpretable and continu- ous by design, but are often still overly simplistic and fail * Equal contrib ution 1 T echnical Univ ersity of Munich 2 Helmholtz Munich 3 Munich Center for Machine Learning. Cor - respondence to: Nora Schneider < nora.schneider@tum.de > , Georg Manten < georg.manten@tum.de > , Niki Kilbertus < niki.kilbertus@tum.de > . Pr eprint. Mar ch 18, 2026. to capture the comple x and stochastic dynamics found in heterogeneous patient settings. Flexible data-dri ven meth- ods from machine learning, in particular causal inference, promise to ov ercome these limitations. Howe v er , most of these methods operate in discrete time with fixed decision points (and often finite actions), which does not account for the continuous and irregular nature of clinical reality . In high stakes decisions, we often cannot perform direct experimentation due to practical or ethical considerations. Therefore, one is typically constrained to an of fline set- ting: giv en observed trajectories collected under existing treatment plans, we aim to design future treatment plans that ideally perform better than historical ones. In this of- fline setting, when first learning the system’ s dynamics and then optimizing ov er that model, the optimization may act adversarially by finding and e xploiting model errors—in particular erroneous extrapolation beha vior in data-poor re- gions. This problem, well kno wn in offline RL, is sev erely exacerbated in continuous-time treatment planning, where actions live in path space. The classical overlap (or posi- tivity) assumption, already challenging in discrete settings, becomes practically impossible to satisfy . In this paper , we de velop a principled method for conserva- tive optimization of continuous-time tr eatments fr om irr e gu- larly sampled observational trajectories . W e model patient dynamics as a controlled stochastic dif ferential equation with treatment as a continuous-time control, learn this model from data, and conservati vely penalize deviation between model-induced and observed trajectory distrib utions via a signature-kernel conditional MMD on path space, when optimizing treatments. 2. Related W ork Continuous-Time Longitudinal Causal Infer ence. Moti- vated by these limitations, machine learning—more specif- ically causal inference—has proposed fle xible data-dri ven techniques for longitudinal treatment ef fect estimation by extending ideas from the potential outcomes framework. Recent w ork includes extensions to continuous-time set- tings that handle irregular observations, event-dri v en dy- namics, and time-dependent confounding, but do not of fer a principled way to optimize treatment plans from finite ob- servational data. Some methods focus on population-le vel 1 Conservati ve Continuous-Time T reatment Optimization outcomes, unable to condition on patient-specific history , which pre vents predicting personalized responses to candi- date treatment plans ( Sun & Crawford , 2022 ; Y ing , 2025 ; Lok , 2008 ; Schulam & Saria , 2017 ; Hızlı et al. , 2023 ). Other approaches condition on patient history via neural ODEs or neural CDEs, but typically model treatments as categorical variables applied at discrete decision times ( De Brouwer et al. , 2022 ; Seedat et al. , 2022 ; Hess et al. , 2023 ; Hess & Feuerriegel , 2025 ). The INSITE-framework ( Kacprzyk et al. , 2024 ) models continuous-valued treatments as con- trols in the go verning differential equations for treatment effect estimation. Although these methods are identifiable under continuous- time e xtensions of standard potential-outcomes assumptions in the infinite-sample limit, the y rely on o verlap/positi vity , which is already fragile in discrete settings ( D’Amour et al. , 2021 ). In continuous time, finite data make this issue sub- stantially worse: covering all treatments at all times for all histories is impossible, so optimization can select plans that driv e the state-treatment trajectory far outside the empirical support, where estimates may become unreliable. Overall, existing works stop at estimation and do not op- timize treatments, often consider treatment as categorical and/or applied only at discrete ev ent times prev enting them from discov ering ne w continuous-time controls, and when naiv ely optimized ov er , may suf fer from large e xtrapolation error due to poor ov erlap. Optimal Dynamic T reatment Regimes (DTR) learn a se- quence of history-dependent decision rules that map patient information at each decision point to a treatment to optimize expected clinical outcomes ( Murphy , 2003 ; Robins , 2004 ; Chakraborty & Murphy , 2014 ; Li et al. , 2023 ; Tsirtsis & Rodriguez , 2023 ). Most of this literature is formulated in discrete time with fixed decision points, and is often de- veloped with sequential decision making (and sometimes adaptiv e data collection) in mind rather than personalized planning from a fixed observ ational dataset. Of fline v ariants exist – for e xample, Zhou et al. ( 2023 ) propose a pessimistic Bayesian Q-learning approach for learning DTRs from ob- servational data, and W ang et al. ( 2025 ) propose variational counterfactual interv ention planning – b ut remain discrete- time framew orks. As a result, e xisting DTR approaches do not address irregular observation schedules typical in clinical time series, where measurements and treatment ad- justments occur at non-uniform times. Xu & Zhang ( 2024 ) instead learn a joint model not only over dosage v alues, b ut also dosage times, while still fundamentally considering a discrete sequence of ev ents rather than a continuous-time control problem. Offline Reinfor cement Learning (RL) has also de veloped ideas around pessimism and uncertainty-a ware re gulariza- tion to prevent model e xploitation in discrete Marko v de- cision processes (MDP) ( Y u et al. , 2020 ; Kidambi et al. , 2020 ; Y u et al. , 2021 ), also e xtended to truncating trajecto- ries depending on model uncertainty ( Zhang et al. , 2023 ). Again, existing w orks focus on discrete settings and do not transfer to continuous-time with irregular observ ations and do not hav e to grapple with the identification of potential outcomes in our setting. Perhaps most closely related to our work is the approach by Koprulu et al. ( 2025 ), who propose a neural stochastic differential equation model to explicitly capture both aleatoric and epistemic model un- certainty . They truncate/penalize rollouts based on implicit, local model uncertainty . Unlik e their approach, we do not rely directly on model uncertainty , b ut consistently penalize distributional mismatch of trajectories induced by a candi- date treatment and the observed distrib ution. Our main contributions include: • W e formalize personalized, continuous-time, offline treat- ment optimization from irregular observations within a stochastic control framew ork b uilt on controlled SDEs. • W e deri v e a conservati ve planning objecti v e that upper bounds the true cost via conditional integral probability metrics on path space. • W e dev elop a practically tractable and consistent imple- mentation of this regularizer via a signature-kernel condi- tional MMD penalizing out-of-support treatment plans to av oid model e xploitation. • W e demonstrate empirically that our method outperforms non-conservati v e baselines in performance and reliability . 3. Problem Setting W e consider the problem of finding optimal treatment plans that guide patient trajectories tow ard desirable outcomes from a causal perspectiv e. Setup. Throughout this paper, let I = [ t s , t f ] ⊂ R be a compact time interv al, (Ω , F , F , P ) a complete filtered probability space with filtration F = ( F t ) t ∈ I . On this space, we consider stochastic processes U : I × Ω → A ⊆ R d u called the tr eatment pr ocess (e.g., chemotherapy), Y : I × Ω → R d y the outcome process (e.g., tumor volume) and Z : I × Ω → R d z denotes time-v arying co v ariates (e.g., biomarkers), compactly denoted in the combined process X = ( Z, Y ) : I × Ω → R d x , where d x = d y + d z , called state process. For a fix ed treatment plan u : I → A ⊆ R d u , X ( u, pot) = ( X ( u, pot) t ) t ∈ I is the potential state pr ocess un- der u , i.e., the trajectory that would be observed if the treatment were e xternally set to follo w u ov er the entire interval I . For a stochastic process, we write X [ t 1 ,t 2 ] for the restriction of X to [ t 1 , t 2 ] , viewed as a path-v alued ran- dom variable (R V) Ω → ( R d x ) [ t 1 ,t 2 ] , denote its law by 2 Conservati ve Continuous-Time T reatment Optimization P X [ t 1 ,t 2 ] = P ◦ X − 1 [ t 1 ,t 2 ] ∈ M 1 (( R d x ) [ t 1 ,t 2 ] ) and write X = { X t } t ∈ I for the associated family of time-index ed R Vs. 1 In addition, the filtration of X is F X = ( F X t ) t ∈ I , F X t = σ ( X [ t ′ ] | t s ≤ t ′ ≤ t ) and we often write P Y | X [ t 1 ,t 2 ] = P Y |F X t 1 ,t 2 , where F X t 1 ,t 2 = σ ( X t | t 1 ≤ t ≤ t 2 ) . W e use lowercase letters for sample-path realizations, e.g., x = ( X t ( ω )) t ∈ I for ω ∈ Ω . Data. W e observe N i.i.d. patient trajectories. For each patient i , measurements are recorded at a finite, irregular time grid t s ≤ t ( i ) 1 < · · · < t ( i ) n i ≤ t f , which may dif- fer across patients. At each time point t ( i ) j we observe the triplet Z t ( i ) j , Y t ( i ) j , U t ( i ) j . W e denote the dataset by D = { ( z ( i ) , y ( i ) , u ( i ) ) } N i =1 , where z ( i ) = { ( t ( i ) j , z ( i ) j ) } n i j =1 , y ( i ) = { ( t ( i ) j , y ( i ) j ) } n i j =1 , and u ( i ) = { ( t ( i ) j , u ( i ) j ) } n i j =1 con- sist of time-value tuples. Objective. Consider a patient of interest at decision time t 0 ∈ I . Gi v en the av ailable information up to t 0 (i.e., F t 0 ), we aim to find an optimal treatment plan u ∗ : [ t 0 , t f ] → R d u that minimizes the follo wing function of the potential state process ( X ( u, pot) t ) t ∈ [ t 0 ,t f ] min u ∈U adm E h Z t f t 0 f t, X ( u, pot) t , u t d t + g X ( u, pot) t f F t 0 i (1) where f : I × R d x × A → R and g : R d x → R are running and terminal cost functions, respecti vely , U adm is the class of admissible treatment plans, which we specify later . 4. Methodology Our objecti v e in Equation ( 1 ) poses tw o key challenges: (a) Potential state trajectories X ( u, pot) are challenging to estimate from observational data, because they model hypo- thetical scenarios that cannot be observed simultaneously ( Imbens & Rubin , 2015 ). W e address this with a structural dynamics model and approximate it from observational data; we then sho w which assumptions on the treatment process warrant identification of potential state trajectories (Sec- tion 4.1 ). (b) Optimizing ov er treatment plans from ob- servational data is challenging because the data consists of historical treatment plans only . When e valuating new treatment plans, any learned model for the potential state process must e xtrapolate, and predictions may be inaccurate due to model error . When optimized ov er , these errors can be exploited, resulting in suboptimal treatment plans. W e address this and propose an augmented objectiv e that re gu- larizes de viation from the observ ed trajectory distrib ution, resulting in a conservati ve upper bound on the performance of the selected treatment plan (Section 4.2 ). 1 M 1 ( • ) is the set of all probability measures on • , ( R d x ) [ t 1 ,t 2 ] the set of maps from set [ t 1 , t 2 ] to set R d x . 4.1. Structural Dynamics Model for Identifiability W e model the patient state as solutions of a controlled stochastic dynamical system, with treatment as the control input. This structural model giv es mathematical meaning to potential state trajectories for deterministic plans and, with standard assumptions on the treatment process, connects these potential outcomes to the observational data distrib u- tion. T echnical details are deferred to Appendix B . Assumption 4.1 (Structural Dynamics Model) . The ob- served state process X = ( X t ) t ∈ I ev olv es under the (possi- bly stochastic) treatment process U = ( U t ) t ∈ I as d X t = µ ( X t , U t ) d t + σ ( X t , U t ) d W t , (2) where X 0 ∼ µ 0 ∈ M 1 ( R d x ) , W = ( W t ) t ∈ I a standard d w -dimensional W iener process and µ and σ the measurable drift and diffusion functions. F or a deterministic plan u : I → A , we define X ( u, pot) as the solution of Equation ( 2 ) with U t = u t for all t ∈ I and X ( u, pot) 0 ∼ µ 0 . W e further assume that ( µ, σ ) in Equation ( 2 ) satisfy the standard regularity conditions (i.e., global-Lipschitz and lin- ear growth assumption) ensuring existence and pathwise uniqueness of a strong solution to the above controlled SDE. Details can be found in Appendix B . W ith the struc- tural dynamics model, the observed state trajectories are recov ered by the potential process under the realized, ob- served treatment plan, meaning X = X ( U, pot) a.s., as they share the same underlying stochastic dynamics ( µ, σ, W ) and only differ through the sampled plan u and initializa- tion x 0 . W ith F = ( F t ) t ∈ I being the (completed, right- continuous) filtration generated by the observ ed processes ( X, U ) , i.e., F t := σ X [ t 0 ,t ] , U [ t 0 ,t ] , we make the follow- ing two assumptions on the treatment process U following Y ing ( 2025 ). Assumption 4.2 (Full conditional randomization) . There exists a bounded function ε ( t, η ) > 0 such that R t f t 0 ε ( t, η ) dt → 0 as η ↓ 0 , and for all t ∈ I , η > 0 , sup u ∈U adm E P X ( u, pot) [ t,t f ] | U [ t 0 ,t + η ] , F t − P X ( u, pot) [ t,t f ] |F t TV ≤ ε ( t, η ) , where ∥ · ∥ TV denotes total variation distance on the rele v ant path space. W e can read Assumption 4.2 as the continuous-time ana- logue of “no unmeasured confounding”. Next, we also e x- tend the typical o verlap/positi vity assumption to continuous- time. Assumption 4.3 (Positivity (Overlap)) . Let P denote the observational law of the joint trajectory ( X [ t 0 ,t f ] , U [ t 0 ,t f ] ) on the corresponding product path space (with its canonical 3 Conservati ve Continuous-Time T reatment Optimization σ -field). For each plan u ∈ U adm , let P ( u ) denote the interventional la w of the joint trajectory ( X ( u, pot) [ t 0 ,t f ] , u [ t 0 ,t f ] ) obtained by fixing the treatment input to u . Assume that for ev ery u ∈ U , P ( u ) ≪ P. meaning that ev ents B on path space, that are impossible in the observed data ( P ( B ) = 0 ), are also impossible under the intervention data ( ( P ( u ) ( B ) = 0 ). W e already argued that, e v en though required for rigorous identification statements, ov erlap is extremely fragile if not practically impossible in continuous-time settings. This is a ke y motiv ation for our practical, finite sample remedy of penalizing treatment plans that hav e zero probability under the observed la w . Finally , in contrast to Y ing ( 2025 ), we do not assume that the true data-generating process is known, but instead we make the following assumption about the chosen model being well-specified. Assumption 4.4 (Distribution) . There exist a parameter θ ⋆ with ( µ θ ⋆ = µ, σ θ ⋆ = σ ) (the functions of our structural dynamics model) such that for ev ery u ∈ U adm dX ( u, pot) t = µ θ ⋆ ( X ( u, pot) t , u t ) dt + σ θ ⋆ ( X ( u, pot) t , u t ) dW t is well-defined and ( µ θ ⋆ , σ θ ⋆ ) can be estimated consistently from the observ ational law P X,U . Importantly , we restrict U adm to controls for which the induced law P ( u ) satisfies the positivity condition. Remark 4.5 (Neural SDEs) . In neural SDEs , µ θ and σ θ are parameterized by neural networks and fit θ from observa- tional trajectories. For a rich enough model class and a consistent estimator , Assumption 4.4 holds. A sufficient condition that helps ensure that P ( u ) ≪ P holds is that σ σ ⊤ is independent of u . W e can now le verage Y ing ( 2025 , Theorem 1) to identify the target value when considering deterministic plans for stochastic interventions assigning probability one to path u : Proposition 4.6 (Identifiability) . Under assumptions As- sumptions 4.2 to 4.4 and the structur al dynamics Assump- tion 4.1 such that X = X ( U, pot) a.s., J ( t 0 , X ( u, pot) , u ) := E h ν ( X ( u, pot) , u ) | F t 0 i , is identifiable fr om the observational law . In particular , for ν ( X ( u, pot) , u ) = Z t f t 0 f ( t, X ( u, pot) t , u t ) dt + g ( X ( u, pot) t f ) . the value J ( t 0 , X ( u, pot) , u ) is identified. In the following, we denote by X ( t 0 ,x 0 ,u ) the (strong) so- lution of Equation ( 2 ) on I under a deterministic treatment plan u and initial condition X t 0 = x 0 , whose existence and uniqueness are ensured by the regularity conditions in Appendix B . In particular , the induced state process of Equation ( 2 ) satisfies the flow property: for any τ ≥ t 0 , X ( t 0 ,x 0 ,u ) t = X ( τ , X ( t 0 ,x 0 ,u ) τ , u ) t a.s., equi v alently meaning that X is Markov: conditioned on the current state (and the future control values), the future ev olution is independent of the history . Thus, instead of conditioning on the full information F t 0 it suffices to condition on ( X t 0 , U t 0 ) . Under Assumptions 4.1 to 4.3 , we can re write the treatment- optimization problem in Equation ( 1 ) as a stochastic control problem o ver the class of admissible controls U adm and finite time horizon [ t 0 , t f ] as u ∗ := arg min u ∈U adm E Z t f t 0 f t, X ( t 0 ,x 0 ,u ) t , u t d t + g X ( t 0 ,x 0 ,u ) t f . | {z } =: J ( t 0 ,X ( t 0 ,x 0 ,u ) ,u ) . (3) W e refer to Appendix B for regularity details on the set of admissible controls and functions f , g . 4.2. Conservati ve Optimization of T reatment Plans The expected cost of a candidate treatment plan in Equa- tion ( 3 ) cannot be e v aluated directly because the structural dynamics gov erning X ( t 0 ,x 0 ,u ) are unknown. W e therefore learn an approximate dynamics model and use it to com- pute the expected cost of a treatment plan under the learned dynamics. This model-based objecti ve is ultimately not the quantity of interest: the goal is to find treatment plans that minimize the expected cost under the true dynamics. Directly optimizing the model-based objecti ve can e xploit model error and find suboptimal plans under the true dy- namics. W e address this and derive a tractable upper bound on the true expected cost J ( t 0 , X ( t 0 ,x 0 ,u ) , u ) of a plan u . W e propose a consistent estimator for the upper bound and optimize treatment plans by minimizing it. T o the best of our knowledge, such a conserv ati v e bound – together with a consistent estimator – has not previously been established for the continuous-time setting. Estimated Dynamics Model. W e approximate the dynam- ics from observational data D and denote by ˆ X ( t 0 ,x 0 ,u ) the model-implied state process under treatment plan u and initialized at ˆ X t 0 = x 0 . This allows us to evaluate a model-based surrogate cost J ( t 0 , ˆ X ( t 0 ,x 0 ,u ) , u ) . In prin- ciple, any identification method can be used, provided the chosen model class is rich enough to contain (or closely approximate) the structural dynamics model in Equation ( 2 ). In our experiments, we instantiate the model class with a neural stochastic differential equation (neural SDE; see, 4 Conservati ve Continuous-Time T reatment Optimization e.g., Kidger ( 2022 )). Specifically , we parameterize drift and diffusion by neural networks µ θ : R d x × A → R d x and σ θ : R d x × A → R d x × d w , and define ˆ X as the strong solution of d ˆ X t = µ θ ( ˆ X t , U t ) d t + σ θ ( ˆ X t , U t ) d W t , (4) where W is a d w -dimensional Bro wnian motion. T o gener- ate the rollout ˆ X ( t 0 ,x 0 ,u ) , we set U t = u t and initialize the SDE at ˆ X t 0 = x 0 . Decomposition of Model Bias. T o deri ve a conserva- tiv e bound on the objecti ve, we first relate the true cost of a treatment plan u and the estimated cost under the learned model. W e decompose the error ∆( t 0 , x 0 , u ) := J ( t 0 , X ( t 0 ,x 0 ,u ) , u ) − J ( t 0 , ˆ X ( t 0 ,x 0 ,u ) , u ) ov er a temporal partition t 0 < t 1 < · · · < t K = t f and express it as a telescoping sum of conditional one-step discrepancies on shorter time intervals. W e comment later on further motiv a- tion of the decomposition in shorter time intervals. Lemma 4.7 (V alue Error T elescope) . F or an admissible contr ol ( u : I → R d u ) ∈ U adm and initial condition x 0 , ∆( t 0 , x 0 , u ) = K − 1 X j =0 E x ∼ P ˆ X t j [ D j ( x, u )] with one-step discr epancy on [ t j , t j +1 ] between true and learned dynamics given by D j ( x, u ) = E ˆ x [ t j ,t j +1 ] ∼ P ˆ X [ t j ,t j +1 ] | ˆ X t j = x h j ( ˆ x [ t j ,t j +1 ] ) − E x [ t j ,t j +1 ] ∼ P X [ t j ,t j +1 ] | X t j = x h j ( x [ t j ,t j +1 ] ) and h j is the deterministic path functional x [ t j ,t j +1 ] 7→ Z t j +1 t j f ( t, x s , u s )d s + J ( t j +1 , x t j +1 , u ) mapping a deterministic path segment x [ t j ,t j +1 ] to the cu- mulative cost on the curr ent sub interval [ t j , t j +1 ] plus the continuation value at t j +1 . The lemma extends related results from discrete ( Luo et al. , 2019 ) to the continuous-time setting; the detailed proof can be found in Appendix C.1 . W ith Lemma 4.7 we can upper bound the true costs J ( t 0 , X ( t 0 ,x 0 ,u ) , u ) under a candidate control u via J ( t 0 , X ( t 0 ,x 0 ,u ) , u ) ≤ J ( t 0 , ˆ X ( t 0 ,x 0 ,u ) , u ) + K − 1 X j =0 E x ∼ P ˆ X t j [ | D j ( x, u ) | ] . This suggests that a treatment plan with low model-based cost and small accumulated one-step discrepancies will also hav e lo w cost under the true dynamics. The difficulty is that D j ( x, u ) depends on the true conditional law and on the continuation v alue under the true dynamics. T o obtain a data-support-aw are bound that depends only on distribu- tional differences between the true and learned dynamics, we upper bound each one-step discrepancy by an integral probability metric (IPM) between the corresponding condi- tional path laws (cf. the discrete-time argument in ( Y u et al. , 2020 )). For any function class G ⊆ L 0 ( R d x ) [ t j ,t j +1 ] , R with h j ∈ G , the one-step discrepancy satisfies | D j ( y , u ) | ≤ IPM G [ P X [ t j ,t j +1 ] | X t j = y , P ˆ X [ t j ,t j +1 ] | ˆ X t j = y ] where IPM G ( P , Q ) : = sup g ∈G | E P [ g ] − E Q [ g ] | is the inte- gral probability metric generated by G . Common choices for G in machine learning include the follo wing (see Ap- pendix D for details): (i) The total variation distance is induced by G T V = { f ∈ L 0 ( X , R ) | sup x ∈X | f ( x ) | ≤ 1 } ; (ii) Giv en an RKHS ( H ⊆ C X , ⟨· , ·⟩ H , k : X × X → C ) , the IPM for G H = { f ∈ H | ∥ f ∥ H ≤ 1 } is called maximum mean discrepancy (MMD), denoted MMD k ; (iii) Giv en a metric space ( X , d ) ( A X = B ( T d ) ), the IPM for G W = { f ∈ L 0 ( X , R ) ∩ B ( X , R ) | ∥ f ∥ L := sup x = y | f ( x ) − f ( y ) | d ( x,y ) ≤ 1 } is called earth mov er or W asserstein or L 1 -distance. Due to the Lipschitz-regularity assumptions on the structural dynamics model, the ev olution over a finite time-horizon and the boundedness of objecti v e functions (Appendix B ), we can assume there exists a c ∈ R > 0 s.t. h j c ∈ G , and thus | D j ( y , u ) | ≤ c · IPM G [ P X [ t j ,t j +1 ] | X t j = y , P ˆ X [ t j ,t j +1 ] | ˆ X t j = y ] . Conservati ve Objective. From this, we can alter our op- timization objectiv e to a conservative “out-of-distrib ution aware” upper -bound objecti v e min u ∈U adm E Z t f t 0 f t, ˆ X ( t 0 ,x 0 ,u ) t , u t d t + g ˆ X ( t 0 ,x 0 ,u ) t f (5) + λ K − 1 X j =0 E h IPM G [ P X [ t j ,t j +1 ] | X t j = x t j , P ˆ X [ t j ,t j +1 ] | ˆ X t j = x t j ] i where λ > 0 is a hyperparameter controlling the de gree of conservatism. Signature K ernel-Based Estimator . As we cannot ac- cess the true conditional path distrib utions, one possibil- ity to describe the MMD between P X | u,x 0 and P ˆ X | u,x 0 is via the maximum conditional mean discrepanc y (MCMD) ( Park & Muandet , 2020 ), through conditional mean em- beddings (CME) in a suitable reproducing kernel Hilbert 5 Conservati ve Continuous-Time T reatment Optimization space (RKHS), The MCMD has the advantage of work- ing purely on RKHS embeddings (e.g., av oid density es- timation) and offering a consistent (regularized) kernel- ridge-type estimator ( Park et al. , 2021 ) that is tractable purely from observ ational data (see Appendix D for details). Since X is a path, the choice of kernel must be defined on path space. A natural choice for the kernel can be con- structed via the signature transform, which is a feature map S : C p ( I , R d ) → H S ( R d ) that con v erts a continuous path x ∈ C p ( R d ) of bounded variation into a graded sequence of iterated integrals S ( x ) = 1 , S (1) ( x ) , . . . , S ( k ) ( x ) , . . . ∈ ∞ Y k =0 R d ⊗ k , S ( k ) ( x ) = Z t 0 1 . The same segment-wise vie wpoint also connects naturally to of fline model-based RL, where long rollouts under a learned model can amplify error and motiv ate local control of discrepan- cies, for instance via adaptiv e rollout truncation ( K oprulu et al. , 2025 ; Zhang et al. , 2023 ). 5. Experimental Setup W e ev aluate and benchmark our treatment optimization ap- proach on simulated, irregularly sampled longitudinal pa- tient trajectories generated from state-of-the-art pharma- cokinetic–pharmacodynamic (PK/PD) models. This is a standard approach in longitudinal treatment-ef fect estima- tion and allows us to e v aluate ground-truth outcomes under optimized treatment plans. Data. W e use the lung cancer tumor growth simulator of Geng et al. ( 2017 ), which models the combined ef fects of chemotherapy and radiotherapy and is widely used for lon- gitudinal treatment-effect estimation ( Kacprzyk et al. , 2024 ; Seedat et al. , 2022 ; Hess et al. , 2023 ; Hess & Feuerriegel , 2025 ). W e randomly select between two clinically moti- vated treatment plans ( Curran Jr et al. , 2011 ): sequential therapy (three weeks of weekly chemotherapy follo wed by three weeks of radiotherap y) and concurrent therapy (bi- weekly joint chemo–radiotherapy). The state and control dimensions are d x = 2 and d u = 2 (tumor volume and drug concentration; chemotherap y and radiotherap y dosing). W e simulate trajectories over a 60 -day horizon and induce irregular sampling by masking 30% of measurements at random. In addition, we adopt the Covid-19 disease progression model under dexamethasone treatment of Qian et al. ( 2021 ), which is an extension of Dai et al. ( 2021 ). W e consider a single-intervention setting in a 14-day windo w , where dex- amethasone is administered at one of three predefined time points with randomly sampled dosage d ∈ [0 , 10] . The con- trol is the plasma dexamethasone concentration ( d u = 1 ), and the state comprises viral load, innate immune response, adaptiv e immune response, and lung-tissue de xamethasone concentration ( d x = 4 ). W e simulate trajectories o ver a 14-day horizon and induce irre gular sampling by masking 30% of measurements at random. T reatment Optimization Objecti ves. W e consider a total of three treatment optimization objecti v es. In the cancer problem, we consider two treatment optimization objec- tiv es. The first objective ( absolute tar get ) minimizes tumor volume at the terminal time t f while penalizing treatment intensity to discourage overly aggressi v e dosing. The sec- ond objectiv e ( r elative tar g et ) encourages achieving a fixed relativ e reduction in tumor volume (70%) with respect to the initial volume, again trading off tumor control against treatment intensity . In the Covid-19 problem, we define a desired tar get trajectory that mirrors the av erage disease pro- gression and optimize the treatment to track this trajectory 6 Conservati ve Continuous-Time T reatment Optimization 1.0 10.0 100.0 R e g u l a r i z a t i o n S t r e n g t h ( ) 0.4 0.6 0.8 1.0 G T - C o s t i m p r o v e m e n t o v e r = 0 Cancer (Explicit) 1.0 10.0 100.0 R e g u l a r i z a t i o n S t r e n g t h ( ) 0.50 0.75 1.00 1.25 Cancer (Relative) 1.0 10.0 100.0 R e g u l a r i z a t i o n S t r e n g t h ( ) 0.0 0.5 1.0 1.5 CO VID -19 (T arget T raj.) time 0.0 0.3 0.6 0.9 dosage Optimal T reatment Plan training data = 0 . 0 = 1 . 0 = 1 0 . 0 = 1 0 0 . 0 gr ound truth F igur e 1. Effect of increasing regularization strength λ . (Left) Improvement of the ground truth costs for λ = { 1 , 10 , 100 } relativ e to the ground truth costs with no re gularization, i.e., λ = 0 on three optimization problems across 15 different initial conditions (Cancer (Explicit), Cancer (Relativ e), Covid-19 (T arget T rajectory). Boxplots sho w medians and interquartile ranges (IQR). Whiskers extend to the farthest point within 1 . 5 × IQR from boxes. (Right) Example of predicted optimal treatment plans for λ ∈ { 0 , 1 , 10 , 100 } together with treatment plans seen during training and ground truth optimal treatment plan. T able 1. Benchmarking the optimized treatment plans on all three tasks; Cancer (Explicit), Cancer (Relative) and Covid-19 (T arget T rajectory). (a) A verage Spearman correlation between predicted costs and true costs of the candidate treatment plans in the control library . Higher values indicate better results. Method Cancer (Explicit) Cancer (Relativ e) Covid-19 (T arget T raj.) TECDE 0.29 ± 0.01 -0.30 ± 0.10 0.79 ± 0.34 INSITE/ SINDy -0.77 ± 0.00 -0.78 ± 0.00 0.43 ± 0.35 Ours (rank) 0.40 ± 0.0 0.23 ± 0.02 0.51 ± 0.24 (b) A v erage true costs of the optimized treatment plans. For our approach, we use a re gularization strength of λ = 100 . Lower values indicate better results. Method Cancer (Explicit) Cancer (Relativ e) Covid-19 (T arget T raj.) TECDE 467.26 ± 726.64 280.97 ± 414.76 5.80 ± 2.55 INSITE/ SINDy 559.74 ± 824.42 333.77 ± 490.53 7.59 ± 3.83 Ours ( λ = 100 ) 242.07 ± 314.60 73.42 ± 108.34 5.61 ± 3.79 ( tar get tr ajectory ). W e solve each optimization problem for 15 randomly sampled initial conditions. Exact mathematical definitions of all objectiv es are pro vided in Appendix E.2 . Practical Implementation. Our proposed approach is mod- ular , separating (i) learning a continuous-time dynamics model from observ ational trajectories and (ii) optimizing a conserv ativ e treatment objecti v e under the learned dy- namics. While an y dynamics identification method whose model class contains the true structural dynamics could be used, we instantiate the dynamics with a neural SDE by parameterizing the drift and dif fusion in Equation ( 2 ) with neural networks, and train it using an adaptation of the signature-kernel score method of Issa et al. ( 2023 ) that conditions on an initial state and control path (details in Appendix E ). For treatment optimization, we restrict to a parameterized family of deterministic open-loop controls u ϕ : [ t 0 , t f ] → R d u (see Appendix E.3 ). In our settings, the parameters ϕ describe the timepoints and dosage values of our treatment (chemo- and radio-therapy for cancer , and Dexamethasone for Covid-19). W e then compute optimized plans by gradient-based optimization over ϕ : at each iter- ation we sample trajectories from the learned neural SDE under u ϕ , form a Monte-Carlo estimate of the (conserv ati ve) objectiv e J ( t 0 , X ( t 0 ,x 0 ,u ) , u ϕ ) with numerical integration, and backpropagate through the solver , corresponding to di- rect single-shooting approach in optimal control ( Hoglund et al. , 2023 ; Gros & Diehl , 2020 ). Baselines. W e benchmark against state-of-the-art continuous-time treatment ef fect estimators. Specifically , we implement TE-CDE ( Seedat et al. , 2022 ) and INSITE ( Kacprzyk et al. , 2024 ). TE-CDE is based on neural con- trolled differential equations and predicts potential out- comes by conditioning on the full observed history of co- variates and treatments. INSITE learns an explicit dynam- ical model via sparse identification of nonlinear dynamics (SINDy) ( Brunton et al. , 2016a ). Implementation details and hyperparameter ranges are reported in Appendix E . T o ensure a f air comparison, all baselines are tuned on held-out validation data by selecting hyperparameters that minimize the mean square error of predictions. Since the baseline methods do not tri vially extend to treat- ment optimization, we ev aluate them using a control-library protocol ( Seedat et al. , 2022 ; W ang et al. , 2025 ). Concretely , for each optimization problem, we generate a library of 100 candidate control functions by sampling from the same control parameterization used in our optimization proce- dure. Each baseline then predicts the state trajectory and 7 Conservati ve Continuous-Time T reatment Optimization the corresponding objecti ve v alue for e v ery candidate con- trol, and selects the treatment plan with the lo west predicted objectiv e. 6. Experimental Results Our approach finds better treatment plans compar ed to baselines. T able 1 (left) reports Spearman rank correlation between predicted and ground-truth costs on a fixed library of candidate treatment plans. This measures whether a method preserves the cost-induced ordering of candidate plans (i.e., whether plans with lo w ground-truth cost are also assigned low predicted cost). For our method, we compute predicted costs by simulating the learned neural SDE under each candidate plan. On both cancer tasks, our approach yields the highest rank correlation, indicating a closer match to the ground-truth ordering; on the Covid-19 task, TE- CDE performs best, followed by our approach. Note that this ev aluation does not in volve optimizing ov er controls and therefore does not probe model-exploitation effects, since the candidate library remains close to the training distribution. T able 1 (right) reports the ground-truth costs of the optimized treatment plans, e valuated under the true dynamics. Here, our method achie v es the lowest average cost across all three tasks. 0 10 20 30 40 50 60 T ime 0 4 8 12 T umor volume P atient T rajectories under Optimized T reatment N o R e g u l a r i z a t i o n ( = 0 . 0 ) R e g u l a r i z a t i o n ( = 1 0 0 . 0 ) F igur e 2. Example patient trajectories under treatment plans optimized with differ ent conser vatism le vels ( λ ). T umor volume ov er time for two optimized treatment plans. Shaded re gions indicate variability across simulated trajectories, the bold curve denotes the mean trajectory , and the dot marks the mean final tumor volume. Increasing r egularization strengths λ can improv e treat- ment plan performance. Figure 1 in vestig ates the effect of the conservati v e regularization strength λ on treatment optimization. For each test instance, we optimize the con- trol under the learned dynamics for λ ∈ { 0 , 1 , 10 , 100 } and ev aluate the resulting plan using the ground-truth simulator cost J ( t 0 , X ( t 0 ,x 0 ,u ) , u ) . W e report the relative change in ground-truth cost compared to the unregularized solution ( λ = 0 ). On both cancer tasks, increasing λ yields a con- sistent reduction in ground-truth cost, indicating that con- servatism mitig ates model exploitation during optimization. For the Covid-19 task, we do not observe a clear monotonic trend, but regularization does not de grade performance. A plausible e xplanation is that the optimized plans remain closer to the training distrib ution in this task, so distribution- shift ef fects are less pronounced. As e xpected, a large λ can ev entually dominate the objecti ve and limit further impro ve- ments. The left panel of Figure 1 sho ws an e xample of the optimized treatment plan in the Covid-19 task for dif ferent values of λ ; in this instance, stronger re gularization mov es the solution closer to the ground-truth optimum. Figure 2 shows the distribution of ground-truth outcome trajectories for a representati ve patient under the optimized treatment plans with λ = 0 and λ = 100 . The conservati ve solution ( λ = 100 ) achie ves a lo wer mean tumor-v olume trajectory and final state, mirroring the cost gains from regularization observed in Figure 1 . 7. Conclusion W e introduced a conserv ati ve continuous-time frame work for treatment optimization from irregularly sampled patient trajectories. W e propose to model the patient dynamics as a controlled stochastic differential equation where the treat- ment is a continuous-time control input. T o avoid model exploitation during treatment optimization, we deri ved a conservati v e objecti ve that augments the model-based cost with a consistent signature-based MMD re gularizer , which penalizes plans whose learned model rollout distrib ution deviates from the observational trajectory distribution via conditional MMD on path space. Experiments on phar- macokinetic–pharmacodynamic simulations sho w that our conservati v e objecti ve can improv e robustness and lead to better treatment plans under the true dynamics compared to the unregularized setting and baseline approaches. Impact Statement This paper de velops technical methodology for conserva- tiv e, continuous-time treatment optimization from irregu- larly sampled observ ational trajectories, with the goal of improving the reliability of planning under learned dynam- ics; if translated responsibly , such tools could support better decision-making in high-stak es settings (e.g., proposing can- didate dosing plans for further expert revie w , simulation, and prospectiv e study) and may also benefit other domains that require robust control from logged data. At the same time, there are meaningful risks: observ ational data can reflect selection ef fects, unmeasured confounding, missingness, and societal or clinical biases, and optimizing ov er learned models can still produce misleading or harmful recommen- dations if deployed without rigorous validation, appropriate 8 Conservati ve Continuous-Time T reatment Optimization constraints, and human oversight; additionally , clinical and behavioral data raise pri v acy and security concerns. Our work is primarily methodological and does not constitute a deployable medical de vice or clinical guidance; rather , it aims to adv ance broadly applicable machine learning foun- dations for reliable of fline optimization, while explicitly emphasizing the importance of safety , uncertainty , and out- of-distribution rob ustness in an y real-world use, particularly in healthcare. Acknowledgements W e thank Junhyung Park for his time and for helpful dis- cussions about the paper and the MCMD estimator . W e thank Cristopher Salvi for answering questions about the Signature Scores for neural SDE training. W e thank Emilio Ferrucci for answering questions about causal identifiability , stochastic control, rough path theory , and for sharing useful pointers to the related literature. This work has been sup- ported by the German Federal Ministry of Education and Research (Grant: 01IS24082). References Brunton, S. L., Proctor , J. L., and Kutz, J. N. Discovering gov erning equations from data by sparse identification of nonlinear dynamical systems. Pr oceedings of the national academy of sciences , 113(15):3932–3937, 2016a. Brunton, S. L., Proctor , J. L., and Kutz, J. N. Sparse Identification of Nonlinear Dynamics with Control (SIND Yc)**SLB ackno wledges support from the U.S. Air Force Center of Excellence on Nature Inspired Flight T echnologies and Ideas (F A9550-14-1-0398). JLP thanks Bill and Melinda Gates for their active support of the Institute of Disease Modeling and their sponsorship through the Global Good Fund. JNK ackno wledges sup- port from the U.S. Air F orce Of fice of Scientific Research (F A9550-09-0174). IF AC-P apersOnLine , 49(18):710– 715, 2016b. ISSN 24058963. doi: 10.1016/j.ifacol.2016. 10.249. URL https://linkinghub.elsevier. com/retrieve/pii/S2405896316318298 . Cass, T . and Salvi, C. Lecture notes on rough paths and applications to machine learning. arXiv pr eprint arXiv:2404.06583 , 2024. Chakraborty , B. and Murphy , S. A. Dynamic treatment regimes. Annual re view of statistics and its application , 1 (1):447–464, 2014. Curran Jr , W . J., Paulus, R., Langer , C. J., K omaki, R., Lee, J. S., Hauser , S., Mo vsas, B., W asserman, T ., Rosenthal, S. A., Gore, E., et al. Sequential vs concurrent chemoradi- ation for stage iii non–small cell lung cancer: randomized phase iii trial rtog 9410. J ournal of the National Cancer Institute , 103(19):1452–1460, 2011. Dai, W ., Rao, R., Sher , A., T ania, N., Musante, C. J., and Allen, R. A prototype qsp model of the immune response to sars-cov-2 for community de v elopment. CPT : pharma- cometrics & systems pharmacology , 10(1):18–29, 2021. De Brouwer , E., Gonzalez, J., and Hyland, S. Predicting the impact of treatments ov er time with uncertainty aware neural dif ferential equations. In International Confer ence on Artificial Intelligence and Statistics , pp. 4705–4722. PMLR, 2022. D’Amour , A., Ding, P ., Feller, A., Lei, L., and Sekhon, J. Overlap in observ ational studies with high-dimensional cov ariates. Journal of Econometrics , 221(2):644–654, 2021. Geng, C., Pag anetti, H., and Grassber ger , C. Prediction of treatment response for combined chemo-and radiation therapy for non-small cell lung cancer patients using a bio-mathematical model. Scientific reports , 7(1):13542, 2017. Gros, S. and Diehl, M. Numerical optimal control (draft), 2020. Hess, K. and Feuerriegel, S. Stabilized neural prediction of potential outcomes in continuous time. In The Thirteenth International Confer ence on Learning Repr esentations , 2025. URL https://openreview.net/forum? id=aN57tSd5Us . Hess, K., Melnychuk, V ., Frauen, D., and Feuerriegel, S. Bayesian neural controlled dif ferential equations for treat- ment effect estimation. arXiv pr eprint arXiv:2310.17463 , 2023. Hızlı, C ¸ ., John, S., Juuti, A. T ., Saarinen, T . T ., Pietil ¨ ainen, K. H., and Marttinen, P . Causal modeling of policy in- terventions from treatment-outcome sequences. In Inter- national Conference on Machine Learning , pp. 13050– 13084. PMLR, 2023. Hoglund, M., Ferrucci, E., Hern ´ andez, C., Gonzalez, A. M., Salvi, C., S ´ anchez-Betancourt, L., and Zhang, Y . A neural rde approach for continuous-time non-mark ovian stochas- tic control problems. arXiv pr eprint arXiv:2306.14258 , 2023. Imbens, G. W . and Rubin, D. B. Causal infer ence in statis- tics, social, and biomedical sciences . Cambridge univ er - sity press, 2015. Issa, Z., Horvath, B., Lemercier , M., and Salvi, C. Non- adversarial training of neural sdes with signature kernel scores. Advances in Neur al Information Pr ocessing Sys- tems , 36:11102–11126, 2023. 9 Conservati ve Continuous-Time T reatment Optimization Kacprzyk, K., Holt, S., Berre v oets, J., Qian, Z., and van der Schaar , M. Ode discov ery for longitudinal het- erogeneous treatment effects inference. arXiv pr eprint arXiv:2403.10766 , 2024. Kidambi, R., Rajeswaran, A., Netrapalli, P ., and Joachims, T . Morel: Model-based offline reinforcement learning. Advances in neur al information pr ocessing systems , 33: 21810–21823, 2020. Kidger , P . On neural differential equations, 2022. URL https://arxiv.org/abs/2202.02435 . K oprulu, C., Djeumou, F ., and ufuk topcu. Neural stochastic differential equations for uncertainty-aware offline RL. In The Thirteenth International Confer ence on Learning Repr esentations , 2025. URL https://openreview. net/forum?id=hxUMQ4fic3 . Li, Z., Chen, J., Laber , E., Liu, F ., and Baumgartner , R. Optimal treatment regimes: a revie w and empirical com- parison. International Statistical Review , 91(3):427–463, 2023. Lok, J. J. Statistical modeling of causal effects in continuous time. The Annals of Statistics , 36(3):1464–1507, June 2008. doi: 10.1214/009053607000000820. URL https: //doi.org/10.1214/009053607000000820 . Luo, Y ., Xu, H., Li, Y ., Tian, Y ., Darrell, T ., and Ma, T . Algo- rithmic frame work for model-based deep reinforcement learning with theoretical guarantees. In International Confer ence on Learning Repr esentations (ICLR) , 2019. Murphy , S. A. Optimal dynamic treatment regimes. J our - nal of the Royal Statistical Society Series B: Statistical Methodology , 65(2):331–355, 2003. Park, J. and Muandet, K. A measure-theoretic ap- proach to kernel conditional mean embeddings. In Advances in Neural Information Processing Systems , volume 33, pp. 16665–16677, 2020. doi: 10.5555/ 3495724.3497508. URL https://proceedings. neurips.cc/paper/2020/hash/ f340f1b1f65b6df5b5e3f94d95b11daf- Abstract. html . Park, J., Shalit, U., Sch ¨ olkopf, B., and Muandet, K. Conditional distributional treatment effect with ker - nel conditional mean embeddings and u-statistic re- gression. In Meila, M. and Zhang, T . (eds.), Pr o- ceedings of the 38th International Confer ence on Ma- chine Learning , volume 139 of Pr oceedings of Ma- chine Learning Researc h , pp. 8401–8412. PMLR, July 2021. URL https://proceedings.mlr.press/ v139/park21c.html . Pham, H. Continuous-time Stochastic Contr ol and Op- timization with F inancial Applications , volume 61 of Stochastic Modelling and Applied Pr obability . Springer , Berlin, Heidelberg, 2009. ISBN 978-3-540-89499-5 978-3-540-89500-8. doi: 10.1007/978- 3- 540- 89500- 8. URL https://link.springer.com/10.1007/ 978- 3- 540- 89500- 8 . Qian, Z., Zame, W ., Fleuren, L., Elbers, P ., and van der Schaar , M. Integrating expert odes into neural odes: phar- macology and disease progression. Advances in Neural Information Pr ocessing Systems , 34:11364–11383, 2021. Robins, J. M. Optimal structural nested models for optimal sequential decisions. In Pr oceedings of the Second Seattle Symposium in Biostatistics: analysis of corr elated data , pp. 189–326. Springer , 2004. Salvi, C., Cass, T ., Foster , J., L yons, T ., and Y ang, W . The Signature Kernel is the solution of a Gour- sat PDE. SIAM Journal on Mathematics of Data Sci- ence , 3:873–899, January 2021. ISSN 2577-0187. doi: 10.1137/20M1366794. URL abs/2006.14794 . arXi v:2006.14794 [math]. Schulam, P . and Saria, S. Reliable decision support using counterfactual models. Advances in neural information pr ocessing systems , 30, 2017. Seedat, N., Imrie, F ., Bellot, A., Qian, Z., and v an der Schaar , M. Continuous-time modeling of counterfactual outcomes using neural controlled dif ferential equations. arXiv pr eprint arXiv:2206.08311 , 2022. Sun, J. and Crawford, F . W . Causal identification for continuous-time stochastic processes. arXiv preprint arXiv:2211.15934 , 2022. Tsirtsis, S. and Rodriguez, M. Finding counterfactually optimal action sequences in continuous state spaces. Ad- vances in Neur al Information Processing Systems , 36: 3220–3247, 2023. V anderschueren, T ., Curth, A., V erbeke, W ., and V an Der Schaar , M. Accounting for informati ve sampling when learning to forecast treatment outcomes o ver time. In International Conference on Machine Learning , pp. 34855–34874. PMLR, 2023. W ang, X., L yu, S., Luo, C., Zhou, X., and Chen, H. V ari- ational counterfactual interv ention planning to achieve target outcomes. In F orty-second International Con- fer ence on Machine Learning , 2025. URL https: //openreview.net/forum?id=ggyHPOXLGH . Xu, Y . and Zhang, Z. Modeling and optimizing dynamic treatment regimens in continuous time. In Statistics in Pr ecision Health: Theory , Methods and Applications , pp. 513–535. Springer , 2024. 10 Conservati ve Continuous-Time T reatment Optimization Y ing, A. Causal identification for complex functional longi- tudinal studies. In The Thirteenth International Confer- ence on Learning Repr esentations , 2025. URL https: //openreview.net/forum?id=96beVMeHh9 . Y u, T ., Thomas, G., Y u, L., Ermon, S., Zou, J. Y ., Le vine, S., Finn, C., and Ma, T . Mopo: Model-based of fline polic y optimization. Advances in Neural Information Pr ocessing Systems , 33:14129–14142, 2020. Y u, T ., Kumar , A., Rafailo v , R., Rajeswaran, A., Levine, S., and Finn, C. Combo: Conservati v e of fline model-based policy optimization. Advances in neural information pr ocessing systems , 34:28954–28967, 2021. Zhang, J., L yu, J., Ma, X., Y an, J., Y ang, J., W an, L., and Li, X. Uncertainty-driv en trajectory truncation for data augmentation in of fline reinforcement learning. arXiv pr eprint arXiv:2304.04660 , 2023. Zhou, Y ., Qi, Z., Shi, C., and Li, L. Optimizing pessimism in dynamic treatment regimes: A bayesian learning ap- proach. In International Confer ence on Artificial Intelli- gence and Statistics , pp. 6704–6721. PMLR, 2023. 11 Conservati ve Continuous-Time T reatment Optimization T able 2. Summary of notation Symbol Meaning I I := [ t s , t f ] , the considered finite time horizon N the number of training samples D the dataset D := { ( z ( i ) , y ( i ) , u ( i ) ) } i ∈ [ N ] d z dimension of the stochastic cov ariate process Z : Ω × I → R d z d y dimension of the stochastic outcome process Y : Ω × I → R d y d u dimension of the stochastic treatment process U : Ω × I → R d u d x d x = d z + d y dimension of X = ( Z, Y ) ⊤ M 1 (Ω , A ) the set of probability measures on measurable space (Ω , A ) Y X the set of all functions F unct( X , Y ) = { f : X → Y } L 0 (( X , A X ) , ( Y , A Y )) the A X − A Y measurable functions f : X → Y A. Summary of Notation A summary of notation is provided in T able 2 . W e write u for a sample of the stochastic process U . B. On Identifiability f or Continuous-Time T reatment Effects In order to render the stochastic control problem from Equation ( 1 ) computationally tractable, we reformulated it as Equation ( 3 ). In this section we (i) give re gularity assumptions on the controlled dynamics that guarantee existence and uniqueness of the resulting state trajectory , (ii) specify the assumptions for Equation ( 1 ) to be identifiable, and (iii) narro w down the class of admissible treatment processes for the control problem to be well-defined. Concretely , we model the patient state by a controlled SDE with measurable drift- and dif fusion functions, satisfying the standard conditions (e.g., uniform Lipschitz continuity in the state, plus suitable integrability of the control) so that, for an y initial condition ( t 0 , x 0 ) and any admissible control u , the SDE admits a unique strong solution X t 0 ,x 0 ,u with finite objectiv e. Structural Dynamics Model. Let I = [ t s , t f ] , , let (Ω , F , F , P ) be a complete filtered probability space with filtration F = ( F t ) t ∈ I satisfying the usual conditions, W = ( W t ) t ∈ I a R d w -valued W iener process adapted to F , U = ( U t ) t ∈ I the progressiv ely F -measurable control-process taking v alues in A ⊆ R d u and µ : R d x × A → R d x , σ : R d x × A → R d x × d w , (6) the measurable drift- and diffusion functions of the controlled SDE dX t = µ ( X t , U t ) dt + σ ( X t , U t ) dW t , t ∈ I . (7) Assuming (i) uniform Lipschitz condition in the state, uniformly ov er treatments, meaning there exists K ≥ 0 such that for all x, y ∈ R d x and all u ∈ A , ∥ µ ( x, u ) − µ ( y , u ) ∥ + ∥ σ ( x, u ) − σ ( y , u ) ∥ ≤ K ∥ x − y ∥ . (ii) the set of control processes are restricted such that E Z t f t 0 ∥ µ (0 , u t ) ∥ 2 + ∥ σ (0 , u t ) ∥ 2 d t < ∞ the controlled SDE Equation ( 7 ) admits a unique strong solution for every initial condition ( t 0 , x 0 ) ∈ I × R d x , for any progressiv ely F -measurable control process U , which we denote by X ( t 0 ,x 0 ,U ) = ( X ( t 0 ,x 0 ,U ) t ) t ∈ I . For more details see Pham ( 2009 ). 12 Conservati ve Continuous-Time T reatment Optimization The Class of Admissible T reatments. Assuming a controlled SDE ( µ, σ, I ) as in Equation ( 7 ) and measurable functions f : I × R d x × A → R and g : R d x → R s.t. (i) g is lower -bounded (ii) ∃ C > 0 : | g ( x ) | ≤ C (1 + | x | 2 ) ∀ x ∈ R n the set of admissible controls for ( t 0 , x 0 ) ∈ I × R d x , U adm := U f ( t 0 ,x 0 ) := u | E Z t f t 0 | f ( s, X ( t 0 ,x 0 ,u ) s , u s ) | ds < ∞ = ∅ is non-empty , making the SDE-governed control problem v ( t 0 , x 0 ) = arg inf u ∈ U adm J ( t 0 , x 0 , u ) , J ( t 0 , x 0 , u ) = E Z t f t 0 f ( s, X ( t 0 ,x 0 ,u ) s , u s )d s + g X ( t 0 ,x 0 ,u ) t f well-defined, where J ( t 0 , x 0 , α ) is the cost function and v ( t 0 , x 0 ) the associated value function. C. Details on Our Proposed A ppr oach C.1. Proof of the telescope lemma In this subsection, we prov e Lemma 4.7 , wlog assuming I = [0 , 1] for notational con v enience. Note: Remark C.1 . Given an admissible u ∈ U we hav e by the path-wise uniqueness of the flow of the SDE for X , due to the Markovian structure of the SDE: X ( t 0 ,x 0 ,u ) t = X ( τ ,X t 0 ,x 0 ,u τ ,u ) t , t ≥ τ For an y stopping time τ v alued in [ t 0 , 1] . Hence we obtain: J ( t 0 , x 0 , u ) = E Z τ t 0 f ( s, X ( t 0 ,x 0 ,u ) s , u s ) ds + Z 1 τ f ( s, X ( t 0 ,x 0 ,u ) s , u s ) ds + g ( X ( t 0 ,x 0 ,u ) 1 ) (8) ( ∗ ) = E E Z τ t 0 f ( s, X ( t 0 ,x 0 ,u ) s , u s ) ds + Z 1 τ f ( s, X ( t 0 ,x 0 ,u ) s , u s ) ds + g ( X ( t 0 ,x 0 ,u ) 1 ) | F τ (9) ( ∗∗ ) = E Z τ t 0 f ( s, X ( t 0 ,x 0 ,u ) s , u s ) ds + E Z 1 τ f ( s, X ( t 0 ,x 0 ,u ) s , u s ) ds + g ( X ( t 0 ,x 0 ,u ) 1 ) | F τ (10) where in ( ∗ ) we used the tower property of the conditional e xpectation and in ( ∗∗ ) we used that τ a stopping time making R τ t 0 f ( s, X t 0 ,x 0 ,u s , u s ) ds F τ measurable. Moreov er , by shifting control and Brownian motion by − τ , in the underlying controlled SDE, we obtain: E Z 1 τ f ( s, X ( t 0 ,x 0 ,u ) s , u s ) ds + g ( X ( t 0 ,x 0 ,u ) 1 ) | F τ = J ( τ , X ( t 0 ,x 0 ,u ) τ , u ) P − a.s. such that J ( t 0 , x 0 , u ) = E Z τ t 0 f ( s, X ( t 0 ,x 0 ,u ) s , u s ) ds + J ( τ , X ( t 0 ,x 0 ,u ) τ , u ) Pr oof. For j ∈ { 0 , . . . , K − 1 , define the stochastic process ˜ X ( j ) that e volv es up to time t j according to (estimated) SDE gov erned by µ θ , σ θ and starting from t j according to the true SDE with µ, σ , meaning ˜ X ( i ) = ( ˜ X ( i ) s ) s ∈ I satisfies ˜ X ( j ) [0 ,t j ] = solution of x 0 + Z t j 0 µ θ ( ˜ X ( j ) s , u s ) ds + Z t j 0 σ θ ( ˜ X ( j ) s ) dW s , (11) ˜ X ( j ) [ t j , 1] = solution to ˜ X ( j ) t j + Z 1 t j µ ( ˜ X ( j ) s , u s ) ds + Z 1 t j σ ( ˜ X ( j ) s ) dW s . (12) 13 Conservati ve Continuous-Time T reatment Optimization Note that ˜ X (0) := X := X (0 ,x 0 ,u ) , ˜ X ( N ) = ˆ X = ˆ X (0 ,x 0 ,u ) . Then for a stopping time τ , we can therefore define ˜ J ( j ) := J (0 , ˜ X ( j ) , u ) = E Z τ 0 f ( ˜ X ( j ) s , u ) ds + J ( τ , ˜ X ( j ) τ , u ) By a ”telescoping-sum” argument, we can write J (0 , X (0 ,x 0 ,u ) , u ) − J (0 , ˆ X (0 ,x 0 ,u ) , u ) = ˜ J (0) − ˜ J ( N ) = N − 1 X j =0 ˜ J ( j ) − ˜ J ( j +1) . W ith the choice τ = t j , we can write ˜ J ( j ) = E " Z t j 0 f ( s, ˜ X ( j ) s , u s ) ds + E " Z 1 t j f ( s, ˜ X ( j ) s , u s ) ds + g ( ˜ X ( j ) 1 ) | ˜ F ( j ) t j ## = E Z t j 0 f ( s, ˜ X ( j ) s , u s ) ds + E Z t j +1 t j f ( s, ˜ X ( j ) s , u s ) ds + E " Z 1 t j +1 f ( s, ˜ X ( j ) s , u s ) ds + g ( ˜ X ( j ) 1 ) | ˜ F ( j ) t j +1 # | {z } a.s. = ˜ J ( t j +1 , ˜ X ( j ) t j +1 ,u ) | ˜ F ( j ) t j where F ( j ) t = σ X ( j ) s | s ≤ t . Similarly ˜ J ( j +1) = E Z t j 0 f ( s, ˜ X ( j +1) s , u s ) ds + E Z t j +1 t j f ( ˜ X ( j +1) s , u s ) ds + E " Z 1 t j +1 f ( ˜ X ( j +1) s , u s ) ds + g ( ˜ X ( j +1) 1 ) | ˜ F ( j +1) t j +1 # | {z } a.s. = J ( t j +1 , ˜ X ( j +1) t j +1 ,u ) | ˜ F ( j +1) t j Then, as ˜ X ( j ) s = ˜ X ( j +1) s ∀ s ≤ t j , we hav e ˜ F ( j ) t j = ˜ F ( j +1) t j from which we can write ˜ J ( j ) − ˜ J ( j +1) = E " E " Z t j +1 t j f ( s, ˜ X ( j ) s , u s ) ds + J ( t j +1 , ˜ X ( j ) t j +1 , u ) ! − Z t j +1 t j f ( s, ˜ X ( j +1) s , u s ) ds + J ( t j +1 , ˜ X ( j +1) t j +1 , u ) ! | ˜ F ( j ) t j ## ( a ) = E " Z t j +1 t j h j ( ˜ X ( j ) [ t j ,t j +1 ] ( ω ′ )) − h j ( ˜ X ( j +1) [ t j ,t j +1 ] ( ω ′ )) dP ( t j ) ( ω , dω ′ ) # ( b ) = E y ∼ P ˜ X ( j ) t j E P ˆ X [ t j ,t j +1 ] | ˆ X t j = y F j ( ˜ x [ t j ,t j +1 ] ) − E P X [ t j ,t j +1 ] | X t j = y F j ( x [ t j ,t j +1 ] ) | {z } D j ( y ,u ) where in ( a ) we inserted the deterministic function h j giv en by ( γ : [ t j , t j +1 ] 7→ R n ) − → Z t j +1 t j f ( s, γ s , u s ) ds + J ( t j +1 , X ( t j ,γ t j +1 ,u ) , u ) (note: the RHS is a random variable) and in ( b ) we used a path notation for the push-forward measure. Note the dependence of h j on u [ t j ,t j +1 ] , thus the subscript j . Altogether we therefore obtain J − ˆ J = N − 1 X j =0 E y ∼ P ˜ X ( j ) t j E P ˆ X [ t j ,t j +1 ] | ˆ X t j = y h j ( ˜ x [ t j ,t j +1 ] ) − E P X [ t j ,t j +1 ] | X t j = y h j ( x [ t j ,t j +1 ] ) . 14 Conservati ve Continuous-Time T reatment Optimization C.2. Signature T ransform and MCMD Estimator The Signatur e K er nel on P ath Space. A suitable tool for learning from streamed data is the signature transform, a feature map S : C p ( I , R d ) → H S ( R d ) that conv erts a data stream (a continuous path x ∈ C p ( R d ) of bounded variation) into a graded sequence of iterated integrals S ( x ) = 1 , S (1) ( x ) , . . . , S ( k ) ( x ) , . . . ∈ ∞ Y k =0 R d ⊗ k , S ( k ) ( x ) = Z t 0 ϕ dosages,i · exp( t i − t ) . E.4. Practical Implementation Details of Our Appr oach N E U R A L S D E W e model the drift and dif fusion of the SDE with neural networks µ θ : R d y × A → R d y , σ θ : R d x × A → R d x × d w and ζ θ : R d ν → R d x to model the strong solution ˆ X : I → R d x of d ˆ X t = µ θ ( ˆ X t , U t )d t + σ θ ( ˆ X t , U t )d W t , ˆ X 0 = ζ θ ( V ) , V ∼ N (0 , I d v ) (19) W : I → R d w is a d w -dimensional Brownian motion and U = ( U t ) t ∈ I our control. In all experiments, we parameterized the drift and diffusion so that each state has its o wn independent neural networks. Each drift neural network consists of a three-layer fully-connected neural network with 64 hidden units, while the diffusion networks consist of a one-layer fully-connected neural network with 8 hidden units. W e use LipSwish -activ ation function for intermediate layers and tanh -activ ation for the final layers. W e optimize the neural SDE using a signature-kernel score objectiv e (see belo w , based on Issa et al. ( 2023 )). For training we use the RMSProp optimizer with a learning rate of 0 . 0001 and 0 . 001 for the Cancer and Covid dataset respecti vely . W e optimize over 15000 and 30000 gradient steps, where each gradient step has a batch size of 16 trajectories. Signature scor e f or training neural SDEs Here, we detail ho w we adapt the signature score training method by Issa et al. ( 2023 ) to our conditional training as we ha ve samples from the treatment process and the state process. Issa et al. ( 2023 ) propose a non-adversarial training objectiv e for neural SDEs based on the signature kernel : The signature kernel score of P ∈ M 1 ( X , B ( T X )) and y ∈ X is the map Φ sig : M 1 ( X , B ( T X )) × X − → R defined by Φ sig ( P , y ) := E x,x ′ ∼ P [ k sig ( x, x ′ )] − 2 E x ∼ P [ k sig ( x, y )] . Φ sig is a strictly proper score, meaning it assigns the lowest e xpected score when the proposed prediction is realized by the true probability distribution. Let (Ω , F , P ) be a probability space, T > 0 , d x ∈ N . F or path-space X := γ ∈ C BV ([0 , T ] , R d x ) ∃ i ∈ [ d ] : γ i monotone , Let T be a topology such that the signature k ernel k sig : X × X → R is continuous. Let X true : Ω → X be a random variable with sample { x ( i ) } N i =1 iid ∼ P X true ∈ M 1 ( X , B ( T X )) . In the unconditional setting (as described in Issa et al. ( 2023 )), one can learn a neural SDE X θ into X such that its law P X θ ≈ P X true , use the training objectiv e min θ L ( θ ) , L ( θ ) := E y ∼ P X true [Φ sig ( P X θ , y )] , 19 Conservati ve Continuous-Time T reatment Optimization which is equiv alent to optimization using MMD 2 k sig [ P X θ , P X true ] as the last term E y ,y ′ ∼ P X true [ k sig ( y , y ′ )] is constant w .r .t. θ . Howe v er , our neural SDE is a conditional setting, since we observ e both a treatment process and a state process. Giv en a distribution P Z to condition on, P X true | Z ( · | z ) , z ∼ P Z the target conditional distrib ution with sample { ( z ( i ) , x ( i ) ) } i ∈ [ N ] , z ( i ) ∼ P Z , x ( i ) ∼ P X true | Z ( · | z ( i ) ) . T o learn a neural SDE X θ into X such that P X θ ( · | z ) ≈ P X true ( · | z ) , train with the modified objectiv e: min θ L ′ ( θ ) , where L ′ ( θ ) := E z ∼ P Z E x ∼ P X true ( ·| z ) [Φ sig ( P X θ ( · | z ) , x )] . Thus, in our controlled SDE setting, we have to adapt the abov e conditional objectiv e in the following way . Giv en a control signal u to condition on, the target conditional distrib ution P X true | U ( · | u ) with sample { ( u ( i ) , x ( i ) ) } i ∈ [ N ] , u ( i ) ∼ P U , x ( i ) ∼ P X true | U ( · | u ( i ) ) . T o learn a neural SDE X θ into X such that P X θ ( · | u, x 0 ) ≈ P X true ( · | u, x 0 ) , train with the modified objectiv e: min θ L ′ ( θ ) , where L ′ ( θ ) := E u ∼ P U ,x 0 ∼ P X 0 E x ∼ P X true ( ·| u,x 0 ) [Φ sig ( P X θ ( · | u, x 0 ) , x )] . In all of our experiments, we use an RBF-lifting kernel with a smoothing parameter σ = 1 to compute the abov e loss. W e set the order of dyadic refinement associated to the PDE solver for computing the signature kernel to 1 . Follo wing Issa et al. ( 2023 ), we apply a time-normalization transform and augment true and predicted paths with the normalized-time component before passing them through the loss function. T R E A T M E N T P L A N O P T I M I Z A T I O N Signature-K er nel Based Regularizer . W e compute the MCMD regularizer using the respecti v e v alidation dataset together with an RBF-lifting kernel with smoothing parameter σ = 1 . Again, we use a dyadic refinement factor 1 for the PDE solver and apply a time-normalization transform and augment true and predicted paths with the normalized-time component before ev aluating the path. For the initial condition, we also use an RBF-kernel with smoothing parameter σ = 1 . Optimization Procedur e W e use the parameterized representation of the control functions described in Appendix E.3 . This allows for direct single-shooting control optimization. W e initialize the timepoint parameters ϕ timepoints by uniformly sampling ov er the time interv al I . The dosage parameters ϕ dosag e are initialized by uniformly sampling between [0 . 1 , 0 . 3] . W e use gradient-descent based methods (RMSProp optimizer) with a learning rate of 0 . 001 for both Cancer tasks and a learning rate of 0 . 01 for the Covid task. All tasks are optimized with 5000 gradient-update steps. Moreov er , we use Monte Carlo sampling with a sample size = 10 to approximate the expectation in our objecti v e function J . Ours (rank) When we report the Spearman correlation (to compare with the benchmark methods), we rely on the same control library . There, we simply use our trained neural SDE to predict the costs for the respecti ve candidate controls. W e do not use a regularization for these settings. E.5. Baseline Methods TE-CDE ( Seedat et al. , 2022 ). In our experiments, we use the same model architecture as proposed in the original paper, meaning a 2-layer neural network with hidden state dimension 128 for both the neural CDE encoder and decoder . W e train the decoder already to predict the same time horizon as needed in our control tasks. W e also adjust the balancing-representation loss objecti ve from a cross-entropy loss to a mean-squared-error loss as it naturally extends to our general formulation of continuous-valued treatments. W e perform a grid-search for selecting the learning rate ∈ { 0 . 0001 , 0 . 001 , 0 . 01 } , number of epochs ∈ { 50 , 100 } and dimensions of the latent state ∈ { 8 , 16 } . The optimal hyperparameters are determined by choosing the set that results in the lo west mean squared error on the validation dataset, for the Cancer and Co vid datasets, respecti vely . INSITE/SINDy . ( Kacprzyk et al. , 2024 ; Brunton et al. , 2016a ) INSITE is a frame w ork for bridging treatment ef fect estimation and dynamical system disco very . F ollo wing Kacprzyk et al. ( 2024 ), we also instantiate the INSITE framew ork using SINDy ( Brunton et al. , 2016a ). Importantly , since we do not hav e patient heterogeneity in our experiments (so where each patient has their “o wn” dynamics parameters) and instead model patient-conditioned ef fects by conditioning on the observed history , we only obtain the population-le vel dynamics equation and not further fine-tune it on a specific trajectory . 20 Conservati ve Continuous-Time T reatment Optimization Moreov er following ( Kacprzyk et al. , 2024 ), we incorporate the continuous-valued treatment directly as a control input and use SINDy with control. SINDy works as follo ws: Assuming that the target symbolic expression f is a linear combination of library functions from L ⊂ F unct( R d x , R d y ) ( f = P θ i h i or span( L ) = F ), linear symbolic regression solves the multiv ariate, multidimensional SR - problem ( F ⊆ F unct( R d x , R d y ) , D = { ( x i , y i ) } i ∈ [ n ] , x i ∈ R d x , y i ∈ R d y ) by: 1. Assign each f j ∈ L a coef ficient vector θ j ∈ R d y with with θ 0 ∈ R d y s.t. y = k X j =1 f j ( x ) θ j + θ 0 (A) (ii) Apply (A) to all input - output pair to obtain a system of linear equations, that can be written in matrix-form as y ⊤ 1 y ⊤ 2 . . . y ⊤ n | {z } Y = 1 f 1 ( x 1 ) f 2 ( x 1 ) · · · f k ( x 1 ) 1 f 1 ( x 2 ) f 2 ( x 2 ) · · · f k ( x 2 ) . . . . . . . . . . . . 1 f 1 ( x n ) f 2 ( x n ) · · · f k ( x n ) | {z } U ( X ) θ ⊤ 0 θ ⊤ 1 . . . θ ⊤ k | {z } Θ (B) or more compact as Y = U ( X ) · Θ where U := U ( X ) ∈ R n × ( k +1) the library functions e valuated on input data and Θ ∈ R ( k +1) × d y the (sparse) matrix of coefficients. (iii) Solve (B) with Θ = U ⊤ U † U ⊤ Y Remark E.1 . The abov e describe SINDy-framework is extend to control inputs (SINDy with control; see Brunton et al. ( 2016b )) by augmenting the library of functions in ( x, u ) and fitting the dynamics ˙ x = f ( x, u ) , In order to establish a fair comparison with the SINDy baseline, similar to Kacprzyk et al. ( 2024 ), we used the STLSQ optimizer together with the polynomial library (instead of the wSINDy weak-form library) and performed hyperparameter- tuning ov er the polynomial degree deg ∈ { 1 , 2 } , threshold { 0 . 1 , 0 . 2 , 0 . 5 } and the ridge regularization parameter λ ∈ { 0 . 1 , 0 . 2 , 0 . 5 } . Library of Candidate Contr ols. The library of candidate controls consists of 100 control candidates. W e randomly sample the parameters ϕ which parameterize the continuous-time control function. F or the respective problems, we use the parameterization described in Appendix E.3 . W e initialize the treatment timepoints ϕ timepoints by uniformly sampling ov er the time interval of interest [ t 0 , t f ] . The dosages are obtained by randomly sampling over the interval [0 . 1 , 0 . 3] (in the min-max-normalized control space). F or the Cancer tasks, we choose a maximum of K = 5 treatment administrations. For the Co vid task, we chose K = 1 treatment mirroring the fact that the tar get trajectory was also generated with a single treatment administration. F . Additional Results 1.0 10.0 100.0 R e g u l a r i z a t i o n S t r e n g t h ( ) 0.0 0.3 0.6 0.9 M M D r e l a t i v e t o = 0 Cancer (Explicit) 1.0 10.0 100.0 R e g u l a r i z a t i o n S t r e n g t h ( ) 0 1 2 3 Cancer (Relative) 1.0 10.0 100.0 R e g u l a r i z a t i o n S t r e n g t h ( ) 0.6 0.9 1.2 1.5 CO VID -19 (T arget T raj.) M M D ( X t 0 , x 0 , u * , X t 0 , x 0 , u * ) f o r d i f f e r e n t (a) MMD k sig between the true and predicted patient trajectory distribution for the optimal control obtained with different lev els of re gularization strength λ . This figure shows the Signature-Kernel-Score. 1.0 10.0 100.0 R e g u l a r i z a t i o n S t r e n g t h ( ) 0.25 0.50 0.75 1.00 I P M V a l u e r e l a t i v e t o = 0 Cancer (Explicit) 1.0 10.0 100.0 R e g u l a r i z a t i o n S t r e n g t h ( ) 0.25 0.50 0.75 1.00 Cancer (Relative) 1.0 10.0 100.0 R e g u l a r i z a t i o n S t r e n g t h ( ) 0.0 0.6 1.2 1.8 CO VID -19 (T arget T raj.) I P M v a l u e f o r d i f f e r e n t (b) MCMD k sig estimator for the optimal control obtained with different le v els of regularization strength λ . F igur e 3. Effects of regularization strength parameter λ . 21
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment