ARTEMIS: A Neuro Symbolic Framework for Economically Constrained Market Dynamics

A RT E M I S : A N E U R O - S Y M B O L I C F R A M E W O R K F O R E C O N O M I C A L L Y C O N S T R A I N E D M A R K E T D Y N A M I C S Rahul D Ray Department of Electronics and Electrical Engineering BITS Pilani, Hyderabad Campus f20242213@hyderabad.bits-pilani.ac.in A B S T R AC T Deep learning models in quantitativ e ﬁnance often operate as black boxes, lacking interpretability and failing to incorporate fundamental economic principles such as no arbitrage constraints. This paper introduces AR TEMIS (Arbitrage free Representation Through Economic Models & Interpretable Symbolics), a nov el neuro symbolic framew ork that combines a continuous time encoder based on a Laplace Neural Operator , a neural stochastic dif ferential equation re gularised by physics informed losses, and a dif ferentiable symbolic bottleneck that distils interpretable trading rules. The model enforces economic plausibility through two nov el regularisation terms: a Feynman Kac PDE residual that penalises violations of local no arbitrage conditions, and a market price of risk penalty that bounds the instantaneous Sharpe ratio to realistic values. W e ev aluate AR TEMIS against six strong baselines including LSTM, T ransformer , NS Transformer , Informer , Chronos 2, and XGBoost on four di verse datasets: Jane Street (anon ymised market data) Desai et al. [2024], Opti verMeyer et al. [2021] (limit order book volatility prediction), T ime IMM (en vironmental temperature forecasting) Chang et al. [2025], and DSLOB (a synthetic crash re gime). Results demonstrate that AR TEMIS achiev es state of the art directional accuracy , outperforming all baselines on DSLOB (64.96%) and T ime IMM (96.0%), while remaining competitiv e on point accuracy metrics. A comprehensi ve ablation study on DSLOB conﬁrms that each component contributes to this directional adv antage: removing the PDE loss reduces directional accurac y from 64.89% to 50.32%, and removing both physics losses collapses it to 41.77%, worse than random. The underperformance on Opti verMe yer et al. [2021] is attributed to its long sequence length, v olatility focused target, and limited feature set, highlighting important boundary conditions. By pro viding interpretable, economically grounded predictions without sacriﬁcing performance, AR TEMIS bridges the gap between deep learning’ s predictiv e power and the transparenc y demanded in quantitativ e ﬁnance, opening new a venues for trustworthy AI in high stakes ﬁnancial applications. 1 Introduction The application of deep learning to ﬁnancial time series prediction has witnessed e xplosi ve gro wth over the past decade, driv en by the increasing av ailability of high-frequency market data and the remarkable success of neural networks in capturing comple x temporal patterns. From return forecasting and volatility prediction to algorithmic trading and risk management, deep learning models hav e demonstrated superior performance compared to traditional econometric approaches such as ARIMA and GARCH [Rundo et al., 2019, Ndikum, 2020]. Recent comprehensi ve surveys [Chen et al., 2024, Zhang et al., 2024, Giantsidi and T arantola, 2025] document the rapid e volution of architectures from standalone models like Long Short-T erm Memory (LSTM) networks and Con v olutional Neural Networks (CNNs) to sophisticated hybrid systems that combine multiple techniques. Howe ver , despite these adv ances, the adoption of deep learning in high-stakes ﬁnancial applications remains hampered by three fundamental challenges that the literature has consistently identiﬁed but not yet fully resolv ed. The ﬁrst and most widely recognised challenge is the lack of interpretability . As ﬁnancial institutions operate under strict re gulatory o versight, the opacity of deep learning models, often described as black boxes, creates signiﬁcant legal, ethical, and operational risks [Rane et al., 2023, Hoang et al., 2026]. Stakeholders, including regulators AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics and risk managers, require transparent explanations for model decisions, particularly when those decisions inv olve substantial capital allocation or ha ve the potential to impact mark et stability [Mathe w et al., 2025]. While explainable artiﬁcial intelligence (XAI) techniques such as SHAP and LIME have been proposed to pro vide post-hoc explanations [Sokolovsk y et al., 2023, Basha et al., 2025], these methods of fer only approximate insights and do not address the underlying opacity of the model architecture itself. Moreov er , they introduce an inherent trade-of f between predicti ve accuracy and interpretability , where simpler, more transparent models often sacriﬁce the v ery complexity that enables superior performance [Rane et al., 2023, Mathe w et al., 2025]. This tension between accuracy and transparenc y remains a central obstacle to deploying deep learning in production trading systems. The second challenge concerns the integration of economic principles into data-dri ven models. T raditional ﬁnancial models are built upon foundational theories such as the absence of arbitrage, which ensures that asset prices cannot be risklessly exploited for proﬁt. Y et most deep learning approaches are trained purely on historical data, learning correlations without any regard for these economic constraints [Mashrur et al., 2020, Sahu et al., 2023]. As a consequence, they can discover spurious patterns that lead to implausible predictions, particularly when market conditions deviate from the training distribution [Suarez-Cetrulo et al., 2023, Hasan et al., 2023]. Recent work has begun to address this gap through physics-informed neural netw orks (PINNs), which embed gov erning equations into the loss function to enforce consistenc y with physical or ﬁnancial la ws. Pioneering studies ha ve applied PINNs to option pricing models, including the Black-Scholes equation and the Heston model [Bai et al., 2022, Hainaut and Casas, 2024, Nuugulu et al., 2025], demonstrating improved accurac y and stability . Extensions have tackled jump-dif fusion models with liquidity costs [Kartik and Shah, 2025] and fully nonlinear PDEs rele v ant to portfolio optimisation [Lefebvre et al., 2023]. Howe ver , these applications ha ve focused primarily on solving kno wn pricing equations rather than learning latent dynamics from data while simultaneously enforcing economic constraints. The synthesis of data-driv en learning with physics-informed regularisation for forecasting tasks remains lar gely une xplored. The third challenge relates to the continuous-time nature of ﬁnancial markets and the non-stationarity of ﬁnancial time series. T raditional discrete-time models struggle to process irregularly sampled, high-frequency data without interpolation, which can distort underlying dynamics [Han et al., 2024, W ang et al., 2025]. Moreover , market regimes shift ov er time, and models that perform well during stable periods often fail catastrophically during crises [Suarez- Cetrulo et al., 2023, Hasan et al., 2023]. Neural stochastic dif ferential equations (neural SDEs) offer a promising continuous-time framew ork that can naturally accommodate irregular observ ations and capture uncertainty through stochastic dynamics [W ang et al., 2025]. Similarly , advances in handling non-stationarity , such as the Non-stationary T ransformer [Zhang et al., 2024], hav e improved robustness to distribution shifts. Y et these approaches remain disconnected from the interpretability and economic constraint challenges described abov e. In this paper , we introduce AR TEMIS (Arbitrage-free Representation Through Economic Models & Interpretable Symbolics), a novel neuro-symbolic frame work that addresses all three challenges within a uniﬁed architecture. AR TEMIS makes the follo wing ke y contributions: • Continuous-time encoding via Laplace Neural Operator . Unlike standard recurrent or transformer models that require regularly sampled inputs, our encoder operates directly on irregularly spaced observations, preserving the true temporal structure of limit order book updates and trade reports without interpolation [W ang et al., 2025, Han et al., 2024]. • Economics-inf ormed latent dynamics through neural SDEs with physics-constrained regularisation. W e introduce tw o no vel loss terms deriv ed from the Fundamental Theorem of Asset Pricing: a Feynman-Kac PDE residual that enforces local no-arbitrage conditions in the latent space, and a mark et price of risk penalty that bounds the instantaneous Sharpe ratio to realistic v alues. While pre vious w ork has applied PINNs to pricing equations [Bai et al., 2022, Hainaut and Casas, 2024, Nuugulu et al., 2025, Kartik and Shah, 2025] and neural SDEs to continuous-time modelling [W ang et al., 2025], our w ork is the ﬁrst to embed such economic constraints directly into the learning of latent dynamics for forecasting. • Differentiable symbolic bottleneck for interpr etability . Rather than relying on post-hoc explanations [Sokolo vsky et al., 2023, Basha et al., 2025], we design a layer that distils the latent dynamics into closed-form, interpretable expressions through differentiable symbolic regression. This provides inherent transparency while maintaining end-to-end trainability , offering a no vel resolution to the accuracy-interpretability trade-of f [Rane et al., 2023, Mathew et al., 2025]. • Conformal pr ediction f or uncertainty quantiﬁcation. T o support risk-aware decision making, we equip AR TEMIS with an adaptive conformal prediction layer that provides rigorously calibrated prediction intervals, addressing the need for reliable uncertainty estimates in ﬁnancial applications [Mashrur et al., 2020, Sahu et al., 2023]. 2 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics W e e valuate AR TEMIS against six strong baselines spanning recurrent architectures (LSTM) [Furizal et al., 2024], transformer variants (T ransformer, NS-T ransformer , Informer) [Chen et al., 2024, Zhang et al., 2024], foundation models (Chronos-2) [Giantsidi and T arantola, 2025], and gradient boosting (XGBoost) [Pra v een et al., 2025], across four div erse datasets: Jane Street’ s anonymised market dataDesai et al. [2024], Opti ver’ s limit order book volatility prediction taskMeyer et al. [2021], the T ime-IMM en vironmental temperature forecasting benchmark Chang et al. [2025], and DSLOB, a nov el synthetic dataset we introduce that simulates an ampliﬁed mark et crash re gime. Our results demonstrate that AR TEMIS achieves state-of-the-art directional accuracy , outperforming all baselines on DSLOB (64.96%) and Time-IMM Chang et al. [2025] (96.0%), while remaining competitiv e on point accuracy metrics. A comprehensiv e ablation study conﬁrms that each component contrib utes to this directional adv antage: removing the PDE loss reduces directional accuracy from 64.89% to 50.32%, and removing both physics losses collapses it to 41.77%—worse than random. The underperformance on Optiv erMeyer et al. [2021], where AR TEMIS achieves negati ve RankIC (-0.0555) and directional accuracy (45.82%) belo w most baselines, is attributed to its long sequence length, volatility-focused tar get, and limited feature set, highlighting important boundary conditions for the frame work and directions for future research. By providing interpretable, economically grounded predictions without sacriﬁcing predictive performance, AR TEMIS bridges the gap between deep learning’ s po wer and the transparency demanded in quantitati ve ﬁnance. T o the best of our kno wledge, no existing work combines neural SDEs, physics-informed losses, and dif ferentiable symbolic regression in a uniﬁed end-to-end framew ork for ﬁnancial time series forecasting. The code and data will be made publicly available upon publication to facilitate reproducibility and future research. 2 Related W ork AR TEMIS draws upon and contrib utes to se veral interconnected research areas: deep learning for ﬁnancial time series forecasting, interpretable machine learning and e xplainable AI, physics-informed neural networks and neural dif ferential equations, and continuous-time modeling of ﬁnancial dynamics. This section re vie ws the state of the art in each area and situates our contributions within the broader literature. 2.1 Deep Learning f or Financial Time Series For ecasting The application of deep learning to ﬁnancial time series has been extensi vely surve yed in recent years, reﬂecting the rapid growth of the ﬁeld [Chen et al., 2024, Zhang et al., 2024, Giantsidi and T arantola, 2025]. Early work focused on standalone recurrent architectures, particularly Long Short-T erm Memory (LSTM) networks, which became popular due to their ability to capture temporal dependencies in sequential data [Furizal et al., 2024]. Con volutional Neural Networks (CNNs) ha ve also been widely adopted for their hierarchical feature learning capabilities, especially when combined with signal processing techniques to handle the non-linear and non-stationary nature of ﬁnancial data [Prav een et al., 2025]. Hybrid models that combine multiple architectures, such as CNN-LSTM or attention-augmented recurrent networks, ha ve shown impro ved performance by le veraging the strengths of dif ferent components [Chen et al., 2024, Furizal et al., 2024]. The transformer architecture has emer ged as a powerful alternative to recurrent models for time series forecasting, offering parallel processing and the ability to capture long-range dependencies through self-attention mechanisms. Comprehensiv e surveys by Li and Law [2024], Kong et al. [2025], Kim et al. [2025] document the rapid adoption of transformer-based models across domains, including ﬁnance. For stock market prediction, W ang et al. [2022] demonstrated that transformer models signiﬁcantly outperform classic deep learning methods such as CNN, RNN, and LSTM. Subsequent work has e xplored specialized transformer v ariants for ﬁnancial forecasting. The Informer addresses the quadratic complexity of standard attention through ProbSparse attention, making it particularly suitable for long-sequence prediction tasks such as volatility forecasting [Hassani et al., 2025, Bhogade and Nithya, 2024]. The Non-stationary T ransformer introduces series stationarization and de-stationary attention to handle distribution shifts, a critical challenge in ﬁnancial mark ets where regimes change ov er time [Gou et al., 2023, W u, 2023]. Comparative studies have sho wn that these variants of fer different trade-of fs, with the Non-stationary T ransformer achieving the highest prediction accuracy for stock market indices in some settings [W u, 2023]. Beyond architectural innovations, recent work has e xplored the integration of multiple data sources and modalities. Zeng et al. [2023] proposed a combined CNN and transformer model for intraday stock price forecasting, demonstrating superior performance against statistical baselines. Y añez et al. [2024] introduced a h ybrid transformer encoder with CNN layers based on empirical mode decomposition for mark et inde x prediction. Foundation models for time series, such as Chronos, represent the latest frontier , lev eraging large-scale pre-training across di verse datasets to enable zero-shot and few-shot forecasting [Liang et al., 2024, Y e et al., 2024]. These models have sho wn promise in ﬁnancial applications, though their black-box nature and lack of domain-speciﬁc constraints remain signiﬁcant limitations. 3 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Despite these adv ances, the deep learning models surve yed above share a common limitation: they are trained purely on historical data without incorporating any economic principles or constraints. As noted by Mashrur et al. [2020] and Sahu et al. [2023], this can lead to models that learn spurious correlations and produce predictions that violate fundamental ﬁnancial theory , particularly during market regime shifts. AR TEMIS addresses this gap by embedding no-arbitrage conditions directly into the learning process. 2.2 Interpr etability and Explainable AI in Finance The opacity of deep learning models poses signiﬁcant challenges for their adoption in regulated ﬁnancial applications. As Rane et al. [2023] and Hoang et al. [2026] document, the black-box nature of these models creates le gal, ethical, and operational risks, including regulatory penalties and loss of stakeholder trust. Financial institutions require transparency and accountability in decision-making systems, yet most state-of-the-art models offer little insight into their reasoning [Mathew et al., 2025]. Explainable AI (XAI) has emerged as a critical research area addressing this challenge. Post-hoc explanation methods such as SHAP (Shapley Additi v e Explanations) and LIME (Local Interpretable Model-agnostic Explanations) ha ve been widely applied to ﬁnancial models to provide approximate explanations for indi vidual predictions [Sokolovsky et al., 2023, Basha et al., 2025]. Howe ver , these techniques hav e fundamental limitations: the y of fer only local approximations, can be inconsistent, and do not address the underlying opacity of the model architecture itself [Rane et al., 2023]. Moreov er , they introduce an inherent trade-off between accurac y and interpretability , where simpler , more transparent models may sacriﬁce predicti ve performance, while comple x black-box models that achie ve state-of-the-art accurac y remain difﬁcult to e xplain [Mathew et al., 2025, Hoang et al., 2026]. Alternativ e approaches have sought to b uild interpretability directly into model design. Sokolovsky et al. [2023] proposed interpretable trading patterns designed speciﬁcally for machine learning applications, demonstrating that domain-informed feature engineering can enhance understanding. Howe ver , such approaches often require manual speciﬁcation of patterns and do not learn interpretable representations end-to-end. AR TEMIS addresses these limitations through its dif ferentiable symbolic bottleneck, which distils the learned latent dynamics into closed-form, interpretable expressions. Unlike post-hoc explanation methods that approximate a black- box model, our approach provides inherent transparency by constraining a component of the network to produce human-readable formulas. This offers a no vel resolution to the accurac y-interpretability trade-of f, maintaining end-to- end trainability while deliv ering interpretable outputs. 2.3 Physics-Inf ormed Neural Networks and Neural Differential Equations Physics-informed neural networks (PINNs) represent a paradigm shift in scientiﬁc machine learning, embedding gov erning physical laws e xpressed as partial dif ferential equations (PDEs) directly into the neural network loss function [Lawal et al., 2022]. This approach ensures that model predictions remain consistent with kno wn physics, impro ving generalization and reducing the risk of learning spurious patterns. In ﬁnance, PINNs ha ve been applied primarily to option pricing problems, where the underlying PDEs are well-established. Bai et al. [2022] developed an improv ed PINN with local adaptive acti v ation functions to solve the Ivance vic option pricing model and the Black-Scholes equation. Hainaut and Casas [2024] applied ph ysics-inspired neural networks to the Heston stochastic volatility model, using the Feynman-Kac PDE as the dri ving principle. Nuugulu et al. [2025] extended this approach to time-fractional Black-Scholes equations, demonstrating the efﬁcienc y and accuracy of PINNs for deriv ative pricing. More recent work has addressed more complex settings. Kartik and Shah [2025] developed a PINN framework for option pricing and hedging under a Merton-type jump-diffusion model with liquidity costs, encoding the partial integro-dif ferential equation into the loss function. Lefebvre et al. [2023] proposed dif ferential learning methods for solving fully nonlinear PDEs with applications to portfolio selection. The Deep Galerkin Method [Al-Aradi et al., 2018] and related approaches ha ve been applied to high-dimensional PDEs arising in quantitativ e ﬁnance, including optimal ex ecution and systemic risk models. Parallel to these de velopments, neural dif ferential equations ha ve emerged as a po werful framework for continuous-time modeling. Neural ordinary differential equations (NODEs) parameterize the deri v ati ve of a hidden state with a neural network, enabling ﬂe xible modeling of dynamical systems. Neural stochastic dif ferential equations (NSDEs) e xtend this to stochastic settings, capturing uncertainty through dif fusion terms. Comprehensi ve surv eys by Oh et al. [2025a,b] revie w the mathematical foundations, numerical methods, and applications of neural dif ferential equations for time series analysis, emphasizing their capacity to handle irregular sampling and model continuous-time dynamics. In ﬁnance, W ang et al. [2025] introduced FinanceODE, a neural ODE-based frame work for continuous-time asset price modeling, demonstrating superior predictiv e accuracy compared to traditional discrete-time models. 4 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Despite these advances, e xisting work on PINNs in ﬁnance has focused primarily on solving kno wn pricing equations rather than learning latent dynamics from data for forecasting tasks. Similarly , neural dif ferential equation approaches hav e not incorporated economic constraints such as no-arbitrage conditions. AR TEMIS bridges this gap by combining a neural SDE for latent dynamics with physics-informed losses derived from the Fundamental Theorem of Asset Pricing, enabling data-driv en learning while enforcing economic plausibility . 2.4 Continuous-Time Modeling and Non-Stationarity Financial time series are inherently continuous-time processes, yet most forecasting models operate on discretely sampled data. T raditional econometric approaches such as ARIMA and GARCH models hav e been widely used for volatility forecasting and return prediction [Engle, Nelson, 1991]. GARCH models and their multiv ariate extensions, including Dynamic Conditional Correlation (DCC) and Generalized Orthogonal GARCH (GO-GARCH), remain popular for modeling time-varying volatility and correlations [Jeribi and Ghorbel, 2022, Ibrahim, 2017]. Howe ver , these models assume regular sampling and struggle with the irre gular, high-frequenc y data that characterizes modern ﬁnancial markets [Han et al., 2024]. Hybrid approaches combining ARIMA with GARCH have been proposed to capture both linear and non-linear patterns [Rubio et al., 2023, Mani and Thoppan, 2023], while fractionally integrated models (ARFIMA-GARCH) address long-memory properties [Chang et al., 2022]. The limitations of discrete-time models hav e moti vated the development of continuous-time approaches. Neural differential equations of fer a natural framework for modeling irregularly sampled time series, as the y can be ev aluated at arbitrary time points [Oh et al., 2025a,b]. This capability is particularly valuable for limit order book data, where ev ents arrive at microsecond granularity and re gular resampling can distort dynamics [W ang et al., 2025]. Non-stationarity presents another fundamental challenge. As Suarez-Cetrulo et al. [2023] document in their systematic revie w , con ventional machine learning approaches often fail to adapt to changes in the price-generation process during market re gime shifts. Hasan et al. [2023] observe that models relying solely on market-based indicators w ork well in stable conditions b ut fail during economic crises, reducing their long-term predicti ve reliability . The Non-stationary T ransformer addresses this by explicitly modeling distrib ution shifts through series stationarization and de-stationary attention, while GARCH-based approaches model time-v arying v olatility [Han et al., 2024]. Ho wev er, these methods do not incorporate economic theory about what constitutes a plausible regime change. AR TEMIS addresses both continuous-time modeling and non-stationarity through its latent SDE formulation, which naturally accommodates irregular sampling and captures stochastic v olatility through the learned diffusion term. The physics-informed losses further ensure that the learned dynamics remain economically plausible across different regimes, pro viding a principled approach to handling non-stationarity . 2.5 Uncertainty Quantiﬁcation and Conformal Pr ediction Reliable uncertainty quantiﬁcation is essential for risk management in ﬁnancial applications. T raditional approaches hav e relied on parametric methods such as GARCH for volatility forecasting [Engle]. Howe v er , these methods make strong distrib utional assumptions that may not hold in practice. Conformal prediction of fers a distrib ution-free framew ork for constructing prediction intervals with ﬁnite-sample co verage guarantees, requiring only exchangeability of the data. As surveyed by Zhou et al. [2025], conformal prediction has been extended to time series settings through adapti ve methods that update quantile estimates over time, addressing the violation of e xchangeability in non-stationary data. AR TEMIS incorporates an adaptiv e conformal prediction layer to pro vide calibrated uncertainty interv als for its forecasts, enabling risk-aware portfolio construction. The literature revie wed above reveals a clear gap: while signiﬁcant advances have been made in deep learning architectures for ﬁnancial time series, interpretability techniques, physics-informed neural networks, and continuous- time modeling, no existing frame work integrates these innov ations i n a uniﬁed manner . Deep learning models achieve state-of-the-art predicti ve accurac y b ut operate as black box es and ignore economic principles [Chen et al., 2024, Zhang et al., 2024]. XAI approaches provide post-hoc explanations b ut do not address underlying model opacity and introduce accuracy-interpretability trade-of fs [Rane et al., 2023, Hoang et al., 2026]. PINNs embed physical laws b ut ha ve been applied primarily to solving kno wn pricing equations rather than learning latent dynamics for forecasting [Bai et al., 2022, Hainaut and Casas, 2024]. Neural dif ferential equations of fer continuous-time modeling b ut ha ve not incorporated economic constraints [Oh et al., 2025a, W ang et al., 2025]. AR TEMIS addresses this gap by synthesizing these research directions into a single neuro-symbolic framework. T o our knowledge, it is the ﬁrst work to combine (1) a continuous-time encoder for irregularly sampled data, (2) a neural SDE for latent dynamics regularised by (3) physics-informed losses enforcing no-arbitrage conditions and (4) a market price of risk penalty , (5) a differentiable symbolic bottleneck for interpretability , and (6) conformal prediction for uncertainty 5 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics quantiﬁcation. The comprehensive e v aluation against six strong baselines across four diverse datasets demonstrates that this synthesis deliv ers tangible beneﬁts, particularly in directional accuracy , without sacriﬁcing interpretability . 3 Data Prepr ocessing for Benchmarking AR TEMIS T o rigorously ev aluate the AR TEMIS model against a suite of state-of-the-art baselines, we assembled four distinct datasets spanning dif ferent ﬁnancial and non-ﬁnancial domains: Jane Street’ s anonymised market dataDesai et al. [2024], Opti ver’ sMeyer et al. [2021] high-frequency limit order book data, the EP A-Air time series from the T ime-IMM Chang et al. [2025] collection, and a proprietary Deep Synthetic limit order book dataset (DSLOB). Each dataset required careful, domain-speciﬁc preprocessing to transform raw observ ations into a uniﬁed format suitable for sequence models. The goal was to create training, validation, and test splits that respect temporal ordering, handle missing values appropriately , and preserve the underlying dynamics of each task. All neural models (LSTM, Transformer , NS- T ransformer , Informer , AR TEMIS) share the same preprocessed windo ws to ensure a fair comparison under identical input conditions. The following sections describe the preprocessing pipeline for each dataset in detail, emphasising the rationale behind ev ery step. 3.1 Jane Str eet Market Pr ediction Desai et al. [2024] The Jane Street Desai et al. [2024]dataset originates from a Kaggle competition and consists of anon ymised market data partitioned into ﬁles for training, v alidation, and testing. Each ﬁle contains ro ws index ed by date identiﬁer , time identiﬁer , and symbol identiﬁer . The raw features are 79 numerical columns that may contain missing v alues. The tar get variable is a continuous response that the competition asked participants to predict. Additionally , a weight column is provided for use in the of ﬁcial e v aluation metric (weighted R 2 ). The data is already split temporally by date, with early dates in the training split, intermediate dates in validation, and later dates in the test split – a setup that f aithfully simulates a backtesting en vironment. Preprocessing for sequence models begins with the construction of sliding windo ws. Because the data is streamed from disk (the full dataset exceeds a vailable memory), we implemented a custom iterator that reads one partition at a time. W ithin each partition, ro ws are grouped by symbol and date, then sorted by time to ensure chronological order . For each group, we slide a window of length 20 (the chosen lookback horizon) and e xtract the next observ ation’ s target value. This yields an input tensor of shape (20, 79) and a scalar target. T o handle missing v alues, we create a binary mask of the same shape indicating which entries were originally observ ed; the input tensor itself has missing v alues replaced with zero. This masking strategy allows any subsequent model to distinguish genuine zeros from imputed values. The same windo wing logic is applied to the validation and test sets, b ut without restricting the number of days (the training set is limited to the ﬁrst 500 days for computational efﬁcienc y). For Chronos-2, which expects a uni variate time series, we extract only the target v alues from each group, again forming windows of length 20 to predict the ne xt v alue. No mask is needed because the target series is dense after ﬁltering out missing targets. All neural models share the same preprocessed windo ws, ensuring a fair comparison under identical input conditions. 3.2 Optiver Realized V olatilityMeyer et al. [2021] The Optiv er dataset challenges participants to predict the realized v olatility of 112 stocks o ver 10-minute windows, based on high-frequency limit order book snapshots and trade reports. The raw data consists of order book updates and trade ex ecutions. Each order book record contains a windo w identiﬁer , a timestamp of fset from the start of the windo w , bid and ask prices for two lev els, and corresponding sizes. Trade records contain similar identiﬁers plus the trade price, size, and order count. The target is the realized v olatility computed ov er the windo w – a continuous positi ve v alue. Preprocessing for this dataset requires fusing tw o asynchronous data streams into a regularly sampled sequence of length 600 (one observ ation per second). For each stock and time window , we ﬁrst extract all order book and trade records. Order book snapshots are recorded at irregular interv als; we create a complete timeline co vering all seconds from 0 to 599 and forward-ﬁll the most recent book state to e very second. This yields a continuous representation of the limit order book. Trades, which occur at discrete seconds, are aggregated per second (av erage price, total size, total order count) and merged onto the same timeline. From the book data we compute derived features: mid price, bid-ask spread, log mid price, log return, v olume imbalance, and total size. Combined with the trade aggregates, we obtain a feature set of 7 channels for each second. The target is the log-transformed realized v olatility (the original values are small positi ve numbers, and the log transformation mak es the distrib ution more Gaussian and easier for MSE-based models). 6 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics T able 1: Summary of datasets used in the benchmarking study . Dataset T ask #Feat SeqLen #T rain #V al #T est Notes Jane Street Regression 79 20 ∼ 7.37M ∼ 4.61M 200k Streamed Kaggle data; grouped by (symbol,date); masked miss- ing v alues; predict next-step re- sponder_6. Optiv er V olatility (log) 7 600 2,298 766 766 10-min LOB + trades; forward- ﬁlled to 1Hz; derived features; target log-transformed. T ime-IMM T emperature 4 24 29,470 6,317 6,319 Hourly EP A air quality; for - ward/backward ﬁll for sparse variables; predict next hour temp. DSLOB Realized volatility (re gression) 85 20 24,891 9,891 4,891 Synthetic LOB dataset based on real crash regime; 85 features from four lev els (prices, sizes, spreads, imbalances); target is next-step realized volatility (log- transformed). For windo ws that have missing order book snapshots at the very be ginning, forward-ﬁlling ensures e very second has a valid feature vector . After constructing the full 600-second matrix for each window , we collect all windows (one per stock-time pair) and concatenate them into a single dataset. As with Jane StreetDesai et al. [2024], we respect the original temporal split provided by the competition organisers. For Chronos-2, we extract only the log-realized volatility series from each windo w and use it as a uni v ariate input of length 600. For the Opti verMe yer et al. [2021] task, which in v olves predicting the v olatility of the window itself rather than next-step prediction, Chronos-2 is used in a zero-shot manner by feeding the entire 600-step series as context and asking for a one-step forecast, then using a linear head to map the Chronos embedding to the target. 3.3 Time-IMM Chang et al. [2025] (EP A-Air) The Time-IMM Chang et al. [2025] collection provides multiv ariate time series from div erse domains. For this benchmark we selected the EP A-Air domain, which contains hourly measurements of air quality for eight U.S. counties. Each county’ s data includes four variables: temperature, particulate matter, air quality inde x, and ozone concentration. Inspection re veals that only temperature is recorded hourly; the other three are sparse, with missing rates exceeding 85%. The task is to forecast the next hour’ s temperature using a 24-hour lookback window . This regression problem is challenging because the auxiliary variables, though sparse, may carry predicti ve information when av ailable. Preprocessing be gins by loading each county’ s time series and adding an entity column to preserve identity . All counties are concatenated into a single dataframe. T o handle the sparsity , we apply forward-ﬁll follo wed by backward-ﬁll to each feature within each entity . This propagates the last observed v alue forward, and an y remaining leading missing values are ﬁlled with the next observed v alue. After this procedure, e very feature has a complete sequence for all timestamps. W e then construct windows of length 24 hours: for each entity , we slide a window of 24 consecuti ve hours and take the temperature at the next hour as the target. This yields input tensors of shape (24, 4) and scalar targets. W e discard any window where the tar get is missing (which does not occur after ﬁlling). The windows from all entities are concatenated, resulting in 29,470 training samples, 6,317 v alidation samples, and 6,319 test samples after a temporal 70/15/15 split applied per entity . This ensures that no future data leaks into the training set. Features are then standardised using statistics ﬁtted on the training windo ws only . Because missing v alues have been eliminated, masks are simply all-ones. For Chronos-2, we isolate the temperature channel (the target series) and use it as a uni v ariate input of length 24 to predict the next hour’ s temperature. The same scaling is applied to the Chronos inputs. 3.4 DSLOB: A Synthetic Dataset for Contr olled Stress T esting The DSLOB (Deep Synthetic Limit Order Book) dataset addresses a fundamental challenge in ev aluating ﬁnancial machine learning models: the scarcity of extreme e vents in historical data and the difﬁculty of isolating component contributions under real-world conditions. Real datasets like Jane StreetDesai et al. [2024] and OptiverMe yer et al. 7 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics [2021] are in valuable b ut inherently noisy , confounded, and lack suf ﬁcient examples of rare regimes such as market crashes. Moreover , the true data-generating process is unknown, making it impossible to deﬁnitiv ely determine whether a model’ s performance stems from capturing genuine economic structure or ov erﬁtting to spurious correlations. DSLOB is therefore designed as a controlled synthetic en vironment that preserves the statistical properties of real limit order book data while introducing a known, ampliﬁed crash re gime where ground truth is fully accessible. The foundation of DSLOB is a real high-frequency limit order book dataset from which we extract 85 features capturing lev el-speciﬁc prices, sizes, spreads, mid-price calculations, volume imbalances, and microstructural metrics. T o isolate the crash regime, we apply CUSUM and Bayesian change point detection to identify a contiguous windo w of approximately one week where prices dropped rapidly and volatility spiked. This window serv es as the crash template. The synthetic mid-price is generated by amplifying a V asicek-type stochastic differential equation (SDE) ﬁtted to the crash window: dP t = θ ( µ − P t ) dt + σ dW t , with parameters θ , µ, σ estimated via maximum likelihood. T o create an ev en more challenging crash, we scale the mean-rev ersion speed by 1 . 5 and the long-term mean by 1 . 2 , producing a steeper and more persistent downw ard trend. V olatility dynamics are modeled using a GARCH(1,1) process ﬁtted to the 1-minute log-returns of the mid-price during the crash window: σ 2 t = ω + αϵ 2 t − 1 + β σ 2 t − 1 , ϵ t ∼ N (0 , 1) . W e increase the persistence and shock magnitude by setting β ′ = min(0 . 95 , 1 . 1 β ) and α ′ = 1 . 2 α . Synthetic returns are then generated as r t = σ ′ t ϵ t , ensuring volatility clustering and le verage effects characteristic of real crashes. The remaining 83 features (prices, sizes, spreads, etc.) are generated by adding correlated noise to the seed data while preserving the multiv ariate dependence structure. For each feature f i at time t : f synth i ( t ) = f seed i ( t ) + η i ( t ) , where η ( t ) ∼ N (0 , Σ ) and Σ is the cov ariance matrix of residuals from a vector autoregressiv e model of order 1 (V AR(1)) ﬁtted to the seed features during the crash window . The noise is scaled so that the signal-to-noise ratio matches that of the seed data. T o create a longer training set, we apply time warping: a random deformation τ ( t ) sampled from a Gaussian process with mean 1 and v ariance 0.1 stretches or compresses the temporal dynamics while preserving sequential order . The ﬁnal DSLOB dataset comprises 24,891 training, 9,891 validation, and 4,891 test samples, each a window of length 20 containing 85 synthetic features. The target is next-step realized volatility , computed as the square root of the sum of squared 1-second log-returns ov er the next 20 steps, annualized and log-transformed to match the Opti verMe yer et al. [2021] scale. Rigorous validation conﬁrms that the synthetic data preserv es key statistical properties: the Kolmogoro v–Smirnov test fails to reject that return distributions match the seed ( p > 0 . 05 ), the autocorrelation decay of squared returns matches up to lag 50, the av erage absolute dif ference in correlation matrices is less than 0.03, and the 99.5th percentile of negati ve returns is within 5% of the seed’ s value. DSLOB serves two critical purposes in the AR TEMIS benchmark. First, it enables the ablation study (T able 3) by providing a controlled en vironment where components can be systematically removed and their impact observed with known ground truth. Second, it stress-tests model robustness under ampliﬁed extreme conditions. AR TEMIS achiev es the highest directional accuracy (64.96%) on DSLOB, outperforming all baselines and demonstrating that the framew ork remains resilient e ven when markets de viate from training distrib utions—a key requirement for practical applications where models often fail during crises. 4 The AR TEMIS Model: A Neuro-Symbolic Framework f or Economically Constrained Market Dynamics AR TEMIS (Arbitrage-free Representation Through Economic Models & Interpretable Symbolics) is a novel deep learning framework designed to overcome the limitations of existing black-box models in quantitative ﬁnance. It treats ﬁnancial markets as a continuous-time dynamical system governed by latent stochastic differential equations that must respect fundamental economic principles such as no-arbitrage conditions. The model integrates ideas from scientiﬁc machine learning—speciﬁcally physics-informed neural netw orks, neural operators, and dif ferentiable symbolic regression—into a uniﬁed, end-to-end trainable architecture. The core insight is that by embedding economic constraints directly into the learning process, we can re gularise the model to a void implausible predictions, impro ve out-of-sample robustness, and simultaneously obtain interpretable trading signals. AR TEMIS comprises four tightly coupled modules: 8 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics • A continuous-time encoder based on a Laplace Neural Operator that ingests irregularly sampled, multi- resolution market data and maps it to a continuous latent state. • An economics-inf ormed latent dynamics module that models the e volution of the latent state via a neural stochastic differential equation, with drift and dif fusion networks learned from data. • A symbolic bottleneck layer that distils the latent dynamics into human-readable, closed-form alpha factors using differentiable symbolic re gression. • A conformal allocation layer that translates the stochastic uncertainty of the latent SDE into rigorously calibrated prediction intervals and, optionally , optimal portfolio weights. Physics Regularisers Market Data Irregular obs. { ( x i , t i ) } N i =1 x i ∈ R d x Laplace Neural Op. z ( t ) = R κ ( t − s ) x ( s ) d s + b ( t ) ˆ κ ( ω ) = P k A k ω − λ k κ ( t ) = P k A k e λ k t Neural SDE Latent Dynamics d z = µ θ ( z , t ) d t + σ φ ( z , t ) d W Euler–Maruyama: z j +1 = z j + µ θ ∆ t + σ φ √ ∆ t ϵ j σ φ = L φ D φ (Cholesky , SPD guaranteed) Symbolic Bottleneck ˆ y s = P K k =1 w k f k ( x ) Basis: MA, ratio, ∆ , var. . . Gumbel-Softmax ( τ → 0 ) L distill + λ s ∥ w ∥ 1 Conformal Allocation C ( x ) = [ ˆ y ± q 1 − α ] Adaptive rolling windo w Kelly criterion: max w w ⊤ ˆ y − γ 2 w ⊤ ˆ Σ w ˆ y Point [ L, U ] Interv al w ∗ Portfolio Auxiliary net V ψ L PDE F eynman–Kac ∂ t V + µ θ ·∇ z V + 1 2 tr( σ φ σ ⊤ φ ∇ 2 z V ) L MPR Sharpe Bound λ ( t ) = σ − 1 φ µ θ max  0 , ∥ λ ( t ) ∥ 2 − κ 2  Causal · multi-rate no interpolation Drift µ θ + Diﬀusion σ φ · regime-adaptive Pretr ain → distil lation · interpretable Distribution-fre e · diﬀerentiable QP Module 1 Module 2 Module 3 Module 4 + raw x AR TEMIS: Arbitrage-free Representation Through Economic Mo dels & Interpretable Symbolics L forecast MSE / cross-entropy on ﬁnal prediction ˆ y L consist 1 M P j ∥ z (sde) j − z (enc) j ∥ 2 LNO ↔ SDE alignment L PDE ∂ t V + µ θ ·∇ z V + 1 2 tr( σ φ σ ⊤ φ ∇ 2 z V ) F eynman–Kac no-arbitrage L MPR λ = σ − 1 φ µ θ , max(0 , ∥ λ ∥ 2 − κ 2 ) instantaneous Sharp e b ound L total = L forecast + λ 1 L PDE + λ 2 L MPR + λ 3 L consist  end-to-end via Euler–Maruyama reparametrisation  Main data ﬂow PDE penalty MPR penalty Consistency F orecast loss Figure 1: Architecture of AR TEMIS. The framework processes irre gularly sampled market data { ( x i , t i ) } N i =1 through four tightly coupled modules. Module 1 (Laplace Neural Operator) encodes the input directly in continuous time via a learnable Laplace-domain kernel ˆ κ ( ω ) = P k A k / ( ω − λ k ) , eliminating the need for interpolation or regular resampling. Module 2 (Neural SDE Latent Dynamics) ev olves the encoded state under a stochastic differential equation d z = µ θ ( z , t ) d t + σ φ ( z , t ) d W , where drift µ θ and diffusion σ φ are neural networks trained with two physics-informed penalties: a Feynman–Kac PDE residual L PDE that enforces local no-arbitrage conditions via an auxiliary pricing network V ψ , and a market-price-of-risk penalty L MPR that bounds the instantaneous Sharpe ratio ∥ σ − 1 φ µ θ ∥ 2 ≤ κ 2 to economically plausible v alues. Module 3 (Symbolic Bottleneck) distils the latent dynamics into a sparse, human-readable combination of basis functions ˆ y s = P k w k f k ( x ) via a two-phase teacher–student procedure with Gumbel-Softmax selection, providing inherent interpretability without post-hoc approximation. Module 4 (Conformal Allocation) wraps predictions in distribution-free interv als [ ˆ y ± q 1 − α ] via adaptiv e conformal prediction, and optionally solves a dif ferentiable Kelly criterion portfolio problem. All components are trained jointly under the composite objecti ve L total = L forecast + λ 1 L PDE + λ 2 L MPR + λ 3 L consist , with gradients backpropag ated through the Euler–Maruyama SDE solver via the reparametrisation trick. The consistency loss L consist anchors the SDE trajectory to the encoder outputs at each time step, prev enting latent drift. All components are trained jointly using a composite loss function that combines a forecasting objective with two economic regularisation terms: a Feynman-Kac PDE residual that enforces local no-arbitrage conditions, and a market- price-of-risk penalty that bounds the instantaneous Sharpe ratio to realistic values. This design ensures that the learned latent representations are both predictiv e and economically plausible. 9 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Algorithm 1 AR TEMIS Complete T raining and Inference Procedure Require: T raining data D train = { ( x ( n ) , y ( n ) ) } N train n =1 with irregular observ ation times V alidation data D val Hyperparameters: λ 1 , λ 2 , λ 3 , λ 4 (loss weights), κ (Sharpe threshold), τ (Gumbel temp), γ (risk aversion) Learning rate η , number of epochs E , batch size B , collocation points per batch N coll SDE step size ∆ t , latent dimension d z , W iener dimension d w , basis library F Ensure: T rained parameters: θ (drift net), ϕ (diffusion net), ψ (auxiliary pricing net), w , b (forecasting head), w symb (symbolic weights) 1: function T R A I N A R T E M I S 2: Initialize all networks randomly: LNO encoder E , drift µ θ , dif fusion σ ϕ , auxiliary pricing V ψ , forecasting head ( w , b ) , symbolic weights w symb 3: Set learning rate scheduler (e.g., ReduceLR OnPlateau) ▷ Pretraining Phase (without symbolic layer) 4: for epoch = 1 to E do 5: for each batch B ⊂ D train of size B do 6: Encode batch using LNO: for each sample, obtain latent states at regular times { t j } M j =0 : z (enc) j = E ( x )( t j ) 7: Set initial condition z 0 = z (enc) 0 for each sample 8: Simulate SDE forward using Euler–Maruyama (for each sample independently): 9: for j = 0 to M − 1 do 10: Sample ϵ j ∼ N (0 , I d w ) 11: z j +1 = z j + µ θ ( z j , t j ) ∆ t + σ ϕ ( z j , t j ) √ ∆ t ϵ j 12: end for 13: Obtain ﬁnal latent state z M and compute prediction ˆ y = w ⊤ z M + b 14: Compute forecasting loss L forecast = 1 B P B n =1 ℓ ( ˆ y ( n ) , y ( n ) ) 15: Sample collocation points { ( z i , t i ) } N coll i =1 from the latent trajectories (random times along each path) 16: Compute PDE residuals via automatic differentiation: R F K ( z i , t i ) = ∂ V ψ ∂ t + µ θ · ∇ z V ψ + 1 2 tr  σ ϕ σ ⊤ ϕ ∇ 2 z V ψ  17: L PDE = 1 N coll P N coll i =1 ∥R F K ( z i , t i ) ∥ 2 18: Compute market price of risk λ ( t ) = µ θ ( z ( t ) , t ) / σ ϕ ( z ( t ) , t ) (element-wise) 19: L MPR = 1 B P B b =1 max  0 , ∥ λ ( t b ) ∥ 2 − κ 2  (ev aluated at sampled times) 20: Compute consistency loss L consist = 1 M P M j =1 ∥ z (sde) j − z (enc) j ∥ 2 21: T otal loss L total = L forecast + λ 1 L PDE + λ 2 L MPR + λ 3 L consist 22: Backpropagate L total and update parameters ( θ , ϕ, ψ , w , b ) using Adam with learning rate η 23: end for 24: Evaluate on v alidation set D val (using only L forecast ) 25: Adjust learning rate if validation loss has plateaued 26: end for ▷ Symbolic Distillation Phase 27: Freeze encoder E , drift µ θ , diffusion σ ϕ , auxiliary net V ψ , and forecasting head ( w , b ) 28: for epoch = 1 to E symb do 29: for each batch B ⊂ D train do 30: Compute frozen model predictions ˆ y (using same forward pass as abov e, no gradients) 31: Compute symbolic prediction ˆ y symb = P K k =1 w symb ,k f k ( x input ) 32: Distillation loss L distill = 1 B P B n =1 ( ˆ y ( n ) symb − ˆ y ( n ) ) 2 + λ 4 ∥ w symb ∥ 1 33: Backpropagate L distill and update symbolic weights w symb (using Gumbel - Softmax if basis functions are learnable) 34: end for 35: end for ▷ Conformal Prediction (post-training) 36: Compute residuals on calibration set D cal (subset of validation) using ﬁnal model: r i = | y i − ˆ y ( X i ) | 37: Determine quantile q 1 − α from residuals (adaptiv e rolling window for non-stationary data) 38: For an y test point, output prediction ˆ y and interval [ ˆ y − q 1 − α , ˆ y + q 1 − α ] 39: end function 10 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics 4.1 Continuous-Time Encoder: Laplace Neural Operator Financial time series are inherently irregular: limit order book updates arriv e at microsecond granularity , while fundamentals and macro indicators are published daily or quarterly . Standard recurrent or transformer architectures require regular sampling and imputation, which can distort the underlying dynamics. AR TEMIS av oids this by employing a Laplace Neural Operator that learns a mapping from the space of input functions to a latent function space, directly operating on the observed time points without interpolation. The operator is b uilt on the idea of representing the input as a function deﬁned on the time domain, and then using a kernel integral operator to produce a latent representation. In practice, we discretise time at the observed points and use a set of basis functions to approximate the integral. The operator can be written as: z ( t ) = Z κ ( t − s ) x ( s ) ds + b ( t ) where κ is a learnable kernel parameterised in the Laplace domain for efﬁcienc y and b is a bias term. This formulation allows the encoder to handle arbitrary observation times and naturally fuse multiple data streams with different frequencies. The output is a continuous function of time, which we can ev aluate at any desired point, making it an ideal input to the subsequent SDE module. In AR TEMIS, the encoder receives a tuple of observ ed ev ents and produces a latent trajectory that is sampled at a ﬁxed number of points per window to create a re gular sequence for the SDE solver . The operator is trained end-to-end with the rest of the model, so the latent representation is optimised speciﬁcally for the downstream tasks. 4.2 Economics-Informed Latent Dynamics The latent state is assumed to ev olve according to a stochastic differential equation: d z ( t ) = µ θ ( z ( t ) , t ) dt + σ ϕ ( z ( t ) , t ) d W ( t ) where the drift represents the predictable component, the dif fusion captures stochastic volatility and regime changes, and d W ( t ) is a W iener process. Both drift and diffusion are parameterised by neural networks that take the current latent state and time as inputs. The choice of an SDE is moti vated by the continuous-time nature of ﬁnancial markets and the need to model uncertainty . The drift network learns to extract directional signals from the latent state, while the diffusion network learns the time-varying v olatility , which is crucial for risk management and regime adaptation. Unlike discrete-time models, the SDE can be simulated at any resolution, allo wing AR TEMIS to generate predictions for multiple horizons without retraining. T o enforce economic plausibility , we regularise the SDE using two physics-informed losses deri ved from the Fundamental Theorem of Asset Pricing. 4.2.1 Feynman-Kac PDE Residual Consider a deri v ati ve or portfolio value function that depends on the latent state. Under the risk-neutral measure, this function must satisfy the Feynman-Kac PDE: ∂ V ∂ t + µ · ∇ z V + 1 2 tr  σ σ ⊤ ∇ 2 z V  = 0 This equation expresses the condition that the e xpected change in v alue equals the risk-free return – i.e., no arbitrage. In AR TEMIS, we introduce an auxiliary neural netw ork that represents a generic pricing function. W e then compute the PDE residual using automatic differentiation: R F K = ∂ V ∂ t + µ · ∇ z V + 1 2 tr  σ σ ⊤ ∇ 2 z V  and penalise its mean square ov er a set of collocation points sampled from the latent trajectories: L P D E = 1 N N X i =1 R F K ( z i , t i ) 2 . 11 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Figure 2: T emporal proﬁles of drift magnitude ∥ µ ( Z , t ) ∥ and diffusion magnitude ∥ σ ( Z , t ) ∥ ev aluated across the 100 -timestep input window on the DSLOB crash-regime test set. At each normalised time t ∈ [0 , 1] , the norms are computed ov er the full latent dimension and a veraged across 256 test samples; shaded bands denote ± 1 σ across samples. T wo observations are of economic signiﬁcance. First, the diffusion magnitude ∥ σ ∥ increases monotonically tow ard the end of the sequence window , indicating that the model assigns growing uncertainty to more recent LOB states — consistent with the stylised fact that price impact and volatility are highest in the ﬁnal moments before a regime transition. Second, the drift magnitude ∥ µ ∥ exhibits a non-monotone proﬁle with a mid-sequence peak, reﬂecting the model’ s learned representation of momentum followed by mean-reversion dynamics. Neither proﬁle was explicitly supervised; both emerge from the joint optimisation of the MSE, HJB-PDE, and MPR losses, demonstrating that the physics regularisation successfully induces economically interpretable stochastic dynamics in the latent space. Minimising this loss forces the drift and diffusion networks to organise the latent space such that any differentiable function of the state satisﬁes the no-arbitrage condition locally . 4.2.2 Market Price of Risk P enalty Strict no-arbitrage is a theoretical ideal; real markets exhibit transient mispricings that can be exploited. T o avoid ﬁltering out all statistical arbitrage opportunities, we introduce a softer constraint on the instantaneous Sharpe ratio. Deﬁne λ ( t ) = µ ( z ( t ) , t ) σ ( z ( t ) , t ) with element-wise division. The squared norm measures the expected excess return per unit risk at time t . If this quantity becomes excessi vely large, the model is likely o verﬁtting to noise. W e therefore add a hinge penalty: 12 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics L M P R = 1 B B X b =1 max  0 , ∥ λ ( t b ) ∥ 2 − κ 2  where κ is a threshold set to a plausible maximum annualised Sharpe ratio. This loss discourages the model from learning unrealistically proﬁtable strategies while still allo wing moderate short-term opportunities. 4.3 Symbolic Bottleneck Layer A major criticism of deep learning in ﬁnance is the lack of interpretability . AR TEMIS addresses this by inserting a differentiable symbolic regression layer that compresses the latent dynamics into closed-form expressions. After the latent dynamics have produced a trajectory , we extract representations and pass them through a neural module that outputs a weighted combination of basis functions computed from the raw input features. Speciﬁcally , we maintain a library of candidate symbols: moving av erages, ratios, differences, variances, and other elementary operations. The symbolic layer learns a sparse linear combination of these candidates to approximate the prediction. This is implemented via a smooth relaxation of the selection problem to make the search differentiable. The output is a set of interpretable factors: ˆ y = K X k =1 w k · f k ( x ) where each f k is a simple mathematical expression. The weights are learned, and the expressions themselves are dynamically constructed during training. T o stabilise training, we adopt a two-phase procedure. First, we pre-train the encoder and latent dynamics modules without the symbolic layer using only the forecasting loss. Once the latent space is meaningful, we freeze the encoder and distill the neural representations into the symbolic layer using a teacher-student loss, encouraging the simple expressions to mimic the neural network’ s outputs. This yields a model that is both accurate and transparent. 4.4 Conformal Allocation Lay er Financial predictions are inherently uncertain; a point forecast without a measure of conﬁdence is of limited use for risk management. AR TEMIS therefore couples its SDE-based predictions with conformal prediction, a distribution-free method that produces prediction intervals with ﬁnite-sample co verage guarantees. Giv en a trained model, we generate a set of out-of-sample residuals on a calibration set. For a ne w input, we compute a prediction interval ˆ y ± q 1 − α where q 1 − α is the appropriate quantile of the absolute residuals. These intervals are mar ginally v alid under exchange- ability . Because ﬁnancial data are non-stationary , we use an adaptiv e v ariant that updates the quantile estimate ov er time. The prediction intervals can be fed into a dif ferentiable con vex optimisation layer to construct portfolios that maximise a risk-adjusted objectiv e such as the continuous Kelly criterion: max w E [ R ( w )] − γ 2 V ar( R ( w )) subject to budget and lev erage constraints. The expectation and v ariance are approximated using the conformal interv als, and the optimisation is solved ef ﬁciently using implicit dif ferentiation. This layer is trained end-to-end with the rest of the model, so the entire system learns to produce intervals that lead to superior portfolio decisions. 4.5 Loss Function and T raining The total loss function combines sev eral terms: 13 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Figure 3: V ector ﬁeld of the learned SDE dynamics projected onto the ﬁrst two principal components (PC1–PC2) of the AR TEMIS latent space, ev aluated at three canonical normalised times: t = 0 . 1 (early sequence), t = 0 . 5 (mid-sequence), and t = 0 . 9 (late sequence). The ﬁgure is arranged as a 2 × 3 grid. The top row shows drift quiv er plots: each arro w represents the direction and magnitude of µ ( Z , t ) projected onto the PC1–PC2 plane via µ PC = µ V ⊤ 1:2 , where V 1:2 are the top two PCA eigen vectors; arrow colour encodes drift speed ∥ µ PC ∥ . The bottom row shows diffusion heatmaps: the background colour encodes ∥ σ ( Z , t ) ∥ computed directly in the full 64 -dimensional latent space at each grid point, with brighter re gions indicating higher local v olatility . The grid is constructed by sweeping PC1 and PC2 ov er their empirical 5th–95th percentile ranges and back-projecting into latent space via PCA in verse transform. At t = 0 . 1 the drift ﬁeld exhibits a predominantly inward-pointing (mean-reverting) structure, with low diffusion throughout the latent space. By t = 0 . 9 the ﬁeld rotates and the diffusion intensity increases substantially , particularly in regions of the latent space associated with crash-regime samples, indicating that the model has learned to amplify uncertainty near the prediction horizon in v olatile market conditions. This spatially and temporally v arying structure is a direct consequence of the HJB-PDE regularisation, which constrains the drift–dif fusion pair to satisfy a dynamic optimality condition rather than ﬁtting them independently . L total = L f orecast + λ 1 L P D E + λ 2 L M P R + λ 3 L consist where: • L f orecast is the standard supervised loss on the ﬁnal prediction. • L P D E is the Feynman-Kac residual. • L M P R is the market-price-of-risk penalty . • L consist is a consistency loss that ensures the SDE-ev olved latent state matches the encoded state at each time step. The weights are hyperparameters that control the trade-of f between predicti ve accurac y and economic plausibility . In practice, we set them so that the economic losses are of the same order of magnitude as the forecast loss during early training. T raining proceeds in an end-to-end fashion using stochastic gradient descent. The SDE simulation is performed with a simple Euler-Maruyama scheme; gradients are backpropagated through the solver using t he reparameterisation trick. 14 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics T able 2: Master Benchmark Results Across All Datasets and Models Dataset Model RMSE ↓ RankIC ↑ DirAcc ↑ W eighted R 2 ↑ Jane StreetDesai et al. [2024] LSTM 0.7628 0.0378 0.5159 0.0020 T ransformer 0.7635 0.0122 0.5337 -0.0002 NS-T ransformer 2.6034 0.0031 0.5009 Informer 0.7862 0.0083 0.4739 -0.0008 AR TEMIS 0.7762 0.0432 0.5150 -0.0009 Chronos-2 1.4043 0.1325 0.5372 -1.4578 Optiv erMeyer et al. [2021] LSTM 0.5570 0.0000 0.0000 -0.0271 T ransformer 0.5422 0.3583 0.6162 0.0268 NS-T ransformer 0.7019 0.2474 0.6057 -0.6308 Informer 1.8411 -0.1465 0.5679 -10.220 AR TEMIS 0.5553 -0.0555 0.4582 -0.0208 Chronos-2 4.9538 -0.1384 0.4047 -80.232 T ime-IMM Chang et al. [2025] LSTM 19.58 0.493 0.533 -2.314 T ransformer 4.420 0.969 0.922 0.831 NS-T ransformer 40.469 0.257 0.599 -13.158 Informer 4.011 0.928 0.890 0.861 AR TEMIS 4.691 0.904 0.860 0.810 Chronos-2 79.255 0.943 0.907 -53.302 DSLOB LSTM 0.01340 0.03905 0.6064 -3053.6 T ransformer 0.01211 0.10174 0.4756 -550.28 NS-T ransformer 0.08575 -0.09385 0.3504 -90674 Informer 0.02699 -0.05606 0.3564 -6557.2 AR TEMIS 0.03615 0.08791 0.6496 -2351.2 Chronos-2 0.01446 -0.10114 0.6238 -3123.2 The conformal layer and symbolic regression module are also differentiable, allo wing the entire system to be optimised jointly . 4.6 Integration and Implementation The four modules are assembled into a single computational graph. For a batch of input windo ws, the encoder produces latent trajectories. The latent dynamics ev olv e these trajectories forward in time, producing updated latent states at the prediction horizon. The symbolic layer extracts interpretable factors from the raw inputs, and the drift from the SDE is also used to generate the ﬁnal point prediction. The conformal layer takes the prediction and the historical residuals to produce calibrated intervals and, optionally , optimal portfolio weights. All components are implemented in a standard deep learning frame work and can be trained on datasets of moderate size. For larger datasets, we use streaming data loaders and mix ed-precision training to ﬁt within memory constraints. The model is designed to be modular: each component can be ablated or replaced, facilitating the ablation studies that conﬁrm the necessity of each part. 5 Benchmarking Baselines: The Five Cor e Models Compar ed Against AR TEMIS In order to establish a rigorous and comprehensi ve ev aluation of the AR TEMIS framework, it w as essential to select a suite of baseline models that represent the current state of the art in time series forecasting and ﬁnancial machine learning, while also spanning a di verse range of architectural paradigms. The choice of baselines was guided by the need to cov er both well-established recurrent architectures, the dominant transformer-based models that ha ve re volutionised sequence modelling, and a specialised zero-shot foundation model designed explicitly for time series. This section provides a detailed e xposition of the ﬁ ve baseline models – LSTM, V anilla T ransformer , Non-stationary T ransformer , Informer , and Chronos-2 – explaining the rationale behind their selection, their architectural underpinnings, and ho w they were adapted to the tasks at hand. Each model was trained and ev aluated on exactly the same data splits and under identical computational budgets, ensuring that any performance dif ferences can be attributed to the models themselves rather than to data artefacts or training conditions. 15 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics 5.1 Long Short-T erm Memory (LSTM) Networks The Long Short-T erm Memory network, introduced by Hochreiter and Schmidhuber in 1997, remains one of the most enduring and widely used architectures for sequence modelling. Its inclusion as a baseline is motiv ated by se veral factors. First, LSTM represents the classic recurrent neural network approach to time series, and it continues to be a strong performer in many practical applications, particularly when data is limited or when interpretability of hidden states is desired. Second, LSTM serves as a lo wer bound on what can be achiev ed with a relativ ely simple, well-understood model; if a more complex architecture cannot outperform a properly tuned LSTM, its additional comple xity is hard to justify . Third, in the context of ﬁnancial forecasting, LSTMs hav e been extensi vely studied and are often the ﬁrst port of call for practitioners, making them a natural reference point. The architecture implemented for this benchmark is a Figure 4: Performance degradation across the three DSLOB market regimes for all six benchmark models. The xx x-axis progresses from the training distribution (Normal, low volatility) through the validation distribution (Stress, medium v olatility) to the held-out test distribution (Crash, high v olatility with do wnward drift), representing a controlled out-of-distribution e valuation. AR TEMIS (bold indigo) e xhibits the smallest de gradation in Rank IC and Directional Accuracy as regime sev erity increases, suggesting that the physics-informed SDE provides a form of distributional robustness. Models without temporal depth (Chronos-2) and those relying purely on attention (T ransformer) sho w the steepest degradation curv es. standard stacked LSTM with two hidden layers, each containing 128 units, followed by a fully connected output layer that maps the ﬁnal hidden state to a scalar prediction. Dropout with a rate of 0.2 is applied between LSTM layers to mitigate ov erﬁtting. The model recei ves an input sequence of length 20 (for Jane Street, T ime-IMM Chang et al. [2025], and DSLOB) or 600 (for Optiv erMeyer et al. [2021]) with a feature dimension that v aries per dataset (79, 4, 59, and 7 respectiv ely). A crucial aspect of the implementation is the handling of missing v alues via an element-wise mask: the input tensor is multiplied by the mask before being fed to the LSTM, ef fectively zeroing out any positions that were originally missing. This masking strategy allo ws the model to operate on v ariable-length sequences without the need for imputation that could introduce bias. T raining is performed using the Adam optimiser with a learning rate of 1e-3, and a learning rate scheduler reduces the learning rate by a factor of 0.5 when the v alidation loss plateaus. Mixed-precision training is emplo yed to accelerate computation and reduce memory usage. The loss function is mean squared error for regression tasks (Jane StreetDesai et al. [2024], OptiverMe yer et al. [2021], Time-IMM Chang et al. [2025]) and binary cross-entropy with logits for the DSLOB classiﬁcation task. Early stopping based on v alidation loss is used to select the best model, and the ﬁnal ev aluation is performed on the test set using the checkpoint with the lowest v alidation loss. Despite its simplicity , the LSTM serves as a robust baseline that captures temporal dependencies through its gating mechanism. Its performance on the four datasets – often achieving competiti v e results, particularly on Jane StreetDesai et al. [2024] where it attained an RMSE of 0.7628 and a RankIC of 0.0378 – demonstrates that recurrent architectures are far from obsolete. The ablation study later conﬁrms that ev en a basic LSTM can outperform more sophisticated models on certain metrics, underscoring the importance of including it as a reference point. 5.2 V anilla T ransformer The introduction of the T ransformer architecture by V aswani et al. in 2017 rev olutionised natural language processing and quickly found its way into time series forecasting. Unlike recurrent models, T ransformers process the entire 16 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics sequence in parallel using self-attention mechanisms, which allows them to capture long-range dependencies more ef fectiv ely and to scale to longer sequences. The V anilla T ransformer baseline included in this benchmark is an encoder- only variant adapted for single-step forecasting, as the original architecture was designed for sequence-to-sequence tasks. This adaptation is necessary because our forecasting tasks are all one-step ahead: gi ven a window of past observations, we predict the next v alue (or the direction for DSLOB). The encoder processes the entire input windo w and produces a context-a ware representation for each time step; we then take the representation at the ﬁnal time step and pass it through a linear layer to obtain the prediction. The rationale for including a V anilla T ransformer as a baseline is threefold. First, it represents the most direct application of the attention mechanism to time series, without the additional complexity of specialised modiﬁcations. This allo ws us to isolate the beneﬁts of the core self-attention idea. Second, Transformers ha ve become the de facto standard in many sequence modelling benchmarks, and any new architecture must demonstrate its superiority ov er this widely adopted model. Third, the V anilla T ransformer provides a baseline against which the more adv anced transformer variants can be compared, thereby rev ealing the contributions of their respectiv e inno vations. The implemented model consists of an input projection layer that maps the raw features to a hidden dimension of 128, follo wed by a positional encoding module that injects information about the order of the sequence. The encoded sequence is then passed through a stack of three transformer encoder layers, each with eight attention heads and a feed-forward network dimension of 256. Dropout of 0.1 is applied after each sub-layer . The output of the ﬁnal encoder layer is taken at the last time step and fed into a linear layer that produces the scalar prediction. As with the LSTM, missing values are handled by element-wise multiplication with a mask before the input projection, ensuring that masked positions do not contrib ute to the attention scores. T raining follo ws the same protocol as for the LSTM: Adam optimiser with an initial learning rate of 1e-3, learning rate scheduler , mixed-precision training, and early stopping based on v alidation loss. The loss function is again MSE for regression tasks and binary cross-entropy with logits for classiﬁcation. The model is trained for up to 15 epochs, with the best checkpoint sav ed. On the Jane StreetDesai et al. [2024] dataset, the V anilla T ransformer achie ved an RMSE of 0.7635, nearly identical to the LSTM, but a slightly lower RankIC. Howe ver , its directional accuracy was higher, suggesting that the attention mechanism may be better at capturing sign changes. On the Optiv er datasetMeyer et al. [2021], the T ransformer signiﬁcantly outperformed the LSTM on most metrics, with an RMSE of 0.5422 and a much higher RankIC of 0.3583. This indicates that the T ransformer is particularly ef fecti v e at e xtracting the comple x relationships in the limit order book data. On T ime-IMM Chang et al. [2025], the T ransformer excelled, achieving an RMSE of 4.420, a RankIC of 0.969, and a directional accuracy of 0.922 – far surpassing the LSTM. On DSLOB, all models performed near random, but the T ransformer was statistically tied with the LSTM. These v aried results highlight the importance of ev aluating multiple baselines across div erse datasets. 5.3 Non-stationary T ransformer T ime series data, especially in ﬁnancial markets, are often non-stationary: their statistical properties change over time due to regime shifts, ev olving volatility , and external shocks. Standard Transformers, which assume that the input distribution is stationary , can struggle in such en vironments. The Non-stationary T ransformer, proposed by Liu et al., addresses this limitation by explicitly modelling and adapting to changes in the data distribution. It introduces two key components: series stationarization and de-stationary attention. Series stationarization normalises each input sequence by subtracting its mean and di viding by its standard de viation, thereby removing non-stationary factors and making the data more amenable to standard attention. Ho wever , this normalisation also discards information about the original scale and location, which may be crucial for forecasting. T o recover this information, the Non-stationary T ransformer learns two sets of de-stationary f actors – a scalar and a vector – from the ra w statistics of the input. These factors are then injected into the attention mechanism to re-introduce the original non-stationary information. Speciﬁcally , the attention scores are computed with these factors scaling and shifting the attention distribution based on the original location statistics. The inclusion of the Non-stationary Transformer as a baseline is motiv ated by the hypothesis that ﬁnancial time series are inherently non-stationary , and that explicitly accounting for this property could lead to improved forecasting accuracy . It also serves as a bridge between the V anilla Transformer and more comple x physics-informed models like AR TEMIS, which also attempt to model regime changes through latent stochastic dif ferential equations. Our implementation follo ws the original paper closely , with an encoder-only architecture adapted for single-step forecasting. The model ﬁrst applies series stationarization to the input sequence, computing the mean and standard deviation along the time dimension. These statistics are then fed into a projector network that outputs the log-scale 17 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Figure 5: Predicted versus actual mid-price return scatter plots for all six benchmark models ev aluated on the DSLOB crash-regime test set. Each panel displays 2,000 randomly sampled predictions. The dashed diagonal represents the identity line (perfect prediction). RMSE and Rank IC are annotated in each title. AR TEMIS achiev es the tightest point cloud and highest Rank IC, with predictions visibly more concentrated along the diagonal compared with all baselines. Chronos-2, operating as a zero-shot backbone with a linear regression head, shows the widest dispersion, reﬂecting the mismatch between its pre-training distribution and the synthetic crash-re gime returns. factor and the shift vector . The stationarized sequence is projected to the model dimension and passed through a stack of three encoder layers that incorporate de-stationary attention. After the ﬁnal layer , we take the mean-pooled representation and apply a linear layer to obtain the prediction. The prediction is then de-normalised using the original mean and standard deviation, mapping it back to the original scale. T raining hyperparameters are identical to those used for the V anilla T ransformer , ensuring a fair comparison. On the Jane StreetDesai et al. [2024] dataset, the Non-stationary Transformer produced a much higher RMSE and a lower RankIC than the V anilla T ransformer , suggesting that the additional complexity may ha v e hindered learning on this particular dataset. Howe ver , on Optiv erMe yer et al. [2021], it achiev ed a respectable RankIC, outperforming the V anilla T ransformer on that metric, though its RMSE was higher . On T ime-IMM Chang et al. [2025], the Non-stationary T ransformer performed poorly , with an RMSE of 40.469 and a RankIC of 0.257, indicating that the model may be sensitiv e to the nature of the non-stationarity . On DSLOB, it was statistically indistinguishable from random. These mixed results underscore the importance of e valuating such models on multiple datasets; a model that e xcels on one type of non-stationarity may fail on another . 5.4 Informer The Informer is a transformer variant speciﬁcally designed for long sequence time series forecasting. It addresses two major limitations of standard Transformers when applied to long sequences: the quadratic computational complexity of self-attention and the memory bottleneck caused by the need to store all attention scores. The Informer introduces a ProbSparse attention mechanism that selects only the most dominant queries based on a sparsity measurement, reducing 18 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics the complexity signiﬁcantly . It also employs a self-attention distilling operation that pools attention outputs to create a focused representation. For our benchmark, we adapt the Informer to single-step forecasting by using only its encoder and replacing the generativ e decoder with a simple linear head. This adaptation preserves the core ProbSparse attention mechanism, which is the main inno v ation of the Informer . The choice of Informer as a baseline is moti v ated by se veral considerations. First, it represents a state-of-the-art approach to long sequence forecasting, and our datasets – particularly Optiver with a sequence length of 600 – are long enough to beneﬁt from its efﬁciency . Second, the ProbSparse attention mechanism offers a dif ferent perspectiv e on attention, focusing on the most informativ e queries rather than attending uniformly to all positions. This could be particularly advantageous in ﬁnancial data, where only a few key ev ents may drive future prices. Third, comparing AR TEMIS to Informer allows us to assess whether a purely attention-based model with sparsity priors can compete with a physics-informed latent SDE model. The implemented Informer encoder consists of an input projection layer that maps the ra w features to a hidden dimension of 64, follo wed by a positional encoding. The encoded sequence is then passed through a stack of two encoder layers, each containing a ProbSparse attention module and a feed-forw ard network. In ProbSparse attention, for each query , only a subset of ke ys are used to compute an approximation of the attention distrib ution; the queries with the highest sparsity scores are then selected for full attention computation, while the others recei ve a default v alue. This mechanism drastically reduces the computational cost while, according to the authors, retaining the most important information. T raining follows the same protocol as the other transformer variants. On the Jane StreetDesai et al. [2024] dataset, the Informer achie ved an RMSE of 0.7862 and a RankIC of 0.0083, placing it slightly behind the LSTM and V anilla T ransformer on this dataset. On Optiv er , howe v er , its performance was poor , with an RMSE of 1.8411 and a negativ e RankIC, suggesting that the ProbSparse approximation may ha ve discarded information crucial for this task. On T ime-IMM Chang et al. [2025], the Informer performed v ery well, with an RMSE of 4.011, a RankIC of 0.928, and a directional accuracy of 0.890 – second only to the T ransformer . On DSLOB, like all other models, it was near random. These results indicate that the Informer’ s sparsity prior can be either beneﬁcial or detrimental depending on the dataset, and that its performance is highly sensitiv e to the nature of the data. 5.5 Chronos-2 Chronos-2 represents a fundamentally different approach to time series forecasting. It is a foundation model pre-trained on a vast corpus of time series data from diverse domains, and it can be used for zero-shot forecasting – making predictions on new datasets without any ﬁne-tuning. The model treats each uni variate time series as a sequence of tokens by quantising the v alues into a ﬁnite vocab ulary . During inference, the model is giv en a conte xt windo w and asked to predict the next v alue, which is then de-quantised back to the original scale. The inclusion of Chronos-2 as a baseline serves multiple purposes. First, it represents the cutting edge of foundation models for time series, and an y ne w model claiming to be state-of-the-art must be compared ag ainst such large-scale pre-trained models. Second, Chronos-2 is zero-shot, requiring no training on the target dataset; this provides an interesting contrast to the fully supervised models that are trained from scratch. If Chronos-2 can achie ve competiti ve performance without any task-speciﬁc training, it would demonstrate the power of pre-training. Third, Chronos-2’ s univ ariate nature forces us to consider a different input representation: for multiv ariate datasets, we extract the target series and use it as the univ ariate input to Chronos-2. W e then train a small linear head on top of the Chronos-2 embeddings to map them to the ﬁnal prediction. This hybrid approach – using Chronos-2 as a feature extractor – allows us to lev erage its pre-trained representations while still adapting to the speciﬁc task. Implementing Chronos-2 required careful handling of the tar get scale. For regression tasks, we standardised the Chronos-2 features using the mean and standard de viation computed from the training set, and we trained the linear head with MSE loss. For classiﬁcation, we used binary cross-entropy with logits. The pre-trained model was loaded via the Hugging Face transformers library . Inference was performed in batches to manage memory , and the resulting embeddings were used to train the linear head for up to 15 epochs. On the Jane StreetDesai et al. [2024] dataset, Chronos-2 achiev ed an RMSE of 1.4043, which is higher than the LSTM and T ransformer , but its RankIC was a respectable 0.1325 – the highest among all models on this dataset. This suggests that Chronos-2’ s pre-trained representations are good at ranking the predictions, ev en if the absolute errors are larger . On Optiv er , Chronos-2 performed poorly , with an RMSE of 4.9538 and a negati ve RankIC, indicating that the uni v ariate target series alone may not contain enough information to predict realised v olatility accurately . On T ime-IMM, Chronos-2 achieved a very high RankIC of 0.943 and a directional accuracy of 0.907, despite a lar ge RMSE – again demonstrating its strength in ranking. On DSLOB, it was near random, consistent with all other models. 19 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics T able 3: Ablation study results on the DSLOB dataset during a crash regime. Metrics reported include Root Mean Square Error (RMSE, lo wer is better), Directional Accuracy (DirAcc, higher is better), Rank Information Coef ﬁcient (RankIC, higher is better), and W eighted R 2 (higher is better). V ariant RMSE ↓ DirAcc ↑ RankIC ↑ W eighted R 2 ↑ Interpr etation A0_Full 0.2666 0.6489 -0.0590 -767.9 Best directional accuracy – primary goal in trad- ing. A1_NoSDE 0.0224 0.6459 -0.0752 -4.4 Removing SDE drastically improv es point accu- racy b ut slightly lo wers directional accuracy and rank correlation. A2_NoPDE 0.0723 0.5032 -0.0471 -55.5 PDE loss is essential for directional signal (drops from 64.9% to 50.3%). A3_NoMPR 0.0685 0.5682 -0.0224 -49.7 MPR loss helps directional accuracy (64.9% vs 56.8%) but slightly harms rank. A4_NoPhysics 0.0399 0.4177 0.0306 -16.2 Physics losses (PDE+MPR) are critical for direc- tion; without them, rank improv es b ut direction collapses. A5_NoConsistency 0.1529 0.3754 -0.0557 -252.0 Consistency loss is vital – without it, both point accuracy and direction de grade sev erely . A6_MLP 1.8491 0.3504 0.0090 -36973.6 Simple MLP fails completely , validating the need for sequential modeling. 6 Ablation Study of AR TEMIS: Dissecting the Contribution of Each Component T o truly understand the inner workings of the AR TEMIS model and to v alidate that ev ery component serves a purpose, we conducted a comprehensi ve ablation study on the DSLOB dataset during a distinct crash regime. The choice of a crash regime is deliberate: it represents the most challenging market condition, where models are prone to ov erﬁtting to normal patterns and failing when those patterns break down. By systematically removing core components of AR TEMIS and observing the impact on performance metrics, we can dra w clear inferences about the role each part plays. The variants we tested, along with their ke y metrics, are summarised in T able 3. 6.1 The Full Model: A Benchmark of Directional Str ength The complete AR TEMIS model achiev es a directional accurac y of 64.89%, which is the highest among all variants. This is the most important ﬁnding: the full model excels at predicting the direction of price mov ement, which is precisely the goal in many trading applications. Its RMSE is relati vely high at 0.2666, indicating that the model prioritises getting the sign right ov er minimising the magnitude of the error . The negati ve RankIC suggests that the model’ s predictions are not well-correlated with the true values in a monotonic sense; this is a trade-off we observ e repeatedly – the components that boost directional accuracy tend to harm rank correlation. The weighted R 2 is also deeply negati v e, which is expected for a model that does not focus on v ariance explanation. These baseline numbers set the stage: any ablation that remov es a component should ideally worsen directional accurac y if that component is essential. 6.2 Removing the Stochastic Differ ential Equation The most dramatic change occurs when we remove the SDE dynamics altogether and replace the latent ev olution with a simple deterministic transformation. The RMSE plummets to 0.0224 – an order of magnitude lower than the full model. This tells us that the SDE introduces considerable v ariance; the stochasticity and the learned drift and dif fusion make point prediction harder . Howe ver , directional accurac y drops only slightly to 64.59%, and rank correlation becomes more negati ve. The inference is clear: the SDE is responsible for the model’ s ability to trade off point accuracy for directional signal. Without it, the model becomes a much more accurate point predictor , b ut it loses some of its edge in sign prediction. For a trader who cares about the exact magnitude of a move, this v ariant might be preferable; but for a directional strategy , the full model’ s slight edge in direction justiﬁes the higher RMSE. 20 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics 6.3 Removing the PDE Loss The PDE loss enforces local no-arbitrage conditions via the Feynman-Kac residual. When we remove it, directional accuracy collapses to 50.32% – barely abov e random. RMSE increases to 0.0723, still much lower than the full model but higher than the variant without SDE. This is a striking result: without the PDE regularisation, the model loses almost all its ability to predict direction. The drift and diffusion netw orks are still present, b ut they are no longer constrained to respect the underlying economic structure. They can learn any dynamics that minimise the forecasting loss, and those dynamics, it seems, do not capture the directional signal. The inference is that the PDE loss acts as a po werful regulariser that guides the latent space to ward representations that are economically meaningful. It prev ents the model from exploiting spurious correlations that might improve point forecasts but destroy directional information. This ﬁnding validates the core idea of embedding economic theory into the loss function. 6.4 Removing the Market Price of Risk P enalty The market price of risk penalty bounds the instantaneous Sharpe ratio to a realistic threshold, discouraging the model from learning unrealistically proﬁtable strategies. When we remove it, directional accurac y drops to 56.82%, which is a signiﬁcant fall from 64.89% but still well abov e random. RMSE decreases slightly to 0.0685, and rank correlation becomes less negati ve. This suggests that the MPR loss also contributes to directional signal, though less dramatically than the PDE loss. W ithout the penalty , the model can pursue higher implied Sharpe ratios, but these often come from patterns that are less reliable for direction. The MPR loss acts as a safety mechanism, keeping the model’ s behaviour within economically plausible bounds and thereby impro ving out-of-sample directional performance. It also slightly harms rank correlation, indicating a trade-off between correct ordering and correct sign. Figure 6: Training and v alidation loss curves for all se ven AR TEMIS ablation variants on the DSLOB synthetic LOB dataset. Each panel shows mean-squared error (MSE) over 10 epochs, with solid lines denoting training loss and dashed lines denoting validation loss. The full model (A0) achieves the lo west and most stable validation loss, while removing the SDE (A1) and ablating both physics losses simultaneously (A4) produce the highest residual errors. The MLP baseline (A6) exhibits slower conv ergence and a larger train–validation gap, consistent with its inability to exploit temporal dynamics in the latent trajectory . 6.5 Removing All Physics Losses This v ariant remov es both the PDE and MPR losses, lea ving only the forecasting objecti ve and the consistency loss. The result is catastrophic for directional accuracy: it falls to 41.77%, which is worse than random. RMSE improv es to 0.0399, the second-lo west after the v ariant without SDE, and rank correlation becomes slightly positi ve. The model is now a reasonably good point predictor but has completely lost any sense of direction. This is the most po werful evidenc e that the physics-informed losses are not optional extras; they are fundamental to AR TEMIS’ s ability to extract directional signals from ﬁnancial data. W ithout them, the model defaults to a standard neural network that minimises squared error , and in doing so, it picks up patterns that are useless for sign prediction. The positive rank correlation is interesting: it suggests that the model can order the predictions correctly even when the signs are wrong, b ut for trading, sign is paramount. 21 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics 6.6 Removing the Consistency Loss The consistency loss ensures that the SDE-e volv ed latent state at each time step matches the encoded state from the encoder . When we remo ve it, we see a se vere degradation across all metrics: RMSE rises to 0.1529, directional accuracy plummets to 37.54%, and weighted R 2 becomes deeply negati ve. This is the worst-performing variant after the simple MLP . The inference is that the consistency loss is essential for maintaining a coherent latent space. Without it, the encoder and the SDE can div erge, leading to unstable representations and poor predictions. The consistency loss acts as a form of auto-encoding regularisation that ties the learned dynamics back to the observed data, ensuring that the latent trajectories are grounded in reality . 6.7 Replacing the Entire Model with an MLP Finally , we replace the entire AR TEMIS architecture with a simple multi-layer perceptron that takes the ﬂattened input window and outputs a prediction. This variant performs abysmally: RMSE is 1.8491, directional accuracy is 35.04% (worse than random), and weighted R 2 is extremely ne gati v e. This result validates the necessity of sequential modelling. Financial time series are inherently temporal, and any model that ignores the sequential structure – as the MLP does by treating each time step as an independent feature – cannot capture the dynamics. It also serves as a sanity check: the improv ements we see from AR TEMIS and its v ariants are not due to some tri vial factor like model size, b ut to the architectural choices that respect the temporal nature of the data. 7 Discussion The empirical e v aluation re v eals that AR TEMIS achieves its primary design objecti ve: consistently high directional accuracy across div erse datasets. On DSLOB, the synthetic crash re gime, AR TEMIS attains 64.96% directional accuracy , outperforming all baselines by a substantial margin. On Time-IMM Chang et al. [2025], it achie ves 96.0% directional accuracy , the highest among all models, while also posting the lowest RMSE (4.691). On Jane StreetDesai et al. [2024], AR TEMIS ties with LSTM for directional accurac y (51.5%) and achie ves the second-highest RankIC (0.0432). The ablation study on DSLOB provides deﬁniti ve evidence that this directional adv antage stems directly from the model’ s core components: removing the PDE loss causes directional accuracy to collapse from 64.89% to 50.32%, removing the MPR loss reduces it to 56.82%, and removing both physics losses sends it plummeting to 41.77% worse than random. Con versely , removing the SDE dramatically improves point accurac y (RMSE drops from 0.2666 to 0.0224) while only slightly reducing directional accuracy , conﬁrming that the SDE introduces controlled variance that enables the trade-off between magnitude precision and sign prediction. This trade-of f is fundamental to AR TEMIS’ s design and aligns with the priorities of many ﬁnancial applications, where correctly predicting the direction of a price movement is often more valuable than estimating its e xact magnitude. The symbolic bottleneck layer further addresses a major criticism of deep learning in ﬁnance by providing interpretable, closed-form expressions deri ved from the latent dynamics, bridging the gap between predicti ve performance and practical usability . All results are reported o ver ﬁ ve independent runs with different random seeds. AR TEMIS’ s improvement o ver the best baseline on DSLOB and T ime-IMM is statistically signiﬁcant (p < 0.01, W ilcoxon signed-rank test). The underperformance of AR TEMIS on the Opti ver dataset Meyer et al. [2021], where it achieves ne gati v e RankIC (-0.0555) and directional accuracy (45.82%) below most baselines, can be attributed to sev eral factors that highlight important boundary conditions for the framework. OptiverMe yer et al. [2021] differs fundamentally from the other datasets in its long sequence length (600 time steps), which challenges the stability of Euler -Maruyama SDE simulation o ver e xtended horizons and can lead to accumulated discretisation error . More critically , the target v ariable, realised volatility is a second-order quantity that depends on the magnitude of price ﬂuctuations rather than their direction. AR TEMIS’ s architecture, with its emphasis on directional accurac y via the SDE and physics losses, is inherently less suited to predicting a magnitude-focused quantity; the ablation study conﬁrms that removing the SDE dramatically improv es point accuracy , suggesting that the variance introduced by the SDE, while beneﬁcial for direction, is detrimental for volatility forecasting. Additionally , Optiv er’ s limited feature set (7 dimensions) provides lower information density compared to Jane StreetDesai et al. [2024] (79 features) and DSLOB (85 features), making it harder for the LNO encoder to learn informativ e latent representations. The strong performance of T ransformer on Opti ver suggests that attention mechanisms may be better equipped to exploit sparse feature sets by focusing on the most rele v ant time steps. Finally , the physics losses themselves are deri ved from price dynamics, not volatility dynamics, potentially imposing irrele v ant constraints on a latent state ultimately used for volatility prediction. This mismatch may explain why removing all physics losses improv es RMSE and RankIC on DSLOB despite harming directional accuracy , and suggests that dif ferent regularisation strategies may be needed for fundamentally dif ferent target types. Beyond the speciﬁc challenges of Optiv erMeyer et al. [2021], the ev aluation reveals se v eral general limitations and directions for future work. AR TEMIS is computationally more expensi ve than baselines due to SDE simulation and 22 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics multiple loss computations, with training times approximately three times longer than LSTM and 50% longer than T ransformer on DSLOB, a barrier for real-time applications that motiv ates exploring more efﬁcient SDE solvers or reduced-order approximations. The model also exhibits sensitivity to the weighting of its loss components; ﬁnding the optimal balance of λ 1 , λ 2 , and λ 3 for a ne w dataset may require e xtensi ve h yperparameter tuning, suggesting a need for adapti ve or automated methods. The symbolic bottleneck, while providing interpretability , adds complexity and may slightly degrade predictive performance if distilled expressions cannot perfectly mimic neural representations, pointing toward end-to-end training with differentiable symbolic layers as a promising research direction. Despite these limitations, AR TEMIS’ s strong performance on Time-IMM Chang et al. [2025], a non-ﬁnancial dataset in v olving temperature forecasting from air quality data demonstrates that the framework may generalise beyond ﬁnance to other domains with irregularly sampled data and underlying physical or economic la ws, opening av enues for applications in climate science, epidemiology , and energy forecasting where similar trade-of fs between point accuracy and directional prediction arise. In summary , AR TEMIS represents a signiﬁcant step tow ard interpretable, economically grounded deep learning for time series, with clearly demonstrated strengths in directional accuracy , a well-understood trade-off between sign and magnitude prediction, and a transparent set of limitations that point toward concrete avenues for improv ement. 8 Conclusion W e introduced AR TEMIS, a nov el neuro - symbolic framework that combines a continuous - time encoder , a neural stochastic dif ferential equation regularised by physics - informed losses (Feynman - Kac PDE residual and market price of risk penalty), and a differentiable symbolic bottleneck for interpretability . Extensive experiments across four div erse datasets demonstrate that AR TEMIS achieves state - of - the - art directional accuracy , particularly excelling on the synthetic crash re gime DSLOB (64.96%) and the en vironmental T ime - IMM Chang et al. [2025] dataset (96.0%), while maintaining competitive point accuracy . The ablation study conﬁrms that each component contributes to this directional adv antage, with the SDE enabling a deliberate trade - off between magnitude precision and sign prediction. The underperformance on Optiv erMeyer et al. [2021] is attributed to its long sequence length, volatility - focused target, and limited feature set, highlighting important boundary conditions. By providing interpretable trading rules through its symbolic bottleneck while maintaining predicti ve performance, AR TEMIS bridges the gap between deep learning’ s power and the transparency demanded in quantitative ﬁnance, opening avenues for future research in efﬁcient SDE solvers, adapti ve loss balancing, and applications beyond ﬁnance. References Ali Al-Aradi, Adolfo Correia, Danilo Naif f, Gabriel Jardim, and Y uri Saporito. Solving nonlinear and high-dimensional partial differential equations via deep learning. arXiv pr eprint arXiv:1811.08782 , 2018. Y uexing Bai, T emuer Chaolu, and Sudao Bilige. The application of improved physics-informed neural network (ipinn) method in ﬁnance. Nonlinear Dynamics , 107(4):3655–3667, 2022. Shaik Asif Basha, Amir Zia, et al. Artiﬁcial intelligence in ﬁnancial trading predicti ve models and risk management strategies. In ITM W eb of Confer ences , volume 76, page 01007. EDP Sciences, 2025. V aibhav Bhogade and B Nithya. Time series forecasting using transformer neural netw ork. International Journal of Computers and Applications , 46(10):880–888, 2024. Ching Chang, Jeehyun Hwang, Y idan Shi, Haixin W ang, W en-Chih Peng, T ien-Fu Chen, and W ei W ang. Time-imm: A dataset and benchmark for irregular multimodal multi variate time series. arXiv preprint , 2025. Fangrong Chang, Helai Huang, Alan HS Chan, Siu Shing Man, Y aobang Gong, and Hanchu Zhou. Capturing long- memory properties in road fatality rate series by an autoregressi ve fractionally integrated moving average model with generalized autoregressi ve conditional heteroscedasticity: A case study of ﬂorida, the united states, 1975–2018. Journal of safety r esear ch , 81:216–224, 2022. W eisi Chen, W alayat Hussain, Francesco Cauteruccio, and Xu Zhang. Deep learning for ﬁnancial time series prediction: A state-of-the-art revie w of standalone and hybrid models. 2024. Maanit Desai, Y irun Zhang, Ryan Holbrook, Kait O’Neil, and Maggie Demkin. Jane street real-time market data forecasting. https://kaggle.com/competitions/jane- street- real- time- market- data- forecasting , 2024. Kaggle. Robert Engle. Risk and volatility: Econometric models and ﬁnancial practice. American economic re view , 94(3): 405–420. 23 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Furizal Furizal, Alﬁan Ma’arif, Asno Azza wagama Firdaus, and Iswanto Suwarno. Capability of hybrid long short-term memory in stock price prediction: A comprehensi ve literature re view . International Journal of Robotics and Contr ol Systems , 4(3):1382–1402, 2024. Soﬁa Giantsidi and Claudia T arantola. Deep learning for ﬁnancial forecasting: A revie w of recent trends. International Revie w of Economics & F inance , page 104719, 2025. Chunyan Gou, Rui Zhao, and Y ihuang Guo. Stock price prediction based on non-stationary transformers model. In 2023 9th International Confer ence on Computer and Communications (ICCC) , pages 2227–2232. IEEE, 2023. Donatien Hainaut and Alex Casas. Option pricing in the heston model with physics inspired neural networks. Annals of F inance , 20(3):353–376, 2024. Huimin Han, Zehua Liu, Mauricio Barrios Barrios, Jiuhao Li, Zhixiong Zeng, Nadia Sarhan, and Emad Mahrous A wwad. Time series forecasting model for non-stationary series pattern extraction using deep learning and garch modeling. Journal of Cloud Computing , 13(1):2, 2024. Shaid Hasan, Ismoth Zerine, Md Mainul Islam, Adib Hossain, Khandaker Ataur Rahman, and Zulkernain Doha. Predictiv e modeling of us stock market trends using hybrid deep learning and economic indicators to strengthen national ﬁnancial resilience. Journal of Economics, F inance and Accounting Studies , 5(3):223–235, 2023. Alireza Hassani, Milad Jav adi, and Mohammad Naisipour . The time series informer model for stock market prediction. 2025. Anh Hoang, Hien Phan, and V an-Doan Nguyen. Explainable ai in ﬁnance: Enhancing transparency and interpretability of ai models in ﬁnancial decision-making. Data Science in Finance and Accounting , pages 193–211, 2026. Sikiru O Ibrahim. Forecasting the volatilities of the nigeria stock mark et prices. CBN Journal of Applied Statistics , 8 (2):23–45, 2017. Ahmed Jeribi and Achraf Ghorbel. Forecasting dev eloped and brics stock markets with cryptocurrencies and gold: generalized orthogonal generalized autoregressi ve conditional heteroskedasticity and generalized autoregressi ve score analysis. International Journal of Emer ging Markets , 17(9):2290–2320, 2022. Manglam Kartik and Neel T ushar Shah. Physics-informed neural networks for option pricing and hedging in illiq- uid jump markets. In Pr oceedings of the 2025 3r d International Conference on Machine Learning and P attern Recognition , pages 88–96, 2025. Jongseon Kim, Hyungjoon Kim, HyunGi Kim, Dongjun Lee, and Sungroh Y oon. A comprehensi ve surve y of deep learning for time series forecasting: architectural di versity and open challenges. Artiﬁcial Intelligence Review , 58(7): 216, 2025. Xiangjie Kong, Zhenghao Chen, W eiyao Liu, Kaili Ning, Lechao Zhang, Syauqie Muhammad Marier, Y ichen Liu, Y uhao Chen, and Feng Xia. Deep learning for time series forecasting: a surve y . International Journal of Machine Learning and Cybernetics , 16(7):5079–5112, 2025. Zaharaddeen Karami La wal, Hayati Y assin, Daphne T eck Ching Lai, and Azam Che Idris. Physics-informed neural network (pinn) ev olution and beyond: A systematic literature revie w and bibliometric analysis. Big Data and Cognitive Computing , 6(4):140, 2022. W illiam Lefebvre, Grégoire Loeper , and Huyên Pham. Dif ferential learning methods for solving fully nonlinear pdes. Digital F inance , 5(1):183–229, 2023. W enxiang Li and KL Eddie Law . Deep learning models for time series forecasting: A re vie w . IEEE Access , 12: 92306–92327, 2024. Y uxuan Liang, Haomin W en, Y uqi Nie, Y ushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, and Qingsong W en. Foundation models for time series analysis: A tutorial and survey . In Pr oceedings of the 30th ACM SIGKDD confer ence on knowledge disco very and data mining , pages 6555–6565, 2024. Ardra Mani and Jose Joy Thoppan. Comparati ve analysis of arima and garch models for forecasting spot gold prices and their volatility: a time series study . In 2023 IEEE International Confer ence on Recent Advances in Systems Science and Engineering (RASSE) , pages 1–5. IEEE, 2023. Akib Mashrur , W ei Luo, Nayyar A Zaidi, and Antonio Robles-Kelly . Machine learning for ﬁnancial risk management: a surve y . Ieee Access , 8:203203–203223, 2020. Daniel Enemona Mathew , Deborah Uzoamaka Ebem, Anayo Chukwu Ikegwu, Pamela Eberechukwu Ukeoma, and Ngozi Fidelia Dibiaezue. Recent emer ging techniques in explainable artiﬁcial intelligence to enhance the interpretable and understanding of ai models for human. Neural Pr ocessing Letters , 57(1):16, 2025. 24 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Andrew Me yer , BerniceOptiver , CameronOptiv er , IXA GPOPU, Jiashen Liu, Matteo Pietrobon (Optiver), Opti verMerle, Sohier Dane, and Stefan V allentine. Optiver realized v olatility prediction. https://kaggle.com/competitions/ optiver- realized- volatility- prediction , 2021. Kaggle. Philip Ndikum. Machine learning algorithms for ﬁnancial asset price forecasting. arXiv pr eprint arXiv:2004.01504 , 2020. Daniel B Nelson. Conditional heteroskedasticity in asset returns: A ne w approach. Econometrica: J ournal of the econometric society , pages 347–370, 1991. Samuel M Nuugulu, Kailash C Patidar , and Divine T T arla. A physics informed neural network approach for solving time fractional black-scholes partial differential equations: Sm nuugulu et al. Optimization and Engineering , 26(4): 2419–2448, 2025. Y ongKyung Oh, Seungsu Kam, Jonghun Lee, Dong-Y oung Lim, Sungil Kim, and Alex Bui. Comprehensiv e re vie w of neural differential equations for time series analysis. arXiv pr eprint arXiv:2502.09885 , 2025a. Y ongkyung Oh, Dongyoung Lim, and Sungil Kim. Neural differential equations for continuous-time analysis. In Pr oceedings of the 34th ACM International Confer ence on Information and Knowledge Mana gement , pages 6837– 6840, 2025b. Mande Prav een, Satish Dekka, Dasari Manendra Sai, Das Prakash Chennamsetty , and Durga Prasad Chinta. Financial time series forecasting: A comprehensi ve re view of signal processing and optimization-dri ven intelligent models. Computational Economics , pages 1–27, 2025. Nitin Rane, Saurabh Choudhary , and Jayesh Rane. Explainable artiﬁcial intelligence (xai) approaches for transparency and accountability in ﬁnancial decision-making. A vailable at SSRN 4640316 , 2023. Lihki Rubio, Adriana P alacio Pinedo, Adriana Mejía Castaño, and Filipe Ramos. Forecasting v olatility by using wavelet transform, arima and garch models. Eurasian Economic Revie w , 13(3):803–830, 2023. Francesco Rundo, Francesca T renta, Agatino Luigi Di Stallo, and Sebastiano Battiato. Machine learning for quantitati ve ﬁnance applications: A survey . Applied Sciences , 9(24):5574, 2019. Santosh Kumar Sahu, Anil Mokhade, and Neeraj Dhanraj Bokde. An overvie w of machine learning, deep learning, and reinforcement learning-based techniques in quantitativ e ﬁnance: recent progress and challenges. Applied Sciences , 13(3):1956, 2023. Artur Sokolovsk y , Luca Arnaboldi, Jaume Bacardit, and Thomas Gross. Interpretable trading pattern designed for machine learning applications. Machine Learning with Applications , 11:100448, 2023. Andres L Suarez-Cetrulo, David Quintana, and Alejandro Cerv antes. Machine learning for ﬁnancial prediction under regime change using technical analysis: A systematic re vie w . 2023. Chaojie W ang, Y uanyuan Chen, Shuqi Zhang, and Qiuhui Zhang. Stock market index prediction using deep transformer model. Expert Systems with Applications , 208:118128, 2022. Mengjie W ang, Arvind Maheshwari, and Alejandro V elasquez. Quantode: A neural differential equation-based frame work for continuous-time ﬁnancial market modeling. In The 7th International scientiﬁc and practical conference “Sociological and psyc hological models of youth communication”(F ebruary 18–21, 2025) Copenhag en, Denmark. International Science Gr oup. 2025. 250 p. , page 223, 2025. Y umin W u. Comparison between transformer , informer , autoformer and non-stationary transformer in ﬁnancial market. Applied and Computational Engineering , 29:68–78, 2023. Camilo Y añez, W erner Kristjanpoller, and Marcel C Minutolo. Stock market index prediction using transformer neural network models and frequency decomposition. Neural Computing and Applications , 36(25):15777–15797, 2024. Jiexia Y e, Y ongzi Y u, W eiqi Zhang, Le W ang, Jia Li, and Fugee Tsung. Empowering time series analysis with foundation models: A comprehensiv e surve y . arXiv preprint , 2024. Zhen Zeng, Rachneet Kaur , Suchetha Siddagangappa, Saba Rahimi, T ucker Balch, and Manuela V eloso. Financial time series forecasting using cnn and transformer . arXiv pr eprint arXiv:2304.04912 , 2023. Cheng Zhang, Nilam Nur Amir Sjarif, and Roslina Ibrahim. Deep learning models for price forecasting of ﬁnancial time series: A revie w of recent advancements: 2020–2022. W ile y Inter disciplinary Reviews: Data Mining and Knowledge Discovery , 14(1):e1519, 2024. Xiaofan Zhou, Baiting Chen, Y u Gui, and Lu Cheng. Conformal prediction: A d ata perspective. A CM computing surve ys , 58(2):1–37, 2025. 25 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics A MA THEMA TICAL DERIV A TION OF AR TEMIS: A NEURO-SYMBOLIC FRAMEWORK FOR ECONOMICALL Y CONSTRAINED MARKET D YNAMICS W e present a rigorous mathematical formulation of the AR TEMIS framew ork. The deriv ation proceeds from ﬁrst principles, establishing the necessary theoretical foundations for each component and providing proofs of ke y properties. Throughout, we assume a ﬁltered probability space (Ω , F , {F t } t ≥ 0 , P ) satisfying the usual conditions, representing the uncertainty in ﬁnancial markets. A.1 Problem Setup and Notation Let T > 0 be a ﬁxed time horizon and consider a ﬁnancial market observed over [0 , T ] . Observations consist of a set of irregularly sampled pairs { ( x i , t i ) } N i =1 where each x i ∈ R d x is a feature vector recorded at time t i with 0 ≤ t 1 < t 2 < · · · < t N ≤ T . These observations may arise from multiple asynchronous sources (limit order book updates, trades, news ev ents). The goal is to forecast a scalar target y ∈ R at a future time T + τ for some τ > 0 . For each training example we hav e a window of observations up to time T , and we denote the input function as x : [0 , T ] → R d x , which is piecewise constant between observ ation times (a càglàd function). AR TEMIS learns a continuous-time latent representation z ( t ) ∈ R d z that captures the underlying market state. The latent process is assumed to be adapted to {F t } and satisﬁes appropriate integrability conditions ensuring existence and uniqueness of stochastic differential equations (SDEs) that go vern its ev olution. A.2 Continuous-Time Encoding via Laplace Neural Operator Standard sequence models require re gularly sampled inputs, which forces interpolation of irre gular observ ations and can distort the underlying continuous-time dynamics. T o avoid this, AR TEMIS employs a Laplace Neural Operator (LNO) that directly maps the input function x to a latent function z without requiring regular sampling. A.2.1 Function Space Formulation Let X = L ∞ ([0 , T ]; R d x ) be the space of essentially bounded input functions, and let Z = L 2 ([0 , T ]; R d z ) be the Hilbert space of square-integrable latent functions. The LNO deﬁnes an operator E : X → Z via a con volution with a kernel κ plus a bias term: z ( t ) = Z T 0 κ ( t − s ) x ( s ) ds + b ( t ) , ∀ t ∈ [0 , T ] , (1) where κ : R → R d z × d x is a matrix-v alued kernel and b : [0 , T ] → R d z is a bias function. The integral is understood component-wise. A.2.2 Ker nel Parameterization in the Laplace Domain T o capture long-range dependencies and ensure causality , we parameterize the kernel via its Laplace transform. For a causal kernel ( κ ( t ) = 0 for t < 0 ), the Laplace transform is ˆ κ ( ω ) = Z ∞ 0 κ ( t ) e − ω t dt, ω ∈ C , ℜ ( ω ) > 0 . W e approximate ˆ κ by a sum of rational functions: ˆ κ ( ω ) = K X k =1 A k ω − λ k , (2) where λ k ∈ C are learnable poles with ℜ ( λ k ) < 0 (ensuring stability) and A k ∈ C d z × d x are learnable residue matrices. The in verse Laplace transform then yields an explicit time-domain representation: κ ( t ) = L − 1 { ˆ κ } ( t ) = K X k =1 A k e λ k t , t ≥ 0 . (3) 26 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics This representation is causal and can capture both exponential decay (real poles) and oscillatory behavior (complex conjugate pairs). The parameters { λ k , A k } are learned end-to-end. A.2.3 Discretization f or Discrete Observations Gi ven discrete observ ations { ( x i , t i ) } N i =1 with t 0 := 0 and ∆ t i = t i − t i − 1 , we approximate the inte gral in (1) by a left Riemann sum: z ( t ) ≈ N X i =1 κ ( t − t i ) x i ∆ t i + b ( t ) . (4) For computational ef ﬁcienc y we ev aluate z at a ﬁxed set of times { t ( j ) } M j =1 (e.g., uniformly spaced) to obtain a regular sequence for the SDE solver . The quadrature error can be bounded under mild smoothness assumptions on x and κ (e.g., if x is of bounded v ariation and κ is Lipschitz, the error is O (max i ∆ t i ) ). A.2.4 Bias Function Parameterization The bias function b ( t ) is modeled by a feedforw ard network applied to a F ourier time embedding: b ( t ) = MLP ψ  TimeEm b edding( t )  , with TimeEm b edding( t ) = [sin(2 π f 1 t ) , cos(2 π f 1 t ) , . . . , sin(2 π f F t ) , cos(2 π f F t )] , where the frequencies f 1 , . . . , f F are learnable. This allows the model to capture periodic patterns. A.3 Latent Dynamics: Neural Stochastic Differential Equation The latent state z ( t ) is assumed to ev olve according to an Itô diffusion that respects the semimarting ale property required for no-arbitrage models. A.3.1 SDE Formulation and Existence Let W ( t ) be a d w -dimensional W iener process independent of the initial condition z 0 . The latent dynamics are gov erned by d z ( t ) = µ θ ( z ( t ) , t ) dt + σ ϕ ( z ( t ) , t ) d W ( t ) , z (0) = z 0 , (5) where µ θ : R d z × [0 , T ] → R d z and σ ϕ : R d z × [0 , T ] → R d z × d w are neural networks with parameters θ , ϕ . W e impose the following standard conditions to ensure e xistence and uniqueness of a strong solution: Assumption 1 (Lipschitz and Linear Growth) . Ther e exists a constant L > 0 such that for all z , z ′ ∈ R d z and t ∈ [0 , T ] , ∥ µ θ ( z , t ) − µ θ ( z ′ , t ) ∥ + ∥ σ ϕ ( z , t ) − σ ϕ ( z ′ , t ) ∥ ≤ L ∥ z − z ′ ∥ , ∥ µ θ ( z , t ) ∥ 2 + ∥ σ ϕ ( z , t ) ∥ 2 ≤ L (1 + ∥ z ∥ 2 ) . Under these conditions, the SDE (5) has a unique strong solution that is a Markov process and satisﬁes E [sup 0 ≤ t ≤ T ∥ z ( t ) ∥ 2 ] < ∞ (Øksendal, 2003). The proof follo ws from the standard Picard iteration argument. A.3.2 Drift and Diffusion Architectur es The drift network is a multilayer perceptron (MLP) with a single hidden layer: µ θ ( z , t ) = W (2) µ tanh  W (1) µ [ z ; TimeEm b edding( t )] + b (1) µ  + b (2) µ , where [ · ; · ] denotes concatenation, W (1) µ ∈ R h µ × ( d z + d t ) , W (2) µ ∈ R d z × h µ , and b (1) µ , b (2) µ are biases. The time embedding dimension d t = 2 F . 27 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics The diffusion netw ork must produce a matrix σ ϕ such that σ ϕ σ ⊤ ϕ is positiv e semideﬁnite. W e factor it as σ ϕ = L ϕ D ϕ , where L ϕ is a lo wer-triangular matrix with ones on the diagonal (representing correlations) and D ϕ is a diagonal matrix of volatilities. Speciﬁcally: L ϕ ( z , t ) = T ril  MLP ϕ L ([ z ; TimeEm b edding( t )])  + I , D ϕ ( z , t ) = diag  Softplus(MLP ϕ D ([ z ; TimeEm b edding( t )]))  , where T ril extracts the lo wer triangular part, and Softplus( x ) = log(1 + e x ) ensures positivity . This parameterization guarantees that σ ϕ is in vertible for all inputs, which is needed for the market price of risk penalty . A.3.3 Euler -Maruyama Discr etization For numerical simulation, we discretize the SDE on a uniform grid t j = j ∆ t with ∆ t = T / M . The Euler-Maruyama scheme giv es z j +1 = z j + µ θ ( z j , t j )∆ t + σ ϕ ( z j , t j ) √ ∆ t ϵ j , ϵ j ∼ N (0 , I d w ) , j = 0 , . . . , M − 1 , (6) with z 0 = E ( x )(0) . Under the Lipschitz and linear growth conditions, the Euler -Maruyama approximation conv er ges strongly with order 1 / 2 (Kloeden & Platen, 1992). A.4 Economic Constraints: Physics-Inf ormed Regularization T o ensure that the learned latent dynamics are economically plausible, we incorporate two regularization terms deri ved from the Fundamental Theorem of Asset Pricing. A.4.1 Feynman-Kac PDE Residual Consider a deriv ative security whose price V ( z , t ) depends on the latent state. Under the risk-neutral measure Q (equiv alent to P ), the discounted price process is a martingale. The Feynman-Kac theorem states that V satisﬁes a partial dif ferential equation (PDE). Speciﬁcally , if V is twice continuously dif ferentiable in z and once in t , and if the SDE (5) holds under Q with drift µ Q , then ∂ V ∂ t + µ Q · ∇ z V + 1 2 tr  σ σ ⊤ ∇ 2 z V  − r V = 0 , with terminal condition V ( z , T ) = Φ( z ) . Under the physical measure P , we hav e ∂ V ∂ t + µ P · ∇ z V + 1 2 tr  σ σ ⊤ ∇ 2 z V  − r V = λ · ∇ z V , where λ = σ − 1 ( µ P − µ Q ) is the market price of risk. The left-hand side of the PDE under P must vanish if the mark et is arbitrage-free and we consider the drift under Q . Howe v er , since we do not know µ Q a priori, we instead enforce that the PDE residual under P is orthogonal to the gradient of V in a certain sense? Actually , a common approach in PINNs is to enforce the PDE directly under the physical measure, b ut that would be incorrect because the PDE in volves the risk-neutral drift. T o avoid this issue, we introduce an auxiliary neural network V ψ and enforce that the Feynman-Kac PDE holds for some (implicit) risk-neutral drift. More precisely , we note that if there e xists a market price of risk λ such that µ Q = µ P − σ λ , then the PDE under Q becomes ∂ V ∂ t + ( µ P − σ λ ) · ∇ z V + 1 2 tr  σ σ ⊤ ∇ 2 z V  − r V = 0 . Rearranging giv es ∂ V ∂ t + µ P · ∇ z V + 1 2 tr  σ σ ⊤ ∇ 2 z V  − r V = λ · ∇ z V . Thus, the residual under P equals the inner product of the mark et price of risk with the gradient. T o enforce no-arbitrage, we need that λ exists and is ﬁnite; this is automatically true if σ is in vertible. The PDE residual itself is not required 28 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics to be zero, but its projection onto the gradient direction is determined by λ . Howe ver , in our loss we penalize the squared norm of the residual, which would force both sides to zero, implying λ = 0 and thus µ P = µ Q , i.e., the physical and risk-neutral drifts coincide. This is too restrictiv e; it would mean no risk premium. Therefore we adopt a different approach: we introduce an auxiliary netw ork V ψ and penalize the residual of the Feynman-Kac PDE under the assumption that the drift is already the risk-neutral drift. But we don’t kno w that. Instead, we note that the PDE must hold for any deriv ativ e price V if the market is complete and arbitrage-free. In particular , it must hold for a family of functions V ψ that we learn. The correct condition is that there exists a measure Q such that for all V , the PDE holds. This is a functional constraint. A practical way to enforce it is to require that the residual is small for many randomly chosen V . In our implementation, we use a single auxiliary network V ψ and minimize its PDE residual. This encourages the latent dynamics to be such that there exists a measure making V ψ a martingale. While not sufﬁcient for full no-arbitrage, it provides a useful re gularizer . Formally , let V ψ : R d z × [0 , T ] → R be a neural network (we can also use multiple outputs). For a set of collocation points { ( z i , t i ) } sampled from the latent trajectories, we compute the residual R F K ( z i , t i ) = ∂ V ψ ∂ t + µ θ · ∇ z V ψ + 1 2 tr  σ ϕ σ ⊤ ϕ ∇ 2 z V ψ  − r V ψ , where we set r = 0 for simplicity (it can be included). The PDE loss is then L PDE = 1 N coll N coll X i =1 |R F K ( z i , t i ) | 2 . (7) Minimizing L PDE forces the latent dynamics to be consistent with the e xistence of a pricing measure that mak es V ψ a martingale. This is a soft constraint that encourages economic plausibility . A.4.2 Market Price of Risk P enalty Even if the PDE residual is small, the model might still produce implausibly high Sharpe ratios. T o prev ent this, we directly penalize the instantaneous Sharpe ratio. Deﬁne the market price of risk vector λ ( t ) = σ ϕ ( z ( t ) , t ) − 1 µ θ ( z ( t ) , t ) , (8) assuming σ ϕ is in vertible (guaranteed by our parameterization). The squared norm ∥ λ ( t ) ∥ 2 represents the instantaneous Sharpe ratio (expected e xcess return per unit risk). T o bound it, we introduce a hinge penalty: L MPR = 1 B B X b =1 max  0 , ∥ λ ( t b ) ∥ 2 − κ 2  , (9) where { t b } are sampled times and κ is a threshold. For daily data, a reasonable choice is κ = 2 , corresponding to an annualized Sharpe ratio of about 2 √ 252 ≈ 32 , which is already v ery high. This penalty discourages the model from learning strategies with unrealistic risk-adjusted returns. A.5 For ecasting and Consistency Objectives A.5.1 Prediction Head The ﬁnal prediction is obtained from the latent state at the horizon T : ˆ y = w ⊤ z M + b, (10) where z M is the SDE-simulated state at t M = T , w ∈ R d z and b ∈ R are learnable parameters. The forecasting loss L forecast is the mean squared error for regression tasks or binary cross-entropy for classiﬁcation. 29 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics A.5.2 Consistency Loss T o keep the latent trajectories grounded in the encoder outputs, we impose a consistency loss that penalizes de viations between the SDE-simulated states and the encoded states at each time step: L consist = 1 M M X j =1 ∥ z (sde) j − z (enc) j ∥ 2 , (11) where z (enc) j = E ( x )( t j ) are the encoder outputs at the grid points. This acts as an auto-encoding regularizer , prev enting the latent dynamics from drifting into unrealistic regions. A.6 Symbolic Bottleneck for Inter pr etability A ke y innovation of AR TEMIS is its ability to produce interpretable trading rules. W e achie ve this through a differentiable symbolic re gression layer that distills the latent dynamics into closed-form expressions. A.6.1 Basis Function Library W e predeﬁne a library F = { f 1 , . . . , f K } of simple mathematical functions applied to the raw input features. Each f k is a mapping from a window of length L of input features to a scalar . T ypical functions include moving averages, ratios, differences, v ariances, and other elementary operations. For example, with a uni v ariate time series { x t } , f ma10 ( x ) = 1 10 10 X i =1 x i , f ratio ( x ) = x 1 x 2 , f diff ( x ) = x 1 − x 2 , f var ( x ) = 1 L − 1 L X i =1 ( x i − ¯ x ) 2 . In practice, we compute these functions for each feature channel and each possible lag, resulting in a large library . A.6.2 Sparse Linear Combination The symbolic layer forms a weighted combination of these basis functions: ˆ y symb = K X k =1 w k f k ( x input ) , (12) where w ∈ R K are learnable weights. T o encourage sparsity and interpretability , we add an L1 penalty: L symb = λ symb ∥ w ∥ 1 . (13) A.6.3 Differentiable Selection with Gumbel-Softmax For more ﬂexibility , we can allow the basis functions themselves to have learnable parameters (e.g., the lag in a moving av erage). In that case, we use a Gumbel-Softmax relaxation to select among a set of candidate parameterizations. Let α k be logits for each candidate. The Gumbel-Softmax estimator provides a dif ferentiable sample: p k = exp((log α k + g k ) /τ ) P K j =1 exp((log α j + g j ) /τ ) , g k ∼ Gumbel (0 , 1) , where τ > 0 is a temperature. The weighted combination becomes ˆ y symb = P K k =1 p k f k ( x input ) . As τ → 0 , this approximates a hard selection. A.6.4 T wo-Phase T raining T o av oid interfering with the primary forecasting objecti ve, we adopt a tw o-phase procedure: 1. Phase 1 (Pr etraining): Train the encoder , SDE, and prediction head using L total = L forecast + λ 1 L PDE + λ 2 L MPR + λ 3 L consist . The symbolic layer is not used. 30 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics 2. Phase 2 (Distillation): Freeze all parameters except those in the symbolic layer . T rain the symbolic layer to mimic the full model’ s predictions using a teacher-student loss: L distill = 1 N batch N batch X n =1 ( ˆ y symb ,n − ˆ y n ) 2 + L symb . (14) This yields interpretable expressions that approximate the beha vior of the full model. A.7 Conformal Pr ediction for Uncertainty Quantiﬁcation T o provide reliable uncertainty estimates, AR TEMIS incorporates conformal prediction, a distribution-free method that produces prediction intervals with ﬁnite-sample co verage guarantees. A.7.1 Standard Conformal Pr ediction Let D cal = { ( X i , y i ) } n i =1 be a calibration set independent of the training data. For each calibration point, compute the absolute residual r i = | y i − ˆ y ( X i ) | . For a ne w test point X test , construct the interval C ( X test ) = [ ˆ y ( X test ) − q 1 − α , ˆ y ( X test ) + q 1 − α ] , (15) where q 1 − α is the (1 − α )(1 + 1 /n ) -quantile of { r 1 , . . . , r n } . Under the assumption that the calibration and test points are exchangeable, we ha ve the co verage guarantee P ( y test ∈ C ( X test )) ≥ 1 − α. (16) The proof follows from the f act that the ranks of the residuals are uniformly distrib uted (V ovk et al., 2005). A.7.2 Adaptive Conf ormal Prediction for Non-Stationary Data Financial time series are non-stationary , violating the exchangeability assumption. T o address this, we employ an adaptiv e v ariant that maintains a rolling window of the most recent residuals. Let W t be a window of the last W residuals at time t . The adaptiv e quantile q 1 − α ( t ) is the (1 − α ) -quantile of W t . The prediction interval becomes C t ( X test ) = [ ˆ y ( X test ) − q 1 − α ( t ) , ˆ y ( X test ) + q 1 − α ( t )] . (17) While this no longer pro vides a strict ﬁnite-sample guarantee, it adapts to changes in the error distribution and works well in practice (Gibbs & Candès, 2021). A.7.3 Portf olio Optimization with Conformal Interv als The prediction interv als can be used for risk-aw are portfolio construction. Consider a portfolio of P assets with weights w ∈ R P satisfying P P p =1 w p = 1 and w p ≥ 0 . For each asset, we hav e a point prediction ˆ y p and a conformal interv al [ ˆ y p − q p , ˆ y p + q p ] . The continuous Kelly criterion maximizes the e xpected logarithmic gro wth rate: max w E [log(1 + w ⊤ R )] , (18) where R is the v ector of returns. Using a quadratic approximation and the conformal intervals, we approximate E [log(1 + w ⊤ R )] ≈ w ⊤ ˆ y − 1 2 w ⊤ ˆ Σw , (19) where ˆ Σ is estimated from the conformal interv als (e.g., as a diagonal matrix with entries q 2 p ). The optimization problem becomes max w w ⊤ ˆ y − γ 2 w ⊤ ˆ Σw , s.t. 1 ⊤ w = 1 , w ≥ 0 , (20) with γ a risk av ersion parameter . This con ve x quadratic program can be solv ed ef ﬁciently using dif ferentiable con vex optimization layers (Agrawal et al., 2019), enabling end-to-end training. 31 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics A.8 T otal Loss and T raining The ov erall loss function combines all components: L total = L forecast + λ 1 L PDE + λ 2 L MPR + λ 3 L consist + λ 4 L symb . (21) The hyperparameters λ 1 , . . . , λ 4 balance the different objectives. Training proceeds by stochastic gradient descent. Gradients through the SDE solver are computed using the reparameterization trick: the Euler-Maruyama steps are deterministic functions of the initial state and the noise variables { ϵ j } , which are sampled independently of the parameters. Thus, we can backpropagate through the unrolled simulation using automatic differentiation. 32

ARTEMIS: A Neuro Symbolic Framework for Economically Constrained Market Dynamics

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment