ARTEMIS: A Neuro Symbolic Framework for Economically Constrained Market Dynamics
Deep learning models in quantitative finance often operate as black boxes, lacking interpretability and failing to incorporate fundamental economic principles such as no-arbitrage constraints. This paper introduces ARTEMIS (Arbitrage-free Representat…
Authors: Rahul D Ray
A RT E M I S : A N E U R O - S Y M B O L I C F R A M E W O R K F O R E C O N O M I C A L L Y C O N S T R A I N E D M A R K E T D Y N A M I C S Rahul D Ray Department of Electronics and Electrical Engineering BITS Pilani, Hyderabad Campus f20242213@hyderabad.bits-pilani.ac.in A B S T R AC T Deep learning models in quantitativ e finance often operate as black boxes, lacking interpretability and failing to incorporate fundamental economic principles such as no arbitrage constraints. This paper introduces AR TEMIS (Arbitrage free Representation Through Economic Models & Interpretable Symbolics), a nov el neuro symbolic framew ork that combines a continuous time encoder based on a Laplace Neural Operator , a neural stochastic dif ferential equation re gularised by physics informed losses, and a dif ferentiable symbolic bottleneck that distils interpretable trading rules. The model enforces economic plausibility through two nov el regularisation terms: a Feynman Kac PDE residual that penalises violations of local no arbitrage conditions, and a market price of risk penalty that bounds the instantaneous Sharpe ratio to realistic values. W e ev aluate AR TEMIS against six strong baselines including LSTM, T ransformer , NS Transformer , Informer , Chronos 2, and XGBoost on four di verse datasets: Jane Street (anon ymised market data) Desai et al. [2024], Opti verMeyer et al. [2021] (limit order book volatility prediction), T ime IMM (en vironmental temperature forecasting) Chang et al. [2025], and DSLOB (a synthetic crash re gime). Results demonstrate that AR TEMIS achiev es state of the art directional accuracy , outperforming all baselines on DSLOB (64.96%) and T ime IMM (96.0%), while remaining competitiv e on point accuracy metrics. A comprehensi ve ablation study on DSLOB confirms that each component contributes to this directional adv antage: removing the PDE loss reduces directional accurac y from 64.89% to 50.32%, and removing both physics losses collapses it to 41.77%, worse than random. The underperformance on Opti verMe yer et al. [2021] is attributed to its long sequence length, v olatility focused target, and limited feature set, highlighting important boundary conditions. By pro viding interpretable, economically grounded predictions without sacrificing performance, AR TEMIS bridges the gap between deep learning’ s predictiv e power and the transparenc y demanded in quantitativ e finance, opening new a venues for trustworthy AI in high stakes financial applications. 1 Introduction The application of deep learning to financial time series prediction has witnessed e xplosi ve gro wth over the past decade, driv en by the increasing av ailability of high-frequency market data and the remarkable success of neural networks in capturing comple x temporal patterns. From return forecasting and volatility prediction to algorithmic trading and risk management, deep learning models hav e demonstrated superior performance compared to traditional econometric approaches such as ARIMA and GARCH [Rundo et al., 2019, Ndikum, 2020]. Recent comprehensi ve surveys [Chen et al., 2024, Zhang et al., 2024, Giantsidi and T arantola, 2025] document the rapid e volution of architectures from standalone models like Long Short-T erm Memory (LSTM) networks and Con v olutional Neural Networks (CNNs) to sophisticated hybrid systems that combine multiple techniques. Howe ver , despite these adv ances, the adoption of deep learning in high-stakes financial applications remains hampered by three fundamental challenges that the literature has consistently identified but not yet fully resolv ed. The first and most widely recognised challenge is the lack of interpretability . As financial institutions operate under strict re gulatory o versight, the opacity of deep learning models, often described as black boxes, creates significant legal, ethical, and operational risks [Rane et al., 2023, Hoang et al., 2026]. Stakeholders, including regulators AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics and risk managers, require transparent explanations for model decisions, particularly when those decisions inv olve substantial capital allocation or ha ve the potential to impact mark et stability [Mathe w et al., 2025]. While explainable artificial intelligence (XAI) techniques such as SHAP and LIME have been proposed to pro vide post-hoc explanations [Sokolovsk y et al., 2023, Basha et al., 2025], these methods of fer only approximate insights and do not address the underlying opacity of the model architecture itself. Moreov er , they introduce an inherent trade-of f between predicti ve accuracy and interpretability , where simpler, more transparent models often sacrifice the v ery complexity that enables superior performance [Rane et al., 2023, Mathe w et al., 2025]. This tension between accuracy and transparenc y remains a central obstacle to deploying deep learning in production trading systems. The second challenge concerns the integration of economic principles into data-dri ven models. T raditional financial models are built upon foundational theories such as the absence of arbitrage, which ensures that asset prices cannot be risklessly exploited for profit. Y et most deep learning approaches are trained purely on historical data, learning correlations without any regard for these economic constraints [Mashrur et al., 2020, Sahu et al., 2023]. As a consequence, they can discover spurious patterns that lead to implausible predictions, particularly when market conditions deviate from the training distribution [Suarez-Cetrulo et al., 2023, Hasan et al., 2023]. Recent work has begun to address this gap through physics-informed neural netw orks (PINNs), which embed gov erning equations into the loss function to enforce consistenc y with physical or financial la ws. Pioneering studies ha ve applied PINNs to option pricing models, including the Black-Scholes equation and the Heston model [Bai et al., 2022, Hainaut and Casas, 2024, Nuugulu et al., 2025], demonstrating improved accurac y and stability . Extensions have tackled jump-dif fusion models with liquidity costs [Kartik and Shah, 2025] and fully nonlinear PDEs rele v ant to portfolio optimisation [Lefebvre et al., 2023]. Howe ver , these applications ha ve focused primarily on solving kno wn pricing equations rather than learning latent dynamics from data while simultaneously enforcing economic constraints. The synthesis of data-driv en learning with physics-informed regularisation for forecasting tasks remains lar gely une xplored. The third challenge relates to the continuous-time nature of financial markets and the non-stationarity of financial time series. T raditional discrete-time models struggle to process irregularly sampled, high-frequency data without interpolation, which can distort underlying dynamics [Han et al., 2024, W ang et al., 2025]. Moreover , market regimes shift ov er time, and models that perform well during stable periods often fail catastrophically during crises [Suarez- Cetrulo et al., 2023, Hasan et al., 2023]. Neural stochastic dif ferential equations (neural SDEs) offer a promising continuous-time framew ork that can naturally accommodate irregular observ ations and capture uncertainty through stochastic dynamics [W ang et al., 2025]. Similarly , advances in handling non-stationarity , such as the Non-stationary T ransformer [Zhang et al., 2024], hav e improved robustness to distribution shifts. Y et these approaches remain disconnected from the interpretability and economic constraint challenges described abov e. In this paper , we introduce AR TEMIS (Arbitrage-free Representation Through Economic Models & Interpretable Symbolics), a novel neuro-symbolic frame work that addresses all three challenges within a unified architecture. AR TEMIS makes the follo wing ke y contributions: • Continuous-time encoding via Laplace Neural Operator . Unlike standard recurrent or transformer models that require regularly sampled inputs, our encoder operates directly on irregularly spaced observations, preserving the true temporal structure of limit order book updates and trade reports without interpolation [W ang et al., 2025, Han et al., 2024]. • Economics-inf ormed latent dynamics through neural SDEs with physics-constrained regularisation. W e introduce tw o no vel loss terms deriv ed from the Fundamental Theorem of Asset Pricing: a Feynman-Kac PDE residual that enforces local no-arbitrage conditions in the latent space, and a mark et price of risk penalty that bounds the instantaneous Sharpe ratio to realistic v alues. While pre vious w ork has applied PINNs to pricing equations [Bai et al., 2022, Hainaut and Casas, 2024, Nuugulu et al., 2025, Kartik and Shah, 2025] and neural SDEs to continuous-time modelling [W ang et al., 2025], our w ork is the first to embed such economic constraints directly into the learning of latent dynamics for forecasting. • Differentiable symbolic bottleneck for interpr etability . Rather than relying on post-hoc explanations [Sokolo vsky et al., 2023, Basha et al., 2025], we design a layer that distils the latent dynamics into closed-form, interpretable expressions through differentiable symbolic regression. This provides inherent transparency while maintaining end-to-end trainability , offering a no vel resolution to the accuracy-interpretability trade-of f [Rane et al., 2023, Mathew et al., 2025]. • Conformal pr ediction f or uncertainty quantification. T o support risk-aware decision making, we equip AR TEMIS with an adaptive conformal prediction layer that provides rigorously calibrated prediction intervals, addressing the need for reliable uncertainty estimates in financial applications [Mashrur et al., 2020, Sahu et al., 2023]. 2 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics W e e valuate AR TEMIS against six strong baselines spanning recurrent architectures (LSTM) [Furizal et al., 2024], transformer variants (T ransformer, NS-T ransformer , Informer) [Chen et al., 2024, Zhang et al., 2024], foundation models (Chronos-2) [Giantsidi and T arantola, 2025], and gradient boosting (XGBoost) [Pra v een et al., 2025], across four div erse datasets: Jane Street’ s anonymised market dataDesai et al. [2024], Opti ver’ s limit order book volatility prediction taskMeyer et al. [2021], the T ime-IMM en vironmental temperature forecasting benchmark Chang et al. [2025], and DSLOB, a nov el synthetic dataset we introduce that simulates an amplified mark et crash re gime. Our results demonstrate that AR TEMIS achieves state-of-the-art directional accuracy , outperforming all baselines on DSLOB (64.96%) and Time-IMM Chang et al. [2025] (96.0%), while remaining competitiv e on point accuracy metrics. A comprehensiv e ablation study confirms that each component contrib utes to this directional adv antage: removing the PDE loss reduces directional accuracy from 64.89% to 50.32%, and removing both physics losses collapses it to 41.77%—worse than random. The underperformance on Optiv erMeyer et al. [2021], where AR TEMIS achieves negati ve RankIC (-0.0555) and directional accuracy (45.82%) belo w most baselines, is attributed to its long sequence length, volatility-focused tar get, and limited feature set, highlighting important boundary conditions for the frame work and directions for future research. By providing interpretable, economically grounded predictions without sacrificing predictive performance, AR TEMIS bridges the gap between deep learning’ s po wer and the transparency demanded in quantitati ve finance. T o the best of our kno wledge, no existing work combines neural SDEs, physics-informed losses, and dif ferentiable symbolic regression in a unified end-to-end framew ork for financial time series forecasting. The code and data will be made publicly available upon publication to facilitate reproducibility and future research. 2 Related W ork AR TEMIS draws upon and contrib utes to se veral interconnected research areas: deep learning for financial time series forecasting, interpretable machine learning and e xplainable AI, physics-informed neural networks and neural dif ferential equations, and continuous-time modeling of financial dynamics. This section re vie ws the state of the art in each area and situates our contributions within the broader literature. 2.1 Deep Learning f or Financial Time Series For ecasting The application of deep learning to financial time series has been extensi vely surve yed in recent years, reflecting the rapid growth of the field [Chen et al., 2024, Zhang et al., 2024, Giantsidi and T arantola, 2025]. Early work focused on standalone recurrent architectures, particularly Long Short-T erm Memory (LSTM) networks, which became popular due to their ability to capture temporal dependencies in sequential data [Furizal et al., 2024]. Con volutional Neural Networks (CNNs) ha ve also been widely adopted for their hierarchical feature learning capabilities, especially when combined with signal processing techniques to handle the non-linear and non-stationary nature of financial data [Prav een et al., 2025]. Hybrid models that combine multiple architectures, such as CNN-LSTM or attention-augmented recurrent networks, ha ve shown impro ved performance by le veraging the strengths of dif ferent components [Chen et al., 2024, Furizal et al., 2024]. The transformer architecture has emer ged as a powerful alternative to recurrent models for time series forecasting, offering parallel processing and the ability to capture long-range dependencies through self-attention mechanisms. Comprehensiv e surveys by Li and Law [2024], Kong et al. [2025], Kim et al. [2025] document the rapid adoption of transformer-based models across domains, including finance. For stock market prediction, W ang et al. [2022] demonstrated that transformer models significantly outperform classic deep learning methods such as CNN, RNN, and LSTM. Subsequent work has e xplored specialized transformer v ariants for financial forecasting. The Informer addresses the quadratic complexity of standard attention through ProbSparse attention, making it particularly suitable for long-sequence prediction tasks such as volatility forecasting [Hassani et al., 2025, Bhogade and Nithya, 2024]. The Non-stationary T ransformer introduces series stationarization and de-stationary attention to handle distribution shifts, a critical challenge in financial mark ets where regimes change ov er time [Gou et al., 2023, W u, 2023]. Comparative studies have sho wn that these variants of fer different trade-of fs, with the Non-stationary T ransformer achieving the highest prediction accuracy for stock market indices in some settings [W u, 2023]. Beyond architectural innovations, recent work has e xplored the integration of multiple data sources and modalities. Zeng et al. [2023] proposed a combined CNN and transformer model for intraday stock price forecasting, demonstrating superior performance against statistical baselines. Y añez et al. [2024] introduced a h ybrid transformer encoder with CNN layers based on empirical mode decomposition for mark et inde x prediction. Foundation models for time series, such as Chronos, represent the latest frontier , lev eraging large-scale pre-training across di verse datasets to enable zero-shot and few-shot forecasting [Liang et al., 2024, Y e et al., 2024]. These models have sho wn promise in financial applications, though their black-box nature and lack of domain-specific constraints remain significant limitations. 3 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Despite these adv ances, the deep learning models surve yed above share a common limitation: they are trained purely on historical data without incorporating any economic principles or constraints. As noted by Mashrur et al. [2020] and Sahu et al. [2023], this can lead to models that learn spurious correlations and produce predictions that violate fundamental financial theory , particularly during market regime shifts. AR TEMIS addresses this gap by embedding no-arbitrage conditions directly into the learning process. 2.2 Interpr etability and Explainable AI in Finance The opacity of deep learning models poses significant challenges for their adoption in regulated financial applications. As Rane et al. [2023] and Hoang et al. [2026] document, the black-box nature of these models creates le gal, ethical, and operational risks, including regulatory penalties and loss of stakeholder trust. Financial institutions require transparency and accountability in decision-making systems, yet most state-of-the-art models offer little insight into their reasoning [Mathew et al., 2025]. Explainable AI (XAI) has emerged as a critical research area addressing this challenge. Post-hoc explanation methods such as SHAP (Shapley Additi v e Explanations) and LIME (Local Interpretable Model-agnostic Explanations) ha ve been widely applied to financial models to provide approximate explanations for indi vidual predictions [Sokolovsky et al., 2023, Basha et al., 2025]. Howe ver , these techniques hav e fundamental limitations: the y of fer only local approximations, can be inconsistent, and do not address the underlying opacity of the model architecture itself [Rane et al., 2023]. Moreov er , they introduce an inherent trade-off between accurac y and interpretability , where simpler , more transparent models may sacrifice predicti ve performance, while comple x black-box models that achie ve state-of-the-art accurac y remain difficult to e xplain [Mathew et al., 2025, Hoang et al., 2026]. Alternativ e approaches have sought to b uild interpretability directly into model design. Sokolovsky et al. [2023] proposed interpretable trading patterns designed specifically for machine learning applications, demonstrating that domain-informed feature engineering can enhance understanding. Howe ver , such approaches often require manual specification of patterns and do not learn interpretable representations end-to-end. AR TEMIS addresses these limitations through its dif ferentiable symbolic bottleneck, which distils the learned latent dynamics into closed-form, interpretable expressions. Unlike post-hoc explanation methods that approximate a black- box model, our approach provides inherent transparency by constraining a component of the network to produce human-readable formulas. This offers a no vel resolution to the accurac y-interpretability trade-of f, maintaining end-to- end trainability while deliv ering interpretable outputs. 2.3 Physics-Inf ormed Neural Networks and Neural Differential Equations Physics-informed neural networks (PINNs) represent a paradigm shift in scientific machine learning, embedding gov erning physical laws e xpressed as partial dif ferential equations (PDEs) directly into the neural network loss function [Lawal et al., 2022]. This approach ensures that model predictions remain consistent with kno wn physics, impro ving generalization and reducing the risk of learning spurious patterns. In finance, PINNs ha ve been applied primarily to option pricing problems, where the underlying PDEs are well-established. Bai et al. [2022] developed an improv ed PINN with local adaptive acti v ation functions to solve the Ivance vic option pricing model and the Black-Scholes equation. Hainaut and Casas [2024] applied ph ysics-inspired neural networks to the Heston stochastic volatility model, using the Feynman-Kac PDE as the dri ving principle. Nuugulu et al. [2025] extended this approach to time-fractional Black-Scholes equations, demonstrating the efficienc y and accuracy of PINNs for deriv ative pricing. More recent work has addressed more complex settings. Kartik and Shah [2025] developed a PINN framework for option pricing and hedging under a Merton-type jump-diffusion model with liquidity costs, encoding the partial integro-dif ferential equation into the loss function. Lefebvre et al. [2023] proposed dif ferential learning methods for solving fully nonlinear PDEs with applications to portfolio selection. The Deep Galerkin Method [Al-Aradi et al., 2018] and related approaches ha ve been applied to high-dimensional PDEs arising in quantitativ e finance, including optimal ex ecution and systemic risk models. Parallel to these de velopments, neural dif ferential equations ha ve emerged as a po werful framework for continuous-time modeling. Neural ordinary differential equations (NODEs) parameterize the deri v ati ve of a hidden state with a neural network, enabling fle xible modeling of dynamical systems. Neural stochastic dif ferential equations (NSDEs) e xtend this to stochastic settings, capturing uncertainty through dif fusion terms. Comprehensi ve surv eys by Oh et al. [2025a,b] revie w the mathematical foundations, numerical methods, and applications of neural dif ferential equations for time series analysis, emphasizing their capacity to handle irregular sampling and model continuous-time dynamics. In finance, W ang et al. [2025] introduced FinanceODE, a neural ODE-based frame work for continuous-time asset price modeling, demonstrating superior predictiv e accuracy compared to traditional discrete-time models. 4 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Despite these advances, e xisting work on PINNs in finance has focused primarily on solving kno wn pricing equations rather than learning latent dynamics from data for forecasting tasks. Similarly , neural dif ferential equation approaches hav e not incorporated economic constraints such as no-arbitrage conditions. AR TEMIS bridges this gap by combining a neural SDE for latent dynamics with physics-informed losses derived from the Fundamental Theorem of Asset Pricing, enabling data-driv en learning while enforcing economic plausibility . 2.4 Continuous-Time Modeling and Non-Stationarity Financial time series are inherently continuous-time processes, yet most forecasting models operate on discretely sampled data. T raditional econometric approaches such as ARIMA and GARCH models hav e been widely used for volatility forecasting and return prediction [Engle, Nelson, 1991]. GARCH models and their multiv ariate extensions, including Dynamic Conditional Correlation (DCC) and Generalized Orthogonal GARCH (GO-GARCH), remain popular for modeling time-varying volatility and correlations [Jeribi and Ghorbel, 2022, Ibrahim, 2017]. Howe ver , these models assume regular sampling and struggle with the irre gular, high-frequenc y data that characterizes modern financial markets [Han et al., 2024]. Hybrid approaches combining ARIMA with GARCH have been proposed to capture both linear and non-linear patterns [Rubio et al., 2023, Mani and Thoppan, 2023], while fractionally integrated models (ARFIMA-GARCH) address long-memory properties [Chang et al., 2022]. The limitations of discrete-time models hav e moti vated the development of continuous-time approaches. Neural differential equations of fer a natural framework for modeling irregularly sampled time series, as the y can be ev aluated at arbitrary time points [Oh et al., 2025a,b]. This capability is particularly valuable for limit order book data, where ev ents arrive at microsecond granularity and re gular resampling can distort dynamics [W ang et al., 2025]. Non-stationarity presents another fundamental challenge. As Suarez-Cetrulo et al. [2023] document in their systematic revie w , con ventional machine learning approaches often fail to adapt to changes in the price-generation process during market re gime shifts. Hasan et al. [2023] observe that models relying solely on market-based indicators w ork well in stable conditions b ut fail during economic crises, reducing their long-term predicti ve reliability . The Non-stationary T ransformer addresses this by explicitly modeling distrib ution shifts through series stationarization and de-stationary attention, while GARCH-based approaches model time-v arying v olatility [Han et al., 2024]. Ho wev er, these methods do not incorporate economic theory about what constitutes a plausible regime change. AR TEMIS addresses both continuous-time modeling and non-stationarity through its latent SDE formulation, which naturally accommodates irregular sampling and captures stochastic v olatility through the learned diffusion term. The physics-informed losses further ensure that the learned dynamics remain economically plausible across different regimes, pro viding a principled approach to handling non-stationarity . 2.5 Uncertainty Quantification and Conformal Pr ediction Reliable uncertainty quantification is essential for risk management in financial applications. T raditional approaches hav e relied on parametric methods such as GARCH for volatility forecasting [Engle]. Howe v er , these methods make strong distrib utional assumptions that may not hold in practice. Conformal prediction of fers a distrib ution-free framew ork for constructing prediction intervals with finite-sample co verage guarantees, requiring only exchangeability of the data. As surveyed by Zhou et al. [2025], conformal prediction has been extended to time series settings through adapti ve methods that update quantile estimates over time, addressing the violation of e xchangeability in non-stationary data. AR TEMIS incorporates an adaptiv e conformal prediction layer to pro vide calibrated uncertainty interv als for its forecasts, enabling risk-aware portfolio construction. The literature revie wed above reveals a clear gap: while significant advances have been made in deep learning architectures for financial time series, interpretability techniques, physics-informed neural networks, and continuous- time modeling, no existing frame work integrates these innov ations i n a unified manner . Deep learning models achieve state-of-the-art predicti ve accurac y b ut operate as black box es and ignore economic principles [Chen et al., 2024, Zhang et al., 2024]. XAI approaches provide post-hoc explanations b ut do not address underlying model opacity and introduce accuracy-interpretability trade-of fs [Rane et al., 2023, Hoang et al., 2026]. PINNs embed physical laws b ut ha ve been applied primarily to solving kno wn pricing equations rather than learning latent dynamics for forecasting [Bai et al., 2022, Hainaut and Casas, 2024]. Neural dif ferential equations of fer continuous-time modeling b ut ha ve not incorporated economic constraints [Oh et al., 2025a, W ang et al., 2025]. AR TEMIS addresses this gap by synthesizing these research directions into a single neuro-symbolic framework. T o our knowledge, it is the first work to combine (1) a continuous-time encoder for irregularly sampled data, (2) a neural SDE for latent dynamics regularised by (3) physics-informed losses enforcing no-arbitrage conditions and (4) a market price of risk penalty , (5) a differentiable symbolic bottleneck for interpretability , and (6) conformal prediction for uncertainty 5 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics quantification. The comprehensive e v aluation against six strong baselines across four diverse datasets demonstrates that this synthesis deliv ers tangible benefits, particularly in directional accuracy , without sacrificing interpretability . 3 Data Prepr ocessing for Benchmarking AR TEMIS T o rigorously ev aluate the AR TEMIS model against a suite of state-of-the-art baselines, we assembled four distinct datasets spanning dif ferent financial and non-financial domains: Jane Street’ s anonymised market dataDesai et al. [2024], Opti ver’ sMeyer et al. [2021] high-frequency limit order book data, the EP A-Air time series from the T ime-IMM Chang et al. [2025] collection, and a proprietary Deep Synthetic limit order book dataset (DSLOB). Each dataset required careful, domain-specific preprocessing to transform raw observ ations into a unified format suitable for sequence models. The goal was to create training, validation, and test splits that respect temporal ordering, handle missing values appropriately , and preserve the underlying dynamics of each task. All neural models (LSTM, Transformer , NS- T ransformer , Informer , AR TEMIS) share the same preprocessed windo ws to ensure a fair comparison under identical input conditions. The following sections describe the preprocessing pipeline for each dataset in detail, emphasising the rationale behind ev ery step. 3.1 Jane Str eet Market Pr ediction Desai et al. [2024] The Jane Street Desai et al. [2024]dataset originates from a Kaggle competition and consists of anon ymised market data partitioned into files for training, v alidation, and testing. Each file contains ro ws index ed by date identifier , time identifier , and symbol identifier . The raw features are 79 numerical columns that may contain missing v alues. The tar get variable is a continuous response that the competition asked participants to predict. Additionally , a weight column is provided for use in the of ficial e v aluation metric (weighted R 2 ). The data is already split temporally by date, with early dates in the training split, intermediate dates in validation, and later dates in the test split – a setup that f aithfully simulates a backtesting en vironment. Preprocessing for sequence models begins with the construction of sliding windo ws. Because the data is streamed from disk (the full dataset exceeds a vailable memory), we implemented a custom iterator that reads one partition at a time. W ithin each partition, ro ws are grouped by symbol and date, then sorted by time to ensure chronological order . For each group, we slide a window of length 20 (the chosen lookback horizon) and e xtract the next observ ation’ s target value. This yields an input tensor of shape (20, 79) and a scalar target. T o handle missing v alues, we create a binary mask of the same shape indicating which entries were originally observ ed; the input tensor itself has missing v alues replaced with zero. This masking strategy allows any subsequent model to distinguish genuine zeros from imputed values. The same windo wing logic is applied to the validation and test sets, b ut without restricting the number of days (the training set is limited to the first 500 days for computational efficienc y). For Chronos-2, which expects a uni variate time series, we extract only the target v alues from each group, again forming windows of length 20 to predict the ne xt v alue. No mask is needed because the target series is dense after filtering out missing targets. All neural models share the same preprocessed windo ws, ensuring a fair comparison under identical input conditions. 3.2 Optiver Realized V olatilityMeyer et al. [2021] The Optiv er dataset challenges participants to predict the realized v olatility of 112 stocks o ver 10-minute windows, based on high-frequency limit order book snapshots and trade reports. The raw data consists of order book updates and trade ex ecutions. Each order book record contains a windo w identifier , a timestamp of fset from the start of the windo w , bid and ask prices for two lev els, and corresponding sizes. Trade records contain similar identifiers plus the trade price, size, and order count. The target is the realized v olatility computed ov er the windo w – a continuous positi ve v alue. Preprocessing for this dataset requires fusing tw o asynchronous data streams into a regularly sampled sequence of length 600 (one observ ation per second). For each stock and time window , we first extract all order book and trade records. Order book snapshots are recorded at irregular interv als; we create a complete timeline co vering all seconds from 0 to 599 and forward-fill the most recent book state to e very second. This yields a continuous representation of the limit order book. Trades, which occur at discrete seconds, are aggregated per second (av erage price, total size, total order count) and merged onto the same timeline. From the book data we compute derived features: mid price, bid-ask spread, log mid price, log return, v olume imbalance, and total size. Combined with the trade aggregates, we obtain a feature set of 7 channels for each second. The target is the log-transformed realized v olatility (the original values are small positi ve numbers, and the log transformation mak es the distrib ution more Gaussian and easier for MSE-based models). 6 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics T able 1: Summary of datasets used in the benchmarking study . Dataset T ask #Feat SeqLen #T rain #V al #T est Notes Jane Street Regression 79 20 ∼ 7.37M ∼ 4.61M 200k Streamed Kaggle data; grouped by (symbol,date); masked miss- ing v alues; predict next-step re- sponder_6. Optiv er V olatility (log) 7 600 2,298 766 766 10-min LOB + trades; forward- filled to 1Hz; derived features; target log-transformed. T ime-IMM T emperature 4 24 29,470 6,317 6,319 Hourly EP A air quality; for - ward/backward fill for sparse variables; predict next hour temp. DSLOB Realized volatility (re gression) 85 20 24,891 9,891 4,891 Synthetic LOB dataset based on real crash regime; 85 features from four lev els (prices, sizes, spreads, imbalances); target is next-step realized volatility (log- transformed). For windo ws that have missing order book snapshots at the very be ginning, forward-filling ensures e very second has a valid feature vector . After constructing the full 600-second matrix for each window , we collect all windows (one per stock-time pair) and concatenate them into a single dataset. As with Jane StreetDesai et al. [2024], we respect the original temporal split provided by the competition organisers. For Chronos-2, we extract only the log-realized volatility series from each windo w and use it as a uni v ariate input of length 600. For the Opti verMe yer et al. [2021] task, which in v olves predicting the v olatility of the window itself rather than next-step prediction, Chronos-2 is used in a zero-shot manner by feeding the entire 600-step series as context and asking for a one-step forecast, then using a linear head to map the Chronos embedding to the target. 3.3 Time-IMM Chang et al. [2025] (EP A-Air) The Time-IMM Chang et al. [2025] collection provides multiv ariate time series from div erse domains. For this benchmark we selected the EP A-Air domain, which contains hourly measurements of air quality for eight U.S. counties. Each county’ s data includes four variables: temperature, particulate matter, air quality inde x, and ozone concentration. Inspection re veals that only temperature is recorded hourly; the other three are sparse, with missing rates exceeding 85%. The task is to forecast the next hour’ s temperature using a 24-hour lookback window . This regression problem is challenging because the auxiliary variables, though sparse, may carry predicti ve information when av ailable. Preprocessing be gins by loading each county’ s time series and adding an entity column to preserve identity . All counties are concatenated into a single dataframe. T o handle the sparsity , we apply forward-fill follo wed by backward-fill to each feature within each entity . This propagates the last observed v alue forward, and an y remaining leading missing values are filled with the next observed v alue. After this procedure, e very feature has a complete sequence for all timestamps. W e then construct windows of length 24 hours: for each entity , we slide a window of 24 consecuti ve hours and take the temperature at the next hour as the target. This yields input tensors of shape (24, 4) and scalar targets. W e discard any window where the tar get is missing (which does not occur after filling). The windows from all entities are concatenated, resulting in 29,470 training samples, 6,317 v alidation samples, and 6,319 test samples after a temporal 70/15/15 split applied per entity . This ensures that no future data leaks into the training set. Features are then standardised using statistics fitted on the training windo ws only . Because missing v alues have been eliminated, masks are simply all-ones. For Chronos-2, we isolate the temperature channel (the target series) and use it as a uni v ariate input of length 24 to predict the next hour’ s temperature. The same scaling is applied to the Chronos inputs. 3.4 DSLOB: A Synthetic Dataset for Contr olled Stress T esting The DSLOB (Deep Synthetic Limit Order Book) dataset addresses a fundamental challenge in ev aluating financial machine learning models: the scarcity of extreme e vents in historical data and the difficulty of isolating component contributions under real-world conditions. Real datasets like Jane StreetDesai et al. [2024] and OptiverMe yer et al. 7 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics [2021] are in valuable b ut inherently noisy , confounded, and lack suf ficient examples of rare regimes such as market crashes. Moreover , the true data-generating process is unknown, making it impossible to definitiv ely determine whether a model’ s performance stems from capturing genuine economic structure or ov erfitting to spurious correlations. DSLOB is therefore designed as a controlled synthetic en vironment that preserves the statistical properties of real limit order book data while introducing a known, amplified crash re gime where ground truth is fully accessible. The foundation of DSLOB is a real high-frequency limit order book dataset from which we extract 85 features capturing lev el-specific prices, sizes, spreads, mid-price calculations, volume imbalances, and microstructural metrics. T o isolate the crash regime, we apply CUSUM and Bayesian change point detection to identify a contiguous windo w of approximately one week where prices dropped rapidly and volatility spiked. This window serv es as the crash template. The synthetic mid-price is generated by amplifying a V asicek-type stochastic differential equation (SDE) fitted to the crash window: dP t = θ ( µ − P t ) dt + σ dW t , with parameters θ , µ, σ estimated via maximum likelihood. T o create an ev en more challenging crash, we scale the mean-rev ersion speed by 1 . 5 and the long-term mean by 1 . 2 , producing a steeper and more persistent downw ard trend. V olatility dynamics are modeled using a GARCH(1,1) process fitted to the 1-minute log-returns of the mid-price during the crash window: σ 2 t = ω + αϵ 2 t − 1 + β σ 2 t − 1 , ϵ t ∼ N (0 , 1) . W e increase the persistence and shock magnitude by setting β ′ = min(0 . 95 , 1 . 1 β ) and α ′ = 1 . 2 α . Synthetic returns are then generated as r t = σ ′ t ϵ t , ensuring volatility clustering and le verage effects characteristic of real crashes. The remaining 83 features (prices, sizes, spreads, etc.) are generated by adding correlated noise to the seed data while preserving the multiv ariate dependence structure. For each feature f i at time t : f synth i ( t ) = f seed i ( t ) + η i ( t ) , where η ( t ) ∼ N (0 , Σ ) and Σ is the cov ariance matrix of residuals from a vector autoregressiv e model of order 1 (V AR(1)) fitted to the seed features during the crash window . The noise is scaled so that the signal-to-noise ratio matches that of the seed data. T o create a longer training set, we apply time warping: a random deformation τ ( t ) sampled from a Gaussian process with mean 1 and v ariance 0.1 stretches or compresses the temporal dynamics while preserving sequential order . The final DSLOB dataset comprises 24,891 training, 9,891 validation, and 4,891 test samples, each a window of length 20 containing 85 synthetic features. The target is next-step realized volatility , computed as the square root of the sum of squared 1-second log-returns ov er the next 20 steps, annualized and log-transformed to match the Opti verMe yer et al. [2021] scale. Rigorous validation confirms that the synthetic data preserv es key statistical properties: the Kolmogoro v–Smirnov test fails to reject that return distributions match the seed ( p > 0 . 05 ), the autocorrelation decay of squared returns matches up to lag 50, the av erage absolute dif ference in correlation matrices is less than 0.03, and the 99.5th percentile of negati ve returns is within 5% of the seed’ s value. DSLOB serves two critical purposes in the AR TEMIS benchmark. First, it enables the ablation study (T able 3) by providing a controlled en vironment where components can be systematically removed and their impact observed with known ground truth. Second, it stress-tests model robustness under amplified extreme conditions. AR TEMIS achiev es the highest directional accuracy (64.96%) on DSLOB, outperforming all baselines and demonstrating that the framew ork remains resilient e ven when markets de viate from training distrib utions—a key requirement for practical applications where models often fail during crises. 4 The AR TEMIS Model: A Neuro-Symbolic Framework f or Economically Constrained Market Dynamics AR TEMIS (Arbitrage-free Representation Through Economic Models & Interpretable Symbolics) is a novel deep learning framework designed to overcome the limitations of existing black-box models in quantitative finance. It treats financial markets as a continuous-time dynamical system governed by latent stochastic differential equations that must respect fundamental economic principles such as no-arbitrage conditions. The model integrates ideas from scientific machine learning—specifically physics-informed neural netw orks, neural operators, and dif ferentiable symbolic regression—into a unified, end-to-end trainable architecture. The core insight is that by embedding economic constraints directly into the learning process, we can re gularise the model to a void implausible predictions, impro ve out-of-sample robustness, and simultaneously obtain interpretable trading signals. AR TEMIS comprises four tightly coupled modules: 8 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics • A continuous-time encoder based on a Laplace Neural Operator that ingests irregularly sampled, multi- resolution market data and maps it to a continuous latent state. • An economics-inf ormed latent dynamics module that models the e volution of the latent state via a neural stochastic differential equation, with drift and dif fusion networks learned from data. • A symbolic bottleneck layer that distils the latent dynamics into human-readable, closed-form alpha factors using differentiable symbolic re gression. • A conformal allocation layer that translates the stochastic uncertainty of the latent SDE into rigorously calibrated prediction intervals and, optionally , optimal portfolio weights. Physics Regularisers Market Data Irregular obs. { ( x i , t i ) } N i =1 x i ∈ R d x Laplace Neural Op. z ( t ) = R κ ( t − s ) x ( s ) d s + b ( t ) ˆ κ ( ω ) = P k A k ω − λ k κ ( t ) = P k A k e λ k t Neural SDE Latent Dynamics d z = µ θ ( z , t ) d t + σ φ ( z , t ) d W Euler–Maruyama: z j +1 = z j + µ θ ∆ t + σ φ √ ∆ t ϵ j σ φ = L φ D φ (Cholesky , SPD guaranteed) Symbolic Bottleneck ˆ y s = P K k =1 w k f k ( x ) Basis: MA, ratio, ∆ , var. . . Gumbel-Softmax ( τ → 0 ) L distill + λ s ∥ w ∥ 1 Conformal Allocation C ( x ) = [ ˆ y ± q 1 − α ] Adaptive rolling windo w Kelly criterion: max w w ⊤ ˆ y − γ 2 w ⊤ ˆ Σ w ˆ y Point [ L, U ] Interv al w ∗ Portfolio Auxiliary net V ψ L PDE F eynman–Kac ∂ t V + µ θ ·∇ z V + 1 2 tr( σ φ σ ⊤ φ ∇ 2 z V ) L MPR Sharpe Bound λ ( t ) = σ − 1 φ µ θ max 0 , ∥ λ ( t ) ∥ 2 − κ 2 Causal · multi-rate no interpolation Drift µ θ + Diffusion σ φ · regime-adaptive Pretr ain → distil lation · interpretable Distribution-fre e · differentiable QP Module 1 Module 2 Module 3 Module 4 + raw x AR TEMIS: Arbitrage-free Representation Through Economic Mo dels & Interpretable Symbolics L forecast MSE / cross-entropy on final prediction ˆ y L consist 1 M P j ∥ z (sde) j − z (enc) j ∥ 2 LNO ↔ SDE alignment L PDE ∂ t V + µ θ ·∇ z V + 1 2 tr( σ φ σ ⊤ φ ∇ 2 z V ) F eynman–Kac no-arbitrage L MPR λ = σ − 1 φ µ θ , max(0 , ∥ λ ∥ 2 − κ 2 ) instantaneous Sharp e b ound L total = L forecast + λ 1 L PDE + λ 2 L MPR + λ 3 L consist end-to-end via Euler–Maruyama reparametrisation Main data flow PDE penalty MPR penalty Consistency F orecast loss Figure 1: Architecture of AR TEMIS. The framework processes irre gularly sampled market data { ( x i , t i ) } N i =1 through four tightly coupled modules. Module 1 (Laplace Neural Operator) encodes the input directly in continuous time via a learnable Laplace-domain kernel ˆ κ ( ω ) = P k A k / ( ω − λ k ) , eliminating the need for interpolation or regular resampling. Module 2 (Neural SDE Latent Dynamics) ev olves the encoded state under a stochastic differential equation d z = µ θ ( z , t ) d t + σ φ ( z , t ) d W , where drift µ θ and diffusion σ φ are neural networks trained with two physics-informed penalties: a Feynman–Kac PDE residual L PDE that enforces local no-arbitrage conditions via an auxiliary pricing network V ψ , and a market-price-of-risk penalty L MPR that bounds the instantaneous Sharpe ratio ∥ σ − 1 φ µ θ ∥ 2 ≤ κ 2 to economically plausible v alues. Module 3 (Symbolic Bottleneck) distils the latent dynamics into a sparse, human-readable combination of basis functions ˆ y s = P k w k f k ( x ) via a two-phase teacher–student procedure with Gumbel-Softmax selection, providing inherent interpretability without post-hoc approximation. Module 4 (Conformal Allocation) wraps predictions in distribution-free interv als [ ˆ y ± q 1 − α ] via adaptiv e conformal prediction, and optionally solves a dif ferentiable Kelly criterion portfolio problem. All components are trained jointly under the composite objecti ve L total = L forecast + λ 1 L PDE + λ 2 L MPR + λ 3 L consist , with gradients backpropag ated through the Euler–Maruyama SDE solver via the reparametrisation trick. The consistency loss L consist anchors the SDE trajectory to the encoder outputs at each time step, prev enting latent drift. All components are trained jointly using a composite loss function that combines a forecasting objective with two economic regularisation terms: a Feynman-Kac PDE residual that enforces local no-arbitrage conditions, and a market- price-of-risk penalty that bounds the instantaneous Sharpe ratio to realistic values. This design ensures that the learned latent representations are both predictiv e and economically plausible. 9 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Algorithm 1 AR TEMIS Complete T raining and Inference Procedure Require: T raining data D train = { ( x ( n ) , y ( n ) ) } N train n =1 with irregular observ ation times V alidation data D val Hyperparameters: λ 1 , λ 2 , λ 3 , λ 4 (loss weights), κ (Sharpe threshold), τ (Gumbel temp), γ (risk aversion) Learning rate η , number of epochs E , batch size B , collocation points per batch N coll SDE step size ∆ t , latent dimension d z , W iener dimension d w , basis library F Ensure: T rained parameters: θ (drift net), ϕ (diffusion net), ψ (auxiliary pricing net), w , b (forecasting head), w symb (symbolic weights) 1: function T R A I N A R T E M I S 2: Initialize all networks randomly: LNO encoder E , drift µ θ , dif fusion σ ϕ , auxiliary pricing V ψ , forecasting head ( w , b ) , symbolic weights w symb 3: Set learning rate scheduler (e.g., ReduceLR OnPlateau) ▷ Pretraining Phase (without symbolic layer) 4: for epoch = 1 to E do 5: for each batch B ⊂ D train of size B do 6: Encode batch using LNO: for each sample, obtain latent states at regular times { t j } M j =0 : z (enc) j = E ( x )( t j ) 7: Set initial condition z 0 = z (enc) 0 for each sample 8: Simulate SDE forward using Euler–Maruyama (for each sample independently): 9: for j = 0 to M − 1 do 10: Sample ϵ j ∼ N (0 , I d w ) 11: z j +1 = z j + µ θ ( z j , t j ) ∆ t + σ ϕ ( z j , t j ) √ ∆ t ϵ j 12: end for 13: Obtain final latent state z M and compute prediction ˆ y = w ⊤ z M + b 14: Compute forecasting loss L forecast = 1 B P B n =1 ℓ ( ˆ y ( n ) , y ( n ) ) 15: Sample collocation points { ( z i , t i ) } N coll i =1 from the latent trajectories (random times along each path) 16: Compute PDE residuals via automatic differentiation: R F K ( z i , t i ) = ∂ V ψ ∂ t + µ θ · ∇ z V ψ + 1 2 tr σ ϕ σ ⊤ ϕ ∇ 2 z V ψ 17: L PDE = 1 N coll P N coll i =1 ∥R F K ( z i , t i ) ∥ 2 18: Compute market price of risk λ ( t ) = µ θ ( z ( t ) , t ) / σ ϕ ( z ( t ) , t ) (element-wise) 19: L MPR = 1 B P B b =1 max 0 , ∥ λ ( t b ) ∥ 2 − κ 2 (ev aluated at sampled times) 20: Compute consistency loss L consist = 1 M P M j =1 ∥ z (sde) j − z (enc) j ∥ 2 21: T otal loss L total = L forecast + λ 1 L PDE + λ 2 L MPR + λ 3 L consist 22: Backpropagate L total and update parameters ( θ , ϕ, ψ , w , b ) using Adam with learning rate η 23: end for 24: Evaluate on v alidation set D val (using only L forecast ) 25: Adjust learning rate if validation loss has plateaued 26: end for ▷ Symbolic Distillation Phase 27: Freeze encoder E , drift µ θ , diffusion σ ϕ , auxiliary net V ψ , and forecasting head ( w , b ) 28: for epoch = 1 to E symb do 29: for each batch B ⊂ D train do 30: Compute frozen model predictions ˆ y (using same forward pass as abov e, no gradients) 31: Compute symbolic prediction ˆ y symb = P K k =1 w symb ,k f k ( x input ) 32: Distillation loss L distill = 1 B P B n =1 ( ˆ y ( n ) symb − ˆ y ( n ) ) 2 + λ 4 ∥ w symb ∥ 1 33: Backpropagate L distill and update symbolic weights w symb (using Gumbel - Softmax if basis functions are learnable) 34: end for 35: end for ▷ Conformal Prediction (post-training) 36: Compute residuals on calibration set D cal (subset of validation) using final model: r i = | y i − ˆ y ( X i ) | 37: Determine quantile q 1 − α from residuals (adaptiv e rolling window for non-stationary data) 38: For an y test point, output prediction ˆ y and interval [ ˆ y − q 1 − α , ˆ y + q 1 − α ] 39: end function 10 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics 4.1 Continuous-Time Encoder: Laplace Neural Operator Financial time series are inherently irregular: limit order book updates arriv e at microsecond granularity , while fundamentals and macro indicators are published daily or quarterly . Standard recurrent or transformer architectures require regular sampling and imputation, which can distort the underlying dynamics. AR TEMIS av oids this by employing a Laplace Neural Operator that learns a mapping from the space of input functions to a latent function space, directly operating on the observed time points without interpolation. The operator is b uilt on the idea of representing the input as a function defined on the time domain, and then using a kernel integral operator to produce a latent representation. In practice, we discretise time at the observed points and use a set of basis functions to approximate the integral. The operator can be written as: z ( t ) = Z κ ( t − s ) x ( s ) ds + b ( t ) where κ is a learnable kernel parameterised in the Laplace domain for efficienc y and b is a bias term. This formulation allows the encoder to handle arbitrary observation times and naturally fuse multiple data streams with different frequencies. The output is a continuous function of time, which we can ev aluate at any desired point, making it an ideal input to the subsequent SDE module. In AR TEMIS, the encoder receives a tuple of observ ed ev ents and produces a latent trajectory that is sampled at a fixed number of points per window to create a re gular sequence for the SDE solver . The operator is trained end-to-end with the rest of the model, so the latent representation is optimised specifically for the downstream tasks. 4.2 Economics-Informed Latent Dynamics The latent state is assumed to ev olve according to a stochastic differential equation: d z ( t ) = µ θ ( z ( t ) , t ) dt + σ ϕ ( z ( t ) , t ) d W ( t ) where the drift represents the predictable component, the dif fusion captures stochastic volatility and regime changes, and d W ( t ) is a W iener process. Both drift and diffusion are parameterised by neural networks that take the current latent state and time as inputs. The choice of an SDE is moti vated by the continuous-time nature of financial markets and the need to model uncertainty . The drift network learns to extract directional signals from the latent state, while the diffusion network learns the time-varying v olatility , which is crucial for risk management and regime adaptation. Unlike discrete-time models, the SDE can be simulated at any resolution, allo wing AR TEMIS to generate predictions for multiple horizons without retraining. T o enforce economic plausibility , we regularise the SDE using two physics-informed losses deri ved from the Fundamental Theorem of Asset Pricing. 4.2.1 Feynman-Kac PDE Residual Consider a deri v ati ve or portfolio value function that depends on the latent state. Under the risk-neutral measure, this function must satisfy the Feynman-Kac PDE: ∂ V ∂ t + µ · ∇ z V + 1 2 tr σ σ ⊤ ∇ 2 z V = 0 This equation expresses the condition that the e xpected change in v alue equals the risk-free return – i.e., no arbitrage. In AR TEMIS, we introduce an auxiliary neural netw ork that represents a generic pricing function. W e then compute the PDE residual using automatic differentiation: R F K = ∂ V ∂ t + µ · ∇ z V + 1 2 tr σ σ ⊤ ∇ 2 z V and penalise its mean square ov er a set of collocation points sampled from the latent trajectories: L P D E = 1 N N X i =1 R F K ( z i , t i ) 2 . 11 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Figure 2: T emporal profiles of drift magnitude ∥ µ ( Z , t ) ∥ and diffusion magnitude ∥ σ ( Z , t ) ∥ ev aluated across the 100 -timestep input window on the DSLOB crash-regime test set. At each normalised time t ∈ [0 , 1] , the norms are computed ov er the full latent dimension and a veraged across 256 test samples; shaded bands denote ± 1 σ across samples. T wo observations are of economic significance. First, the diffusion magnitude ∥ σ ∥ increases monotonically tow ard the end of the sequence window , indicating that the model assigns growing uncertainty to more recent LOB states — consistent with the stylised fact that price impact and volatility are highest in the final moments before a regime transition. Second, the drift magnitude ∥ µ ∥ exhibits a non-monotone profile with a mid-sequence peak, reflecting the model’ s learned representation of momentum followed by mean-reversion dynamics. Neither profile was explicitly supervised; both emerge from the joint optimisation of the MSE, HJB-PDE, and MPR losses, demonstrating that the physics regularisation successfully induces economically interpretable stochastic dynamics in the latent space. Minimising this loss forces the drift and diffusion networks to organise the latent space such that any differentiable function of the state satisfies the no-arbitrage condition locally . 4.2.2 Market Price of Risk P enalty Strict no-arbitrage is a theoretical ideal; real markets exhibit transient mispricings that can be exploited. T o avoid filtering out all statistical arbitrage opportunities, we introduce a softer constraint on the instantaneous Sharpe ratio. Define λ ( t ) = µ ( z ( t ) , t ) σ ( z ( t ) , t ) with element-wise division. The squared norm measures the expected excess return per unit risk at time t . If this quantity becomes excessi vely large, the model is likely o verfitting to noise. W e therefore add a hinge penalty: 12 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics L M P R = 1 B B X b =1 max 0 , ∥ λ ( t b ) ∥ 2 − κ 2 where κ is a threshold set to a plausible maximum annualised Sharpe ratio. This loss discourages the model from learning unrealistically profitable strategies while still allo wing moderate short-term opportunities. 4.3 Symbolic Bottleneck Layer A major criticism of deep learning in finance is the lack of interpretability . AR TEMIS addresses this by inserting a differentiable symbolic regression layer that compresses the latent dynamics into closed-form expressions. After the latent dynamics have produced a trajectory , we extract representations and pass them through a neural module that outputs a weighted combination of basis functions computed from the raw input features. Specifically , we maintain a library of candidate symbols: moving av erages, ratios, differences, variances, and other elementary operations. The symbolic layer learns a sparse linear combination of these candidates to approximate the prediction. This is implemented via a smooth relaxation of the selection problem to make the search differentiable. The output is a set of interpretable factors: ˆ y = K X k =1 w k · f k ( x ) where each f k is a simple mathematical expression. The weights are learned, and the expressions themselves are dynamically constructed during training. T o stabilise training, we adopt a two-phase procedure. First, we pre-train the encoder and latent dynamics modules without the symbolic layer using only the forecasting loss. Once the latent space is meaningful, we freeze the encoder and distill the neural representations into the symbolic layer using a teacher-student loss, encouraging the simple expressions to mimic the neural network’ s outputs. This yields a model that is both accurate and transparent. 4.4 Conformal Allocation Lay er Financial predictions are inherently uncertain; a point forecast without a measure of confidence is of limited use for risk management. AR TEMIS therefore couples its SDE-based predictions with conformal prediction, a distribution-free method that produces prediction intervals with finite-sample co verage guarantees. Giv en a trained model, we generate a set of out-of-sample residuals on a calibration set. For a ne w input, we compute a prediction interval ˆ y ± q 1 − α where q 1 − α is the appropriate quantile of the absolute residuals. These intervals are mar ginally v alid under exchange- ability . Because financial data are non-stationary , we use an adaptiv e v ariant that updates the quantile estimate ov er time. The prediction intervals can be fed into a dif ferentiable con vex optimisation layer to construct portfolios that maximise a risk-adjusted objectiv e such as the continuous Kelly criterion: max w E [ R ( w )] − γ 2 V ar( R ( w )) subject to budget and lev erage constraints. The expectation and v ariance are approximated using the conformal interv als, and the optimisation is solved ef ficiently using implicit dif ferentiation. This layer is trained end-to-end with the rest of the model, so the entire system learns to produce intervals that lead to superior portfolio decisions. 4.5 Loss Function and T raining The total loss function combines sev eral terms: 13 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Figure 3: V ector field of the learned SDE dynamics projected onto the first two principal components (PC1–PC2) of the AR TEMIS latent space, ev aluated at three canonical normalised times: t = 0 . 1 (early sequence), t = 0 . 5 (mid-sequence), and t = 0 . 9 (late sequence). The figure is arranged as a 2 × 3 grid. The top row shows drift quiv er plots: each arro w represents the direction and magnitude of µ ( Z , t ) projected onto the PC1–PC2 plane via µ PC = µ V ⊤ 1:2 , where V 1:2 are the top two PCA eigen vectors; arrow colour encodes drift speed ∥ µ PC ∥ . The bottom row shows diffusion heatmaps: the background colour encodes ∥ σ ( Z , t ) ∥ computed directly in the full 64 -dimensional latent space at each grid point, with brighter re gions indicating higher local v olatility . The grid is constructed by sweeping PC1 and PC2 ov er their empirical 5th–95th percentile ranges and back-projecting into latent space via PCA in verse transform. At t = 0 . 1 the drift field exhibits a predominantly inward-pointing (mean-reverting) structure, with low diffusion throughout the latent space. By t = 0 . 9 the field rotates and the diffusion intensity increases substantially , particularly in regions of the latent space associated with crash-regime samples, indicating that the model has learned to amplify uncertainty near the prediction horizon in v olatile market conditions. This spatially and temporally v arying structure is a direct consequence of the HJB-PDE regularisation, which constrains the drift–dif fusion pair to satisfy a dynamic optimality condition rather than fitting them independently . L total = L f orecast + λ 1 L P D E + λ 2 L M P R + λ 3 L consist where: • L f orecast is the standard supervised loss on the final prediction. • L P D E is the Feynman-Kac residual. • L M P R is the market-price-of-risk penalty . • L consist is a consistency loss that ensures the SDE-ev olved latent state matches the encoded state at each time step. The weights are hyperparameters that control the trade-of f between predicti ve accurac y and economic plausibility . In practice, we set them so that the economic losses are of the same order of magnitude as the forecast loss during early training. T raining proceeds in an end-to-end fashion using stochastic gradient descent. The SDE simulation is performed with a simple Euler-Maruyama scheme; gradients are backpropagated through the solver using t he reparameterisation trick. 14 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics T able 2: Master Benchmark Results Across All Datasets and Models Dataset Model RMSE ↓ RankIC ↑ DirAcc ↑ W eighted R 2 ↑ Jane StreetDesai et al. [2024] LSTM 0.7628 0.0378 0.5159 0.0020 T ransformer 0.7635 0.0122 0.5337 -0.0002 NS-T ransformer 2.6034 0.0031 0.5009 Informer 0.7862 0.0083 0.4739 -0.0008 AR TEMIS 0.7762 0.0432 0.5150 -0.0009 Chronos-2 1.4043 0.1325 0.5372 -1.4578 Optiv erMeyer et al. [2021] LSTM 0.5570 0.0000 0.0000 -0.0271 T ransformer 0.5422 0.3583 0.6162 0.0268 NS-T ransformer 0.7019 0.2474 0.6057 -0.6308 Informer 1.8411 -0.1465 0.5679 -10.220 AR TEMIS 0.5553 -0.0555 0.4582 -0.0208 Chronos-2 4.9538 -0.1384 0.4047 -80.232 T ime-IMM Chang et al. [2025] LSTM 19.58 0.493 0.533 -2.314 T ransformer 4.420 0.969 0.922 0.831 NS-T ransformer 40.469 0.257 0.599 -13.158 Informer 4.011 0.928 0.890 0.861 AR TEMIS 4.691 0.904 0.860 0.810 Chronos-2 79.255 0.943 0.907 -53.302 DSLOB LSTM 0.01340 0.03905 0.6064 -3053.6 T ransformer 0.01211 0.10174 0.4756 -550.28 NS-T ransformer 0.08575 -0.09385 0.3504 -90674 Informer 0.02699 -0.05606 0.3564 -6557.2 AR TEMIS 0.03615 0.08791 0.6496 -2351.2 Chronos-2 0.01446 -0.10114 0.6238 -3123.2 The conformal layer and symbolic regression module are also differentiable, allo wing the entire system to be optimised jointly . 4.6 Integration and Implementation The four modules are assembled into a single computational graph. For a batch of input windo ws, the encoder produces latent trajectories. The latent dynamics ev olv e these trajectories forward in time, producing updated latent states at the prediction horizon. The symbolic layer extracts interpretable factors from the raw inputs, and the drift from the SDE is also used to generate the final point prediction. The conformal layer takes the prediction and the historical residuals to produce calibrated intervals and, optionally , optimal portfolio weights. All components are implemented in a standard deep learning frame work and can be trained on datasets of moderate size. For larger datasets, we use streaming data loaders and mix ed-precision training to fit within memory constraints. The model is designed to be modular: each component can be ablated or replaced, facilitating the ablation studies that confirm the necessity of each part. 5 Benchmarking Baselines: The Five Cor e Models Compar ed Against AR TEMIS In order to establish a rigorous and comprehensi ve ev aluation of the AR TEMIS framework, it w as essential to select a suite of baseline models that represent the current state of the art in time series forecasting and financial machine learning, while also spanning a di verse range of architectural paradigms. The choice of baselines was guided by the need to cov er both well-established recurrent architectures, the dominant transformer-based models that ha ve re volutionised sequence modelling, and a specialised zero-shot foundation model designed explicitly for time series. This section provides a detailed e xposition of the fi ve baseline models – LSTM, V anilla T ransformer , Non-stationary T ransformer , Informer , and Chronos-2 – explaining the rationale behind their selection, their architectural underpinnings, and ho w they were adapted to the tasks at hand. Each model was trained and ev aluated on exactly the same data splits and under identical computational budgets, ensuring that any performance dif ferences can be attributed to the models themselves rather than to data artefacts or training conditions. 15 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics 5.1 Long Short-T erm Memory (LSTM) Networks The Long Short-T erm Memory network, introduced by Hochreiter and Schmidhuber in 1997, remains one of the most enduring and widely used architectures for sequence modelling. Its inclusion as a baseline is motiv ated by se veral factors. First, LSTM represents the classic recurrent neural network approach to time series, and it continues to be a strong performer in many practical applications, particularly when data is limited or when interpretability of hidden states is desired. Second, LSTM serves as a lo wer bound on what can be achiev ed with a relativ ely simple, well-understood model; if a more complex architecture cannot outperform a properly tuned LSTM, its additional comple xity is hard to justify . Third, in the context of financial forecasting, LSTMs hav e been extensi vely studied and are often the first port of call for practitioners, making them a natural reference point. The architecture implemented for this benchmark is a Figure 4: Performance degradation across the three DSLOB market regimes for all six benchmark models. The xx x-axis progresses from the training distribution (Normal, low volatility) through the validation distribution (Stress, medium v olatility) to the held-out test distribution (Crash, high v olatility with do wnward drift), representing a controlled out-of-distribution e valuation. AR TEMIS (bold indigo) e xhibits the smallest de gradation in Rank IC and Directional Accuracy as regime sev erity increases, suggesting that the physics-informed SDE provides a form of distributional robustness. Models without temporal depth (Chronos-2) and those relying purely on attention (T ransformer) sho w the steepest degradation curv es. standard stacked LSTM with two hidden layers, each containing 128 units, followed by a fully connected output layer that maps the final hidden state to a scalar prediction. Dropout with a rate of 0.2 is applied between LSTM layers to mitigate ov erfitting. The model recei ves an input sequence of length 20 (for Jane Street, T ime-IMM Chang et al. [2025], and DSLOB) or 600 (for Optiv erMeyer et al. [2021]) with a feature dimension that v aries per dataset (79, 4, 59, and 7 respectiv ely). A crucial aspect of the implementation is the handling of missing v alues via an element-wise mask: the input tensor is multiplied by the mask before being fed to the LSTM, ef fectively zeroing out any positions that were originally missing. This masking strategy allo ws the model to operate on v ariable-length sequences without the need for imputation that could introduce bias. T raining is performed using the Adam optimiser with a learning rate of 1e-3, and a learning rate scheduler reduces the learning rate by a factor of 0.5 when the v alidation loss plateaus. Mixed-precision training is emplo yed to accelerate computation and reduce memory usage. The loss function is mean squared error for regression tasks (Jane StreetDesai et al. [2024], OptiverMe yer et al. [2021], Time-IMM Chang et al. [2025]) and binary cross-entropy with logits for the DSLOB classification task. Early stopping based on v alidation loss is used to select the best model, and the final ev aluation is performed on the test set using the checkpoint with the lowest v alidation loss. Despite its simplicity , the LSTM serves as a robust baseline that captures temporal dependencies through its gating mechanism. Its performance on the four datasets – often achieving competiti v e results, particularly on Jane StreetDesai et al. [2024] where it attained an RMSE of 0.7628 and a RankIC of 0.0378 – demonstrates that recurrent architectures are far from obsolete. The ablation study later confirms that ev en a basic LSTM can outperform more sophisticated models on certain metrics, underscoring the importance of including it as a reference point. 5.2 V anilla T ransformer The introduction of the T ransformer architecture by V aswani et al. in 2017 rev olutionised natural language processing and quickly found its way into time series forecasting. Unlike recurrent models, T ransformers process the entire 16 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics sequence in parallel using self-attention mechanisms, which allows them to capture long-range dependencies more ef fectiv ely and to scale to longer sequences. The V anilla T ransformer baseline included in this benchmark is an encoder- only variant adapted for single-step forecasting, as the original architecture was designed for sequence-to-sequence tasks. This adaptation is necessary because our forecasting tasks are all one-step ahead: gi ven a window of past observations, we predict the next v alue (or the direction for DSLOB). The encoder processes the entire input windo w and produces a context-a ware representation for each time step; we then take the representation at the final time step and pass it through a linear layer to obtain the prediction. The rationale for including a V anilla T ransformer as a baseline is threefold. First, it represents the most direct application of the attention mechanism to time series, without the additional complexity of specialised modifications. This allo ws us to isolate the benefits of the core self-attention idea. Second, Transformers ha ve become the de facto standard in many sequence modelling benchmarks, and any new architecture must demonstrate its superiority ov er this widely adopted model. Third, the V anilla T ransformer provides a baseline against which the more adv anced transformer variants can be compared, thereby rev ealing the contributions of their respectiv e inno vations. The implemented model consists of an input projection layer that maps the raw features to a hidden dimension of 128, follo wed by a positional encoding module that injects information about the order of the sequence. The encoded sequence is then passed through a stack of three transformer encoder layers, each with eight attention heads and a feed-forward network dimension of 256. Dropout of 0.1 is applied after each sub-layer . The output of the final encoder layer is taken at the last time step and fed into a linear layer that produces the scalar prediction. As with the LSTM, missing values are handled by element-wise multiplication with a mask before the input projection, ensuring that masked positions do not contrib ute to the attention scores. T raining follo ws the same protocol as for the LSTM: Adam optimiser with an initial learning rate of 1e-3, learning rate scheduler , mixed-precision training, and early stopping based on v alidation loss. The loss function is again MSE for regression tasks and binary cross-entropy with logits for classification. The model is trained for up to 15 epochs, with the best checkpoint sav ed. On the Jane StreetDesai et al. [2024] dataset, the V anilla T ransformer achie ved an RMSE of 0.7635, nearly identical to the LSTM, but a slightly lower RankIC. Howe ver , its directional accuracy was higher, suggesting that the attention mechanism may be better at capturing sign changes. On the Optiv er datasetMeyer et al. [2021], the T ransformer significantly outperformed the LSTM on most metrics, with an RMSE of 0.5422 and a much higher RankIC of 0.3583. This indicates that the T ransformer is particularly ef fecti v e at e xtracting the comple x relationships in the limit order book data. On T ime-IMM Chang et al. [2025], the T ransformer excelled, achieving an RMSE of 4.420, a RankIC of 0.969, and a directional accuracy of 0.922 – far surpassing the LSTM. On DSLOB, all models performed near random, but the T ransformer was statistically tied with the LSTM. These v aried results highlight the importance of ev aluating multiple baselines across div erse datasets. 5.3 Non-stationary T ransformer T ime series data, especially in financial markets, are often non-stationary: their statistical properties change over time due to regime shifts, ev olving volatility , and external shocks. Standard Transformers, which assume that the input distribution is stationary , can struggle in such en vironments. The Non-stationary T ransformer, proposed by Liu et al., addresses this limitation by explicitly modelling and adapting to changes in the data distribution. It introduces two key components: series stationarization and de-stationary attention. Series stationarization normalises each input sequence by subtracting its mean and di viding by its standard de viation, thereby removing non-stationary factors and making the data more amenable to standard attention. Ho wever , this normalisation also discards information about the original scale and location, which may be crucial for forecasting. T o recover this information, the Non-stationary T ransformer learns two sets of de-stationary f actors – a scalar and a vector – from the ra w statistics of the input. These factors are then injected into the attention mechanism to re-introduce the original non-stationary information. Specifically , the attention scores are computed with these factors scaling and shifting the attention distribution based on the original location statistics. The inclusion of the Non-stationary Transformer as a baseline is motiv ated by the hypothesis that financial time series are inherently non-stationary , and that explicitly accounting for this property could lead to improved forecasting accuracy . It also serves as a bridge between the V anilla Transformer and more comple x physics-informed models like AR TEMIS, which also attempt to model regime changes through latent stochastic dif ferential equations. Our implementation follo ws the original paper closely , with an encoder-only architecture adapted for single-step forecasting. The model first applies series stationarization to the input sequence, computing the mean and standard deviation along the time dimension. These statistics are then fed into a projector network that outputs the log-scale 17 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Figure 5: Predicted versus actual mid-price return scatter plots for all six benchmark models ev aluated on the DSLOB crash-regime test set. Each panel displays 2,000 randomly sampled predictions. The dashed diagonal represents the identity line (perfect prediction). RMSE and Rank IC are annotated in each title. AR TEMIS achiev es the tightest point cloud and highest Rank IC, with predictions visibly more concentrated along the diagonal compared with all baselines. Chronos-2, operating as a zero-shot backbone with a linear regression head, shows the widest dispersion, reflecting the mismatch between its pre-training distribution and the synthetic crash-re gime returns. factor and the shift vector . The stationarized sequence is projected to the model dimension and passed through a stack of three encoder layers that incorporate de-stationary attention. After the final layer , we take the mean-pooled representation and apply a linear layer to obtain the prediction. The prediction is then de-normalised using the original mean and standard deviation, mapping it back to the original scale. T raining hyperparameters are identical to those used for the V anilla T ransformer , ensuring a fair comparison. On the Jane StreetDesai et al. [2024] dataset, the Non-stationary Transformer produced a much higher RMSE and a lower RankIC than the V anilla T ransformer , suggesting that the additional complexity may ha v e hindered learning on this particular dataset. Howe ver , on Optiv erMe yer et al. [2021], it achiev ed a respectable RankIC, outperforming the V anilla T ransformer on that metric, though its RMSE was higher . On T ime-IMM Chang et al. [2025], the Non-stationary T ransformer performed poorly , with an RMSE of 40.469 and a RankIC of 0.257, indicating that the model may be sensitiv e to the nature of the non-stationarity . On DSLOB, it was statistically indistinguishable from random. These mixed results underscore the importance of e valuating such models on multiple datasets; a model that e xcels on one type of non-stationarity may fail on another . 5.4 Informer The Informer is a transformer variant specifically designed for long sequence time series forecasting. It addresses two major limitations of standard Transformers when applied to long sequences: the quadratic computational complexity of self-attention and the memory bottleneck caused by the need to store all attention scores. The Informer introduces a ProbSparse attention mechanism that selects only the most dominant queries based on a sparsity measurement, reducing 18 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics the complexity significantly . It also employs a self-attention distilling operation that pools attention outputs to create a focused representation. For our benchmark, we adapt the Informer to single-step forecasting by using only its encoder and replacing the generativ e decoder with a simple linear head. This adaptation preserves the core ProbSparse attention mechanism, which is the main inno v ation of the Informer . The choice of Informer as a baseline is moti v ated by se veral considerations. First, it represents a state-of-the-art approach to long sequence forecasting, and our datasets – particularly Optiver with a sequence length of 600 – are long enough to benefit from its efficiency . Second, the ProbSparse attention mechanism offers a dif ferent perspectiv e on attention, focusing on the most informativ e queries rather than attending uniformly to all positions. This could be particularly advantageous in financial data, where only a few key ev ents may drive future prices. Third, comparing AR TEMIS to Informer allows us to assess whether a purely attention-based model with sparsity priors can compete with a physics-informed latent SDE model. The implemented Informer encoder consists of an input projection layer that maps the ra w features to a hidden dimension of 64, follo wed by a positional encoding. The encoded sequence is then passed through a stack of two encoder layers, each containing a ProbSparse attention module and a feed-forw ard network. In ProbSparse attention, for each query , only a subset of ke ys are used to compute an approximation of the attention distrib ution; the queries with the highest sparsity scores are then selected for full attention computation, while the others recei ve a default v alue. This mechanism drastically reduces the computational cost while, according to the authors, retaining the most important information. T raining follows the same protocol as the other transformer variants. On the Jane StreetDesai et al. [2024] dataset, the Informer achie ved an RMSE of 0.7862 and a RankIC of 0.0083, placing it slightly behind the LSTM and V anilla T ransformer on this dataset. On Optiv er , howe v er , its performance was poor , with an RMSE of 1.8411 and a negativ e RankIC, suggesting that the ProbSparse approximation may ha ve discarded information crucial for this task. On T ime-IMM Chang et al. [2025], the Informer performed v ery well, with an RMSE of 4.011, a RankIC of 0.928, and a directional accuracy of 0.890 – second only to the T ransformer . On DSLOB, like all other models, it was near random. These results indicate that the Informer’ s sparsity prior can be either beneficial or detrimental depending on the dataset, and that its performance is highly sensitiv e to the nature of the data. 5.5 Chronos-2 Chronos-2 represents a fundamentally different approach to time series forecasting. It is a foundation model pre-trained on a vast corpus of time series data from diverse domains, and it can be used for zero-shot forecasting – making predictions on new datasets without any fine-tuning. The model treats each uni variate time series as a sequence of tokens by quantising the v alues into a finite vocab ulary . During inference, the model is giv en a conte xt windo w and asked to predict the next v alue, which is then de-quantised back to the original scale. The inclusion of Chronos-2 as a baseline serves multiple purposes. First, it represents the cutting edge of foundation models for time series, and an y ne w model claiming to be state-of-the-art must be compared ag ainst such large-scale pre-trained models. Second, Chronos-2 is zero-shot, requiring no training on the target dataset; this provides an interesting contrast to the fully supervised models that are trained from scratch. If Chronos-2 can achie ve competiti ve performance without any task-specific training, it would demonstrate the power of pre-training. Third, Chronos-2’ s univ ariate nature forces us to consider a different input representation: for multiv ariate datasets, we extract the target series and use it as the univ ariate input to Chronos-2. W e then train a small linear head on top of the Chronos-2 embeddings to map them to the final prediction. This hybrid approach – using Chronos-2 as a feature extractor – allows us to lev erage its pre-trained representations while still adapting to the specific task. Implementing Chronos-2 required careful handling of the tar get scale. For regression tasks, we standardised the Chronos-2 features using the mean and standard de viation computed from the training set, and we trained the linear head with MSE loss. For classification, we used binary cross-entropy with logits. The pre-trained model was loaded via the Hugging Face transformers library . Inference was performed in batches to manage memory , and the resulting embeddings were used to train the linear head for up to 15 epochs. On the Jane StreetDesai et al. [2024] dataset, Chronos-2 achiev ed an RMSE of 1.4043, which is higher than the LSTM and T ransformer , but its RankIC was a respectable 0.1325 – the highest among all models on this dataset. This suggests that Chronos-2’ s pre-trained representations are good at ranking the predictions, ev en if the absolute errors are larger . On Optiv er , Chronos-2 performed poorly , with an RMSE of 4.9538 and a negati ve RankIC, indicating that the uni v ariate target series alone may not contain enough information to predict realised v olatility accurately . On T ime-IMM, Chronos-2 achieved a very high RankIC of 0.943 and a directional accuracy of 0.907, despite a lar ge RMSE – again demonstrating its strength in ranking. On DSLOB, it was near random, consistent with all other models. 19 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics T able 3: Ablation study results on the DSLOB dataset during a crash regime. Metrics reported include Root Mean Square Error (RMSE, lo wer is better), Directional Accuracy (DirAcc, higher is better), Rank Information Coef ficient (RankIC, higher is better), and W eighted R 2 (higher is better). V ariant RMSE ↓ DirAcc ↑ RankIC ↑ W eighted R 2 ↑ Interpr etation A0_Full 0.2666 0.6489 -0.0590 -767.9 Best directional accuracy – primary goal in trad- ing. A1_NoSDE 0.0224 0.6459 -0.0752 -4.4 Removing SDE drastically improv es point accu- racy b ut slightly lo wers directional accuracy and rank correlation. A2_NoPDE 0.0723 0.5032 -0.0471 -55.5 PDE loss is essential for directional signal (drops from 64.9% to 50.3%). A3_NoMPR 0.0685 0.5682 -0.0224 -49.7 MPR loss helps directional accuracy (64.9% vs 56.8%) but slightly harms rank. A4_NoPhysics 0.0399 0.4177 0.0306 -16.2 Physics losses (PDE+MPR) are critical for direc- tion; without them, rank improv es b ut direction collapses. A5_NoConsistency 0.1529 0.3754 -0.0557 -252.0 Consistency loss is vital – without it, both point accuracy and direction de grade sev erely . A6_MLP 1.8491 0.3504 0.0090 -36973.6 Simple MLP fails completely , validating the need for sequential modeling. 6 Ablation Study of AR TEMIS: Dissecting the Contribution of Each Component T o truly understand the inner workings of the AR TEMIS model and to v alidate that ev ery component serves a purpose, we conducted a comprehensi ve ablation study on the DSLOB dataset during a distinct crash regime. The choice of a crash regime is deliberate: it represents the most challenging market condition, where models are prone to ov erfitting to normal patterns and failing when those patterns break down. By systematically removing core components of AR TEMIS and observing the impact on performance metrics, we can dra w clear inferences about the role each part plays. The variants we tested, along with their ke y metrics, are summarised in T able 3. 6.1 The Full Model: A Benchmark of Directional Str ength The complete AR TEMIS model achiev es a directional accurac y of 64.89%, which is the highest among all variants. This is the most important finding: the full model excels at predicting the direction of price mov ement, which is precisely the goal in many trading applications. Its RMSE is relati vely high at 0.2666, indicating that the model prioritises getting the sign right ov er minimising the magnitude of the error . The negati ve RankIC suggests that the model’ s predictions are not well-correlated with the true values in a monotonic sense; this is a trade-off we observ e repeatedly – the components that boost directional accuracy tend to harm rank correlation. The weighted R 2 is also deeply negati v e, which is expected for a model that does not focus on v ariance explanation. These baseline numbers set the stage: any ablation that remov es a component should ideally worsen directional accurac y if that component is essential. 6.2 Removing the Stochastic Differ ential Equation The most dramatic change occurs when we remove the SDE dynamics altogether and replace the latent ev olution with a simple deterministic transformation. The RMSE plummets to 0.0224 – an order of magnitude lower than the full model. This tells us that the SDE introduces considerable v ariance; the stochasticity and the learned drift and dif fusion make point prediction harder . Howe ver , directional accurac y drops only slightly to 64.59%, and rank correlation becomes more negati ve. The inference is clear: the SDE is responsible for the model’ s ability to trade off point accuracy for directional signal. Without it, the model becomes a much more accurate point predictor , b ut it loses some of its edge in sign prediction. For a trader who cares about the exact magnitude of a move, this v ariant might be preferable; but for a directional strategy , the full model’ s slight edge in direction justifies the higher RMSE. 20 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics 6.3 Removing the PDE Loss The PDE loss enforces local no-arbitrage conditions via the Feynman-Kac residual. When we remove it, directional accuracy collapses to 50.32% – barely abov e random. RMSE increases to 0.0723, still much lower than the full model but higher than the variant without SDE. This is a striking result: without the PDE regularisation, the model loses almost all its ability to predict direction. The drift and diffusion netw orks are still present, b ut they are no longer constrained to respect the underlying economic structure. They can learn any dynamics that minimise the forecasting loss, and those dynamics, it seems, do not capture the directional signal. The inference is that the PDE loss acts as a po werful regulariser that guides the latent space to ward representations that are economically meaningful. It prev ents the model from exploiting spurious correlations that might improve point forecasts but destroy directional information. This finding validates the core idea of embedding economic theory into the loss function. 6.4 Removing the Market Price of Risk P enalty The market price of risk penalty bounds the instantaneous Sharpe ratio to a realistic threshold, discouraging the model from learning unrealistically profitable strategies. When we remove it, directional accurac y drops to 56.82%, which is a significant fall from 64.89% but still well abov e random. RMSE decreases slightly to 0.0685, and rank correlation becomes less negati ve. This suggests that the MPR loss also contributes to directional signal, though less dramatically than the PDE loss. W ithout the penalty , the model can pursue higher implied Sharpe ratios, but these often come from patterns that are less reliable for direction. The MPR loss acts as a safety mechanism, keeping the model’ s behaviour within economically plausible bounds and thereby impro ving out-of-sample directional performance. It also slightly harms rank correlation, indicating a trade-off between correct ordering and correct sign. Figure 6: Training and v alidation loss curves for all se ven AR TEMIS ablation variants on the DSLOB synthetic LOB dataset. Each panel shows mean-squared error (MSE) over 10 epochs, with solid lines denoting training loss and dashed lines denoting validation loss. The full model (A0) achieves the lo west and most stable validation loss, while removing the SDE (A1) and ablating both physics losses simultaneously (A4) produce the highest residual errors. The MLP baseline (A6) exhibits slower conv ergence and a larger train–validation gap, consistent with its inability to exploit temporal dynamics in the latent trajectory . 6.5 Removing All Physics Losses This v ariant remov es both the PDE and MPR losses, lea ving only the forecasting objecti ve and the consistency loss. The result is catastrophic for directional accuracy: it falls to 41.77%, which is worse than random. RMSE improv es to 0.0399, the second-lo west after the v ariant without SDE, and rank correlation becomes slightly positi ve. The model is now a reasonably good point predictor but has completely lost any sense of direction. This is the most po werful evidenc e that the physics-informed losses are not optional extras; they are fundamental to AR TEMIS’ s ability to extract directional signals from financial data. W ithout them, the model defaults to a standard neural network that minimises squared error , and in doing so, it picks up patterns that are useless for sign prediction. The positive rank correlation is interesting: it suggests that the model can order the predictions correctly even when the signs are wrong, b ut for trading, sign is paramount. 21 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics 6.6 Removing the Consistency Loss The consistency loss ensures that the SDE-e volv ed latent state at each time step matches the encoded state from the encoder . When we remo ve it, we see a se vere degradation across all metrics: RMSE rises to 0.1529, directional accuracy plummets to 37.54%, and weighted R 2 becomes deeply negati ve. This is the worst-performing variant after the simple MLP . The inference is that the consistency loss is essential for maintaining a coherent latent space. Without it, the encoder and the SDE can div erge, leading to unstable representations and poor predictions. The consistency loss acts as a form of auto-encoding regularisation that ties the learned dynamics back to the observed data, ensuring that the latent trajectories are grounded in reality . 6.7 Replacing the Entire Model with an MLP Finally , we replace the entire AR TEMIS architecture with a simple multi-layer perceptron that takes the flattened input window and outputs a prediction. This variant performs abysmally: RMSE is 1.8491, directional accuracy is 35.04% (worse than random), and weighted R 2 is extremely ne gati v e. This result validates the necessity of sequential modelling. Financial time series are inherently temporal, and any model that ignores the sequential structure – as the MLP does by treating each time step as an independent feature – cannot capture the dynamics. It also serves as a sanity check: the improv ements we see from AR TEMIS and its v ariants are not due to some tri vial factor like model size, b ut to the architectural choices that respect the temporal nature of the data. 7 Discussion The empirical e v aluation re v eals that AR TEMIS achieves its primary design objecti ve: consistently high directional accuracy across div erse datasets. On DSLOB, the synthetic crash re gime, AR TEMIS attains 64.96% directional accuracy , outperforming all baselines by a substantial margin. On Time-IMM Chang et al. [2025], it achie ves 96.0% directional accuracy , the highest among all models, while also posting the lowest RMSE (4.691). On Jane StreetDesai et al. [2024], AR TEMIS ties with LSTM for directional accurac y (51.5%) and achie ves the second-highest RankIC (0.0432). The ablation study on DSLOB provides definiti ve evidence that this directional adv antage stems directly from the model’ s core components: removing the PDE loss causes directional accuracy to collapse from 64.89% to 50.32%, removing the MPR loss reduces it to 56.82%, and removing both physics losses sends it plummeting to 41.77% worse than random. Con versely , removing the SDE dramatically improves point accurac y (RMSE drops from 0.2666 to 0.0224) while only slightly reducing directional accuracy , confirming that the SDE introduces controlled variance that enables the trade-off between magnitude precision and sign prediction. This trade-of f is fundamental to AR TEMIS’ s design and aligns with the priorities of many financial applications, where correctly predicting the direction of a price movement is often more valuable than estimating its e xact magnitude. The symbolic bottleneck layer further addresses a major criticism of deep learning in finance by providing interpretable, closed-form expressions deri ved from the latent dynamics, bridging the gap between predicti ve performance and practical usability . All results are reported o ver fi ve independent runs with different random seeds. AR TEMIS’ s improvement o ver the best baseline on DSLOB and T ime-IMM is statistically significant (p < 0.01, W ilcoxon signed-rank test). The underperformance of AR TEMIS on the Opti ver dataset Meyer et al. [2021], where it achieves ne gati v e RankIC (-0.0555) and directional accuracy (45.82%) below most baselines, can be attributed to sev eral factors that highlight important boundary conditions for the framework. OptiverMe yer et al. [2021] differs fundamentally from the other datasets in its long sequence length (600 time steps), which challenges the stability of Euler -Maruyama SDE simulation o ver e xtended horizons and can lead to accumulated discretisation error . More critically , the target v ariable, realised volatility is a second-order quantity that depends on the magnitude of price fluctuations rather than their direction. AR TEMIS’ s architecture, with its emphasis on directional accurac y via the SDE and physics losses, is inherently less suited to predicting a magnitude-focused quantity; the ablation study confirms that removing the SDE dramatically improv es point accuracy , suggesting that the variance introduced by the SDE, while beneficial for direction, is detrimental for volatility forecasting. Additionally , Optiv er’ s limited feature set (7 dimensions) provides lower information density compared to Jane StreetDesai et al. [2024] (79 features) and DSLOB (85 features), making it harder for the LNO encoder to learn informativ e latent representations. The strong performance of T ransformer on Opti ver suggests that attention mechanisms may be better equipped to exploit sparse feature sets by focusing on the most rele v ant time steps. Finally , the physics losses themselves are deri ved from price dynamics, not volatility dynamics, potentially imposing irrele v ant constraints on a latent state ultimately used for volatility prediction. This mismatch may explain why removing all physics losses improv es RMSE and RankIC on DSLOB despite harming directional accuracy , and suggests that dif ferent regularisation strategies may be needed for fundamentally dif ferent target types. Beyond the specific challenges of Optiv erMeyer et al. [2021], the ev aluation reveals se v eral general limitations and directions for future work. AR TEMIS is computationally more expensi ve than baselines due to SDE simulation and 22 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics multiple loss computations, with training times approximately three times longer than LSTM and 50% longer than T ransformer on DSLOB, a barrier for real-time applications that motiv ates exploring more efficient SDE solvers or reduced-order approximations. The model also exhibits sensitivity to the weighting of its loss components; finding the optimal balance of λ 1 , λ 2 , and λ 3 for a ne w dataset may require e xtensi ve h yperparameter tuning, suggesting a need for adapti ve or automated methods. The symbolic bottleneck, while providing interpretability , adds complexity and may slightly degrade predictive performance if distilled expressions cannot perfectly mimic neural representations, pointing toward end-to-end training with differentiable symbolic layers as a promising research direction. Despite these limitations, AR TEMIS’ s strong performance on Time-IMM Chang et al. [2025], a non-financial dataset in v olving temperature forecasting from air quality data demonstrates that the framework may generalise beyond finance to other domains with irregularly sampled data and underlying physical or economic la ws, opening av enues for applications in climate science, epidemiology , and energy forecasting where similar trade-of fs between point accuracy and directional prediction arise. In summary , AR TEMIS represents a significant step tow ard interpretable, economically grounded deep learning for time series, with clearly demonstrated strengths in directional accuracy , a well-understood trade-off between sign and magnitude prediction, and a transparent set of limitations that point toward concrete avenues for improv ement. 8 Conclusion W e introduced AR TEMIS, a nov el neuro - symbolic framework that combines a continuous - time encoder , a neural stochastic dif ferential equation regularised by physics - informed losses (Feynman - Kac PDE residual and market price of risk penalty), and a differentiable symbolic bottleneck for interpretability . Extensive experiments across four div erse datasets demonstrate that AR TEMIS achieves state - of - the - art directional accuracy , particularly excelling on the synthetic crash re gime DSLOB (64.96%) and the en vironmental T ime - IMM Chang et al. [2025] dataset (96.0%), while maintaining competitive point accuracy . The ablation study confirms that each component contributes to this directional adv antage, with the SDE enabling a deliberate trade - off between magnitude precision and sign prediction. The underperformance on Optiv erMeyer et al. [2021] is attributed to its long sequence length, volatility - focused target, and limited feature set, highlighting important boundary conditions. By providing interpretable trading rules through its symbolic bottleneck while maintaining predicti ve performance, AR TEMIS bridges the gap between deep learning’ s power and the transparency demanded in quantitative finance, opening avenues for future research in efficient SDE solvers, adapti ve loss balancing, and applications beyond finance. References Ali Al-Aradi, Adolfo Correia, Danilo Naif f, Gabriel Jardim, and Y uri Saporito. Solving nonlinear and high-dimensional partial differential equations via deep learning. arXiv pr eprint arXiv:1811.08782 , 2018. Y uexing Bai, T emuer Chaolu, and Sudao Bilige. The application of improved physics-informed neural network (ipinn) method in finance. Nonlinear Dynamics , 107(4):3655–3667, 2022. Shaik Asif Basha, Amir Zia, et al. Artificial intelligence in financial trading predicti ve models and risk management strategies. In ITM W eb of Confer ences , volume 76, page 01007. EDP Sciences, 2025. V aibhav Bhogade and B Nithya. Time series forecasting using transformer neural netw ork. International Journal of Computers and Applications , 46(10):880–888, 2024. Ching Chang, Jeehyun Hwang, Y idan Shi, Haixin W ang, W en-Chih Peng, T ien-Fu Chen, and W ei W ang. Time-imm: A dataset and benchmark for irregular multimodal multi variate time series. arXiv preprint , 2025. Fangrong Chang, Helai Huang, Alan HS Chan, Siu Shing Man, Y aobang Gong, and Hanchu Zhou. Capturing long- memory properties in road fatality rate series by an autoregressi ve fractionally integrated moving average model with generalized autoregressi ve conditional heteroscedasticity: A case study of florida, the united states, 1975–2018. Journal of safety r esear ch , 81:216–224, 2022. W eisi Chen, W alayat Hussain, Francesco Cauteruccio, and Xu Zhang. Deep learning for financial time series prediction: A state-of-the-art revie w of standalone and hybrid models. 2024. Maanit Desai, Y irun Zhang, Ryan Holbrook, Kait O’Neil, and Maggie Demkin. Jane street real-time market data forecasting. https://kaggle.com/competitions/jane- street- real- time- market- data- forecasting , 2024. Kaggle. Robert Engle. Risk and volatility: Econometric models and financial practice. American economic re view , 94(3): 405–420. 23 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Furizal Furizal, Alfian Ma’arif, Asno Azza wagama Firdaus, and Iswanto Suwarno. Capability of hybrid long short-term memory in stock price prediction: A comprehensi ve literature re view . International Journal of Robotics and Contr ol Systems , 4(3):1382–1402, 2024. Sofia Giantsidi and Claudia T arantola. Deep learning for financial forecasting: A revie w of recent trends. International Revie w of Economics & F inance , page 104719, 2025. Chunyan Gou, Rui Zhao, and Y ihuang Guo. Stock price prediction based on non-stationary transformers model. In 2023 9th International Confer ence on Computer and Communications (ICCC) , pages 2227–2232. IEEE, 2023. Donatien Hainaut and Alex Casas. Option pricing in the heston model with physics inspired neural networks. Annals of F inance , 20(3):353–376, 2024. Huimin Han, Zehua Liu, Mauricio Barrios Barrios, Jiuhao Li, Zhixiong Zeng, Nadia Sarhan, and Emad Mahrous A wwad. Time series forecasting model for non-stationary series pattern extraction using deep learning and garch modeling. Journal of Cloud Computing , 13(1):2, 2024. Shaid Hasan, Ismoth Zerine, Md Mainul Islam, Adib Hossain, Khandaker Ataur Rahman, and Zulkernain Doha. Predictiv e modeling of us stock market trends using hybrid deep learning and economic indicators to strengthen national financial resilience. Journal of Economics, F inance and Accounting Studies , 5(3):223–235, 2023. Alireza Hassani, Milad Jav adi, and Mohammad Naisipour . The time series informer model for stock market prediction. 2025. Anh Hoang, Hien Phan, and V an-Doan Nguyen. Explainable ai in finance: Enhancing transparency and interpretability of ai models in financial decision-making. Data Science in Finance and Accounting , pages 193–211, 2026. Sikiru O Ibrahim. Forecasting the volatilities of the nigeria stock mark et prices. CBN Journal of Applied Statistics , 8 (2):23–45, 2017. Ahmed Jeribi and Achraf Ghorbel. Forecasting dev eloped and brics stock markets with cryptocurrencies and gold: generalized orthogonal generalized autoregressi ve conditional heteroskedasticity and generalized autoregressi ve score analysis. International Journal of Emer ging Markets , 17(9):2290–2320, 2022. Manglam Kartik and Neel T ushar Shah. Physics-informed neural networks for option pricing and hedging in illiq- uid jump markets. In Pr oceedings of the 2025 3r d International Conference on Machine Learning and P attern Recognition , pages 88–96, 2025. Jongseon Kim, Hyungjoon Kim, HyunGi Kim, Dongjun Lee, and Sungroh Y oon. A comprehensi ve surve y of deep learning for time series forecasting: architectural di versity and open challenges. Artificial Intelligence Review , 58(7): 216, 2025. Xiangjie Kong, Zhenghao Chen, W eiyao Liu, Kaili Ning, Lechao Zhang, Syauqie Muhammad Marier, Y ichen Liu, Y uhao Chen, and Feng Xia. Deep learning for time series forecasting: a surve y . International Journal of Machine Learning and Cybernetics , 16(7):5079–5112, 2025. Zaharaddeen Karami La wal, Hayati Y assin, Daphne T eck Ching Lai, and Azam Che Idris. Physics-informed neural network (pinn) ev olution and beyond: A systematic literature revie w and bibliometric analysis. Big Data and Cognitive Computing , 6(4):140, 2022. W illiam Lefebvre, Grégoire Loeper , and Huyên Pham. Dif ferential learning methods for solving fully nonlinear pdes. Digital F inance , 5(1):183–229, 2023. W enxiang Li and KL Eddie Law . Deep learning models for time series forecasting: A re vie w . IEEE Access , 12: 92306–92327, 2024. Y uxuan Liang, Haomin W en, Y uqi Nie, Y ushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, and Qingsong W en. Foundation models for time series analysis: A tutorial and survey . In Pr oceedings of the 30th ACM SIGKDD confer ence on knowledge disco very and data mining , pages 6555–6565, 2024. Ardra Mani and Jose Joy Thoppan. Comparati ve analysis of arima and garch models for forecasting spot gold prices and their volatility: a time series study . In 2023 IEEE International Confer ence on Recent Advances in Systems Science and Engineering (RASSE) , pages 1–5. IEEE, 2023. Akib Mashrur , W ei Luo, Nayyar A Zaidi, and Antonio Robles-Kelly . Machine learning for financial risk management: a surve y . Ieee Access , 8:203203–203223, 2020. Daniel Enemona Mathew , Deborah Uzoamaka Ebem, Anayo Chukwu Ikegwu, Pamela Eberechukwu Ukeoma, and Ngozi Fidelia Dibiaezue. Recent emer ging techniques in explainable artificial intelligence to enhance the interpretable and understanding of ai models for human. Neural Pr ocessing Letters , 57(1):16, 2025. 24 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics Andrew Me yer , BerniceOptiver , CameronOptiv er , IXA GPOPU, Jiashen Liu, Matteo Pietrobon (Optiver), Opti verMerle, Sohier Dane, and Stefan V allentine. Optiver realized v olatility prediction. https://kaggle.com/competitions/ optiver- realized- volatility- prediction , 2021. Kaggle. Philip Ndikum. Machine learning algorithms for financial asset price forecasting. arXiv pr eprint arXiv:2004.01504 , 2020. Daniel B Nelson. Conditional heteroskedasticity in asset returns: A ne w approach. Econometrica: J ournal of the econometric society , pages 347–370, 1991. Samuel M Nuugulu, Kailash C Patidar , and Divine T T arla. A physics informed neural network approach for solving time fractional black-scholes partial differential equations: Sm nuugulu et al. Optimization and Engineering , 26(4): 2419–2448, 2025. Y ongKyung Oh, Seungsu Kam, Jonghun Lee, Dong-Y oung Lim, Sungil Kim, and Alex Bui. Comprehensiv e re vie w of neural differential equations for time series analysis. arXiv pr eprint arXiv:2502.09885 , 2025a. Y ongkyung Oh, Dongyoung Lim, and Sungil Kim. Neural differential equations for continuous-time analysis. In Pr oceedings of the 34th ACM International Confer ence on Information and Knowledge Mana gement , pages 6837– 6840, 2025b. Mande Prav een, Satish Dekka, Dasari Manendra Sai, Das Prakash Chennamsetty , and Durga Prasad Chinta. Financial time series forecasting: A comprehensi ve re view of signal processing and optimization-dri ven intelligent models. Computational Economics , pages 1–27, 2025. Nitin Rane, Saurabh Choudhary , and Jayesh Rane. Explainable artificial intelligence (xai) approaches for transparency and accountability in financial decision-making. A vailable at SSRN 4640316 , 2023. Lihki Rubio, Adriana P alacio Pinedo, Adriana Mejía Castaño, and Filipe Ramos. Forecasting v olatility by using wavelet transform, arima and garch models. Eurasian Economic Revie w , 13(3):803–830, 2023. Francesco Rundo, Francesca T renta, Agatino Luigi Di Stallo, and Sebastiano Battiato. Machine learning for quantitati ve finance applications: A survey . Applied Sciences , 9(24):5574, 2019. Santosh Kumar Sahu, Anil Mokhade, and Neeraj Dhanraj Bokde. An overvie w of machine learning, deep learning, and reinforcement learning-based techniques in quantitativ e finance: recent progress and challenges. Applied Sciences , 13(3):1956, 2023. Artur Sokolovsk y , Luca Arnaboldi, Jaume Bacardit, and Thomas Gross. Interpretable trading pattern designed for machine learning applications. Machine Learning with Applications , 11:100448, 2023. Andres L Suarez-Cetrulo, David Quintana, and Alejandro Cerv antes. Machine learning for financial prediction under regime change using technical analysis: A systematic re vie w . 2023. Chaojie W ang, Y uanyuan Chen, Shuqi Zhang, and Qiuhui Zhang. Stock market index prediction using deep transformer model. Expert Systems with Applications , 208:118128, 2022. Mengjie W ang, Arvind Maheshwari, and Alejandro V elasquez. Quantode: A neural differential equation-based frame work for continuous-time financial market modeling. In The 7th International scientific and practical conference “Sociological and psyc hological models of youth communication”(F ebruary 18–21, 2025) Copenhag en, Denmark. International Science Gr oup. 2025. 250 p. , page 223, 2025. Y umin W u. Comparison between transformer , informer , autoformer and non-stationary transformer in financial market. Applied and Computational Engineering , 29:68–78, 2023. Camilo Y añez, W erner Kristjanpoller, and Marcel C Minutolo. Stock market index prediction using transformer neural network models and frequency decomposition. Neural Computing and Applications , 36(25):15777–15797, 2024. Jiexia Y e, Y ongzi Y u, W eiqi Zhang, Le W ang, Jia Li, and Fugee Tsung. Empowering time series analysis with foundation models: A comprehensiv e surve y . arXiv preprint , 2024. Zhen Zeng, Rachneet Kaur , Suchetha Siddagangappa, Saba Rahimi, T ucker Balch, and Manuela V eloso. Financial time series forecasting using cnn and transformer . arXiv pr eprint arXiv:2304.04912 , 2023. Cheng Zhang, Nilam Nur Amir Sjarif, and Roslina Ibrahim. Deep learning models for price forecasting of financial time series: A revie w of recent advancements: 2020–2022. W ile y Inter disciplinary Reviews: Data Mining and Knowledge Discovery , 14(1):e1519, 2024. Xiaofan Zhou, Baiting Chen, Y u Gui, and Lu Cheng. Conformal prediction: A d ata perspective. A CM computing surve ys , 58(2):1–37, 2025. 25 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics A MA THEMA TICAL DERIV A TION OF AR TEMIS: A NEURO-SYMBOLIC FRAMEWORK FOR ECONOMICALL Y CONSTRAINED MARKET D YNAMICS W e present a rigorous mathematical formulation of the AR TEMIS framew ork. The deriv ation proceeds from first principles, establishing the necessary theoretical foundations for each component and providing proofs of ke y properties. Throughout, we assume a filtered probability space (Ω , F , {F t } t ≥ 0 , P ) satisfying the usual conditions, representing the uncertainty in financial markets. A.1 Problem Setup and Notation Let T > 0 be a fixed time horizon and consider a financial market observed over [0 , T ] . Observations consist of a set of irregularly sampled pairs { ( x i , t i ) } N i =1 where each x i ∈ R d x is a feature vector recorded at time t i with 0 ≤ t 1 < t 2 < · · · < t N ≤ T . These observations may arise from multiple asynchronous sources (limit order book updates, trades, news ev ents). The goal is to forecast a scalar target y ∈ R at a future time T + τ for some τ > 0 . For each training example we hav e a window of observations up to time T , and we denote the input function as x : [0 , T ] → R d x , which is piecewise constant between observ ation times (a càglàd function). AR TEMIS learns a continuous-time latent representation z ( t ) ∈ R d z that captures the underlying market state. The latent process is assumed to be adapted to {F t } and satisfies appropriate integrability conditions ensuring existence and uniqueness of stochastic differential equations (SDEs) that go vern its ev olution. A.2 Continuous-Time Encoding via Laplace Neural Operator Standard sequence models require re gularly sampled inputs, which forces interpolation of irre gular observ ations and can distort the underlying continuous-time dynamics. T o avoid this, AR TEMIS employs a Laplace Neural Operator (LNO) that directly maps the input function x to a latent function z without requiring regular sampling. A.2.1 Function Space Formulation Let X = L ∞ ([0 , T ]; R d x ) be the space of essentially bounded input functions, and let Z = L 2 ([0 , T ]; R d z ) be the Hilbert space of square-integrable latent functions. The LNO defines an operator E : X → Z via a con volution with a kernel κ plus a bias term: z ( t ) = Z T 0 κ ( t − s ) x ( s ) ds + b ( t ) , ∀ t ∈ [0 , T ] , (1) where κ : R → R d z × d x is a matrix-v alued kernel and b : [0 , T ] → R d z is a bias function. The integral is understood component-wise. A.2.2 Ker nel Parameterization in the Laplace Domain T o capture long-range dependencies and ensure causality , we parameterize the kernel via its Laplace transform. For a causal kernel ( κ ( t ) = 0 for t < 0 ), the Laplace transform is ˆ κ ( ω ) = Z ∞ 0 κ ( t ) e − ω t dt, ω ∈ C , ℜ ( ω ) > 0 . W e approximate ˆ κ by a sum of rational functions: ˆ κ ( ω ) = K X k =1 A k ω − λ k , (2) where λ k ∈ C are learnable poles with ℜ ( λ k ) < 0 (ensuring stability) and A k ∈ C d z × d x are learnable residue matrices. The in verse Laplace transform then yields an explicit time-domain representation: κ ( t ) = L − 1 { ˆ κ } ( t ) = K X k =1 A k e λ k t , t ≥ 0 . (3) 26 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics This representation is causal and can capture both exponential decay (real poles) and oscillatory behavior (complex conjugate pairs). The parameters { λ k , A k } are learned end-to-end. A.2.3 Discretization f or Discrete Observations Gi ven discrete observ ations { ( x i , t i ) } N i =1 with t 0 := 0 and ∆ t i = t i − t i − 1 , we approximate the inte gral in (1) by a left Riemann sum: z ( t ) ≈ N X i =1 κ ( t − t i ) x i ∆ t i + b ( t ) . (4) For computational ef ficienc y we ev aluate z at a fixed set of times { t ( j ) } M j =1 (e.g., uniformly spaced) to obtain a regular sequence for the SDE solver . The quadrature error can be bounded under mild smoothness assumptions on x and κ (e.g., if x is of bounded v ariation and κ is Lipschitz, the error is O (max i ∆ t i ) ). A.2.4 Bias Function Parameterization The bias function b ( t ) is modeled by a feedforw ard network applied to a F ourier time embedding: b ( t ) = MLP ψ TimeEm b edding( t ) , with TimeEm b edding( t ) = [sin(2 π f 1 t ) , cos(2 π f 1 t ) , . . . , sin(2 π f F t ) , cos(2 π f F t )] , where the frequencies f 1 , . . . , f F are learnable. This allows the model to capture periodic patterns. A.3 Latent Dynamics: Neural Stochastic Differential Equation The latent state z ( t ) is assumed to ev olve according to an Itô diffusion that respects the semimarting ale property required for no-arbitrage models. A.3.1 SDE Formulation and Existence Let W ( t ) be a d w -dimensional W iener process independent of the initial condition z 0 . The latent dynamics are gov erned by d z ( t ) = µ θ ( z ( t ) , t ) dt + σ ϕ ( z ( t ) , t ) d W ( t ) , z (0) = z 0 , (5) where µ θ : R d z × [0 , T ] → R d z and σ ϕ : R d z × [0 , T ] → R d z × d w are neural networks with parameters θ , ϕ . W e impose the following standard conditions to ensure e xistence and uniqueness of a strong solution: Assumption 1 (Lipschitz and Linear Growth) . Ther e exists a constant L > 0 such that for all z , z ′ ∈ R d z and t ∈ [0 , T ] , ∥ µ θ ( z , t ) − µ θ ( z ′ , t ) ∥ + ∥ σ ϕ ( z , t ) − σ ϕ ( z ′ , t ) ∥ ≤ L ∥ z − z ′ ∥ , ∥ µ θ ( z , t ) ∥ 2 + ∥ σ ϕ ( z , t ) ∥ 2 ≤ L (1 + ∥ z ∥ 2 ) . Under these conditions, the SDE (5) has a unique strong solution that is a Markov process and satisfies E [sup 0 ≤ t ≤ T ∥ z ( t ) ∥ 2 ] < ∞ (Øksendal, 2003). The proof follo ws from the standard Picard iteration argument. A.3.2 Drift and Diffusion Architectur es The drift network is a multilayer perceptron (MLP) with a single hidden layer: µ θ ( z , t ) = W (2) µ tanh W (1) µ [ z ; TimeEm b edding( t )] + b (1) µ + b (2) µ , where [ · ; · ] denotes concatenation, W (1) µ ∈ R h µ × ( d z + d t ) , W (2) µ ∈ R d z × h µ , and b (1) µ , b (2) µ are biases. The time embedding dimension d t = 2 F . 27 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics The diffusion netw ork must produce a matrix σ ϕ such that σ ϕ σ ⊤ ϕ is positiv e semidefinite. W e factor it as σ ϕ = L ϕ D ϕ , where L ϕ is a lo wer-triangular matrix with ones on the diagonal (representing correlations) and D ϕ is a diagonal matrix of volatilities. Specifically: L ϕ ( z , t ) = T ril MLP ϕ L ([ z ; TimeEm b edding( t )]) + I , D ϕ ( z , t ) = diag Softplus(MLP ϕ D ([ z ; TimeEm b edding( t )])) , where T ril extracts the lo wer triangular part, and Softplus( x ) = log(1 + e x ) ensures positivity . This parameterization guarantees that σ ϕ is in vertible for all inputs, which is needed for the market price of risk penalty . A.3.3 Euler -Maruyama Discr etization For numerical simulation, we discretize the SDE on a uniform grid t j = j ∆ t with ∆ t = T / M . The Euler-Maruyama scheme giv es z j +1 = z j + µ θ ( z j , t j )∆ t + σ ϕ ( z j , t j ) √ ∆ t ϵ j , ϵ j ∼ N (0 , I d w ) , j = 0 , . . . , M − 1 , (6) with z 0 = E ( x )(0) . Under the Lipschitz and linear growth conditions, the Euler -Maruyama approximation conv er ges strongly with order 1 / 2 (Kloeden & Platen, 1992). A.4 Economic Constraints: Physics-Inf ormed Regularization T o ensure that the learned latent dynamics are economically plausible, we incorporate two regularization terms deri ved from the Fundamental Theorem of Asset Pricing. A.4.1 Feynman-Kac PDE Residual Consider a deriv ative security whose price V ( z , t ) depends on the latent state. Under the risk-neutral measure Q (equiv alent to P ), the discounted price process is a martingale. The Feynman-Kac theorem states that V satisfies a partial dif ferential equation (PDE). Specifically , if V is twice continuously dif ferentiable in z and once in t , and if the SDE (5) holds under Q with drift µ Q , then ∂ V ∂ t + µ Q · ∇ z V + 1 2 tr σ σ ⊤ ∇ 2 z V − r V = 0 , with terminal condition V ( z , T ) = Φ( z ) . Under the physical measure P , we hav e ∂ V ∂ t + µ P · ∇ z V + 1 2 tr σ σ ⊤ ∇ 2 z V − r V = λ · ∇ z V , where λ = σ − 1 ( µ P − µ Q ) is the market price of risk. The left-hand side of the PDE under P must vanish if the mark et is arbitrage-free and we consider the drift under Q . Howe v er , since we do not know µ Q a priori, we instead enforce that the PDE residual under P is orthogonal to the gradient of V in a certain sense? Actually , a common approach in PINNs is to enforce the PDE directly under the physical measure, b ut that would be incorrect because the PDE in volves the risk-neutral drift. T o avoid this issue, we introduce an auxiliary neural network V ψ and enforce that the Feynman-Kac PDE holds for some (implicit) risk-neutral drift. More precisely , we note that if there e xists a market price of risk λ such that µ Q = µ P − σ λ , then the PDE under Q becomes ∂ V ∂ t + ( µ P − σ λ ) · ∇ z V + 1 2 tr σ σ ⊤ ∇ 2 z V − r V = 0 . Rearranging giv es ∂ V ∂ t + µ P · ∇ z V + 1 2 tr σ σ ⊤ ∇ 2 z V − r V = λ · ∇ z V . Thus, the residual under P equals the inner product of the mark et price of risk with the gradient. T o enforce no-arbitrage, we need that λ exists and is finite; this is automatically true if σ is in vertible. The PDE residual itself is not required 28 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics to be zero, but its projection onto the gradient direction is determined by λ . Howe ver , in our loss we penalize the squared norm of the residual, which would force both sides to zero, implying λ = 0 and thus µ P = µ Q , i.e., the physical and risk-neutral drifts coincide. This is too restrictiv e; it would mean no risk premium. Therefore we adopt a different approach: we introduce an auxiliary netw ork V ψ and penalize the residual of the Feynman-Kac PDE under the assumption that the drift is already the risk-neutral drift. But we don’t kno w that. Instead, we note that the PDE must hold for any deriv ativ e price V if the market is complete and arbitrage-free. In particular , it must hold for a family of functions V ψ that we learn. The correct condition is that there exists a measure Q such that for all V , the PDE holds. This is a functional constraint. A practical way to enforce it is to require that the residual is small for many randomly chosen V . In our implementation, we use a single auxiliary network V ψ and minimize its PDE residual. This encourages the latent dynamics to be such that there exists a measure making V ψ a martingale. While not sufficient for full no-arbitrage, it provides a useful re gularizer . Formally , let V ψ : R d z × [0 , T ] → R be a neural network (we can also use multiple outputs). For a set of collocation points { ( z i , t i ) } sampled from the latent trajectories, we compute the residual R F K ( z i , t i ) = ∂ V ψ ∂ t + µ θ · ∇ z V ψ + 1 2 tr σ ϕ σ ⊤ ϕ ∇ 2 z V ψ − r V ψ , where we set r = 0 for simplicity (it can be included). The PDE loss is then L PDE = 1 N coll N coll X i =1 |R F K ( z i , t i ) | 2 . (7) Minimizing L PDE forces the latent dynamics to be consistent with the e xistence of a pricing measure that mak es V ψ a martingale. This is a soft constraint that encourages economic plausibility . A.4.2 Market Price of Risk P enalty Even if the PDE residual is small, the model might still produce implausibly high Sharpe ratios. T o prev ent this, we directly penalize the instantaneous Sharpe ratio. Define the market price of risk vector λ ( t ) = σ ϕ ( z ( t ) , t ) − 1 µ θ ( z ( t ) , t ) , (8) assuming σ ϕ is in vertible (guaranteed by our parameterization). The squared norm ∥ λ ( t ) ∥ 2 represents the instantaneous Sharpe ratio (expected e xcess return per unit risk). T o bound it, we introduce a hinge penalty: L MPR = 1 B B X b =1 max 0 , ∥ λ ( t b ) ∥ 2 − κ 2 , (9) where { t b } are sampled times and κ is a threshold. For daily data, a reasonable choice is κ = 2 , corresponding to an annualized Sharpe ratio of about 2 √ 252 ≈ 32 , which is already v ery high. This penalty discourages the model from learning strategies with unrealistic risk-adjusted returns. A.5 For ecasting and Consistency Objectives A.5.1 Prediction Head The final prediction is obtained from the latent state at the horizon T : ˆ y = w ⊤ z M + b, (10) where z M is the SDE-simulated state at t M = T , w ∈ R d z and b ∈ R are learnable parameters. The forecasting loss L forecast is the mean squared error for regression tasks or binary cross-entropy for classification. 29 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics A.5.2 Consistency Loss T o keep the latent trajectories grounded in the encoder outputs, we impose a consistency loss that penalizes de viations between the SDE-simulated states and the encoded states at each time step: L consist = 1 M M X j =1 ∥ z (sde) j − z (enc) j ∥ 2 , (11) where z (enc) j = E ( x )( t j ) are the encoder outputs at the grid points. This acts as an auto-encoding regularizer , prev enting the latent dynamics from drifting into unrealistic regions. A.6 Symbolic Bottleneck for Inter pr etability A ke y innovation of AR TEMIS is its ability to produce interpretable trading rules. W e achie ve this through a differentiable symbolic re gression layer that distills the latent dynamics into closed-form expressions. A.6.1 Basis Function Library W e predefine a library F = { f 1 , . . . , f K } of simple mathematical functions applied to the raw input features. Each f k is a mapping from a window of length L of input features to a scalar . T ypical functions include moving averages, ratios, differences, v ariances, and other elementary operations. For example, with a uni v ariate time series { x t } , f ma10 ( x ) = 1 10 10 X i =1 x i , f ratio ( x ) = x 1 x 2 , f diff ( x ) = x 1 − x 2 , f var ( x ) = 1 L − 1 L X i =1 ( x i − ¯ x ) 2 . In practice, we compute these functions for each feature channel and each possible lag, resulting in a large library . A.6.2 Sparse Linear Combination The symbolic layer forms a weighted combination of these basis functions: ˆ y symb = K X k =1 w k f k ( x input ) , (12) where w ∈ R K are learnable weights. T o encourage sparsity and interpretability , we add an L1 penalty: L symb = λ symb ∥ w ∥ 1 . (13) A.6.3 Differentiable Selection with Gumbel-Softmax For more flexibility , we can allow the basis functions themselves to have learnable parameters (e.g., the lag in a moving av erage). In that case, we use a Gumbel-Softmax relaxation to select among a set of candidate parameterizations. Let α k be logits for each candidate. The Gumbel-Softmax estimator provides a dif ferentiable sample: p k = exp((log α k + g k ) /τ ) P K j =1 exp((log α j + g j ) /τ ) , g k ∼ Gumbel (0 , 1) , where τ > 0 is a temperature. The weighted combination becomes ˆ y symb = P K k =1 p k f k ( x input ) . As τ → 0 , this approximates a hard selection. A.6.4 T wo-Phase T raining T o av oid interfering with the primary forecasting objecti ve, we adopt a tw o-phase procedure: 1. Phase 1 (Pr etraining): Train the encoder , SDE, and prediction head using L total = L forecast + λ 1 L PDE + λ 2 L MPR + λ 3 L consist . The symbolic layer is not used. 30 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics 2. Phase 2 (Distillation): Freeze all parameters except those in the symbolic layer . T rain the symbolic layer to mimic the full model’ s predictions using a teacher-student loss: L distill = 1 N batch N batch X n =1 ( ˆ y symb ,n − ˆ y n ) 2 + L symb . (14) This yields interpretable expressions that approximate the beha vior of the full model. A.7 Conformal Pr ediction for Uncertainty Quantification T o provide reliable uncertainty estimates, AR TEMIS incorporates conformal prediction, a distribution-free method that produces prediction intervals with finite-sample co verage guarantees. A.7.1 Standard Conformal Pr ediction Let D cal = { ( X i , y i ) } n i =1 be a calibration set independent of the training data. For each calibration point, compute the absolute residual r i = | y i − ˆ y ( X i ) | . For a ne w test point X test , construct the interval C ( X test ) = [ ˆ y ( X test ) − q 1 − α , ˆ y ( X test ) + q 1 − α ] , (15) where q 1 − α is the (1 − α )(1 + 1 /n ) -quantile of { r 1 , . . . , r n } . Under the assumption that the calibration and test points are exchangeable, we ha ve the co verage guarantee P ( y test ∈ C ( X test )) ≥ 1 − α. (16) The proof follows from the f act that the ranks of the residuals are uniformly distrib uted (V ovk et al., 2005). A.7.2 Adaptive Conf ormal Prediction for Non-Stationary Data Financial time series are non-stationary , violating the exchangeability assumption. T o address this, we employ an adaptiv e v ariant that maintains a rolling window of the most recent residuals. Let W t be a window of the last W residuals at time t . The adaptiv e quantile q 1 − α ( t ) is the (1 − α ) -quantile of W t . The prediction interval becomes C t ( X test ) = [ ˆ y ( X test ) − q 1 − α ( t ) , ˆ y ( X test ) + q 1 − α ( t )] . (17) While this no longer pro vides a strict finite-sample guarantee, it adapts to changes in the error distribution and works well in practice (Gibbs & Candès, 2021). A.7.3 Portf olio Optimization with Conformal Interv als The prediction interv als can be used for risk-aw are portfolio construction. Consider a portfolio of P assets with weights w ∈ R P satisfying P P p =1 w p = 1 and w p ≥ 0 . For each asset, we hav e a point prediction ˆ y p and a conformal interv al [ ˆ y p − q p , ˆ y p + q p ] . The continuous Kelly criterion maximizes the e xpected logarithmic gro wth rate: max w E [log(1 + w ⊤ R )] , (18) where R is the v ector of returns. Using a quadratic approximation and the conformal intervals, we approximate E [log(1 + w ⊤ R )] ≈ w ⊤ ˆ y − 1 2 w ⊤ ˆ Σw , (19) where ˆ Σ is estimated from the conformal interv als (e.g., as a diagonal matrix with entries q 2 p ). The optimization problem becomes max w w ⊤ ˆ y − γ 2 w ⊤ ˆ Σw , s.t. 1 ⊤ w = 1 , w ≥ 0 , (20) with γ a risk av ersion parameter . This con ve x quadratic program can be solv ed ef ficiently using dif ferentiable con vex optimization layers (Agrawal et al., 2019), enabling end-to-end training. 31 AR TEMIS: A Neuro-Symbolic Framew ork for Economically Constrained Market Dynamics A.8 T otal Loss and T raining The ov erall loss function combines all components: L total = L forecast + λ 1 L PDE + λ 2 L MPR + λ 3 L consist + λ 4 L symb . (21) The hyperparameters λ 1 , . . . , λ 4 balance the different objectives. Training proceeds by stochastic gradient descent. Gradients through the SDE solver are computed using the reparameterization trick: the Euler-Maruyama steps are deterministic functions of the initial state and the noise variables { ϵ j } , which are sampled independently of the parameters. Thus, we can backpropagate through the unrolled simulation using automatic differentiation. 32
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment