A Controlled Comparison of Deep Learning Architectures for Multi-Horizon Financial Forecasting: Evidence from 918 Experiments
Multi-horizon price forecasting is central to portfolio allocation, risk management, and algorithmic trading, yet deep learning architectures have proliferated faster than rigorous financial benchmarks can evaluate them. This study provides a control…
Authors: Nabeel Ahmad Saidd
A Contr olled Com parison of Deep Learning Ar c hitectures f or Multi-Horizon Financial F orecasting Evidence from 918 Experiments N abeel Ahmad Saidd Dr . A.P .J. A bdul Kalam T ec hnical U niv ersity (AKTU) nabeelahmadsaidd@gmail.com Abstract. Multi-horizon price f orecasting is central to portf olio allocation, risk manag ement, and algorithmic trading, y et deep learning architectures hav e prolif erated faster than rigorous compar isons on financial data can assess them. Exis ting benc hmarks are weak ened b y uncontrolled h yper parameter budg ets, single-seed ev aluation, nar ro w asset-class co v erage, and absent pair wise statis tical correction. This study compares nine deep learning arc hitectures—A utoformer, DLinear, iT ransf ormer, LSTM, Modern TCN, N-Hi TS, Patc hTST, T imesNet, and T imeXer—from f our f amilies (T ransf or mer , MLP , CNN, RNN) across three asset classes (cr yptocur rency , f ore x, equity indices) and tw o f orecasting hor izons ( ℎ ∈ { 4 , 24 } hours). All 918 r uns f ollow a five-s tage protocol: fix ed-seed Ba yesian h yper parameter optimisation, configuration freezing per asset class, multi-seed final training, metr ic aggregation with uncer tainty quantification, and statis tical v alidation. Modern TCN achie v es the best mean rank (1.333) with a 75% first-place rate across 24 e valuation points; P atchTST ranks second (2.000). The global rank leaderboard rev eals a clear three-tier structure separating the top pair from a middle cluster and a bottom tier . Architecture explains 99.90% of raw rmse variance v ersus 0.01% f or seed randomness. Rankings remain stable across hor izons despite 2 – 2 . 5 × er ror amplification. Directional accuracy is indistinguishable from 50% across all 54 model–categor y –hor izon combinations, indicating that MSE-trained architectures lack directional skill at hourl y resolution. Four practical implications f ollo w: (i) large-k er nel temporal con volutions and patch-based T ransf ormers consistently outperf or m alternatives; (ii) the comple xity–perf or mance relationship is non-monotonic, with arc hitectural inductiv e bias mattering more than ra w capacity ; (iii) three-seed replication suffices given negligible seed v ar iance; and (iv) directional f orecasting requires explicit loss-function redesign. The full codebase, data, trained models, and ev aluation outputs are released f or independent replication. Ke ywords: financial time-ser ies forecas ting; deep lear ning; model benchmarking; multi-hor izon f orecasting; h yper parameter optimisation; reproducibility; seed robustness; directional accuracy JEL Classification: C45, C52, C53, G17 1 1 Introduction 2 1 Intr oduction 1.1 Motiv ation Multi-horizon pr ice f orecasting is central to moder n finance: por tf olio allocation relies on expected return estimates at multiple hor izons, r isk management depends on v olatility projections, and algor ithmic trading requires predictive signals whose quality degrades with the f orecast window . Unlik e ph ysical sy stems go verned by conservation la ws, financial time ser ies e xhibit non-stationarity , hea vy-tailed returns, v olatility clustering, le verag e effects, and abrupt regime transitions ( Cont , 2001 ; F ama , 1970 )—proper ties that make accurate multi-step prediction e x ceptionally difficult. Deep learning architectures f or temporal sequence modelling ha ve prolif erated rapidly o v er the past fiv e y ears. T ransf or mer -based models no w span a wide range of inductiv e biases: auto-cor relation in the frequency domain ( W u et al. , 2021 ), patc h-based tokenisation with channel independence ( Nie et al. , 2023 ), in v er ted variate-wise attention ( Liu et al. , 2024 ), and ex ogenous-v ar iable-a w are cross-attention with lear nable global tokens ( W ang et al. , 2024 ). A t the same time, decomposition-based linear mappings hav e been shown to match or surpass T ransf ormer perf or mance on s tandard benchmarks ( Zeng et al. , 2023 ), while moder n temporal con volutional architectures e xploit larg e-kernel depthwise con v olutions ( Luo and W ang , 2024 ) and FFT -based 2D reshaping ( W u et al. , 2023 ) to capture multi-scale temporal structure. Hierarchical MLP designs with multi-rate pooling ( Challu et al. , 2023 ) offer yet another approach to direct multi-step f orecasting. This architectural div ersity raises a practical question: which ar chit ecture should a practitioner deploy for a giv en financial f or ecasting task , at whic h horizon, and f or which asset class? A reliable answ er requires controlled e xperimentation that isolates architectural mer it from conf ounding factors—a requirement that e xisting benchmarks ha v e not met, as Section 2.4 demonstrates. 1.2 Resear ch Gap Despite the g ro wing body of f orecas ting s tudies, fiv e persistent methodological shor tcomings under mine published model comparisons: G1. U ncontrolled h yperparameter budge ts. Many benchmarks allocate different tuning effor t to different models—or skip h yper parameter optimisation entirel y —confounding tuning luck with architectural mer it. Prior w ork has introduced architectures with custom-tuned configurations while e valuating competitors under default or unspecified settings ( W u et al. , 2023 ; Zeng et al. , 2023 ), pre venting fair attr ibution of per f ormance differences. G2. Single-seed ev aluation. The v ast ma jor ity of comparativ e studies repor t results from a single random initialisation. Seed-induced v ar iance has been sho wn to e x ceed algor ithmic differences in standard machine learning benchmarks ( Bouthillier et al. , 2021 ), with analogous effects documented in reinf orcement learning ( Henderson et al. , 2018 ). Without multi-seed replication, it is impossible to distinguish genuine architectural adv antage from s tochastic v ar iation in w eight initialisation. 1 Introduction 3 G3. Single-horizon analy sis. Most compar isons e valuate a single f orecasting hor izon, precluding in ves tigation of how arc hitectural inductiv e biases interact with prediction difficulty as the forecas t windo w e xtends. Pr ior w ork has benchmar ked recur rent netw orks at fix ed horizons without character ising deg radation behaviour , lea ving cross-hor izon generalisation unaddressed ( He wamalag e et al. , 2021 ). G4. Absent pairwise statistical correction. Ev en studies that repor t omnibus significance tests (e.g., Fr iedman) rarely appl y post-hoc pairwise corrections with famil y-wise error control, lea ving the significance of individual ranking differences unq uantified. The M4 competition ( Makridakis et al. , 2018 ) pro vided aggregate rankings but did not repor t pair wise tes ts among deep learning par ticipants. G5. Narr o w asset-class co v erage. Benchmarks f ocusing on a single asset class—whether cryptocur rency ( Sezer et al. , 2020 ), eq uities, or standard f orecasting datasets (ETTh, W eather , Electricity)—cannot assess whether ranking conclusions generalise across the structurally distinct dynamics of different financial markets. Cross-class ev aluation is necessar y f or reliable deplo yment guidance. Collectiv ely , these gaps prev ent reliable inf erence about the relationship between architectural inductiv e biases and financial time-series str ucture, and depriv e practitioners of e vidence-based model selection guidance—despite the finding (Section 4.4 ) that architecture choice, not random initialisation, e xplains o ver 99% of total f orecast v ar iance. 1.3 Hypotheses and Objectives Four testable h ypotheses str ucture the empirical inv estig ation. Each states a clear null and a designated statis tical test: H1. Ranking Non-U nif ormity . The global per f ormance ranking of the nine architectures is significantl y non-unif orm across ev aluation points. 𝐻 0 : All models ha ve equal e xpected rank. T est : Rank -based leaderboard analy sis across 24 e valuation points (12 assets × 2 horizons); win-rate counts; mean rank gaps between tiers. Evidence: Sections 4.1 and 4.6 . H2. Cross-Horizon Ranking Stability . T op-rank ed architectures at ℎ = 4 maintain their relative superior ity at ℎ = 24 , despite absolute er ror amplification. 𝐻 0 : Rankings at ℎ = 4 and ℎ = 24 are independent ( 𝜌 𝑆 = 0 ). T es t: Spearman rank cor relation between ℎ = 4 and ℎ = 24 model rankings per asset; percentage degradation analy sis. Evidence: Section 4.3 . H3. V ariance Dominance. Architecture choice e xplains a significantl y larg er propor tion of total f orecast variance than random seed initialisation. 𝐻 0 : Model and seed f actors contr ibute equall y to variance. 1 Introduction 4 T est : T w o-f actor sum-of-squares variance decomposition; compar ison of variance propor tions across panels (ra w , z-normalised all models, z-nor malised modern only). Evidence: Section 4.4 . H4. Non-Mono tonic Comple xity–P erformance. The relationship betw een trainable parameter count and f orecasting er ror is non-monotonic—arc hitectural inductiv e bias matters more than ra w model capacity . 𝐻 0 : Monotonic negativ e cor relation ( 𝜌 𝑆 = − 1 ) betw een parameter count and rmse rank. T es t: Spear man rank cor relation between parameter count and mean rmse rank; visual inspection of comple xity–perf or mance scatter . Evidence: Section 4.7 . The primar y objectiv e is to pro vide a controlled, statisticall y validated compar ison of nine deep learning architectures across three asset classes at tw o f orecasting horizons under identical e xperimental conditions, yielding e vidence-based deplo yment guidance. 1.4 Contrib utions Fiv e specific contributions advance the state of kno w ledge: C1. Pro tocol-controlled fair comparison (addresses G1). Nine architectures spanning f our f amilies (T ransf ormer , MLP , CNN, RNN) are ev aluated under a single five-s tage protocol: fixed-seed Ba yesian HPO (5 Optuna TPE trials, seed 42), configuration freezing per asset class, identical chronological 70/15/15 splits, a common OHLCV f eature set, and rank -based per f or mance e valuation across 12 instruments, three asset classes, and tw o horizons, totalling 918 runs (648 final training runs + 270 HPO tr ials). C2. Multi-seed robustness quantification (addresses G2). Final training is replicated across three independent seeds (123, 456, 789). A tw o-factor v ar iance decomposition sho w s that arc hitecture choice e xplains 99.90% of total f orecast v ar iance while seed variation accounts f or 0.01%, establishing model selection as the dominant lev er f or accuracy impro vement and confir ming that three-seed replication is sufficient. C3. Cross-horizon generalisation anal y sis (addresses G3). Identical models are ev aluated at ℎ = 4 and ℎ = 24 with matched protocol, characterising architecture-specific degradation (T able 10 ) and identifying which inductive biases scale with prediction difficulty . C4. Asset-class-specific deplo yment guidance (addresses G5). Category-le vel analy sis across cryptocur rency, f ore x, and equity indices (T able 9 ) show s that top-tier rankings (Moder n TCN, Patc hTST) hold across all three classes, while mid-tier order ings (DLinear, N-Hi TS, TimeX er) are category-dependent. A per -asset best-model matrix (T able 8 ) fur ther rev eals niche advantag es: N- Hi TS achie ves the lo w est er ror on lo wer -capitalisation cr yptocur rency assets, pro viding actionable asset-le vel deplo yment guidance. C5. Open, deterministic benchmarking framew or k (addr esses G1–G5). The complete pipeline— source code, configuration files, ra w market data, processed datasets, all trained models, and 2 Related Work 5 e valuation outputs—is released under an open licence. A ccompanying documentation enables an y researcher to reproduce ev ery experiment via a unified command-line inter f ace. 1.5 P aper Organisation The remainder of this paper is org anised as f ollow s. Section 2 sur v e ys related w ork and positions this study relative to exis ting benchmarks. Section 3 presents the unified e xper imental protocol. Section 4 repor ts the empir ical findings structured by the four h ypotheses. Section 5 inter prets the results, pro vides economic conte xt, and discusses limitations. Section 6 summar ises contr ibutions and offers deplo yment recommendations. Section 7 provides a reproducibility statement. Supplementar y results are collected in the Appendix. 2 Related W ork This section positions the present study within the broader landscape of financial time-ser ies f orecasting. It traces the ev olution from classical econometric models through f our families of deep lear ning architectures, re view s multi-step f orecasting s trategies, and identifies the methodological gaps this benchmark addresses. 2.1 Classical and Statistical Appr oaches The ARIMA f amily ( Bo x and Jenkins , 1970 ) remains a cornerstone of time-series analy sis, captur ing linear temporal dependencies through differencing and lagg ed-er ror ter ms. Exponential smoothing methods ( Hyndman and Khandakar , 2008 ) offer computationally efficient trend–seasonality decompo- sition, while the AR CH ( Engle , 1982 ) and GAR CH ( Bollersle v , 1986 ) framew orks pro vide the standard toolkit f or modelling conditional heteroscedasticity in financial retur ns. These methods share three fundamental limitations. Firs t, they assume linear ity in the conditional mean or v ar iance, y et financial retur ns e xhibit nonlinear phenomena—le verag e effects, long memory , and regime transitions ( Cont , 2001 )—that violate these assumptions. Second, classical models are inherentl y univariate (or require e xplicit cross-v ariable specification), limiting their ability to e xploit joint OHLCV inf ormation. Third, multi-step f orecasting under these framew orks typically proceeds recursiv ely , compounding prediction er rors at longer horizons ( Ben T aieb et al. , 2012 ). These limitations motiv ate the use of deep lear ning architectures that lear n nonlinear , multiv ar iate mappings directl y from data. 2.2 Deep Learning Ar chitectures f or Time-Series Deep learning approaches to time-ser ies f orecasting ha v e ev ol v ed along f our arc hitectural families, each encoding distinct assumptions about temporal str ucture. The specific models included in this benchmark are revie w ed below , with emphasis on their temporal inductiv e biases. 2 Related Work 6 Recurr ent architectur es. Long Shor t- T erm Memor y netw orks ( Gers et al. , 2000 ; Hochreiter and Schmidhuber , 1997 ) introduced g ated recur rence to address v anishing gradients, maintaining a cell s tate that selectiv el y retains or discards inf ormation. While effectiv e f or shor t-rang e dependencies, sequential processing limits parallelisation and hinders lear ning of v ery long-rang e patterns. Sur v e ys confirm that LS TMs ser v ed as the default neural f orecasting baseline during 2017–2021 ( He wamalag e et al. , 2021 ; Lim and Zohren , 2021 ). In this benc hmark, LS TM represents the recur rent famil y , providing a classical ref erence point. Its inductive bias is autoreg ressiv e: the hidden state compresses all past inf or mation into a fixed-dimensional vector , relying on recurrence to propagate long-rang e context. T ransformer -based architectur es. The self-attention mechanism ( V aswani et al. , 2017 ) enables direct modelling of pairwise dependencies across arbitrary temporal lags, o vercoming the sequential bottleneck of recur rence. F our T ransf ormer variants are ev aluated: • A utof or mer ( W u et al. , 2021 ) replaces canonical attention with an auto-cor relation mec hanism operating in the frequency domain at O ( 𝐿 log 𝐿 ) comple xity , coupled with progressive trend– seasonal decomposition. Its inductive bias assumes that dominant temporal patterns manifes t as periodic auto-cor relations detectable via spectral analy sis. • Patc hTST ( Nie et al. , 2023 ) segments input ser ies into patc hes, treats eac h as a token, and applies a T ransf ormer encoder with channel-independent processing and R evIN normalisation ( Kim et al. , 2022 ) f or distribution-shift mitig ation. Its inductiv e bias prior itises local temporal coherence within patches while using attention f or global inter -patch dependencies. • i T ransf or mer ( Liu et al. , 2024 ) inv erts the attention paradigm: each variate ’ s full temporal tra jector y serves as a token, and attention operates across the variate dimension, directly captur ing cross- v ar iable interactions. Its inductiv e bias assumes that inter -variate relationships are the primar y source of predictiv e inf or mation. • TimeX er ( W ang et al. , 2024 ) separates tar g et and e x ogenous v ar iables, embedding the patched targ et alongside lear nable global tok ens and applying cross-attention to query inv erted ex ogenous representations. Its inductive bias e xplicitly separates autoreg ressiv e dynamics from e x og enous co variate influence. MLP and linear arc hitectures. The necessity of attention for time-ser ies f orecasting has been challeng ed ( Zeng et al. , 2023 ), demonstrating that DLinear—a decomposition-based model with dual independent linear la yers mapping seasonal and trend components, without hidden la y ers, activations, or attention—can match or e x ceed T ransf or mer per f ormance on standard benchmarks. Its inductiv e bias assumes that temporal patter ns are adequatel y captured b y linear projections of decomposed components. N-Hi TS ( Challu et al. , 2023 ) employ s a hierarchical stack of MLP blocks with multi-rate pooling: each bloc k operates at a different temporal resolution, produces coefficients via a multi-la y er perceptron, and inter polates them to the forecas t horizon through basis-function expansion. Its inductiv e bias prior itises multi-scale temporal structure through hierarchical signal decomposition. T og ether , these results raise the ques tion of whether attention contributes meaningfull y to f orecasting accuracy—a ques tion this benchmark addresses. 2 Related Work 7 Con v olutional architectur es. T emporal conv olutional netw orks (TCNs) apply causal, dilated con- v olutions to capture long-rang e dependencies through hierarchical receptiv e fields ( Bai et al. , 2018 ). The tw o con v olutional models in this benchmark —TimesN et and Modern TCN—each depar t from the standard TCN template in distinct w ay s: TimesN et ( W u et al. , 2023 ) transf or ms the f orecasting problem from 1D to 2D by identifying dominant FFT -based per iods, reshaping the sequence into 2D tensors inde xed b y per iod length and intra-period position, and appl ying Inception-sty le 2D conv olutions. Its inductiv e bias assumes that temporal dynamics decompose into inter -per iod and intra-period variations best captured through spatial con v olutions. Modern TCN ( Luo and W ang , 2024 ) emplo y s larg e-kernel depthwise con v olutions with structural reparameterisation (dual branc hes merg ed at inf erence), multi-stag e do wnsampling, and optional R evIN normalisation. Its inductiv e bias holds that local temporal patterns at multiple scales, captured through larg e receptiv e fields with efficient depthwise operations, suffice f or accurate f orecasting without frequency -domain or attention mechanisms. A ke y ques tion is whether these arc hitectural differences yield consist ent and statistically significant performance differences on financial data, or whether the e xper imental protocol dominates obser v ed rankings. The present study disentangles these effects through the controlled protocol descr ibed in Section 3 . 2.3 Multi-Step Forecasting Strategies Multi-step-ahead prediction admits three pr incipal s trategies ( Ben T aieb et al. , 2012 ; Che villon , 2007 ). The recur sive (iterated) s trategy applies a one-step model iterativ ely , feeding predictions back as inputs; this approach is straightf orward but accumulates er rors geometricall y with the horizon length. The dir ect strategy trains independent output heads for eac h future s tep, a voiding error accumulation at the cost of ignor ing inter -step temporal coherence. The multi-input multi-output (MIMO) strategy produces all hor izon steps in a single f or w ard pass, preser ving inter -step dependencies without iterative er ror propagation. This benchmark adopts direct multi-st ep f orecas ting : each model outputs all ℎ f orecast steps simultaneousl y ( ℎ ∈ { 4 , 24 } ) in a single f orward pass, matching the nativ e output design of all nine architectures. No model f eeds predictions back as inputs. T w o separate e xperiments per hor izon use dis tinct lookback window s ( 𝑤 = 24 f or ℎ = 4 ; 𝑤 = 96 f or ℎ = 24 ), enabling isolation of horizon-dependent deg radation from architecture-dependent effects. Three considerations motiv ate this choice: (i) it av oids the er ror -accumulation conf ound of recursiv e strategies; (ii) it matches ev er y architecture ’ s nativ e mode, prev enting protocol mismatch; and (iii) it enables clean cross-horizon comparison (H2) by ensuring that ℎ = 4 and ℎ = 24 results differ onl y in task difficulty . 2 Related Work 8 2.4 Benchmarking Practices and Identified Gaps Se veral pr ior studies hav e compared deep learning architectures f or time-ser ies f orecasting, but persistent methodological limitations constrain the conclusions that can be dra wn. The most rele vant benchmarks are revie w ed below , mapped to the fiv e gaps from Section 1.2 . Larg e-scale f orecasting competitions hav e adv anced standardised e valuation methodologies. The M4 competition ( Makridakis et al. , 2018 ) introduced a common ev aluation protocol across 100,000 series but f ocused on macroeconomic and demog raphic data, did not include modern deep learning architectures (released after 2020), and did not apply pair wise statis tical cor rections ( G1 , G4 , G5 ). The M5 competition ( Makr idakis et al. , 2022 ) e xtended to retail sales f orecasting with hierarchical structure but again ex cluded recent T ransf or mer , TCN, and MLP architectures ( G5 ). Within the deep learning literature, a comprehensiv e benchmark of recur rent networks ( He wamalag e et al. , 2021 ) ex cluded T ransformer - and CNN-based alter nativ es and e valuated a single hor izon ( G3 , G5 ). The competitiv eness of linear models agains t T ransf ormers w as demonstrated ( Zeng et al. , 2023 ), but with default h yper parameters for competitors and a single seed ( G1 , G2 ). TimesN et was benchmark ed agains t multiple baselines ( W u et al. , 2023 ) but with uncontrolled HPO budgets and single-seed e valuation ( G1 , G2 ). PatchTS T was introduced with strong benchmark results ( Nie et al. , 2023 ) but on non-financial datasets and without multi-seed replication ( G2 , G5 ). iT ransf ormer was e valuated across standard time-ser ies benchmarks ( Liu et al. , 2024 ) but without pair wise statis tical tests or multi-hor izon deg radation anal ysis ( G3 , G4 ). W ithin the financial f orecasting literature specificall y , a comprehensiv e sur v e y ( Sezer et al. , 2020 ) noted the absence of controlled e xper imental comparisons without pro viding one. T able 1 summar ises these prior studies and their g ap cov erag e. The central finding is that no prior study simultaneously addr esses G1–G5 : e v er y e xisting benc hmark leav es at least tw o gaps uncontrolled. The present study fills this compound gap b y providing the first controlled, multi-seed, multi-hor izon, multi-asset-class comparison with full statistical v alidation f or financial time-ser ies f orecasting. T able 1. Summar y of pr ior comparative s tudies in time-series f orecasting. Columns indicate the number of models e valuated, number of datasets or asset classes, hor izons tested, whether multi-seed ev aluation was performed, and whether post-hoc pairwise statistical tests w ere applied. Study Models Datasets Horizons Multi-Seed Pairwise T ests Open Code ( Makridakis et al. , 2018 ) (M4) Man y 100K ser ies Multiple No No Partial ( He wamalag e et al. , 2021 ) RNN only 6 Multiple No No Partial ( Zeng et al. , 2023 ) 6 9 4 No No Y es ( W u et al. , 2023 ) 8 8 4 No No Y es ( Nie et al. , 2023 ) 7 8 4 No No Y es ( Liu et al. , 2024 ) 8 7 4 No No Y es Present study 9 12 (3 classes) 2 Y es (3) Y es Y es 3 Experimental Design 9 3 Experimental Design This section presents the complete e xperimental protocol as a unified, replicable specification. Ev er y design choice is stated with its rationale. A reader equipped with the accompanying repository can reproduce e very repor ted number b y f ollo wing this section sequentially . 3.1 Formal Pr oblem Definition Let X 𝑡 ∈ R 𝑤 × 𝑑 denote a multiv ar iate in put window of 𝑤 consecutiv e hourl y observations with 𝑑 = 5 OHLCV f eatures (Open, High, Lo w , Close, V olume), ending at time inde x 𝑡 . The f orecasting task is to learn a parametr ic mapping 𝑓 𝜽 : R 𝑤 × 𝑑 → R ℎ , ˆ y 𝑡 + 1: 𝑡 + ℎ = 𝑓 𝜽 ( X 𝑡 ) , ℎ ∈ { 4 , 24 } , (1) where ˆ y 𝑡 + 1: 𝑡 + ℎ is the predicted v ector of future Close pr ices and 𝜽 collects all lear nable parameters. T w o hor izon configurations are ev aluated as completely separat e experiments , with no shared w eights or intermediate results: • Short-term ( ℎ = 4 ): lookback windo w 𝑤 = 24 hours, predicting 4 hours ahead. • Long-term ( ℎ = 24 ): lookback windo w 𝑤 = 96 hours, predicting 24 hours ahead. All models emplo y direct multi-step f or ecasting : the entire horizon v ector ˆ y ∈ R ℎ is produced in a single f or w ard pass. No model f eeds predictions back as inputs, a voiding the er ror accumulation of recursiv e strategies ( Ben T aieb et al. , 2012 ; Che villon , 2007 ). The training objectiv e is mean squared er ror: L ( 𝜽 ) = 1 𝑛 𝑛 𝑖 = 1 ∥ y 𝑖 − ˆ y 𝑖 ∥ 2 , (2) and the model selection criter ion is minimum validation mse . This common loss function and selection cr iterion appl y identically to all architectures, eliminating conf ounds from differential training objectiv es. 3.2 Data 3.2.1 Asset Universe The benchmark spans 12 financial instruments across three asset classes, each with structurally distinct market microstructure: • Cryptocurrency (4 assets): BTC/USDT , ETH/USDT , BNB/USDT , AD A/USDT . These instru- ments trade continuousl y (24/7), e xhibit high v olatility , and are subject to rapid regime chang es driven by speculative activity and regulator y e v ents. 3 Experimental Design 10 • F or ex (4 assets): EUR/USD, USD/JPY , GBP/USD, A UD/USD. Ma jor cur rency pairs are charac- terised b y high liquidity , shor t-ter m mean-re v er ting tendencies, and sensitivity to macroeconomic announcements and central-bank policy . • Equity indices (4 assets): Dow Jones, S&P 500, N ASD A Q 100, D AX. These indices trac k broad equity mark ets, exhibiting trending beha viour , lo wer intra-day v olatility relativ e to cr yptocur rency , and session-based trading hours. All instruments are sampled at H1 (1-hour) frequency , providing unif orm temporal resolution across asset classes. For hyperparameter optimisation, one representative asset per class is designated: BTC/USDT (cr yptocur rency), EUR/USD (f ore x), and Dow Jones (equity indices). Optimised configurations are frozen and applied to all assets within the corresponding class, prev enting asset-lev el o v er fitting while preserving categor y -lev el calibration (Section 3.4.2 ). 3.2.2 Feature Specification All models receiv e identical in put tensors compr ising fiv e ra w market features: Open, High, Lo w , Close, and V olume (OHLCV). This deliberate restr iction to unprocessed market data isolates the contribution of archi tectural design from feature-engineering conf ounds. No tec hnical indicators, lagged retur ns, calendar variables, or e xter nal co v ar iates are introduced. The sole f orecas t targ et is the Close pr ice at each horizon step, ensuring that performance differences reflect architecture rather than feature a vailability . 3.2.3 Preprocessing The preprocessing pipeline transf or ms raw market data into normalised, window ed tensors through f our stag es: 1. Loading. Raw hourl y OHLCV records are ing ested from CSV files containing datetime, open, high, lo w , close, and volume columns. 2. T runcation. T o ensure comparable dataset sizes across instruments with different histories, the most recent 30,000 time steps are retained f or ev ery asset pr ior to windowing. 3. Normalisation. Standard 𝑧 -score nor malisation (zero mean, unit v ar iance) is fitted exclusiv ely on the tr aining par tition and applied unchang ed to validation and test par titions. This pre v ents leakag e of future distr ibutional information into the scaling statis tics. After inf erence, predictions are in v erse-scaled to the original pr ice domain bef ore metr ic computation. 4. Windo wing. R olling windo ws of length 𝑤 + ℎ are constructed. T w o configurations are emplo yed: ( 𝑤 , ℎ ) = ( 24 , 4 ) f or shor t-term and ( 𝑤 , ℎ ) = ( 96 , 24 ) f or long-ter m f orecas ting. Each windo w yields an input matr ix X ∈ R 𝑤 × 5 and a targ et v ector y ∈ R ℎ (Close prices only). 3.2.4 Chronological Splits All par titions are str ictl y chronological to prev ent future-data leakag e: • T raining : first 70% of samples (appro ximately 21,000 windo ws per asset per horizon). 3 Experimental Design 11 • V alidation : ne xt 15% of samples (appro ximately 4,500 window s). • T est : final 15% of samples (appro ximately 4,500 window s). No shuffling is per f or med at any s tage, preser ving the temporal order ing essential f or financial time-series. Identical splits are applied to all models, ensuring that ev ery architecture receiv es e xactly the same training, v alidation, and tes t obser v ations f or each (asset, hor izon) pair . Split boundar ies and sample counts are recorded alongside the processed datasets (T able 2 ). Return distr ibutional statis tics—mean retur n, v olatility , ske wness, e x cess k urtosis, first-order autocor relation, and ADF unit-root test 𝑝 -v alue—for all tw elv e assets are repor ted in T able 3 ; all retur n series are stationary ( 𝑝 < 0 . 001 ) with heavy tails (e xcess kur tosis 15–96) and near -zero mean retur ns consistent with w eak -f orm market efficiency . T able 2. Dataset summary . All assets use H1 (hourl y) frequency . The most recent 30,000 window ed samples are retained per (asset, horizon) pair, split chronologicall y into 70%/15%/15% train/val/tes t par titions. Windo w lengths: 𝑤 = 24 for ℎ = 4 and 𝑤 = 96 f or ℎ = 24 . Features: OHLCV (5 channels); targ et: close price. Category Asset Date Range T rain V al T est T otal Crypto BTC/USDT 2021-03 – 2026-02 21,000 4,500 4,500 30,000 ETH/USDT 2021-03 – 2026-02 21,000 4,500 4,500 30,000 BNB/USDT 2021-03 – 2026-02 21,000 4,500 4,500 30,000 AD A/USDT 2021-05 – 2026-02 21,000 4,500 4,500 30,000 Fore x EUR/USD 2017-12 – 2026-02 21,000 4,500 4,500 30,000 USD/JPY 2017-12 – 2026-02 21,000 4,500 4,500 30,000 GBP/USD 2017-12 – 2026-02 21,000 4,500 4,500 30,000 A UD/USD 2017-12 – 2026-02 21,000 4,500 4,500 30,000 Indices Do w Jones 2019-01 – 2026-02 21,000 4,500 4,500 30,000 S&P 500 2019-01 – 2026-02 21,000 4,500 4,500 30,000 N ASD A Q 100 2019-01 – 2026-02 21,000 4,500 4,500 30,000 D AX 2019-02 – 2026-02 21,000 4,500 4,500 30,000 T otal per horizon — 252,000 54,000 54,000 360,000 3.3 Model Ar chitectures Nine architectures spanning f our f amilies are ev aluated. All models conf or m to a unified interface: input shape ( 𝐵 , 𝑤 , 𝑑 ) with 𝑑 = 5 OHLCV f eatures; output shape ( 𝐵 , ℎ ) , where 𝐵 is the batch size. No model-specific f eature engineer ing or data augmentation is permitted. T able 4 summar ises the ke y architectural proper ties. T ransformer family (4 models). A utof or mer ( W u et al. , 2021 ) replaces self-attention with an auto- cor relation mechanism operating in the frequency domain at O ( 𝐿 log 𝐿 ) comple xity , incor porating progressiv e ser ies decomposition to separate trend and seasonal components. 3 Experimental Design 12 T able 3. Distr ibutional statis tics of hour l y log retur ns f or all tw elv e assets. All series reject the Augmented Dick ey –Fuller unit-root null at 𝑝 < 0 . 001 , confir ming return stationarity . Cat. Asse t ¯ 𝜇 𝑟 ( × 10 − 4 ) 𝜎 𝑟 (%) Ske w Kurt A CF(1) ADF 𝑝 Crypto BTC/USDT + 0 . 4 0.783 − 0 . 47 + 35 . 8 − 0 . 043 < 0.001 ETH/USDT + 0 . 3 0.973 − 0 . 56 + 27 . 3 − 0 . 026 < 0.001 BNB/USDT + 0 . 8 1.107 + 0 . 08 + 35 . 0 − 0 . 077 < 0.001 AD A/USDT 0 . 0 1.187 − 0 . 05 + 19 . 2 − 0 . 085 < 0.001 Fore x EUR/USD 0 . 0 0.109 − 0 . 01 + 15 . 2 − 0 . 010 < 0.001 USD/JPY + 0 . 1 0.119 − 0 . 98 + 46 . 6 − 0 . 002 < 0.001 GBP/USD 0 . 0 0.115 − 1 . 66 + 95 . 8 − 0 . 018 < 0.001 A UD/USD 0 . 0 0.141 − 0 . 07 + 17 . 9 − 0 . 020 < 0.001 Indices Do w Jones + 0 . 2 0.225 − 0 . 57 + 60 . 8 − 0 . 012 < 0.001 S&P 500 + 0 . 2 0.235 − 1 . 13 + 90 . 2 − 0 . 026 < 0.001 N ASD A Q 100 + 0 . 3 0.286 − 0 . 70 + 47 . 7 − 0 . 014 < 0.001 D AX + 0 . 2 0.262 − 1 . 74 + 83 . 0 − 0 . 011 < 0.001 All retur n distributions e xhibit e x cess kur tosis (heavy tails) consistent with the sty lised facts of financial mark ets; equities and f ore x show neg ativ e sk ewness (do wnside asymmetr y), while crypto ske wness is near zero. Neg ative A CF(1) values indicate mild mean-re v ersion in hourl y returns. V olatility spans tw o orders of magnitude across asset classes (crypto ≈ 1% , f orex ≈ 0 . 1% , indices ≈ 0 . 2% ), reflecting differences in lev erage, liquidity , and trading hours. Patc hTST ( Nie et al. , 2023 ) segments in put series into patc hes, treats eac h as a token, and applies a T ransf ormer encoder with c hannel-independent processing and R evIN nor malisation ( Kim et al. , 2022 ). i T ransf or mer ( Liu et al. , 2024 ) in v er ts the attention paradigm: each v ar iate ’ s full temporal trajectory serves as a token, and attention is computed across the v ar iate dimension. TimeX er ( W ang et al. , 2024 ) separates targ et and e x og enous variables, embedding patched tar get representations alongside learnable global tokens and appl ying cross-attention to quer y in verted e x ogenous embeddings. MLP/linear famil y (2 models). DLinear ( Zeng et al. , 2023 ) applies mo ving-av erag e decomposition to separate seasonal and trend components, then maps each through an independent linear la y er —without hidden la yers, activations, or attention. It has the f e wes t parameters of any model in the benchmark (appro ximately 1,000 at ℎ = 4 ). N-Hi TS ( Challu et al. , 2023 ) emplo ys a hierarchical stac k of MLP blocks with multi-rate pooling. Each block operates at a different temporal resolution and inter polates coefficients to the f orecast horizon through basis-function expansion. Con v olutional famil y (2 models). TimesNet ( W u et al. , 2023 ) transf orms f orecasting from 1D to 2D b y identifying dominant FFT -based per iods, reshaping the sequence into 2D tensors, and appl ying Inception-sty le 2D con v olutions to capture intra-per iod and inter -period patter ns. 3 Experimental Design 13 20000 40000 60000 80000 100000 120000 BTC/USDT Train (70%) | V alidation (15%) | T est (15%) 0.95 1.00 1.05 1.10 1.15 1.20 EUR/USD 0k 6k 12k 18k 24k 30k Observation Index (thousands) 30000 35000 40000 45000 50000 Dow Jones Representative Hourly Close Price Series (Latest 30,000 Observations) Figure 1. R epresentative hourl y Close-pr ice time ser ies f or one asset per class: BTC/USDT (cryptocur - rency), EUR/USD (f ore x), and Dow Jones (equity indices). V er tical dashed lines indicate chronological train/v alidation/test boundaries (70/15/15 split). The three classes e xhibit qualitativ ely different dynamics: high-v olatility trending beha viour (cryptocur rency), lo w-v olatility mean-rev ersion around a narrow rang e (f ore x), and moderate-v olatility upward dr ift (equity indices). All series compr ise the most recent 30,000 hourl y observations. Modern TCN ( Luo and W ang , 2024 ) emplo y s larg e-kernel depthwise con v olutions with structural reparameterisation, multi-stag e do wnsampling, and optional Re vIN nor malisation f or multi-scale temporal pattern extraction. Recurr ent famil y (1 model). LSTM ( Hochreiter and Schmidhuber , 1997 ) serves as the classical baseline: a multi-la y er stac ked LSTM encoder e xtracts the final hidden state, whic h a two-la y er MLP head projects to the f orecast vector . 3.4 Five-Stage Experimental Pipeline The e xper imental pipeline compr ises fiv e sequential s tages, each designed to eliminate a specific source of conf ounding. Figure 2 provides a schematic o v er vie w . 3.4.1 Stage 1: Fixed-Seed Hyperparameter Optimisation Hyper parameter optimisation is perf or med using the Optuna frame work ( Akiba et al. , 2019 ) with the T ree-structured P arzen Es timator (TPE) sampler ( Bergs tra et al. , 2011 ). T o ensure fairness, the f ollo wing settings are held constant across all architectures: 3 Experimental Design 14 T able 4. Summar y of the nine deep lear ning architectures e valuated. All models receive OHLCV input of shape ( 𝐵 , 𝑤 , 5 ) and produce direct multi-step f orecasts of shape ( 𝐵 , ℎ ) . F amily abbreviations: RNN = recurrent, MLP = multi-lay er perceptron, TCN = temporal conv olutional netw ork, TF = T ransf ormer . Model F amily Ke y Mechanism Ref erence A utoformer TF A uto-cor relation & ser ies decompo- sition ( W u et al. , 2021 ) DLinear MLP Linear lay ers with trend–seasonal de- composition ( Zeng et al. , 2023 ) i T ransf or mer TF In verted attention on variate tokens ( Liu et al. , 2024 ) LSTM RNN Gated recur rent cells + dense MLP projection head ( Hochreiter and Schmidhuber , 1997 ) Modern TCN TCN Larg e-kernel depthwise conv olu- tions with patching ( Luo and W ang , 2024 ) N-Hi TS MLP Hierarchical inter polation with multi- rate pooling ( Challu et al. , 2023 ) Patc hTST TF Channel-independent patch-based T ransf or mer ( Nie et al. , 2023 ) TimesN et TCN 2D temporal variation via FFT + In- ception blocks ( W u et al. , 2023 ) TimeX er TF Ex ogenous-v ariable-aw are cross- attention ( W ang et al. , 2024 ) • Deterministic seed: 42 (identical random state f or all HPO runs). • T rial budget: 5 tr ials per (model, categor y , hor izon) configuration. • Sampler: TPE with median pr uner (2 s tar tup tr ials, 5 w ar m-up steps). • Objective: Minimise validation mse . • T raining budget per trial: 50 epochs, batch size 256. HPO is conducted e x clusiv ely on one representative asset per categor y (BTC/USDT , EUR/USD, Do w Jones), prev enting o v er fitting to individual assets while captur ing categor y -lev el dynamics. This design is motiv ated b y tw o considerations: (i) tuning on all 12 assets would multiply computational cost by 12 × without a mechanism f or cross-asset g eneralisation, and (ii) freezing configurations at the category lev el ensures the same inductiv e pr ior gov erns all assets within a class. T able 5 pro vides the full search space specification. The controlled 5-tr ial budget is a deliberate choice that prior itises comparativ e fairness o v er e xhaustiv e peak -per f or mance searc h. Three considerations justify this constraint. Firs t, in financial time-series f orecasting—where the signal-to-noise ratio is low and non-stationarity is per v asiv e— e xtensiv e HPO r isks selecting configurations that o v er fit to transient market regimes and fail to g eneralise to the test set. T o mitigate this r isk, search rang es were dra wn from commonly used configurations in the literature, and identical budg ets w ere applied to all models to ensure a symmetr ical e valuation frame w ork. Second, model parameter counts w ere kept in a comparable range across architectures (with the e x ception of DLinear, which is intentionally lightw eight b y design). Maintaining comparable 3 Experimental Design 15 capacity reduces search-space imbalance and prev ents capacity -dr iv en advantag es from conf ounding the architectural compar ison. Third, the empirical evidence cor roborates this design: the variance decomposition and rank stability across horizons (Section 4 ) show that architectural identity is the dominant f actor in per f ormance v ar iance, and the high consistency of model rankings sugges ts that an e xpanded HPO budget w ould yield diminishing retur ns unlik ely to alter the comparativ e conclusions. Consequentl y , this protocol prior itises cross-architectural fairness and statis tical validity , ensur ing that observed performance differences reflect structural mer its rather than differential tuning intensity . T able 5. Hyper parameter searc h spaces per model. All models share lear ning rate ∈ [ 5 × 10 − 4 , 5 × 10 − 3 ] (log-unif or m) and batch size ∈ { 64 , 128 } . Only architecture-specific parameters are shown. HPO uses Optuna TPE with 5 tr ials per (model × horizon × asset class) on representative assets (BTC/USDT , EUR/USD, Do w Jones) only . Model Hyperparameter Range / Choices A utoformer Model dimension { 64 , 128 } Attention heads { 4 , 8 } Enc./dec. lay ers { 1 , 2 } each Feedf orward dimension { 64 , 128 } Dropout rate [ 0 . 0 , 0 . 2 ] , step 0 . 05 DLinear Mo ving-av erag e k er nel [ 3 , 51 ] , step 2 P er-c hannel mapping { tr ue , false } i T ransf or mer Model dimension { 96 , 112 } Feedf orward dimension { 128 , 256 } Encoder lay ers [ 2 , 4 ] Attention heads { 4 , 8 , 16 } LSTM Hidden state size { 64 , 128 } R ecur rent lay ers [ 1 , 3 ] Projection head width { 64 , 128 } Bidirectional { true , false } Modern TCN Patc h size { 8 , 16 } Channel dimensions [ 32 , 64 , 96 ] R evIN nor malisation { true , f alse } Dropout / head dropout [ 0 . 0 , 0 . 2 ] , step 0 . 05 N-Hi TS N umber of blocks [ 2 , 3 ] Hidden lay er width { 64 , 96 } La yers per block [ 3 , 6 ] P ooling kernel sizes { [ 2 , 4 , 8 ] , [ 4 , 8 , 12 ] } Patc hTST Model dimension { 64 } Encoder lay ers [ 1 , 3 ] Patc h length { 16 , 24 } Stride { 4 , 8 } TimesN et Model dimension { 24 , 32 } Feedf orward dimension { 32 , 64 } Encoder lay ers [ 2 , 3 ] TimeX er Model dimension { 64 , 96 } Feedf orward dimension { 128 , 256 } Encoder lay ers [ 2 , 3 ] Attention heads { 4 , 8 } 3 Experimental Design 16 3.4.2 Stage 2: Configuration Freezing The best-perf or ming configuration f or each (model, categor y , hor izon) triple—deter mined b y validation mse in Stag e 1—is recorded and frozen. These frozen configurations are applied unc hang ed to all assets within the cor responding category; no fur ther tuning is per mitted. This eliminates asset-lev el o v er fitting and ensures that each architecture is e valuated on a single, category-le vel configuration. T able 6 presents the selected configurations. T able 6. Frozen best hyperparameters selected via Optuna TPE (5 tr ials, seed 42, objectiv e: minimum validation mse ). Onl y ke y architecture-specific parameters are sho wn; all configurations also include lear ning rate and batch size. Shared entr ies across categor ies indicate that the representativ e-asset optimum was identical. Model Category ℎ Ke y Parameters A utof or mer Crypto/Fore x 4/24 𝑑 model = 128 , heads = 4 , enc = 1 , dec = 2 , 𝑑 ff = 128 , drop = 0 . 20 Indices 4/24 𝑑 model = 128 , heads = 4 , enc = 1 , dec = 2 , 𝑑 ff = 128 , drop = 0 . 10 DLinear All 4 kernel = 21 , individual = true Crypto/Fore x 24 k ernel = 43 , individual = true i T ransf or mer Crypto 24 𝑑 model = 112 , 𝑑 ff = 128 , la yers = 2 , heads = 16 , drop = 0 . 15 Other 4 𝑑 model = 96 , 𝑑 ff = 128 , la yers = 4 , heads = 16 , drop = 0 . 00 LSTM All 4 hidden = 128 , la y ers = 1 , mlp = 128 , bidir = tr ue, drop = 0 . 15 All 24 hidden = 64 , la y ers = 1 , mlp = 128 , bidir = tr ue, drop = 0 . 10 Modern TCN Crypto 24 patch = 16 , dims = [ 32 , 64 , 96 ] , Re vIN, drop = 0 . 20 , head- drop = 0 . 20 Other 4/24 patch = 8 , dims = [ 32 , 64 , 96 ] , Re vIN+affine, drop = 0 . 10 , head-drop = 0 . 10 N-Hi TS Crypto 4 blocks = 2 , hidden = 64 , la yers = 3 , pool = [ 4 , 8 , 12 ] , drop = 0 . 05 Crypto 24 blocks = 3 , hidden = 96 , la yers = 4 , pool = [ 4 , 8 , 12 ] , drop = 0 . 00 Patc hTST Crypto 4 𝑑 model = 64 , la yers = 2 , patch = 24 , stride = 8 , drop = 0 . 05 Other 24 𝑑 model = 64 , la yers = 1 , patch = 24 , stride = 4 , drop = 0 . 15 TimesN et All 4/24 𝑑 model = 32 , 𝑑 ff = 32 , la yers = 2 , top- 𝑘 = 3 , drop = 0 . 00 TimeX er Crypto 4 𝑑 model = 64 , 𝑑 ff = 256 , la yers = 2 , heads = 4 , drop = 0 . 05 Other 4/24 𝑑 model = 64 , 𝑑 ff = 128 , la yers = 3 , heads = 8 , drop = 0 . 05 3.4.3 Stage 3: Multi-Seed Final T raining Final training is conducted f or ev er y (model, asset, horizon, seed) combination under the frozen configuration from Stag e 2. The f ollo wing training protocol is applied identically to all r uns: • Seeds: 123, 456, 789 (three independent initialisations per configuration). 3 Experimental Design 17 • Maximum epochs: 100. • Batch size: As deter mined by HPO (typically 64 or 128). • Optimiser: Adam with weight decay 10 − 4 . • Learning rate scheduler: ReduceLR OnPlateau (patience 5, factor 0.5, minimum LR 10 − 6 ). • Earl y stopping: Patience 15, monitor ing validation loss (minimum improv ement threshold 𝛿 min = 10 − 4 ). • Gradient clipping: ℓ 2 -norm clipped at 1.0. • Loss function: mse (Equation 2 ). Each seed controls all sources of randomness: Python ’ s standard librar y , N umPy , PyT orc h CPU and CUD A generators, cuDNN back end settings, and the Python hash seed (set via en vironment variable bef ore process star tup). DataLoader w orkers der iv e their seeds deter minis tically from the primar y seed. Checkpointing occurs ev ery epoc h, retaining both the best model (low es t v alidation loss) and the latest model. Inter rupted training resumes from the last completed epoc h, restor ing optimiser , scheduler , and random number g enerator states. 3.4.4 Stage 4: Metric Aggregation Ev aluation metr ics ( rmse , mae , d a ) are computed e xclusiv el y on the held-out test set. Predictions are inv erse-scaled to the or iginal price domain using training-set scaler parameters bef ore metr ic computation, ensuring that er rors are e xpressed in economicall y meaningful units. For each (model, asset, hor izon) triple, the three seed-specific metr ics are aggregated as mean ± standard de viation, pro viding both a point estimate and an uncertainty bound. 3.4.5 Stage 5: Benchmarking and Statistical V alidation The final stag e g enerates all comparativ e analy ses: • Global leaderboard: Models rank ed by mean rmse rank across all 12 assets and 2 horizons (24 e valuation points per model). • Category-le v el analy sis: Per -categor y aggregated metr ics and rankings. • Cross-horizon degradation: rmse chang e from ℎ = 4 to ℎ = 24 per model per asset. • Statistical v alidation: Rank -based leaderboard analy sis and tw o-factor variance decomposition of model vs. seed contr ibutions (Section 3.6 ). Dual-plot conv ention. LSTM ser v es as a classical baseline, but its er rors (one to tw o orders of magnitude abo v e modern models) compress the visual scale in compar ison plots, obscuring per f ormance differences among the eight modern architectures. Body figures therefore use the no-LS TM variant f or finer discrimination; the all-models variant including LSTM appears in Appendix A.5 . All tabular results alw ay s include all nine models. 3 Experimental Design 18 Data Ingestion Prepro cessing Chronological Split S 1 HPO S 2 Config F reeze S 3 Final T raining S 4 Aggregation S 5 Benc hmarking L A T E X Man uscript · 12 OHLCV CSV assets · 3 classes (crypto, forex, indices) · H1 (hourly) bars · Last 30,000 steps · z -score normalise (fit on train only) · Rolling windows: h =4 : w =24 h =24 : w =96 · Strictly chronological · T rain: 70 % · V al: 15 % · T est: 15 % · No shuffling · Optuna TPE sampler · Seed = 42 (fixed) · 5 trials / config · Ob jectiv e: min v al- MSE · 3 repr. assets: BTC, EUR/USD, DJI · best.yaml per (model × category × horizon) · Propagate to all 12 assets p er class · No further tuning · 9 models · 12 assets · h ∈ { 4 , 24 } · Seeds: {123, 456, 789} · Early stopping (patience = 15) · Checkpoint & resume · Per-seed metrics: RMSE, MAE, DA · Inverse-scale preds · µ ± σ across seeds {123, 456, 789} · F riedman Iman Daven- port · Wilcoxon signed rank · Intraclass correlation coefficientk · CD diagrams · Generate tables · Plots & figures · pdflatex compile · CD diagrams 1 2 3 4 5 b es t config p er class 648 run results DA T A PIPELINE TRAINING PIPELINE EV ALUA TION OUTPUT HPO on 3 representative assets (BTC · EUR/USD · DJI) — b est config per class propagated 9 mo dels × 12 assets × 2 ho rizons × 3 seeds = 648 total runs Legend: Data Pipeline HPO (Stage 1) Config F reeze (Stage 2) Final T raining (Stage 3) Aggregation (Stage 4) Stats & Benchmarking (Stage 5) Manuscript Five-Stage Exp erimental Pip eline fo r Deep Learning Financial F orecasting 9 architectures × 12 assets × 2 horizons ( h ∈ { 4 , 24 } ) × 3 seeds { 123 , 456 , 789 } Figure 2. Fiv e-stag e e xper imental pipeline. Stage 1: Fix ed-seed Bay esian HPO on representativ e assets (BTC/USDT , EUR/USD, Do w Jones; seed 42; 5 Optuna TPE tr ials; 50 epochs per tr ial). Stage 2: Best configuration frozen per (model, category , horizon) triple. Stage 3: Multi-seed final training (seeds 123, 456, 789; 100 epoc hs maximum; earl y s topping with patience 15). St ag e 4: T est-set metr ic aggregation with inv erse scaling (mean ± std across seeds). Stage 5: Benchmarking with rank -based leaderboard analy sis, visualisation, and v ar iance decomposition. All 918 e xper imental r uns—270 HPO trials plus 648 final training r uns—are conducted under identical conditions. 3 Experimental Design 19 3.5 Ev aluation Metrics Three complementar y metrics are computed on the held-out test set f or e very (model, asset, hor izon, seed) configuration. All predictions are in verse-transf ormed to the or iginal pr ice scale bef ore computation, ensuring that er ror magnitudes are economically inter pretable. 1. rmse (R oot Mean Sq uared Error) —the primar y r anking metric , penalising larg e de viations quadraticall y . rmse is appropr iate f or financial r isk assessment, where larg e f orecast er rors car ry dispropor tionate cost: RMSE = v t 1 𝑛 𝑛 𝑖 = 1 ( 𝑦 𝑖 − ˆ 𝑦 𝑖 ) 2 . (3) 2. mae (Mean Absolute Error) —a secondar y metr ic that is robust to outliers and provides a median-biased point estimate of f orecast er ror: MAE = 1 𝑛 𝑛 𝑖 = 1 | 𝑦 𝑖 − ˆ 𝑦 𝑖 | . (4) 3. d a (Directional A ccuracy) —the fraction of hor izon steps where the predicted direction of pr ice chang e matches the realised direction, pro viding an economically inter pretable measure of f orecast quality rele vant to trading-signal applications: D A = 1 𝑛 𝑛 𝑖 = 1 ⊮ [ sign ( ˆ 𝑦 𝑖 − ˆ 𝑦 𝑖 − 1 ) = sign ( 𝑦 𝑖 − 𝑦 𝑖 − 1 ) ] . (5) All results are repor ted as mean ± standard de viation across 3 seeds (123, 456, 789), enabling quantification of initialisation-induced uncer tainty . The concordance betw een rmse and mae rankings is e xamined in Section 4.2 to v er ify that findings are robust to the choice of er ror metr ic. 3.6 Statistical V alidation Framew ork A tw o-tier frame w ork is emplo y ed to character ise obser v ed per f ormance differences: 1. T w o-factor variance decomposition → H3 . A sum-of-squares decomposition par titions total f orecast variance into three components: model-attr ibutable, seed-attr ibutable, and residual interaction. Reported as a percentag e of total sum of squares across three panels (ra w , z-normalised all models, z-normalised moder n onl y). 2. Spearman rank correlation → H2, H4 . Spear man ’ s 𝜌 betw een ℎ = 4 and ℎ = 24 model rankings per asset tests cross-hor izon stability (H2). Spearman ’ s 𝜌 betw een parameter count and mean rmse rank tests whether comple xity–perf or mance is monotonic (H4). Both correlations are tes ted f or significance agains t 𝜌 = 0 . 3.7 Experimental Scale The total e xper imental scale is: 4 Results 20 • HPO (St ag e 1): 9 models × 3 representative assets × 2 hor izons × 5 tr ials = 270 tr ial runs. • Final training (St ag e 3): 9 models × 12 assets × 2 horizons × 3 seeds = 648 training r uns. • T otal: 270 + 648 = 918 experimental r uns. Each of the 648 final training r uns produces a separate metr ics e valuation on the tes t set, yielding 648 individual (model, asset, hor izon, seed) perf or mance records that f or m the basis of all analy ses in Section 4 . 4 Results This section presents findings from 918 e xper imental runs: 648 final training r uns (9 models × 12 assets × 2 hor izons × 3 seeds) plus 270 HPO tr ials. R esults are org anised by the f our hypotheses from Section 1.3 . All claims ref erence specific tables or figures; mechanistic inter pretation is def er red to Section 5 . 4.1 Global P erformance Rankings T able 7 presents the global leaderboard, ranking nine models across f our architectural families by mean rmse rank across all 12 assets and both hor izons (24 ev aluation points per model). Three distinct performance tiers emerg e: • T op tier: Moder n TCN (CNN; mean rank 1.333, median 1.0) and Patc hTST (T ransf or mer; mean rank 2.000, median 2.0). • Middle tier: i T ransf or mer (3.667), TimeX er (4.292), DLinear (4.958), and N-Hi TS (5.250), spanning the T ransf ormer and MLP / Linear families. • Bottom tier: TimesNet (7.708), Autof ormer (7.833), and LSTM (7.958). The separation betw een tiers is subs tantial: the gap between the top tier (ranks 1–2) and bottom tier (ranks 7–9) spans more than 5.5 rank positions, and perf or mance does not cor relate unif or ml y with model f amily . Both the top and bottom tiers contain CNN and T ransf or mer -based architectures, sugg esting that specific implementation details (e.g., patc hing, large-k er nel conv olutions) matter more than broad architectural classes. In ter ms of win counts, Moder n TCN achie v es the lo w est rmse on 18 of 24 ev aluation points (75.0% win rate), while N-HiTS and Patc hTST eac h win 3 points (12.5%). No other architecture achie v es a single first-place finish. T able 8 disaggregates these wins b y asset and hor izon. Moder n TCN’ s dominance is concentrated in f ore x and equity indices (16 of 16 wins), while its cr yptocur rency record is more contes ted: N-Hi TS wins on ETH/USDT (both hor izons) and AD A/USDT ( ℎ = 4 ), and Patc hTST wins on BTC/USDT ( ℎ = 4 ; rmse = 731 . 05 vs. Moder n TCN’ s 731 . 63 , Δ = 0 . 08% ) and AD A/USDT ( ℎ = 24 ). Notabl y , no model other than Modern TCN achie v es the lo w est rmse on an y f ore x or equity index asset at ℎ = 24 , underscoring its super ior long-hor izon generalisation outside the cr yptocur rency domain. 4 Results 21 T able 7. Global model ranking aggregated across all 12 assets and both horizons ( ℎ ∈ { 4 , 24 } ), categor ised b y architectural f amily . Each (asset, hor izon) pair contr ibutes one rank based on mean rmse o v er three seeds. Mean and median ranks are computed o ver 24 e valuation slots. Win Count indicates the number of slots in which a model achie ved rank 1. Bold marks the best value per column. Model F amil y Mean Rank Median Rank Wins (of 24) Modern T CN CNN 1 . 3330 1 . 0000 18 (75.0%) Patc hTST T ransformer 2 . 0000 2 . 0000 3 (12.5%) i T ransf or mer T ransf ormer 3 . 6670 3 . 0000 0 TimeX er T ransf or mer 4 . 2920 4 . 0000 0 DLinear MLP / Linear 4 . 9580 5 . 0000 0 N-Hi TS MLP / Linear 5 . 2500 6 . 0000 3 (12.5%) TimesN et CNN 7 . 7080 8 . 0000 0 A utof or mer T ransf ormer 7 . 8330 8 . 0000 0 LSTM RNN 7 . 9580 9 . 0000 0 T able 8. Best-perf or ming model per asset and horizon, deter mined by low est mean rmse across three seeds. Moder n TCN ac hie ves the lo w est rmse on 18 of 24 ev aluation points (75%); N-Hi TS wins on 3 points (all in cr yptocurrency); Patc hTST wins on 3 points (one cr ypto, one cr ypto, one f orex). Bold highlights non-Moder n TCN winners, re vealing niche asset-specific advantag es. Category Asset Best at ℎ = 4 Best at ℎ = 24 Crypto BTC/USDT Patc hTST Modern TCN ETH/USDT N-HiTS N-Hi TS BNB/USDT Moder n TCN Moder n TCN AD A/USDT N-HiTS P atchTST Fore x EUR/USD Modern TCN Moder n TCN USD/JPY Modern TCN Moder n TCN GBP/USD Patc hTST Modern TCN A UD/USD Modern TCN Moder n TCN Indices D AX Moder n TCN Moder n TCN Do w Jones Moder n TCN Moder n TCN S&P 500 Modern TCN Moder n TCN N ASD A Q 100 Moder n TCN Moder n TCN Win count summary Moder n TCN: 18 N-HiTS: 3 Patc hTST: 3 N-Hi TS wins ex clusiv ely on low er -capitalisation cryptocur rency assets (ETH/USDT , AD A/USDT), sugges ting that its hierarchical multi-rate pooling is par ticularl y suited to the high-frequency , multi-scale v olatility s tr ucture of altcoin markets. Patc hTST wins at ℎ = 4 on BTC/USDT ( rmse = 731 . 05 vs. Moder n TCN’ s 731 . 63 ; Δ = 0 . 08% ) and on GBP/USD, but cedes to Moder n TCN at ℎ = 24 in both cases. No model other than Modern TCN wins on an y f orex or equity index asset at ℎ = 24 , underscoring its superior long-horizon generalisation. 4 Results 22 Figure 3 display s the rmse heatmap across all ev aluation points. The body panel e x cludes LSTM to rev eal finer dis tinctions among the eight moder n architectures (the all-models v ar iant appears in Appendix A.5 ). Modern TCN and Patc hTST consistentl y occupy the lo w est-error cells. Figure 4 presents the rank distribution, confir ming the three-tier str ucture. Figure 3. Global rmse heatmap across eight moder n architectures and 24 ev aluation points (12 assets × 2 hor izons). Lighter cells indicate lo w er er ror . Modern TCN and Patc hTST consistentl y achie v e the lo west rmse v alues across all asset–horizon combinations. LSTM is e xcluded f or visual clarity; the full nine-model variant is pro vided in Appendix A.5 , Figure 31 . V alues represent mean rmse across three seeds. Figure 4. Global mean rank compar ison across 24 ev aluation points (12 assets × 2 hor izons). Lo wer values indicate better per f ormance. Three distinct tiers are visible: Moder n TCN and Patc hTST (ranks 1–2), a middle group of f our models (ranks 3–6), and a bottom g roup comprising TimesN et, A utof or mer, and LSTM (ranks 7–9). Er ror bars represent rank standard de viation across ev aluation points. 4.2 Category-Level Anal ysis T able 9 repor ts categor y -le v el agg reg ated rmse and mae f or each model, pro viding the asset-class dimension of H1. Er ror magnitudes differ by sev eral orders of magnitude across categories due to underl ying price scales, but relative model order ing is preserved. 4 Results 23 T able 9. Categor y -lev el mean rmse and mae a v erag ed across all assets within each category and across both horizons ( ℎ ∈ { 4 , 24 } ). V alues are agg reg ated ov er 4 assets × 2 hor izons × 3 seeds. Bold marks the best (lo wes t) v alue per categor y and metric. LSTM is included for completeness but e xcluded from ranking discussions due to con ver gence f ailures. Crypto F ore x Indices Model rmse mae rmse mae rmse mae A utof ormer 532 . 21 393 . 83 0 . 2068 0 . 1551 179 . 66 136 . 24 DLinear 323 . 48 220 . 87 0 . 1279 0 . 0940 123 . 80 90 . 67 i T ransf or mer 320 . 83 217 . 99 0 . 1136 0 . 0785 113 . 72 79 . 13 LS TM 2398 . 94 2041 . 58 3 . 6679 3 . 5858 1548 . 56 1478 . 21 Modern TCN 314 . 66 211 . 14 0 . 1098 0 . 0750 110 . 89 76 . 09 N-Hi TS 352 . 91 255 . 77 0 . 2356 0 . 1977 135 . 01 101 . 04 P atchTS T 314 . 87 211 . 70 0 . 1108 0 . 0761 111 . 71 77 . 04 T imesNet 352 . 92 249 . 41 0 . 2127 0 . 1596 172 . 98 131 . 61 T imeX er 318 . 38 215 . 24 0 . 1174 0 . 0815 114 . 09 79 . 19 Cryptocurrency . Moder n TCN achie v es the lo wes t mean rmse (314.66), closel y f ollo wed by PatchTS T (314.87; Δ = 0 . 07% ) and TimeX er (318.38). iT ransformer (320.83) and DLinear (323.48) are competitiv e within a 3% rang e of the leader . LSTM exhibits er rors appro ximately 7 . 6 × higher than the best model (2,398.94), reflecting fundamental con v erg ence difficulties under the s tandard training protocol. N-Hi TS (352.91), Autof ormer (532.21), and TimesN et (352.92) form the lo w er -perf orming group. F ore x. Moder n TCN leads (mean rmse 0.1098), f ollo w ed b y PatchTS T (0.1108; Δ = 0 . 9% ) and i T ransf or mer (0.1136). TimeX er (0.1174) and DLinear (0.1279) remain competitiv e on an absolute scale. LSTM (3.668) is appro ximately 33 × w orse than the leader , while N-Hi TS (0.2356), A utof or mer (0.2068), and TimesN et (0.2127) f or m the lo w er tier . Equity indices. Moder n TCN achiev es the lo w est rmse (110.89), with Patc hTST (111.71; Δ = 0 . 7% ), i T ransf or mer (113.72), and TimeX er (114.09) in close succession. DLinear (123.80) and N-HiTS (135.01) are moderatel y higher . TimesN et (172.98), Autof ormer (179.67), and LSTM (1,548.56) trail substantiall y . Cross-category ranking consistency . A cross all three categories, Modern TCN and Patc hTST consistentl y occup y the top two positions (Figure 5 ). This consistency is confirmed by the rmse – mae concordance in Figure 6 : near -per f ect linear cor relation betw een the two error metr ics sho ws that rankings are robus t to metric choice. Categor y dendrograms and per -categor y per f ormance matr ices 4 Results 24 appear in Appendix A.7 (Figures 34 and 35 ); Moder n TCN and P atchTS T cluster at the top of e v er y dendrogram. Figure 5. Category-le vel rank distributions across assets within each categor y , ex cluding LS TM f or visual clarity . Modern TCN exhibits the tightest rank distribution (consistently rank 1 across all categories), indicating stable cross-asset per f ormance. The full nine-model v ar iant is pro vided in Appendix A.5 . Box es sho w interquar tile rang e; whiskers e xtend to the most e xtreme rank observ ed. Figure 6. Category-le vel rmse v s. mae scatter plot f or eight modern architectures. Each point represents one model’ s mean er ror within a category . The near -per f ect linear cor relation ( 𝑅 2 > 0 . 99 ) confirms that model rankings are consistent across the two error metrics, indicating that findings based on rmse g eneralise to mae . The full nine-model v ar iant is provided in Appendix A.5 . 4 Results 25 4.3 Cr oss-Horizon Degradation T able 10 presents rmse at ℎ = 4 and ℎ = 24 f or each representativ e asset (BTC/USDT , EUR/USD, Dow Jones), along with the percentage degradation ( Δ % = 100 × ( RMSE ℎ = 24 − RMSE ℎ = 4 ) / RMSE ℎ = 4 ). T able 13 pro vides the corresponding rank shift Δ = 𝑟 24 − 𝑟 4 f or each model, isolating relativ e order ing chang es from absolute er ror magnitudes. T able 10. Hor izon degradation f or BTC/USDT (top), EUR/USD (middle), and Do w Jones (bottom). rmse v alues are three-seed means. Δ % denotes the relativ e rmse increase from ℎ = 4 to ℎ = 24 : Δ % = 100 × ( RMSE 24 − RMSE 4 ) / RMSE 4 . Bold marks the model with the lo wes t rmse at ℎ = 24 across all nine architectures. Model rmse , ℎ = 4 rmse , ℎ = 24 Δ % A utof or mer 1284 . 90 2670 . 00 107 . 80 DLinear 772 . 00 1644 . 10 113 . 00 i T ransf or mer 743 . 50 1651 . 30 122 . 10 LSTM 8029 . 90 10 878 . 70 35 . 50 Modern TCN 731 . 60 1617 . 40 121 . 10 N-Hi TS 930 . 50 1724 . 50 85 . 30 Patc hTST 731 . 10 1619 . 10 121 . 50 TimesN et 793 . 60 1840 . 40 131 . 90 TimeX er 750 . 30 1624 . 90 116 . 60 T able 11. Hor izon degradation for EUR/USD . rmse values are three-seed means. Δ % defined as in T able 10 . Model rmse , ℎ = 4 rmse , ℎ = 24 Δ % A utof or mer 0 . 00 0 . 01 120 . 30 DLinear 0 . 00 0 . 00 118 . 50 i T ransf or mer 0 . 00 0 . 00 121 . 00 LSTM 0 . 00 0 . 00 102 . 60 Modern TCN 0 . 00 0 . 00 121 . 40 N-Hi TS 0 . 00 0 . 00 107 . 40 Patc hTST 0 . 00 0 . 00 124 . 90 TimesN et 0 . 00 0 . 01 110 . 30 TimeX er 0 . 00 0 . 00 116 . 70 All models e xhibit higher rmse at ℎ = 24 relativ e to ℎ = 4 , consistent with the established understanding that prediction uncer tainty g ro w s with the f orecast window . Ho w e ver , degradation rates v ar y meaningfull y across architectures and assets: BTC/USDT . Degradation ranges from 85.3% (N-Hi TS) to 131.9% (TimesNet) among modern architectures. The top-tier models Modern TCN ( Δ % = 121 . 1 ) and Patc hTST ( Δ % = 121 . 5 ) e xhibit nearl y identical degradation. LSTM show s the lo w est relative deg radation (35.5%), not because of strong long-horizon perf or mance, but because its ℎ = 4 er rors are already substantiall y ele vated (8,029.9 vs. 731.6 f or Moder n TCN). 4 Results 26 T able 12. Hor izon degradation for Do w Jones . rmse values are three-seed means. Δ % defined as in T able 10 . Model rmse , ℎ = 4 rmse , ℎ = 24 Δ % A utof or mer 160 . 70 410 . 70 155 . 60 DLinear 121 . 30 279 . 10 130 . 10 i T ransf or mer 115 . 20 258 . 40 124 . 30 LSTM 1261 . 70 1688 . 50 33 . 80 Modern TCN 112 . 50 249 . 50 121 . 70 N-Hi TS 160 . 20 255 . 90 59 . 70 Patc hTST 113 . 00 252 . 20 123 . 20 TimesN et 160 . 30 431 . 40 169 . 10 TimeX er 118 . 40 258 . 30 118 . 10 T able 13. Hor izon ranking shift f or the three representative assets. 𝑟 4 and 𝑟 24 : model rank at ℎ = 4 and ℎ = 24 respectiv ely (lo wer is better; 1 = best). Δ = 𝑟 24 − 𝑟 4 : positive v alues indicate rank deg radation (the model performs relativ ely wor se at longer hor izons); negativ e values indicate rank impro v ement. V alues in bold denote | Δ | ≥ 2 . Rankings are based on mean rmse across three seeds. BTC/USDT EUR/USD Do w Jones Model 𝑟 4 𝑟 24 Δ 𝑟 4 𝑟 24 Δ 𝑟 4 𝑟 24 Δ A utof or mer 8 . 00 8 . 00 0 . 00 8 . 00 8 . 00 0 . 00 8 . 00 7 . 00 − 1 . 00 DLinear 5 . 00 4 . 00 − 1 . 00 4 . 00 4 . 00 0 . 00 5 . 00 6 . 00 1 . 00 i T ransf or mer 3 . 00 5 . 00 2 . 00 3 . 00 3 . 00 0 . 00 3 . 00 5 . 00 2 . 00 LSTM 9 . 00 9 . 00 0 . 00 7 . 00 7 . 00 0 . 00 9 . 00 9 . 00 0 . 00 Modern TCN 2 . 00 1 . 00 − 1 . 00 1 . 00 1 . 00 0 . 00 1 . 00 1 . 00 0 . 00 N-Hi TS 7 . 00 6 . 00 − 1 . 00 6 . 00 6 . 00 0 . 00 6 . 00 3 . 00 − 3 . 00 Patc hTST 1 . 00 2 . 00 1 . 00 2 . 00 2 . 00 0 . 00 2 . 00 2 . 00 0 . 00 TimesN et 6 . 00 7 . 00 1 . 00 9 . 00 9 . 00 0 . 00 7 . 00 8 . 00 1 . 00 TimeX er 4 . 00 3 . 00 − 1 . 00 5 . 00 5 . 00 0 . 00 4 . 00 4 . 00 0 . 00 EUR/USD exhibits perfect rank stability ( Δ = 0 f or all nine models), confir ming that the f orex ranking is inv ar iant to horizon. iT ransf ormer degrades by 2 positions in both BTC/USDT and Do w Jones, sugg esting its inductiv e bias is less suited to long er -hor izon financial prediction. N-Hi TS impro v es by 3 positions in Do w Jones at ℎ = 24 , the lar gest positive shift observ ed, consistent with its multi-rate pooling design captur ing longer temporal structure in indices. EUR/USD. Degradation magnitudes are comparable: Moder n TCN ( Δ % = 121 . 4 ) and Patc hTST ( Δ % = 124 . 9 ) degrade similarl y , while N-Hi TS ( Δ % = 107 . 4 ) sho ws the lo w est degradation among modern architectures. Do w Jones. T imesNet exhibits the highest degradation (169.1%), f ollo wed b y A utof or mer (155.6%). N-Hi TS sho w s notably lo w er deg radation (59.7%), sugges ting that its hierarchical multi-rate pooling ma y capture multi-scale patter ns that transf er across hor izons. Cross-horizon rank stability . Despite 2 – 2 . 5 × absolute er ror amplification, top-tier models maintain their relativ e ranking across horizons f or all three representativ e assets: Moder n TCN and Patc hTST 4 Results 27 hold positions 1–2 at both ℎ = 4 and ℎ = 24 . Rank shifts are concentrated in the middle tier; f or e xample, N-Hi TS improv es b y 3 ranks on Do w Jones (from rank 6 to rank 3), while i T ransf or mer drops b y 2 ranks on BTC/USDT (from rank 3 to rank 5). EUR/USD rankings are perfectl y stable across horizons. A comprehensiv e heatmap of per -model per -asset deg radation percentages appears in Appendix A.4 (Figure 30 ). Section 4.8 provides complementar y qualitativ e evidence through actual-v ersus-predicted o ver la ys. Figure 7 extends this analy sis to all tw el v e assets. Moder n TCN and Patc hTST consistentl y occup y the lo w est-er ror positions across all assets and both hor izons, confirming that their inductive biases— larg e-kernel con v olutions and patc h-based self-attention—generalise robustl y to extended temporal conte xts. Con v ersely , T imesNet and A utof or mer sho w the most pronounced deg radation, sugges ting that their multi-per iodicity mechanisms are susceptible to fidelity loss at longer horizons in high-noise financial domains. Figure 7. Cross-hor izon rmse comparison f or eight modern architectures across twel v e assets. Each asset group displa ys rmse at ℎ = 4 and ℎ = 24 . The top-tier architectures (Moder n TCN, P atchTST) maintain super ior rankings across both horizons, while middle-tier models e xhibit v ar ying sensitivity to the f orecast windo w length. 4.4 Seed Rob ustness and V ariance Decomposition The tw o-factor v ariance decomposition (T able 14 ) is repor ted in three panels: raw rmse , z-score normalised rmse across all nine models, and z-score nor malised rmse e x cluding the LSTM outlier . Ra w panel. On the original pr ice scale, architecture absorbs 99.90% of total sum-of-squares v ar iance, v ersus 0.01% f or seed and 0.09% f or the residual. This extreme dominance is larg el y an ar tef act of LSTM’ s outlier rmse values: a single model 7 − 33 × abo v e the median inflates SS model relativ e to all other terms. 4 Results 28 Figure 8. Cross-horizon rmse deg radation f or eight moder n architectures across representativ e assets. Lines connect each model’ s rmse at ℎ = 4 and ℎ = 24 . All architectures e xhibit absolute er ror gro wth, but degradation magnitudes are architecture-dependent: N-HiTS degrades least, while TimesN et and A utof or mer exhibit the steepes t increase. Modern TCN and Patc hTST maintain top-tier per f or mance at both horizons. The full nine-model v ar iant is provided in Appendix A.5 . T able 14. T w o-factor variance decomposition of f orecast rmse . Ra w (untransf or med) and z-normalised (within each asset–horizon slot) panels are sho wn with and without LSTM. Seed v ar iance is negligible ( < 0 . 1 % ) in all cases. F actor Ra w (%) 𝑧 -norm, all (%) 𝑧 -norm, no LSTM (%) Model (arc hitecture) 99 . 9000 48 . 3200 68 . 3300 Seed (initialisation) 0 . 0100 0 . 0400 0 . 0200 R esidual (model × slot) 0 . 0900 51 . 6400 31 . 6600 Ra w panel: pooled v ar iance decomposition on untransf or med rmse ; LSTM’ s errors ( 7 × – 33 × higher than best model) dominate the model sum of squares, inflating the Model fraction to 99.90 %. 𝑧 -norm panels: rmse s tandardised within each (asset, hor izon) slot before AN O V A, removing price- magnitude scale effects across asset classes. R esidual ( ≈ 32 – 52 % ) reflects model × slot interaction: each architecture has a context-dependent advantag e on different asset–horizon combinations. Seed variance ( < 0 . 1 % ) is negligible across all three panels, validating the three-seed protocol. Z-normalised panels: all models. After z-scoring each model’ s rmse within each (asset, horizon) slot, the architecture factor falls to 48.32% , seed to 0.04% , and the residual r ises to 51.64% . The larg e residual reflects g enuine heterog eneity in which model e x cels on a given asset–horizon combination—a meaningful signal rather than noise. Z-normalised panel: modern models only . Ex cluding LSTM, architecture reco v ers to 68.33% (seed: 0.02%; residual: 31.66%), confir ming that architecture remains the dominant f actor ev en among competitiv e moder n models, though asset–horizon conte xt contr ibutes substantiall y . In all three panels, seed v ar iance is negligible ( ≤ 0 . 04% ): rankings are stable with respect to random initialisation and three seeds suffice. 4 Results 29 P er -asset variance decompositions corroborate the global result. At ℎ = 4 , architecture accounts f or 99.75% of variance on BTC/USDT , 97.44% on EUR/USD, and 99.08% on Dow Jones; seed fractions are 0.02%, 0.35%, and 0.01% respectiv ely . Even EUR/USD, with the highes t relativ e seed contribution, show s seed v ar iance two orders of magnitude belo w the architecture factor . At ℎ = 24 , the pattern tightens fur ther: architecture e xplains 99.86% on BTC/USDT (seed: 0.02%), confir ming that initialisation effects diminish—rather than amplify —at longer horizons. Figure 9 (a) displa ys seed-to-seed rmse variation per model as a violin plot. Inter -seed v ariance is negligible relativ e to inter -model differences across all nine architectures: e v en LSTM, which has the highest absolute seed v ar iance, sho ws seed-induced v ar iation that is small relative to its distance from the nearest competitor . Figure 9 (b) provides a pie chart of the ra w v ar iance decomposition; the z-normalised breakdo wn appears in T able 14 . Per -asset seed-variance box plots and scatter plots f or all three representativ e assets appear in Appendix B.3 (Figures 36 – 39 ). (a) Seed v ar iance violin plot. (b) V ar iance decomposition (ra w). Figure 9. Seed robustness anal y sis. (a) Violin plot of seed-to-seed rmse variation. Inter -seed variation is negligible relativ e to inter -model differences across all nine architectures. (b) T w o-factor v ar iance decomposition on ra w price-scale rmse : architecture e xplains 99.90%, seed 0.01%, residual 0.09%. After z-score nor malisation within each (asset, horizon) slot, the architecture share falls to 48.3% (68.3% e xcluding LSTM); see T able 14 f or the full dual-panel breakdo wn. 4.5 Directional Accuracy Directional accuracy ( d a ) quantifies the fraction of f orecasts that cor rectl y predict the sign of the ne xt price change. T able 15 repor ts mean d a per model, category , and horizon. A cross all 9 models × 3 categor ies × 2 hor izons ( = 54 combinations), mean d a is 50.08% . No combination de viates meaningfully from 50% and no horizon trend is discer nible. MSE-trained deep lear ning architectures produce directional f orecasts equiv alent to a fair coin flip on hour ly financial data. This is consistent with the weak -f or m efficient-market hypothesis at hourl y 4 Results 30 T able 15. Mean directional accuracy ( d a , %) per model, asset class, and horizon, a v eraged across assets within each categor y and o ver three seeds. V alues close to 50% indicate no systematic directional bias. Crypto F ore x Indices Model ℎ = 4 ℎ = 24 ℎ = 4 ℎ = 24 ℎ = 4 ℎ = 24 A utoformer 50 . 41 49 . 96 50 . 06 50 . 11 50 . 01 49 . 98 DLinear 49 . 70 49 . 94 50 . 13 50 . 12 49 . 97 49 . 96 i T ransformer 49 . 58 50 . 22 50 . 04 49 . 99 49 . 90 50 . 18 LSTM 49 . 73 50 . 07 50 . 08 49 . 95 49 . 88 49 . 95 Modern TCN 50 . 04 49 . 96 49 . 96 50 . 03 50 . 34 50 . 03 N-Hi TS 50 . 05 49 . 87 50 . 02 50 . 03 50 . 78 49 . 92 Patc hTST 50 . 07 49 . 85 50 . 10 49 . 98 50 . 42 49 . 95 TimesN et 50 . 57 50 . 34 50 . 05 49 . 88 50 . 27 50 . 18 TimeX er 49 . 86 49 . 94 49 . 98 50 . 08 50 . 92 50 . 20 Mean d a across all 54 model–category–horizon combinations is 50.08%, indistinguishable from a f air -coin baseline. No combination deviates meaningfully from 50%. V alues belo w 50 % indicate slight do wn-trend bias arising from scaling ar tef acts, not g enuine negativ e directional skill. resolution and with MSE training’ s reg ression-to-the-mean bias. Section 5.4 discusses implications f or trading strategies. 4.6 Statistical Significance T ests T ables 16 – 19 repor t the full batter y of statis tical tests, pro viding f or mal confir mation of the descr iptiv e findings in Sections 4.1 – 4.5 . T able 16. Statistical significance tests — Panel A: Friedman-Iman-Dav enport omnibus tests. 𝜒 2 : Fr iedman 𝜒 2 statis tic (8 df ); 𝐹 : Iman-Dav enport 𝐹 -statis tic; 𝑛 : number of blocks (e valuation points). All tests reject 𝐻 0 at 𝛼 = 0 . 001 . Scope 𝜒 2 (8) 𝐹 df 𝑝 𝑛 Global (all 12 assets, ℎ ∈ { 4 , 24 } ) 156 . 49 101 . 36 (8, 184) < 10 − 15 24 Crypto (4 assets, ℎ ∈ { 4 , 24 } ) 49 . 23 23 . 34 (8, 56) < 10 − 15 8 Fore x (4 assets, ℎ ∈ { 4 , 24 } ) 56 . 00 49 . 00 (8, 56) < 10 − 15 8 Indices (4 assets, ℎ ∈ { 4 , 24 } ) 60 . 40 117 . 44 (8, 56) < 10 − 15 8 Global Friedman-Iman-Da v enport test. The Friedman-Iman-Dav en por t test on all 24 ev aluation points (12 assets × 2 horizons) yields 𝐹 ( 8 , 184 ) = 101 . 36 , 𝑝 < 10 − 15 (Friedman 𝜒 2 ( 8 ) = 156 . 49 , 𝑝 = 8 . 67 × 10 − 30 ), fir ml y rejecting the null hypothesis that all nine architectures per f orm equiv alentl y . T able 16 indicates that e xactl y the same conclusion holds within ev ery individual asset class: cr ypto ( 𝐹 = 23 . 34 ), f ore x ( 𝐹 = 49 . 00 ), and indices ( 𝐹 = 117 . 44 ), each with 𝑝 < 10 − 15 . 4 Results 31 T able 17. Statistical significance tests — P anel B: Spear man rank cor relations betw een model rankings at ℎ = 4 and ℎ = 24 , plus S touffer combined test. Rankings are based on mean rmse across three seeds f or each of the nine architectures ( 𝑛 = 9 per asset). All 12 per -asset cor relations are significant at 𝛼 = 0 . 05 . Category Asset Spearman 𝜌 𝑝 -v alue Crypto AD A/USDT 0 . 68 0.0424 Crypto BNB/USDT 0 . 92 0.0005 Crypto BTC/USDT 0 . 92 0.0005 Crypto ETH/USDT 0 . 78 0.0125 Fore x A UD/USD 0 . 87 0.0025 Fore x EUR/USD 1 . 00 < 10 − 9 Fore x GBP/USD 0 . 80 0.0096 Fore x USD/JPY 0 . 95 0.0001 Indices D AX 0 . 88 0.0016 Indices Do w Jones 0 . 87 0.0025 Indices S&P 500 0 . 97 < 10 − 4 Indices N ASD A Q 100 0 . 98 < 10 − 5 Stouffer combined ( 𝑛 = 12 assets) 6 . 17 3 . 47 × 10 − 10 Stouffer combined 𝑍 -statis tic aggregates the 12 per -asset one-sided Spear man 𝑝 - values using the inv erse-nor mal method. The global 𝑝 = 3 . 47 × 10 − 10 confirms that cross-horizon rank stability is a sy stematic property of the benchmark. T able 18. Statis tical significance tests — P anel C: Intra- class Cor relation Coefficient ( icc ; two-w a y mix ed, abso- lute agreement) for three representativ e assets at ℎ = 24 . High icc v alues confirm negligible seed-to-seed variation relativ e to inter -model differences. Asset ( ℎ = 24 ) icc 𝐹 -statistic 𝑝 -v alue BTC/USDT 1 . 00 1650 . 20 < 10 − 15 EUR/USD 0 . 99 309 . 60 < 10 − 15 S&P 500 1 . 00 2255 . 60 < 10 − 15 Each icc computed o v er 9 models × 3 seeds. 𝐹 -statis tic tests 𝐻 0 : all seed means are equal. V alues abo v e 0.99 indicate that > 99% of inter-model v ar iance is attributable to architecture rather than random initialisation. 4 Results 32 T able 19. Statis tical significance tests — P anel D: Jonckheere- T er ps tra ( jt ) test f or a monotonic relationship betw een model comple xity (param- eter count) and rmse rank. A significant positiv e result w ould indicate that more parameters reliably yield better ranks. Three complexity groups: ≤ 30K , 30K – 200K , > 200K parameters. Category Horizon jt 𝑧 𝑝 -value Monotonic? Crypto 4 0 . 35 0 . 36 No Crypto 24 − 0 . 35 0 . 64 No Fore x 4 0 . 38 0 . 35 No Fore x 24 − 0 . 29 0 . 61 No Indices 4 − 1 . 25 0 . 89 N o Indices 24 − 1 . 48 0 . 93 No No test achie v es significance ( 𝛼 = 0 . 05 ); all 𝑝 > 0 . 35 . Negativ e 𝑧 values indicate a tendency f or few er parameters to yield better ranks, consistent with the Pareto frontier defined b y DLinear, PatchTS T, and Moder n TCN (Figure 11 ). P ost-hoc Holm- Wilco x on pairwise tests (global). Of the 9 2 = 36 pairwise Wilco x on compar isons at the global le vel ( 𝑛 = 24 observations), 33 are statis tically significant after Holm cor rection ( 𝛼 = 0 . 05 ). The three non-significant pairs are all intr a-tier : TimeX er vs. iT ransf ormer ( 𝑝 Holm = 0 . 480 ), A utof or mer v s. TimesNet ( 𝑝 Holm = 0 . 529 ), and DLinear vs. N-Hi TS ( 𝑝 Holm = 0 . 529 ) — confirming that onl y neighbour ing models within the same performance tier are statis tically indistinguishable. At the per -categor y lev el ( 𝑛 = 8 : 4 assets × 2 hor izons), no pair wise compar ison reaches significance after Holm cor rection (all 𝑝 Holm > 0 . 28 ). This reflects a pow er constr aint , not an absence of effect: with 𝑛 = 8 , the minimum achie vable Wilco x on 𝑝 -v alue is 0 . 0078 ; the Holm step-do wn procedure requires the most e xtreme ra w 𝑝 -v alue to beat 0 . 05 / 36 = 0 . 0014 , which is unattainable at this sample size. The per -category result is thus consistent with—and subsumed b y—the decisiv e global test. Critical difference diagram. Figure 10 visualises the Holm- W ilcox on significance structure as a critical difference (CD) diagram. The hor izontal axis represents mean rank across all 𝑁 = 24 e valuation blocks ( 𝑘 = 9 models); lo w er values denote super ior perf or mance. Each model is placed at its ex act mean rank derived from global_ranking_aggregated.csv . Thick hor izontal bars connect pairs that are not statis tically dis tinguishable after Holm cor rection ( 𝛼 = 0 . 05 ); all unlabelled pairs are significant. Three intra-tier equiv alence g roups emer ge directl y from the pair wise test results. Within the middle tier , i T ransf or mer (rank 3.667) and TimeX er (rank 4.292) are s tatisticall y indistinguishable ( 𝑝 Holm = 0 . 480 ), as are DLinear (rank 4.958) and N-Hi TS (rank 5.250) ( 𝑝 Holm = 0 . 529 ). Within the bottom tier , T imesNet (rank 7.708) and A utof or mer (rank 7.833) are statis tically equiv alent ( 𝑝 Holm = 0 . 529 ). All 33 remaining compar isons are s tatisticall y significant, including ev er y cross-tier comparison. In par ticular , the top-tier boundar y is unambiguous: Moder n TCN (rank 1.333) and Patc hTST (rank 2.000) are each significantly superior to ev ery model in the middle and bottom tiers ( 𝑝 Holm ≤ 0 . 028 in all cases). 4 Results 33 The brack eted interval labelled CD 0 . 05 = 2 . 451 in the upper -left cor ner displa ys the Nemen yi critical difference computed as CD = 𝑞 0 . 05 ( 9 ) 𝑘 ( 𝑘 + 1 ) / ( 6 𝑁 ) = 3 . 102 × 90 / 144 , where 𝑞 0 . 05 ( 9 ) = 3 . 102 is the studentised range cr itical value at 𝛼 = 0 . 05 f or 𝑘 = 9 and 𝑁 = 24 . This brack et is sho wn as a ref erence onl y; all significance claims in this paper are der iv ed from the more pow erful Holm-cor rected Wilco x on procedure rather than the Nemen yi threshold. Notabl y , the full rank span from Modern TCN to LS TM ( Δ 𝑟 = 6 . 625 ) e x ceeds 2 . 7 × CD 0 . 05 , confirming that the top-to-bottom separation is not a boundary case but an ov erwhelming statistical gap. 1 2 3 4 5 6 7 8 9 Modern TCN Patc hTST i T ransformer TimeX er DLinear N-Hi TS TimesN et A utoformer LSTM CD = 2 . 451 Mean Rank Figure 10. Cr itical difference diagram f or nine f orecasting architectures across 𝑁 = 24 e valuation bloc ks (12 assets × 2 hor izons). Models are placed at their ex act mean rmse rank; low er rank is better . Thick horizontal bars connect models that are no t statisticall y distinguishable at 𝛼 = 0 . 05 under Holm-corrected Wilco x on tests ( holm_wilcoxon.csv ): iT ransf ormer–TimeX er ( 𝑝 Holm = 0 . 480 ), DLinear–N-HiTS ( 𝑝 Holm = 0 . 529 ), and TimesN et–Autof ormer ( 𝑝 Holm = 0 . 529 ). All other 33 pair wise comparisons are significant. The brack eted interval (upper left) sho ws the Nemen yi cr itical difference CD 0 . 05 = 2 . 451 ( 𝑘 = 9 , 𝑁 = 24 , 𝑞 0 . 05 = 3 . 102 ) f or ref erence only . Cross-horizon Spearman rank correlations and Stouffer combination. T able 17 repor ts the Spearman 𝜌 betw een model rankings at ℎ = 4 and ℎ = 24 f or all tw elv e assets. All 12 correlations are positiv e and statisticall y significant ( 𝑝 < 0 . 05 ), with 𝜌 ranging from 0 . 683 (AD A/USDT) to 1 . 000 (EUR/USD). The Stouffer combined statistic 𝑍 S = 6 . 17 ( 𝑝 = 3 . 47 × 10 − 10 ) confir ms that cross-hor izon rank stability is a globally sy stematic proper ty : no single architecture ’ s ranking collapses betw een the tw o hor izons. Intraclass Correlation Coefficient ( icc ). icc (3,k) analy sis f or three representativ e assets at ℎ = 24 (T able 18 ) yields ICC > 0 . 990 in all cases, with 𝐹 -statis tics ranging from 309.6 to 2255.6 ( 𝑝 < 10 − 15 ). These values indicate that more than 99% of inter -model variance ar ises from architecture rather than random initialisation, corroborating the ≤ 0 . 04% seed contribution in T able 14 . At ℎ = 4 , ICC remains high: 0.9966 (BTC/USDT , 𝐹 = 873 . 9 ), 0.9668 (EUR/USD, 𝐹 = 88 . 4 ), and 0.9864 (Do w Jones, 𝐹 = 218 . 2 ), all with 𝑝 < 10 − 11 . EUR/USD’ s comparativ ely lo w er ℎ = 4 ICC reflects the tighter model clustering on this lo w -volatility pair rather than g enuine seed instability , as e v en 96.7% remains f ar abov e con v entional reliability thresholds. 4 Results 34 Diebold-Mariano pairwise tests. For BTC/USDT at ℎ = 24 , Holm-cor rected Diebold-Mariano ( dm ) tests show that the top cluster (Moder n TCN, P atchTS T, T imeXer, DLinear, iT ransf ormer) is internally par tiall y distinguishable. Specifically , Moder n TCN v s. Patc hTST is not significant ( 𝑝 Holm = 0 . 453 ), and Modern TCN vs. T imeXer is border line ( 𝑝 Holm = 0 . 094 ), while all compar isons to LS TM, A utof or mer, and TimesN et are highly significant ( 𝑝 < 10 − 14 ). EUR/USD at ℎ = 24 e xhibits a sharper separation structure. Moder n TCN is s tatisticall y distin- guishable from Patc hTST ( 𝑡 DM = − 10 . 17 , 𝑝 Holm = 2 . 77 × 10 − 24 ) and from T imeXer ( 𝑡 DM = − 9 . 21 , 𝑝 Holm = 3 . 30 × 10 − 20 )—unlike BTC/USDT , where these top-tier differences are not significant. The non-significant pairs on EUR/USD are DLinear vs. PatchTS T ( 𝑝 Holm = 0 . 122 ), DLinear v s. TimeX er ( 𝑝 Holm = 0 . 187 ), DLinear vs. i T ransf or mer ( 𝑝 Holm = 0 . 187 ), and iT ransf or mer vs. Patc hTST ( 𝑝 Holm = 0 . 717 ). Thus, on the most liquid and lo w-noise f ore x pair , Moder n TCN’ s super iority is statis tically unambiguous, whereas the middle cluster (DLinear , i T ransf or mer , P atchTS T, TimeX er) remains inter nall y indistinguishable. This asset-dependent DM separability sugg ests that the statis tical po wer to discr iminate the top two architectures depends on the signal-to-noise proper ties of the underl ying mark et. These results confir m that the top-tier boundar y (Moder n TCN and Patc hTST) is not ar tef actual, but that fine-grained rankings within the three-to-fiv e-model cluster should be inter preted with appropr iate caution across individual assets. Jonckheer e- T erpstra test for comple xity monotonicity . The Jonckheere- T er ps tra ( jt ) test for a monotonic rank -descending trend with increasing parameter count finds no significant relationship in an y of the six category–horizon combinations tested (all 𝑝 > 0 . 35 ; T able 19 ). Four of the six 𝑧 -v alues are negativ e, indicating that fe w er parameters tend to yield bett er ranks on av erag e. This f or mall y cor roborates the non-monotonic complexity –per f ormance finding (Section 4.7 ). Directional accuracy z-tests. One-sample z-tests on directional accuracy confir m that no architecture ’ s d a de viates significantly from the 50% null (all Holm-cor rected 𝑝 > 0 . 43 f or BTC/USDT at ℎ = 24 ), cor roborating the aggregate finding in Section 4.5 . The test with the lar gest | 𝑧 | is i T ransf or mer ( 𝑧 = 0 . 263 , 𝑝 Holm = 1 . 0 ), confir ming that no model possesses directional skill at this resolution. 4.7 Complexity–P erf ormance Relationship The relationship betw een model comple xity , measured by trainable parameter count, and f orecasting performance is sho wn in F igure 11 . The anal ysis f ocuses on modern arc hitectures, ex cluding the LSTM baseline. The empirical results rev eal se v eral insights: Par eto-efficient arc hitectures. A distinct Pareto frontier is defined b y DLinear, Patc hTST, and Modern TCN. DLinear (appro x. 1,000 parameters) represents the e xtreme efficiency point, achie ving mid-tier perf ormance (rank 5) with minimal capacity . Patc hTST (appro x. 103K) and Modern TCN 4 Results 35 1,000 10,000 100,000 P arameter Count (log scale) 0.12 0.14 0.16 0.18 0.20 0.22 0.24 RMSE MEAN Spearman = -0.143 p-value = 0.736 Model Complexity vs RMSE MEAN (F orex, h=4) Model Autoformer DLinear ModernTCN N-HiTS P atchTST TimeXer TimesNet iTransformer OLS Trend (log scale) (a) Horizon ℎ = 4 . 100,000 P arameter Count (log scale) 0.12 0.14 0.16 0.18 0.20 0.22 0.24 RMSE MEAN Spearman = -0.310 p-value = 0.456 Model Complexity vs RMSE MEAN (F orex, h=24) Model Autoformer DLinear ModernTCN N-HiTS P atchTST TimeXer TimesNet iTransformer OLS Trend (log scale) (b) Horizon ℎ = 24 . Figure 11. Complexity –per f ormance trade-off (e xcluding LS TM). The horizontal axis represents the number of trainable parameters (log scale); the v er tical axis represents the mean rmse rank across all assets and seeds. The Pareto frontier is clearl y defined b y DLinear, Patc hTST, and Moder n TCN. (appro x. 230K) occup y the optimal region, pro viding the lo w est rmse ranks globall y by deplo ying parameters into inductiv e biases suited to financial time ser ies. Diminishing returns at high complexity . Bey ond appro ximately 2 . 5 × 10 5 parameters, retur ns diminish shar pl y . A utof or mer (appro x. 438K) and iT ransf or mer (approx. 253K) do not impro ve commensuratel y with their larg er capacity . A utof or mer’ s rank is consistentl y lo wer than the simpler DLinear, sugges ting that e xcess capacity without appropr iate temporal decomposition ma y lead to o v er fitting on v olatile OHLCV f eatures. Horizon consistency . Comparing F igure 11a and Figure 11b sho ws that the efficiency profile remains stable across horizons. A bsolute er rors increase at ℎ = 24 , but relative model positions on the comple xity–perf or mance plane are preserv ed, indicating that architectural efficiency is a robust property of the model design. 4.8 Qualitative Forecast Fidelity Quantitativ e metr ics efficientl y rank architectures but compress multi-dimensional f orecast beha viour into scalar summar ies. This subsection complements the tabular e vidence with actual-versus-predicted o v erla ys, org anised along three dimensions: shor t-horizon tracking ( ℎ = 4 ), medium-hor izon beha viour ( ℎ = 24 , step 12), and long-hor izon deg radation ( ℎ = 24 , step 24). All plots use seed 123; analogous patterns hold across seeds given the near -zero seed variance established in Section 4.4 . Short-horizon trac king fidelity ( ℎ = 4 , steps 1 and 4). Figure 12 presents Moder n TCN’ s actual- v ersus-predicted o v erla y on BTC/USDT at ℎ = 4 f or steps 1 and 4 of the f orecast v ector . A t step 1 (Figure 12a ), the predicted curv e closely mirrors the actual pr ice tra jector y , captur ing both direction and amplitude of hourl y mo v ements. This near -perf ect alignment is consis tent with Moder n TCN’ s lo w est category-le vel rmse in cryptocur rency (314.66; T able 9 ). At step 4 (Figure 12b ), o verall alignment is maintained, but the predicted cur v e show s modest amplitude attenuation dur ing high-v olatility 4 Results 36 episodes—a visual signature of the regression-to-the-mean effect inherent in MSE-optimised multi-step predictors. No sys tematic phase shift or directional bias appears in either panel. 0 100 200 300 400 500 Time Index 110000 112000 114000 116000 118000 120000 122000 124000 V alue Actual vs Predicted Step 1 crypto | BTCUSDT | ModernTCN | Horizon 4 Actual Predicted (a) Step 1 ( + 1 hour ahead). 0 100 200 300 400 500 Time Index 110000 112000 114000 116000 118000 120000 122000 124000 V alue Actual vs Predicted Step 4 crypto | BTCUSDT | ModernTCN | Horizon 4 Actual Predicted (b) Step 4 ( + 4 hours ahead). Figure 12. Actual v ersus predicted close price f or Moder n TCN on BTC/USDT , ℎ = 4 (seed 123). (a) At step 1, the model trac ks the actual price with high fidelity , capturing directional tur ns and amplitude fluctuations. (b) At step 4, trend structure is preserv ed but shor t-liv ed volatility spikes are modestl y attenuated, consistent with MSE-induced shr inkag e. No sy stematic phase shift is obser v ed. Figure 13 juxtaposes Patc hTST on EUR/USD and TimeX er on Do w Jones, both at ℎ = 4 , step 1. The EUR/USD panel (F igure 13a ) sho w s that Patc hTST’ s predictions f ollo w the low -amplitude, mean-re verting dynamics of the cur rency pair with high precision, corroborating its categor y -leading rmse in f ore x (0.1108; T able 9 ). The Do w Jones panel (Figure 13b ) sho ws the middle-tier TimeX er (rank 4): it tracks the g eneral directional dr ift but e xhibits a wider deviation band and less precise reco v er y of shar p rev ersals—a qualitativ e reflection of the rank g ap between the top and middle tiers. 0 100 200 300 400 500 Time Index 1.125 1.130 1.135 1.140 1.145 1.150 1.155 1.160 V alue Actual vs Predicted Step 1 forex | EURUSD | P atchTST | Horizon 4 Actual Predicted (a) Patc hTST on EUR/USD, step 1 ( ℎ = 4 ). 0 100 200 300 400 500 Time Index 23400 23600 23800 24000 24200 24400 V alue Actual vs Predicted Step 1 indices | DEUIDXEUR | TimeXer | Horizon 4 Actual Predicted (b) TimeX er on Dow Jones, step 1 ( ℎ = 4 ). Figure 13. Cross-architecture, cross-asset contrast at ℎ = 4 , step 1 (seed 123). (a) PatchTS T on EUR/USD (rank 2) trac ks the low -amplitude, mean-rev erting f orex dynamics with high accuracy . (b) TimeX er on Do w Jones (rank 4) captures the directional structure but with a wider deviation band, illustrating the qualitativ e signature of the rank gap betw een top and middle tiers. Figure 14 provides a direct comparison between the third-ranked i T ransf or mer and the top-ranked Modern TCN on BTC/USDT at ℎ = 4 . Both models trac k the actual pr ice tra jector y closel y at step 1, but at step 4 iT ransf or mer e xhibits slightly more amplitude attenuation dur ing v olatile episodes. This visual difference is consis tent with the small but consistent rmse g ap betw een i T ransf or mer (743.5) and Modern TCN (731.6) on BTC/USDT at ℎ = 4 (T able 10 ). Medium-horizon beha viour ( ℎ = 24 , step 12). Figure 15 presents step-12 ov er lay s from tw o model– asset pairings at the midpoint of the ℎ = 24 v ector . Moder n TCN on EUR/USD (Figure 15a ) sho ws that macro-directional str ucture is preser v ed 12 hours ahead: the predicted series f ollo ws multi-session trends while understandabl y missing the shar pest intra-session swings. P atchTST on Do w Jones (Figure 15b ) similarl y maintains directional integ rity at step 12, but with a visibl y wider er ror en v elope than at ℎ = 4 —confirming the 2 – 2 . 5 × rmse amplification (T able 10 ). Both panels sho w that the dominant degradation signature is amplitude attenuation rather than phase er ror or directional rev ersal. 4 Results 37 0 100 200 300 400 500 Time Index 110000 112000 114000 116000 118000 120000 122000 124000 V alue Actual vs Predicted Step 1 crypto | BTCUSDT | iTransformer | Horizon 4 Actual Predicted (a) i T ransformer on BTC/USDT , s tep 1 ( ℎ = 4 ). 0 100 200 300 400 500 Time Index 110000 112000 114000 116000 118000 120000 122000 124000 V alue Actual vs Predicted Step 4 crypto | BTCUSDT | iTransformer | Horizon 4 Actual Predicted (b) i T ransformer on BTC/USDT , s tep 4 ( ℎ = 4 ). Figure 14. i T ransf or mer (rank 3) on BTC/USDT at ℎ = 4 (seed 123). (a) At step 1, tracking fidelity is comparable to Moder n TCN (Figure 12a ). (b) At step 4, slightl y more amplitude attenuation is visible relativ e to Modern TCN (Figure 12b ), consistent with the 1.6% rmse gap. The inv erted attention mechanism produces qualitativ el y similar but measurably w eaker temporal representations f or shor t-horizon cryptocur rency f orecasting. 0 100 200 300 400 500 Time Index 1.125 1.130 1.135 1.140 1.145 1.150 1.155 1.160 V alue Actual vs Predicted Step 12 forex | EURUSD | ModernTCN | Horizon 24 Actual Predicted (a) Modern TCN on EUR/USD, s tep 12 ( ℎ = 24 ). 0 100 200 300 400 500 Time Index 23400 23600 23800 24000 24200 24400 V alue Actual vs Predicted Step 12 indices | DEUIDXEUR | P atchTST | Horizon 24 Actual Predicted (b) Patc hTST on Dow Jones, step 12 ( ℎ = 24 ). Figure 15. Medium-hor izon actual-v ersus-predicted o v erla ys at step 12 of the ℎ = 24 f orecast v ector (seed 123). (a) Moder n TCN on EUR/USD: directional content is preser v ed 12 hours ahead while high-frequency amplitude is dampened. (b) P atchTST on Do w Jones: directional integr ity is maintained but the forecas t en v elope is wider than at ℎ = 4 , cor roborating the 2 – 2 . 5 × rmse amplification (T able 10 ). Both top architectures e xhibit amplitude attenuation as the primar y degradation mode. Long-horizon degradation ( ℎ = 24 , step 24). Figure 16 presents the 24-step-ahead ov er lay f or Modern TCN on BTC/USDT —the most demanding combination in the benchmark, pairing the highest price v olatility with the maximum forecas t depth. The predicted ser ies retains directional drift but sho ws progressiv e amplitude compression be y ond step 12, with increasingl y imprecise oscillatory re versal timing. These characteristics are consistent with Moder n TCN’ s rmse degradation from 731.6 ( ℎ = 4 ) to 1,617.4 ( ℎ = 24 ; T able 10 )—a 121.1% increase that, while substantial, does not erase directional signal or introduce sys tematic bias. Long-hor izon predictions are tr end-indicative rather than instance-specific: arc hitectures differ not in whether degradation occurs but in how gracefully their temporal representations transf er to the maximum hor izon. 5 Discussion 38 0 100 200 300 400 500 Time Index 110000 112000 114000 116000 118000 120000 122000 124000 V alue Actual vs Predicted Step 24 crypto | BTCUSDT | ModernTCN | Horizon 24 Actual Predicted Figure 16. Long-hor izon o v erla y: Moder n TCN on BTC/USDT , ℎ = 24 , s tep 24 (seed 123). The model retains directional integr ity and captures lo w-freq uency trend components, but high-amplitude intra-da y rev ersals are under -predicted. This patter n—directional fidelity without amplitude precision—is the signature of MSE- optimised direct multi-step forecas ting at the maximum hor izon. The rmse at ℎ = 24 (1,617.4) represents a 121.1% degradation relativ e to ℎ = 4 (731.6; T able 10 ). 5 Discussion This section inter prets the empir ical findings from Section 4 , adjudicating each h ypothesis, analy sing architectural mechanisms, and discussing economic implications, connections to pr ior work, and limitations. 5.1 Hypothesis Adjudication H1: Ranking Non-U nif ormity — SUPPORTED. The global leaderboard (T able 7 ) rev eals a clear , consistent hierarch y: Moder n TCN (mean rank 1.333, 75% win rate) and PatchTS T (mean rank 2.000) lead across all 24 ev aluation points, separated b y more than 5.5 ranks from the bottom tier (TimesNet, A utof or mer, LS TM). The per -asset best-model matr ix (T able 8 ) sho w s that Moder n TCN’ s wins span all three asset classes and both hor izons, with N-HiTS and Patc hTST achie ving niche first-place finishes e x clusivel y on cr yptocurrency assets and shor t hor izons. The Fr iedman-Iman-Da v enpor t test confirms this non-unif or mity at the global lev el with 𝐹 ( 8 , 184 ) = 101 . 36 , 𝑝 < 10 − 15 (T able 16 ), and the same result holds within each asset class. Pos t-hoc Holm- Wilco x on tests (Section 4.6 ) es tablish that 33 of 36 pair wise differences are statis tically significant; the onl y non-significant pairs are intra-tier neighbours (TimeX er vs. i T ransf or mer, A utof or mer v s. T imesNet, DLinear vs. N-Hi TS). The Modern TCN–Patc hTST g ap (0.667 mean rank difference) is not individually significant by Diebold- Mariano on BTC/USDT ( 𝑝 Holm = 0 . 453 ), reflecting near -equiv alent top-tier per f ormance rather than statis tical equiv alence across the full distribution. H2: Cross-Horizon Ranking Stability — SUPPOR TED. T op-tier rankings (Modern TCN, PatchTS T) are preser v ed at both ℎ = 4 and ℎ = 24 across all three representativ e assets (T ables 10 and 13 ). EUR/USD rankings are per f ectl y stable; BTC/USDT and Do w Jones e xhibit rank shifts confined to the middle tier (N-Hi TS impro ves b y 3 ranks on Do w Jones; i T ransf ormer drops b y 2 on BTC/USDT). This stability is f or mall y confir med b y Spear man cross-hor izon rank cor relations: all 12 assets yield 𝜌 ≥ 0 . 683 with 𝑝 < 0 . 05 (T able 17 ), including 𝜌 = 1 . 000 f or EUR/USD. The Stouffer combined 5 Discussion 39 𝑍 S = 6 . 17 ( 𝑝 = 3 . 47 × 10 − 10 ) confir ms this as a sy stematic, not asset-specific, proper ty . Er ror amplification of 2 . 0 – 2 . 5 × at the longer horizon reflects increased task difficulty rather than differential model degradation. Figure 16 pro vides qualitativ e confir mation: Moder n TCN retains directional cor relation with BTC/USDT at step 24 despite the rmse increase. H3: V ariance Dominance — STR ONGL Y SUPPOR TED. The tw o-f actor decomposition (T able 14 ) is repor ted in three panels to separate scale ar tef acts from structural effects. 1 On the ra w pr ice scale, architecture explains 99.90% of v ar iance—driven b y LSTM’ s outlier er rors ( 7 – 33 × the categor y median). After z-score nor malisation, architecture accounts f or 48.32% (seed: 0.04%; residual: 51.64%); e x cluding LS TM, it r ises to 68.33% (seed: 0.02%). The residual reflects g enuine model–slot interaction: no single architecture dominates e v ery slot. Across all panels, seed v ar iance is negligible ( ≤ 0 . 04% ), validating three-seed replication as sufficient. This conclusion is independently cor roborated b y icc analy sis: f or three representativ e assets at ℎ = 24 , icc > 0 . 990 with 𝐹 > 309 ( 𝑝 < 10 − 15 ; T able 18 ), confir ming that > 99% of inter -model variance is attr ibutable to arc hitecture rather than random seed. Per -asset decompositions e xtend this result to ℎ = 4 , where architecture s till explains 97.4%–99.7% of v ariance (Section 4.4 ), confir ming that seed-in variance holds at both f orecast depths. H4: Non-Monotonic Comple xity –Perf ormance — SUPPORTED. The comple xity–perf or mance scatter (Figure 11 ) demonstrates a non-monotonic relationship betw een trainable parameter count and mean rmse rank. DLinear (appro ximately 1,000 parameters) achie v es rank 5, while A utof or mer (appro ximately 438,000 parameters) and LSTM (appro ximatel y 172,000 parameters) achie v e ranks 8–9. Modern TCN (appro ximately 230,000 parameters) and Patc hTST (appro ximatel y 103,000 parameters) occup y the top positions with moderate parameter budg ets. The Jonckheere- T erpstra test f or a monotonic comple xity-rank relationship finds no significant trend in an y of the six categor y –hor izon combinations (all 𝑝 > 0 . 35 ; T able 19 ); f our of the six 𝑧 -v alues are negativ e, sugges ting an in v erse tendency . This f or mall y confir ms that ho w parameters are deplo y ed—the specific temporal inductiv e bias—deter mines f orecast quality , not the raw quantity of lear nable w eights. 5.2 Ar chitecture-Specific Insights Modern T CN. Moder n TCN’ s consistent superior ity (rank 1 on 18/24 e valuation points) is consistent with complementary design f eatures: larg e-k er nel depthwise con v olutions capture multi-rang e temporal dependencies without quadratic attention cost; multi-stag e downsampling enables hierarchical f eature e xtraction suited to the multi-scale dynamics of financial ser ies; and R evIN nor malisation mitig ates distributional shift betw een training and tes t per iods. All of this is ac hiev ed with appro ximatel y 230,000 parameters—a moderate f ootpr int. Patc hTST . Patc hTST’ s consistent second-place ranking suppor ts the view that patch-based tokenisation offers an effectiv e compromise betw een local pattern recognition and global dependency modelling. 1 W e reser v e str ong ly supported f or h ypotheses where the effect holds at > 99% magnitude across all analy sis panels; the remaining h ypotheses are labelled suppor t ed . 5 Discussion 40 Segmenting the in put into patches reduces the token count, enabling attention o ver temporally coherent segments, while c hannel-independent processing prev ents cross-feature leakage—appropriate giv en the heterog eneous scales of OHLCV components. The nar ro w Moder n TCN–P atchTST gap (0.667 mean rank difference) sugges ts that both architectures capture the rele v ant temporal s tr ucture through distinct mechanisms. iT ransf ormer and TimeX er . i T ransf or mer’ s in verted attention paradigm (rank 3) sho ws that computing attention across the fiv e OHLCV f eatures rather than across time steps is effectiv e f or multiv ar iate financial forecas ting. TimeX er (rank 4) fur ther separates targ et and e x ogenous variables through cross-attention, but the marginal improv ement sugg ests that e xplicit targ et–e x ogenous decomposition adds comple xity without propor tionate benefit when all input f eatures are closely correlated. DLinear and N-Hi TS. DLinear’ s fifth-place ranking with approximatel y 1,000 parameters and no nonlinear activations cor roborates the finding that simple linear mappings can be sur pr isingl y effectiv e ( Zeng et al. , 2023 ). Its trend–seasonal decomposition captures the dominant lo w -frequency structure of financial ser ies at minimal cost. N-HiTS (rank 6) benefits from multi-rate pooling across temporal resolutions but processes only the targ et channel, potentiall y limiting cross-f eature e xploitation. Its notably lo wer cross-horizon degradation on BTC/USDT (85.3% vs. 121% f or Moder n TCN; T able 10 ) sugges ts that hierarchical temporal decomposition transf ers w ell across horizons f or trending series. Notabl y , N-HiTS achie ves the lo wes t rmse on ETH/USDT (both hor izons) and AD A/USDT ( ℎ = 4 )—both lo w er -capitalisation, higher -v olatility cr yptocurrency assets—accounting f or all three of its firs t-place finishes (T able 8 ). This sugg ests that its multi-rate pooling is par ticular ly suited to the superimposed multi-scale oscillator y patter ns c haracteristic of altcoin mark ets, where multiple speculativ e timescales dominate the pr ice dynamics. TimesN et, A utoformer , and LSTM. TimesN et (rank 7) and Autof ormer (rank 8) both emplo y frequency -domain inductive biases—FFT -based 2D reshaping and auto-cor relation, respectiv ely —that presuppose per iodic str ucture largel y absent in hourl y financial data. Their designs are better suited to domains with strong seasonality such as electr icity demand or weather f orecasting. LSTM’ s consistentl y poor per f ormance (wors t-ranked across all conditions; er rors 7 – 33 × higher than the best model) confirms that the recur rent architecture is not competitive f or multi-s tep financial f orecasting. Compressing all temporal conte xt into a fixed-dimensional hidden state se verel y limits representation capacity f or direct multi-step prediction. 5.3 Asset-Class Dynamics The three asset classes present distinct f orecas ting challeng es shaped b y different market microstructures (Section 3.2.1 ). Cr yptocur rency markets exhibit the highest absolute er rors (mean rmse 314–2,399; T able 9 ) due to elev ated price lev els, 24/7 trading, and speculativ e v olatility . Fore x mark ets produce the smallest er rors (0.110–3.668), reflecting lo w pr ice magnitudes and high liquidity , while equity indices occup y an inter mediate position (111–1,549). 5 Discussion 41 Despite these scale differences, relativ e rankings ar e larg ely pr eser v ed : Moder n TCN and Patc hTST consistentl y occup y the top two positions in e v er y category (T able 9 , Figure 5 ), indicating g eneral- pur pose temporal modelling capabilities rather than class-specific advantag es. Mid-tier v ar iation is more pronounced. TimeX er ranks 3rd in cr yptocurrency but 4th in f ore x and indices. N-Hi TS (rank 6 o verall) perf or ms relativ ely better on cr yptocur rency , where multi-rate pooling ma y better capture multi-scale v olatility . DLinear performs better on f orex (rank 5), where simpler , mean-re verting dynamics ma y be w ell-appro ximated b y linear projections. Asset-specific niche adv ant ag es. The per -asset bes t-model matrix (T able 8 ) rev eals a finer -g rained picture than categor y -le v el rankings. N-Hi TS achie v es the lo w est rmse on ETH/USDT at both horizons and on AD A/USDT at ℎ = 4 —all lo wer -capitalisation cr yptocur rency assets characterised b y higher relativ e v olatility and more pronounced multi-scale dynamics. This sugg ests that N-HiTS’ s hierarchical multi-rate pooling, which decomposes the in put at sev eral temporal resolutions, captures the superimposed shor t- and medium-term oscillation patter ns that dominate altcoin pr ice series. Patc hTS T wins on BTC/USDT at ℎ = 4 (with a margin of only 0.08% ov er Moder n TCN), on AD A/USDT at ℎ = 24 , and on GBP/USD at ℎ = 4 , indicating that patch-based self-attention is competitiv e f or shor t-horizon prediction on assets with div erse temporal structure. Critically , no model other than Moder n TCN wins on an y f ore x or equity inde x e valuation point at ℎ = 24 . This long-hor izon, cross-categor y dominance—16 out of 16 possible wins outside the cr yptocurrency domain at ℎ = 24 —sugg ests that the combination of larg e-kernel depthwise con volutions and multi-stag e do wnsampling pro vides the most transf erable temporal representations when the f orecast window extends. Statistical separability varies b y mark et microstructur e. Diebold-Mar iano tests re veal that the top-tier gap is market-dependent. On EUR/USD ( ℎ = 24 ), Moder n TCN is statis tically separable from e very other architecture, including PatchTS T ( 𝑝 Holm = 2 . 77 × 10 − 24 ; Section 4.6 ). On BTC/USDT at the same hor izon, Modern TCN vs. P atchTST is not significant ( 𝑝 Holm = 0 . 453 ). This dispar ity reflects the differing signal-to-noise ratios: EUR/USD’ s lo w er intr insic noise amplifies small but consistent performance differences into statistical separability , whereas BTC/USDT’ s high v olatility masks the same differences. Practitioners operating in lo w-noise mark ets can thus ha v e higher confidence in Modern TCN’ s super iority; in high-noise cr ypto markets, the top-tier models should be treated as near -equiv alent. 5.4 Economic Interpretation While this study ev aluates f orecasting accur acy rather than trading pr ofitability , sev eral economicall y rele vant observations emerge. Cost of model selection error . The rmse g ap between Modern TCN and the w orst moder n alternativ e (A utof or mer) rang es from 60%–70% across categories (T able 9 ); the g ap to LSTM is an order of 5 Discussion 42 magnitude larg er . Where f orecast er ror translates into sizing or timing er rors, systematic architecture e valuation yields subs tantial returns relative to default model selection. Directional accuracy . The mean d a across all 54 model–categor y –hor izon combinations is 50.08%, with no combination deviating meaningfully from the 50% baseline (T able 15 ). MSE-optimised architectures do not exhibit directional skill at hourl y resolution. Application to directional trading w ould require e xplicit directional loss functions or post-processing of point f orecasts into probabilistic signals. V ariance decom position implication. Architecture choice e xplains the o verwhelming ma jority of f orecast variance while seed explains ≤ 0 . 04% . The practical corollar y : effor t in v ested in architecture selection yields f ar higher retur ns than effor t spent on seed selection or initialisation-based ensembles. Ca v eat. These economic inter pretations are preliminary . rmse and mae measure statistical accuracy , not economic v alue. T rading per f ormance requires consideration of transaction cos ts, market impact, slippag e, position sizing, and r isk -adjusted returns (Shar pe ratio, maximum drawdo wn). This study pro vides the statis tical f oundation f or economic ev aluations but does not assess trading profitability . 5.5 Comparison with Prior Literature Linear models are competitiv e. DLinear’ s fifth-place ranking with appro ximatel y 1,000 parameters is consistent with the finding that linear temporal mappings can match or ex ceed more comple x alternativ es ( Zeng et al. , 2023 ). This benchmark extends that finding from standard time-series benchmarks (ETTh, W eather , Electr icity) to financial data across three asset classes under controlled HPO. Patc h-based attention is effectiv e. PatchTS T’ s consistent second-place ranking suppor ts the vie w that patch tok enisation is an effectiv e strategy f or time-series T ransf or mers ( Nie et al. , 2023 ), and this effectiv eness e xtends to financial data across asset classes and hor izons. Seed variance is small in financial f orecasting. The 99.90% ra w model vs. 0.01% seed decomposition (z-normalised: 48.3% vs. 0.04%) extends pr ior w ork documenting the impor tance of separating implementation variance from algorithmic perf or mance ( Bouthillier et al. , 2021 ). In financial f orecasting with fixed splits and deterministic preprocessing, seed v ar iance is e ven smaller than in the g eneral settings ex amined there, justifying the three-seed protocol. Recurr ent models are not competitiv e. The order -of-magnitude inferior ity of LSTM corroborates prior obser v ations on the declining competitiv eness of recur rent arc hitectures ( Hew amalage et al. , 2021 ; Lim and Zohren , 2021 ). This study provides the most controlled e vidence to date f or this conclusion in the financial domain. P ositioning relativ e to compound g ap. Pr ior studies (T able 1 ) address at most two or three of the fiv e gaps. This study simultaneousl y addresses: controlled HPO (G1), multi-seed ev aluation (G2), 5 Discussion 43 multi-horizon anal ysis (G3), f or mal pairwise statistical cor rection with Holm- W ilcox on and Diebold- Mariano tests (G4), and multi-asset-class cov erag e (G5). All fiv e methodological gaps are theref ore addressed, placing this study at the intersection of r igorous e xper imental design and financial time-series benchmarking. 5.6 Limitations and Threats to V alidity The f ollo wing limitations are org anised by sev er ity , from most impactful to least: L1. HPO budget. Fiv e Optuna tr ials represent a limited search budg et. Models with larg er search spaces ma y be disadvantag ed. Increasing the budg et may shift mid-tier rankings, though top-tier positions appear robust. L2. Statistical validation. A comprehensive batter y of f or mal s tatistical tests is repor ted (Section 4.6 and T ables 16 – 19 ), including Friedman-Iman-Da v enport omnibus tes ts, Holm- Wilco x on pair wise comparisons, Diebold-Mariano tests, Spear man/S touffer cross-hor izon cor relations, icc , and Jonckheere- T er pstra complexity tests. The main remaining limitation is that per -categor y pairwise Wilco x on tests are under po w ered due to small bloc k size ( 𝑛 = 8 ); a categorical-lev el s tudy with more assets per class w ould o vercome this constraint. L3. F eature set. The OHLCV -onl y restr iction, while essential f or f air compar ison, ma y underestimate models designed to e xploit technical indicators, order -book data, or sentiment f eatures. Rankings could differ under richer input configurations. L4. T em poral scope. All e xper iments use H1 (hourl y) frequency . Generalisability to higher frequencies (tick -le vel, M1) or lo w er frequencies (H4, daily) is not established. The relative impor tance of different inductiv e biases ma y chang e with the characteristic time scale. L5. Asset univ erse. T wel v e ins tr uments across three asset classes pro vide a representative but not e xhaustiv e sample. Commodities, fix ed income, and emerging-market instruments are absent. L6. Horizon granularity . Deg radation is characterised at only tw o points ( ℎ = 4 and ℎ = 24 ). Intermediate horizons ( ℎ = 8 , 12 , 16 ) w ould yield smoother deg radation curv es and more precise characterisation of architecture-specific scaling beha viour . L7. Effect-size reporting. A cr itical difference diag ram (Figure 10 ) visualises the Holm- Wilco x on significance structure, and Diebold–Mar iano pair wise tests are repor ted f or representativ e assets (Section 4.6 ). Ho w ev er , Cohen ’ s 𝑑 standardised effect sizes—which w ould facilitate cross-study comparison—are not computed. While the Holm-adjus ted 𝑝 -v alues and rank differences pro vide implicit effect scaling, e xplicit 𝑑 v alues remain a useful complement and are identified as future w ork. L8. Out-of-sample recency . The test set comprises the final 15% of histor ical data, ev aluated in batch. A live f or w ard-walk e valuation with rolling retraining w ould pro vide strong er evidence f or real-time deplo yment. 6 Conclusion 44 6 Conc lusion 6.1 Summary of Contributions This paper presented a controlled compar ison of nine deep lear ning architectures—spanning T ransf or mer , MLP , con v olutional, and recur rent families—f or multi-horizon financial time-ser ies f orecasting across 918 e xper imental r uns. Fiv e contr ibutions w ere made: (C1) protocol-controlled fair comparison, (C2) multi-seed robustness quantification, (C3) cross-hor izon g eneralisation analy sis, (C4) asset-class- specific deplo yment guidance, and (C5) release of a full y open benchmarking framew ork. 6.2 Principal Findings All f our h ypotheses from Section 1.3 are suppor ted (see Section 5.1 f or detailed adjudication): H1 (Ranking non-unif ormity): SUPPORTED. A clear three-tier hierarch y emerg es, with Mod- ern TCN and Patc hTST separated from the bottom tier b y o ver 5.5 mean-rank positions. H2 (Cross-horizon st ability): SUPPORTED. T op-tier rankings are preserved across horizons de- spite 2 – 2 . 5 × error amplification; rank shifts are confined to mid-tier models. H3 (V ariance dominance): STR ONGL Y SUPPOR TED. Architecture e xplains ≥ 99 . 9% of ra w v ar iance; seed v ar iance is negligible ( ≤ 0 . 04% ), confir ming three-seed replication suffices. H4 (Non-mono tonic complexity): SUPPORTED. DLinear (appro ximately 1,000 parameters) out- ranks A utof or mer (appro ximately 438K) and LS TM (approximatel y 172K); architectural inductiv e bias dominates ra w capacity . 6.3 Practical Recommendations Building on the empir ical evidence from 648 final training runs and the hypothesis adjudication (Section 5.1 ): • Default recommendation: Moder n TCN. Rank 1 on 75% of ev aluation points with moderate cost. Larg e-kernel depthwise conv olutions and multi-stag e do wnsampling pro vide effectiv e g eneral-pur pose temporal modelling across all asset classes and hor izons. On lo w -noise markets (EUR/USD), its superior ity is statisticall y unambiguous ( 𝑝 Holm < 10 − 20 vs. all competitors; Section 4.6 ). • T ransformer alternative: Patc hTST. Consistentl y rank 2, with a nar ro w gap from Moder n TCN (0.667). Patch-based tokenisation with channel independence suits settings where T ransf ormer - f amily models are pref er red. • Altcoin specialist: N-Hi TS. A chie v es the lo w est rmse on ETH/USDT and AD A/USDT (T able 8 ), sugg esting niche advantag e on lo wer -capitalisation cr yptocurrency assets due to its multi-rate pooling design. • Resour ce-constrained en vironments: DLinear. Rank 5 with appro ximately 1,000 parameters and no nonlinear activ ations, suitable when inference latency or memor y is binding. 7 Reproducibility Statement 45 • No t recommended: LSTM. Ranks last across all conditions with er rors 7 – 33 × higher than the best model. Directional accuracy . MSE-optimised architectures produce directional f orecasts indistinguishable from a coin flip (mean d a = 50 . 08% ; T able 15 ). Applications requiring directional skill must incor porate explicit directional objectiv es or post-processing. These recommendations are conditioned on the e xper imental scope descr ibed in Section 3 (OHLCV f eatures, H1 frequency , 12 assets, 2 hor izons) and the limitations ac know ledg ed in Section 5.6 . 6.4 Future W ork The f ollo wing e xtensions are pr ior itised by expected impact: 1. Cohen ’s 𝑑 effect-size matrices to complement the Holm- W ilcox on and Diebold-Mariano pairwise anal yses already reported (Section 4.6 ), facilitating s tandardised cross-study compar ison. 2. Increased HPO budg et and seed count to strengthen ranking evidence, par ticularl y f or mid-tier models where the cur rent 5-tr ial budg et may be limiting. 3. Directional loss training via differentiable sur rogate losses, inv estig ating whether MSE-accurate architectures also achie ve super ior directional accuracy under e xplicit super vision. 4. Extended horizon co v erage ( ℎ ∈ { 1 , 8 , 12 , 48 , 96 } ) f or smoother degradation cur v es and more precise characterisation of architecture-specific scaling beha viour . 5. Asset-specific model selection e xplor ing whether the niche adv antag es obser v ed f or N-HiTS on lo w er -capitalisation cryptocur rency (T able 8 ) e xtend to a broader altcoin universe. 6. Richer f eature sets (techni cal indicators, sentiment, order -book data) to assess ranking sensitivity to input dimensionality . 7. Heter ogeneous ensembles of top arc hitectures f or potential accuracy gains, particularl y combining Modern TCN’ s con volutional inductive bias with Patc hTST’ s patch-based attention. 8. Alternativ e frequencies (H4, daily , tick -le vel) to test cross-frequency generalisation. 9. Expanded asset univ erse (commodities, fixed income, emerging markets). 10. F orwar d-walk e valuation with rolling retraining f or real-time deplo yment evidence. 7 Repr oducibility Statement All ar tef acts required to reproduce the repor ted results are publicly a v ailable in the accompanying repository . 7.1 Code and Data A v ailability The repositor y contains the complete source code, v ersion-controlled configuration files, and data. The codebase f ollo ws a modular organisation separating data preprocessing, model implementations, 7 Reproducibility Statement 46 training, e valuation, and benc hmarking into dedicated subpackag es. All e xper imental parameters are declared in Y AML configuration files rather than embedded in source code. R eleased ar tef acts include preprocessed window ed datasets with split metadata, HPO tr ial logs and best configurations, all 648 trained model checkpoints, and complete ev aluation outputs (per -seed metrics, agg reg ated statistics, benchmark summar ies, statistical test results, and figures). 7.2 Determinism Contr ols Randomness is manag ed through a centralised seed-management module ( src/utils/seed.py ) in- v oked at the s tar t of e v er y experimental r un. Seeds are applied consistentl y to the Python standard librar y , N umPy , PyT orch CPU and CUD A random number g enerators, and the PYTHONHASHSEED en vironment v ar iable. The cuDNN back end is configured with deterministic=True and benchmark=False . DataLoader w orkers deriv e their seeds deterministicall y from the primar y seed, preserving consistent data-loading order across runs. PyT orc h’ s full y deterministic algor ithm mode ( use_deterministic_algorithms ) was not enabled, so minor floating-point v ar iation remains possible across different GPU architectures. This limitation is ackno w ledg ed in Section 5.6 and mitigated b y the three-seed replication protocol. 7.3 Configuration V er sioning The configuration hierarch y captures all e xper imental degrees of freedom: HPO sampler settings and trial budg ets; final training sc hedules, seed lists, and earl y-s topping cr iteria; per -model h yper parameter search spaces; dataset window sizes, split ratios, and hor izon definitions; and asset category assignments. After HPO, the bes t h yper parameter set f or eac h (model, categor y , horizon) triple is serialised as a frozen configuration file (T able 6 ) and held fix ed f or all subsequent stag es. 7.4 Execution Overview The pipeline proceeds through three sequential s tages—h yper parameter optimisation, multi-seed final training, and benc hmarking with statistical validation—eac h accessible via a unified command-line interface. Stag es may be e xecuted independently with optional filter ing by model, seed, hor izon, or configuration path. T raining suppor ts automatic c heckpoint resumption, restoring model weights, optimiser and sc heduler states, and all random-number -generator states from the most recent c heckpoint. 7.5 En vir onment Specification The softw are environment is fully specified b y the repositor y’ s environment.yml (Conda; Python v ersion, deep lear ning frame w ork, CUD A toolkit) and requirements.txt (pinned librar y versions). All e xper iments w ere r un using PyT orch ( P aszke et al. , 2019 ) on CUD A-enabled hardw are. Results are reproducible within the same software environment, subject to standard floating-point and hardware- le vel numer ical v ar iation. Ref erences 47 References T . Akiba, S. Sano, T . Y anase, T . Ohta, and M. Ko y ama. Optuna: A ne xt-generation hyperparameter optimization frame w ork. In Proceedings of the 25th A CM SIGKDD International Confer ence on Know ledg e Disco very & Data Mining , pages 2623–2631, 2019. S. Bai, J. Z. K olter , and V . K oltun. An empir ical e valuation of generic con v olutional and recur rent netw orks f or sequence modeling. arXiv pr eprint arXiv :1803.01271 , 2018. S. Ben T aieb, G. Bontempi, A. F . Atiy a, and A. Sorjamaa. A revie w and comparison of strategies f or multi-step ahead time ser ies f orecasting based on the NN5 f orecas ting competition. Exper t Sy stems with Applications , 39(8):7067–7083, 2012. J. Bergs tra, R. Bardenet, Y . Bengio, and B. Kégl. Algor ithms f or h yper -parameter optimization. In Adv ances in N eural Information Pr ocessing Sy stems (N eurIPS) , v olume 24, pag es 2546–2554, 2011. T . Bollersle v . Generalized autoregressiv e conditional heteroskedas ticity . Jour nal of Econometrics , 31 (3):307–327, 1986. X. Bouthillier , P . Delaunay , M. Bronzi, A. T rofimo v , B. Nich ypor uk, J. Szeto, N. Sepah, E. Raff, K. Mber , H. V oleti, S. E. Kahou, and C. Pal. A ccounting f or v ariance in machine lear ning benchmarks. In Pr oceedings of Machine Lear ning and Sy stems (MLSys) , v olume 3, pag es 747–769, 2021. G. E. P . Bo x and G. M. Jenkins. Time series anal ysis: Forecas ting and control. Journal of the American Statis tical Association , 65(332):1509–1526, 1970. C. Challu, K. G. Oliv ares, B. N. Oreshkin, F . Garza, M. Merg enthaler -Canseco, and A. Dubraw ski. N-Hi TS: Neural hierarchical inter polation f or time ser ies f orecasting. In AAAI Confer ence on Artificial Intellig ence , v olume 37, pag es 6989–6997, 2023. G. Chevillon. Direct multi-step estimation and f orecasting. Journal of Economic Sur v eys , 21(4): 746–785, 2007. R. Cont. Empirical proper ties of asset retur ns: Sty lized f acts and statistical issues. Quantitativ e F inance , 1(2):223–236, 2001. R. F . Engle. Autoregressiv e conditional heteroscedasticity with estimates of the v ar iance of U nited Kingdom inflation. Econometrica , 50(4):987–1007, 1982. E. F . F ama. Efficient capital markets: A revie w of theor y and empirical w ork. The Jour nal of F inance , 25(2):383–417, 1970. F . A. Gers, J. Schmidhuber , and F . Cummins. Lear ning to for get: Continual prediction with LS TM. N eural Computation , 12(10):2451–2471, 2000. P . Henderson, R. Islam, P . Bachman, J. Pineau, D. Precup, and D. Meger . Deep reinf orcement lear ning that matters. In AAAI Conf erence on Ar tificial Intellig ence , v olume 32, 2018. H. He wamalag e, C. Bergmeir , and K. Bandara. Recurrent neural netw orks f or time ser ies f orecasting: Cur rent status and future directions. International Journal of F orecas ting , 37(1):388–427, 2021. Ref erences 48 S. Hochreiter and J. Schmidhuber . Long shor t-term memor y . N eural Computation , 9(8):1735–1780, 1997. R. J. Hyndman and Y . Khandakar . A utomatic time ser ies f orecasting: The f orecast pac kage f or R. Journal of Statistical Softwar e , 27(3):1–22, 2008. T . Kim, J. Kim, Y . T ae, C. Park, J.-H. Choo, and J. K o. Re v ersible instance nor malization f or accurate time-ser ies f orecasting agains t distribution shift. In International Confer ence on Learning Repr esentations (ICLR) , 2022. B. Lim and S. Zohren. T ime-ser ies f orecasting with deep lear ning: A surve y . Philosophical T ransactions of the R oyal Society A , 379(2194):20200209, 2021. Y . Liu, T . Hu, H. Zhang, H. W u, S. W ang, L. Ma, and M. Long. iT ransf ormer: Inv er ted transf or mers are effectiv e f or time ser ies f orecasting. In International Conf erence on Lear ning R epresentations (ICLR) , 2024. D. Luo and X. W ang. Modern TCN: A moder n pure con v olution structure f or general time series anal ysis. In International Confer ence on Learning Repr esentations (ICLR) , 2024. S. Makr idakis, E. Spiliotis, and V . Assimak opoulos. The M4 competition: R esults, findings, conclusion and w ay forw ard. International Jour nal of F or ecasting , 34(4):802–808, 2018. S. Makr idakis, E. Spiliotis, and V . Assimak opoulos. The M5 accuracy competition: Results, findings, and conclusions. International Journal of F orecas ting , 38(4):1346–1364, 2022. Y . Nie, N. H. Nguy en, P . Sinthong, and J. Kalagnanam. A time series is w or th 64 w ords: Long-ter m f orecasting with transf ormers. In International Confer ence on Learning Repr esentations (ICLR) , 2023. A. P aszke, S. Gross, F . Massa, A. Lerer , J. Bradbur y , G. Chanan, T . Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K öpf, E. Y ang, Z. De Vito, M. Raison, A. T ejani, S. Chilamk urthy , B. S teiner, L. F ang, J. Bai, and S. Chintala. PyT orc h: An imperativ e sty le, high-per f ormance deep learning librar y . In Advances in N eural Information Processing Sys tems (N eurIPS) , v olume 32, pag es 8024–8035, 2019. O. B. Sezer , M. U . Gudelek, and A. M. Ozbay oglu. Financial time ser ies f orecasting with deep lear ning: A sy stematic literature re vie w: 2005–2019. Applied Sof t Computing , 90:106181, 2020. A. V asw ani, N. Shazeer , N. Parmar , J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all y ou need. In Adv ances in N eural Information Processing Syst ems (N eurIPS) , v olume 30, pages 5998–6008, 2017. Y . W ang, H. W u, J. Dong, Y . Liu, Y . Qiu, H. Zhang, J. W ang, and M. Long. TimeX er: Empo w ering transf or mers f or time ser ies f orecasting with e x og enous variables. In Advances in N eural Inf or mation Pr ocessing Syst ems (NeurIPS) , 2024. H. W u, J. X u, J. W ang, and M. Long. A utof or mer: Decomposition transf or mers with auto-correlation f or long-ter m series f orecasting. In Advances in N eur al Inf ormation Processing Sy stems (N eurIPS) , v olume 34, pages 22419–22430, 2021. Ref erences 49 H. W u, T . Hu, Y . Liu, H. Zhou, J. W ang, and M. Long. TimesN et: T emporal 2d-variation modeling f or g eneral time ser ies anal ysis. In International Conf er ence on Lear ning Repr esentations (ICLR) , 2023. A. Zeng, M. Chen, L. Zhang, and Q. X u. Are transf ormers effective f or time ser ies f orecasting? In AAAI Conf erence on Ar tificial Intellig ence , v olume 37, pages 11121–11128, 2023. A Additional Empir ical Results 50 A Additional Empirical Results A.1 Representative P er-Asset Results One representativ e instrument from each asset class—BTC/USDT (cryptocur rency), EUR/USD (f orex), and Do w Jones/US A30ID XUSD (equity indices)—is presented belo w . These are the HPO representativ e assets and theref ore pro vide the most controlled inter -model compar ison. rmse bar charts f or all remaining instruments (ETH/USDT , BNB/USDT , AD A/USDT , USD/JPY , GBP/USD, A UD/USD, S&P 500, NASD A Q 100, and D AX), together with cor responding mae and rank variants, are av ailable in the Online Supplementar y Mater ials. Results are reported as mean ± standard de viation across seeds 123, 456, and 789. A.1.1 Cryptocurrency Figure 17. rmse compar ison f or BTC/USDT at ℎ = 4 (left) and ℎ = 24 (right), e xcluding LSTM f or visual clarity . Patc hTST and Modern TCN achie v e the tw o lo west er rors at ℎ = 4 ; Moder n TCN leads marginall y at ℎ = 24 . Mean ± std across three seeds. A.1.2 Forex Figure 18. rmse compar ison f or EUR/USD at ℎ = 4 (left) and ℎ = 24 (right), e x cluding LS TM. Modern TCN and Patc hTST rank highest across both hor izons; the remaining moder n architectures clus ter within a nar ro w er ror band. Mean ± std across three seeds. A Additional Empir ical Results 51 Figure 19. rmse compar ison f or Do w Jones (US A30ID XUSD) at ℎ = 4 (left) and ℎ = 24 (right), e x cluding LSTM. Modern TCN ranks first at both hor izons; P atchTST f ollo ws closel y . Mean ± std across three seeds. A.1.3 Equity Indices A.2 Remaining P er-Asset Results Figures 20 – 28 present rmse bar charts f or the nine assets not sho wn in Section A.1 , completing the full 12-asset comparison. Across all assets, the three-tier structure (Moder n TCN/Patc hTST at top; i T ransf or mer/T imeXer/DLinear/N-Hi TS in the middle; TimesN et/Autof ormer at the bottom) is maintained, with the notable ex ception of ETH/USDT and AD A/USDT where N-HiTS achie v es the lo w est rmse (T able 8 ). A.2.1 Cryptocurrency — Remaining Assets Figure 20. rmse compar ison f or ETH/USDT at ℎ = 4 (left) and ℎ = 24 (right), ex cluding LS TM. N-Hi TS achie v es the lo wes t rmse at both hor izons, one of only tw o assets where Modern TCN does not rank first. Mean ± std across three seeds. Figure 21. rmse compar ison f or BNB/USDT at ℎ = 4 (left) and ℎ = 24 (right), e x cluding LSTM. Modern TCN leads at both horizons with a clear margin. Mean ± std across three seeds. A Additional Empir ical Results 52 Figure 22. rmse comparison f or AD A/USDT at ℎ = 4 (left) and ℎ = 24 (right), e xcluding LS TM. N-Hi TS leads at ℎ = 4 ; Patc hTST leads at ℎ = 24 . This asset e xhibits the lo wes t cross-horizon Spearman 𝜌 ( 0 . 683 ; T able 17 ), consistent with more variable rankings betw een horizons. Mean ± std across three seeds. A.2.2 Forex — Remaining Assets Figure 23. rmse compar ison f or USD/JPY at ℎ = 4 (left) and ℎ = 24 (right), e x cluding LS TM. Modern TCN ranks first; the moder n architecture cluster is tightly pac ked. Mean ± std across three seeds. Figure 24. rmse comparison f or GBP/USD at ℎ = 4 (left) and ℎ = 24 (right), e xcluding LS TM. Patc hTST leads at ℎ = 4 ; Moder n TCN leads at ℎ = 24 . Mean ± std across three seeds. A.2.3 Equity Indices — Remaining Assets A.3 Cr oss-Horizon Analysis Figure 29 reproduces the hor izon-degradation line plot from the main te xt (Figure 8 ) with all nine models included. The no-LS TM variant is used as the primar y ref erence because LSTM’ s elev ated baseline compresses the v er tical scale, obscuring differences among the eight moder n architectures. A Additional Empir ical Results 53 Figure 25. rmse compar ison f or A UD/USD at ℎ = 4 (left) and ℎ = 24 (right), e x cluding LS TM. Modern TCN ranks first at both horizons. Mean ± s td across three seeds. Figure 26. rmse compar ison f or D AX (DEUID XEUR) at ℎ = 4 (left) and ℎ = 24 (right), e xcluding LSTM. Modern TCN leads at both hor izons; the top-f our cluster (Modern TCN, PatchTS T, iT ransf ormer, TimeX er) is tightl y grouped. Mean ± std across three seeds. Figure 27. rmse compar ison f or S&P 500 (USA500ID XUSD) at ℎ = 4 (left) and ℎ = 24 (right), e x cluding LSTM. Modern TCN leads; the cross-hor izon Spearman 𝜌 = 0 . 967 (T able 17 ) indicates near -per f ect rank preser v ation. Mean ± std across three seeds. Figure 28. rmse compar ison f or N ASD A Q 100 (USA TECHID XUSD) at ℎ = 4 (left) and ℎ = 24 (right), e x cluding LSTM. Moder n TCN leads; the cross-hor izon Spear man 𝜌 = 0 . 983 (T able 17 ) is the highest among all assets. Mean ± std across three seeds. A Additional Empir ical Results 54 Figure 29. Cross-hor izon rmse deg radation f or all nine models. Compare with Figure 8 (no-LS TM v ar iant) f or finer inter -model discr imination. A.4 Horizon Sensitivity Heatmap Figure 30 displa ys the percentage rmse degradation from ℎ = 4 to ℎ = 24 f or each model–asset combination (e x cluding LSTM). Cool cells indicate architectures whose representations transf er w ell across hor izons; w ar m cells highlight combinations where temporal str ucture deteriorates rapidly with f orecast depth. A.5 Dual-Plot V ariants (All Models Including LSTM) Figure 31 sho ws the global rmse heatmap including all nine models; compare with Figure 3 f or finer discrimination among moder n architectures. Figure 32 presents the cross-horizon compar ison f or the full nine-model set. LSTM produces the highest er rors across all assets and horizons, with its er ror floor e x ceeding ev en the lo w est-perf or ming moder n architectures. A.6 Appendix: Efficiency Including LSTM Models While the main-body analy sis (Section 4.7 ) f ocuses on moder n architectures, Figure 33 includes LSTM. Despite a parameter budg et ( ≈ 172 , 000 ) that is moderate in absolute ter ms, LS TM achie ves the wors t ranking at both horizons, reinf orcing that moder n arc hitectural components provide far higher retur n on parameter in v estment than recur rent inductiv e biases f or this task. A.7 Category Hierarc hical Rankings Figures 34a – 34c displa y hierarchical ranking dendrograms f or each asset class, grouping architectures b y per f ormance similar ity . These complement the leaderboard tables by rev ealing clustering structure: closel y branched architectures per f orm similarl y; widel y separated branches indicate consistent performance gaps. The no-LSTM variant is sho wn. A Additional Empir ical Results 55 Figure 30. Hor izon sensitivity heatmap: percentage rmse deg radation ( Δ % = 100 × ( RMSE 24 − RMSE 4 ) / RMSE 4 ) f or eight modern architectures across all tw elv e assets. LSTM is ex cluded f or visual clarity . V alues below 90% (lo w deg radation) appear in cooler colours; values abov e 150% appear in warmer shades. N-HiTS on Do w Jones achiev es the lo wes t degradation (59.7%); T imesNet on Dow Jones the highest (169.1%). Figure 31. Global rmse heatmap f or all nine architectures (including LSTM) across 24 ev aluation points. LSTM’ s e xtreme er rors dominate the colour scale, which is wh y the no-LSTM v ar iant (Figure 3 ) is used as the primar y figure in the main body . A Additional Empir ical Results 56 Figure 32. Cross-hor izon rmse comparison f or all nine architectures across twel v e assets, including LS TM. The recur rent baseline illustrates the generational perf or mance gap relativ e to modern time-series architectures. 1,000 10,000 100,000 P arameter Count (log scale) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 RMSE MEAN Spearman = -0.083 p-value = 0.831 Model Complexity vs RMSE MEAN (F orex, h=4) Model Autoformer DLinear LSTM ModernTCN N-HiTS P atchTST TimeXer TimesNet iTransformer OLS Trend (log scale) (a) Horizon ℎ = 4 . 100,000 P arameter Count (log scale) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 RMSE MEAN Spearman = -0.450 p-value = 0.224 Model Complexity vs RMSE MEAN (F orex, h=24) Model Autoformer DLinear LSTM ModernTCN N-HiTS P atchTST TimeXer TimesNet iTransformer OLS Trend (log scale) (b) Horizon ℎ = 24 . Figure 33. Extended comple xity–perf or mance relationship including LSTM. The recur rent baseline lies f ar from the Pareto frontier defined b y modern architectures. A.8 Category P erf ormance Matrices Figures 35a – 35c sho w categor y per f or mance matr ices displa ying nor malised rmse v alues f or each model–asset combination within each class, pro viding a complementar y vie w to the o v erall leaderboard. R o ws represent models and columns represent assets; cell intensity encodes the nor malised er ror relativ e to the best model on that asset. A Additional Empir ical Results 57 (a) Cryptocur rency . (b) Fore x. (c) Equity Indices. Figure 34. Categor y hierarchical ranking dendrograms f or the eight modern architectures (LS TM e xcluded). Branch lengths encode per f ormance dissimilar ity within each asset class. Modern TCN and Patc hTST are consistentl y co-clustered at the top of each dendrogram, confir ming their joint dominance is s table across all three asset classes. A Additional Empir ical Results 58 (a) Cryptocur rency . (b) Fore x. (c) Equity Indices. Figure 35. Categor y performance matr ices f or the eight moder n architectures (LSTM e x cluded). Cell intensity encodes nor malised rmse relativ e to the bes t model per asset (darker = w orse). Moder n TCN and Patc hTST occup y the lightest cells consistentl y ; A utof or mer and TimesN et occupy the darkes t, confir ming the three-tier structure obser v ed in the global leaderboard. B Methodological Details 59 B Methodological Details B.1 Hyperparameter Sear ch Spaces T able 5 in the main te xt summar ises the HPO search dimensions f or all models. The ke y varied parameters b y model famil y are as f ollo ws. • T ransformer models (A utof or mer, P atchTST, i T ransf or mer, T imeXer): model dimension, attention heads, encoder depth, f eedf orward width, dropout, learning rate, batc h size, and architecture-specific parameters (patch configuration, mo ving-av erag e windo ws). • MLP models (DLinear, N-Hi TS): kernel size and decomposition mode (DLinear); block count, hidden-la y er width, depth, and pooling scales (N-Hi TS). • CNN models (T imesNet, Modern TCN): channel and depth parameters, dominant-frequency or patch-size settings, and model-specific design choices. • RNN model (LS TM): hidden-state size, lay er count, projection width, and directionality (unidirec- tional or bidirectional). All shared training h yper parameters (lear ning-rate rang e, batch-size options) were held constant across all models to ensure no implicit tuning advantag e. Complete Y AML search-space definitions f or each model are av ailable in the Online Supplementar y Mater ials. B.2 T raining Pr otocol Final models w ere trained across three random seeds (123, 456, 789) using the frozen best- h yper parameter configurations identified dur ing the HPO stag e; no retuning w as per f ormed. Check - points were sa ved at each epoch to suppor t full resumability from the last completed epoch. All randomness sources ( random , NumPy , PyT orch CPU and CUD A) w ere seeded identically per run, with cudnn.deterministic=True and cudnn.benchmark=False enf orced throughout. Note: torch.use_deterministic_algorithms(True) w as no t enabled; minor hardw are-dependent floating-point variation is possible across GPU models (see Section 7.2 ). Full training logs are archiv ed in the Online Supplementary Mater ials. B.3 Rob ustness Chec ks Seed-v ar iance bo x plots (Figures 36 – 37 ) and rmse -v ersus-seed-v ariance scatter plots (Figures 38 – 39 ) confirm that inter -seed variance is negligible relativ e to inter -model v ar iance across all asset classes and hor izons. Results f or BTC/USDT , EUR/USD, and Do w Jones are sho wn; remaining assets e xhibit qualitativ el y identical patter ns and are archiv ed in the Online Supplementary Mater ials. B.4 Detailed Results and Figures While the main paper and this appendix present k ey empir ical findings and summar y visualisations, the full set of results from all 918 e xper imental runs is hos ted in the project repositor y . This comprehensive B Methodological Details 60 (a) BTC/USDT . (b) EUR/USD. (c) Do w Jones. Figure 36. Seed-variance box plots at ℎ = 4 f or the three representativ e assets (LSTM e x cluded). Each box spans the interquar tile rang e of rmse values across seeds 123, 456, and 789. Negligible bo x widths relative to inter -model differences confir m that seed accounts f or < 0 . 1% of total f orecast variance. archiv e ensures full transparency and suppor ts independent v erification of all repor ted metrics. The repository is av ailable at: https://github.com/NabeelAhmad9/compare_f orecasting_models The f ollo wing content is av ailable in the repository: • Benchmar k R esults : Complete rmse , mae , and d a scores f or all 108 model–asset–hor izon combinations across three seeds. • Figures : High-resolution visualisations f or all instruments, including per -seed actual vs. predicted plots. • Intermediate Outputs : Full HPO trial logs, sa ved models and frozen bes t-hyperparameter configurations. • T raining Artefacts : All 648 trained model chec kpoints and cor responding per -epoch CS V training logs. • Statistical V alidation : Raw outputs f or all Fr iedman, N emen yi, and v ar iance decomposition tests. • Full Project Code : All scr ipts f or models, training routines, benchmarking, ev aluation, and utilities. B Methodological Details 61 (a) BTC/USDT . (b) EUR/USD. (c) Do w Jones. Figure 37. Seed-variance box plots at ℎ = 24 f or the three representativ e assets (LSTM e xcluded). The patter n mir rors ℎ = 4 : box widths remain negligible at the long er horizon, confirming H3 holds at both f orecast depths. • No tebooks : Jup yter notebooks f or r unning e xper iments, reproducing figures, and exploring intermediate results. B Methodological Details 62 (a) BTC/USDT . (b) EUR/USD. (c) Do w Jones. Figure 38. Mean rmse v s. seed variance scatter at ℎ = 4 f or eight moder n architectures. Each point represents one model; the horizontal axis show s mean rmse across seeds, and the vertical axis sho ws the across-seed v ar iance. High-per f or ming models (low mean rmse ) also exhibit lo w seed v ar iance, confirming that architectural quality and robustness co-vary positiv ely . B Methodological Details 63 (a) BTC/USDT . (b) EUR/USD. (c) Do w Jones. Figure 39. Mean rmse v s. seed variance scatter at ℎ = 24 . The positiv e co-variation betw een per f or mance and seed stability is maintained at the longer horizon. B Methodological Details 64 Dec laration of Interest The author declares that the y hav e **no kno wn competing financial interests or personal relationships** that could ha v e influenced the research, analy sis, or conclusions presented in this paper .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment