Multimodal Forecasting for Commodity Prices Using Spectrogram-Based and Time Series Representations

Forecasting multivariate time series remains challenging due to complex cross-variable dependencies and the presence of heterogeneous external influences. This paper presents Spectrogram-Enhanced Multimodal Fusion (SEMF), which combines spectral and …

Authors: Soyeon Park, Doohee Chung, Charmgil Hong

Multimodal Forecasting for Commodity Prices Using Spectrogram-Based and Time Series Representations
Multimodal F or ecasting f or Commodity Prices Using Spectr ogram-Based and Time Series Repr esentations Soyeon Park 1 * , Doohee Chung 1,2 , Charmgil Hong 1,2 1 Handong Global Univ ersity , Pohang, Republic of K orea 2 Impactiv e AI, Seoul, Republic of K orea { soyeon.park, profchung, charmgil } @handong.ac.kr Abstract Forecasting multiv ariate time series remains challenging due to complex cross-v ariable dependencies and the presence of heterogeneous e xternal influences. This paper presents Spectr ogr am-Enhanced Multimodal Fusion (SEMF), which combines spectral and temporal representations for more accurate and robust forecasting. The target time series is transformed into Morlet wa velet spectrograms, from which a V ision Transformer encoder extracts localized, frequency- aware features. In parallel, exogenous v ariables, such as fi- nancial indicators and macroeconomic signals, are encoded via a T ransformer to capture temporal dependencies and multiv ariate dynamics. A bidirectional cross-attention mod- ule integrates these modalities into a unified representation that preserves distinct signal characteristics while model- ing cross-modal correlations. Applied to multiple commod- ity price forecasting tasks, SEMF achieves consistent im- prov ements over sev en competitive baselines across multi- ple forecasting horizons and e valuation metrics. These re- sults demonstrate the effecti veness of multimodal fusion and spectrogram-based encoding in capturing multi-scale patterns within complex financial time series. Introduction T ime series prediction plays a fundamental role in organi- zational decision-making across business domains, includ- ing finance, energy , and manufacturing. In particular , accu- rate forecasts of commodity prices, such as those of gold, crude oil, nickel, and aluminum, directly influence strategic planning, risk management, procurement, and hedging deci- sions (Sezer , Gudelek, and Ozbayoglu 2020). These fore- casts inform not only short-term trading actions but also longer-term operational and inv estment strategies that re- quire consistency across multiple time horizons. Howe ver , decision-makers often face substantial uncertainty because commodity markets reflect complex interactions among macroeconomic indicators, policy shifts, geopolitical ev ents, and supply chain disruptions. Consequently , commodity price forecasting represents a decision-critical task in which forecast errors lead to in ventory misallocation, hedging in- efficienc y , and delayed operational responses. * Part of this work w as carried out during Soyeon P ark’ s internship at Impactiv e AI. Copyright © 2026, Association for the Adv ancement of Artificial Intelligence (www .aaai.org). All rights reserved. Commodity price series exhibit nonlinear and non- stationary behavior that arises from div erse market mech- anisms and external drivers. This heterogeneity manifests through shifts in volatility regimes, v ariations in frequency composition, and differing sensitivities to exogenous fac- tors, which complicate the alignment between forecasts and business decisions. Moreov er , organizations typically rely on forecasts across multiple planning horizons, which places additional demands on model robustness and consistency . These characteristics make cross-commodity generalization and multi-horizon reliability central challenges in business- oriented forecasting systems. T raditional deep learning models for time series fore- casting, such as Long Short-T erm Memory (LSTM) net- works (Hochreiter and Schmidhuber 1997), model se- quential dependencies through recurrent architectures and hav e been widely applied to financial data. While these approaches capture short-term temporal patterns effec- tiv ely , they exhibit notable performance de gradation in long-horizon settings with multiple temporal scales and frequency-dependent patterns (Sezer , Gudelek, and Ozbayo- glu 2020). Commodity price data exhibit abrupt fluctuations alongside long- and short-term dynamics that no single rep- resentation can capture. This limitation is particularly prob- lematic in business contexts where forecasts must remain re- liable across operational and strategic horizons. Recent studies have explored the transformation of time series data into image-based representations to exploit ad- vances in computer vision models (Semenoglou, Spiliotis, and Assimakopoulos 2023). While line plot images enable the extraction of structural patterns, they provide limited access to frequency-domain characteristics that are essen- tial for analyzing non-stationary volatility . Alternati ve ap- proaches reformulate forecasting as an image reconstruction task, demonstrating that pre-trained visual autoencoders can act as generic forecasters (Chen et al. 2024). Although these methods show promise, they often lack sufficient capacity to capture multiscale temporal dynamics and contextual de- pendencies, such as macroeconomic signals or market-wide risk indicators that influence business decisions. As a re- sult, vision-based representations alone remain insuf ficient for complex financial forecasting tasks. Morlet wa velet spectrograms (T orrence and Compo 1998) offer a principled time-frequency representation that pre- serves localized spectral information across multiple scales. Compared to simple visual representations such as line plot images or generic image-based encodings discussed above, Morlet spectrograms provide a substantially richer descrip- tion of non-stationary signals by retaining scale-dependent energy distributions and phase information. The Morlet wa velet, formed by modulating a Gaussian window with a complex sinusoid, yields complex-v alued coefficients that enable precise characterization of transient oscillatory be- havior and abrupt spectral shifts. Prior studies hav e demon- strated the effecti veness of Morlet spectrograms as inputs for forecasting models (Zeng et al. 2023). Howe ver , existing approaches primarily focus on uni variate signals processed by a single V ision T ransformer (V iT) (Dosovitskiy 2020), which limits their ability to model multi v ariate dependen- cies and incorporate external conte xtual information. T o address these limitations, we propose Spectr ogram- Enhanced Multimodal Fusion (SEMF), a dual-path frame- work that integrates spectral and temporal representations. SEMF employs a V iT that encodes Morlet wav elet spectro- grams in order to capture scale-specific market dynamics, while a T ransformer-based encoder processes rev ersible in- stance normalization (RevIN)-transformed exogenous time series that represent macroeconomic and financial con- text (Kim et al. 2021). A bidirectional cross-attention mod- ule aligns the two modalities within a shared representation space, which facilitates interaction between time-frequency structures and conte xtual signals. This design supports fore- casts that remain stable across multiple horizons, where spectral market dynamics and external contextual signals must be considered jointly . By jointly modeling spectral pat- terns and multi variate dependencies, SEMF addresses repre- sentational gaps that limit prior approaches. W e ev aluate SEMF on commodity price forecasting tasks that inv olve assets with di verse market characteristics and external influences. The experimental design includes macro-financial variables that reflect signals considered in managerial decision-making. Experimental results show that SEMF achie ves consistent performance improvements ov er con ventional time series models and state-of-the-art image- based approaches across multiple forecasting horizons. The findings re veal the robustness and practical relev ance of SEMF in financial environments with high volatility and complex multi variate dynamics. Such robustness across forecasting horizons supports stable risk management and coherent long-term hedging. It reduces horizon-induced de- cision variability , limits unexpected losses, and improv es cost efficienc y under uncertainty . Our contributions are summarized as follo ws: • W e propose Spectrogram-Enhanced Multimodal Fusion (SEMF), a forecasting architecture that inte grates spectral and temporal representations to support robust decision- oriented forecasting. • W e employ Morlet wavelet spectrograms that capture non- linear and non-stationary dynamics rele vant to volatile commodity markets. • W e adopt a bidirectional cross-attention fusion mecha- nism that aligns spectral patterns with exogenous contex- tual signals within a unified representation. • W e implement a multi-horizon, multi-task learning frame- work for b usiness and financial decision-making that sup- ports coordinated short-term and long-term decisions un- der uncertainty . Related W ork T ime series forecasting has progressed from classical sta- tistical models to modern learning-based approaches. Sta- tistical methods such as ARIMA and Prophet (T aylor and Letham 2018) perform reliably on series with strong sea- sonal or trend components; ho wev er , they generalize poorly in financial settings where nonlinear dynamics and heteroge- neous exogenous factors interact. These limitations restrict their ability to represent the complex and non-stationary na- ture of financial time series. Deep learning models hav e, therefore, attracted signifi- cant attention. Recurrent architectures, such as LSTM and GR U, capture sequential dependencies, while T ransformer- based models leverage self-attention to capture long-range temporal correlations (Nie 2022). Despite their expressi ve capacity , these approaches often exhibit instability when ap- plied to financial data that inv olve high-frequency noise, regime shifts, and complex cross-variable interactions. This observation suggests that a single temporal representation remains insufficient for high-comple xity forecasting tasks. Recent studies ha ve explored image-based formulations of time series forecasting by conv erting sequential data into visual representations. V iT architectures have demon- strated improv ed abstraction of temporal structures com- pared to con volutional models (Li, Li, and Y an 2023). Ex- isting image-based encodings primarily focus on shape and structural patterns, whereas explicit time-frequency reso- lution remains insufficient, which constrains the model- ing of non-stationary volatility and multi-scale dynamics in financial time series. T o address this limitation, recent work adopts time-frequency representations based on Mor- let wa velet spectrograms. Zeng et al. (Zeng et al. 2023) en- code spectrogram-based features with numerical v alue in- tensities through conv olutional networks and V iT , although the approach remains confined to univ ariate series without external conte xtual variables. Motiv ated by these observations, our work proposes a dual-path multimodal framew ork that integrates time- frequency representations with raw multiv ariate time series. The target commodity price series is encoded through a V iT operating on Morlet wavelet spectrograms, while exogenous variables are processed by a T ransformer-based encoder to model multi variate temporal dynamics. A cross-attention mechanism aligns these feature representations within a uni- fied feature space, enabling joint spectral and temporal mod- eling for financial forecasting. Methodology Problem Statement Let X = { x 1 , x 2 , . . . , x T } denote a multi variate time se- ries, where each observation x t ∈ R D represents D v ari- ables observed at time step t . The target series is de- fined as y t = x (1) t , while the remaining variables form V ari ables T ime Series … 0 * Linear Projection 1 2 8 9 P atch + P osition Embedding Norm Multi-Head Attention Norm MLP L ✕ … Spectr ogram F eatur e Encoder Exogenous F eatur e Encoder Spectrogram Morlet w a v elet Norm Multi-Head Attention Norm MLP L ✕ T ransformer Encoder Re vIN Normal ization Q K V Q V Cr oss-Attention Fusion Module T ransformer Encoder K Q K V Q K V Projection Ŷ Element-wise add Matrix product Input … Figure 1: Overall frame work of Spectrogram-Enhanced Multimodal Fusion (SEMF). the exogenous time series Z = { z 1 , . . . , z T } with z t ∈ R D − 1 . Giv en historical observations ( y 1: T , Z 1: T ) , the ob- jectiv e is to predict the target series at multiple future hori- zons, Y = { y T +1 , y T +3 , y T +7 , y T +14 , y T +21 , y T +35 } . The model learns a mapping f θ : ( y 1: T , Z 1: T ) → ˆ Y , where ˆ Y denotes the predicted values and θ represents the model parameters. The goal of training is to minimize the Mean Squared Error (MSE) ov er all forecasting horizons, which serves as the primary optimization criterion throughout the experimental e v aluation. Overall Framework The proposed Spectrogram-Enhanced Multimodal Fusion (SEMF) framework is designed to model localized fre- quency variations and long-range temporal dependencies in financial time series. As illustrated in Figure 1, SEMF consists of three main components: a Spectrogram Fea- ture Encoder , an Exogenous Feature Encoder , and a Cross- Attention Fusion Module. The framework transforms the target time series into a Morlet wav elet spectrogram, which is encoded by a V iT to capture time-frequency charac- teristics. In parallel, the Exogenous Feature Encoder pro- cesses multi variate e xogenous v ariables as sequential inputs to model temporal dynamics. Finally , the Cross-Attention Fusion Module aligns these representations within a unified latent representation space, which enables joint spectral and temporal modeling for multi-horizon forecasting. Time Series T ransformation This subsection describes the transformation procedures ap- plied to the target and exogenous time series before feature encoding. The tar get series is extracted from a fixed-length historical window and transformed into a time-frequency representation using the Morlet wa velet. This transforma- tion exposes localized fluctuations, multi-scale temporal pat- terns, and non-stationary dynamics that are difficult to dis- cern in the raw time domain. The resulting logarithmic am- plitude spectrogram is standardized to a zero mean and unit variance before being input to the Spectrogram Feature En- coder . This transformation enables ef fectiv e modeling of both short- and long-term temporal dependencies. The Morlet wa velet adopted in this study is defined as ψ 0 ( η ) = π − 1 / 4 e iω 0 η e − η 2 / 2 where ω 0 denotes the non-dimensional central fre- quency (T orrence and Compo 1998). The Morlet wav elet provides fa vorable theoretical and practical properties for analyzing non-stationary signals with oscillatory behav- ior . In contrast to fixed-windo w time-frequency transforms such as the Short-T ime Fourier T ransform (STFT), w av elet- based representations offer adaptiv e resolution across scales, which enable fine temporal localization at high frequen- cies and improved frequency resolution at low frequen- cies. This property aligns well with commodity price se- ries, which e xhibit abrupt short-term fluctuations together with longer-term cyclical dynamics. Among commonly used wa velets, the Morlet wa velet achie ves strong joint localiza- tion in time and frequenc y domains through its Gaussian en- velope and complex exponential form. Compared to real- valued wa velets such as Haar or Daubechies, the complex- valued Morlet wav elet preserves both amplitude and phase information, which facilitates the characterization of tran- sient oscillatory behavior and abrupt spectral shifts. These properties make the Morlet wa velet particularly suitable for representing localized frequency v ariations and re gime tran- sitions in non-stationary financial time series. Exogenous variables are retained in their raw sequential form in order to preserve temporal structure and v ariable- specific distributions. This design supports the Exogenous Feature Encoder in modeling cross-variable interactions, de- layed ef fects, and short-term temporal patterns that com- plement the spectrogram-based representation. For multi- horizon forecasting, target values corresponding to 1, 3, 7, 14, 21, and 35 days ahead are standardized individually . As a result, the SEMF framework operates on input–output pairs that consist of normalized spectrograms, raw exogenous se- quences, and standardized tar get values. This unified prepro- cessing scheme enables the model to learn time-frequency representations and temporal dependencies within a consis- tent forecasting setting. Spectrogram F eature Encoder The Spectrogram Feature Encoder maps the spectrogram representation into a latent feature representation. It em- ploys a V iT -based architecture (Dosovitskiy 2020), in which the spectrogram is divided into patch tokens and encoded through multi-head self-attention. This design allows the model to capture both local and global temporal-spectral in- teractions across the entire spectrogram. W ithin the Transformer encoder , attention heads attend to complementary aspects of the spectrogram, allowing div erse structural patterns to be represented in the patch embed- dings. A dedicated CLS token aggregates global informa- tion, and the encoder produces a compact latent vector that summarizes the spectrogram content (V aswani et al. 2017). This representation serves as a frequency-aw are summary that supports do wnstream multi-horizon forecasting. The re- sulting latent vector is passed to the fusion module, where it is aligned with temporal representations deriv ed from ex- ogenous variables. The attention-based structure promotes stable learning and facilitates effecti ve modeling of interac- tions across different frequenc y patterns. Exogenous Featur e Encoder The Exogenous Feature Encoder processes multiv ariate ex- ogenous variables in their raw sequential representations and transforms them into temporal feature embeddings that cap- ture complementary temporal dynamics. These inputs are aligned with the same historical window as the tar get time series and reflect di verse external conditions. Each sequence is independently normalized using Re vIN, which stabilizes training and mitigates distrib utional shifts while preserving variable-specific scale and behavior (Kim et al. 2021). This normalization preserves the temporal structure of each sig- nal without introducing irreversible transformations. As a T ype V ariable T arget Closing price of target commodity futures Exogenous US 10-year T reasury yield US 2-year T reasury yield US 3-month T reasury bill yield US Dollar Index (DXY) USD to CNY exchange rate USD to JPY exchange rate USD to KR W exchange rate S&P 500 index S&P 500 VIX index (market v olatility) LME Commodity Index T able 1: Input variables for commodity price forecasting. result, the encoder retains informati ve dynamics that are es- sential for accurate multi-horizon forecasting. The encoder employs a T ransformer-based architecture to model the temporal dependencies within the exogenous signals (Zhou et al. 2021; W u et al. 2021). Through self- attention, it captures delayed ef fects, cross-v ariable interac- tions, and long-range temporal influences that may affect fu- ture movements of the target series. This design enables the model to learn patterns that extend beyond local time steps and depend on interactions across variables. The resulting embedding provides a context-rich summary of both short- term dynamics and global relationships, which are subse- quently integrated with spectral representations in the fusion module. Cross-Attention Fusion Module The Cross-Attention Fusion Module combines spectrogram- based and exogenous representations into a unified fea- ture space for forecasting. Although each encoder captures modality-specific characteristics, their independent outputs cannot represent the cross-modal dependencies required to model complex financial signals. T o address this limitation, the fusion module learns interactions between frequency- domain and time-domain features that are not accessible to either feature representation alone (Tsai et al. 2019). The module applies a bidirectional cross-attention mech- anism in which each modality alternately serv es as the query while the other provides key–v alue pairs (V aswani et al. 2017). In one direction, a representativ e vector deriv ed from the spectrogram sequence queries the exogenous feature se- quence to relate frequency variations to external contextual signals (Tsai et al. 2019; Zhang and Y an 2023). In the re- verse direction, a summary vector from the exogenous se- quence queries the spectrogram features to associate tem- poral dynamics with time-frequency patterns. This bidirec- tional formulation enables symmetric information e xchange and aligns complementary cues across modalities. The outputs of the cross-attention layers are combined through residual connections and layer normalization to form a unified feature representation. This design preserves modality-specific information, promotes training stability , and supports effecti ve learning of cross-modal dependen- cies. The resulting representation captures interdependen- Category Model Commodity RMSE ↓ RMAE ↓ MAPE ↓ R 2 ↑ T ime series-based LSTM Coal 18.86 0.1230 11.4718 -0.1821 Gold 538.60 0.2220 0.2139 -1.5574 Steel 570.24 0.1491 15.6967 -2.6951 iT ransformer Coal 104.09 0.5663 55.7048 -37.7016 Gold 112.52 0.0408 0.0407 0.9660 Steel 183.48 0.0419 4.1703 0.5649 T imesNet Coal 21.09 0.1052 10.4761 -0.4346 Gold 104.23 0.0378 0.0369 0.9600 Steel 199.92 0.0457 4.6178 0.5463 PatchTST Coal 12.02 0.0959 7.4217 0.5032 Gold 126.77 0.0568 0.0433 0.9545 Steel 177.35 0.0527 4.1102 0.6494 Image-based V isionTS Coal 70.77 0.5535 55.6418 -15.6105 Gold 823.13 0.3279 0.3290 -187.9199 Steel 1522.21 0.4223 43.9314 -23.9076 V iT -num-spec Coal 45.10 0.3307 31.6826 -5.5089 Gold 830.62 0.3525 0.3431 -8.4687 Steel 314.52 0.0835 8.4544 -0.0619 Multimodal SEMF (Ours) Coal 11.85 0.0758 7.4484 0.5500 Gold 107.05 0.0361 0.0358 0.8425 Steel 173.29 0.0409 4.1100 0.6734 T able 2: Performance comparison across multiple forecasting horizons. Results are av eraged over all horizons. Red and blue denote the best and second-best results, respectiv ely . cies across time and frequency and pro vides a compact sum- mary that integrates complementary signals from both input pathways, supporting accurate multi-horizon forecasting. Multi-Horizon For ecasting The final fused representation generated by the Cross- Attention Fusion Module is passed through a shared two- layer MLP prediction head, consisting of LayerNorm, a lin- ear projection, GELU acti vation, dropout, and a final linear layer , to produce six predictions corresponding to multiple horizons (1, 3, 7, 14, 21, and 35 days). Instead of employ- ing a single-step decoder or an autoregressiv e forecasting strategy , SEMF predicts all horizons simultaneously using a shared output layer . This design enables joint optimization across forecasting horizons and a voids error accumulation associated with recursi ve prediction. The model is trained by minimizing the mean squared error , av eraged across all fore- casting horizons. By integrating frequency-sensiti ve repre- sentations with temporal interaction features, SEMF pro- vides stable performance across both short- and long-term prediction horizons. Experiments W e ev aluate the effecti veness of the proposed SEMF frame- work through a series of empirical experiments. First, we describe the experimental settings used for model training and testing. W e then compare SEMF against a set of strong baselines across multiple forecasting horizons and ev alua- tion metrics. Finally , we conduct ablation studies to exam- ine the contrib utions of each architectural component and to validate the design choices behind the frame work. Experimental Setting Dataset W e ev aluate our model on a multiv ariate daily time series dataset constructed from commodity prices and macro-financial data from April 2013 to January 2026. The dataset contains daily observations for multiple com- modities and associated macroeconomic indicators, with all price-based variables aligned using daily closing prices to ensure consistency across assets. T able 1 summarizes the in- put v ariables used for commodity price forecasting. Missing values arising from data collection gaps and market-specific reporting differences are handled through an imputation pro- cedure prior to model training. This preprocessing step en- sures complete and temporally aligned input sequences for all commodities considered in the study . For each commodity , a fixed historical input window is constructed, and future prices at multiple horizons are pre- dicted. The dataset for each commodity is chronologically divided into training, validation, and test sets using a ratio of 0.65, 0.15, and 0.20, respectiv ely . This split results in a total of 3,185 samples per commodity , including 2,070 training samples, 478 v alidation samples, and 637 test samples. Im- portantly , all commodities e valuated in this study share the same temporal co verage, preprocessing pipeline, and data split configuration. This unified setup enables a f air and con- sistent comparison of forecasting performance across com- modities with div erse market characteristics. Comparison Methods For SEMF , the input sequence length is set to 120, which is selected based on prelimi- nary validation e xperiments to balance the representation of short- and long-term temporal patterns. The Morlet wa velet uses a fixed scale of 128, resulting in a spectrogram output of size 128 × 120 (scale × sequence length). The Spectrogram Feature Encoder adopts a V iT with a patch size of 8. De- tailed analyses of these selected hyperparameters and their effects on performance are presented in the Analysis. The study compares the proposed SEMF model with sev en representati ve baselines. The baselines include fiv e time series models (LSTM (Hochreiter and Schmidhuber 1997), iTransformer (Liu et al. 2023), TimesNet (W u et al. 2022), and PatchTST (Nie 2022)) and two image-based fore- casting models (V isionTS (Chen et al. 2024) and V iT -num- spec (Zeng et al. 2023)). These baselines represent div erse architectures and enable a systematic comparison of SEMF under different modeling assumptions. Evaluation Metrics W e e v aluate forecasting performance using four standard metrics: Root Mean Squared Error (RMSE), Relati ve Mean Absolute Error (RMAE), Mean Absolute Percentage Error (MAPE), and the coefficient of determination ( R 2 ). RMSE captures absolute errors with higher sensitivity to large deviations, while RMAE and MAPE measure relative errors using scale-normalized and percentage-based schemes. R 2 quantifies the proportion of variance in the target series explained by the forecasts. Metrics are computed independently for each forecasting horizon (1, 3, 7, 14, 21, and 35 days) and then av eraged across horizons. Lo wer v alues indicate better performance for error-based metrics, whereas higher values are preferred for R 2 . Results Now we report the forecasting performance of the proposed SEMF framework, with an emphasis on robustness and sta- bility across commodities and forecasting horizons, which are essential for decision-oriented forecasting. T able 2 summarizes forecasting performance across Coal, Gold, and Steel, with results averaged across all forecasting horizons. T ime series-based models demonstrate strong per- formance on individual commodities, although their accu- racy varies substantially depending on asset-specific charac- teristics. Image-based approaches consistently e xhibit lar ger errors, which suggests limited suitability for modeling struc- tured temporal dynamics in isolation. Across all commodi- ties and ev aluation metrics, SEMF shows comparable error lev els rather than sharp performance fluctuations. Lo w rel- ativ e error measures are observed for Gold, while similar accuracy le vels are maintained for Coal and Steel under dis- tinct volatility profiles. The resulting performance pattern Model H RMSE RMAE MAPE T imesNet 1 73.88 0.0269 0.0262 3 89.43 0.0325 0.0316 7 99.13 0.0363 0.0353 14 116.28 0.0426 0.0415 21 116.14 0.0418 0.0407 35 130.53 0.0469 0.0459 avg 104.23 0.0378 0.0369 PatchTST 1 100.33 0.0455 0.0344 3 129.40 0.0586 0.0453 7 106.22 0.0479 0.0366 14 127.61 0.0572 0.0424 21 135.95 0.0606 0.0472 35 161.11 0.0710 0.0541 avg 126.77 0.0568 0.0433 SEMF (Ours) 1 107.24 0.0357 0.0347 3 106.32 0.0360 0.0351 7 103.38 0.0353 0.0345 14 100.50 0.0342 0.0338 21 107.52 0.0369 0.0367 35 117.34 0.0386 0.0399 avg 107.05 0.0361 0.0358 T able 3: Horizon-wise forecasting performance for Gold price forecasts. Best results are shown in r ed . highlights robustness across heterogeneous market condi- tions rather than specialization to a single asset class. In par- ticular , for Gold, a discrepancy between RMSE and R 2 is observed. This arises from their dif ferent sensitivity to tar- get variance. While RMSE measures absolute error , R 2 is normalized by horizon-specific variance. As a result, incon- sistent trends may appear when results are averaged across horizons. T able 3 presents horizon-wise forecasting results for the Gold price prediction task. Baseline time series models ex- hibit increasing errors as the forecasting horizon extends, which reflects limited long-range stability . By comparison, SEMF sho ws smaller performance variations as forecast length increases. Accuracy remains relati vely stable e ven at longer horizons, whereas competing methods display mono- tonic error growth. Such differences in horizon sensiti vity affect ho w forecasting models behav e under multi-horizon planning requirements. Beyond point accuracy , robustness across forecasting horizons carries direct economic implications. In commod- ity markets, horizon-dependent instability can result in in- consistent in ventory allocation, inefficient hedging adjust- ments, and increased operational uncertainty . Stable fore- casting performance across forecast lengths supports coher- ent planning strategies under uncertainty . This allows a sin- gle forecasting framework to be applied across short- and long-term horizons without frequent recalibration. From an operational perspective, such consistency reduces exposure to horizon-driv en decision volatility . Image RMSE RMAE MAPE Line 187.98 0.0634 0.0617 STFT 173.20 0.0554 0.0544 CMOR 129.93 0.0495 0.0495 Morlet 107.05 0.0361 0.0358 T able 4: Forecasting performance of SEMF with dif ferent time-frequency representations. Analysis: Key Design Choices W e present an ablation analysis that examines the contribu- tion of indi vidual components within the proposed SEMF framew ork, with ev aluation restricted to Gold price forecast- ing to ensure a controlled and interpretable setting. Gold is selected as the representativ e asset due to its high liquidity , stable data av ailability , and well-characterized market be- havior , which allows for clear identification of architectural effects under consistent data conditions. Impact of Image T ransformation T able 4 summarizes the forecasting performance of SEMF under four image transformations: line plot, Short-T ime Fourier Transform (STFT), Complex Morlet wav elet (CMOR), and Morlet wa velet. This comparison examines how shape-based and time-frequency representations affect forecasting accuracy under non-stationary dynamics. Shape-based line plot representations produce the largest errors, reflecting the absence of explicit frequency infor- mation. T ime-frequency approaches yield improved per- formance, with fixed-windo w STFT providing moderate gains and adaptiv e wav elet-based representations offer- ing further improvements. Among the ev aluated methods, the Morlet wavelet achiev es the most accurate and sta- ble forecasts across all metrics. Performance dif ferences across transformations remain consistent across ev aluation measures, which underscores the importance of represen- tation choice in spectrogram-based forecasting. W av elet- based representations offe r greater flexibility in character- izing non-stationary temporal patterns than shape-based or fixed-windo w frequency encodings. Role of Exogenous Featur e Encoding T able 5 compares forecasting performance under different encoders for ex- ogenous variables. A simple multilayer perceptron exhibits limited effectiv eness in modeling these signals, indicating insufficient capacity to capture temporal dependencies and interactions across variables. In contrast, the T ransformer- based encoder learns more informativ e representations, as its self-attention mechanism enables structured modeling of temporal patterns and variable-wise dependencies. This richer representation is more effecti vely integrated through the cross-attention fusion module, where exogenous features interact with spectral representations of the target series. As shown in T able 5, substantial improvements in forecasting performance are observed compared to the MLP-based al- ternativ e. RMSE RMAE MAPE MLP 273.17 0.1097 0.1095 T ransformer 107.05 0.0361 0.0358 T able 5: Comparison of SEMF performance using different Exogenous Feature Encoders (MLP vs. T ransformer). Fusion RMSE RMAE MAPE Single CA 200.66 0.0766 0.0776 Bi-CA 107.05 0.0361 0.0358 T able 6: Forecasting performance of SEMF with dif ferent cross-attention fusion strategies. Patch Scale RMSE RMAE MAPE 8 64 199.32 0.0696 0.0670 16 64 142.79 0.0536 0.0524 8 128 107.05 0.0361 0.0358 16 128 181.61 0.0615 0.0587 T able 7: Forecasting performance of SEMF under v arying patch sizes and Morlet wa velet scales. Effect of Cross-Modal Alignment Cross-modal align- ment strategies dif fer substantially in how effecti vely they integrate spectral and temporal information. T able 6 com- pares their impact on forecasting performance in SEMF . A single-direction cross-attention (CA) mechanism allows only limited interaction between spectral and temporal rep- resentations, which constrains the ef fectiv e use of comple- mentary information across modalities. In contrast, bidi- rectional cross-attention (bi-CA) enables mutual informa- tion exchange between the two representations. This bidi- rectional alignment yields more informati ve fused represen- tations and supports stable forecasting beha vior . Such align- ment mechanisms enable complementary spectral and tem- poral signals to be utilized more effecti vely within a unified representation for multimodal forecasting. Design T rade-offs in T emporal-Spectral Resolution T a- ble 7 reports the forecasting performance of SEMF under different combinations of patch size and Morlet wav elet scale. These parameters jointly determine the encoding of temporal and spectral information within the spectrogram representation and influence forecasting performance. Patch size controls the granularity of temporal segmen- tation. Smaller patches preserve fine-grained temporal vari- ations, whereas lar ger patches aggregate information over longer intervals and reduce temporal resolution. W a velet scale governs the extent of spectral context. Larger scales capture longer-term frequency patterns, while smaller scales focus on short-term spectral components. The results suggest a trade-of f between temporal resolu- tion and spectral context. The combination of a small patch size and a large w av elet scale yields the best forecasting per- formance, as it preserves localized temporal variations while Seq Length RMSE RMAE MAPE 30 269.94 0.1092 0.1101 60 270.05 0.1090 0.1094 90 244.43 0.0848 0.0821 120 107.05 0.0361 0.0358 T able 8: Forecasting performance with different input se- quence lengths. providing sufficient long-range spectral information. In con- trast, configurations with both large patch sizes and lar ge wa velet scales result in o verly coarse representations, which are associated with degraded performance due to the loss of fine-grained temporal structure. For smaller wa velet scales, the limited spectral conte xt constrains the capacity of the learned representation to capture long-term frequency dy- namics, ev en with preserved temporal resolution. Effect of Historical Window Size T able 8 summarizes the forecasting performance of SEMF under different input sequence lengths. The historical window size determines the amount of temporal context available to the model and di- rectly affects its ability to capture both short-term dependen- cies and long-term trends. Shorter input sequences provide insufficient historical information and result in substantially degraded forecasting performance. As the sequence length increases, forecasting accurac y impro ves, indicating that ad- ditional historical context contributes to more informati ve temporal representations. The longest sequence length con- sidered in this study provides suf ficient context to support stable forecasting behavior across horizons. These observa- tions place practical constraints on the minimum historical cov erage required for reliable long-range forecasting. Conclusion This study presents the Spectr ogr am-Enhanced Multimodal Fusion (SEMF) framework for multi v ariate time series fore- casting in complex, non-stationary environments. SEMF integrates Morlet wa velet spectrograms of the target se- ries with sequential representations of exogenous vari- ables, which are processed through separate encoding pathways and integrated via bidirectional cross-attention. This design enables the model to jointly capture time- frequency characteristics and multiv ariate temporal depen- dencies within a unified forecasting framework. Empiri- cal ev aluation demonstrates that SEMF achie ves consistent forecasting performance across multiple horizons and com- modities. Component-wise analyses further confirm the con- tribution of each design element by demonstrating the im- portance of spatial and temporal resolution choices such as patch size, wa velet scale, and historical windo w length. Beyond predictiv e accuracy , the horizon-consistent be- havior of SEMF carries meaningful implications for opera- tional and managerial decision-making. Stable performance across forecast horizons reduces sensitivity to planning hori- zon selection in applications such as procurement, inv en- tory management, and risk mitigation under volatile market conditions. Such consistency supports more reliable coordi- nation between short-term operational actions and longer- term strategic planning. Forecasting systems with unstable long-horizon behavior can introduce unnecessary replanning and increased decision volatility . In this sense, SEMF offers coherent decision-oriented forecasting across multiple time horizons in business and financial domains. The applicability of SEMF e xtends beyond the e valu- ated commodities to other asset classes and industrial set- tings characterized by complex temporal dynamics and het- erogeneous external influences. The framew ork is particu- larly relev ant in contexts where forecasting models serve as inputs to downstream decision processes. Future work will explore extensions to higher-frequency and streaming data, alternative spectral representations, and additional in- formation sources such as te xtual or sentiment-based sig- nals. Further adv ances in the interpretability of multimodal attention mechanisms may also facilitate broader adoption in decision-critical applications. Acknowledgments This research was supported by the Ministry of Science and ICT (MSIT), Korea, under the National Program for Ex- cellence in SW (2023-0-00055), supervised by the Institute for Information & Communications T echnology Planning & Evaluation (IITP). References Chen, M.; Shen, L.; Li, Z.; W ang, X. J.; Sun, J.; and Liu, C. 2024. V isionts: V isual masked autoencoders are free- lunch zero-shot time series forecasters. arXiv pr eprint arXiv:2408.17253 . Dosovitskiy , A. 2020. An image is worth 16x16 words: T ransformers for image recognition at scale. arXiv preprint arXiv:2010.11929 . Hochreiter , S.; and Schmidhuber , J. 1997. Long short-term memory . Neural computation , 9(8): 1735–1780. Kim, T .; Kim, J.; T ae, Y .; Park, C.; Choi, J.-H.; and Choo, J. 2021. Reversible instance normalization for accurate time- series forecasting against distribution shift. In International confer ence on learning r epr esentations . Li, Z.; Li, S.; and Y an, X. 2023. T ime series as im- ages: V ision transformer for irregularly sampled time se- ries. Advances in Neural Information Pr ocessing Systems , 36: 49187–49204. Liu, Y .; Hu, T .; Zhang, H.; W u, H.; W ang, S.; Ma, L.; and Long, M. 2023. itransformer: In verted transformers are effecti ve for time series forecasting. arXiv preprint arXiv:2310.06625 . Nie, Y . 2022. A T ime Series is W orth 64W ords: Long- term Forecasting with Transformers. arXiv preprint arXiv:2211.14730 . Semenoglou, A.-A.; Spiliotis, E.; and Assimakopoulos, V . 2023. Image-based time series forecasting: A deep con vo- lutional neural network approach. Neural Networks , 157: 39–53. Sezer , O. B.; Gudelek, M. U.; and Ozbayoglu, A. M. 2020. Financial time series forecasting with deep learning : A sys- tematic literature re view: 2005–2019. Applied Soft Comput- ing , 90: 106181. T aylor, S. J.; and Letham, B. 2018. Forecasting at scale. The American Statistician , 72(1): 37–45. T orrence, C.; and Compo, G. P . 1998. A practical guide to wa velet analysis. Bulletin of the American Meteor ological society , 79(1): 61–78. Tsai, Y .-H. H.; Bai, S.; Liang, P . P .; K olter , J. Z.; Morency , L.-P .; and Salakhutdinov , R. 2019. Multimodal transformer for unaligned multimodal language sequences. In Pr oceed- ings of the confer ence. Association for computational lin- guistics. Meeting , volume 2019, 6558. V aswani, A.; Shazeer, N.; Parmar , N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At- tention is all you need. Advances in neural information pr o- cessing systems , 30. W u, H.; Hu, T .; Liu, Y .; Zhou, H.; W ang, J.; and Long, M. 2022. T imesnet: T emporal 2d-v ariation modeling for gen- eral time series analysis. arXiv pr eprint arXiv:2210.02186 . W u, H.; Xu, J.; W ang, J.; and Long, M. 2021. Autoformer: Decomposition transformers with auto-correlation for long- term series forecasting. Advances in neural information pr o- cessing systems , 34: 22419–22430. Zeng, Z.; Kaur , R.; Siddagangappa, S.; Balch, T .; and V eloso, M. 2023. From pixels to predictions: Spectrogram and vision transformer for better time series forecasting. In Pr oceedings of the F ourth ACM International Conference on AI in F inance , 82–90. Zhang, Y .; and Y an, J. 2023. Crossformer: T ransformer uti- lizing cross-dimension dependenc y for multi variate time se- ries forecasting. In The ele venth international confer ence on learning r epr esentations . Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; and Zhang, W . 2021. Informer: Be yond ef ficient transformer for long sequence time-series forecasting. In Pr oceedings of the AAAI confer ence on artificial intelligence , v olume 35, 11106–11115.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment