Beyond Accuracy: Evaluating Forecasting Models by Multi-Echelon Inventory Cost
This study develops a digitalized forecasting-inventory optimization pipeline integrating traditional forecasting models, machine learning regressors, and deep sequence models within a unified inventory simulation framework. Using the M5 Walmart data…
Authors: Swata Marik, Swayamjit Saha, Garga Chatterjee
Bey ond Accuracy: Ev aluating F orecasting Mo dels b y Multi-Ec helon In v en tory Cost Sw ata Marik 1* † , Sw ay amjit Saha 2 † and Garga Chatterjee 3 1* Departmen t of Home Science, Univ ersit y of Calcutta, 20 E, Judges Court Road, Alipore, Kolk ata, 700027, W est Bengal, India. 2 Departmen t of Computer Science and Engineering, Mississippi State Univ ersity , 665 George P erry Street, Starkville, 39759, MS, USA. 3 Psyc hology Researc h Unit, Indian Statistical Institute Kolk ata, 203, B.T. Road, Kolk ata, 700108, W est Bengal, India. *Corresp onding author(s). E-mail(s): swatamarik01@gmail.com ; Con tributing authors: ss4706@msstate.edu ; garga@isical.ac.in ; † These authors contributed equally to this work. Abstract This study dev elops a digitalized forecasting–inv entory optimization pipeline in tegrating traditional forecasting mo dels, machine learning regressors, and deep sequence models within a unified in ven tory simulation framework. Using the M5 W almart dataset, we ev aluate seven forecasting approaches and assess their op erational impact under single- and tw o-ec helon newsv endor systems. Results indicate that T emporal CNN and LSTM mo dels significan tly reduce in ven- tory costs and improv e fill rates compared to statistical baselines. Sensitivity and multi-ec helon analyses demonstrate robustness and scalability , offering a data-driv en decision-supp ort to ol for mo dern supply chains. Keyw ords: Demand forecasting, Inv entory optimization, Multi-ec helon supply chain, Newsvendor simulation, Deep learning (LSTM, T emp oral CNN) 1 In tro duction Mo dern supply chains face gro wing volatilit y and disruption risk, motiv ating digital- ization and AI-enabled decision supp ort to improv e resilience. Digital transformation 1 allo ws firms to integrate heterogeneous data, automate analytics, and resp ond adap- tiv ely to sho c ks; recen t work links ML-driven digitalization to stronger disruption absorption and service con tinuit y [ 1 , 2 ]. Within this ecosystem, demand forecasting is a key input to inv en tory optimization, shaping replenishmen t, safety sto c k, and co or- dination across distribution cen ters (DCs) and stores. F orecast errors can propagate upstream and amplify v ariabilit y (the bullwhip effect) [ 3 ], so impro ved forecasts can directly reduce cost and improv e service [ 4 ]. Ho wev er, classical single-model approac hes (e.g., ARIMA and exp onen tial smooth- ing) often struggle with nonlinear, in termittent, and non-stationary retail demand [ 5 ]. ML/DL mo dels (e.g., recurrent and attention-based architectures) frequen tly outp er- form statistical baselines under suc h conditions [ 6 – 8 ], y et their v alue is commonly rep orted via error metrics rather than op erational impact [ 9 – 11 ]. This gap is more pro- nounced in multi-ec helon net works: despite the imp ortance of tw o-ec helon DC–store systems [ 12 ], relativ ely few empirical studies link forecast impro vemen ts to co ordinated m ulti-tier inv entory outcomes using real retail data and information-sharing signals [ 13 – 15 ]. T o address this gap, this study makes four k ey contributions. First, w e dev elop a unified forecasting pip eline that integrates seven mo del classes—including statisti- cal, machine learning, and deep learning approaches—within a standardized feature and training framework. Second, we em b ed these forecasts into a newsv endor-based op erational ev aluation to quantify how predictiv e p erformance translates into practi- cal inv entory outcomes across multiple holding–shortage cost ratios. Third, we extend the analysis b ey ond traditional single-echelon settings by implementing a t wo-ec helon DC–Store simulation, enabling assessmen t of upstream and downstream op erational impacts. Finally , w e p erform a detailed sensitivity analysis to ev aluate the robustness of eac h forecasting mo del under v arying cost structures, offering actionable insights for practitioners seeking to leverage digital analytics for supply c hain resilience. 2 Literature Review 2.1 Classical F orecasting Classical forecasting is widely used in supply chains due to simplicit y and ease of deplo yment. Common baselines include ARIMA and exp onen tial smo othing (e.g., Holt–Win ters) [ 16 – 18 ]. Ho w ever, retail demand often violates linearity/stationar- it y assumptions, esp ecially under intermittency and structural breaks. Croston-type metho ds address intermitten t demand b y separating demand sizes and inter-arriv al times [ 19 , 20 ], yet can still degrade under abrupt shifts [ 9 ]. 2.2 Mac hine Learning Approaches T ree ensembles such as gradient b oosting and related metho ds are p opular for retail forecasting b ecause they capture nonlinear interactions am ong engineered features and calendar signals [ 21 ]. Scalable implementations lik e XGBoost ha ve shown strong 2 p erformance on large tabular demand datasets [ 22 ], and evidence from the M5 compe- tition highlights the comp etitiv eness of bo osted-tree approac hes for hierarchical retail sales prediction [ 23 , 24 ]. 2.3 Deep Learning Approaches Deep learning models learn temp oral represen tations directly from sequences and co v ariates, whic h can reduce reliance on handcrafted features. LSTMs are a stan- dard choice for long-range dep endencies [ 25 ], while multi-horizon architectures such as T emp oral F usion T ransformers improv e cov ariate handling and interpretabilit y [ 7 ]. T emp oral CNNs with dilations provide effi cien t long-context mo deling [ 26 ], and hybrid ARIMA–NN mo dels ha ve been used to combine linear structure with nonlinear effects [ 27 ]. Surveys emphasize DL adv antages when demand drivers interact and forecasting m ust scale across many related series [ 6 , 8 ]. 2.4 Gaps in Prior W ork 2.4.1 F orecasting accuracy vs. in ven tory outcomes F orecasting studies commonly emphasize statistical error measures, while in ven tory studies often assume exogenous/st ylized demand inputs; consequen tly , the operational v alue of accuracy improv emen ts is not consistently quantified in cost and service metrics [ 9 , 10 ]. 2.4.2 Limited multi-ec helon ev aluation with real retail data Although m ulti-echelon theory is well dev elop ed, empirical studies that propagate forecast errors through DC–store replenishment decisions using real retail demand and information-sharing signals (e.g., POS/sell-through) remain comparativ ely scarce [ 12 – 14 ]. 2.4.3 Lac k of unified comparisons across statistical, ML, and DL Only a small subset of w ork ev aluates classical methods, ML ensem bles, and DL arc hitectures under a single exp erimen tal proto col that links accuracy to downstream m ulti-tier inv en tory p erformance (e.g., total cost, fill rate, back orders) [ 10 , 11 , 15 ]. 3 Metho dology 3.1 Dataset and Prepro cessing W e use the M5 F orecasting dataset [ 28 ], combining daily sales from sales train validation.csv with exogenous cov ariates from calendar.csv . Sales are reshaped to long format and merged with the calendar on the d index (date, w eekday/mon th, even ts, SNAP). • Subset: W e filter to state id=CA and dept id=FOODS 1 (CA FOODS 1) for con trolled b enc hmarking. 3 • F eatures: Standard retail predictors are constructed p er series: lags ( y t − 1 , y t − 7 , y t − 14 , y t − 28 ), rolling means (7/14/28 da ys), and calendar/even t indicators (including SNAP). • Splits: W e apply a rolling holdout: final 28 days for test, previous 28 days for v alidation, remainder for training, with all features computed from past data only . 3.2 F orecasting Mo dels Let y i,t b e daily demand for series i at time t , and ˆ y i,t +1 | t the one-step-ahead forecast. Classical mo dels are fit p er series, while ML/DL mo dels are trained globally across the selected panel using engineered cov ariates x t (lags, rolling statistics, and calen- dar/ev ent indicators). All mo dels are trained on the training split, tuned on v alidation, and ev aluated on test. 1. Naive (lag-1): a persistence baseline that is often comp etitive for very short horizons. ˆ y i,t +1 | t = y i,t . (1) 2. Holt–Winters ES (additive, m = 7 ): captures level, trend, and weekly seasonalit y via exp onential smo othing. ℓ t = α ( y t − s t − m ) + (1 − α )( ℓ t − 1 + b t − 1 ) , (2) b t = β ( ℓ t − ℓ t − 1 ) + (1 − β ) b t − 1 , (3) s t = γ ( y t − ℓ t ) + (1 − γ ) s t − m , (4) ˆ y t +1 | t = ℓ t + b t + s t +1 − m . (5) 3. ARIMA(1,1,1): a differenced linear time-series mo del that captures short-range auto correlation in the mean. ∇ y t = c + ϕ 1 ∇ y t − 1 + ε t + θ 1 ε t − 1 . (6) 4. Gradient Bo osting Regressor (GBR): an ensemble of regression trees that learns nonlinear relations b et ween x t and demand. ˆ y t +1 | t = f ( x t ) , f ( x ) = M X m =1 η g m ( x ) . (7) 5. XGBoost: a regularized gradien t-b oosted tree mo del optimized for scalability and strong tabular p erformance. ˆ y t +1 | t = K X k =1 f k ( x t ) , L = X t ℓ ( y t , ˆ y t ) + X k Ω( f k ) . (8) 4 6. LSTM (global): a recurrent sequence mo del that learns temp oral representations shared across series. c t = f t ⊙ c t − 1 + i t ⊙ ˜ c t , h t = o t ⊙ tanh( c t ) , (9) ˆ y t +1 | t = W y h t + b y . (10) 7. T emp oral CNN: a causal dilated-con volution model with large receptive fields to capture long-range patterns efficiently . z ( ℓ ) t = K − 1 X k =0 w ( ℓ ) k x ( ℓ − 1) t − d ℓ k , (11) ˆ y t +1 | t = W z z ( L c ) t + b z . (12) 3.3 Newsv endor Inv en tory Sim ulator W e ev aluate eac h forecasting mo del through a rolling single-p erio d newsvendor sim u- lator. Let D i,t b e realized demand for series i at da y t , and let Q i,t b e the order placed b efore observing D i,t . With o verage (holding) cost h > 0 and underage (shortage) cost b > 0, the p er-p eriod cost is C i,t ( Q i,t ) = h max( Q i,t − D i,t , 0) + b max( D i,t − Q i,t , 0) . (13) • Mapping forecasts to orders: for p oin t forecasts ˆ D i,t | t − 1 , we set Q i,t = max { 0 , ˆ D i,t | t − 1 } . (14) • KPIs: we rep ort mean cost o ver the ev aluation horizon T , C = 1 N |T | N X i =1 X t ∈T C i,t ( Q i,t ) , (15) and demand-weigh ted fill rate, FR = 1 − P i,t ∈T max( D i,t − Q i,t , 0) P i,t ∈T D i,t + ϵ . (16) 3.4 Tw o-Ec helon Extension W e extend the sim ulator to a t wo-ec helon system with one DC supplying stores S . DC demand is aggregated as D DC t = X s ∈S D s,t , ˆ D DC t | t − 1 = X s ∈S ˆ D s,t | t − 1 , (17) 5 and the DC orders Q DC t = max { 0 , ˆ D DC t | t − 1 } . Stores request R s,t = max { 0 , ˆ D s,t | t − 1 } ; if in ven tory is insufficient, fulfillmen t is allo cated prop ortionally: F s,t = R s,t P s ′ R s ′ ,t + ϵ I DC t + Q DC t . (18) Tw o-echelon p erformance is measured by a verage net work cost C (2) and fill rate: C (2) = 1 |T | X t ∈T C DC t + X s ∈S b S max( D s,t − F s,t , 0) , (19) FR (2) = P t ∈T P s ∈S min( D s,t , F s,t ) P t ∈T P s ∈S D s,t + ϵ . (20) 4 Results W e rep ort results on the held-out test perio d for the CA FOODS 1 subset, ev aluating b oth (i) statistical forecast accuracy and (ii) downstream inv en tory performance under the newsvendor simulator with ov erage cost h = 1 and underage cost b ∈ { 2 , 5 , 10 } . 4.1 F orecast accuracy and single-ec helon inv en tory p erformance T able 1 summarizes test-set accuracy (RMSE/MAE) together with in ven tory KPIs (a verage daily newsvendor cost and fill rate) under ( h, b ) = (1 , 5). Ov erall, mo dern learning-based mo dels yield consisten t op erational gains ov er classical baselines. The T emp oral CNN achiev es the lowest in ven tory cost (3 . 674) and the highest fill rate (0 . 632), corresp onding to an 18.7% cost reduction and a +9.8 pp fill-rate impro ve- men t o ver the naive forecast. The LSTM attains the best RMSE (2 . 207) and delivers the second-b est inv entory cost (3 . 704). T ree ensem bles (GBR/X GBo ost) also impro ve b oth cost and service compared to classical and naive baselines. W e note that MAPE is extremely large for this subset due to near-zero denominators on low-demand days; hence we emphasize RMSE/MAE and inv entory-based metrics for in terpretation. 4.2 Sensitivit y to shortage p enalt y ( b ) T able 2 v aries the underage p enalt y ( b ) while k eeping h = 1. As exp ected, absolute costs increase as shortages become more exp ensive. Imp ortan tly , the ov erall ranking remains stable: deep mo dels (T emporal CNN/LSTM) retain the low est costs across all b v alues, while b o osted trees remain comp etitiv e. Since our order p olicy is a deter- ministic mapping from p oin t forecasts ( Q = max(0 , ˆ D )), fill rate is unchanged across b for a fixed mo del; b affects c ost tr ade-offs rather than the realized service level. 6 T able 1 T est-set forecast accuracy and in ven tory p erformance (single-ec helon newsvendor, h = 1, b = 5). Cost reduction and fill-rate gains are relative to Naiv e (lag-1). Mo del RMSE MAE Avg. Cost/day Fill rate Cost ↓ Fill ↑ (pp) Naiv e (lag-1) 2.909 1.505 4.521 0.534 0.0% 0.0 Holt–Win ters ES 2.677 1.487 4.182 0.583 7.5% 4.9 ARIMA(1,1,1) 2.636 1.486 4.258 0.572 5.8% 3.8 GBR (global) 2.296 1.293 3.854 0.605 14.8% 7.1 X GBo ost (global) 2.294 1.289 3.839 0.606 15.1% 7.2 LSTM (global seq) 2.207 1.243 3.704 0.620 18.1% 8.6 T emp oral CNN 2.260 1.293 3.674 0.632 18.7% 9.8 T able 2 Average daily in ven tory cost on the test set for different shortage p enalties ( h = 1). Best v alue per column in b old. Mo del b = 2 b = 5 b = 10 ARIMA(1,1,1) 2.179 4.258 7.722 GBR (global) 1.933 3.854 7.054 Holt–Win ters ES 2.157 4.182 7.558 LSTM (global seq) 1.858 3.704 6.781 Naiv e (lag-1) 2.259 4.521 8.291 T emp oral CNN 1.888 3.674 6.652 X GBo ost (global) 1.926 3.839 7.028 5 Discussion Across all tested settings, deep learning mo dels translate predictive gains into c on- sistent op er ational impr ovements : b oth LSTM and T emporal CNN reduce a verage newsv endor cost relative to classical (Holt–Winters/ARIMA) and ML baselines, indi- cating fewer costly sto c kout/o verstock even ts under the same ordering rule. Among them, the T emporal CNN is the most robust across shortage-to-holding cost ratios, ac hieving the low est or near-low est cost as b/h increases, which suggests that its learned temp oral representations generalize better under asymmetric service p enalties. Extending the simulator to the multi-ec helon setting further highlights that forecast qualit y at the distribution-center (DC) level has a disprop ortionate downstream effect: errors in DC-level demand aggregation propagate to store replenishmen t decisions, amplifying cost and service degradation across multiple outlets even when store-level forecasting is improv ed. F rom a managerial p ersp ectiv e, these results provide a direct economic argument for inv esting in improv ed forecasting pipelines: accuracy improv emen ts are not merely statistical, but conv ert into quantifiable reductions in in ven tory cost while supporting higher fill rates, strengthening resilience under volatile demand. A t the same time, the findings should be in terpreted within the study scop e. Exp erimen ts are restricted to a 7 single departmen t category (CA FOODS 1) o ver a limited ev aluation horizon, and the ordering p olicy uses p oin t forecasts without explicit mo deling of price elasticity , pro- motion lift, or substitution. F uture w ork should extend the analysis across categories and longer horizons, incorp orate pricing and promotion resp onse, and ev aluate distri- butional (quantile) forecasting to align ordering decisions more directly with service targets in multi-ec helon netw orks. 6 Conclusion This pap er prop oses an end-to-end framew ork that ev aluates demand forecasting mod- els by their do wnstream inv entory impact on real retail data. Using the M5 dataset (CA FOODS 1), we b enc hmark classical (Naiv e, Holt–Winters, ARIMA), ML (GBR, X GBo ost), and DL (LSTM, T emporal CNN) approac hes and propagate their forecasts through a newsv endor simulator, showing that learning-based models—esp ecially deep arc hitectures—consistently reduce inv entory cost while impro ving service. Practically , the results help manufacturing and retail planners choose forecasting pip elines based on business KPIs (cost and fill rate), not accuracy alone, under asym- metric shortage/holding p enalties. F uture work will extend to probabilistic (quan tile) forecasts, reinforcemen t-learning inv en tory control, and richer multi-ec helon net works with lateral transshipments and op erational constraints. References [1] Iv anov, D., Dolgui, A.: Viability of intert wined supply netw orks: Extending the supply chain resilience angles tow ards surviv ability . In ternational Journal of Pro duction Researc h 58 (10), 2904–2915 (2020) [2] Queiroz, M.M., Iv anov, D., Dolgui, A., F osso W am ba, S.: Impacts of epidemic outbreaks on supply chains: Mapping a researc h agenda amid COVID-19. T rans- p ortation Researc h P art E: Logistics and T ransportation Review 138 , 101958 (2020) [3] Lee, H.L., P admanabhan, V., Whang, S.: Information distortion in a supply c hain: The bullwhip effect. Management Science 43 (4), 546–558 (1997) [4] Hyndman, R.J., Athanasopoulos, G.: F orecasting: Principles and Practice. OT exts, ??? (2018) [5] Makridakis, S., Spiliotis, E., Assimakopoulos, V.: Statistical and machine learning forecasting metho ds: Concerns and w ays forward. PLoS ONE 13 (3), 0194889 (2018) [6] Bandara, K., Bergmeir, C., Sm yl, S.: F orecasting across time series databases using recurrent neural net w orks on groups of similar series: A clustering approach. Exp ert Systems with Applications 140 , 112896 (2020) https://doi.org/10.1016/ j.esw a.2019.112896 8 [7] Lim, B., Arık, S. ¨ O., Lo eff, N., Pfister, T.: T emp oral F usion T ransformers for In terpretable Multi-horizon Time Series F orecasting. International Journal of F orecasting 37 (4), 1748–1764 (2021) h ttps://doi.org/10.1016/j.ijforecast.2021.03. 012 [8] Lim, B., Zohren, S.: Time-series forecasting with deep learning: a survey . Philo- sophical T ransactions of the Ro yal Society A: Mathematical, Physical and Engineering Sciences 379 (2194), 20200209 (2021) h ttps://doi.org/10.1098/rsta. 2020.0209 arXiv:2004.13408 [cs.LG] [9] Syntetos, A.A., Bo ylan, J.E., Disney , S.M.: F orecasting for inv entory planning: a 50-year review. Journal of the Op erational Research So ciet y 60 (S1), 149–160 (2009) https://doi.org/10.1057/jors.2008.173 [10] Petropoulos, F., W ang, X., Disney , S.M.: The in ven tory p erformance of forecast- ing metho ds: Evidence from the M3 comp etition data. In ternational Journal of F orecasting 35 (1), 251–265 (2019) https://doi.org/10.1016/j.ijforecast.2018.01. 004 [11] Jeunet, J.: Demand forecast accuracy and p erformance of inv en tory policies under m ulti-level rolling sc hedule environmen ts. In ternational Journal of Pro duction Economics 103 (1), 401–419 (2006) https://doi.org/10.1016/j.ijpe.2005.10.003 [12] Nahmias, S., Smith, S.A.: Optimizing In v entory Levels in a Tw o-Echelon Retailer System with P artial Lost Sales. Managemen t Science 40 (5), 582–596 (1994) h ttps: //doi.org/10.1287/mnsc.40.5.582 [13] V an Belle, J., Guns, T., V erb ek e, W.: Using shared sell-through data to forecast wholesaler demand in m ulti-echelon supply chains. Europ ean Journal of Op er- ational Research 288 (2), 466–479 (2021) https://doi.org/10.1016/j.ejor.2020.05. 059 [14] Schlaic h, T., Hoberg, K.: When Is the Next Order? No wcasting Channel Inv en to- ries With Poin t-of-Sales Data to Predict the Timing of Retail Orders. Europ ean Journal of Op erational Research 315 (1), 35–49 (2024) h ttps://doi.org/10.1016/ j.ejor.2023.10.038 [15] Ab olghasemi, M., Rostami-T abar, B., Syntetos, A.: The p o w er of information sharing: ev aluating POS and order data for hierarchical forecasting in multi- ec helon supply c hains. In ternational Journal of Production Researc h (2025) https: //doi.org/10.1080/00207543.2025.2532756 . Adv ance online publication / Latest Articles (volume/issue/pages not yet assigned) [16] Hyndman, R.J., Khandak ar, Y.: Automatic Time Series F orecasting: The forecast P ack age for R. Journal of Statistical Soft ware 27 (3), 1–22 (2008) h ttps://doi.org/ 10.18637/jss.v027.i03 9 [17] Chatfield, C., Y ar, M.: Holt–Win ters F orecasting: Some Practical Issues. Journal of the Ro y al Statistical Society: Series D (The Statistician) 37 (2), 129–140 (1988) h ttps://doi.org/10.2307/2348687 [18] Gardner, E.S.: Exponential smo othing: The state of the art—Part I I. In terna- tional Journal of F orecasting 22 (4), 637–666 (2006) https://doi.org/10.1016/j. ijforecast.2006.03.005 [19] Croston, J.D.: F orecasting and Sto c k Control for Intermitten t Demands. Op er- ational Research Quarterly (1970–1977) 23 (3), 289–303 (1972) https://doi.org/ 10.2307/3007885 [20] Syntetos, A.A., Bo ylan, J.E.: The accuracy of intermitten t demand estimates. In ternational Journal of F orecasting 21 (2), 303–314 (2005) https://doi.org/10. 1016/j.ijforecast.2004.10.001 [21] F riedman, J.H.: Sto c hastic gradien t b oosting. Computational Statistics & Data Analysis 38 (4), 367–378 (2002) https://doi.org/10.1016/S0167- 9473(01)00065- 2 [22] Chen, T., Guestrin, C.: XGBoost: A Scalable T ree Boosting System. In: Pro ceed- ings of the 22nd ACM SIGKDD In ternational Conference on Knowledge Disco very and Data Mining (2016). https://doi.org/10.1145/2939672.2939785 [23] Makridakis, S., Spiliotis, E., Assimak op oulos, V.: The M5 comp etition: Bac k- ground, organization, and implemen tation. In ternational Journal of F orecasting 38 (4), 1325–1336 (2022) https://doi.org/10.1016/j.ijforecast.2021.07.007 [24] Makridakis, S., Spiliotis, E., Assimak op oulos, V.: M5 accuracy comp etition: Results, findings, and conclusions. International Journal of F orecasting 38 (4), 1346–1364 (2022) https://doi.org/10.1016/j.ijforecast.2021.11.013 [25] Ho c hreiter, S., Sc hmidhuber, J.: Long Short-T erm Memory. Neural Computation 9 (8), 1735–1780 (1997) https://doi.org/10.1162/neco.1997.9.8.1735 [26] Borovykh, A., Boh te, S., Oosterlee, C.W.: Conditional time series forecasting with con volutional neural netw orks. arXiv, 5 (2017) https://doi.org/10.48550/arXiv. 1703.04691 [27] Zhang, G.P .: Time series forecasting using a hybrid ARIMA and neural net work model. Neurocomputing 50 , 159–175 (2003) h ttps://doi.org/10.1016/ S0925- 2312(01)00702- 0 [28] Kaggle: M5 F orecasting - Accuracy . https://www.k aggle.com/comp etitions/ m5- forecasting- accuracy . Comp etition page (timeline: Start Date: Marc h 2, 2020; End Date: June 30, 2020). Accessed: 2025-12-15 (2020) 10
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment