Experimental Designs for Multi-Item Multi-Period Inventory Control
Randomized experiments, or A/B testing, are the gold standard for evaluating interventions, yet they remain underutilized in inventory management. This study addresses this gap by analyzing A/B testing strategies in multi-item, multi-period inventory systems with lost sales and capacity constraints. We examine two canonical experimental designs, namely, switchback experiments and item-level randomization, and show that both suffer from systematic bias due to interference: temporal carryover in switchbacks and cannibalization across items under capacity constraints. Under mild conditions, we characterize the direction of this bias, proving that switchback designs systematically underestimate, while item-level randomization systematically overestimate, the global treatment effect. Motivated by two-sided randomization, we propose a pairwise design over items and time and analyze its bias properties. Numerical experiments using real-world data validate our theory and provide concrete guidance for selecting experimental designs in practice.
💡 Research Summary
This paper investigates how to conduct reliable A/B tests for evaluating inventory‑control policies in a multi‑item, multi‑period lost‑sales system with a shared warehouse capacity. While data‑driven inventory algorithms have been extensively studied, their real‑world performance is usually assessed by simulation, which suffers from censored demand, uncertain customer behavior after stockouts, and high computational cost. The authors therefore turn to randomized experiments, the gold standard in many digital platforms, and ask how classic experimental designs behave when the Stable Unit Treatment Value Assumption (SUTVA) is violated by temporal carry‑over and capacity‑induced cannibalization.
The model consists of N items, a common capacity B, and T periods. Demand for each item in each period is a random vector Dₜ with unknown distribution; the decision maker observes an estimate of the demand distribution (typically a forecast) and places orders that are delivered instantly. Unsatisfied demand is lost, and profit in each period is the sum of sales revenue minus ordering and holding costs. The key distinction between the “treatment” (new forecasting or policy) and the “control” (baseline) lies either in the mean of the forecasted demand or in its variance.
Two canonical experimental designs are examined:
-
Switchback experiments – all items receive the same treatment for a block of periods, then the assignment switches. Because the inventory state carries over from one block to the next, a treatment influences future periods (temporal interference). The authors prove that when the treatment mainly changes the forecast mean, the switchback estimator systematically under‑estimates the Global Treatment Effect (GTE). Intuitively, a policy that reduces expected demand leaves less inventory for the subsequent block, depressing observed profit. Conversely, when the treatment primarily inflates forecast variance, the switchback estimator tends to over‑estimate GTE.
-
Item‑level randomization – each item is independently assigned to treatment or control for the whole horizon, while the treatment assignment is fixed over time. With a binding capacity, allocating more inventory to treated items reduces the amount available to control items, creating “cannibalization” across items. When the treatment changes the forecast mean, this design systematically over‑estimates GTE because treated items enjoy higher service levels while control items suffer stockouts. If the treatment only changes variance, the bias disappears asymptotically.
Recognizing the limitations of both designs, the authors propose a pairwise (two‑sided) randomization that randomizes simultaneously over items and time. In each period a random subset of items receives the treatment, and the subset changes across periods. This design diffuses both temporal carry‑over and capacity interference. Theoretical analysis shows that for mean‑driven treatments, pairwise randomization yields substantially lower bias than item‑level randomization, whereas for variance‑driven treatments it can incur higher bias.
To validate the theory, the authors use a large‑scale fresh‑food dataset from Dingdong Fresh. Demand is censored because sales after stockouts are unobserved. They impute censored demand, train several forecasting models (e.g., ARIMA, LSTM, Bayesian networks) to create controlled mean or variance differences, and simulate the inventory system under each experimental design. Results confirm the predicted bias directions and magnitudes: switchback underestimates GTE by roughly 10–15 % when means differ, item‑level randomization overestimates by 12–18 %, and pairwise randomization reduces bias to 3–5 %. When variance differences dominate, switchback overestimates while pairwise randomization shows modest upward bias, matching the analytical findings.
The paper concludes with practical guidance. If a new policy primarily improves demand‑mean forecasts (e.g., better point predictions), practitioners should avoid switchback designs and prefer item‑level or, better yet, pairwise randomization, especially when capacity is tight. If the policy mainly reduces forecast uncertainty (e.g., robust or risk‑averse policies), switchback experiments may be acceptable, while pairwise randomization could still be preferable if the capacity constraint is severe. The authors also provide a decision checklist based on whether the treatment effect is mean‑driven or variance‑driven and on the tightness of the capacity constraint.
Overall, the contribution is threefold: (1) formalizing how SUTVA violations arise in inventory experiments, (2) characterizing the sign and magnitude of bias for common designs under realistic operational regimes, and (3) offering a novel two‑sided randomization that mitigates bias in many practical settings. This work bridges the gap between experimental economics / digital platform A/B testing literature and operations‑management inventory theory, delivering actionable insights for both researchers and industry practitioners.
Comments & Academic Discussion
Loading comments...
Leave a Comment