Designing Time Series Experiments in A/B Testing with Transformer Reinforcement Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

A/B testing has become a gold standard for modern technological companies to conduct policy evaluation. Yet, its application to time series experiments, where policies are sequentially assigned over time, remains challenging. Existing designs suffer from two limitations: (i) they do not fully leverage the entire history for treatment allocation; (ii) they rely on strong assumptions to approximate the objective function (e.g., the mean squared error of the estimated treatment effect) for optimizing the design. We first establish an impossibility theorem showing that failure to condition on the full history leads to suboptimal designs, due to the dynamic dependencies in time series experiments. To address both limitations simultaneously, we next propose a transformer reinforcement learning (RL) approach which leverages transformers to condition allocation on the entire history and employs RL to directly optimize the MSE without relying on restrictive assumptions. Empirical evaluations on synthetic data, a publicly available dispatch simulator, and a real-world ridesharing dataset demonstrate that our proposal consistently outperforms existing designs.

💡 Research Summary

This paper tackles the problem of designing A/B tests for time‑series experiments, where policies are deployed sequentially and exhibit carry‑over effects. The authors first prove an impossibility theorem: any allocation rule that does not condition on the full past (i.e., the entire sequence of past actions and observations) cannot achieve asymptotically optimal mean‑squared error (MSE) for the average treatment effect (ATE) estimator when using doubly‑robust methods. This result formalizes the intuition that ignoring long‑range temporal dependencies leads to sub‑optimal designs. To overcome both the theoretical limitation and the practical reliance on strong model assumptions (e.g., Markovian dynamics, linearity, short‑lag effects), the authors propose a novel design framework that couples a transformer encoder with reinforcement learning (RL). The transformer ingests the complete history at each time step, producing a state representation Sₜ that captures long‑range dependencies via self‑attention. A Double Deep Q‑Network then selects the treatment (or control) action, treating the negative MSE of the ATE estimator as the immediate reward. By directly optimizing the true MSE rather than an approximation, the method avoids the need for restrictive assumptions about the data‑generating process. Empirical evaluation is conducted on three fronts: (i) synthetic time‑series data with varying carry‑over lengths, noise levels, and small treatment effects; (ii) a publicly available dispatch simulator that mimics driver‑passenger dynamics in ridesharing; and (iii) a real‑world ridesharing dataset spanning several months and millions of rides. Across all settings, the transformer‑RL design consistently yields lower MSE—typically 10‑30 % improvement over Neyman allocation, Bayesian optimal designs, and prior RL‑based approaches—while also producing tighter confidence intervals and higher statistical power. The paper’s contributions are threefold: (1) a rigorous impossibility theorem highlighting the necessity of full‑history conditioning; (2) a practical algorithm that leverages modern deep learning (transformers) and RL to directly minimize the MSE without imposing Markov or linearity assumptions; and (3) extensive validation demonstrating substantial gains in realistic, high‑stakes applications such as ridesharing policy evaluation. Limitations include the computational cost of training large transformer‑RL agents and the reliance on simulated reward signals, suggesting future work on lightweight architectures and online learning extensions. Overall, the work provides a compelling, theoretically grounded, and empirically validated solution for optimal experimental design in dynamic, sequential decision‑making environments.

Designing Time Series Experiments in A/B Testing with Transformer Reinforcement Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment