Mean--Variance Portfolio Selection by Continuous-Time Reinforcement Learning: Algorithms, Regret Analysis, and Empirical Study
We study continuous-time mean–variance portfolio selection in markets where stock prices are diffusion processes driven by observable factors that are also diffusion processes, yet the coefficients of these processes are unknown. Based on the recently developed reinforcement learning (RL) theory for diffusion processes, we present a general data-driven RL approach that learns the pre-committed investment strategy directly without attempting to learn or estimate the market coefficients. For multi-stock Black–Scholes markets without factors, we further devise an algorithm and prove its performance guarantee by deriving a sublinear regret bound in terms of the Sharpe ratio. We then carry out an extensive empirical study implementing this algorithm to compare its performance and trading characteristics, evaluated under a host of common metrics, with a large number of widely employed portfolio allocation strategies on S&P 500 constituents. The results demonstrate that the proposed continuous-time RL strategy is consistently among the best, especially in a volatile bear market, and decisively outperforms the model-based continuous-time counterparts by significant margins.
💡 Research Summary
The paper tackles the continuous‑time mean‑variance (MV) portfolio selection problem in a setting where asset prices and observable factors follow diffusion processes, but the drift and volatility coefficients are unknown to the investor. Rather than estimating these market parameters, the authors adopt a model‑free reinforcement learning (RL) framework that learns the optimal pre‑committed investment strategy directly from observed price, factor, and wealth trajectories.
The methodological contribution consists of two parts. First, the authors adapt the recent continuous‑time RL theory of Wang et al. (2020) and Jia & Zhou (2022) to the MV objective. They formulate a set of martingale‑based moment conditions that link the instantaneous mean and covariance of the portfolio return to the unknown market dynamics. By solving these moment equations with data generated through the agent’s own trading, the algorithm bypasses any explicit model estimation. This “policy‑gradient‑style” actor‑critic scheme updates a linear function approximator for the policy (the actor) and a value‑function approximator for the Sharpe ratio (the critic) while enforcing an expectation constraint that guarantees feasibility of the MV problem.
Second, for the special case of a multi‑asset Black‑Scholes market without additional factors, the authors design a concrete RL algorithm. They prove almost‑sure convergence of the policy parameters and derive a regret bound measured in terms of the Sharpe ratio. The bound scales as (O(\sqrt{N})) where (N) is the number of learning episodes, i.e., the regret is sub‑linear. This is the first result establishing a sub‑linear regret guarantee for a model‑free, continuous‑time, risk‑aware RL algorithm.
Empirically, the paper conducts an extensive back‑test on S&P 500 constituents over the period 2000‑2020, using 1990‑2000 as a burn‑in pre‑training window. The RL strategy is benchmarked against 13 widely used portfolio construction methods, including the market portfolio, equal‑weight, sample‑based estimators, factor models, Bayesian estimators, plug‑in continuous‑time MV solutions, linear predictive models, and two generic RL algorithms. Performance is evaluated on both return‑based metrics (annualized return, Sharpe ratio and its variance, maximum drawdown, recovery time) and trading‑behavior metrics (gross risky exposure, turnover, concentration, bankruptcy probability). Across all criteria the RL strategy consistently outperforms the model‑based counterparts, with especially pronounced superiority during volatile bear markets such as the 2008 financial crisis. Notably, the advantage stems not from sophisticated predictive features or deep neural networks but from the fundamentally different decision‑making paradigm: learning the optimal control directly without ever estimating the underlying market model.
The authors acknowledge limitations: the analysis is confined to pure Itô diffusion dynamics, excluding jumps or stochastic volatility; the linear policy approximation may be insufficient for highly nonlinear market regimes; and practical deployment would require online adaptation mechanisms. They suggest future work on extending the theory to jump‑diffusion settings, incorporating nonlinear function approximators (e.g., deep networks), and testing the approach in live trading environments.
In summary, the paper provides a rigorous theoretical foundation (algorithm design, convergence proof, regret analysis) and a comprehensive empirical validation that together demonstrate the viability of model‑free continuous‑time RL for dynamic mean‑variance portfolio optimization. It opens a promising research avenue where reinforcement learning can replace traditional econometric estimation in high‑frequency, continuously traded financial markets.
Comments & Academic Discussion
Loading comments...
Leave a Comment