Symbolic Regression Methods for Reinforcement Learning
Reinforcement learning algorithms can solve dynamic decision-making and optimal control problems. With continuous-valued state and input variables, reinforcement learning algorithms must rely on function approximators to represent the value function and policy mappings. Commonly used numerical approximators, such as neural networks or basis function expansions, have two main drawbacks: they are black-box models offering little insight into the mappings learned, and they require extensive trial and error tuning of their hyper-parameters. In this paper, we propose a new approach to constructing smooth value functions in the form of analytic expressions by using symbolic regression. We introduce three off-line methods for finding value functions based on a state-transition model: symbolic value iteration, symbolic policy iteration, and a direct solution of the Bellman equation. The methods are illustrated on four nonlinear control problems: velocity control under friction, one-link and two-link pendulum swing-up, and magnetic manipulation. The results show that the value functions yield well-performing policies and are compact, mathematically tractable, and easy to plug into other algorithms. This makes them potentially suitable for further analysis of the closed-loop system. A comparison with an alternative approach using neural networks shows that our method outperforms the neural network-based one.
💡 Research Summary
The paper tackles a fundamental challenge in continuous‑state reinforcement learning (RL): the lack of interpretability and the heavy hyper‑parameter tuning required by conventional numerical function approximators such as neural networks (NNs) or basis‑function expansions. To address this, the authors propose constructing smooth value functions as closed‑form analytic expressions using symbolic regression (SR), a genetic‑programming‑based technique that evolves mathematical formulas that fit a given data set. Three offline algorithms are introduced, each leveraging a known state‑transition model to generate training samples.
-
Symbolic Value Iteration (SVI) adapts the classic value‑iteration scheme. At each iteration, a set of state‑action‑next‑state triples is sampled, the Bellman backup R(s,a)+γ V_k(s′) is computed, and SR is employed to evolve a new symbolic expression V_{k+1}(s) that minimizes the mean‑squared error against the backup values. A complexity penalty on the expression tree size is added to keep the resulting formula compact.
-
Symbolic Policy Iteration (SPI) separates policy evaluation and improvement. In the evaluation phase, the current deterministic policy π_k is fixed and SR is used to fit the corresponding value function V^{π_k}(s). In the improvement phase, the gradient of the newly obtained symbolic V^{π_k} is analytically computed, and the policy is updated to choose actions that maximize the Bellman improvement term. Because the policy itself can be expressed symbolically, the entire control law remains transparent.
-
Direct Symbolic Bellman Solution (SBS) formulates the Bellman optimality condition as a residual minimization problem. A loss L(θ)=∑_{(s,a)}